All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH -mmotm 0/4] memcg: per cgroup dirty limit (v4)
@ 2010-03-04 10:40 ` Andrea Righi
  0 siblings, 0 replies; 68+ messages in thread
From: Andrea Righi @ 2010-03-04 10:40 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki, Balbir Singh
  Cc: Vivek Goyal, Peter Zijlstra, Trond Myklebust, Suleiman Souhlal,
	Greg Thelen, Daisuke Nishimura, Kirill A. Shutemov,
	Andrew Morton, containers, linux-kernel, linux-mm

Control the maximum amount of dirty pages a cgroup can have at any given time.

Per cgroup dirty limit is like fixing the max amount of dirty (hard to reclaim)
page cache used by any cgroup. So, in case of multiple cgroup writers, they
will not be able to consume more than their designated share of dirty pages and
will be forced to perform write-out if they cross that limit.

The overall design is the following:

 - account dirty pages per cgroup
 - limit the number of dirty pages via memory.dirty_ratio / memory.dirty_bytes
   and memory.dirty_background_ratio / memory.dirty_background_bytes in
   cgroupfs
 - start to write-out (background or actively) when the cgroup limits are
   exceeded

This feature is supposed to be strictly connected to any underlying IO
controller implementation, so we can stop increasing dirty pages in VM layer
and enforce a write-out before any cgroup will consume the global amount of
dirty pages defined by the /proc/sys/vm/dirty_ratio|dirty_bytes and
/proc/sys/vm/dirty_background_ratio|dirty_background_bytes limits.

Changelog (v3 -> v4)
~~~~~~~~~~~~~~~~~~~~~~
 * handle the migration of tasks across different cgroups
   NOTE: at the moment we don't move charges of file cache pages, so this
   functionality is not immediately necessary. However, since the migration of
   file cache pages is in plan, it is better to start handling file pages
   anyway.
 * properly account dirty pages in nilfs2
   (thanks to Kirill A. Shutemov <kirill@shutemov.name>)
 * lockless access to dirty memory parameters
 * fix: page_cgroup lock must not be acquired under mapping->tree_lock
   (thanks to Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> and
    KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>)
 * code restyling

-Andrea

^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH -mmotm 0/4] memcg: per cgroup dirty limit (v4)
@ 2010-03-04 10:40 ` Andrea Righi
  0 siblings, 0 replies; 68+ messages in thread
From: Andrea Righi @ 2010-03-04 10:40 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki, Balbir Singh
  Cc: Vivek Goyal, Peter Zijlstra, Trond Myklebust, Suleiman Souhlal,
	Greg Thelen, Daisuke Nishimura, Kirill A. Shutemov,
	Andrew Morton, containers, linux-kernel, linux-mm

Control the maximum amount of dirty pages a cgroup can have at any given time.

Per cgroup dirty limit is like fixing the max amount of dirty (hard to reclaim)
page cache used by any cgroup. So, in case of multiple cgroup writers, they
will not be able to consume more than their designated share of dirty pages and
will be forced to perform write-out if they cross that limit.

The overall design is the following:

 - account dirty pages per cgroup
 - limit the number of dirty pages via memory.dirty_ratio / memory.dirty_bytes
   and memory.dirty_background_ratio / memory.dirty_background_bytes in
   cgroupfs
 - start to write-out (background or actively) when the cgroup limits are
   exceeded

This feature is supposed to be strictly connected to any underlying IO
controller implementation, so we can stop increasing dirty pages in VM layer
and enforce a write-out before any cgroup will consume the global amount of
dirty pages defined by the /proc/sys/vm/dirty_ratio|dirty_bytes and
/proc/sys/vm/dirty_background_ratio|dirty_background_bytes limits.

Changelog (v3 -> v4)
~~~~~~~~~~~~~~~~~~~~~~
 * handle the migration of tasks across different cgroups
   NOTE: at the moment we don't move charges of file cache pages, so this
   functionality is not immediately necessary. However, since the migration of
   file cache pages is in plan, it is better to start handling file pages
   anyway.
 * properly account dirty pages in nilfs2
   (thanks to Kirill A. Shutemov <kirill@shutemov.name>)
 * lockless access to dirty memory parameters
 * fix: page_cgroup lock must not be acquired under mapping->tree_lock
   (thanks to Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> and
    KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>)
 * code restyling

-Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH -mmotm 1/4] memcg: dirty memory documentation
       [not found] ` <1267699215-4101-1-git-send-email-arighi-vWjgImWzx8FBDgjK7y7TUQ@public.gmane.org>
@ 2010-03-04 10:40   ` Andrea Righi
  2010-03-04 10:40   ` [PATCH -mmotm 2/4] page_cgroup: introduce file cache flags Andrea Righi
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 68+ messages in thread
From: Andrea Righi @ 2010-03-04 10:40 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki, Balbir Singh
  Cc: Andrea Righi, Daisuke Nishimura,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Trond Myklebust,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Suleiman Souhlal, Andrew Morton,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Vivek Goyal

Document cgroup dirty memory interfaces and statistics.

Signed-off-by: Andrea Righi <arighi-vWjgImWzx8FBDgjK7y7TUQ@public.gmane.org>
---
 Documentation/cgroups/memory.txt |   36 ++++++++++++++++++++++++++++++++++++
 1 files changed, 36 insertions(+), 0 deletions(-)

diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index 49f86f3..38ca499 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -310,6 +310,11 @@ cache		- # of bytes of page cache memory.
 rss		- # of bytes of anonymous and swap cache memory.
 pgpgin		- # of pages paged in (equivalent to # of charging events).
 pgpgout		- # of pages paged out (equivalent to # of uncharging events).
+filedirty	- # of pages that are waiting to get written back to the disk.
+writeback	- # of pages that are actively being written back to the disk.
+writeback_tmp	- # of pages used by FUSE for temporary writeback buffers.
+nfs		- # of NFS pages sent to the server, but not yet committed to
+		  the actual storage.
 active_anon	- # of bytes of anonymous and  swap cache memory on active
 		  lru list.
 inactive_anon	- # of bytes of anonymous memory and swap cache memory on
@@ -345,6 +350,37 @@ Note:
   - a cgroup which uses hierarchy and it has child cgroup.
   - a cgroup which uses hierarchy and not the root of hierarchy.
 
+5.4 dirty memory
+
+  Control the maximum amount of dirty pages a cgroup can have at any given time.
+
+  Limiting dirty memory is like fixing the max amount of dirty (hard to
+  reclaim) page cache used by any cgroup. So, in case of multiple cgroup writers,
+  they will not be able to consume more than their designated share of dirty
+  pages and will be forced to perform write-out if they cross that limit.
+
+  The interface is equivalent to the procfs interface: /proc/sys/vm/dirty_*.
+  It is possible to configure a limit to trigger both a direct writeback or a
+  background writeback performed by per-bdi flusher threads.
+
+  Per-cgroup dirty limits can be set using the following files in the cgroupfs:
+
+  - memory.dirty_ratio: contains, as a percentage of cgroup memory, the
+    amount of dirty memory at which a process which is generating disk writes
+    inside the cgroup will start itself writing out dirty data.
+
+  - memory.dirty_bytes: the amount of dirty memory of the cgroup (expressed in
+    bytes) at which a process generating disk writes will start itself writing
+    out dirty data.
+
+  - memory.dirty_background_ratio: contains, as a percentage of the cgroup
+    memory, the amount of dirty memory at which background writeback kernel
+    threads will start writing out dirty data.
+
+  - memory.dirty_background_bytes: the amount of dirty memory of the cgroup (in
+    bytes) at which background writeback kernel threads will start writing out
+    dirty data.
+
 
 6. Hierarchy support
 
-- 
1.6.3.3

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH -mmotm 1/4] memcg: dirty memory documentation
  2010-03-04 10:40 ` Andrea Righi
@ 2010-03-04 10:40   ` Andrea Righi
  -1 siblings, 0 replies; 68+ messages in thread
From: Andrea Righi @ 2010-03-04 10:40 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki, Balbir Singh
  Cc: Vivek Goyal, Peter Zijlstra, Trond Myklebust, Suleiman Souhlal,
	Greg Thelen, Daisuke Nishimura, Kirill A. Shutemov,
	Andrew Morton, containers, linux-kernel, linux-mm, Andrea Righi

Document cgroup dirty memory interfaces and statistics.

Signed-off-by: Andrea Righi <arighi@develer.com>
---
 Documentation/cgroups/memory.txt |   36 ++++++++++++++++++++++++++++++++++++
 1 files changed, 36 insertions(+), 0 deletions(-)

diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index 49f86f3..38ca499 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -310,6 +310,11 @@ cache		- # of bytes of page cache memory.
 rss		- # of bytes of anonymous and swap cache memory.
 pgpgin		- # of pages paged in (equivalent to # of charging events).
 pgpgout		- # of pages paged out (equivalent to # of uncharging events).
+filedirty	- # of pages that are waiting to get written back to the disk.
+writeback	- # of pages that are actively being written back to the disk.
+writeback_tmp	- # of pages used by FUSE for temporary writeback buffers.
+nfs		- # of NFS pages sent to the server, but not yet committed to
+		  the actual storage.
 active_anon	- # of bytes of anonymous and  swap cache memory on active
 		  lru list.
 inactive_anon	- # of bytes of anonymous memory and swap cache memory on
@@ -345,6 +350,37 @@ Note:
   - a cgroup which uses hierarchy and it has child cgroup.
   - a cgroup which uses hierarchy and not the root of hierarchy.
 
+5.4 dirty memory
+
+  Control the maximum amount of dirty pages a cgroup can have at any given time.
+
+  Limiting dirty memory is like fixing the max amount of dirty (hard to
+  reclaim) page cache used by any cgroup. So, in case of multiple cgroup writers,
+  they will not be able to consume more than their designated share of dirty
+  pages and will be forced to perform write-out if they cross that limit.
+
+  The interface is equivalent to the procfs interface: /proc/sys/vm/dirty_*.
+  It is possible to configure a limit to trigger both a direct writeback or a
+  background writeback performed by per-bdi flusher threads.
+
+  Per-cgroup dirty limits can be set using the following files in the cgroupfs:
+
+  - memory.dirty_ratio: contains, as a percentage of cgroup memory, the
+    amount of dirty memory at which a process which is generating disk writes
+    inside the cgroup will start itself writing out dirty data.
+
+  - memory.dirty_bytes: the amount of dirty memory of the cgroup (expressed in
+    bytes) at which a process generating disk writes will start itself writing
+    out dirty data.
+
+  - memory.dirty_background_ratio: contains, as a percentage of the cgroup
+    memory, the amount of dirty memory at which background writeback kernel
+    threads will start writing out dirty data.
+
+  - memory.dirty_background_bytes: the amount of dirty memory of the cgroup (in
+    bytes) at which background writeback kernel threads will start writing out
+    dirty data.
+
 
 6. Hierarchy support
 
-- 
1.6.3.3


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH -mmotm 1/4] memcg: dirty memory documentation
@ 2010-03-04 10:40   ` Andrea Righi
  0 siblings, 0 replies; 68+ messages in thread
From: Andrea Righi @ 2010-03-04 10:40 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki, Balbir Singh
  Cc: Vivek Goyal, Peter Zijlstra, Trond Myklebust, Suleiman Souhlal,
	Greg Thelen, Daisuke Nishimura, Kirill A. Shutemov,
	Andrew Morton, containers, linux-kernel, linux-mm, Andrea Righi

Document cgroup dirty memory interfaces and statistics.

Signed-off-by: Andrea Righi <arighi@develer.com>
---
 Documentation/cgroups/memory.txt |   36 ++++++++++++++++++++++++++++++++++++
 1 files changed, 36 insertions(+), 0 deletions(-)

diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index 49f86f3..38ca499 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -310,6 +310,11 @@ cache		- # of bytes of page cache memory.
 rss		- # of bytes of anonymous and swap cache memory.
 pgpgin		- # of pages paged in (equivalent to # of charging events).
 pgpgout		- # of pages paged out (equivalent to # of uncharging events).
+filedirty	- # of pages that are waiting to get written back to the disk.
+writeback	- # of pages that are actively being written back to the disk.
+writeback_tmp	- # of pages used by FUSE for temporary writeback buffers.
+nfs		- # of NFS pages sent to the server, but not yet committed to
+		  the actual storage.
 active_anon	- # of bytes of anonymous and  swap cache memory on active
 		  lru list.
 inactive_anon	- # of bytes of anonymous memory and swap cache memory on
@@ -345,6 +350,37 @@ Note:
   - a cgroup which uses hierarchy and it has child cgroup.
   - a cgroup which uses hierarchy and not the root of hierarchy.
 
+5.4 dirty memory
+
+  Control the maximum amount of dirty pages a cgroup can have at any given time.
+
+  Limiting dirty memory is like fixing the max amount of dirty (hard to
+  reclaim) page cache used by any cgroup. So, in case of multiple cgroup writers,
+  they will not be able to consume more than their designated share of dirty
+  pages and will be forced to perform write-out if they cross that limit.
+
+  The interface is equivalent to the procfs interface: /proc/sys/vm/dirty_*.
+  It is possible to configure a limit to trigger both a direct writeback or a
+  background writeback performed by per-bdi flusher threads.
+
+  Per-cgroup dirty limits can be set using the following files in the cgroupfs:
+
+  - memory.dirty_ratio: contains, as a percentage of cgroup memory, the
+    amount of dirty memory at which a process which is generating disk writes
+    inside the cgroup will start itself writing out dirty data.
+
+  - memory.dirty_bytes: the amount of dirty memory of the cgroup (expressed in
+    bytes) at which a process generating disk writes will start itself writing
+    out dirty data.
+
+  - memory.dirty_background_ratio: contains, as a percentage of the cgroup
+    memory, the amount of dirty memory at which background writeback kernel
+    threads will start writing out dirty data.
+
+  - memory.dirty_background_bytes: the amount of dirty memory of the cgroup (in
+    bytes) at which background writeback kernel threads will start writing out
+    dirty data.
+
 
 6. Hierarchy support
 
-- 
1.6.3.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH -mmotm 2/4] page_cgroup: introduce file cache flags
       [not found] ` <1267699215-4101-1-git-send-email-arighi-vWjgImWzx8FBDgjK7y7TUQ@public.gmane.org>
  2010-03-04 10:40   ` Andrea Righi
@ 2010-03-04 10:40   ` Andrea Righi
  2010-03-04 10:40   ` [PATCH -mmotm 3/4] memcg: dirty pages accounting and limiting infrastructure Andrea Righi
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 68+ messages in thread
From: Andrea Righi @ 2010-03-04 10:40 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki, Balbir Singh
  Cc: Andrea Righi, Daisuke Nishimura,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Trond Myklebust,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Suleiman Souhlal, Andrew Morton,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Vivek Goyal

Introduce page_cgroup flags to keep track of file cache pages.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
Signed-off-by: Andrea Righi <arighi-vWjgImWzx8FBDgjK7y7TUQ@public.gmane.org>
---
 include/linux/page_cgroup.h |   49 +++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 49 insertions(+), 0 deletions(-)

diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
index 30b0813..1b79ded 100644
--- a/include/linux/page_cgroup.h
+++ b/include/linux/page_cgroup.h
@@ -39,6 +39,12 @@ enum {
 	PCG_CACHE, /* charged as cache */
 	PCG_USED, /* this object is in use. */
 	PCG_ACCT_LRU, /* page has been accounted for */
+	PCG_MIGRATE_LOCK, /* used for mutual execution of account migration */
+	PCG_ACCT_FILE_MAPPED, /* page is accounted as file rss*/
+	PCG_ACCT_DIRTY, /* page is dirty */
+	PCG_ACCT_WRITEBACK, /* page is being written back to disk */
+	PCG_ACCT_WRITEBACK_TEMP, /* page is used as temporary buffer for FUSE */
+	PCG_ACCT_UNSTABLE_NFS, /* NFS page not yet committed to the server */
 };
 
 #define TESTPCGFLAG(uname, lname)			\
@@ -73,6 +79,27 @@ CLEARPCGFLAG(AcctLRU, ACCT_LRU)
 TESTPCGFLAG(AcctLRU, ACCT_LRU)
 TESTCLEARPCGFLAG(AcctLRU, ACCT_LRU)
 
+/* File cache and dirty memory flags */
+TESTPCGFLAG(FileMapped, ACCT_FILE_MAPPED)
+SETPCGFLAG(FileMapped, ACCT_FILE_MAPPED)
+CLEARPCGFLAG(FileMapped, ACCT_FILE_MAPPED)
+
+TESTPCGFLAG(Dirty, ACCT_DIRTY)
+SETPCGFLAG(Dirty, ACCT_DIRTY)
+CLEARPCGFLAG(Dirty, ACCT_DIRTY)
+
+TESTPCGFLAG(Writeback, ACCT_WRITEBACK)
+SETPCGFLAG(Writeback, ACCT_WRITEBACK)
+CLEARPCGFLAG(Writeback, ACCT_WRITEBACK)
+
+TESTPCGFLAG(WritebackTemp, ACCT_WRITEBACK_TEMP)
+SETPCGFLAG(WritebackTemp, ACCT_WRITEBACK_TEMP)
+CLEARPCGFLAG(WritebackTemp, ACCT_WRITEBACK_TEMP)
+
+TESTPCGFLAG(UnstableNFS, ACCT_UNSTABLE_NFS)
+SETPCGFLAG(UnstableNFS, ACCT_UNSTABLE_NFS)
+CLEARPCGFLAG(UnstableNFS, ACCT_UNSTABLE_NFS)
+
 static inline int page_cgroup_nid(struct page_cgroup *pc)
 {
 	return page_to_nid(pc->page);
@@ -83,6 +110,9 @@ static inline enum zone_type page_cgroup_zid(struct page_cgroup *pc)
 	return page_zonenum(pc->page);
 }
 
+/*
+ * lock_page_cgroup() should not be held under mapping->tree_lock
+ */
 static inline void lock_page_cgroup(struct page_cgroup *pc)
 {
 	bit_spin_lock(PCG_LOCK, &pc->flags);
@@ -93,6 +123,25 @@ static inline void unlock_page_cgroup(struct page_cgroup *pc)
 	bit_spin_unlock(PCG_LOCK, &pc->flags);
 }
 
+/*
+ * Lock order is
+ *     lock_page_cgroup()
+ *             lock_page_cgroup_migrate()
+ *
+ * This lock is not be lock for charge/uncharge but for account moving.
+ * i.e. overwrite pc->mem_cgroup. The lock owner should guarantee by itself
+ * the page is uncharged while we hold this.
+ */
+static inline void lock_page_cgroup_migrate(struct page_cgroup *pc)
+{
+	bit_spin_lock(PCG_MIGRATE_LOCK, &pc->flags);
+}
+
+static inline void unlock_page_cgroup_migrate(struct page_cgroup *pc)
+{
+	bit_spin_unlock(PCG_MIGRATE_LOCK, &pc->flags);
+}
+
 #else /* CONFIG_CGROUP_MEM_RES_CTLR */
 struct page_cgroup;
 
-- 
1.6.3.3

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH -mmotm 2/4] page_cgroup: introduce file cache flags
  2010-03-04 10:40 ` Andrea Righi
@ 2010-03-04 10:40   ` Andrea Righi
  -1 siblings, 0 replies; 68+ messages in thread
From: Andrea Righi @ 2010-03-04 10:40 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki, Balbir Singh
  Cc: Vivek Goyal, Peter Zijlstra, Trond Myklebust, Suleiman Souhlal,
	Greg Thelen, Daisuke Nishimura, Kirill A. Shutemov,
	Andrew Morton, containers, linux-kernel, linux-mm, Andrea Righi,
	KAMEZAWA Hiroyuki

Introduce page_cgroup flags to keep track of file cache pages.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrea Righi <arighi@develer.com>
---
 include/linux/page_cgroup.h |   49 +++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 49 insertions(+), 0 deletions(-)

diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
index 30b0813..1b79ded 100644
--- a/include/linux/page_cgroup.h
+++ b/include/linux/page_cgroup.h
@@ -39,6 +39,12 @@ enum {
 	PCG_CACHE, /* charged as cache */
 	PCG_USED, /* this object is in use. */
 	PCG_ACCT_LRU, /* page has been accounted for */
+	PCG_MIGRATE_LOCK, /* used for mutual execution of account migration */
+	PCG_ACCT_FILE_MAPPED, /* page is accounted as file rss*/
+	PCG_ACCT_DIRTY, /* page is dirty */
+	PCG_ACCT_WRITEBACK, /* page is being written back to disk */
+	PCG_ACCT_WRITEBACK_TEMP, /* page is used as temporary buffer for FUSE */
+	PCG_ACCT_UNSTABLE_NFS, /* NFS page not yet committed to the server */
 };
 
 #define TESTPCGFLAG(uname, lname)			\
@@ -73,6 +79,27 @@ CLEARPCGFLAG(AcctLRU, ACCT_LRU)
 TESTPCGFLAG(AcctLRU, ACCT_LRU)
 TESTCLEARPCGFLAG(AcctLRU, ACCT_LRU)
 
+/* File cache and dirty memory flags */
+TESTPCGFLAG(FileMapped, ACCT_FILE_MAPPED)
+SETPCGFLAG(FileMapped, ACCT_FILE_MAPPED)
+CLEARPCGFLAG(FileMapped, ACCT_FILE_MAPPED)
+
+TESTPCGFLAG(Dirty, ACCT_DIRTY)
+SETPCGFLAG(Dirty, ACCT_DIRTY)
+CLEARPCGFLAG(Dirty, ACCT_DIRTY)
+
+TESTPCGFLAG(Writeback, ACCT_WRITEBACK)
+SETPCGFLAG(Writeback, ACCT_WRITEBACK)
+CLEARPCGFLAG(Writeback, ACCT_WRITEBACK)
+
+TESTPCGFLAG(WritebackTemp, ACCT_WRITEBACK_TEMP)
+SETPCGFLAG(WritebackTemp, ACCT_WRITEBACK_TEMP)
+CLEARPCGFLAG(WritebackTemp, ACCT_WRITEBACK_TEMP)
+
+TESTPCGFLAG(UnstableNFS, ACCT_UNSTABLE_NFS)
+SETPCGFLAG(UnstableNFS, ACCT_UNSTABLE_NFS)
+CLEARPCGFLAG(UnstableNFS, ACCT_UNSTABLE_NFS)
+
 static inline int page_cgroup_nid(struct page_cgroup *pc)
 {
 	return page_to_nid(pc->page);
@@ -83,6 +110,9 @@ static inline enum zone_type page_cgroup_zid(struct page_cgroup *pc)
 	return page_zonenum(pc->page);
 }
 
+/*
+ * lock_page_cgroup() should not be held under mapping->tree_lock
+ */
 static inline void lock_page_cgroup(struct page_cgroup *pc)
 {
 	bit_spin_lock(PCG_LOCK, &pc->flags);
@@ -93,6 +123,25 @@ static inline void unlock_page_cgroup(struct page_cgroup *pc)
 	bit_spin_unlock(PCG_LOCK, &pc->flags);
 }
 
+/*
+ * Lock order is
+ *     lock_page_cgroup()
+ *             lock_page_cgroup_migrate()
+ *
+ * This lock is not be lock for charge/uncharge but for account moving.
+ * i.e. overwrite pc->mem_cgroup. The lock owner should guarantee by itself
+ * the page is uncharged while we hold this.
+ */
+static inline void lock_page_cgroup_migrate(struct page_cgroup *pc)
+{
+	bit_spin_lock(PCG_MIGRATE_LOCK, &pc->flags);
+}
+
+static inline void unlock_page_cgroup_migrate(struct page_cgroup *pc)
+{
+	bit_spin_unlock(PCG_MIGRATE_LOCK, &pc->flags);
+}
+
 #else /* CONFIG_CGROUP_MEM_RES_CTLR */
 struct page_cgroup;
 
-- 
1.6.3.3


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH -mmotm 2/4] page_cgroup: introduce file cache flags
@ 2010-03-04 10:40   ` Andrea Righi
  0 siblings, 0 replies; 68+ messages in thread
From: Andrea Righi @ 2010-03-04 10:40 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki, Balbir Singh
  Cc: Vivek Goyal, Peter Zijlstra, Trond Myklebust, Suleiman Souhlal,
	Greg Thelen, Daisuke Nishimura, Kirill A. Shutemov,
	Andrew Morton, containers, linux-kernel, linux-mm, Andrea Righi

Introduce page_cgroup flags to keep track of file cache pages.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrea Righi <arighi@develer.com>
---
 include/linux/page_cgroup.h |   49 +++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 49 insertions(+), 0 deletions(-)

diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
index 30b0813..1b79ded 100644
--- a/include/linux/page_cgroup.h
+++ b/include/linux/page_cgroup.h
@@ -39,6 +39,12 @@ enum {
 	PCG_CACHE, /* charged as cache */
 	PCG_USED, /* this object is in use. */
 	PCG_ACCT_LRU, /* page has been accounted for */
+	PCG_MIGRATE_LOCK, /* used for mutual execution of account migration */
+	PCG_ACCT_FILE_MAPPED, /* page is accounted as file rss*/
+	PCG_ACCT_DIRTY, /* page is dirty */
+	PCG_ACCT_WRITEBACK, /* page is being written back to disk */
+	PCG_ACCT_WRITEBACK_TEMP, /* page is used as temporary buffer for FUSE */
+	PCG_ACCT_UNSTABLE_NFS, /* NFS page not yet committed to the server */
 };
 
 #define TESTPCGFLAG(uname, lname)			\
@@ -73,6 +79,27 @@ CLEARPCGFLAG(AcctLRU, ACCT_LRU)
 TESTPCGFLAG(AcctLRU, ACCT_LRU)
 TESTCLEARPCGFLAG(AcctLRU, ACCT_LRU)
 
+/* File cache and dirty memory flags */
+TESTPCGFLAG(FileMapped, ACCT_FILE_MAPPED)
+SETPCGFLAG(FileMapped, ACCT_FILE_MAPPED)
+CLEARPCGFLAG(FileMapped, ACCT_FILE_MAPPED)
+
+TESTPCGFLAG(Dirty, ACCT_DIRTY)
+SETPCGFLAG(Dirty, ACCT_DIRTY)
+CLEARPCGFLAG(Dirty, ACCT_DIRTY)
+
+TESTPCGFLAG(Writeback, ACCT_WRITEBACK)
+SETPCGFLAG(Writeback, ACCT_WRITEBACK)
+CLEARPCGFLAG(Writeback, ACCT_WRITEBACK)
+
+TESTPCGFLAG(WritebackTemp, ACCT_WRITEBACK_TEMP)
+SETPCGFLAG(WritebackTemp, ACCT_WRITEBACK_TEMP)
+CLEARPCGFLAG(WritebackTemp, ACCT_WRITEBACK_TEMP)
+
+TESTPCGFLAG(UnstableNFS, ACCT_UNSTABLE_NFS)
+SETPCGFLAG(UnstableNFS, ACCT_UNSTABLE_NFS)
+CLEARPCGFLAG(UnstableNFS, ACCT_UNSTABLE_NFS)
+
 static inline int page_cgroup_nid(struct page_cgroup *pc)
 {
 	return page_to_nid(pc->page);
@@ -83,6 +110,9 @@ static inline enum zone_type page_cgroup_zid(struct page_cgroup *pc)
 	return page_zonenum(pc->page);
 }
 
+/*
+ * lock_page_cgroup() should not be held under mapping->tree_lock
+ */
 static inline void lock_page_cgroup(struct page_cgroup *pc)
 {
 	bit_spin_lock(PCG_LOCK, &pc->flags);
@@ -93,6 +123,25 @@ static inline void unlock_page_cgroup(struct page_cgroup *pc)
 	bit_spin_unlock(PCG_LOCK, &pc->flags);
 }
 
+/*
+ * Lock order is
+ *     lock_page_cgroup()
+ *             lock_page_cgroup_migrate()
+ *
+ * This lock is not be lock for charge/uncharge but for account moving.
+ * i.e. overwrite pc->mem_cgroup. The lock owner should guarantee by itself
+ * the page is uncharged while we hold this.
+ */
+static inline void lock_page_cgroup_migrate(struct page_cgroup *pc)
+{
+	bit_spin_lock(PCG_MIGRATE_LOCK, &pc->flags);
+}
+
+static inline void unlock_page_cgroup_migrate(struct page_cgroup *pc)
+{
+	bit_spin_unlock(PCG_MIGRATE_LOCK, &pc->flags);
+}
+
 #else /* CONFIG_CGROUP_MEM_RES_CTLR */
 struct page_cgroup;
 
-- 
1.6.3.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH -mmotm 3/4] memcg: dirty pages accounting and limiting infrastructure
       [not found] ` <1267699215-4101-1-git-send-email-arighi-vWjgImWzx8FBDgjK7y7TUQ@public.gmane.org>
  2010-03-04 10:40   ` Andrea Righi
  2010-03-04 10:40   ` [PATCH -mmotm 2/4] page_cgroup: introduce file cache flags Andrea Righi
@ 2010-03-04 10:40   ` Andrea Righi
  2010-03-04 10:40   ` [PATCH -mmotm 4/4] memcg: dirty pages instrumentation Andrea Righi
  2010-03-04 17:11   ` [PATCH -mmotm 0/4] memcg: per cgroup dirty limit (v4) Balbir Singh
  4 siblings, 0 replies; 68+ messages in thread
From: Andrea Righi @ 2010-03-04 10:40 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki, Balbir Singh
  Cc: Andrea Righi, Daisuke Nishimura,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Trond Myklebust,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Suleiman Souhlal, Andrew Morton,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Vivek Goyal

Infrastructure to account dirty pages per cgroup and add dirty limit
interfaces in the cgroupfs:

 - Direct write-out: memory.dirty_ratio, memory.dirty_bytes

 - Background write-out: memory.dirty_background_ratio, memory.dirty_background_bytes

Signed-off-by: Andrea Righi <arighi-vWjgImWzx8FBDgjK7y7TUQ@public.gmane.org>
---
 include/linux/memcontrol.h |   80 ++++++++-
 mm/memcontrol.c            |  420 +++++++++++++++++++++++++++++++++++++++-----
 2 files changed, 450 insertions(+), 50 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 1f9b119..cc3421b 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -19,12 +19,66 @@
 
 #ifndef _LINUX_MEMCONTROL_H
 #define _LINUX_MEMCONTROL_H
+
+#include <linux/writeback.h>
 #include <linux/cgroup.h>
+
 struct mem_cgroup;
 struct page_cgroup;
 struct page;
 struct mm_struct;
 
+/* Cgroup memory statistics items exported to the kernel */
+enum mem_cgroup_page_stat_item {
+	MEMCG_NR_DIRTYABLE_PAGES,
+	MEMCG_NR_RECLAIM_PAGES,
+	MEMCG_NR_WRITEBACK,
+	MEMCG_NR_DIRTY_WRITEBACK_PAGES,
+};
+
+/* Dirty memory parameters */
+struct dirty_param {
+	int dirty_ratio;
+	unsigned long dirty_bytes;
+	int dirty_background_ratio;
+	unsigned long dirty_background_bytes;
+};
+
+/*
+ * Statistics for memory cgroup.
+ */
+enum mem_cgroup_stat_index {
+	/*
+	 * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
+	 */
+	MEM_CGROUP_STAT_CACHE,	   /* # of pages charged as cache */
+	MEM_CGROUP_STAT_RSS,	   /* # of pages charged as anon rss */
+	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
+	MEM_CGROUP_STAT_PGPGIN_COUNT,	/* # of pages paged in */
+	MEM_CGROUP_STAT_PGPGOUT_COUNT,	/* # of pages paged out */
+	MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
+	MEM_CGROUP_EVENTS,	/* incremented at every  pagein/pageout */
+	MEM_CGROUP_STAT_FILE_DIRTY,   /* # of dirty pages in page cache */
+	MEM_CGROUP_STAT_WRITEBACK,   /* # of pages under writeback */
+	MEM_CGROUP_STAT_WRITEBACK_TEMP,   /* # of pages under writeback using
+						temporary buffers */
+	MEM_CGROUP_STAT_UNSTABLE_NFS,   /* # of NFS unstable pages */
+
+	MEM_CGROUP_STAT_NSTATS,
+};
+
+/*
+ * TODO: provide a validation check routine. And retry if validation
+ * fails.
+ */
+static inline void get_global_dirty_param(struct dirty_param *param)
+{
+	param->dirty_ratio = vm_dirty_ratio;
+	param->dirty_bytes = vm_dirty_bytes;
+	param->dirty_background_ratio = dirty_background_ratio;
+	param->dirty_background_bytes = dirty_background_bytes;
+}
+
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR
 /*
  * All "charge" functions with gfp_mask should use GFP_KERNEL or
@@ -117,6 +171,10 @@ extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
 extern int do_swap_account;
 #endif
 
+extern bool mem_cgroup_has_dirty_limit(void);
+extern void get_dirty_param(struct dirty_param *param);
+extern s64 mem_cgroup_page_stat(enum mem_cgroup_page_stat_item item);
+
 static inline bool mem_cgroup_disabled(void)
 {
 	if (mem_cgroup_subsys.disabled)
@@ -125,7 +183,8 @@ static inline bool mem_cgroup_disabled(void)
 }
 
 extern bool mem_cgroup_oom_called(struct task_struct *task);
-void mem_cgroup_update_file_mapped(struct page *page, int val);
+void mem_cgroup_update_stat(struct page *page,
+			enum mem_cgroup_stat_index idx, int val);
 unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
 						gfp_t gfp_mask, int nid,
 						int zid);
@@ -300,8 +359,8 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
 {
 }
 
-static inline void mem_cgroup_update_file_mapped(struct page *page,
-							int val)
+static inline void mem_cgroup_update_stat(struct page *page,
+			enum mem_cgroup_stat_index idx, int val)
 {
 }
 
@@ -312,6 +371,21 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
 	return 0;
 }
 
+static inline bool mem_cgroup_has_dirty_limit(void)
+{
+	return false;
+}
+
+static inline void get_dirty_param(struct dirty_param *param)
+{
+	get_global_dirty_param(param);
+}
+
+static inline s64 mem_cgroup_page_stat(enum mem_cgroup_page_stat_item item)
+{
+	return -ENOSYS;
+}
+
 #endif /* CONFIG_CGROUP_MEM_CONT */
 
 #endif /* _LINUX_MEMCONTROL_H */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 497b6f7..9842e7b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -73,28 +73,23 @@ static int really_do_swap_account __initdata = 1; /* for remember boot option*/
 #define THRESHOLDS_EVENTS_THRESH (7) /* once in 128 */
 #define SOFTLIMIT_EVENTS_THRESH (10) /* once in 1024 */
 
-/*
- * Statistics for memory cgroup.
- */
-enum mem_cgroup_stat_index {
-	/*
-	 * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
-	 */
-	MEM_CGROUP_STAT_CACHE, 	   /* # of pages charged as cache */
-	MEM_CGROUP_STAT_RSS,	   /* # of pages charged as anon rss */
-	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
-	MEM_CGROUP_STAT_PGPGIN_COUNT,	/* # of pages paged in */
-	MEM_CGROUP_STAT_PGPGOUT_COUNT,	/* # of pages paged out */
-	MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
-	MEM_CGROUP_EVENTS,	/* incremented at every  pagein/pageout */
-
-	MEM_CGROUP_STAT_NSTATS,
-};
-
 struct mem_cgroup_stat_cpu {
 	s64 count[MEM_CGROUP_STAT_NSTATS];
 };
 
+/* Per cgroup page statistics */
+struct mem_cgroup_page_stat {
+	enum mem_cgroup_page_stat_item item;
+	s64 value;
+};
+
+enum {
+	MEM_CGROUP_DIRTY_RATIO,
+	MEM_CGROUP_DIRTY_BYTES,
+	MEM_CGROUP_DIRTY_BACKGROUND_RATIO,
+	MEM_CGROUP_DIRTY_BACKGROUND_BYTES,
+};
+
 /*
  * per-zone information in memory controller.
  */
@@ -208,6 +203,9 @@ struct mem_cgroup {
 
 	unsigned int	swappiness;
 
+	/* control memory cgroup dirty pages */
+	struct dirty_param dirty_param;
+
 	/* set when res.limit == memsw.limit */
 	bool		memsw_is_minimum;
 
@@ -1033,6 +1031,156 @@ static unsigned int get_swappiness(struct mem_cgroup *memcg)
 	return swappiness;
 }
 
+static bool dirty_param_is_valid(struct dirty_param *param)
+{
+	if (param->dirty_ratio && param->dirty_bytes)
+		return false;
+	if (param->dirty_background_ratio && param->dirty_background_bytes)
+		return false;
+	return true;
+}
+
+static void
+__mem_cgroup_get_dirty_param(struct dirty_param *param, struct mem_cgroup *mem)
+{
+	param->dirty_ratio = mem->dirty_param.dirty_ratio;
+	param->dirty_bytes = mem->dirty_param.dirty_bytes;
+	param->dirty_background_ratio = mem->dirty_param.dirty_background_ratio;
+	param->dirty_background_bytes = mem->dirty_param.dirty_background_bytes;
+}
+
+/*
+ * get_dirty_param() - get dirty memory parameters of the current memcg
+ * @param:	a structure is filled with the dirty memory settings
+ *
+ * The function fills @param with the current memcg dirty memory settings. If
+ * memory cgroup is disabled or in case of error the structure is filled with
+ * the global dirty memory settings.
+ */
+void get_dirty_param(struct dirty_param *param)
+{
+	struct mem_cgroup *memcg;
+
+	if (mem_cgroup_disabled()) {
+		get_global_dirty_param(param);
+		return;
+	}
+	/*
+	 * It's possible that "current" may be moved to other cgroup while we
+	 * access cgroup. But precise check is meaningless because the task can
+	 * be moved after our access and writeback tends to take long time.
+	 * At least, "memcg" will not be freed under rcu_read_lock().
+	 */
+	while (1) {
+		rcu_read_lock();
+		memcg = mem_cgroup_from_task(current);
+		if (likely(memcg))
+			__mem_cgroup_get_dirty_param(param, memcg);
+		else
+			get_global_dirty_param(param);
+		rcu_read_unlock();
+		/*
+		 * Since global and memcg dirty_param are not protected we try
+		 * to speculatively read them and retry if we get inconsistent
+		 * values.
+		 */
+		if (likely(dirty_param_is_valid(param)))
+			break;
+	}
+}
+
+static inline bool mem_cgroup_can_swap(struct mem_cgroup *memcg)
+{
+	if (!do_swap_account)
+		return nr_swap_pages > 0;
+	return !memcg->memsw_is_minimum &&
+		(res_counter_read_u64(&memcg->memsw, RES_LIMIT) > 0);
+}
+
+static s64 mem_cgroup_get_local_page_stat(struct mem_cgroup *memcg,
+				enum mem_cgroup_page_stat_item item)
+{
+	s64 ret;
+
+	switch (item) {
+	case MEMCG_NR_DIRTYABLE_PAGES:
+		ret = res_counter_read_u64(&memcg->res, RES_LIMIT) -
+			res_counter_read_u64(&memcg->res, RES_USAGE);
+		/* Translate free memory in pages */
+		ret >>= PAGE_SHIFT;
+		ret += mem_cgroup_read_stat(memcg, LRU_ACTIVE_FILE) +
+			mem_cgroup_read_stat(memcg, LRU_INACTIVE_FILE);
+		if (mem_cgroup_can_swap(memcg))
+			ret += mem_cgroup_read_stat(memcg, LRU_ACTIVE_ANON) +
+				mem_cgroup_read_stat(memcg, LRU_INACTIVE_ANON);
+		break;
+	case MEMCG_NR_RECLAIM_PAGES:
+		ret = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_FILE_DIRTY) +
+			mem_cgroup_read_stat(memcg,
+					MEM_CGROUP_STAT_UNSTABLE_NFS);
+		break;
+	case MEMCG_NR_WRITEBACK:
+		ret = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_WRITEBACK);
+		break;
+	case MEMCG_NR_DIRTY_WRITEBACK_PAGES:
+		ret = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_WRITEBACK) +
+			mem_cgroup_read_stat(memcg,
+				MEM_CGROUP_STAT_UNSTABLE_NFS);
+		break;
+	default:
+		BUG_ON(1);
+	}
+	return ret;
+}
+
+static int mem_cgroup_page_stat_cb(struct mem_cgroup *mem, void *data)
+{
+	struct mem_cgroup_page_stat *stat = (struct mem_cgroup_page_stat *)data;
+
+	stat->value += mem_cgroup_get_local_page_stat(mem, stat->item);
+	return 0;
+}
+
+/*
+ * mem_cgroup_has_dirty_limit() - check if current memcg has local dirty limits
+ *
+ * Return true if the current memory cgroup has local dirty memory settings,
+ * false otherwise.
+ */
+bool mem_cgroup_has_dirty_limit(void)
+{
+	if (mem_cgroup_disabled())
+		return false;
+	return mem_cgroup_from_task(current) != NULL;
+}
+
+/*
+ * mem_cgroup_page_stat() - get memory cgroup file cache statistics
+ * @item:	memory statistic item exported to the kernel
+ *
+ * Return the accounted statistic value, or a negative value in case of error.
+ */
+s64 mem_cgroup_page_stat(enum mem_cgroup_page_stat_item item)
+{
+	struct mem_cgroup_page_stat stat = {};
+	struct mem_cgroup *memcg;
+
+	rcu_read_lock();
+	memcg = mem_cgroup_from_task(current);
+	if (memcg) {
+		/*
+		 * Recursively evaulate page statistics against all cgroup
+		 * under hierarchy tree
+		 */
+		stat.item = item;
+		mem_cgroup_walk_tree(memcg, &stat, mem_cgroup_page_stat_cb);
+	} else
+		stat.value = -EINVAL;
+	rcu_read_unlock();
+
+	return stat.value;
+}
+
 static int mem_cgroup_count_children_cb(struct mem_cgroup *mem, void *data)
 {
 	int *val = data;
@@ -1275,34 +1423,70 @@ static void record_last_oom(struct mem_cgroup *mem)
 }
 
 /*
- * Currently used to update mapped file statistics, but the routine can be
- * generalized to update other statistics as well.
+ * Generalized routine to update file cache's status for memcg.
+ *
+ * Before calling this, mapping->tree_lock should be held and preemption is
+ * disabled.  Then, it's guarnteed that the page is not uncharged while we
+ * access page_cgroup. We can make use of that.
  */
-void mem_cgroup_update_file_mapped(struct page *page, int val)
+void mem_cgroup_update_stat(struct page *page,
+			enum mem_cgroup_stat_index idx, int val)
 {
 	struct mem_cgroup *mem;
 	struct page_cgroup *pc;
 
+	if (mem_cgroup_disabled())
+		return;
 	pc = lookup_page_cgroup(page);
-	if (unlikely(!pc))
+	if (unlikely(!pc) || !PageCgroupUsed(pc))
 		return;
 
-	lock_page_cgroup(pc);
-	mem = pc->mem_cgroup;
-	if (!mem)
-		goto done;
-
-	if (!PageCgroupUsed(pc))
-		goto done;
-
+	lock_page_cgroup_migrate(pc);
 	/*
-	 * Preemption is already disabled. We can use __this_cpu_xxx
-	 */
-	__this_cpu_add(mem->stat->count[MEM_CGROUP_STAT_FILE_MAPPED], val);
-
-done:
-	unlock_page_cgroup(pc);
+	* It's guarnteed that this page is never uncharged.
+	* The only racy problem is moving account among memcgs.
+	*/
+	switch (idx) {
+	case MEM_CGROUP_STAT_FILE_MAPPED:
+		if (val > 0)
+			SetPageCgroupFileMapped(pc);
+		else
+			ClearPageCgroupFileMapped(pc);
+		break;
+	case MEM_CGROUP_STAT_FILE_DIRTY:
+		if (val > 0)
+			SetPageCgroupDirty(pc);
+		else
+			ClearPageCgroupDirty(pc);
+		break;
+	case MEM_CGROUP_STAT_WRITEBACK:
+		if (val > 0)
+			SetPageCgroupWriteback(pc);
+		else
+			ClearPageCgroupWriteback(pc);
+		break;
+	case MEM_CGROUP_STAT_WRITEBACK_TEMP:
+		if (val > 0)
+			SetPageCgroupWritebackTemp(pc);
+		else
+			ClearPageCgroupWritebackTemp(pc);
+		break;
+	case MEM_CGROUP_STAT_UNSTABLE_NFS:
+		if (val > 0)
+			SetPageCgroupUnstableNFS(pc);
+		else
+			ClearPageCgroupUnstableNFS(pc);
+		break;
+	default:
+		BUG();
+		break;
+	}
+	mem = pc->mem_cgroup;
+	if (likely(mem))
+		__this_cpu_add(mem->stat->count[idx], val);
+	unlock_page_cgroup_migrate(pc);
 }
+EXPORT_SYMBOL_GPL(mem_cgroup_update_stat);
 
 /*
  * size of first charge trial. "32" comes from vmscan.c's magic value.
@@ -1701,6 +1885,45 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *mem,
 	memcg_check_events(mem, pc->page);
 }
 
+/*
+ * Update file cache accounted statistics on task migration.
+ *
+ * TODO: We don't move charges of file (including shmem/tmpfs) pages for now.
+ * So, at the moment this function simply returns without updating accounted
+ * statistics, because we deal only with anonymous pages here.
+ */
+static void __mem_cgroup_update_file_stat(struct page_cgroup *pc,
+	struct mem_cgroup *from, struct mem_cgroup *to)
+{
+	struct page *page = pc->page;
+
+	if (!page_mapped(page) || PageAnon(page))
+		return;
+
+	if (PageCgroupFileMapped(pc)) {
+		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
+		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
+	}
+	if (PageCgroupDirty(pc)) {
+		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_DIRTY]);
+		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_DIRTY]);
+	}
+	if (PageCgroupWriteback(pc)) {
+		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_WRITEBACK]);
+		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_WRITEBACK]);
+	}
+	if (PageCgroupWritebackTemp(pc)) {
+		__this_cpu_dec(
+			from->stat->count[MEM_CGROUP_STAT_WRITEBACK_TEMP]);
+		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_WRITEBACK_TEMP]);
+	}
+	if (PageCgroupUnstableNFS(pc)) {
+		__this_cpu_dec(
+			from->stat->count[MEM_CGROUP_STAT_UNSTABLE_NFS]);
+		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_UNSTABLE_NFS]);
+	}
+}
+
 /**
  * __mem_cgroup_move_account - move account of the page
  * @pc:	page_cgroup of the page.
@@ -1721,22 +1944,16 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *mem,
 static void __mem_cgroup_move_account(struct page_cgroup *pc,
 	struct mem_cgroup *from, struct mem_cgroup *to, bool uncharge)
 {
-	struct page *page;
-
 	VM_BUG_ON(from == to);
 	VM_BUG_ON(PageLRU(pc->page));
 	VM_BUG_ON(!PageCgroupLocked(pc));
 	VM_BUG_ON(!PageCgroupUsed(pc));
 	VM_BUG_ON(pc->mem_cgroup != from);
 
-	page = pc->page;
-	if (page_mapped(page) && !PageAnon(page)) {
-		/* Update mapped_file data for mem_cgroup */
-		preempt_disable();
-		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
-		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
-		preempt_enable();
-	}
+	preempt_disable();
+	lock_page_cgroup_migrate(pc);
+	__mem_cgroup_update_file_stat(pc, from, to);
+
 	mem_cgroup_charge_statistics(from, pc, false);
 	if (uncharge)
 		/* This is not "cancel", but cancel_charge does all we need. */
@@ -1745,6 +1962,8 @@ static void __mem_cgroup_move_account(struct page_cgroup *pc,
 	/* caller should have done css_get */
 	pc->mem_cgroup = to;
 	mem_cgroup_charge_statistics(to, pc, true);
+	unlock_page_cgroup_migrate(pc);
+	preempt_enable();
 	/*
 	 * We charges against "to" which may not have any tasks. Then, "to"
 	 * can be under rmdir(). But in current implementation, caller of
@@ -3042,6 +3261,10 @@ enum {
 	MCS_PGPGIN,
 	MCS_PGPGOUT,
 	MCS_SWAP,
+	MCS_FILE_DIRTY,
+	MCS_WRITEBACK,
+	MCS_WRITEBACK_TEMP,
+	MCS_UNSTABLE_NFS,
 	MCS_INACTIVE_ANON,
 	MCS_ACTIVE_ANON,
 	MCS_INACTIVE_FILE,
@@ -3064,6 +3287,10 @@ struct {
 	{"pgpgin", "total_pgpgin"},
 	{"pgpgout", "total_pgpgout"},
 	{"swap", "total_swap"},
+	{"filedirty", "dirty_pages"},
+	{"writeback", "writeback_pages"},
+	{"writeback_tmp", "writeback_temp_pages"},
+	{"nfs", "nfs_unstable"},
 	{"inactive_anon", "total_inactive_anon"},
 	{"active_anon", "total_active_anon"},
 	{"inactive_file", "total_inactive_file"},
@@ -3092,6 +3319,14 @@ static int mem_cgroup_get_local_stat(struct mem_cgroup *mem, void *data)
 		val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_SWAPOUT);
 		s->stat[MCS_SWAP] += val * PAGE_SIZE;
 	}
+	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_FILE_DIRTY);
+	s->stat[MCS_FILE_DIRTY] += val;
+	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_WRITEBACK);
+	s->stat[MCS_WRITEBACK] += val;
+	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_WRITEBACK_TEMP);
+	s->stat[MCS_WRITEBACK_TEMP] += val;
+	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_UNSTABLE_NFS);
+	s->stat[MCS_UNSTABLE_NFS] += val;
 
 	/* per zone stat */
 	val = mem_cgroup_get_local_zonestat(mem, LRU_INACTIVE_ANON);
@@ -3453,6 +3688,60 @@ unlock:
 	return ret;
 }
 
+static u64 mem_cgroup_dirty_read(struct cgroup *cgrp, struct cftype *cft)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
+
+	switch (cft->private) {
+	case MEM_CGROUP_DIRTY_RATIO:
+		return memcg->dirty_param.dirty_ratio;
+	case MEM_CGROUP_DIRTY_BYTES:
+		return memcg->dirty_param.dirty_bytes;
+	case MEM_CGROUP_DIRTY_BACKGROUND_RATIO:
+		return memcg->dirty_param.dirty_background_ratio;
+	case MEM_CGROUP_DIRTY_BACKGROUND_BYTES:
+		return memcg->dirty_param.dirty_background_bytes;
+	default:
+		BUG();
+	}
+}
+
+static int
+mem_cgroup_dirty_write(struct cgroup *cgrp, struct cftype *cft, u64 val)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
+	int type = cft->private;
+
+	if (cgrp->parent == NULL)
+		return -EINVAL;
+	if ((type == MEM_CGROUP_DIRTY_RATIO ||
+		type == MEM_CGROUP_DIRTY_BACKGROUND_RATIO) && val > 100)
+		return -EINVAL;
+	/*
+	 * TODO: provide a validation check routine. And retry if validation
+	 * fails.
+	 */
+	switch (type) {
+	case MEM_CGROUP_DIRTY_RATIO:
+		memcg->dirty_param.dirty_ratio = val;
+		memcg->dirty_param.dirty_bytes = 0;
+		break;
+	case MEM_CGROUP_DIRTY_BYTES:
+		memcg->dirty_param.dirty_ratio  = 0;
+		memcg->dirty_param.dirty_bytes = val;
+		break;
+	case MEM_CGROUP_DIRTY_BACKGROUND_RATIO:
+		memcg->dirty_param.dirty_background_ratio = val;
+		memcg->dirty_param.dirty_background_bytes = 0;
+		break;
+	case MEM_CGROUP_DIRTY_BACKGROUND_BYTES:
+		memcg->dirty_param.dirty_background_ratio = 0;
+		memcg->dirty_param.dirty_background_bytes = val;
+		break;
+	}
+	return 0;
+}
+
 static struct cftype mem_cgroup_files[] = {
 	{
 		.name = "usage_in_bytes",
@@ -3504,6 +3793,30 @@ static struct cftype mem_cgroup_files[] = {
 		.write_u64 = mem_cgroup_swappiness_write,
 	},
 	{
+		.name = "dirty_ratio",
+		.read_u64 = mem_cgroup_dirty_read,
+		.write_u64 = mem_cgroup_dirty_write,
+		.private = MEM_CGROUP_DIRTY_RATIO,
+	},
+	{
+		.name = "dirty_bytes",
+		.read_u64 = mem_cgroup_dirty_read,
+		.write_u64 = mem_cgroup_dirty_write,
+		.private = MEM_CGROUP_DIRTY_BYTES,
+	},
+	{
+		.name = "dirty_background_ratio",
+		.read_u64 = mem_cgroup_dirty_read,
+		.write_u64 = mem_cgroup_dirty_write,
+		.private = MEM_CGROUP_DIRTY_BACKGROUND_RATIO,
+	},
+	{
+		.name = "dirty_background_bytes",
+		.read_u64 = mem_cgroup_dirty_read,
+		.write_u64 = mem_cgroup_dirty_write,
+		.private = MEM_CGROUP_DIRTY_BACKGROUND_BYTES,
+	},
+	{
 		.name = "move_charge_at_immigrate",
 		.read_u64 = mem_cgroup_move_charge_read,
 		.write_u64 = mem_cgroup_move_charge_write,
@@ -3762,8 +4075,21 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
 	mem->last_scanned_child = 0;
 	spin_lock_init(&mem->reclaim_param_lock);
 
-	if (parent)
+	if (parent) {
 		mem->swappiness = get_swappiness(parent);
+		mem->dirty_param = parent->dirty_param;
+	} else {
+		while (1) {
+			get_global_dirty_param(&mem->dirty_param);
+			/*
+			 * Since global dirty parameters are not protected we
+			 * try to speculatively read them and retry if we get
+			 * inconsistent values.
+			 */
+			if (likely(dirty_param_is_valid(&mem->dirty_param)))
+				break;
+		}
+	}
 	atomic_set(&mem->refcnt, 1);
 	mem->move_charge_at_immigrate = 0;
 	mutex_init(&mem->thresholds_lock);
-- 
1.6.3.3

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH -mmotm 3/4] memcg: dirty pages accounting and limiting infrastructure
  2010-03-04 10:40 ` Andrea Righi
@ 2010-03-04 10:40   ` Andrea Righi
  -1 siblings, 0 replies; 68+ messages in thread
From: Andrea Righi @ 2010-03-04 10:40 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki, Balbir Singh
  Cc: Vivek Goyal, Peter Zijlstra, Trond Myklebust, Suleiman Souhlal,
	Greg Thelen, Daisuke Nishimura, Kirill A. Shutemov,
	Andrew Morton, containers, linux-kernel, linux-mm, Andrea Righi

Infrastructure to account dirty pages per cgroup and add dirty limit
interfaces in the cgroupfs:

 - Direct write-out: memory.dirty_ratio, memory.dirty_bytes

 - Background write-out: memory.dirty_background_ratio, memory.dirty_background_bytes

Signed-off-by: Andrea Righi <arighi@develer.com>
---
 include/linux/memcontrol.h |   80 ++++++++-
 mm/memcontrol.c            |  420 +++++++++++++++++++++++++++++++++++++++-----
 2 files changed, 450 insertions(+), 50 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 1f9b119..cc3421b 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -19,12 +19,66 @@
 
 #ifndef _LINUX_MEMCONTROL_H
 #define _LINUX_MEMCONTROL_H
+
+#include <linux/writeback.h>
 #include <linux/cgroup.h>
+
 struct mem_cgroup;
 struct page_cgroup;
 struct page;
 struct mm_struct;
 
+/* Cgroup memory statistics items exported to the kernel */
+enum mem_cgroup_page_stat_item {
+	MEMCG_NR_DIRTYABLE_PAGES,
+	MEMCG_NR_RECLAIM_PAGES,
+	MEMCG_NR_WRITEBACK,
+	MEMCG_NR_DIRTY_WRITEBACK_PAGES,
+};
+
+/* Dirty memory parameters */
+struct dirty_param {
+	int dirty_ratio;
+	unsigned long dirty_bytes;
+	int dirty_background_ratio;
+	unsigned long dirty_background_bytes;
+};
+
+/*
+ * Statistics for memory cgroup.
+ */
+enum mem_cgroup_stat_index {
+	/*
+	 * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
+	 */
+	MEM_CGROUP_STAT_CACHE,	   /* # of pages charged as cache */
+	MEM_CGROUP_STAT_RSS,	   /* # of pages charged as anon rss */
+	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
+	MEM_CGROUP_STAT_PGPGIN_COUNT,	/* # of pages paged in */
+	MEM_CGROUP_STAT_PGPGOUT_COUNT,	/* # of pages paged out */
+	MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
+	MEM_CGROUP_EVENTS,	/* incremented at every  pagein/pageout */
+	MEM_CGROUP_STAT_FILE_DIRTY,   /* # of dirty pages in page cache */
+	MEM_CGROUP_STAT_WRITEBACK,   /* # of pages under writeback */
+	MEM_CGROUP_STAT_WRITEBACK_TEMP,   /* # of pages under writeback using
+						temporary buffers */
+	MEM_CGROUP_STAT_UNSTABLE_NFS,   /* # of NFS unstable pages */
+
+	MEM_CGROUP_STAT_NSTATS,
+};
+
+/*
+ * TODO: provide a validation check routine. And retry if validation
+ * fails.
+ */
+static inline void get_global_dirty_param(struct dirty_param *param)
+{
+	param->dirty_ratio = vm_dirty_ratio;
+	param->dirty_bytes = vm_dirty_bytes;
+	param->dirty_background_ratio = dirty_background_ratio;
+	param->dirty_background_bytes = dirty_background_bytes;
+}
+
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR
 /*
  * All "charge" functions with gfp_mask should use GFP_KERNEL or
@@ -117,6 +171,10 @@ extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
 extern int do_swap_account;
 #endif
 
+extern bool mem_cgroup_has_dirty_limit(void);
+extern void get_dirty_param(struct dirty_param *param);
+extern s64 mem_cgroup_page_stat(enum mem_cgroup_page_stat_item item);
+
 static inline bool mem_cgroup_disabled(void)
 {
 	if (mem_cgroup_subsys.disabled)
@@ -125,7 +183,8 @@ static inline bool mem_cgroup_disabled(void)
 }
 
 extern bool mem_cgroup_oom_called(struct task_struct *task);
-void mem_cgroup_update_file_mapped(struct page *page, int val);
+void mem_cgroup_update_stat(struct page *page,
+			enum mem_cgroup_stat_index idx, int val);
 unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
 						gfp_t gfp_mask, int nid,
 						int zid);
@@ -300,8 +359,8 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
 {
 }
 
-static inline void mem_cgroup_update_file_mapped(struct page *page,
-							int val)
+static inline void mem_cgroup_update_stat(struct page *page,
+			enum mem_cgroup_stat_index idx, int val)
 {
 }
 
@@ -312,6 +371,21 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
 	return 0;
 }
 
+static inline bool mem_cgroup_has_dirty_limit(void)
+{
+	return false;
+}
+
+static inline void get_dirty_param(struct dirty_param *param)
+{
+	get_global_dirty_param(param);
+}
+
+static inline s64 mem_cgroup_page_stat(enum mem_cgroup_page_stat_item item)
+{
+	return -ENOSYS;
+}
+
 #endif /* CONFIG_CGROUP_MEM_CONT */
 
 #endif /* _LINUX_MEMCONTROL_H */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 497b6f7..9842e7b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -73,28 +73,23 @@ static int really_do_swap_account __initdata = 1; /* for remember boot option*/
 #define THRESHOLDS_EVENTS_THRESH (7) /* once in 128 */
 #define SOFTLIMIT_EVENTS_THRESH (10) /* once in 1024 */
 
-/*
- * Statistics for memory cgroup.
- */
-enum mem_cgroup_stat_index {
-	/*
-	 * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
-	 */
-	MEM_CGROUP_STAT_CACHE, 	   /* # of pages charged as cache */
-	MEM_CGROUP_STAT_RSS,	   /* # of pages charged as anon rss */
-	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
-	MEM_CGROUP_STAT_PGPGIN_COUNT,	/* # of pages paged in */
-	MEM_CGROUP_STAT_PGPGOUT_COUNT,	/* # of pages paged out */
-	MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
-	MEM_CGROUP_EVENTS,	/* incremented at every  pagein/pageout */
-
-	MEM_CGROUP_STAT_NSTATS,
-};
-
 struct mem_cgroup_stat_cpu {
 	s64 count[MEM_CGROUP_STAT_NSTATS];
 };
 
+/* Per cgroup page statistics */
+struct mem_cgroup_page_stat {
+	enum mem_cgroup_page_stat_item item;
+	s64 value;
+};
+
+enum {
+	MEM_CGROUP_DIRTY_RATIO,
+	MEM_CGROUP_DIRTY_BYTES,
+	MEM_CGROUP_DIRTY_BACKGROUND_RATIO,
+	MEM_CGROUP_DIRTY_BACKGROUND_BYTES,
+};
+
 /*
  * per-zone information in memory controller.
  */
@@ -208,6 +203,9 @@ struct mem_cgroup {
 
 	unsigned int	swappiness;
 
+	/* control memory cgroup dirty pages */
+	struct dirty_param dirty_param;
+
 	/* set when res.limit == memsw.limit */
 	bool		memsw_is_minimum;
 
@@ -1033,6 +1031,156 @@ static unsigned int get_swappiness(struct mem_cgroup *memcg)
 	return swappiness;
 }
 
+static bool dirty_param_is_valid(struct dirty_param *param)
+{
+	if (param->dirty_ratio && param->dirty_bytes)
+		return false;
+	if (param->dirty_background_ratio && param->dirty_background_bytes)
+		return false;
+	return true;
+}
+
+static void
+__mem_cgroup_get_dirty_param(struct dirty_param *param, struct mem_cgroup *mem)
+{
+	param->dirty_ratio = mem->dirty_param.dirty_ratio;
+	param->dirty_bytes = mem->dirty_param.dirty_bytes;
+	param->dirty_background_ratio = mem->dirty_param.dirty_background_ratio;
+	param->dirty_background_bytes = mem->dirty_param.dirty_background_bytes;
+}
+
+/*
+ * get_dirty_param() - get dirty memory parameters of the current memcg
+ * @param:	a structure is filled with the dirty memory settings
+ *
+ * The function fills @param with the current memcg dirty memory settings. If
+ * memory cgroup is disabled or in case of error the structure is filled with
+ * the global dirty memory settings.
+ */
+void get_dirty_param(struct dirty_param *param)
+{
+	struct mem_cgroup *memcg;
+
+	if (mem_cgroup_disabled()) {
+		get_global_dirty_param(param);
+		return;
+	}
+	/*
+	 * It's possible that "current" may be moved to other cgroup while we
+	 * access cgroup. But precise check is meaningless because the task can
+	 * be moved after our access and writeback tends to take long time.
+	 * At least, "memcg" will not be freed under rcu_read_lock().
+	 */
+	while (1) {
+		rcu_read_lock();
+		memcg = mem_cgroup_from_task(current);
+		if (likely(memcg))
+			__mem_cgroup_get_dirty_param(param, memcg);
+		else
+			get_global_dirty_param(param);
+		rcu_read_unlock();
+		/*
+		 * Since global and memcg dirty_param are not protected we try
+		 * to speculatively read them and retry if we get inconsistent
+		 * values.
+		 */
+		if (likely(dirty_param_is_valid(param)))
+			break;
+	}
+}
+
+static inline bool mem_cgroup_can_swap(struct mem_cgroup *memcg)
+{
+	if (!do_swap_account)
+		return nr_swap_pages > 0;
+	return !memcg->memsw_is_minimum &&
+		(res_counter_read_u64(&memcg->memsw, RES_LIMIT) > 0);
+}
+
+static s64 mem_cgroup_get_local_page_stat(struct mem_cgroup *memcg,
+				enum mem_cgroup_page_stat_item item)
+{
+	s64 ret;
+
+	switch (item) {
+	case MEMCG_NR_DIRTYABLE_PAGES:
+		ret = res_counter_read_u64(&memcg->res, RES_LIMIT) -
+			res_counter_read_u64(&memcg->res, RES_USAGE);
+		/* Translate free memory in pages */
+		ret >>= PAGE_SHIFT;
+		ret += mem_cgroup_read_stat(memcg, LRU_ACTIVE_FILE) +
+			mem_cgroup_read_stat(memcg, LRU_INACTIVE_FILE);
+		if (mem_cgroup_can_swap(memcg))
+			ret += mem_cgroup_read_stat(memcg, LRU_ACTIVE_ANON) +
+				mem_cgroup_read_stat(memcg, LRU_INACTIVE_ANON);
+		break;
+	case MEMCG_NR_RECLAIM_PAGES:
+		ret = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_FILE_DIRTY) +
+			mem_cgroup_read_stat(memcg,
+					MEM_CGROUP_STAT_UNSTABLE_NFS);
+		break;
+	case MEMCG_NR_WRITEBACK:
+		ret = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_WRITEBACK);
+		break;
+	case MEMCG_NR_DIRTY_WRITEBACK_PAGES:
+		ret = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_WRITEBACK) +
+			mem_cgroup_read_stat(memcg,
+				MEM_CGROUP_STAT_UNSTABLE_NFS);
+		break;
+	default:
+		BUG_ON(1);
+	}
+	return ret;
+}
+
+static int mem_cgroup_page_stat_cb(struct mem_cgroup *mem, void *data)
+{
+	struct mem_cgroup_page_stat *stat = (struct mem_cgroup_page_stat *)data;
+
+	stat->value += mem_cgroup_get_local_page_stat(mem, stat->item);
+	return 0;
+}
+
+/*
+ * mem_cgroup_has_dirty_limit() - check if current memcg has local dirty limits
+ *
+ * Return true if the current memory cgroup has local dirty memory settings,
+ * false otherwise.
+ */
+bool mem_cgroup_has_dirty_limit(void)
+{
+	if (mem_cgroup_disabled())
+		return false;
+	return mem_cgroup_from_task(current) != NULL;
+}
+
+/*
+ * mem_cgroup_page_stat() - get memory cgroup file cache statistics
+ * @item:	memory statistic item exported to the kernel
+ *
+ * Return the accounted statistic value, or a negative value in case of error.
+ */
+s64 mem_cgroup_page_stat(enum mem_cgroup_page_stat_item item)
+{
+	struct mem_cgroup_page_stat stat = {};
+	struct mem_cgroup *memcg;
+
+	rcu_read_lock();
+	memcg = mem_cgroup_from_task(current);
+	if (memcg) {
+		/*
+		 * Recursively evaulate page statistics against all cgroup
+		 * under hierarchy tree
+		 */
+		stat.item = item;
+		mem_cgroup_walk_tree(memcg, &stat, mem_cgroup_page_stat_cb);
+	} else
+		stat.value = -EINVAL;
+	rcu_read_unlock();
+
+	return stat.value;
+}
+
 static int mem_cgroup_count_children_cb(struct mem_cgroup *mem, void *data)
 {
 	int *val = data;
@@ -1275,34 +1423,70 @@ static void record_last_oom(struct mem_cgroup *mem)
 }
 
 /*
- * Currently used to update mapped file statistics, but the routine can be
- * generalized to update other statistics as well.
+ * Generalized routine to update file cache's status for memcg.
+ *
+ * Before calling this, mapping->tree_lock should be held and preemption is
+ * disabled.  Then, it's guarnteed that the page is not uncharged while we
+ * access page_cgroup. We can make use of that.
  */
-void mem_cgroup_update_file_mapped(struct page *page, int val)
+void mem_cgroup_update_stat(struct page *page,
+			enum mem_cgroup_stat_index idx, int val)
 {
 	struct mem_cgroup *mem;
 	struct page_cgroup *pc;
 
+	if (mem_cgroup_disabled())
+		return;
 	pc = lookup_page_cgroup(page);
-	if (unlikely(!pc))
+	if (unlikely(!pc) || !PageCgroupUsed(pc))
 		return;
 
-	lock_page_cgroup(pc);
-	mem = pc->mem_cgroup;
-	if (!mem)
-		goto done;
-
-	if (!PageCgroupUsed(pc))
-		goto done;
-
+	lock_page_cgroup_migrate(pc);
 	/*
-	 * Preemption is already disabled. We can use __this_cpu_xxx
-	 */
-	__this_cpu_add(mem->stat->count[MEM_CGROUP_STAT_FILE_MAPPED], val);
-
-done:
-	unlock_page_cgroup(pc);
+	* It's guarnteed that this page is never uncharged.
+	* The only racy problem is moving account among memcgs.
+	*/
+	switch (idx) {
+	case MEM_CGROUP_STAT_FILE_MAPPED:
+		if (val > 0)
+			SetPageCgroupFileMapped(pc);
+		else
+			ClearPageCgroupFileMapped(pc);
+		break;
+	case MEM_CGROUP_STAT_FILE_DIRTY:
+		if (val > 0)
+			SetPageCgroupDirty(pc);
+		else
+			ClearPageCgroupDirty(pc);
+		break;
+	case MEM_CGROUP_STAT_WRITEBACK:
+		if (val > 0)
+			SetPageCgroupWriteback(pc);
+		else
+			ClearPageCgroupWriteback(pc);
+		break;
+	case MEM_CGROUP_STAT_WRITEBACK_TEMP:
+		if (val > 0)
+			SetPageCgroupWritebackTemp(pc);
+		else
+			ClearPageCgroupWritebackTemp(pc);
+		break;
+	case MEM_CGROUP_STAT_UNSTABLE_NFS:
+		if (val > 0)
+			SetPageCgroupUnstableNFS(pc);
+		else
+			ClearPageCgroupUnstableNFS(pc);
+		break;
+	default:
+		BUG();
+		break;
+	}
+	mem = pc->mem_cgroup;
+	if (likely(mem))
+		__this_cpu_add(mem->stat->count[idx], val);
+	unlock_page_cgroup_migrate(pc);
 }
+EXPORT_SYMBOL_GPL(mem_cgroup_update_stat);
 
 /*
  * size of first charge trial. "32" comes from vmscan.c's magic value.
@@ -1701,6 +1885,45 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *mem,
 	memcg_check_events(mem, pc->page);
 }
 
+/*
+ * Update file cache accounted statistics on task migration.
+ *
+ * TODO: We don't move charges of file (including shmem/tmpfs) pages for now.
+ * So, at the moment this function simply returns without updating accounted
+ * statistics, because we deal only with anonymous pages here.
+ */
+static void __mem_cgroup_update_file_stat(struct page_cgroup *pc,
+	struct mem_cgroup *from, struct mem_cgroup *to)
+{
+	struct page *page = pc->page;
+
+	if (!page_mapped(page) || PageAnon(page))
+		return;
+
+	if (PageCgroupFileMapped(pc)) {
+		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
+		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
+	}
+	if (PageCgroupDirty(pc)) {
+		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_DIRTY]);
+		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_DIRTY]);
+	}
+	if (PageCgroupWriteback(pc)) {
+		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_WRITEBACK]);
+		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_WRITEBACK]);
+	}
+	if (PageCgroupWritebackTemp(pc)) {
+		__this_cpu_dec(
+			from->stat->count[MEM_CGROUP_STAT_WRITEBACK_TEMP]);
+		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_WRITEBACK_TEMP]);
+	}
+	if (PageCgroupUnstableNFS(pc)) {
+		__this_cpu_dec(
+			from->stat->count[MEM_CGROUP_STAT_UNSTABLE_NFS]);
+		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_UNSTABLE_NFS]);
+	}
+}
+
 /**
  * __mem_cgroup_move_account - move account of the page
  * @pc:	page_cgroup of the page.
@@ -1721,22 +1944,16 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *mem,
 static void __mem_cgroup_move_account(struct page_cgroup *pc,
 	struct mem_cgroup *from, struct mem_cgroup *to, bool uncharge)
 {
-	struct page *page;
-
 	VM_BUG_ON(from == to);
 	VM_BUG_ON(PageLRU(pc->page));
 	VM_BUG_ON(!PageCgroupLocked(pc));
 	VM_BUG_ON(!PageCgroupUsed(pc));
 	VM_BUG_ON(pc->mem_cgroup != from);
 
-	page = pc->page;
-	if (page_mapped(page) && !PageAnon(page)) {
-		/* Update mapped_file data for mem_cgroup */
-		preempt_disable();
-		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
-		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
-		preempt_enable();
-	}
+	preempt_disable();
+	lock_page_cgroup_migrate(pc);
+	__mem_cgroup_update_file_stat(pc, from, to);
+
 	mem_cgroup_charge_statistics(from, pc, false);
 	if (uncharge)
 		/* This is not "cancel", but cancel_charge does all we need. */
@@ -1745,6 +1962,8 @@ static void __mem_cgroup_move_account(struct page_cgroup *pc,
 	/* caller should have done css_get */
 	pc->mem_cgroup = to;
 	mem_cgroup_charge_statistics(to, pc, true);
+	unlock_page_cgroup_migrate(pc);
+	preempt_enable();
 	/*
 	 * We charges against "to" which may not have any tasks. Then, "to"
 	 * can be under rmdir(). But in current implementation, caller of
@@ -3042,6 +3261,10 @@ enum {
 	MCS_PGPGIN,
 	MCS_PGPGOUT,
 	MCS_SWAP,
+	MCS_FILE_DIRTY,
+	MCS_WRITEBACK,
+	MCS_WRITEBACK_TEMP,
+	MCS_UNSTABLE_NFS,
 	MCS_INACTIVE_ANON,
 	MCS_ACTIVE_ANON,
 	MCS_INACTIVE_FILE,
@@ -3064,6 +3287,10 @@ struct {
 	{"pgpgin", "total_pgpgin"},
 	{"pgpgout", "total_pgpgout"},
 	{"swap", "total_swap"},
+	{"filedirty", "dirty_pages"},
+	{"writeback", "writeback_pages"},
+	{"writeback_tmp", "writeback_temp_pages"},
+	{"nfs", "nfs_unstable"},
 	{"inactive_anon", "total_inactive_anon"},
 	{"active_anon", "total_active_anon"},
 	{"inactive_file", "total_inactive_file"},
@@ -3092,6 +3319,14 @@ static int mem_cgroup_get_local_stat(struct mem_cgroup *mem, void *data)
 		val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_SWAPOUT);
 		s->stat[MCS_SWAP] += val * PAGE_SIZE;
 	}
+	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_FILE_DIRTY);
+	s->stat[MCS_FILE_DIRTY] += val;
+	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_WRITEBACK);
+	s->stat[MCS_WRITEBACK] += val;
+	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_WRITEBACK_TEMP);
+	s->stat[MCS_WRITEBACK_TEMP] += val;
+	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_UNSTABLE_NFS);
+	s->stat[MCS_UNSTABLE_NFS] += val;
 
 	/* per zone stat */
 	val = mem_cgroup_get_local_zonestat(mem, LRU_INACTIVE_ANON);
@@ -3453,6 +3688,60 @@ unlock:
 	return ret;
 }
 
+static u64 mem_cgroup_dirty_read(struct cgroup *cgrp, struct cftype *cft)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
+
+	switch (cft->private) {
+	case MEM_CGROUP_DIRTY_RATIO:
+		return memcg->dirty_param.dirty_ratio;
+	case MEM_CGROUP_DIRTY_BYTES:
+		return memcg->dirty_param.dirty_bytes;
+	case MEM_CGROUP_DIRTY_BACKGROUND_RATIO:
+		return memcg->dirty_param.dirty_background_ratio;
+	case MEM_CGROUP_DIRTY_BACKGROUND_BYTES:
+		return memcg->dirty_param.dirty_background_bytes;
+	default:
+		BUG();
+	}
+}
+
+static int
+mem_cgroup_dirty_write(struct cgroup *cgrp, struct cftype *cft, u64 val)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
+	int type = cft->private;
+
+	if (cgrp->parent == NULL)
+		return -EINVAL;
+	if ((type == MEM_CGROUP_DIRTY_RATIO ||
+		type == MEM_CGROUP_DIRTY_BACKGROUND_RATIO) && val > 100)
+		return -EINVAL;
+	/*
+	 * TODO: provide a validation check routine. And retry if validation
+	 * fails.
+	 */
+	switch (type) {
+	case MEM_CGROUP_DIRTY_RATIO:
+		memcg->dirty_param.dirty_ratio = val;
+		memcg->dirty_param.dirty_bytes = 0;
+		break;
+	case MEM_CGROUP_DIRTY_BYTES:
+		memcg->dirty_param.dirty_ratio  = 0;
+		memcg->dirty_param.dirty_bytes = val;
+		break;
+	case MEM_CGROUP_DIRTY_BACKGROUND_RATIO:
+		memcg->dirty_param.dirty_background_ratio = val;
+		memcg->dirty_param.dirty_background_bytes = 0;
+		break;
+	case MEM_CGROUP_DIRTY_BACKGROUND_BYTES:
+		memcg->dirty_param.dirty_background_ratio = 0;
+		memcg->dirty_param.dirty_background_bytes = val;
+		break;
+	}
+	return 0;
+}
+
 static struct cftype mem_cgroup_files[] = {
 	{
 		.name = "usage_in_bytes",
@@ -3504,6 +3793,30 @@ static struct cftype mem_cgroup_files[] = {
 		.write_u64 = mem_cgroup_swappiness_write,
 	},
 	{
+		.name = "dirty_ratio",
+		.read_u64 = mem_cgroup_dirty_read,
+		.write_u64 = mem_cgroup_dirty_write,
+		.private = MEM_CGROUP_DIRTY_RATIO,
+	},
+	{
+		.name = "dirty_bytes",
+		.read_u64 = mem_cgroup_dirty_read,
+		.write_u64 = mem_cgroup_dirty_write,
+		.private = MEM_CGROUP_DIRTY_BYTES,
+	},
+	{
+		.name = "dirty_background_ratio",
+		.read_u64 = mem_cgroup_dirty_read,
+		.write_u64 = mem_cgroup_dirty_write,
+		.private = MEM_CGROUP_DIRTY_BACKGROUND_RATIO,
+	},
+	{
+		.name = "dirty_background_bytes",
+		.read_u64 = mem_cgroup_dirty_read,
+		.write_u64 = mem_cgroup_dirty_write,
+		.private = MEM_CGROUP_DIRTY_BACKGROUND_BYTES,
+	},
+	{
 		.name = "move_charge_at_immigrate",
 		.read_u64 = mem_cgroup_move_charge_read,
 		.write_u64 = mem_cgroup_move_charge_write,
@@ -3762,8 +4075,21 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
 	mem->last_scanned_child = 0;
 	spin_lock_init(&mem->reclaim_param_lock);
 
-	if (parent)
+	if (parent) {
 		mem->swappiness = get_swappiness(parent);
+		mem->dirty_param = parent->dirty_param;
+	} else {
+		while (1) {
+			get_global_dirty_param(&mem->dirty_param);
+			/*
+			 * Since global dirty parameters are not protected we
+			 * try to speculatively read them and retry if we get
+			 * inconsistent values.
+			 */
+			if (likely(dirty_param_is_valid(&mem->dirty_param)))
+				break;
+		}
+	}
 	atomic_set(&mem->refcnt, 1);
 	mem->move_charge_at_immigrate = 0;
 	mutex_init(&mem->thresholds_lock);
-- 
1.6.3.3


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH -mmotm 3/4] memcg: dirty pages accounting and limiting infrastructure
@ 2010-03-04 10:40   ` Andrea Righi
  0 siblings, 0 replies; 68+ messages in thread
From: Andrea Righi @ 2010-03-04 10:40 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki, Balbir Singh
  Cc: Vivek Goyal, Peter Zijlstra, Trond Myklebust, Suleiman Souhlal,
	Greg Thelen, Daisuke Nishimura, Kirill A. Shutemov,
	Andrew Morton, containers, linux-kernel, linux-mm, Andrea Righi

Infrastructure to account dirty pages per cgroup and add dirty limit
interfaces in the cgroupfs:

 - Direct write-out: memory.dirty_ratio, memory.dirty_bytes

 - Background write-out: memory.dirty_background_ratio, memory.dirty_background_bytes

Signed-off-by: Andrea Righi <arighi@develer.com>
---
 include/linux/memcontrol.h |   80 ++++++++-
 mm/memcontrol.c            |  420 +++++++++++++++++++++++++++++++++++++++-----
 2 files changed, 450 insertions(+), 50 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 1f9b119..cc3421b 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -19,12 +19,66 @@
 
 #ifndef _LINUX_MEMCONTROL_H
 #define _LINUX_MEMCONTROL_H
+
+#include <linux/writeback.h>
 #include <linux/cgroup.h>
+
 struct mem_cgroup;
 struct page_cgroup;
 struct page;
 struct mm_struct;
 
+/* Cgroup memory statistics items exported to the kernel */
+enum mem_cgroup_page_stat_item {
+	MEMCG_NR_DIRTYABLE_PAGES,
+	MEMCG_NR_RECLAIM_PAGES,
+	MEMCG_NR_WRITEBACK,
+	MEMCG_NR_DIRTY_WRITEBACK_PAGES,
+};
+
+/* Dirty memory parameters */
+struct dirty_param {
+	int dirty_ratio;
+	unsigned long dirty_bytes;
+	int dirty_background_ratio;
+	unsigned long dirty_background_bytes;
+};
+
+/*
+ * Statistics for memory cgroup.
+ */
+enum mem_cgroup_stat_index {
+	/*
+	 * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
+	 */
+	MEM_CGROUP_STAT_CACHE,	   /* # of pages charged as cache */
+	MEM_CGROUP_STAT_RSS,	   /* # of pages charged as anon rss */
+	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
+	MEM_CGROUP_STAT_PGPGIN_COUNT,	/* # of pages paged in */
+	MEM_CGROUP_STAT_PGPGOUT_COUNT,	/* # of pages paged out */
+	MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
+	MEM_CGROUP_EVENTS,	/* incremented at every  pagein/pageout */
+	MEM_CGROUP_STAT_FILE_DIRTY,   /* # of dirty pages in page cache */
+	MEM_CGROUP_STAT_WRITEBACK,   /* # of pages under writeback */
+	MEM_CGROUP_STAT_WRITEBACK_TEMP,   /* # of pages under writeback using
+						temporary buffers */
+	MEM_CGROUP_STAT_UNSTABLE_NFS,   /* # of NFS unstable pages */
+
+	MEM_CGROUP_STAT_NSTATS,
+};
+
+/*
+ * TODO: provide a validation check routine. And retry if validation
+ * fails.
+ */
+static inline void get_global_dirty_param(struct dirty_param *param)
+{
+	param->dirty_ratio = vm_dirty_ratio;
+	param->dirty_bytes = vm_dirty_bytes;
+	param->dirty_background_ratio = dirty_background_ratio;
+	param->dirty_background_bytes = dirty_background_bytes;
+}
+
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR
 /*
  * All "charge" functions with gfp_mask should use GFP_KERNEL or
@@ -117,6 +171,10 @@ extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
 extern int do_swap_account;
 #endif
 
+extern bool mem_cgroup_has_dirty_limit(void);
+extern void get_dirty_param(struct dirty_param *param);
+extern s64 mem_cgroup_page_stat(enum mem_cgroup_page_stat_item item);
+
 static inline bool mem_cgroup_disabled(void)
 {
 	if (mem_cgroup_subsys.disabled)
@@ -125,7 +183,8 @@ static inline bool mem_cgroup_disabled(void)
 }
 
 extern bool mem_cgroup_oom_called(struct task_struct *task);
-void mem_cgroup_update_file_mapped(struct page *page, int val);
+void mem_cgroup_update_stat(struct page *page,
+			enum mem_cgroup_stat_index idx, int val);
 unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
 						gfp_t gfp_mask, int nid,
 						int zid);
@@ -300,8 +359,8 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
 {
 }
 
-static inline void mem_cgroup_update_file_mapped(struct page *page,
-							int val)
+static inline void mem_cgroup_update_stat(struct page *page,
+			enum mem_cgroup_stat_index idx, int val)
 {
 }
 
@@ -312,6 +371,21 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
 	return 0;
 }
 
+static inline bool mem_cgroup_has_dirty_limit(void)
+{
+	return false;
+}
+
+static inline void get_dirty_param(struct dirty_param *param)
+{
+	get_global_dirty_param(param);
+}
+
+static inline s64 mem_cgroup_page_stat(enum mem_cgroup_page_stat_item item)
+{
+	return -ENOSYS;
+}
+
 #endif /* CONFIG_CGROUP_MEM_CONT */
 
 #endif /* _LINUX_MEMCONTROL_H */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 497b6f7..9842e7b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -73,28 +73,23 @@ static int really_do_swap_account __initdata = 1; /* for remember boot option*/
 #define THRESHOLDS_EVENTS_THRESH (7) /* once in 128 */
 #define SOFTLIMIT_EVENTS_THRESH (10) /* once in 1024 */
 
-/*
- * Statistics for memory cgroup.
- */
-enum mem_cgroup_stat_index {
-	/*
-	 * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
-	 */
-	MEM_CGROUP_STAT_CACHE, 	   /* # of pages charged as cache */
-	MEM_CGROUP_STAT_RSS,	   /* # of pages charged as anon rss */
-	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
-	MEM_CGROUP_STAT_PGPGIN_COUNT,	/* # of pages paged in */
-	MEM_CGROUP_STAT_PGPGOUT_COUNT,	/* # of pages paged out */
-	MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
-	MEM_CGROUP_EVENTS,	/* incremented at every  pagein/pageout */
-
-	MEM_CGROUP_STAT_NSTATS,
-};
-
 struct mem_cgroup_stat_cpu {
 	s64 count[MEM_CGROUP_STAT_NSTATS];
 };
 
+/* Per cgroup page statistics */
+struct mem_cgroup_page_stat {
+	enum mem_cgroup_page_stat_item item;
+	s64 value;
+};
+
+enum {
+	MEM_CGROUP_DIRTY_RATIO,
+	MEM_CGROUP_DIRTY_BYTES,
+	MEM_CGROUP_DIRTY_BACKGROUND_RATIO,
+	MEM_CGROUP_DIRTY_BACKGROUND_BYTES,
+};
+
 /*
  * per-zone information in memory controller.
  */
@@ -208,6 +203,9 @@ struct mem_cgroup {
 
 	unsigned int	swappiness;
 
+	/* control memory cgroup dirty pages */
+	struct dirty_param dirty_param;
+
 	/* set when res.limit == memsw.limit */
 	bool		memsw_is_minimum;
 
@@ -1033,6 +1031,156 @@ static unsigned int get_swappiness(struct mem_cgroup *memcg)
 	return swappiness;
 }
 
+static bool dirty_param_is_valid(struct dirty_param *param)
+{
+	if (param->dirty_ratio && param->dirty_bytes)
+		return false;
+	if (param->dirty_background_ratio && param->dirty_background_bytes)
+		return false;
+	return true;
+}
+
+static void
+__mem_cgroup_get_dirty_param(struct dirty_param *param, struct mem_cgroup *mem)
+{
+	param->dirty_ratio = mem->dirty_param.dirty_ratio;
+	param->dirty_bytes = mem->dirty_param.dirty_bytes;
+	param->dirty_background_ratio = mem->dirty_param.dirty_background_ratio;
+	param->dirty_background_bytes = mem->dirty_param.dirty_background_bytes;
+}
+
+/*
+ * get_dirty_param() - get dirty memory parameters of the current memcg
+ * @param:	a structure is filled with the dirty memory settings
+ *
+ * The function fills @param with the current memcg dirty memory settings. If
+ * memory cgroup is disabled or in case of error the structure is filled with
+ * the global dirty memory settings.
+ */
+void get_dirty_param(struct dirty_param *param)
+{
+	struct mem_cgroup *memcg;
+
+	if (mem_cgroup_disabled()) {
+		get_global_dirty_param(param);
+		return;
+	}
+	/*
+	 * It's possible that "current" may be moved to other cgroup while we
+	 * access cgroup. But precise check is meaningless because the task can
+	 * be moved after our access and writeback tends to take long time.
+	 * At least, "memcg" will not be freed under rcu_read_lock().
+	 */
+	while (1) {
+		rcu_read_lock();
+		memcg = mem_cgroup_from_task(current);
+		if (likely(memcg))
+			__mem_cgroup_get_dirty_param(param, memcg);
+		else
+			get_global_dirty_param(param);
+		rcu_read_unlock();
+		/*
+		 * Since global and memcg dirty_param are not protected we try
+		 * to speculatively read them and retry if we get inconsistent
+		 * values.
+		 */
+		if (likely(dirty_param_is_valid(param)))
+			break;
+	}
+}
+
+static inline bool mem_cgroup_can_swap(struct mem_cgroup *memcg)
+{
+	if (!do_swap_account)
+		return nr_swap_pages > 0;
+	return !memcg->memsw_is_minimum &&
+		(res_counter_read_u64(&memcg->memsw, RES_LIMIT) > 0);
+}
+
+static s64 mem_cgroup_get_local_page_stat(struct mem_cgroup *memcg,
+				enum mem_cgroup_page_stat_item item)
+{
+	s64 ret;
+
+	switch (item) {
+	case MEMCG_NR_DIRTYABLE_PAGES:
+		ret = res_counter_read_u64(&memcg->res, RES_LIMIT) -
+			res_counter_read_u64(&memcg->res, RES_USAGE);
+		/* Translate free memory in pages */
+		ret >>= PAGE_SHIFT;
+		ret += mem_cgroup_read_stat(memcg, LRU_ACTIVE_FILE) +
+			mem_cgroup_read_stat(memcg, LRU_INACTIVE_FILE);
+		if (mem_cgroup_can_swap(memcg))
+			ret += mem_cgroup_read_stat(memcg, LRU_ACTIVE_ANON) +
+				mem_cgroup_read_stat(memcg, LRU_INACTIVE_ANON);
+		break;
+	case MEMCG_NR_RECLAIM_PAGES:
+		ret = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_FILE_DIRTY) +
+			mem_cgroup_read_stat(memcg,
+					MEM_CGROUP_STAT_UNSTABLE_NFS);
+		break;
+	case MEMCG_NR_WRITEBACK:
+		ret = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_WRITEBACK);
+		break;
+	case MEMCG_NR_DIRTY_WRITEBACK_PAGES:
+		ret = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_WRITEBACK) +
+			mem_cgroup_read_stat(memcg,
+				MEM_CGROUP_STAT_UNSTABLE_NFS);
+		break;
+	default:
+		BUG_ON(1);
+	}
+	return ret;
+}
+
+static int mem_cgroup_page_stat_cb(struct mem_cgroup *mem, void *data)
+{
+	struct mem_cgroup_page_stat *stat = (struct mem_cgroup_page_stat *)data;
+
+	stat->value += mem_cgroup_get_local_page_stat(mem, stat->item);
+	return 0;
+}
+
+/*
+ * mem_cgroup_has_dirty_limit() - check if current memcg has local dirty limits
+ *
+ * Return true if the current memory cgroup has local dirty memory settings,
+ * false otherwise.
+ */
+bool mem_cgroup_has_dirty_limit(void)
+{
+	if (mem_cgroup_disabled())
+		return false;
+	return mem_cgroup_from_task(current) != NULL;
+}
+
+/*
+ * mem_cgroup_page_stat() - get memory cgroup file cache statistics
+ * @item:	memory statistic item exported to the kernel
+ *
+ * Return the accounted statistic value, or a negative value in case of error.
+ */
+s64 mem_cgroup_page_stat(enum mem_cgroup_page_stat_item item)
+{
+	struct mem_cgroup_page_stat stat = {};
+	struct mem_cgroup *memcg;
+
+	rcu_read_lock();
+	memcg = mem_cgroup_from_task(current);
+	if (memcg) {
+		/*
+		 * Recursively evaulate page statistics against all cgroup
+		 * under hierarchy tree
+		 */
+		stat.item = item;
+		mem_cgroup_walk_tree(memcg, &stat, mem_cgroup_page_stat_cb);
+	} else
+		stat.value = -EINVAL;
+	rcu_read_unlock();
+
+	return stat.value;
+}
+
 static int mem_cgroup_count_children_cb(struct mem_cgroup *mem, void *data)
 {
 	int *val = data;
@@ -1275,34 +1423,70 @@ static void record_last_oom(struct mem_cgroup *mem)
 }
 
 /*
- * Currently used to update mapped file statistics, but the routine can be
- * generalized to update other statistics as well.
+ * Generalized routine to update file cache's status for memcg.
+ *
+ * Before calling this, mapping->tree_lock should be held and preemption is
+ * disabled.  Then, it's guarnteed that the page is not uncharged while we
+ * access page_cgroup. We can make use of that.
  */
-void mem_cgroup_update_file_mapped(struct page *page, int val)
+void mem_cgroup_update_stat(struct page *page,
+			enum mem_cgroup_stat_index idx, int val)
 {
 	struct mem_cgroup *mem;
 	struct page_cgroup *pc;
 
+	if (mem_cgroup_disabled())
+		return;
 	pc = lookup_page_cgroup(page);
-	if (unlikely(!pc))
+	if (unlikely(!pc) || !PageCgroupUsed(pc))
 		return;
 
-	lock_page_cgroup(pc);
-	mem = pc->mem_cgroup;
-	if (!mem)
-		goto done;
-
-	if (!PageCgroupUsed(pc))
-		goto done;
-
+	lock_page_cgroup_migrate(pc);
 	/*
-	 * Preemption is already disabled. We can use __this_cpu_xxx
-	 */
-	__this_cpu_add(mem->stat->count[MEM_CGROUP_STAT_FILE_MAPPED], val);
-
-done:
-	unlock_page_cgroup(pc);
+	* It's guarnteed that this page is never uncharged.
+	* The only racy problem is moving account among memcgs.
+	*/
+	switch (idx) {
+	case MEM_CGROUP_STAT_FILE_MAPPED:
+		if (val > 0)
+			SetPageCgroupFileMapped(pc);
+		else
+			ClearPageCgroupFileMapped(pc);
+		break;
+	case MEM_CGROUP_STAT_FILE_DIRTY:
+		if (val > 0)
+			SetPageCgroupDirty(pc);
+		else
+			ClearPageCgroupDirty(pc);
+		break;
+	case MEM_CGROUP_STAT_WRITEBACK:
+		if (val > 0)
+			SetPageCgroupWriteback(pc);
+		else
+			ClearPageCgroupWriteback(pc);
+		break;
+	case MEM_CGROUP_STAT_WRITEBACK_TEMP:
+		if (val > 0)
+			SetPageCgroupWritebackTemp(pc);
+		else
+			ClearPageCgroupWritebackTemp(pc);
+		break;
+	case MEM_CGROUP_STAT_UNSTABLE_NFS:
+		if (val > 0)
+			SetPageCgroupUnstableNFS(pc);
+		else
+			ClearPageCgroupUnstableNFS(pc);
+		break;
+	default:
+		BUG();
+		break;
+	}
+	mem = pc->mem_cgroup;
+	if (likely(mem))
+		__this_cpu_add(mem->stat->count[idx], val);
+	unlock_page_cgroup_migrate(pc);
 }
+EXPORT_SYMBOL_GPL(mem_cgroup_update_stat);
 
 /*
  * size of first charge trial. "32" comes from vmscan.c's magic value.
@@ -1701,6 +1885,45 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *mem,
 	memcg_check_events(mem, pc->page);
 }
 
+/*
+ * Update file cache accounted statistics on task migration.
+ *
+ * TODO: We don't move charges of file (including shmem/tmpfs) pages for now.
+ * So, at the moment this function simply returns without updating accounted
+ * statistics, because we deal only with anonymous pages here.
+ */
+static void __mem_cgroup_update_file_stat(struct page_cgroup *pc,
+	struct mem_cgroup *from, struct mem_cgroup *to)
+{
+	struct page *page = pc->page;
+
+	if (!page_mapped(page) || PageAnon(page))
+		return;
+
+	if (PageCgroupFileMapped(pc)) {
+		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
+		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
+	}
+	if (PageCgroupDirty(pc)) {
+		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_DIRTY]);
+		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_DIRTY]);
+	}
+	if (PageCgroupWriteback(pc)) {
+		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_WRITEBACK]);
+		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_WRITEBACK]);
+	}
+	if (PageCgroupWritebackTemp(pc)) {
+		__this_cpu_dec(
+			from->stat->count[MEM_CGROUP_STAT_WRITEBACK_TEMP]);
+		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_WRITEBACK_TEMP]);
+	}
+	if (PageCgroupUnstableNFS(pc)) {
+		__this_cpu_dec(
+			from->stat->count[MEM_CGROUP_STAT_UNSTABLE_NFS]);
+		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_UNSTABLE_NFS]);
+	}
+}
+
 /**
  * __mem_cgroup_move_account - move account of the page
  * @pc:	page_cgroup of the page.
@@ -1721,22 +1944,16 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *mem,
 static void __mem_cgroup_move_account(struct page_cgroup *pc,
 	struct mem_cgroup *from, struct mem_cgroup *to, bool uncharge)
 {
-	struct page *page;
-
 	VM_BUG_ON(from == to);
 	VM_BUG_ON(PageLRU(pc->page));
 	VM_BUG_ON(!PageCgroupLocked(pc));
 	VM_BUG_ON(!PageCgroupUsed(pc));
 	VM_BUG_ON(pc->mem_cgroup != from);
 
-	page = pc->page;
-	if (page_mapped(page) && !PageAnon(page)) {
-		/* Update mapped_file data for mem_cgroup */
-		preempt_disable();
-		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
-		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
-		preempt_enable();
-	}
+	preempt_disable();
+	lock_page_cgroup_migrate(pc);
+	__mem_cgroup_update_file_stat(pc, from, to);
+
 	mem_cgroup_charge_statistics(from, pc, false);
 	if (uncharge)
 		/* This is not "cancel", but cancel_charge does all we need. */
@@ -1745,6 +1962,8 @@ static void __mem_cgroup_move_account(struct page_cgroup *pc,
 	/* caller should have done css_get */
 	pc->mem_cgroup = to;
 	mem_cgroup_charge_statistics(to, pc, true);
+	unlock_page_cgroup_migrate(pc);
+	preempt_enable();
 	/*
 	 * We charges against "to" which may not have any tasks. Then, "to"
 	 * can be under rmdir(). But in current implementation, caller of
@@ -3042,6 +3261,10 @@ enum {
 	MCS_PGPGIN,
 	MCS_PGPGOUT,
 	MCS_SWAP,
+	MCS_FILE_DIRTY,
+	MCS_WRITEBACK,
+	MCS_WRITEBACK_TEMP,
+	MCS_UNSTABLE_NFS,
 	MCS_INACTIVE_ANON,
 	MCS_ACTIVE_ANON,
 	MCS_INACTIVE_FILE,
@@ -3064,6 +3287,10 @@ struct {
 	{"pgpgin", "total_pgpgin"},
 	{"pgpgout", "total_pgpgout"},
 	{"swap", "total_swap"},
+	{"filedirty", "dirty_pages"},
+	{"writeback", "writeback_pages"},
+	{"writeback_tmp", "writeback_temp_pages"},
+	{"nfs", "nfs_unstable"},
 	{"inactive_anon", "total_inactive_anon"},
 	{"active_anon", "total_active_anon"},
 	{"inactive_file", "total_inactive_file"},
@@ -3092,6 +3319,14 @@ static int mem_cgroup_get_local_stat(struct mem_cgroup *mem, void *data)
 		val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_SWAPOUT);
 		s->stat[MCS_SWAP] += val * PAGE_SIZE;
 	}
+	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_FILE_DIRTY);
+	s->stat[MCS_FILE_DIRTY] += val;
+	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_WRITEBACK);
+	s->stat[MCS_WRITEBACK] += val;
+	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_WRITEBACK_TEMP);
+	s->stat[MCS_WRITEBACK_TEMP] += val;
+	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_UNSTABLE_NFS);
+	s->stat[MCS_UNSTABLE_NFS] += val;
 
 	/* per zone stat */
 	val = mem_cgroup_get_local_zonestat(mem, LRU_INACTIVE_ANON);
@@ -3453,6 +3688,60 @@ unlock:
 	return ret;
 }
 
+static u64 mem_cgroup_dirty_read(struct cgroup *cgrp, struct cftype *cft)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
+
+	switch (cft->private) {
+	case MEM_CGROUP_DIRTY_RATIO:
+		return memcg->dirty_param.dirty_ratio;
+	case MEM_CGROUP_DIRTY_BYTES:
+		return memcg->dirty_param.dirty_bytes;
+	case MEM_CGROUP_DIRTY_BACKGROUND_RATIO:
+		return memcg->dirty_param.dirty_background_ratio;
+	case MEM_CGROUP_DIRTY_BACKGROUND_BYTES:
+		return memcg->dirty_param.dirty_background_bytes;
+	default:
+		BUG();
+	}
+}
+
+static int
+mem_cgroup_dirty_write(struct cgroup *cgrp, struct cftype *cft, u64 val)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
+	int type = cft->private;
+
+	if (cgrp->parent == NULL)
+		return -EINVAL;
+	if ((type == MEM_CGROUP_DIRTY_RATIO ||
+		type == MEM_CGROUP_DIRTY_BACKGROUND_RATIO) && val > 100)
+		return -EINVAL;
+	/*
+	 * TODO: provide a validation check routine. And retry if validation
+	 * fails.
+	 */
+	switch (type) {
+	case MEM_CGROUP_DIRTY_RATIO:
+		memcg->dirty_param.dirty_ratio = val;
+		memcg->dirty_param.dirty_bytes = 0;
+		break;
+	case MEM_CGROUP_DIRTY_BYTES:
+		memcg->dirty_param.dirty_ratio  = 0;
+		memcg->dirty_param.dirty_bytes = val;
+		break;
+	case MEM_CGROUP_DIRTY_BACKGROUND_RATIO:
+		memcg->dirty_param.dirty_background_ratio = val;
+		memcg->dirty_param.dirty_background_bytes = 0;
+		break;
+	case MEM_CGROUP_DIRTY_BACKGROUND_BYTES:
+		memcg->dirty_param.dirty_background_ratio = 0;
+		memcg->dirty_param.dirty_background_bytes = val;
+		break;
+	}
+	return 0;
+}
+
 static struct cftype mem_cgroup_files[] = {
 	{
 		.name = "usage_in_bytes",
@@ -3504,6 +3793,30 @@ static struct cftype mem_cgroup_files[] = {
 		.write_u64 = mem_cgroup_swappiness_write,
 	},
 	{
+		.name = "dirty_ratio",
+		.read_u64 = mem_cgroup_dirty_read,
+		.write_u64 = mem_cgroup_dirty_write,
+		.private = MEM_CGROUP_DIRTY_RATIO,
+	},
+	{
+		.name = "dirty_bytes",
+		.read_u64 = mem_cgroup_dirty_read,
+		.write_u64 = mem_cgroup_dirty_write,
+		.private = MEM_CGROUP_DIRTY_BYTES,
+	},
+	{
+		.name = "dirty_background_ratio",
+		.read_u64 = mem_cgroup_dirty_read,
+		.write_u64 = mem_cgroup_dirty_write,
+		.private = MEM_CGROUP_DIRTY_BACKGROUND_RATIO,
+	},
+	{
+		.name = "dirty_background_bytes",
+		.read_u64 = mem_cgroup_dirty_read,
+		.write_u64 = mem_cgroup_dirty_write,
+		.private = MEM_CGROUP_DIRTY_BACKGROUND_BYTES,
+	},
+	{
 		.name = "move_charge_at_immigrate",
 		.read_u64 = mem_cgroup_move_charge_read,
 		.write_u64 = mem_cgroup_move_charge_write,
@@ -3762,8 +4075,21 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
 	mem->last_scanned_child = 0;
 	spin_lock_init(&mem->reclaim_param_lock);
 
-	if (parent)
+	if (parent) {
 		mem->swappiness = get_swappiness(parent);
+		mem->dirty_param = parent->dirty_param;
+	} else {
+		while (1) {
+			get_global_dirty_param(&mem->dirty_param);
+			/*
+			 * Since global dirty parameters are not protected we
+			 * try to speculatively read them and retry if we get
+			 * inconsistent values.
+			 */
+			if (likely(dirty_param_is_valid(&mem->dirty_param)))
+				break;
+		}
+	}
 	atomic_set(&mem->refcnt, 1);
 	mem->move_charge_at_immigrate = 0;
 	mutex_init(&mem->thresholds_lock);
-- 
1.6.3.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH -mmotm 4/4] memcg: dirty pages instrumentation
       [not found] ` <1267699215-4101-1-git-send-email-arighi-vWjgImWzx8FBDgjK7y7TUQ@public.gmane.org>
                     ` (2 preceding siblings ...)
  2010-03-04 10:40   ` [PATCH -mmotm 3/4] memcg: dirty pages accounting and limiting infrastructure Andrea Righi
@ 2010-03-04 10:40   ` Andrea Righi
  2010-03-04 17:11   ` [PATCH -mmotm 0/4] memcg: per cgroup dirty limit (v4) Balbir Singh
  4 siblings, 0 replies; 68+ messages in thread
From: Andrea Righi @ 2010-03-04 10:40 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki, Balbir Singh
  Cc: Andrea Righi, Daisuke Nishimura,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Trond Myklebust,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Suleiman Souhlal, Andrew Morton,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Vivek Goyal

Apply the cgroup dirty pages accounting and limiting infrastructure
to the opportune kernel functions.

Signed-off-by: Andrea Righi <arighi-vWjgImWzx8FBDgjK7y7TUQ@public.gmane.org>
---
 fs/fuse/file.c      |    5 +++
 fs/nfs/write.c      |    4 ++
 fs/nilfs2/segment.c |   11 +++++-
 mm/filemap.c        |    1 +
 mm/page-writeback.c |   91 ++++++++++++++++++++++++++++++++++-----------------
 mm/rmap.c           |    4 +-
 mm/truncate.c       |    2 +
 7 files changed, 84 insertions(+), 34 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index a9f5e13..dbbdd53 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -11,6 +11,7 @@
 #include <linux/pagemap.h>
 #include <linux/slab.h>
 #include <linux/kernel.h>
+#include <linux/memcontrol.h>
 #include <linux/sched.h>
 #include <linux/module.h>
 
@@ -1129,6 +1130,8 @@ static void fuse_writepage_finish(struct fuse_conn *fc, struct fuse_req *req)
 
 	list_del(&req->writepages_entry);
 	dec_bdi_stat(bdi, BDI_WRITEBACK);
+	mem_cgroup_update_stat(req->pages[0],
+			MEM_CGROUP_STAT_WRITEBACK_TEMP, -1);
 	dec_zone_page_state(req->pages[0], NR_WRITEBACK_TEMP);
 	bdi_writeout_inc(bdi);
 	wake_up(&fi->page_waitq);
@@ -1240,6 +1243,8 @@ static int fuse_writepage_locked(struct page *page)
 	req->inode = inode;
 
 	inc_bdi_stat(mapping->backing_dev_info, BDI_WRITEBACK);
+	mem_cgroup_update_stat(tmp_page,
+			MEM_CGROUP_STAT_WRITEBACK_TEMP, 1);
 	inc_zone_page_state(tmp_page, NR_WRITEBACK_TEMP);
 	end_page_writeback(page);
 
diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index b753242..7316f7a 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -439,6 +439,7 @@ nfs_mark_request_commit(struct nfs_page *req)
 			req->wb_index,
 			NFS_PAGE_TAG_COMMIT);
 	spin_unlock(&inode->i_lock);
+	mem_cgroup_update_stat(req->wb_page, MEM_CGROUP_STAT_UNSTABLE_NFS, 1);
 	inc_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
 	inc_bdi_stat(req->wb_page->mapping->backing_dev_info, BDI_UNSTABLE);
 	__mark_inode_dirty(inode, I_DIRTY_DATASYNC);
@@ -450,6 +451,7 @@ nfs_clear_request_commit(struct nfs_page *req)
 	struct page *page = req->wb_page;
 
 	if (test_and_clear_bit(PG_CLEAN, &(req)->wb_flags)) {
+		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_UNSTABLE_NFS, -1);
 		dec_zone_page_state(page, NR_UNSTABLE_NFS);
 		dec_bdi_stat(page->mapping->backing_dev_info, BDI_UNSTABLE);
 		return 1;
@@ -1273,6 +1275,8 @@ nfs_commit_list(struct inode *inode, struct list_head *head, int how)
 		req = nfs_list_entry(head->next);
 		nfs_list_remove_request(req);
 		nfs_mark_request_commit(req);
+		mem_cgroup_update_stat(req->wb_page,
+				MEM_CGROUP_STAT_UNSTABLE_NFS, -1);
 		dec_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
 		dec_bdi_stat(req->wb_page->mapping->backing_dev_info,
 				BDI_UNSTABLE);
diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
index ada2f1b..27a01b1 100644
--- a/fs/nilfs2/segment.c
+++ b/fs/nilfs2/segment.c
@@ -24,6 +24,7 @@
 #include <linux/pagemap.h>
 #include <linux/buffer_head.h>
 #include <linux/writeback.h>
+#include <linux/memcontrol.h>
 #include <linux/bio.h>
 #include <linux/completion.h>
 #include <linux/blkdev.h>
@@ -1660,8 +1661,11 @@ nilfs_copy_replace_page_buffers(struct page *page, struct list_head *out)
 	} while (bh = bh->b_this_page, bh2 = bh2->b_this_page, bh != head);
 	kunmap_atomic(kaddr, KM_USER0);
 
-	if (!TestSetPageWriteback(clone_page))
+	if (!TestSetPageWriteback(clone_page)) {
+		mem_cgroup_update_stat(clone_page,
+				MEM_CGROUP_STAT_WRITEBACK, 1);
 		inc_zone_page_state(clone_page, NR_WRITEBACK);
+	}
 	unlock_page(clone_page);
 
 	return 0;
@@ -1783,8 +1787,11 @@ static void __nilfs_end_page_io(struct page *page, int err)
 	}
 
 	if (buffer_nilfs_allocated(page_buffers(page))) {
-		if (TestClearPageWriteback(page))
+		if (TestClearPageWriteback(page)) {
+			mem_cgroup_update_stat(page,
+					MEM_CGROUP_STAT_WRITEBACK, -1);
 			dec_zone_page_state(page, NR_WRITEBACK);
+		}
 	} else
 		end_page_writeback(page);
 }
diff --git a/mm/filemap.c b/mm/filemap.c
index fe09e51..f85acae 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -135,6 +135,7 @@ void __remove_from_page_cache(struct page *page)
 	 * having removed the page entirely.
 	 */
 	if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
+		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_FILE_DIRTY, -1);
 		dec_zone_page_state(page, NR_FILE_DIRTY);
 		dec_bdi_stat(mapping->backing_dev_info, BDI_DIRTY);
 	}
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 5a0f8f3..c5d14ea 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -137,13 +137,16 @@ static struct prop_descriptor vm_dirties;
  */
 static int calc_period_shift(void)
 {
+	struct dirty_param dirty_param;
 	unsigned long dirty_total;
 
-	if (vm_dirty_bytes)
-		dirty_total = vm_dirty_bytes / PAGE_SIZE;
+	get_dirty_param(&dirty_param);
+
+	if (dirty_param.dirty_bytes)
+		dirty_total = dirty_param.dirty_bytes / PAGE_SIZE;
 	else
-		dirty_total = (vm_dirty_ratio * determine_dirtyable_memory()) /
-				100;
+		dirty_total = (dirty_param.dirty_ratio *
+				determine_dirtyable_memory()) / 100;
 	return 2 + ilog2(dirty_total - 1);
 }
 
@@ -408,41 +411,46 @@ static unsigned long highmem_dirtyable_memory(unsigned long total)
  */
 unsigned long determine_dirtyable_memory(void)
 {
-	unsigned long x;
-
-	x = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages();
+	unsigned long memory;
+	s64 memcg_memory;
 
+	memory = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages();
 	if (!vm_highmem_is_dirtyable)
-		x -= highmem_dirtyable_memory(x);
-
-	return x + 1;	/* Ensure that we never return 0 */
+		memory -= highmem_dirtyable_memory(memory);
+	if (mem_cgroup_has_dirty_limit())
+		return memory + 1;
+	memcg_memory = mem_cgroup_page_stat(MEMCG_NR_DIRTYABLE_PAGES);
+	return min((unsigned long)memcg_memory, memory + 1);
 }
 
 void
 get_dirty_limits(unsigned long *pbackground, unsigned long *pdirty,
 		 unsigned long *pbdi_dirty, struct backing_dev_info *bdi)
 {
-	unsigned long background;
-	unsigned long dirty;
+	unsigned long dirty, background;
 	unsigned long available_memory = determine_dirtyable_memory();
 	struct task_struct *tsk;
+	struct dirty_param dirty_param;
+
+	get_dirty_param(&dirty_param);
 
-	if (vm_dirty_bytes)
-		dirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE);
+	if (dirty_param.dirty_bytes)
+		dirty = DIV_ROUND_UP(dirty_param.dirty_bytes, PAGE_SIZE);
 	else {
 		int dirty_ratio;
 
-		dirty_ratio = vm_dirty_ratio;
+		dirty_ratio = dirty_param.dirty_ratio;
 		if (dirty_ratio < 5)
 			dirty_ratio = 5;
 		dirty = (dirty_ratio * available_memory) / 100;
 	}
 
-	if (dirty_background_bytes)
-		background = DIV_ROUND_UP(dirty_background_bytes, PAGE_SIZE);
+	if (dirty_param.dirty_background_bytes)
+		background = DIV_ROUND_UP(dirty_param.dirty_background_bytes,
+						PAGE_SIZE);
 	else
-		background = (dirty_background_ratio * available_memory) / 100;
-
+		background = (dirty_param.dirty_background_ratio *
+						available_memory) / 100;
 	if (background >= dirty)
 		background = dirty / 2;
 	tsk = current;
@@ -508,9 +516,15 @@ static void balance_dirty_pages(struct address_space *mapping,
 		get_dirty_limits(&background_thresh, &dirty_thresh,
 				&bdi_thresh, bdi);
 
-		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
+		if (mem_cgroup_has_dirty_limit()) {
+			nr_reclaimable =
+				mem_cgroup_page_stat(MEMCG_NR_RECLAIM_PAGES);
+			nr_writeback = mem_cgroup_page_stat(MEMCG_NR_WRITEBACK);
+		} else {
+			nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
 					global_page_state(NR_UNSTABLE_NFS);
-		nr_writeback = global_page_state(NR_WRITEBACK);
+			nr_writeback = global_page_state(NR_WRITEBACK);
+		}
 
 		bdi_nr_reclaimable = bdi_stat(bdi, BDI_DIRTY);
 		if (bdi_cap_account_unstable(bdi)) {
@@ -611,10 +625,13 @@ static void balance_dirty_pages(struct address_space *mapping,
 	 * In normal mode, we start background writeout at the lower
 	 * background_thresh, to keep the amount of dirty memory low.
 	 */
+	if (mem_cgroup_has_dirty_limit())
+		nr_reclaimable = mem_cgroup_page_stat(MEMCG_NR_RECLAIM_PAGES);
+	else
+		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
+				global_page_state(NR_UNSTABLE_NFS);
 	if ((laptop_mode && pages_written) ||
-	    (!laptop_mode && ((global_page_state(NR_FILE_DIRTY)
-			       + global_page_state(NR_UNSTABLE_NFS))
-					  > background_thresh)))
+	    (!laptop_mode && (nr_reclaimable > background_thresh)))
 		bdi_start_writeback(bdi, NULL, 0);
 }
 
@@ -678,6 +695,8 @@ void throttle_vm_writeout(gfp_t gfp_mask)
 	unsigned long dirty_thresh;
 
         for ( ; ; ) {
+		unsigned long dirty;
+
 		get_dirty_limits(&background_thresh, &dirty_thresh, NULL, NULL);
 
                 /*
@@ -686,10 +705,15 @@ void throttle_vm_writeout(gfp_t gfp_mask)
                  */
                 dirty_thresh += dirty_thresh / 10;      /* wheeee... */
 
-                if (global_page_state(NR_UNSTABLE_NFS) +
-			global_page_state(NR_WRITEBACK) <= dirty_thresh)
-                        	break;
-                congestion_wait(BLK_RW_ASYNC, HZ/10);
+		if (mem_cgroup_has_dirty_limit())
+			dirty = mem_cgroup_page_stat(
+					MEMCG_NR_DIRTY_WRITEBACK_PAGES);
+		else
+			dirty = global_page_state(NR_UNSTABLE_NFS) +
+				global_page_state(NR_WRITEBACK);
+		if (dirty <= dirty_thresh)
+			break;
+		congestion_wait(BLK_RW_ASYNC, HZ/10);
 
 		/*
 		 * The caller might hold locks which can prevent IO completion
@@ -1096,6 +1120,7 @@ int __set_page_dirty_no_writeback(struct page *page)
 void account_page_dirtied(struct page *page, struct address_space *mapping)
 {
 	if (mapping_cap_account_dirty(mapping)) {
+		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_FILE_DIRTY, 1);
 		__inc_zone_page_state(page, NR_FILE_DIRTY);
 		__inc_bdi_stat(mapping->backing_dev_info, BDI_DIRTY);
 		task_dirty_inc(current);
@@ -1297,6 +1322,8 @@ int clear_page_dirty_for_io(struct page *page)
 		 * for more comments.
 		 */
 		if (TestClearPageDirty(page)) {
+			mem_cgroup_update_stat(page,
+					MEM_CGROUP_STAT_FILE_DIRTY, -1);
 			dec_zone_page_state(page, NR_FILE_DIRTY);
 			dec_bdi_stat(mapping->backing_dev_info,
 					BDI_DIRTY);
@@ -1332,8 +1359,10 @@ int test_clear_page_writeback(struct page *page)
 	} else {
 		ret = TestClearPageWriteback(page);
 	}
-	if (ret)
+	if (ret) {
+		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_WRITEBACK, -1);
 		dec_zone_page_state(page, NR_WRITEBACK);
+	}
 	return ret;
 }
 
@@ -1363,8 +1392,10 @@ int test_set_page_writeback(struct page *page)
 	} else {
 		ret = TestSetPageWriteback(page);
 	}
-	if (!ret)
+	if (!ret) {
+		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_WRITEBACK, 1);
 		inc_zone_page_state(page, NR_WRITEBACK);
+	}
 	return ret;
 
 }
diff --git a/mm/rmap.c b/mm/rmap.c
index fcd593c..d47c257 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -829,7 +829,7 @@ void page_add_file_rmap(struct page *page)
 {
 	if (atomic_inc_and_test(&page->_mapcount)) {
 		__inc_zone_page_state(page, NR_FILE_MAPPED);
-		mem_cgroup_update_file_mapped(page, 1);
+		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_FILE_MAPPED, 1);
 	}
 }
 
@@ -861,7 +861,7 @@ void page_remove_rmap(struct page *page)
 		__dec_zone_page_state(page, NR_ANON_PAGES);
 	} else {
 		__dec_zone_page_state(page, NR_FILE_MAPPED);
-		mem_cgroup_update_file_mapped(page, -1);
+		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_FILE_MAPPED, -1);
 	}
 	/*
 	 * It would be tidy to reset the PageAnon mapping here,
diff --git a/mm/truncate.c b/mm/truncate.c
index 2466e0c..5f437e7 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -73,6 +73,8 @@ void cancel_dirty_page(struct page *page, unsigned int account_size)
 	if (TestClearPageDirty(page)) {
 		struct address_space *mapping = page->mapping;
 		if (mapping && mapping_cap_account_dirty(mapping)) {
+			mem_cgroup_update_stat(page,
+					MEM_CGROUP_STAT_FILE_DIRTY, -1);
 			dec_zone_page_state(page, NR_FILE_DIRTY);
 			dec_bdi_stat(mapping->backing_dev_info,
 					BDI_DIRTY);
-- 
1.6.3.3

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH -mmotm 4/4] memcg: dirty pages instrumentation
  2010-03-04 10:40 ` Andrea Righi
@ 2010-03-04 10:40   ` Andrea Righi
  -1 siblings, 0 replies; 68+ messages in thread
From: Andrea Righi @ 2010-03-04 10:40 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki, Balbir Singh
  Cc: Vivek Goyal, Peter Zijlstra, Trond Myklebust, Suleiman Souhlal,
	Greg Thelen, Daisuke Nishimura, Kirill A. Shutemov,
	Andrew Morton, containers, linux-kernel, linux-mm, Andrea Righi

Apply the cgroup dirty pages accounting and limiting infrastructure
to the opportune kernel functions.

Signed-off-by: Andrea Righi <arighi@develer.com>
---
 fs/fuse/file.c      |    5 +++
 fs/nfs/write.c      |    4 ++
 fs/nilfs2/segment.c |   11 +++++-
 mm/filemap.c        |    1 +
 mm/page-writeback.c |   91 ++++++++++++++++++++++++++++++++++-----------------
 mm/rmap.c           |    4 +-
 mm/truncate.c       |    2 +
 7 files changed, 84 insertions(+), 34 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index a9f5e13..dbbdd53 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -11,6 +11,7 @@
 #include <linux/pagemap.h>
 #include <linux/slab.h>
 #include <linux/kernel.h>
+#include <linux/memcontrol.h>
 #include <linux/sched.h>
 #include <linux/module.h>
 
@@ -1129,6 +1130,8 @@ static void fuse_writepage_finish(struct fuse_conn *fc, struct fuse_req *req)
 
 	list_del(&req->writepages_entry);
 	dec_bdi_stat(bdi, BDI_WRITEBACK);
+	mem_cgroup_update_stat(req->pages[0],
+			MEM_CGROUP_STAT_WRITEBACK_TEMP, -1);
 	dec_zone_page_state(req->pages[0], NR_WRITEBACK_TEMP);
 	bdi_writeout_inc(bdi);
 	wake_up(&fi->page_waitq);
@@ -1240,6 +1243,8 @@ static int fuse_writepage_locked(struct page *page)
 	req->inode = inode;
 
 	inc_bdi_stat(mapping->backing_dev_info, BDI_WRITEBACK);
+	mem_cgroup_update_stat(tmp_page,
+			MEM_CGROUP_STAT_WRITEBACK_TEMP, 1);
 	inc_zone_page_state(tmp_page, NR_WRITEBACK_TEMP);
 	end_page_writeback(page);
 
diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index b753242..7316f7a 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -439,6 +439,7 @@ nfs_mark_request_commit(struct nfs_page *req)
 			req->wb_index,
 			NFS_PAGE_TAG_COMMIT);
 	spin_unlock(&inode->i_lock);
+	mem_cgroup_update_stat(req->wb_page, MEM_CGROUP_STAT_UNSTABLE_NFS, 1);
 	inc_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
 	inc_bdi_stat(req->wb_page->mapping->backing_dev_info, BDI_UNSTABLE);
 	__mark_inode_dirty(inode, I_DIRTY_DATASYNC);
@@ -450,6 +451,7 @@ nfs_clear_request_commit(struct nfs_page *req)
 	struct page *page = req->wb_page;
 
 	if (test_and_clear_bit(PG_CLEAN, &(req)->wb_flags)) {
+		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_UNSTABLE_NFS, -1);
 		dec_zone_page_state(page, NR_UNSTABLE_NFS);
 		dec_bdi_stat(page->mapping->backing_dev_info, BDI_UNSTABLE);
 		return 1;
@@ -1273,6 +1275,8 @@ nfs_commit_list(struct inode *inode, struct list_head *head, int how)
 		req = nfs_list_entry(head->next);
 		nfs_list_remove_request(req);
 		nfs_mark_request_commit(req);
+		mem_cgroup_update_stat(req->wb_page,
+				MEM_CGROUP_STAT_UNSTABLE_NFS, -1);
 		dec_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
 		dec_bdi_stat(req->wb_page->mapping->backing_dev_info,
 				BDI_UNSTABLE);
diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
index ada2f1b..27a01b1 100644
--- a/fs/nilfs2/segment.c
+++ b/fs/nilfs2/segment.c
@@ -24,6 +24,7 @@
 #include <linux/pagemap.h>
 #include <linux/buffer_head.h>
 #include <linux/writeback.h>
+#include <linux/memcontrol.h>
 #include <linux/bio.h>
 #include <linux/completion.h>
 #include <linux/blkdev.h>
@@ -1660,8 +1661,11 @@ nilfs_copy_replace_page_buffers(struct page *page, struct list_head *out)
 	} while (bh = bh->b_this_page, bh2 = bh2->b_this_page, bh != head);
 	kunmap_atomic(kaddr, KM_USER0);
 
-	if (!TestSetPageWriteback(clone_page))
+	if (!TestSetPageWriteback(clone_page)) {
+		mem_cgroup_update_stat(clone_page,
+				MEM_CGROUP_STAT_WRITEBACK, 1);
 		inc_zone_page_state(clone_page, NR_WRITEBACK);
+	}
 	unlock_page(clone_page);
 
 	return 0;
@@ -1783,8 +1787,11 @@ static void __nilfs_end_page_io(struct page *page, int err)
 	}
 
 	if (buffer_nilfs_allocated(page_buffers(page))) {
-		if (TestClearPageWriteback(page))
+		if (TestClearPageWriteback(page)) {
+			mem_cgroup_update_stat(page,
+					MEM_CGROUP_STAT_WRITEBACK, -1);
 			dec_zone_page_state(page, NR_WRITEBACK);
+		}
 	} else
 		end_page_writeback(page);
 }
diff --git a/mm/filemap.c b/mm/filemap.c
index fe09e51..f85acae 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -135,6 +135,7 @@ void __remove_from_page_cache(struct page *page)
 	 * having removed the page entirely.
 	 */
 	if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
+		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_FILE_DIRTY, -1);
 		dec_zone_page_state(page, NR_FILE_DIRTY);
 		dec_bdi_stat(mapping->backing_dev_info, BDI_DIRTY);
 	}
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 5a0f8f3..c5d14ea 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -137,13 +137,16 @@ static struct prop_descriptor vm_dirties;
  */
 static int calc_period_shift(void)
 {
+	struct dirty_param dirty_param;
 	unsigned long dirty_total;
 
-	if (vm_dirty_bytes)
-		dirty_total = vm_dirty_bytes / PAGE_SIZE;
+	get_dirty_param(&dirty_param);
+
+	if (dirty_param.dirty_bytes)
+		dirty_total = dirty_param.dirty_bytes / PAGE_SIZE;
 	else
-		dirty_total = (vm_dirty_ratio * determine_dirtyable_memory()) /
-				100;
+		dirty_total = (dirty_param.dirty_ratio *
+				determine_dirtyable_memory()) / 100;
 	return 2 + ilog2(dirty_total - 1);
 }
 
@@ -408,41 +411,46 @@ static unsigned long highmem_dirtyable_memory(unsigned long total)
  */
 unsigned long determine_dirtyable_memory(void)
 {
-	unsigned long x;
-
-	x = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages();
+	unsigned long memory;
+	s64 memcg_memory;
 
+	memory = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages();
 	if (!vm_highmem_is_dirtyable)
-		x -= highmem_dirtyable_memory(x);
-
-	return x + 1;	/* Ensure that we never return 0 */
+		memory -= highmem_dirtyable_memory(memory);
+	if (mem_cgroup_has_dirty_limit())
+		return memory + 1;
+	memcg_memory = mem_cgroup_page_stat(MEMCG_NR_DIRTYABLE_PAGES);
+	return min((unsigned long)memcg_memory, memory + 1);
 }
 
 void
 get_dirty_limits(unsigned long *pbackground, unsigned long *pdirty,
 		 unsigned long *pbdi_dirty, struct backing_dev_info *bdi)
 {
-	unsigned long background;
-	unsigned long dirty;
+	unsigned long dirty, background;
 	unsigned long available_memory = determine_dirtyable_memory();
 	struct task_struct *tsk;
+	struct dirty_param dirty_param;
+
+	get_dirty_param(&dirty_param);
 
-	if (vm_dirty_bytes)
-		dirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE);
+	if (dirty_param.dirty_bytes)
+		dirty = DIV_ROUND_UP(dirty_param.dirty_bytes, PAGE_SIZE);
 	else {
 		int dirty_ratio;
 
-		dirty_ratio = vm_dirty_ratio;
+		dirty_ratio = dirty_param.dirty_ratio;
 		if (dirty_ratio < 5)
 			dirty_ratio = 5;
 		dirty = (dirty_ratio * available_memory) / 100;
 	}
 
-	if (dirty_background_bytes)
-		background = DIV_ROUND_UP(dirty_background_bytes, PAGE_SIZE);
+	if (dirty_param.dirty_background_bytes)
+		background = DIV_ROUND_UP(dirty_param.dirty_background_bytes,
+						PAGE_SIZE);
 	else
-		background = (dirty_background_ratio * available_memory) / 100;
-
+		background = (dirty_param.dirty_background_ratio *
+						available_memory) / 100;
 	if (background >= dirty)
 		background = dirty / 2;
 	tsk = current;
@@ -508,9 +516,15 @@ static void balance_dirty_pages(struct address_space *mapping,
 		get_dirty_limits(&background_thresh, &dirty_thresh,
 				&bdi_thresh, bdi);
 
-		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
+		if (mem_cgroup_has_dirty_limit()) {
+			nr_reclaimable =
+				mem_cgroup_page_stat(MEMCG_NR_RECLAIM_PAGES);
+			nr_writeback = mem_cgroup_page_stat(MEMCG_NR_WRITEBACK);
+		} else {
+			nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
 					global_page_state(NR_UNSTABLE_NFS);
-		nr_writeback = global_page_state(NR_WRITEBACK);
+			nr_writeback = global_page_state(NR_WRITEBACK);
+		}
 
 		bdi_nr_reclaimable = bdi_stat(bdi, BDI_DIRTY);
 		if (bdi_cap_account_unstable(bdi)) {
@@ -611,10 +625,13 @@ static void balance_dirty_pages(struct address_space *mapping,
 	 * In normal mode, we start background writeout at the lower
 	 * background_thresh, to keep the amount of dirty memory low.
 	 */
+	if (mem_cgroup_has_dirty_limit())
+		nr_reclaimable = mem_cgroup_page_stat(MEMCG_NR_RECLAIM_PAGES);
+	else
+		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
+				global_page_state(NR_UNSTABLE_NFS);
 	if ((laptop_mode && pages_written) ||
-	    (!laptop_mode && ((global_page_state(NR_FILE_DIRTY)
-			       + global_page_state(NR_UNSTABLE_NFS))
-					  > background_thresh)))
+	    (!laptop_mode && (nr_reclaimable > background_thresh)))
 		bdi_start_writeback(bdi, NULL, 0);
 }
 
@@ -678,6 +695,8 @@ void throttle_vm_writeout(gfp_t gfp_mask)
 	unsigned long dirty_thresh;
 
         for ( ; ; ) {
+		unsigned long dirty;
+
 		get_dirty_limits(&background_thresh, &dirty_thresh, NULL, NULL);
 
                 /*
@@ -686,10 +705,15 @@ void throttle_vm_writeout(gfp_t gfp_mask)
                  */
                 dirty_thresh += dirty_thresh / 10;      /* wheeee... */
 
-                if (global_page_state(NR_UNSTABLE_NFS) +
-			global_page_state(NR_WRITEBACK) <= dirty_thresh)
-                        	break;
-                congestion_wait(BLK_RW_ASYNC, HZ/10);
+		if (mem_cgroup_has_dirty_limit())
+			dirty = mem_cgroup_page_stat(
+					MEMCG_NR_DIRTY_WRITEBACK_PAGES);
+		else
+			dirty = global_page_state(NR_UNSTABLE_NFS) +
+				global_page_state(NR_WRITEBACK);
+		if (dirty <= dirty_thresh)
+			break;
+		congestion_wait(BLK_RW_ASYNC, HZ/10);
 
 		/*
 		 * The caller might hold locks which can prevent IO completion
@@ -1096,6 +1120,7 @@ int __set_page_dirty_no_writeback(struct page *page)
 void account_page_dirtied(struct page *page, struct address_space *mapping)
 {
 	if (mapping_cap_account_dirty(mapping)) {
+		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_FILE_DIRTY, 1);
 		__inc_zone_page_state(page, NR_FILE_DIRTY);
 		__inc_bdi_stat(mapping->backing_dev_info, BDI_DIRTY);
 		task_dirty_inc(current);
@@ -1297,6 +1322,8 @@ int clear_page_dirty_for_io(struct page *page)
 		 * for more comments.
 		 */
 		if (TestClearPageDirty(page)) {
+			mem_cgroup_update_stat(page,
+					MEM_CGROUP_STAT_FILE_DIRTY, -1);
 			dec_zone_page_state(page, NR_FILE_DIRTY);
 			dec_bdi_stat(mapping->backing_dev_info,
 					BDI_DIRTY);
@@ -1332,8 +1359,10 @@ int test_clear_page_writeback(struct page *page)
 	} else {
 		ret = TestClearPageWriteback(page);
 	}
-	if (ret)
+	if (ret) {
+		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_WRITEBACK, -1);
 		dec_zone_page_state(page, NR_WRITEBACK);
+	}
 	return ret;
 }
 
@@ -1363,8 +1392,10 @@ int test_set_page_writeback(struct page *page)
 	} else {
 		ret = TestSetPageWriteback(page);
 	}
-	if (!ret)
+	if (!ret) {
+		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_WRITEBACK, 1);
 		inc_zone_page_state(page, NR_WRITEBACK);
+	}
 	return ret;
 
 }
diff --git a/mm/rmap.c b/mm/rmap.c
index fcd593c..d47c257 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -829,7 +829,7 @@ void page_add_file_rmap(struct page *page)
 {
 	if (atomic_inc_and_test(&page->_mapcount)) {
 		__inc_zone_page_state(page, NR_FILE_MAPPED);
-		mem_cgroup_update_file_mapped(page, 1);
+		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_FILE_MAPPED, 1);
 	}
 }
 
@@ -861,7 +861,7 @@ void page_remove_rmap(struct page *page)
 		__dec_zone_page_state(page, NR_ANON_PAGES);
 	} else {
 		__dec_zone_page_state(page, NR_FILE_MAPPED);
-		mem_cgroup_update_file_mapped(page, -1);
+		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_FILE_MAPPED, -1);
 	}
 	/*
 	 * It would be tidy to reset the PageAnon mapping here,
diff --git a/mm/truncate.c b/mm/truncate.c
index 2466e0c..5f437e7 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -73,6 +73,8 @@ void cancel_dirty_page(struct page *page, unsigned int account_size)
 	if (TestClearPageDirty(page)) {
 		struct address_space *mapping = page->mapping;
 		if (mapping && mapping_cap_account_dirty(mapping)) {
+			mem_cgroup_update_stat(page,
+					MEM_CGROUP_STAT_FILE_DIRTY, -1);
 			dec_zone_page_state(page, NR_FILE_DIRTY);
 			dec_bdi_stat(mapping->backing_dev_info,
 					BDI_DIRTY);
-- 
1.6.3.3


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH -mmotm 4/4] memcg: dirty pages instrumentation
@ 2010-03-04 10:40   ` Andrea Righi
  0 siblings, 0 replies; 68+ messages in thread
From: Andrea Righi @ 2010-03-04 10:40 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki, Balbir Singh
  Cc: Vivek Goyal, Peter Zijlstra, Trond Myklebust, Suleiman Souhlal,
	Greg Thelen, Daisuke Nishimura, Kirill A. Shutemov,
	Andrew Morton, containers, linux-kernel, linux-mm, Andrea Righi

Apply the cgroup dirty pages accounting and limiting infrastructure
to the opportune kernel functions.

Signed-off-by: Andrea Righi <arighi@develer.com>
---
 fs/fuse/file.c      |    5 +++
 fs/nfs/write.c      |    4 ++
 fs/nilfs2/segment.c |   11 +++++-
 mm/filemap.c        |    1 +
 mm/page-writeback.c |   91 ++++++++++++++++++++++++++++++++++-----------------
 mm/rmap.c           |    4 +-
 mm/truncate.c       |    2 +
 7 files changed, 84 insertions(+), 34 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index a9f5e13..dbbdd53 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -11,6 +11,7 @@
 #include <linux/pagemap.h>
 #include <linux/slab.h>
 #include <linux/kernel.h>
+#include <linux/memcontrol.h>
 #include <linux/sched.h>
 #include <linux/module.h>
 
@@ -1129,6 +1130,8 @@ static void fuse_writepage_finish(struct fuse_conn *fc, struct fuse_req *req)
 
 	list_del(&req->writepages_entry);
 	dec_bdi_stat(bdi, BDI_WRITEBACK);
+	mem_cgroup_update_stat(req->pages[0],
+			MEM_CGROUP_STAT_WRITEBACK_TEMP, -1);
 	dec_zone_page_state(req->pages[0], NR_WRITEBACK_TEMP);
 	bdi_writeout_inc(bdi);
 	wake_up(&fi->page_waitq);
@@ -1240,6 +1243,8 @@ static int fuse_writepage_locked(struct page *page)
 	req->inode = inode;
 
 	inc_bdi_stat(mapping->backing_dev_info, BDI_WRITEBACK);
+	mem_cgroup_update_stat(tmp_page,
+			MEM_CGROUP_STAT_WRITEBACK_TEMP, 1);
 	inc_zone_page_state(tmp_page, NR_WRITEBACK_TEMP);
 	end_page_writeback(page);
 
diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index b753242..7316f7a 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -439,6 +439,7 @@ nfs_mark_request_commit(struct nfs_page *req)
 			req->wb_index,
 			NFS_PAGE_TAG_COMMIT);
 	spin_unlock(&inode->i_lock);
+	mem_cgroup_update_stat(req->wb_page, MEM_CGROUP_STAT_UNSTABLE_NFS, 1);
 	inc_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
 	inc_bdi_stat(req->wb_page->mapping->backing_dev_info, BDI_UNSTABLE);
 	__mark_inode_dirty(inode, I_DIRTY_DATASYNC);
@@ -450,6 +451,7 @@ nfs_clear_request_commit(struct nfs_page *req)
 	struct page *page = req->wb_page;
 
 	if (test_and_clear_bit(PG_CLEAN, &(req)->wb_flags)) {
+		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_UNSTABLE_NFS, -1);
 		dec_zone_page_state(page, NR_UNSTABLE_NFS);
 		dec_bdi_stat(page->mapping->backing_dev_info, BDI_UNSTABLE);
 		return 1;
@@ -1273,6 +1275,8 @@ nfs_commit_list(struct inode *inode, struct list_head *head, int how)
 		req = nfs_list_entry(head->next);
 		nfs_list_remove_request(req);
 		nfs_mark_request_commit(req);
+		mem_cgroup_update_stat(req->wb_page,
+				MEM_CGROUP_STAT_UNSTABLE_NFS, -1);
 		dec_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
 		dec_bdi_stat(req->wb_page->mapping->backing_dev_info,
 				BDI_UNSTABLE);
diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
index ada2f1b..27a01b1 100644
--- a/fs/nilfs2/segment.c
+++ b/fs/nilfs2/segment.c
@@ -24,6 +24,7 @@
 #include <linux/pagemap.h>
 #include <linux/buffer_head.h>
 #include <linux/writeback.h>
+#include <linux/memcontrol.h>
 #include <linux/bio.h>
 #include <linux/completion.h>
 #include <linux/blkdev.h>
@@ -1660,8 +1661,11 @@ nilfs_copy_replace_page_buffers(struct page *page, struct list_head *out)
 	} while (bh = bh->b_this_page, bh2 = bh2->b_this_page, bh != head);
 	kunmap_atomic(kaddr, KM_USER0);
 
-	if (!TestSetPageWriteback(clone_page))
+	if (!TestSetPageWriteback(clone_page)) {
+		mem_cgroup_update_stat(clone_page,
+				MEM_CGROUP_STAT_WRITEBACK, 1);
 		inc_zone_page_state(clone_page, NR_WRITEBACK);
+	}
 	unlock_page(clone_page);
 
 	return 0;
@@ -1783,8 +1787,11 @@ static void __nilfs_end_page_io(struct page *page, int err)
 	}
 
 	if (buffer_nilfs_allocated(page_buffers(page))) {
-		if (TestClearPageWriteback(page))
+		if (TestClearPageWriteback(page)) {
+			mem_cgroup_update_stat(page,
+					MEM_CGROUP_STAT_WRITEBACK, -1);
 			dec_zone_page_state(page, NR_WRITEBACK);
+		}
 	} else
 		end_page_writeback(page);
 }
diff --git a/mm/filemap.c b/mm/filemap.c
index fe09e51..f85acae 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -135,6 +135,7 @@ void __remove_from_page_cache(struct page *page)
 	 * having removed the page entirely.
 	 */
 	if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
+		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_FILE_DIRTY, -1);
 		dec_zone_page_state(page, NR_FILE_DIRTY);
 		dec_bdi_stat(mapping->backing_dev_info, BDI_DIRTY);
 	}
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 5a0f8f3..c5d14ea 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -137,13 +137,16 @@ static struct prop_descriptor vm_dirties;
  */
 static int calc_period_shift(void)
 {
+	struct dirty_param dirty_param;
 	unsigned long dirty_total;
 
-	if (vm_dirty_bytes)
-		dirty_total = vm_dirty_bytes / PAGE_SIZE;
+	get_dirty_param(&dirty_param);
+
+	if (dirty_param.dirty_bytes)
+		dirty_total = dirty_param.dirty_bytes / PAGE_SIZE;
 	else
-		dirty_total = (vm_dirty_ratio * determine_dirtyable_memory()) /
-				100;
+		dirty_total = (dirty_param.dirty_ratio *
+				determine_dirtyable_memory()) / 100;
 	return 2 + ilog2(dirty_total - 1);
 }
 
@@ -408,41 +411,46 @@ static unsigned long highmem_dirtyable_memory(unsigned long total)
  */
 unsigned long determine_dirtyable_memory(void)
 {
-	unsigned long x;
-
-	x = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages();
+	unsigned long memory;
+	s64 memcg_memory;
 
+	memory = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages();
 	if (!vm_highmem_is_dirtyable)
-		x -= highmem_dirtyable_memory(x);
-
-	return x + 1;	/* Ensure that we never return 0 */
+		memory -= highmem_dirtyable_memory(memory);
+	if (mem_cgroup_has_dirty_limit())
+		return memory + 1;
+	memcg_memory = mem_cgroup_page_stat(MEMCG_NR_DIRTYABLE_PAGES);
+	return min((unsigned long)memcg_memory, memory + 1);
 }
 
 void
 get_dirty_limits(unsigned long *pbackground, unsigned long *pdirty,
 		 unsigned long *pbdi_dirty, struct backing_dev_info *bdi)
 {
-	unsigned long background;
-	unsigned long dirty;
+	unsigned long dirty, background;
 	unsigned long available_memory = determine_dirtyable_memory();
 	struct task_struct *tsk;
+	struct dirty_param dirty_param;
+
+	get_dirty_param(&dirty_param);
 
-	if (vm_dirty_bytes)
-		dirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE);
+	if (dirty_param.dirty_bytes)
+		dirty = DIV_ROUND_UP(dirty_param.dirty_bytes, PAGE_SIZE);
 	else {
 		int dirty_ratio;
 
-		dirty_ratio = vm_dirty_ratio;
+		dirty_ratio = dirty_param.dirty_ratio;
 		if (dirty_ratio < 5)
 			dirty_ratio = 5;
 		dirty = (dirty_ratio * available_memory) / 100;
 	}
 
-	if (dirty_background_bytes)
-		background = DIV_ROUND_UP(dirty_background_bytes, PAGE_SIZE);
+	if (dirty_param.dirty_background_bytes)
+		background = DIV_ROUND_UP(dirty_param.dirty_background_bytes,
+						PAGE_SIZE);
 	else
-		background = (dirty_background_ratio * available_memory) / 100;
-
+		background = (dirty_param.dirty_background_ratio *
+						available_memory) / 100;
 	if (background >= dirty)
 		background = dirty / 2;
 	tsk = current;
@@ -508,9 +516,15 @@ static void balance_dirty_pages(struct address_space *mapping,
 		get_dirty_limits(&background_thresh, &dirty_thresh,
 				&bdi_thresh, bdi);
 
-		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
+		if (mem_cgroup_has_dirty_limit()) {
+			nr_reclaimable =
+				mem_cgroup_page_stat(MEMCG_NR_RECLAIM_PAGES);
+			nr_writeback = mem_cgroup_page_stat(MEMCG_NR_WRITEBACK);
+		} else {
+			nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
 					global_page_state(NR_UNSTABLE_NFS);
-		nr_writeback = global_page_state(NR_WRITEBACK);
+			nr_writeback = global_page_state(NR_WRITEBACK);
+		}
 
 		bdi_nr_reclaimable = bdi_stat(bdi, BDI_DIRTY);
 		if (bdi_cap_account_unstable(bdi)) {
@@ -611,10 +625,13 @@ static void balance_dirty_pages(struct address_space *mapping,
 	 * In normal mode, we start background writeout at the lower
 	 * background_thresh, to keep the amount of dirty memory low.
 	 */
+	if (mem_cgroup_has_dirty_limit())
+		nr_reclaimable = mem_cgroup_page_stat(MEMCG_NR_RECLAIM_PAGES);
+	else
+		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
+				global_page_state(NR_UNSTABLE_NFS);
 	if ((laptop_mode && pages_written) ||
-	    (!laptop_mode && ((global_page_state(NR_FILE_DIRTY)
-			       + global_page_state(NR_UNSTABLE_NFS))
-					  > background_thresh)))
+	    (!laptop_mode && (nr_reclaimable > background_thresh)))
 		bdi_start_writeback(bdi, NULL, 0);
 }
 
@@ -678,6 +695,8 @@ void throttle_vm_writeout(gfp_t gfp_mask)
 	unsigned long dirty_thresh;
 
         for ( ; ; ) {
+		unsigned long dirty;
+
 		get_dirty_limits(&background_thresh, &dirty_thresh, NULL, NULL);
 
                 /*
@@ -686,10 +705,15 @@ void throttle_vm_writeout(gfp_t gfp_mask)
                  */
                 dirty_thresh += dirty_thresh / 10;      /* wheeee... */
 
-                if (global_page_state(NR_UNSTABLE_NFS) +
-			global_page_state(NR_WRITEBACK) <= dirty_thresh)
-                        	break;
-                congestion_wait(BLK_RW_ASYNC, HZ/10);
+		if (mem_cgroup_has_dirty_limit())
+			dirty = mem_cgroup_page_stat(
+					MEMCG_NR_DIRTY_WRITEBACK_PAGES);
+		else
+			dirty = global_page_state(NR_UNSTABLE_NFS) +
+				global_page_state(NR_WRITEBACK);
+		if (dirty <= dirty_thresh)
+			break;
+		congestion_wait(BLK_RW_ASYNC, HZ/10);
 
 		/*
 		 * The caller might hold locks which can prevent IO completion
@@ -1096,6 +1120,7 @@ int __set_page_dirty_no_writeback(struct page *page)
 void account_page_dirtied(struct page *page, struct address_space *mapping)
 {
 	if (mapping_cap_account_dirty(mapping)) {
+		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_FILE_DIRTY, 1);
 		__inc_zone_page_state(page, NR_FILE_DIRTY);
 		__inc_bdi_stat(mapping->backing_dev_info, BDI_DIRTY);
 		task_dirty_inc(current);
@@ -1297,6 +1322,8 @@ int clear_page_dirty_for_io(struct page *page)
 		 * for more comments.
 		 */
 		if (TestClearPageDirty(page)) {
+			mem_cgroup_update_stat(page,
+					MEM_CGROUP_STAT_FILE_DIRTY, -1);
 			dec_zone_page_state(page, NR_FILE_DIRTY);
 			dec_bdi_stat(mapping->backing_dev_info,
 					BDI_DIRTY);
@@ -1332,8 +1359,10 @@ int test_clear_page_writeback(struct page *page)
 	} else {
 		ret = TestClearPageWriteback(page);
 	}
-	if (ret)
+	if (ret) {
+		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_WRITEBACK, -1);
 		dec_zone_page_state(page, NR_WRITEBACK);
+	}
 	return ret;
 }
 
@@ -1363,8 +1392,10 @@ int test_set_page_writeback(struct page *page)
 	} else {
 		ret = TestSetPageWriteback(page);
 	}
-	if (!ret)
+	if (!ret) {
+		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_WRITEBACK, 1);
 		inc_zone_page_state(page, NR_WRITEBACK);
+	}
 	return ret;
 
 }
diff --git a/mm/rmap.c b/mm/rmap.c
index fcd593c..d47c257 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -829,7 +829,7 @@ void page_add_file_rmap(struct page *page)
 {
 	if (atomic_inc_and_test(&page->_mapcount)) {
 		__inc_zone_page_state(page, NR_FILE_MAPPED);
-		mem_cgroup_update_file_mapped(page, 1);
+		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_FILE_MAPPED, 1);
 	}
 }
 
@@ -861,7 +861,7 @@ void page_remove_rmap(struct page *page)
 		__dec_zone_page_state(page, NR_ANON_PAGES);
 	} else {
 		__dec_zone_page_state(page, NR_FILE_MAPPED);
-		mem_cgroup_update_file_mapped(page, -1);
+		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_FILE_MAPPED, -1);
 	}
 	/*
 	 * It would be tidy to reset the PageAnon mapping here,
diff --git a/mm/truncate.c b/mm/truncate.c
index 2466e0c..5f437e7 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -73,6 +73,8 @@ void cancel_dirty_page(struct page *page, unsigned int account_size)
 	if (TestClearPageDirty(page)) {
 		struct address_space *mapping = page->mapping;
 		if (mapping && mapping_cap_account_dirty(mapping)) {
+			mem_cgroup_update_stat(page,
+					MEM_CGROUP_STAT_FILE_DIRTY, -1);
 			dec_zone_page_state(page, NR_FILE_DIRTY);
 			dec_bdi_stat(mapping->backing_dev_info,
 					BDI_DIRTY);
-- 
1.6.3.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* Re: [PATCH -mmotm 3/4] memcg: dirty pages accounting and limiting infrastructure
       [not found]   ` <1267699215-4101-4-git-send-email-arighi-vWjgImWzx8FBDgjK7y7TUQ@public.gmane.org>
@ 2010-03-04 11:54     ` Kirill A. Shutemov
  2010-03-05  1:12     ` Daisuke Nishimura
  1 sibling, 0 replies; 68+ messages in thread
From: Kirill A. Shutemov @ 2010-03-04 11:54 UTC (permalink / raw)
  To: Andrea Righi
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Daisuke Nishimura, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Trond Myklebust, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	Suleiman Souhlal, Andrew Morton, Vivek Goyal, Balbir Singh

On Thu, Mar 4, 2010 at 12:40 PM, Andrea Righi <arighi@develer.com> wrote:
> Infrastructure to account dirty pages per cgroup and add dirty limit
> interfaces in the cgroupfs:
>
>  - Direct write-out: memory.dirty_ratio, memory.dirty_bytes
>
>  - Background write-out: memory.dirty_background_ratio, memory.dirty_background_bytes
>
> Signed-off-by: Andrea Righi <arighi@develer.com>
> ---
>  include/linux/memcontrol.h |   80 ++++++++-
>  mm/memcontrol.c            |  420 +++++++++++++++++++++++++++++++++++++++-----
>  2 files changed, 450 insertions(+), 50 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 1f9b119..cc3421b 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -19,12 +19,66 @@
>
>  #ifndef _LINUX_MEMCONTROL_H
>  #define _LINUX_MEMCONTROL_H
> +
> +#include <linux/writeback.h>
>  #include <linux/cgroup.h>
> +
>  struct mem_cgroup;
>  struct page_cgroup;
>  struct page;
>  struct mm_struct;
>
> +/* Cgroup memory statistics items exported to the kernel */
> +enum mem_cgroup_page_stat_item {
> +       MEMCG_NR_DIRTYABLE_PAGES,
> +       MEMCG_NR_RECLAIM_PAGES,
> +       MEMCG_NR_WRITEBACK,
> +       MEMCG_NR_DIRTY_WRITEBACK_PAGES,
> +};
> +
> +/* Dirty memory parameters */
> +struct dirty_param {
> +       int dirty_ratio;
> +       unsigned long dirty_bytes;
> +       int dirty_background_ratio;
> +       unsigned long dirty_background_bytes;
> +};
> +
> +/*
> + * Statistics for memory cgroup.
> + */
> +enum mem_cgroup_stat_index {
> +       /*
> +        * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
> +        */
> +       MEM_CGROUP_STAT_CACHE,     /* # of pages charged as cache */
> +       MEM_CGROUP_STAT_RSS,       /* # of pages charged as anon rss */
> +       MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
> +       MEM_CGROUP_STAT_PGPGIN_COUNT,   /* # of pages paged in */
> +       MEM_CGROUP_STAT_PGPGOUT_COUNT,  /* # of pages paged out */
> +       MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
> +       MEM_CGROUP_EVENTS,      /* incremented at every  pagein/pageout */
> +       MEM_CGROUP_STAT_FILE_DIRTY,   /* # of dirty pages in page cache */
> +       MEM_CGROUP_STAT_WRITEBACK,   /* # of pages under writeback */
> +       MEM_CGROUP_STAT_WRITEBACK_TEMP,   /* # of pages under writeback using
> +                                               temporary buffers */
> +       MEM_CGROUP_STAT_UNSTABLE_NFS,   /* # of NFS unstable pages */
> +
> +       MEM_CGROUP_STAT_NSTATS,
> +};
> +
> +/*
> + * TODO: provide a validation check routine. And retry if validation
> + * fails.
> + */
> +static inline void get_global_dirty_param(struct dirty_param *param)
> +{
> +       param->dirty_ratio = vm_dirty_ratio;
> +       param->dirty_bytes = vm_dirty_bytes;
> +       param->dirty_background_ratio = dirty_background_ratio;
> +       param->dirty_background_bytes = dirty_background_bytes;
> +}
> +
>  #ifdef CONFIG_CGROUP_MEM_RES_CTLR
>  /*
>  * All "charge" functions with gfp_mask should use GFP_KERNEL or
> @@ -117,6 +171,10 @@ extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
>  extern int do_swap_account;
>  #endif
>
> +extern bool mem_cgroup_has_dirty_limit(void);
> +extern void get_dirty_param(struct dirty_param *param);
> +extern s64 mem_cgroup_page_stat(enum mem_cgroup_page_stat_item item);
> +
>  static inline bool mem_cgroup_disabled(void)
>  {
>        if (mem_cgroup_subsys.disabled)
> @@ -125,7 +183,8 @@ static inline bool mem_cgroup_disabled(void)
>  }
>
>  extern bool mem_cgroup_oom_called(struct task_struct *task);
> -void mem_cgroup_update_file_mapped(struct page *page, int val);
> +void mem_cgroup_update_stat(struct page *page,
> +                       enum mem_cgroup_stat_index idx, int val);
>  unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>                                                gfp_t gfp_mask, int nid,
>                                                int zid);
> @@ -300,8 +359,8 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
>  {
>  }
>
> -static inline void mem_cgroup_update_file_mapped(struct page *page,
> -                                                       int val)
> +static inline void mem_cgroup_update_stat(struct page *page,
> +                       enum mem_cgroup_stat_index idx, int val)
>  {
>  }
>
> @@ -312,6 +371,21 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>        return 0;
>  }
>
> +static inline bool mem_cgroup_has_dirty_limit(void)
> +{
> +       return false;
> +}
> +
> +static inline void get_dirty_param(struct dirty_param *param)
> +{
> +       get_global_dirty_param(param);
> +}
> +
> +static inline s64 mem_cgroup_page_stat(enum mem_cgroup_page_stat_item item)
> +{
> +       return -ENOSYS;
> +}
> +
>  #endif /* CONFIG_CGROUP_MEM_CONT */
>
>  #endif /* _LINUX_MEMCONTROL_H */
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 497b6f7..9842e7b 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -73,28 +73,23 @@ static int really_do_swap_account __initdata = 1; /* for remember boot option*/
>  #define THRESHOLDS_EVENTS_THRESH (7) /* once in 128 */
>  #define SOFTLIMIT_EVENTS_THRESH (10) /* once in 1024 */
>
> -/*
> - * Statistics for memory cgroup.
> - */
> -enum mem_cgroup_stat_index {
> -       /*
> -        * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
> -        */
> -       MEM_CGROUP_STAT_CACHE,     /* # of pages charged as cache */
> -       MEM_CGROUP_STAT_RSS,       /* # of pages charged as anon rss */
> -       MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
> -       MEM_CGROUP_STAT_PGPGIN_COUNT,   /* # of pages paged in */
> -       MEM_CGROUP_STAT_PGPGOUT_COUNT,  /* # of pages paged out */
> -       MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
> -       MEM_CGROUP_EVENTS,      /* incremented at every  pagein/pageout */
> -
> -       MEM_CGROUP_STAT_NSTATS,
> -};
> -
>  struct mem_cgroup_stat_cpu {
>        s64 count[MEM_CGROUP_STAT_NSTATS];
>  };
>
> +/* Per cgroup page statistics */
> +struct mem_cgroup_page_stat {
> +       enum mem_cgroup_page_stat_item item;
> +       s64 value;
> +};
> +
> +enum {
> +       MEM_CGROUP_DIRTY_RATIO,
> +       MEM_CGROUP_DIRTY_BYTES,
> +       MEM_CGROUP_DIRTY_BACKGROUND_RATIO,
> +       MEM_CGROUP_DIRTY_BACKGROUND_BYTES,
> +};
> +
>  /*
>  * per-zone information in memory controller.
>  */
> @@ -208,6 +203,9 @@ struct mem_cgroup {
>
>        unsigned int    swappiness;
>
> +       /* control memory cgroup dirty pages */
> +       struct dirty_param dirty_param;
> +
>        /* set when res.limit == memsw.limit */
>        bool            memsw_is_minimum;
>
> @@ -1033,6 +1031,156 @@ static unsigned int get_swappiness(struct mem_cgroup *memcg)
>        return swappiness;
>  }
>
> +static bool dirty_param_is_valid(struct dirty_param *param)
> +{
> +       if (param->dirty_ratio && param->dirty_bytes)
> +               return false;
> +       if (param->dirty_background_ratio && param->dirty_background_bytes)
> +               return false;
> +       return true;
> +}
> +
> +static void
> +__mem_cgroup_get_dirty_param(struct dirty_param *param, struct mem_cgroup *mem)
> +{
> +       param->dirty_ratio = mem->dirty_param.dirty_ratio;
> +       param->dirty_bytes = mem->dirty_param.dirty_bytes;
> +       param->dirty_background_ratio = mem->dirty_param.dirty_background_ratio;
> +       param->dirty_background_bytes = mem->dirty_param.dirty_background_bytes;
> +}
> +
> +/*
> + * get_dirty_param() - get dirty memory parameters of the current memcg
> + * @param:     a structure is filled with the dirty memory settings
> + *
> + * The function fills @param with the current memcg dirty memory settings. If
> + * memory cgroup is disabled or in case of error the structure is filled with
> + * the global dirty memory settings.
> + */
> +void get_dirty_param(struct dirty_param *param)
> +{
> +       struct mem_cgroup *memcg;
> +
> +       if (mem_cgroup_disabled()) {
> +               get_global_dirty_param(param);
> +               return;
> +       }
> +       /*
> +        * It's possible that "current" may be moved to other cgroup while we
> +        * access cgroup. But precise check is meaningless because the task can
> +        * be moved after our access and writeback tends to take long time.
> +        * At least, "memcg" will not be freed under rcu_read_lock().
> +        */
> +       while (1) {
> +               rcu_read_lock();
> +               memcg = mem_cgroup_from_task(current);
> +               if (likely(memcg))
> +                       __mem_cgroup_get_dirty_param(param, memcg);
> +               else
> +                       get_global_dirty_param(param);
> +               rcu_read_unlock();
> +               /*
> +                * Since global and memcg dirty_param are not protected we try
> +                * to speculatively read them and retry if we get inconsistent
> +                * values.
> +                */
> +               if (likely(dirty_param_is_valid(param)))
> +                       break;
> +       }
> +}
> +
> +static inline bool mem_cgroup_can_swap(struct mem_cgroup *memcg)
> +{
> +       if (!do_swap_account)
> +               return nr_swap_pages > 0;
> +       return !memcg->memsw_is_minimum &&
> +               (res_counter_read_u64(&memcg->memsw, RES_LIMIT) > 0);
> +}
> +
> +static s64 mem_cgroup_get_local_page_stat(struct mem_cgroup *memcg,
> +                               enum mem_cgroup_page_stat_item item)
> +{
> +       s64 ret;
> +
> +       switch (item) {
> +       case MEMCG_NR_DIRTYABLE_PAGES:
> +               ret = res_counter_read_u64(&memcg->res, RES_LIMIT) -
> +                       res_counter_read_u64(&memcg->res, RES_USAGE);
> +               /* Translate free memory in pages */
> +               ret >>= PAGE_SHIFT;
> +               ret += mem_cgroup_read_stat(memcg, LRU_ACTIVE_FILE) +
> +                       mem_cgroup_read_stat(memcg, LRU_INACTIVE_FILE);
> +               if (mem_cgroup_can_swap(memcg))
> +                       ret += mem_cgroup_read_stat(memcg, LRU_ACTIVE_ANON) +
> +                               mem_cgroup_read_stat(memcg, LRU_INACTIVE_ANON);
> +               break;
> +       case MEMCG_NR_RECLAIM_PAGES:
> +               ret = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_FILE_DIRTY) +
> +                       mem_cgroup_read_stat(memcg,
> +                                       MEM_CGROUP_STAT_UNSTABLE_NFS);
> +               break;
> +       case MEMCG_NR_WRITEBACK:
> +               ret = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_WRITEBACK);
> +               break;
> +       case MEMCG_NR_DIRTY_WRITEBACK_PAGES:
> +               ret = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_WRITEBACK) +
> +                       mem_cgroup_read_stat(memcg,
> +                               MEM_CGROUP_STAT_UNSTABLE_NFS);
> +               break;
> +       default:
> +               BUG_ON(1);

Just BUG()?
Andd add 'break;', please.

> +       }
> +       return ret;
> +}
> +
> +static int mem_cgroup_page_stat_cb(struct mem_cgroup *mem, void *data)
> +{
> +       struct mem_cgroup_page_stat *stat = (struct mem_cgroup_page_stat *)data;
> +
> +       stat->value += mem_cgroup_get_local_page_stat(mem, stat->item);
> +       return 0;
> +}
> +
> +/*
> + * mem_cgroup_has_dirty_limit() - check if current memcg has local dirty limits
> + *
> + * Return true if the current memory cgroup has local dirty memory settings,
> + * false otherwise.
> + */
> +bool mem_cgroup_has_dirty_limit(void)
> +{
> +       if (mem_cgroup_disabled())
> +               return false;
> +       return mem_cgroup_from_task(current) != NULL;
> +}
> +
> +/*
> + * mem_cgroup_page_stat() - get memory cgroup file cache statistics
> + * @item:      memory statistic item exported to the kernel
> + *
> + * Return the accounted statistic value, or a negative value in case of error.
> + */
> +s64 mem_cgroup_page_stat(enum mem_cgroup_page_stat_item item)
> +{
> +       struct mem_cgroup_page_stat stat = {};
> +       struct mem_cgroup *memcg;
> +
> +       rcu_read_lock();
> +       memcg = mem_cgroup_from_task(current);
> +       if (memcg) {
> +               /*
> +                * Recursively evaulate page statistics against all cgroup
> +                * under hierarchy tree
> +                */
> +               stat.item = item;
> +               mem_cgroup_walk_tree(memcg, &stat, mem_cgroup_page_stat_cb);
> +       } else
> +               stat.value = -EINVAL;
> +       rcu_read_unlock();
> +
> +       return stat.value;
> +}
> +
>  static int mem_cgroup_count_children_cb(struct mem_cgroup *mem, void *data)
>  {
>        int *val = data;
> @@ -1275,34 +1423,70 @@ static void record_last_oom(struct mem_cgroup *mem)
>  }
>
>  /*
> - * Currently used to update mapped file statistics, but the routine can be
> - * generalized to update other statistics as well.
> + * Generalized routine to update file cache's status for memcg.
> + *
> + * Before calling this, mapping->tree_lock should be held and preemption is
> + * disabled.  Then, it's guarnteed that the page is not uncharged while we
> + * access page_cgroup. We can make use of that.
>  */
> -void mem_cgroup_update_file_mapped(struct page *page, int val)
> +void mem_cgroup_update_stat(struct page *page,
> +                       enum mem_cgroup_stat_index idx, int val)
>  {
>        struct mem_cgroup *mem;
>        struct page_cgroup *pc;
>
> +       if (mem_cgroup_disabled())
> +               return;
>        pc = lookup_page_cgroup(page);
> -       if (unlikely(!pc))
> +       if (unlikely(!pc) || !PageCgroupUsed(pc))
>                return;
>
> -       lock_page_cgroup(pc);
> -       mem = pc->mem_cgroup;
> -       if (!mem)
> -               goto done;
> -
> -       if (!PageCgroupUsed(pc))
> -               goto done;
> -
> +       lock_page_cgroup_migrate(pc);
>        /*
> -        * Preemption is already disabled. We can use __this_cpu_xxx
> -        */
> -       __this_cpu_add(mem->stat->count[MEM_CGROUP_STAT_FILE_MAPPED], val);
> -
> -done:
> -       unlock_page_cgroup(pc);
> +       * It's guarnteed that this page is never uncharged.
> +       * The only racy problem is moving account among memcgs.
> +       */
> +       switch (idx) {
> +       case MEM_CGROUP_STAT_FILE_MAPPED:
> +               if (val > 0)
> +                       SetPageCgroupFileMapped(pc);
> +               else
> +                       ClearPageCgroupFileMapped(pc);
> +               break;
> +       case MEM_CGROUP_STAT_FILE_DIRTY:
> +               if (val > 0)
> +                       SetPageCgroupDirty(pc);
> +               else
> +                       ClearPageCgroupDirty(pc);
> +               break;
> +       case MEM_CGROUP_STAT_WRITEBACK:
> +               if (val > 0)
> +                       SetPageCgroupWriteback(pc);
> +               else
> +                       ClearPageCgroupWriteback(pc);
> +               break;
> +       case MEM_CGROUP_STAT_WRITEBACK_TEMP:
> +               if (val > 0)
> +                       SetPageCgroupWritebackTemp(pc);
> +               else
> +                       ClearPageCgroupWritebackTemp(pc);
> +               break;
> +       case MEM_CGROUP_STAT_UNSTABLE_NFS:
> +               if (val > 0)
> +                       SetPageCgroupUnstableNFS(pc);
> +               else
> +                       ClearPageCgroupUnstableNFS(pc);
> +               break;
> +       default:
> +               BUG();
> +               break;
> +       }
> +       mem = pc->mem_cgroup;
> +       if (likely(mem))
> +               __this_cpu_add(mem->stat->count[idx], val);
> +       unlock_page_cgroup_migrate(pc);
>  }
> +EXPORT_SYMBOL_GPL(mem_cgroup_update_stat);
>
>  /*
>  * size of first charge trial. "32" comes from vmscan.c's magic value.
> @@ -1701,6 +1885,45 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *mem,
>        memcg_check_events(mem, pc->page);
>  }
>
> +/*
> + * Update file cache accounted statistics on task migration.
> + *
> + * TODO: We don't move charges of file (including shmem/tmpfs) pages for now.
> + * So, at the moment this function simply returns without updating accounted
> + * statistics, because we deal only with anonymous pages here.
> + */
> +static void __mem_cgroup_update_file_stat(struct page_cgroup *pc,
> +       struct mem_cgroup *from, struct mem_cgroup *to)
> +{
> +       struct page *page = pc->page;
> +
> +       if (!page_mapped(page) || PageAnon(page))
> +               return;
> +
> +       if (PageCgroupFileMapped(pc)) {
> +               __this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> +               __this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> +       }
> +       if (PageCgroupDirty(pc)) {
> +               __this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_DIRTY]);
> +               __this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_DIRTY]);
> +       }
> +       if (PageCgroupWriteback(pc)) {
> +               __this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_WRITEBACK]);
> +               __this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_WRITEBACK]);
> +       }
> +       if (PageCgroupWritebackTemp(pc)) {
> +               __this_cpu_dec(
> +                       from->stat->count[MEM_CGROUP_STAT_WRITEBACK_TEMP]);
> +               __this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_WRITEBACK_TEMP]);
> +       }
> +       if (PageCgroupUnstableNFS(pc)) {
> +               __this_cpu_dec(
> +                       from->stat->count[MEM_CGROUP_STAT_UNSTABLE_NFS]);
> +               __this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_UNSTABLE_NFS]);
> +       }
> +}
> +
>  /**
>  * __mem_cgroup_move_account - move account of the page
>  * @pc:        page_cgroup of the page.
> @@ -1721,22 +1944,16 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *mem,
>  static void __mem_cgroup_move_account(struct page_cgroup *pc,
>        struct mem_cgroup *from, struct mem_cgroup *to, bool uncharge)
>  {
> -       struct page *page;
> -
>        VM_BUG_ON(from == to);
>        VM_BUG_ON(PageLRU(pc->page));
>        VM_BUG_ON(!PageCgroupLocked(pc));
>        VM_BUG_ON(!PageCgroupUsed(pc));
>        VM_BUG_ON(pc->mem_cgroup != from);
>
> -       page = pc->page;
> -       if (page_mapped(page) && !PageAnon(page)) {
> -               /* Update mapped_file data for mem_cgroup */
> -               preempt_disable();
> -               __this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> -               __this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> -               preempt_enable();
> -       }
> +       preempt_disable();
> +       lock_page_cgroup_migrate(pc);
> +       __mem_cgroup_update_file_stat(pc, from, to);
> +
>        mem_cgroup_charge_statistics(from, pc, false);
>        if (uncharge)
>                /* This is not "cancel", but cancel_charge does all we need. */
> @@ -1745,6 +1962,8 @@ static void __mem_cgroup_move_account(struct page_cgroup *pc,
>        /* caller should have done css_get */
>        pc->mem_cgroup = to;
>        mem_cgroup_charge_statistics(to, pc, true);
> +       unlock_page_cgroup_migrate(pc);
> +       preempt_enable();
>        /*
>         * We charges against "to" which may not have any tasks. Then, "to"
>         * can be under rmdir(). But in current implementation, caller of
> @@ -3042,6 +3261,10 @@ enum {
>        MCS_PGPGIN,
>        MCS_PGPGOUT,
>        MCS_SWAP,
> +       MCS_FILE_DIRTY,
> +       MCS_WRITEBACK,
> +       MCS_WRITEBACK_TEMP,
> +       MCS_UNSTABLE_NFS,
>        MCS_INACTIVE_ANON,
>        MCS_ACTIVE_ANON,
>        MCS_INACTIVE_FILE,
> @@ -3064,6 +3287,10 @@ struct {
>        {"pgpgin", "total_pgpgin"},
>        {"pgpgout", "total_pgpgout"},
>        {"swap", "total_swap"},
> +       {"filedirty", "dirty_pages"},
> +       {"writeback", "writeback_pages"},
> +       {"writeback_tmp", "writeback_temp_pages"},
> +       {"nfs", "nfs_unstable"},
>        {"inactive_anon", "total_inactive_anon"},
>        {"active_anon", "total_active_anon"},
>        {"inactive_file", "total_inactive_file"},
> @@ -3092,6 +3319,14 @@ static int mem_cgroup_get_local_stat(struct mem_cgroup *mem, void *data)
>                val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_SWAPOUT);
>                s->stat[MCS_SWAP] += val * PAGE_SIZE;
>        }
> +       val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_FILE_DIRTY);
> +       s->stat[MCS_FILE_DIRTY] += val;
> +       val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_WRITEBACK);
> +       s->stat[MCS_WRITEBACK] += val;
> +       val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_WRITEBACK_TEMP);
> +       s->stat[MCS_WRITEBACK_TEMP] += val;
> +       val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_UNSTABLE_NFS);
> +       s->stat[MCS_UNSTABLE_NFS] += val;
>
>        /* per zone stat */
>        val = mem_cgroup_get_local_zonestat(mem, LRU_INACTIVE_ANON);
> @@ -3453,6 +3688,60 @@ unlock:
>        return ret;
>  }
>
> +static u64 mem_cgroup_dirty_read(struct cgroup *cgrp, struct cftype *cft)
> +{
> +       struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
> +
> +       switch (cft->private) {
> +       case MEM_CGROUP_DIRTY_RATIO:
> +               return memcg->dirty_param.dirty_ratio;
> +       case MEM_CGROUP_DIRTY_BYTES:
> +               return memcg->dirty_param.dirty_bytes;
> +       case MEM_CGROUP_DIRTY_BACKGROUND_RATIO:
> +               return memcg->dirty_param.dirty_background_ratio;
> +       case MEM_CGROUP_DIRTY_BACKGROUND_BYTES:
> +               return memcg->dirty_param.dirty_background_bytes;
> +       default:
> +               BUG();
> +       }
> +}
> +
> +static int
> +mem_cgroup_dirty_write(struct cgroup *cgrp, struct cftype *cft, u64 val)
> +{
> +       struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
> +       int type = cft->private;
> +
> +       if (cgrp->parent == NULL)
> +               return -EINVAL;
> +       if ((type == MEM_CGROUP_DIRTY_RATIO ||
> +               type == MEM_CGROUP_DIRTY_BACKGROUND_RATIO) && val > 100)
> +               return -EINVAL;
> +       /*
> +        * TODO: provide a validation check routine. And retry if validation
> +        * fails.
> +        */
> +       switch (type) {
> +       case MEM_CGROUP_DIRTY_RATIO:
> +               memcg->dirty_param.dirty_ratio = val;
> +               memcg->dirty_param.dirty_bytes = 0;
> +               break;
> +       case MEM_CGROUP_DIRTY_BYTES:
> +               memcg->dirty_param.dirty_ratio  = 0;
> +               memcg->dirty_param.dirty_bytes = val;
> +               break;
> +       case MEM_CGROUP_DIRTY_BACKGROUND_RATIO:
> +               memcg->dirty_param.dirty_background_ratio = val;
> +               memcg->dirty_param.dirty_background_bytes = 0;
> +               break;
> +       case MEM_CGROUP_DIRTY_BACKGROUND_BYTES:
> +               memcg->dirty_param.dirty_background_ratio = 0;
> +               memcg->dirty_param.dirty_background_bytes = val;
> +               break;

default:
        BUG();
        break;

> +       }
> +       return 0;
> +}
> +
>  static struct cftype mem_cgroup_files[] = {
>        {
>                .name = "usage_in_bytes",
> @@ -3504,6 +3793,30 @@ static struct cftype mem_cgroup_files[] = {
>                .write_u64 = mem_cgroup_swappiness_write,
>        },
>        {
> +               .name = "dirty_ratio",
> +               .read_u64 = mem_cgroup_dirty_read,
> +               .write_u64 = mem_cgroup_dirty_write,
> +               .private = MEM_CGROUP_DIRTY_RATIO,
> +       },
> +       {
> +               .name = "dirty_bytes",
> +               .read_u64 = mem_cgroup_dirty_read,
> +               .write_u64 = mem_cgroup_dirty_write,
> +               .private = MEM_CGROUP_DIRTY_BYTES,
> +       },
> +       {
> +               .name = "dirty_background_ratio",
> +               .read_u64 = mem_cgroup_dirty_read,
> +               .write_u64 = mem_cgroup_dirty_write,
> +               .private = MEM_CGROUP_DIRTY_BACKGROUND_RATIO,
> +       },
> +       {
> +               .name = "dirty_background_bytes",
> +               .read_u64 = mem_cgroup_dirty_read,
> +               .write_u64 = mem_cgroup_dirty_write,
> +               .private = MEM_CGROUP_DIRTY_BACKGROUND_BYTES,
> +       },
> +       {
>                .name = "move_charge_at_immigrate",
>                .read_u64 = mem_cgroup_move_charge_read,
>                .write_u64 = mem_cgroup_move_charge_write,
> @@ -3762,8 +4075,21 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
>        mem->last_scanned_child = 0;
>        spin_lock_init(&mem->reclaim_param_lock);
>
> -       if (parent)
> +       if (parent) {
>                mem->swappiness = get_swappiness(parent);
> +               mem->dirty_param = parent->dirty_param;
> +       } else {
> +               while (1) {
> +                       get_global_dirty_param(&mem->dirty_param);
> +                       /*
> +                        * Since global dirty parameters are not protected we
> +                        * try to speculatively read them and retry if we get
> +                        * inconsistent values.
> +                        */
> +                       if (likely(dirty_param_is_valid(&mem->dirty_param)))
> +                               break;
> +               }
> +       }
>        atomic_set(&mem->refcnt, 1);
>        mem->move_charge_at_immigrate = 0;
>        mutex_init(&mem->thresholds_lock);
> --
> 1.6.3.3
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH -mmotm 3/4] memcg: dirty pages accounting and limiting  infrastructure
  2010-03-04 10:40   ` Andrea Righi
@ 2010-03-04 11:54     ` Kirill A. Shutemov
  -1 siblings, 0 replies; 68+ messages in thread
From: Kirill A. Shutemov @ 2010-03-04 11:54 UTC (permalink / raw)
  To: Andrea Righi
  Cc: KAMEZAWA Hiroyuki, Balbir Singh, Vivek Goyal, Peter Zijlstra,
	Trond Myklebust, Suleiman Souhlal, Greg Thelen,
	Daisuke Nishimura, Andrew Morton, containers, linux-kernel,
	linux-mm

On Thu, Mar 4, 2010 at 12:40 PM, Andrea Righi <arighi@develer.com> wrote:
> Infrastructure to account dirty pages per cgroup and add dirty limit
> interfaces in the cgroupfs:
>
>  - Direct write-out: memory.dirty_ratio, memory.dirty_bytes
>
>  - Background write-out: memory.dirty_background_ratio, memory.dirty_background_bytes
>
> Signed-off-by: Andrea Righi <arighi@develer.com>
> ---
>  include/linux/memcontrol.h |   80 ++++++++-
>  mm/memcontrol.c            |  420 +++++++++++++++++++++++++++++++++++++++-----
>  2 files changed, 450 insertions(+), 50 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 1f9b119..cc3421b 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -19,12 +19,66 @@
>
>  #ifndef _LINUX_MEMCONTROL_H
>  #define _LINUX_MEMCONTROL_H
> +
> +#include <linux/writeback.h>
>  #include <linux/cgroup.h>
> +
>  struct mem_cgroup;
>  struct page_cgroup;
>  struct page;
>  struct mm_struct;
>
> +/* Cgroup memory statistics items exported to the kernel */
> +enum mem_cgroup_page_stat_item {
> +       MEMCG_NR_DIRTYABLE_PAGES,
> +       MEMCG_NR_RECLAIM_PAGES,
> +       MEMCG_NR_WRITEBACK,
> +       MEMCG_NR_DIRTY_WRITEBACK_PAGES,
> +};
> +
> +/* Dirty memory parameters */
> +struct dirty_param {
> +       int dirty_ratio;
> +       unsigned long dirty_bytes;
> +       int dirty_background_ratio;
> +       unsigned long dirty_background_bytes;
> +};
> +
> +/*
> + * Statistics for memory cgroup.
> + */
> +enum mem_cgroup_stat_index {
> +       /*
> +        * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
> +        */
> +       MEM_CGROUP_STAT_CACHE,     /* # of pages charged as cache */
> +       MEM_CGROUP_STAT_RSS,       /* # of pages charged as anon rss */
> +       MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
> +       MEM_CGROUP_STAT_PGPGIN_COUNT,   /* # of pages paged in */
> +       MEM_CGROUP_STAT_PGPGOUT_COUNT,  /* # of pages paged out */
> +       MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
> +       MEM_CGROUP_EVENTS,      /* incremented at every  pagein/pageout */
> +       MEM_CGROUP_STAT_FILE_DIRTY,   /* # of dirty pages in page cache */
> +       MEM_CGROUP_STAT_WRITEBACK,   /* # of pages under writeback */
> +       MEM_CGROUP_STAT_WRITEBACK_TEMP,   /* # of pages under writeback using
> +                                               temporary buffers */
> +       MEM_CGROUP_STAT_UNSTABLE_NFS,   /* # of NFS unstable pages */
> +
> +       MEM_CGROUP_STAT_NSTATS,
> +};
> +
> +/*
> + * TODO: provide a validation check routine. And retry if validation
> + * fails.
> + */
> +static inline void get_global_dirty_param(struct dirty_param *param)
> +{
> +       param->dirty_ratio = vm_dirty_ratio;
> +       param->dirty_bytes = vm_dirty_bytes;
> +       param->dirty_background_ratio = dirty_background_ratio;
> +       param->dirty_background_bytes = dirty_background_bytes;
> +}
> +
>  #ifdef CONFIG_CGROUP_MEM_RES_CTLR
>  /*
>  * All "charge" functions with gfp_mask should use GFP_KERNEL or
> @@ -117,6 +171,10 @@ extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
>  extern int do_swap_account;
>  #endif
>
> +extern bool mem_cgroup_has_dirty_limit(void);
> +extern void get_dirty_param(struct dirty_param *param);
> +extern s64 mem_cgroup_page_stat(enum mem_cgroup_page_stat_item item);
> +
>  static inline bool mem_cgroup_disabled(void)
>  {
>        if (mem_cgroup_subsys.disabled)
> @@ -125,7 +183,8 @@ static inline bool mem_cgroup_disabled(void)
>  }
>
>  extern bool mem_cgroup_oom_called(struct task_struct *task);
> -void mem_cgroup_update_file_mapped(struct page *page, int val);
> +void mem_cgroup_update_stat(struct page *page,
> +                       enum mem_cgroup_stat_index idx, int val);
>  unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>                                                gfp_t gfp_mask, int nid,
>                                                int zid);
> @@ -300,8 +359,8 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
>  {
>  }
>
> -static inline void mem_cgroup_update_file_mapped(struct page *page,
> -                                                       int val)
> +static inline void mem_cgroup_update_stat(struct page *page,
> +                       enum mem_cgroup_stat_index idx, int val)
>  {
>  }
>
> @@ -312,6 +371,21 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>        return 0;
>  }
>
> +static inline bool mem_cgroup_has_dirty_limit(void)
> +{
> +       return false;
> +}
> +
> +static inline void get_dirty_param(struct dirty_param *param)
> +{
> +       get_global_dirty_param(param);
> +}
> +
> +static inline s64 mem_cgroup_page_stat(enum mem_cgroup_page_stat_item item)
> +{
> +       return -ENOSYS;
> +}
> +
>  #endif /* CONFIG_CGROUP_MEM_CONT */
>
>  #endif /* _LINUX_MEMCONTROL_H */
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 497b6f7..9842e7b 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -73,28 +73,23 @@ static int really_do_swap_account __initdata = 1; /* for remember boot option*/
>  #define THRESHOLDS_EVENTS_THRESH (7) /* once in 128 */
>  #define SOFTLIMIT_EVENTS_THRESH (10) /* once in 1024 */
>
> -/*
> - * Statistics for memory cgroup.
> - */
> -enum mem_cgroup_stat_index {
> -       /*
> -        * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
> -        */
> -       MEM_CGROUP_STAT_CACHE,     /* # of pages charged as cache */
> -       MEM_CGROUP_STAT_RSS,       /* # of pages charged as anon rss */
> -       MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
> -       MEM_CGROUP_STAT_PGPGIN_COUNT,   /* # of pages paged in */
> -       MEM_CGROUP_STAT_PGPGOUT_COUNT,  /* # of pages paged out */
> -       MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
> -       MEM_CGROUP_EVENTS,      /* incremented at every  pagein/pageout */
> -
> -       MEM_CGROUP_STAT_NSTATS,
> -};
> -
>  struct mem_cgroup_stat_cpu {
>        s64 count[MEM_CGROUP_STAT_NSTATS];
>  };
>
> +/* Per cgroup page statistics */
> +struct mem_cgroup_page_stat {
> +       enum mem_cgroup_page_stat_item item;
> +       s64 value;
> +};
> +
> +enum {
> +       MEM_CGROUP_DIRTY_RATIO,
> +       MEM_CGROUP_DIRTY_BYTES,
> +       MEM_CGROUP_DIRTY_BACKGROUND_RATIO,
> +       MEM_CGROUP_DIRTY_BACKGROUND_BYTES,
> +};
> +
>  /*
>  * per-zone information in memory controller.
>  */
> @@ -208,6 +203,9 @@ struct mem_cgroup {
>
>        unsigned int    swappiness;
>
> +       /* control memory cgroup dirty pages */
> +       struct dirty_param dirty_param;
> +
>        /* set when res.limit == memsw.limit */
>        bool            memsw_is_minimum;
>
> @@ -1033,6 +1031,156 @@ static unsigned int get_swappiness(struct mem_cgroup *memcg)
>        return swappiness;
>  }
>
> +static bool dirty_param_is_valid(struct dirty_param *param)
> +{
> +       if (param->dirty_ratio && param->dirty_bytes)
> +               return false;
> +       if (param->dirty_background_ratio && param->dirty_background_bytes)
> +               return false;
> +       return true;
> +}
> +
> +static void
> +__mem_cgroup_get_dirty_param(struct dirty_param *param, struct mem_cgroup *mem)
> +{
> +       param->dirty_ratio = mem->dirty_param.dirty_ratio;
> +       param->dirty_bytes = mem->dirty_param.dirty_bytes;
> +       param->dirty_background_ratio = mem->dirty_param.dirty_background_ratio;
> +       param->dirty_background_bytes = mem->dirty_param.dirty_background_bytes;
> +}
> +
> +/*
> + * get_dirty_param() - get dirty memory parameters of the current memcg
> + * @param:     a structure is filled with the dirty memory settings
> + *
> + * The function fills @param with the current memcg dirty memory settings. If
> + * memory cgroup is disabled or in case of error the structure is filled with
> + * the global dirty memory settings.
> + */
> +void get_dirty_param(struct dirty_param *param)
> +{
> +       struct mem_cgroup *memcg;
> +
> +       if (mem_cgroup_disabled()) {
> +               get_global_dirty_param(param);
> +               return;
> +       }
> +       /*
> +        * It's possible that "current" may be moved to other cgroup while we
> +        * access cgroup. But precise check is meaningless because the task can
> +        * be moved after our access and writeback tends to take long time.
> +        * At least, "memcg" will not be freed under rcu_read_lock().
> +        */
> +       while (1) {
> +               rcu_read_lock();
> +               memcg = mem_cgroup_from_task(current);
> +               if (likely(memcg))
> +                       __mem_cgroup_get_dirty_param(param, memcg);
> +               else
> +                       get_global_dirty_param(param);
> +               rcu_read_unlock();
> +               /*
> +                * Since global and memcg dirty_param are not protected we try
> +                * to speculatively read them and retry if we get inconsistent
> +                * values.
> +                */
> +               if (likely(dirty_param_is_valid(param)))
> +                       break;
> +       }
> +}
> +
> +static inline bool mem_cgroup_can_swap(struct mem_cgroup *memcg)
> +{
> +       if (!do_swap_account)
> +               return nr_swap_pages > 0;
> +       return !memcg->memsw_is_minimum &&
> +               (res_counter_read_u64(&memcg->memsw, RES_LIMIT) > 0);
> +}
> +
> +static s64 mem_cgroup_get_local_page_stat(struct mem_cgroup *memcg,
> +                               enum mem_cgroup_page_stat_item item)
> +{
> +       s64 ret;
> +
> +       switch (item) {
> +       case MEMCG_NR_DIRTYABLE_PAGES:
> +               ret = res_counter_read_u64(&memcg->res, RES_LIMIT) -
> +                       res_counter_read_u64(&memcg->res, RES_USAGE);
> +               /* Translate free memory in pages */
> +               ret >>= PAGE_SHIFT;
> +               ret += mem_cgroup_read_stat(memcg, LRU_ACTIVE_FILE) +
> +                       mem_cgroup_read_stat(memcg, LRU_INACTIVE_FILE);
> +               if (mem_cgroup_can_swap(memcg))
> +                       ret += mem_cgroup_read_stat(memcg, LRU_ACTIVE_ANON) +
> +                               mem_cgroup_read_stat(memcg, LRU_INACTIVE_ANON);
> +               break;
> +       case MEMCG_NR_RECLAIM_PAGES:
> +               ret = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_FILE_DIRTY) +
> +                       mem_cgroup_read_stat(memcg,
> +                                       MEM_CGROUP_STAT_UNSTABLE_NFS);
> +               break;
> +       case MEMCG_NR_WRITEBACK:
> +               ret = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_WRITEBACK);
> +               break;
> +       case MEMCG_NR_DIRTY_WRITEBACK_PAGES:
> +               ret = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_WRITEBACK) +
> +                       mem_cgroup_read_stat(memcg,
> +                               MEM_CGROUP_STAT_UNSTABLE_NFS);
> +               break;
> +       default:
> +               BUG_ON(1);

Just BUG()?
Andd add 'break;', please.

> +       }
> +       return ret;
> +}
> +
> +static int mem_cgroup_page_stat_cb(struct mem_cgroup *mem, void *data)
> +{
> +       struct mem_cgroup_page_stat *stat = (struct mem_cgroup_page_stat *)data;
> +
> +       stat->value += mem_cgroup_get_local_page_stat(mem, stat->item);
> +       return 0;
> +}
> +
> +/*
> + * mem_cgroup_has_dirty_limit() - check if current memcg has local dirty limits
> + *
> + * Return true if the current memory cgroup has local dirty memory settings,
> + * false otherwise.
> + */
> +bool mem_cgroup_has_dirty_limit(void)
> +{
> +       if (mem_cgroup_disabled())
> +               return false;
> +       return mem_cgroup_from_task(current) != NULL;
> +}
> +
> +/*
> + * mem_cgroup_page_stat() - get memory cgroup file cache statistics
> + * @item:      memory statistic item exported to the kernel
> + *
> + * Return the accounted statistic value, or a negative value in case of error.
> + */
> +s64 mem_cgroup_page_stat(enum mem_cgroup_page_stat_item item)
> +{
> +       struct mem_cgroup_page_stat stat = {};
> +       struct mem_cgroup *memcg;
> +
> +       rcu_read_lock();
> +       memcg = mem_cgroup_from_task(current);
> +       if (memcg) {
> +               /*
> +                * Recursively evaulate page statistics against all cgroup
> +                * under hierarchy tree
> +                */
> +               stat.item = item;
> +               mem_cgroup_walk_tree(memcg, &stat, mem_cgroup_page_stat_cb);
> +       } else
> +               stat.value = -EINVAL;
> +       rcu_read_unlock();
> +
> +       return stat.value;
> +}
> +
>  static int mem_cgroup_count_children_cb(struct mem_cgroup *mem, void *data)
>  {
>        int *val = data;
> @@ -1275,34 +1423,70 @@ static void record_last_oom(struct mem_cgroup *mem)
>  }
>
>  /*
> - * Currently used to update mapped file statistics, but the routine can be
> - * generalized to update other statistics as well.
> + * Generalized routine to update file cache's status for memcg.
> + *
> + * Before calling this, mapping->tree_lock should be held and preemption is
> + * disabled.  Then, it's guarnteed that the page is not uncharged while we
> + * access page_cgroup. We can make use of that.
>  */
> -void mem_cgroup_update_file_mapped(struct page *page, int val)
> +void mem_cgroup_update_stat(struct page *page,
> +                       enum mem_cgroup_stat_index idx, int val)
>  {
>        struct mem_cgroup *mem;
>        struct page_cgroup *pc;
>
> +       if (mem_cgroup_disabled())
> +               return;
>        pc = lookup_page_cgroup(page);
> -       if (unlikely(!pc))
> +       if (unlikely(!pc) || !PageCgroupUsed(pc))
>                return;
>
> -       lock_page_cgroup(pc);
> -       mem = pc->mem_cgroup;
> -       if (!mem)
> -               goto done;
> -
> -       if (!PageCgroupUsed(pc))
> -               goto done;
> -
> +       lock_page_cgroup_migrate(pc);
>        /*
> -        * Preemption is already disabled. We can use __this_cpu_xxx
> -        */
> -       __this_cpu_add(mem->stat->count[MEM_CGROUP_STAT_FILE_MAPPED], val);
> -
> -done:
> -       unlock_page_cgroup(pc);
> +       * It's guarnteed that this page is never uncharged.
> +       * The only racy problem is moving account among memcgs.
> +       */
> +       switch (idx) {
> +       case MEM_CGROUP_STAT_FILE_MAPPED:
> +               if (val > 0)
> +                       SetPageCgroupFileMapped(pc);
> +               else
> +                       ClearPageCgroupFileMapped(pc);
> +               break;
> +       case MEM_CGROUP_STAT_FILE_DIRTY:
> +               if (val > 0)
> +                       SetPageCgroupDirty(pc);
> +               else
> +                       ClearPageCgroupDirty(pc);
> +               break;
> +       case MEM_CGROUP_STAT_WRITEBACK:
> +               if (val > 0)
> +                       SetPageCgroupWriteback(pc);
> +               else
> +                       ClearPageCgroupWriteback(pc);
> +               break;
> +       case MEM_CGROUP_STAT_WRITEBACK_TEMP:
> +               if (val > 0)
> +                       SetPageCgroupWritebackTemp(pc);
> +               else
> +                       ClearPageCgroupWritebackTemp(pc);
> +               break;
> +       case MEM_CGROUP_STAT_UNSTABLE_NFS:
> +               if (val > 0)
> +                       SetPageCgroupUnstableNFS(pc);
> +               else
> +                       ClearPageCgroupUnstableNFS(pc);
> +               break;
> +       default:
> +               BUG();
> +               break;
> +       }
> +       mem = pc->mem_cgroup;
> +       if (likely(mem))
> +               __this_cpu_add(mem->stat->count[idx], val);
> +       unlock_page_cgroup_migrate(pc);
>  }
> +EXPORT_SYMBOL_GPL(mem_cgroup_update_stat);
>
>  /*
>  * size of first charge trial. "32" comes from vmscan.c's magic value.
> @@ -1701,6 +1885,45 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *mem,
>        memcg_check_events(mem, pc->page);
>  }
>
> +/*
> + * Update file cache accounted statistics on task migration.
> + *
> + * TODO: We don't move charges of file (including shmem/tmpfs) pages for now.
> + * So, at the moment this function simply returns without updating accounted
> + * statistics, because we deal only with anonymous pages here.
> + */
> +static void __mem_cgroup_update_file_stat(struct page_cgroup *pc,
> +       struct mem_cgroup *from, struct mem_cgroup *to)
> +{
> +       struct page *page = pc->page;
> +
> +       if (!page_mapped(page) || PageAnon(page))
> +               return;
> +
> +       if (PageCgroupFileMapped(pc)) {
> +               __this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> +               __this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> +       }
> +       if (PageCgroupDirty(pc)) {
> +               __this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_DIRTY]);
> +               __this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_DIRTY]);
> +       }
> +       if (PageCgroupWriteback(pc)) {
> +               __this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_WRITEBACK]);
> +               __this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_WRITEBACK]);
> +       }
> +       if (PageCgroupWritebackTemp(pc)) {
> +               __this_cpu_dec(
> +                       from->stat->count[MEM_CGROUP_STAT_WRITEBACK_TEMP]);
> +               __this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_WRITEBACK_TEMP]);
> +       }
> +       if (PageCgroupUnstableNFS(pc)) {
> +               __this_cpu_dec(
> +                       from->stat->count[MEM_CGROUP_STAT_UNSTABLE_NFS]);
> +               __this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_UNSTABLE_NFS]);
> +       }
> +}
> +
>  /**
>  * __mem_cgroup_move_account - move account of the page
>  * @pc:        page_cgroup of the page.
> @@ -1721,22 +1944,16 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *mem,
>  static void __mem_cgroup_move_account(struct page_cgroup *pc,
>        struct mem_cgroup *from, struct mem_cgroup *to, bool uncharge)
>  {
> -       struct page *page;
> -
>        VM_BUG_ON(from == to);
>        VM_BUG_ON(PageLRU(pc->page));
>        VM_BUG_ON(!PageCgroupLocked(pc));
>        VM_BUG_ON(!PageCgroupUsed(pc));
>        VM_BUG_ON(pc->mem_cgroup != from);
>
> -       page = pc->page;
> -       if (page_mapped(page) && !PageAnon(page)) {
> -               /* Update mapped_file data for mem_cgroup */
> -               preempt_disable();
> -               __this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> -               __this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> -               preempt_enable();
> -       }
> +       preempt_disable();
> +       lock_page_cgroup_migrate(pc);
> +       __mem_cgroup_update_file_stat(pc, from, to);
> +
>        mem_cgroup_charge_statistics(from, pc, false);
>        if (uncharge)
>                /* This is not "cancel", but cancel_charge does all we need. */
> @@ -1745,6 +1962,8 @@ static void __mem_cgroup_move_account(struct page_cgroup *pc,
>        /* caller should have done css_get */
>        pc->mem_cgroup = to;
>        mem_cgroup_charge_statistics(to, pc, true);
> +       unlock_page_cgroup_migrate(pc);
> +       preempt_enable();
>        /*
>         * We charges against "to" which may not have any tasks. Then, "to"
>         * can be under rmdir(). But in current implementation, caller of
> @@ -3042,6 +3261,10 @@ enum {
>        MCS_PGPGIN,
>        MCS_PGPGOUT,
>        MCS_SWAP,
> +       MCS_FILE_DIRTY,
> +       MCS_WRITEBACK,
> +       MCS_WRITEBACK_TEMP,
> +       MCS_UNSTABLE_NFS,
>        MCS_INACTIVE_ANON,
>        MCS_ACTIVE_ANON,
>        MCS_INACTIVE_FILE,
> @@ -3064,6 +3287,10 @@ struct {
>        {"pgpgin", "total_pgpgin"},
>        {"pgpgout", "total_pgpgout"},
>        {"swap", "total_swap"},
> +       {"filedirty", "dirty_pages"},
> +       {"writeback", "writeback_pages"},
> +       {"writeback_tmp", "writeback_temp_pages"},
> +       {"nfs", "nfs_unstable"},
>        {"inactive_anon", "total_inactive_anon"},
>        {"active_anon", "total_active_anon"},
>        {"inactive_file", "total_inactive_file"},
> @@ -3092,6 +3319,14 @@ static int mem_cgroup_get_local_stat(struct mem_cgroup *mem, void *data)
>                val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_SWAPOUT);
>                s->stat[MCS_SWAP] += val * PAGE_SIZE;
>        }
> +       val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_FILE_DIRTY);
> +       s->stat[MCS_FILE_DIRTY] += val;
> +       val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_WRITEBACK);
> +       s->stat[MCS_WRITEBACK] += val;
> +       val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_WRITEBACK_TEMP);
> +       s->stat[MCS_WRITEBACK_TEMP] += val;
> +       val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_UNSTABLE_NFS);
> +       s->stat[MCS_UNSTABLE_NFS] += val;
>
>        /* per zone stat */
>        val = mem_cgroup_get_local_zonestat(mem, LRU_INACTIVE_ANON);
> @@ -3453,6 +3688,60 @@ unlock:
>        return ret;
>  }
>
> +static u64 mem_cgroup_dirty_read(struct cgroup *cgrp, struct cftype *cft)
> +{
> +       struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
> +
> +       switch (cft->private) {
> +       case MEM_CGROUP_DIRTY_RATIO:
> +               return memcg->dirty_param.dirty_ratio;
> +       case MEM_CGROUP_DIRTY_BYTES:
> +               return memcg->dirty_param.dirty_bytes;
> +       case MEM_CGROUP_DIRTY_BACKGROUND_RATIO:
> +               return memcg->dirty_param.dirty_background_ratio;
> +       case MEM_CGROUP_DIRTY_BACKGROUND_BYTES:
> +               return memcg->dirty_param.dirty_background_bytes;
> +       default:
> +               BUG();
> +       }
> +}
> +
> +static int
> +mem_cgroup_dirty_write(struct cgroup *cgrp, struct cftype *cft, u64 val)
> +{
> +       struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
> +       int type = cft->private;
> +
> +       if (cgrp->parent == NULL)
> +               return -EINVAL;
> +       if ((type == MEM_CGROUP_DIRTY_RATIO ||
> +               type == MEM_CGROUP_DIRTY_BACKGROUND_RATIO) && val > 100)
> +               return -EINVAL;
> +       /*
> +        * TODO: provide a validation check routine. And retry if validation
> +        * fails.
> +        */
> +       switch (type) {
> +       case MEM_CGROUP_DIRTY_RATIO:
> +               memcg->dirty_param.dirty_ratio = val;
> +               memcg->dirty_param.dirty_bytes = 0;
> +               break;
> +       case MEM_CGROUP_DIRTY_BYTES:
> +               memcg->dirty_param.dirty_ratio  = 0;
> +               memcg->dirty_param.dirty_bytes = val;
> +               break;
> +       case MEM_CGROUP_DIRTY_BACKGROUND_RATIO:
> +               memcg->dirty_param.dirty_background_ratio = val;
> +               memcg->dirty_param.dirty_background_bytes = 0;
> +               break;
> +       case MEM_CGROUP_DIRTY_BACKGROUND_BYTES:
> +               memcg->dirty_param.dirty_background_ratio = 0;
> +               memcg->dirty_param.dirty_background_bytes = val;
> +               break;

default:
        BUG();
        break;

> +       }
> +       return 0;
> +}
> +
>  static struct cftype mem_cgroup_files[] = {
>        {
>                .name = "usage_in_bytes",
> @@ -3504,6 +3793,30 @@ static struct cftype mem_cgroup_files[] = {
>                .write_u64 = mem_cgroup_swappiness_write,
>        },
>        {
> +               .name = "dirty_ratio",
> +               .read_u64 = mem_cgroup_dirty_read,
> +               .write_u64 = mem_cgroup_dirty_write,
> +               .private = MEM_CGROUP_DIRTY_RATIO,
> +       },
> +       {
> +               .name = "dirty_bytes",
> +               .read_u64 = mem_cgroup_dirty_read,
> +               .write_u64 = mem_cgroup_dirty_write,
> +               .private = MEM_CGROUP_DIRTY_BYTES,
> +       },
> +       {
> +               .name = "dirty_background_ratio",
> +               .read_u64 = mem_cgroup_dirty_read,
> +               .write_u64 = mem_cgroup_dirty_write,
> +               .private = MEM_CGROUP_DIRTY_BACKGROUND_RATIO,
> +       },
> +       {
> +               .name = "dirty_background_bytes",
> +               .read_u64 = mem_cgroup_dirty_read,
> +               .write_u64 = mem_cgroup_dirty_write,
> +               .private = MEM_CGROUP_DIRTY_BACKGROUND_BYTES,
> +       },
> +       {
>                .name = "move_charge_at_immigrate",
>                .read_u64 = mem_cgroup_move_charge_read,
>                .write_u64 = mem_cgroup_move_charge_write,
> @@ -3762,8 +4075,21 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
>        mem->last_scanned_child = 0;
>        spin_lock_init(&mem->reclaim_param_lock);
>
> -       if (parent)
> +       if (parent) {
>                mem->swappiness = get_swappiness(parent);
> +               mem->dirty_param = parent->dirty_param;
> +       } else {
> +               while (1) {
> +                       get_global_dirty_param(&mem->dirty_param);
> +                       /*
> +                        * Since global dirty parameters are not protected we
> +                        * try to speculatively read them and retry if we get
> +                        * inconsistent values.
> +                        */
> +                       if (likely(dirty_param_is_valid(&mem->dirty_param)))
> +                               break;
> +               }
> +       }
>        atomic_set(&mem->refcnt, 1);
>        mem->move_charge_at_immigrate = 0;
>        mutex_init(&mem->thresholds_lock);
> --
> 1.6.3.3
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH -mmotm 3/4] memcg: dirty pages accounting and limiting infrastructure
@ 2010-03-04 11:54     ` Kirill A. Shutemov
  0 siblings, 0 replies; 68+ messages in thread
From: Kirill A. Shutemov @ 2010-03-04 11:54 UTC (permalink / raw)
  To: Andrea Righi
  Cc: KAMEZAWA Hiroyuki, Balbir Singh, Vivek Goyal, Peter Zijlstra,
	Trond Myklebust, Suleiman Souhlal, Greg Thelen,
	Daisuke Nishimura, Andrew Morton, containers, linux-kernel,
	linux-mm

On Thu, Mar 4, 2010 at 12:40 PM, Andrea Righi <arighi@develer.com> wrote:
> Infrastructure to account dirty pages per cgroup and add dirty limit
> interfaces in the cgroupfs:
>
>  - Direct write-out: memory.dirty_ratio, memory.dirty_bytes
>
>  - Background write-out: memory.dirty_background_ratio, memory.dirty_background_bytes
>
> Signed-off-by: Andrea Righi <arighi@develer.com>
> ---
>  include/linux/memcontrol.h |   80 ++++++++-
>  mm/memcontrol.c            |  420 +++++++++++++++++++++++++++++++++++++++-----
>  2 files changed, 450 insertions(+), 50 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 1f9b119..cc3421b 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -19,12 +19,66 @@
>
>  #ifndef _LINUX_MEMCONTROL_H
>  #define _LINUX_MEMCONTROL_H
> +
> +#include <linux/writeback.h>
>  #include <linux/cgroup.h>
> +
>  struct mem_cgroup;
>  struct page_cgroup;
>  struct page;
>  struct mm_struct;
>
> +/* Cgroup memory statistics items exported to the kernel */
> +enum mem_cgroup_page_stat_item {
> +       MEMCG_NR_DIRTYABLE_PAGES,
> +       MEMCG_NR_RECLAIM_PAGES,
> +       MEMCG_NR_WRITEBACK,
> +       MEMCG_NR_DIRTY_WRITEBACK_PAGES,
> +};
> +
> +/* Dirty memory parameters */
> +struct dirty_param {
> +       int dirty_ratio;
> +       unsigned long dirty_bytes;
> +       int dirty_background_ratio;
> +       unsigned long dirty_background_bytes;
> +};
> +
> +/*
> + * Statistics for memory cgroup.
> + */
> +enum mem_cgroup_stat_index {
> +       /*
> +        * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
> +        */
> +       MEM_CGROUP_STAT_CACHE,     /* # of pages charged as cache */
> +       MEM_CGROUP_STAT_RSS,       /* # of pages charged as anon rss */
> +       MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
> +       MEM_CGROUP_STAT_PGPGIN_COUNT,   /* # of pages paged in */
> +       MEM_CGROUP_STAT_PGPGOUT_COUNT,  /* # of pages paged out */
> +       MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
> +       MEM_CGROUP_EVENTS,      /* incremented at every  pagein/pageout */
> +       MEM_CGROUP_STAT_FILE_DIRTY,   /* # of dirty pages in page cache */
> +       MEM_CGROUP_STAT_WRITEBACK,   /* # of pages under writeback */
> +       MEM_CGROUP_STAT_WRITEBACK_TEMP,   /* # of pages under writeback using
> +                                               temporary buffers */
> +       MEM_CGROUP_STAT_UNSTABLE_NFS,   /* # of NFS unstable pages */
> +
> +       MEM_CGROUP_STAT_NSTATS,
> +};
> +
> +/*
> + * TODO: provide a validation check routine. And retry if validation
> + * fails.
> + */
> +static inline void get_global_dirty_param(struct dirty_param *param)
> +{
> +       param->dirty_ratio = vm_dirty_ratio;
> +       param->dirty_bytes = vm_dirty_bytes;
> +       param->dirty_background_ratio = dirty_background_ratio;
> +       param->dirty_background_bytes = dirty_background_bytes;
> +}
> +
>  #ifdef CONFIG_CGROUP_MEM_RES_CTLR
>  /*
>  * All "charge" functions with gfp_mask should use GFP_KERNEL or
> @@ -117,6 +171,10 @@ extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
>  extern int do_swap_account;
>  #endif
>
> +extern bool mem_cgroup_has_dirty_limit(void);
> +extern void get_dirty_param(struct dirty_param *param);
> +extern s64 mem_cgroup_page_stat(enum mem_cgroup_page_stat_item item);
> +
>  static inline bool mem_cgroup_disabled(void)
>  {
>        if (mem_cgroup_subsys.disabled)
> @@ -125,7 +183,8 @@ static inline bool mem_cgroup_disabled(void)
>  }
>
>  extern bool mem_cgroup_oom_called(struct task_struct *task);
> -void mem_cgroup_update_file_mapped(struct page *page, int val);
> +void mem_cgroup_update_stat(struct page *page,
> +                       enum mem_cgroup_stat_index idx, int val);
>  unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>                                                gfp_t gfp_mask, int nid,
>                                                int zid);
> @@ -300,8 +359,8 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
>  {
>  }
>
> -static inline void mem_cgroup_update_file_mapped(struct page *page,
> -                                                       int val)
> +static inline void mem_cgroup_update_stat(struct page *page,
> +                       enum mem_cgroup_stat_index idx, int val)
>  {
>  }
>
> @@ -312,6 +371,21 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>        return 0;
>  }
>
> +static inline bool mem_cgroup_has_dirty_limit(void)
> +{
> +       return false;
> +}
> +
> +static inline void get_dirty_param(struct dirty_param *param)
> +{
> +       get_global_dirty_param(param);
> +}
> +
> +static inline s64 mem_cgroup_page_stat(enum mem_cgroup_page_stat_item item)
> +{
> +       return -ENOSYS;
> +}
> +
>  #endif /* CONFIG_CGROUP_MEM_CONT */
>
>  #endif /* _LINUX_MEMCONTROL_H */
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 497b6f7..9842e7b 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -73,28 +73,23 @@ static int really_do_swap_account __initdata = 1; /* for remember boot option*/
>  #define THRESHOLDS_EVENTS_THRESH (7) /* once in 128 */
>  #define SOFTLIMIT_EVENTS_THRESH (10) /* once in 1024 */
>
> -/*
> - * Statistics for memory cgroup.
> - */
> -enum mem_cgroup_stat_index {
> -       /*
> -        * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
> -        */
> -       MEM_CGROUP_STAT_CACHE,     /* # of pages charged as cache */
> -       MEM_CGROUP_STAT_RSS,       /* # of pages charged as anon rss */
> -       MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
> -       MEM_CGROUP_STAT_PGPGIN_COUNT,   /* # of pages paged in */
> -       MEM_CGROUP_STAT_PGPGOUT_COUNT,  /* # of pages paged out */
> -       MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
> -       MEM_CGROUP_EVENTS,      /* incremented at every  pagein/pageout */
> -
> -       MEM_CGROUP_STAT_NSTATS,
> -};
> -
>  struct mem_cgroup_stat_cpu {
>        s64 count[MEM_CGROUP_STAT_NSTATS];
>  };
>
> +/* Per cgroup page statistics */
> +struct mem_cgroup_page_stat {
> +       enum mem_cgroup_page_stat_item item;
> +       s64 value;
> +};
> +
> +enum {
> +       MEM_CGROUP_DIRTY_RATIO,
> +       MEM_CGROUP_DIRTY_BYTES,
> +       MEM_CGROUP_DIRTY_BACKGROUND_RATIO,
> +       MEM_CGROUP_DIRTY_BACKGROUND_BYTES,
> +};
> +
>  /*
>  * per-zone information in memory controller.
>  */
> @@ -208,6 +203,9 @@ struct mem_cgroup {
>
>        unsigned int    swappiness;
>
> +       /* control memory cgroup dirty pages */
> +       struct dirty_param dirty_param;
> +
>        /* set when res.limit == memsw.limit */
>        bool            memsw_is_minimum;
>
> @@ -1033,6 +1031,156 @@ static unsigned int get_swappiness(struct mem_cgroup *memcg)
>        return swappiness;
>  }
>
> +static bool dirty_param_is_valid(struct dirty_param *param)
> +{
> +       if (param->dirty_ratio && param->dirty_bytes)
> +               return false;
> +       if (param->dirty_background_ratio && param->dirty_background_bytes)
> +               return false;
> +       return true;
> +}
> +
> +static void
> +__mem_cgroup_get_dirty_param(struct dirty_param *param, struct mem_cgroup *mem)
> +{
> +       param->dirty_ratio = mem->dirty_param.dirty_ratio;
> +       param->dirty_bytes = mem->dirty_param.dirty_bytes;
> +       param->dirty_background_ratio = mem->dirty_param.dirty_background_ratio;
> +       param->dirty_background_bytes = mem->dirty_param.dirty_background_bytes;
> +}
> +
> +/*
> + * get_dirty_param() - get dirty memory parameters of the current memcg
> + * @param:     a structure is filled with the dirty memory settings
> + *
> + * The function fills @param with the current memcg dirty memory settings. If
> + * memory cgroup is disabled or in case of error the structure is filled with
> + * the global dirty memory settings.
> + */
> +void get_dirty_param(struct dirty_param *param)
> +{
> +       struct mem_cgroup *memcg;
> +
> +       if (mem_cgroup_disabled()) {
> +               get_global_dirty_param(param);
> +               return;
> +       }
> +       /*
> +        * It's possible that "current" may be moved to other cgroup while we
> +        * access cgroup. But precise check is meaningless because the task can
> +        * be moved after our access and writeback tends to take long time.
> +        * At least, "memcg" will not be freed under rcu_read_lock().
> +        */
> +       while (1) {
> +               rcu_read_lock();
> +               memcg = mem_cgroup_from_task(current);
> +               if (likely(memcg))
> +                       __mem_cgroup_get_dirty_param(param, memcg);
> +               else
> +                       get_global_dirty_param(param);
> +               rcu_read_unlock();
> +               /*
> +                * Since global and memcg dirty_param are not protected we try
> +                * to speculatively read them and retry if we get inconsistent
> +                * values.
> +                */
> +               if (likely(dirty_param_is_valid(param)))
> +                       break;
> +       }
> +}
> +
> +static inline bool mem_cgroup_can_swap(struct mem_cgroup *memcg)
> +{
> +       if (!do_swap_account)
> +               return nr_swap_pages > 0;
> +       return !memcg->memsw_is_minimum &&
> +               (res_counter_read_u64(&memcg->memsw, RES_LIMIT) > 0);
> +}
> +
> +static s64 mem_cgroup_get_local_page_stat(struct mem_cgroup *memcg,
> +                               enum mem_cgroup_page_stat_item item)
> +{
> +       s64 ret;
> +
> +       switch (item) {
> +       case MEMCG_NR_DIRTYABLE_PAGES:
> +               ret = res_counter_read_u64(&memcg->res, RES_LIMIT) -
> +                       res_counter_read_u64(&memcg->res, RES_USAGE);
> +               /* Translate free memory in pages */
> +               ret >>= PAGE_SHIFT;
> +               ret += mem_cgroup_read_stat(memcg, LRU_ACTIVE_FILE) +
> +                       mem_cgroup_read_stat(memcg, LRU_INACTIVE_FILE);
> +               if (mem_cgroup_can_swap(memcg))
> +                       ret += mem_cgroup_read_stat(memcg, LRU_ACTIVE_ANON) +
> +                               mem_cgroup_read_stat(memcg, LRU_INACTIVE_ANON);
> +               break;
> +       case MEMCG_NR_RECLAIM_PAGES:
> +               ret = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_FILE_DIRTY) +
> +                       mem_cgroup_read_stat(memcg,
> +                                       MEM_CGROUP_STAT_UNSTABLE_NFS);
> +               break;
> +       case MEMCG_NR_WRITEBACK:
> +               ret = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_WRITEBACK);
> +               break;
> +       case MEMCG_NR_DIRTY_WRITEBACK_PAGES:
> +               ret = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_WRITEBACK) +
> +                       mem_cgroup_read_stat(memcg,
> +                               MEM_CGROUP_STAT_UNSTABLE_NFS);
> +               break;
> +       default:
> +               BUG_ON(1);

Just BUG()?
Andd add 'break;', please.

> +       }
> +       return ret;
> +}
> +
> +static int mem_cgroup_page_stat_cb(struct mem_cgroup *mem, void *data)
> +{
> +       struct mem_cgroup_page_stat *stat = (struct mem_cgroup_page_stat *)data;
> +
> +       stat->value += mem_cgroup_get_local_page_stat(mem, stat->item);
> +       return 0;
> +}
> +
> +/*
> + * mem_cgroup_has_dirty_limit() - check if current memcg has local dirty limits
> + *
> + * Return true if the current memory cgroup has local dirty memory settings,
> + * false otherwise.
> + */
> +bool mem_cgroup_has_dirty_limit(void)
> +{
> +       if (mem_cgroup_disabled())
> +               return false;
> +       return mem_cgroup_from_task(current) != NULL;
> +}
> +
> +/*
> + * mem_cgroup_page_stat() - get memory cgroup file cache statistics
> + * @item:      memory statistic item exported to the kernel
> + *
> + * Return the accounted statistic value, or a negative value in case of error.
> + */
> +s64 mem_cgroup_page_stat(enum mem_cgroup_page_stat_item item)
> +{
> +       struct mem_cgroup_page_stat stat = {};
> +       struct mem_cgroup *memcg;
> +
> +       rcu_read_lock();
> +       memcg = mem_cgroup_from_task(current);
> +       if (memcg) {
> +               /*
> +                * Recursively evaulate page statistics against all cgroup
> +                * under hierarchy tree
> +                */
> +               stat.item = item;
> +               mem_cgroup_walk_tree(memcg, &stat, mem_cgroup_page_stat_cb);
> +       } else
> +               stat.value = -EINVAL;
> +       rcu_read_unlock();
> +
> +       return stat.value;
> +}
> +
>  static int mem_cgroup_count_children_cb(struct mem_cgroup *mem, void *data)
>  {
>        int *val = data;
> @@ -1275,34 +1423,70 @@ static void record_last_oom(struct mem_cgroup *mem)
>  }
>
>  /*
> - * Currently used to update mapped file statistics, but the routine can be
> - * generalized to update other statistics as well.
> + * Generalized routine to update file cache's status for memcg.
> + *
> + * Before calling this, mapping->tree_lock should be held and preemption is
> + * disabled.  Then, it's guarnteed that the page is not uncharged while we
> + * access page_cgroup. We can make use of that.
>  */
> -void mem_cgroup_update_file_mapped(struct page *page, int val)
> +void mem_cgroup_update_stat(struct page *page,
> +                       enum mem_cgroup_stat_index idx, int val)
>  {
>        struct mem_cgroup *mem;
>        struct page_cgroup *pc;
>
> +       if (mem_cgroup_disabled())
> +               return;
>        pc = lookup_page_cgroup(page);
> -       if (unlikely(!pc))
> +       if (unlikely(!pc) || !PageCgroupUsed(pc))
>                return;
>
> -       lock_page_cgroup(pc);
> -       mem = pc->mem_cgroup;
> -       if (!mem)
> -               goto done;
> -
> -       if (!PageCgroupUsed(pc))
> -               goto done;
> -
> +       lock_page_cgroup_migrate(pc);
>        /*
> -        * Preemption is already disabled. We can use __this_cpu_xxx
> -        */
> -       __this_cpu_add(mem->stat->count[MEM_CGROUP_STAT_FILE_MAPPED], val);
> -
> -done:
> -       unlock_page_cgroup(pc);
> +       * It's guarnteed that this page is never uncharged.
> +       * The only racy problem is moving account among memcgs.
> +       */
> +       switch (idx) {
> +       case MEM_CGROUP_STAT_FILE_MAPPED:
> +               if (val > 0)
> +                       SetPageCgroupFileMapped(pc);
> +               else
> +                       ClearPageCgroupFileMapped(pc);
> +               break;
> +       case MEM_CGROUP_STAT_FILE_DIRTY:
> +               if (val > 0)
> +                       SetPageCgroupDirty(pc);
> +               else
> +                       ClearPageCgroupDirty(pc);
> +               break;
> +       case MEM_CGROUP_STAT_WRITEBACK:
> +               if (val > 0)
> +                       SetPageCgroupWriteback(pc);
> +               else
> +                       ClearPageCgroupWriteback(pc);
> +               break;
> +       case MEM_CGROUP_STAT_WRITEBACK_TEMP:
> +               if (val > 0)
> +                       SetPageCgroupWritebackTemp(pc);
> +               else
> +                       ClearPageCgroupWritebackTemp(pc);
> +               break;
> +       case MEM_CGROUP_STAT_UNSTABLE_NFS:
> +               if (val > 0)
> +                       SetPageCgroupUnstableNFS(pc);
> +               else
> +                       ClearPageCgroupUnstableNFS(pc);
> +               break;
> +       default:
> +               BUG();
> +               break;
> +       }
> +       mem = pc->mem_cgroup;
> +       if (likely(mem))
> +               __this_cpu_add(mem->stat->count[idx], val);
> +       unlock_page_cgroup_migrate(pc);
>  }
> +EXPORT_SYMBOL_GPL(mem_cgroup_update_stat);
>
>  /*
>  * size of first charge trial. "32" comes from vmscan.c's magic value.
> @@ -1701,6 +1885,45 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *mem,
>        memcg_check_events(mem, pc->page);
>  }
>
> +/*
> + * Update file cache accounted statistics on task migration.
> + *
> + * TODO: We don't move charges of file (including shmem/tmpfs) pages for now.
> + * So, at the moment this function simply returns without updating accounted
> + * statistics, because we deal only with anonymous pages here.
> + */
> +static void __mem_cgroup_update_file_stat(struct page_cgroup *pc,
> +       struct mem_cgroup *from, struct mem_cgroup *to)
> +{
> +       struct page *page = pc->page;
> +
> +       if (!page_mapped(page) || PageAnon(page))
> +               return;
> +
> +       if (PageCgroupFileMapped(pc)) {
> +               __this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> +               __this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> +       }
> +       if (PageCgroupDirty(pc)) {
> +               __this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_DIRTY]);
> +               __this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_DIRTY]);
> +       }
> +       if (PageCgroupWriteback(pc)) {
> +               __this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_WRITEBACK]);
> +               __this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_WRITEBACK]);
> +       }
> +       if (PageCgroupWritebackTemp(pc)) {
> +               __this_cpu_dec(
> +                       from->stat->count[MEM_CGROUP_STAT_WRITEBACK_TEMP]);
> +               __this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_WRITEBACK_TEMP]);
> +       }
> +       if (PageCgroupUnstableNFS(pc)) {
> +               __this_cpu_dec(
> +                       from->stat->count[MEM_CGROUP_STAT_UNSTABLE_NFS]);
> +               __this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_UNSTABLE_NFS]);
> +       }
> +}
> +
>  /**
>  * __mem_cgroup_move_account - move account of the page
>  * @pc:        page_cgroup of the page.
> @@ -1721,22 +1944,16 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *mem,
>  static void __mem_cgroup_move_account(struct page_cgroup *pc,
>        struct mem_cgroup *from, struct mem_cgroup *to, bool uncharge)
>  {
> -       struct page *page;
> -
>        VM_BUG_ON(from == to);
>        VM_BUG_ON(PageLRU(pc->page));
>        VM_BUG_ON(!PageCgroupLocked(pc));
>        VM_BUG_ON(!PageCgroupUsed(pc));
>        VM_BUG_ON(pc->mem_cgroup != from);
>
> -       page = pc->page;
> -       if (page_mapped(page) && !PageAnon(page)) {
> -               /* Update mapped_file data for mem_cgroup */
> -               preempt_disable();
> -               __this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> -               __this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> -               preempt_enable();
> -       }
> +       preempt_disable();
> +       lock_page_cgroup_migrate(pc);
> +       __mem_cgroup_update_file_stat(pc, from, to);
> +
>        mem_cgroup_charge_statistics(from, pc, false);
>        if (uncharge)
>                /* This is not "cancel", but cancel_charge does all we need. */
> @@ -1745,6 +1962,8 @@ static void __mem_cgroup_move_account(struct page_cgroup *pc,
>        /* caller should have done css_get */
>        pc->mem_cgroup = to;
>        mem_cgroup_charge_statistics(to, pc, true);
> +       unlock_page_cgroup_migrate(pc);
> +       preempt_enable();
>        /*
>         * We charges against "to" which may not have any tasks. Then, "to"
>         * can be under rmdir(). But in current implementation, caller of
> @@ -3042,6 +3261,10 @@ enum {
>        MCS_PGPGIN,
>        MCS_PGPGOUT,
>        MCS_SWAP,
> +       MCS_FILE_DIRTY,
> +       MCS_WRITEBACK,
> +       MCS_WRITEBACK_TEMP,
> +       MCS_UNSTABLE_NFS,
>        MCS_INACTIVE_ANON,
>        MCS_ACTIVE_ANON,
>        MCS_INACTIVE_FILE,
> @@ -3064,6 +3287,10 @@ struct {
>        {"pgpgin", "total_pgpgin"},
>        {"pgpgout", "total_pgpgout"},
>        {"swap", "total_swap"},
> +       {"filedirty", "dirty_pages"},
> +       {"writeback", "writeback_pages"},
> +       {"writeback_tmp", "writeback_temp_pages"},
> +       {"nfs", "nfs_unstable"},
>        {"inactive_anon", "total_inactive_anon"},
>        {"active_anon", "total_active_anon"},
>        {"inactive_file", "total_inactive_file"},
> @@ -3092,6 +3319,14 @@ static int mem_cgroup_get_local_stat(struct mem_cgroup *mem, void *data)
>                val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_SWAPOUT);
>                s->stat[MCS_SWAP] += val * PAGE_SIZE;
>        }
> +       val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_FILE_DIRTY);
> +       s->stat[MCS_FILE_DIRTY] += val;
> +       val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_WRITEBACK);
> +       s->stat[MCS_WRITEBACK] += val;
> +       val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_WRITEBACK_TEMP);
> +       s->stat[MCS_WRITEBACK_TEMP] += val;
> +       val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_UNSTABLE_NFS);
> +       s->stat[MCS_UNSTABLE_NFS] += val;
>
>        /* per zone stat */
>        val = mem_cgroup_get_local_zonestat(mem, LRU_INACTIVE_ANON);
> @@ -3453,6 +3688,60 @@ unlock:
>        return ret;
>  }
>
> +static u64 mem_cgroup_dirty_read(struct cgroup *cgrp, struct cftype *cft)
> +{
> +       struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
> +
> +       switch (cft->private) {
> +       case MEM_CGROUP_DIRTY_RATIO:
> +               return memcg->dirty_param.dirty_ratio;
> +       case MEM_CGROUP_DIRTY_BYTES:
> +               return memcg->dirty_param.dirty_bytes;
> +       case MEM_CGROUP_DIRTY_BACKGROUND_RATIO:
> +               return memcg->dirty_param.dirty_background_ratio;
> +       case MEM_CGROUP_DIRTY_BACKGROUND_BYTES:
> +               return memcg->dirty_param.dirty_background_bytes;
> +       default:
> +               BUG();
> +       }
> +}
> +
> +static int
> +mem_cgroup_dirty_write(struct cgroup *cgrp, struct cftype *cft, u64 val)
> +{
> +       struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
> +       int type = cft->private;
> +
> +       if (cgrp->parent == NULL)
> +               return -EINVAL;
> +       if ((type == MEM_CGROUP_DIRTY_RATIO ||
> +               type == MEM_CGROUP_DIRTY_BACKGROUND_RATIO) && val > 100)
> +               return -EINVAL;
> +       /*
> +        * TODO: provide a validation check routine. And retry if validation
> +        * fails.
> +        */
> +       switch (type) {
> +       case MEM_CGROUP_DIRTY_RATIO:
> +               memcg->dirty_param.dirty_ratio = val;
> +               memcg->dirty_param.dirty_bytes = 0;
> +               break;
> +       case MEM_CGROUP_DIRTY_BYTES:
> +               memcg->dirty_param.dirty_ratio  = 0;
> +               memcg->dirty_param.dirty_bytes = val;
> +               break;
> +       case MEM_CGROUP_DIRTY_BACKGROUND_RATIO:
> +               memcg->dirty_param.dirty_background_ratio = val;
> +               memcg->dirty_param.dirty_background_bytes = 0;
> +               break;
> +       case MEM_CGROUP_DIRTY_BACKGROUND_BYTES:
> +               memcg->dirty_param.dirty_background_ratio = 0;
> +               memcg->dirty_param.dirty_background_bytes = val;
> +               break;

default:
        BUG();
        break;

> +       }
> +       return 0;
> +}
> +
>  static struct cftype mem_cgroup_files[] = {
>        {
>                .name = "usage_in_bytes",
> @@ -3504,6 +3793,30 @@ static struct cftype mem_cgroup_files[] = {
>                .write_u64 = mem_cgroup_swappiness_write,
>        },
>        {
> +               .name = "dirty_ratio",
> +               .read_u64 = mem_cgroup_dirty_read,
> +               .write_u64 = mem_cgroup_dirty_write,
> +               .private = MEM_CGROUP_DIRTY_RATIO,
> +       },
> +       {
> +               .name = "dirty_bytes",
> +               .read_u64 = mem_cgroup_dirty_read,
> +               .write_u64 = mem_cgroup_dirty_write,
> +               .private = MEM_CGROUP_DIRTY_BYTES,
> +       },
> +       {
> +               .name = "dirty_background_ratio",
> +               .read_u64 = mem_cgroup_dirty_read,
> +               .write_u64 = mem_cgroup_dirty_write,
> +               .private = MEM_CGROUP_DIRTY_BACKGROUND_RATIO,
> +       },
> +       {
> +               .name = "dirty_background_bytes",
> +               .read_u64 = mem_cgroup_dirty_read,
> +               .write_u64 = mem_cgroup_dirty_write,
> +               .private = MEM_CGROUP_DIRTY_BACKGROUND_BYTES,
> +       },
> +       {
>                .name = "move_charge_at_immigrate",
>                .read_u64 = mem_cgroup_move_charge_read,
>                .write_u64 = mem_cgroup_move_charge_write,
> @@ -3762,8 +4075,21 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
>        mem->last_scanned_child = 0;
>        spin_lock_init(&mem->reclaim_param_lock);
>
> -       if (parent)
> +       if (parent) {
>                mem->swappiness = get_swappiness(parent);
> +               mem->dirty_param = parent->dirty_param;
> +       } else {
> +               while (1) {
> +                       get_global_dirty_param(&mem->dirty_param);
> +                       /*
> +                        * Since global dirty parameters are not protected we
> +                        * try to speculatively read them and retry if we get
> +                        * inconsistent values.
> +                        */
> +                       if (likely(dirty_param_is_valid(&mem->dirty_param)))
> +                               break;
> +               }
> +       }
>        atomic_set(&mem->refcnt, 1);
>        mem->move_charge_at_immigrate = 0;
>        mutex_init(&mem->thresholds_lock);
> --
> 1.6.3.3
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH -mmotm 4/4] memcg: dirty pages instrumentation
       [not found]   ` <1267699215-4101-5-git-send-email-arighi-vWjgImWzx8FBDgjK7y7TUQ@public.gmane.org>
@ 2010-03-04 16:18     ` Vivek Goyal
  2010-03-04 19:41       ` Vivek Goyal
  2010-03-05  6:38     ` Balbir Singh
  2 siblings, 0 replies; 68+ messages in thread
From: Vivek Goyal @ 2010-03-04 16:18 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Daisuke Nishimura, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Trond Myklebust, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	Suleiman Souhlal, Andrew Morton,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Balbir Singh

On Thu, Mar 04, 2010 at 11:40:15AM +0100, Andrea Righi wrote:

[..]
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 5a0f8f3..c5d14ea 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -137,13 +137,16 @@ static struct prop_descriptor vm_dirties;
>   */
>  static int calc_period_shift(void)
>  {
> +	struct dirty_param dirty_param;
>  	unsigned long dirty_total;
>  
> -	if (vm_dirty_bytes)
> -		dirty_total = vm_dirty_bytes / PAGE_SIZE;
> +	get_dirty_param(&dirty_param);
> +
> +	if (dirty_param.dirty_bytes)
> +		dirty_total = dirty_param.dirty_bytes / PAGE_SIZE;
>  	else
> -		dirty_total = (vm_dirty_ratio * determine_dirtyable_memory()) /
> -				100;
> +		dirty_total = (dirty_param.dirty_ratio *
> +				determine_dirtyable_memory()) / 100;
>  	return 2 + ilog2(dirty_total - 1);
>  }
>  
> @@ -408,41 +411,46 @@ static unsigned long highmem_dirtyable_memory(unsigned long total)
>   */
>  unsigned long determine_dirtyable_memory(void)
>  {
> -	unsigned long x;
> -
> -	x = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages();
> +	unsigned long memory;
> +	s64 memcg_memory;
>  
> +	memory = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages();
>  	if (!vm_highmem_is_dirtyable)
> -		x -= highmem_dirtyable_memory(x);
> -
> -	return x + 1;	/* Ensure that we never return 0 */
> +		memory -= highmem_dirtyable_memory(memory);
> +	if (mem_cgroup_has_dirty_limit())
> +		return memory + 1;

Should above be?
	if (!mem_cgroup_has_dirty_limit())
		return memory + 1;

Vivek

> +	memcg_memory = mem_cgroup_page_stat(MEMCG_NR_DIRTYABLE_PAGES);
> +	return min((unsigned long)memcg_memory, memory + 1);
>  }

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH -mmotm 4/4] memcg: dirty pages instrumentation
  2010-03-04 10:40   ` Andrea Righi
@ 2010-03-04 16:18     ` Vivek Goyal
  -1 siblings, 0 replies; 68+ messages in thread
From: Vivek Goyal @ 2010-03-04 16:18 UTC (permalink / raw)
  To: Andrea Righi
  Cc: KAMEZAWA Hiroyuki, Balbir Singh, Peter Zijlstra, Trond Myklebust,
	Suleiman Souhlal, Greg Thelen, Daisuke Nishimura,
	Kirill A. Shutemov, Andrew Morton, containers, linux-kernel,
	linux-mm

On Thu, Mar 04, 2010 at 11:40:15AM +0100, Andrea Righi wrote:

[..]
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 5a0f8f3..c5d14ea 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -137,13 +137,16 @@ static struct prop_descriptor vm_dirties;
>   */
>  static int calc_period_shift(void)
>  {
> +	struct dirty_param dirty_param;
>  	unsigned long dirty_total;
>  
> -	if (vm_dirty_bytes)
> -		dirty_total = vm_dirty_bytes / PAGE_SIZE;
> +	get_dirty_param(&dirty_param);
> +
> +	if (dirty_param.dirty_bytes)
> +		dirty_total = dirty_param.dirty_bytes / PAGE_SIZE;
>  	else
> -		dirty_total = (vm_dirty_ratio * determine_dirtyable_memory()) /
> -				100;
> +		dirty_total = (dirty_param.dirty_ratio *
> +				determine_dirtyable_memory()) / 100;
>  	return 2 + ilog2(dirty_total - 1);
>  }
>  
> @@ -408,41 +411,46 @@ static unsigned long highmem_dirtyable_memory(unsigned long total)
>   */
>  unsigned long determine_dirtyable_memory(void)
>  {
> -	unsigned long x;
> -
> -	x = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages();
> +	unsigned long memory;
> +	s64 memcg_memory;
>  
> +	memory = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages();
>  	if (!vm_highmem_is_dirtyable)
> -		x -= highmem_dirtyable_memory(x);
> -
> -	return x + 1;	/* Ensure that we never return 0 */
> +		memory -= highmem_dirtyable_memory(memory);
> +	if (mem_cgroup_has_dirty_limit())
> +		return memory + 1;

Should above be?
	if (!mem_cgroup_has_dirty_limit())
		return memory + 1;

Vivek

> +	memcg_memory = mem_cgroup_page_stat(MEMCG_NR_DIRTYABLE_PAGES);
> +	return min((unsigned long)memcg_memory, memory + 1);
>  }

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH -mmotm 4/4] memcg: dirty pages instrumentation
@ 2010-03-04 16:18     ` Vivek Goyal
  0 siblings, 0 replies; 68+ messages in thread
From: Vivek Goyal @ 2010-03-04 16:18 UTC (permalink / raw)
  To: Andrea Righi
  Cc: KAMEZAWA Hiroyuki, Balbir Singh, Peter Zijlstra, Trond Myklebust,
	Suleiman Souhlal, Greg Thelen, Daisuke Nishimura,
	Kirill A. Shutemov, Andrew Morton, containers, linux-kernel,
	linux-mm

On Thu, Mar 04, 2010 at 11:40:15AM +0100, Andrea Righi wrote:

[..]
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 5a0f8f3..c5d14ea 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -137,13 +137,16 @@ static struct prop_descriptor vm_dirties;
>   */
>  static int calc_period_shift(void)
>  {
> +	struct dirty_param dirty_param;
>  	unsigned long dirty_total;
>  
> -	if (vm_dirty_bytes)
> -		dirty_total = vm_dirty_bytes / PAGE_SIZE;
> +	get_dirty_param(&dirty_param);
> +
> +	if (dirty_param.dirty_bytes)
> +		dirty_total = dirty_param.dirty_bytes / PAGE_SIZE;
>  	else
> -		dirty_total = (vm_dirty_ratio * determine_dirtyable_memory()) /
> -				100;
> +		dirty_total = (dirty_param.dirty_ratio *
> +				determine_dirtyable_memory()) / 100;
>  	return 2 + ilog2(dirty_total - 1);
>  }
>  
> @@ -408,41 +411,46 @@ static unsigned long highmem_dirtyable_memory(unsigned long total)
>   */
>  unsigned long determine_dirtyable_memory(void)
>  {
> -	unsigned long x;
> -
> -	x = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages();
> +	unsigned long memory;
> +	s64 memcg_memory;
>  
> +	memory = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages();
>  	if (!vm_highmem_is_dirtyable)
> -		x -= highmem_dirtyable_memory(x);
> -
> -	return x + 1;	/* Ensure that we never return 0 */
> +		memory -= highmem_dirtyable_memory(memory);
> +	if (mem_cgroup_has_dirty_limit())
> +		return memory + 1;

Should above be?
	if (!mem_cgroup_has_dirty_limit())
		return memory + 1;

Vivek

> +	memcg_memory = mem_cgroup_page_stat(MEMCG_NR_DIRTYABLE_PAGES);
> +	return min((unsigned long)memcg_memory, memory + 1);
>  }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH -mmotm 4/4] memcg: dirty pages instrumentation
       [not found]     ` <20100304161828.GC18786-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2010-03-04 16:28       ` Andrea Righi
  0 siblings, 0 replies; 68+ messages in thread
From: Andrea Righi @ 2010-03-04 16:28 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Daisuke Nishimura, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Trond Myklebust, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	Suleiman Souhlal, Andrew Morton,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Balbir Singh

On Thu, Mar 04, 2010 at 11:18:28AM -0500, Vivek Goyal wrote:
> On Thu, Mar 04, 2010 at 11:40:15AM +0100, Andrea Righi wrote:
> 
> [..]
> > diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> > index 5a0f8f3..c5d14ea 100644
> > --- a/mm/page-writeback.c
> > +++ b/mm/page-writeback.c
> > @@ -137,13 +137,16 @@ static struct prop_descriptor vm_dirties;
> >   */
> >  static int calc_period_shift(void)
> >  {
> > +	struct dirty_param dirty_param;
> >  	unsigned long dirty_total;
> >  
> > -	if (vm_dirty_bytes)
> > -		dirty_total = vm_dirty_bytes / PAGE_SIZE;
> > +	get_dirty_param(&dirty_param);
> > +
> > +	if (dirty_param.dirty_bytes)
> > +		dirty_total = dirty_param.dirty_bytes / PAGE_SIZE;
> >  	else
> > -		dirty_total = (vm_dirty_ratio * determine_dirtyable_memory()) /
> > -				100;
> > +		dirty_total = (dirty_param.dirty_ratio *
> > +				determine_dirtyable_memory()) / 100;
> >  	return 2 + ilog2(dirty_total - 1);
> >  }
> >  
> > @@ -408,41 +411,46 @@ static unsigned long highmem_dirtyable_memory(unsigned long total)
> >   */
> >  unsigned long determine_dirtyable_memory(void)
> >  {
> > -	unsigned long x;
> > -
> > -	x = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages();
> > +	unsigned long memory;
> > +	s64 memcg_memory;
> >  
> > +	memory = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages();
> >  	if (!vm_highmem_is_dirtyable)
> > -		x -= highmem_dirtyable_memory(x);
> > -
> > -	return x + 1;	/* Ensure that we never return 0 */
> > +		memory -= highmem_dirtyable_memory(memory);
> > +	if (mem_cgroup_has_dirty_limit())
> > +		return memory + 1;
> 
> Should above be?
> 	if (!mem_cgroup_has_dirty_limit())
> 		return memory + 1;

Very true.

I'll post another patch with this and Kirill's fixes.

Thanks,
-Andrea

> 
> Vivek
> 
> > +	memcg_memory = mem_cgroup_page_stat(MEMCG_NR_DIRTYABLE_PAGES);
> > +	return min((unsigned long)memcg_memory, memory + 1);
> >  }

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH -mmotm 4/4] memcg: dirty pages instrumentation
  2010-03-04 16:18     ` Vivek Goyal
@ 2010-03-04 16:28       ` Andrea Righi
  -1 siblings, 0 replies; 68+ messages in thread
From: Andrea Righi @ 2010-03-04 16:28 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: KAMEZAWA Hiroyuki, Balbir Singh, Peter Zijlstra, Trond Myklebust,
	Suleiman Souhlal, Greg Thelen, Daisuke Nishimura,
	Kirill A. Shutemov, Andrew Morton, containers, linux-kernel,
	linux-mm

On Thu, Mar 04, 2010 at 11:18:28AM -0500, Vivek Goyal wrote:
> On Thu, Mar 04, 2010 at 11:40:15AM +0100, Andrea Righi wrote:
> 
> [..]
> > diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> > index 5a0f8f3..c5d14ea 100644
> > --- a/mm/page-writeback.c
> > +++ b/mm/page-writeback.c
> > @@ -137,13 +137,16 @@ static struct prop_descriptor vm_dirties;
> >   */
> >  static int calc_period_shift(void)
> >  {
> > +	struct dirty_param dirty_param;
> >  	unsigned long dirty_total;
> >  
> > -	if (vm_dirty_bytes)
> > -		dirty_total = vm_dirty_bytes / PAGE_SIZE;
> > +	get_dirty_param(&dirty_param);
> > +
> > +	if (dirty_param.dirty_bytes)
> > +		dirty_total = dirty_param.dirty_bytes / PAGE_SIZE;
> >  	else
> > -		dirty_total = (vm_dirty_ratio * determine_dirtyable_memory()) /
> > -				100;
> > +		dirty_total = (dirty_param.dirty_ratio *
> > +				determine_dirtyable_memory()) / 100;
> >  	return 2 + ilog2(dirty_total - 1);
> >  }
> >  
> > @@ -408,41 +411,46 @@ static unsigned long highmem_dirtyable_memory(unsigned long total)
> >   */
> >  unsigned long determine_dirtyable_memory(void)
> >  {
> > -	unsigned long x;
> > -
> > -	x = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages();
> > +	unsigned long memory;
> > +	s64 memcg_memory;
> >  
> > +	memory = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages();
> >  	if (!vm_highmem_is_dirtyable)
> > -		x -= highmem_dirtyable_memory(x);
> > -
> > -	return x + 1;	/* Ensure that we never return 0 */
> > +		memory -= highmem_dirtyable_memory(memory);
> > +	if (mem_cgroup_has_dirty_limit())
> > +		return memory + 1;
> 
> Should above be?
> 	if (!mem_cgroup_has_dirty_limit())
> 		return memory + 1;

Very true.

I'll post another patch with this and Kirill's fixes.

Thanks,
-Andrea

> 
> Vivek
> 
> > +	memcg_memory = mem_cgroup_page_stat(MEMCG_NR_DIRTYABLE_PAGES);
> > +	return min((unsigned long)memcg_memory, memory + 1);
> >  }

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH -mmotm 4/4] memcg: dirty pages instrumentation
@ 2010-03-04 16:28       ` Andrea Righi
  0 siblings, 0 replies; 68+ messages in thread
From: Andrea Righi @ 2010-03-04 16:28 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: KAMEZAWA Hiroyuki, Balbir Singh, Peter Zijlstra, Trond Myklebust,
	Suleiman Souhlal, Greg Thelen, Daisuke Nishimura,
	Kirill A. Shutemov, Andrew Morton, containers, linux-kernel,
	linux-mm

On Thu, Mar 04, 2010 at 11:18:28AM -0500, Vivek Goyal wrote:
> On Thu, Mar 04, 2010 at 11:40:15AM +0100, Andrea Righi wrote:
> 
> [..]
> > diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> > index 5a0f8f3..c5d14ea 100644
> > --- a/mm/page-writeback.c
> > +++ b/mm/page-writeback.c
> > @@ -137,13 +137,16 @@ static struct prop_descriptor vm_dirties;
> >   */
> >  static int calc_period_shift(void)
> >  {
> > +	struct dirty_param dirty_param;
> >  	unsigned long dirty_total;
> >  
> > -	if (vm_dirty_bytes)
> > -		dirty_total = vm_dirty_bytes / PAGE_SIZE;
> > +	get_dirty_param(&dirty_param);
> > +
> > +	if (dirty_param.dirty_bytes)
> > +		dirty_total = dirty_param.dirty_bytes / PAGE_SIZE;
> >  	else
> > -		dirty_total = (vm_dirty_ratio * determine_dirtyable_memory()) /
> > -				100;
> > +		dirty_total = (dirty_param.dirty_ratio *
> > +				determine_dirtyable_memory()) / 100;
> >  	return 2 + ilog2(dirty_total - 1);
> >  }
> >  
> > @@ -408,41 +411,46 @@ static unsigned long highmem_dirtyable_memory(unsigned long total)
> >   */
> >  unsigned long determine_dirtyable_memory(void)
> >  {
> > -	unsigned long x;
> > -
> > -	x = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages();
> > +	unsigned long memory;
> > +	s64 memcg_memory;
> >  
> > +	memory = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages();
> >  	if (!vm_highmem_is_dirtyable)
> > -		x -= highmem_dirtyable_memory(x);
> > -
> > -	return x + 1;	/* Ensure that we never return 0 */
> > +		memory -= highmem_dirtyable_memory(memory);
> > +	if (mem_cgroup_has_dirty_limit())
> > +		return memory + 1;
> 
> Should above be?
> 	if (!mem_cgroup_has_dirty_limit())
> 		return memory + 1;

Very true.

I'll post another patch with this and Kirill's fixes.

Thanks,
-Andrea

> 
> Vivek
> 
> > +	memcg_memory = mem_cgroup_page_stat(MEMCG_NR_DIRTYABLE_PAGES);
> > +	return min((unsigned long)memcg_memory, memory + 1);
> >  }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH -mmotm 0/4] memcg: per cgroup dirty limit (v4)
       [not found] ` <1267699215-4101-1-git-send-email-arighi-vWjgImWzx8FBDgjK7y7TUQ@public.gmane.org>
                     ` (3 preceding siblings ...)
  2010-03-04 10:40   ` [PATCH -mmotm 4/4] memcg: dirty pages instrumentation Andrea Righi
@ 2010-03-04 17:11   ` Balbir Singh
  4 siblings, 0 replies; 68+ messages in thread
From: Balbir Singh @ 2010-03-04 17:11 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Daisuke Nishimura, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Trond Myklebust, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	Suleiman Souhlal, Andrew Morton,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Vivek Goyal

* Andrea Righi <arighi-vWjgImWzx8FBDgjK7y7TUQ@public.gmane.org> [2010-03-04 11:40:11]:

> Control the maximum amount of dirty pages a cgroup can have at any given time.
> 
> Per cgroup dirty limit is like fixing the max amount of dirty (hard to reclaim)
> page cache used by any cgroup. So, in case of multiple cgroup writers, they
> will not be able to consume more than their designated share of dirty pages and
> will be forced to perform write-out if they cross that limit.
> 
> The overall design is the following:
> 
>  - account dirty pages per cgroup
>  - limit the number of dirty pages via memory.dirty_ratio / memory.dirty_bytes
>    and memory.dirty_background_ratio / memory.dirty_background_bytes in
>    cgroupfs
>  - start to write-out (background or actively) when the cgroup limits are
>    exceeded
> 
> This feature is supposed to be strictly connected to any underlying IO
> controller implementation, so we can stop increasing dirty pages in VM layer
> and enforce a write-out before any cgroup will consume the global amount of
> dirty pages defined by the /proc/sys/vm/dirty_ratio|dirty_bytes and
> /proc/sys/vm/dirty_background_ratio|dirty_background_bytes limits.
> 
> Changelog (v3 -> v4)
> ~~~~~~~~~~~~~~~~~~~~~~
>  * handle the migration of tasks across different cgroups
>    NOTE: at the moment we don't move charges of file cache pages, so this
>    functionality is not immediately necessary. However, since the migration of
>    file cache pages is in plan, it is better to start handling file pages
>    anyway.
>  * properly account dirty pages in nilfs2
>    (thanks to Kirill A. Shutemov <kirill-oKw7cIdHH8eLwutG50LtGA@public.gmane.org>)
>  * lockless access to dirty memory parameters
>  * fix: page_cgroup lock must not be acquired under mapping->tree_lock
>    (thanks to Daisuke Nishimura <nishimura-YQH0OdQVrdy45+QrQBaojngSJqDPrsil@public.gmane.org> and
>     KAMEZAWA Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>)
>  * code restyling
>

This seems to be converging, what sort of tests are you running on
this patchset? 

-- 
	Three Cheers,
	Balbir

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH -mmotm 0/4] memcg: per cgroup dirty limit (v4)
  2010-03-04 10:40 ` Andrea Righi
@ 2010-03-04 17:11   ` Balbir Singh
  -1 siblings, 0 replies; 68+ messages in thread
From: Balbir Singh @ 2010-03-04 17:11 UTC (permalink / raw)
  To: Andrea Righi
  Cc: KAMEZAWA Hiroyuki, Vivek Goyal, Peter Zijlstra, Trond Myklebust,
	Suleiman Souhlal, Greg Thelen, Daisuke Nishimura,
	Kirill A. Shutemov, Andrew Morton, containers, linux-kernel,
	linux-mm

* Andrea Righi <arighi@develer.com> [2010-03-04 11:40:11]:

> Control the maximum amount of dirty pages a cgroup can have at any given time.
> 
> Per cgroup dirty limit is like fixing the max amount of dirty (hard to reclaim)
> page cache used by any cgroup. So, in case of multiple cgroup writers, they
> will not be able to consume more than their designated share of dirty pages and
> will be forced to perform write-out if they cross that limit.
> 
> The overall design is the following:
> 
>  - account dirty pages per cgroup
>  - limit the number of dirty pages via memory.dirty_ratio / memory.dirty_bytes
>    and memory.dirty_background_ratio / memory.dirty_background_bytes in
>    cgroupfs
>  - start to write-out (background or actively) when the cgroup limits are
>    exceeded
> 
> This feature is supposed to be strictly connected to any underlying IO
> controller implementation, so we can stop increasing dirty pages in VM layer
> and enforce a write-out before any cgroup will consume the global amount of
> dirty pages defined by the /proc/sys/vm/dirty_ratio|dirty_bytes and
> /proc/sys/vm/dirty_background_ratio|dirty_background_bytes limits.
> 
> Changelog (v3 -> v4)
> ~~~~~~~~~~~~~~~~~~~~~~
>  * handle the migration of tasks across different cgroups
>    NOTE: at the moment we don't move charges of file cache pages, so this
>    functionality is not immediately necessary. However, since the migration of
>    file cache pages is in plan, it is better to start handling file pages
>    anyway.
>  * properly account dirty pages in nilfs2
>    (thanks to Kirill A. Shutemov <kirill@shutemov.name>)
>  * lockless access to dirty memory parameters
>  * fix: page_cgroup lock must not be acquired under mapping->tree_lock
>    (thanks to Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> and
>     KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>)
>  * code restyling
>

This seems to be converging, what sort of tests are you running on
this patchset? 

-- 
	Three Cheers,
	Balbir

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH -mmotm 0/4] memcg: per cgroup dirty limit (v4)
@ 2010-03-04 17:11   ` Balbir Singh
  0 siblings, 0 replies; 68+ messages in thread
From: Balbir Singh @ 2010-03-04 17:11 UTC (permalink / raw)
  To: Andrea Righi
  Cc: KAMEZAWA Hiroyuki, Vivek Goyal, Peter Zijlstra, Trond Myklebust,
	Suleiman Souhlal, Greg Thelen, Daisuke Nishimura,
	Kirill A. Shutemov, Andrew Morton, containers, linux-kernel,
	linux-mm

* Andrea Righi <arighi@develer.com> [2010-03-04 11:40:11]:

> Control the maximum amount of dirty pages a cgroup can have at any given time.
> 
> Per cgroup dirty limit is like fixing the max amount of dirty (hard to reclaim)
> page cache used by any cgroup. So, in case of multiple cgroup writers, they
> will not be able to consume more than their designated share of dirty pages and
> will be forced to perform write-out if they cross that limit.
> 
> The overall design is the following:
> 
>  - account dirty pages per cgroup
>  - limit the number of dirty pages via memory.dirty_ratio / memory.dirty_bytes
>    and memory.dirty_background_ratio / memory.dirty_background_bytes in
>    cgroupfs
>  - start to write-out (background or actively) when the cgroup limits are
>    exceeded
> 
> This feature is supposed to be strictly connected to any underlying IO
> controller implementation, so we can stop increasing dirty pages in VM layer
> and enforce a write-out before any cgroup will consume the global amount of
> dirty pages defined by the /proc/sys/vm/dirty_ratio|dirty_bytes and
> /proc/sys/vm/dirty_background_ratio|dirty_background_bytes limits.
> 
> Changelog (v3 -> v4)
> ~~~~~~~~~~~~~~~~~~~~~~
>  * handle the migration of tasks across different cgroups
>    NOTE: at the moment we don't move charges of file cache pages, so this
>    functionality is not immediately necessary. However, since the migration of
>    file cache pages is in plan, it is better to start handling file pages
>    anyway.
>  * properly account dirty pages in nilfs2
>    (thanks to Kirill A. Shutemov <kirill@shutemov.name>)
>  * lockless access to dirty memory parameters
>  * fix: page_cgroup lock must not be acquired under mapping->tree_lock
>    (thanks to Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> and
>     KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>)
>  * code restyling
>

This seems to be converging, what sort of tests are you running on
this patchset? 

-- 
	Three Cheers,
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH -mmotm 4/4] memcg: dirty pages instrumentation
  2010-03-04 10:40   ` Andrea Righi
  (?)
@ 2010-03-04 19:41       ` Vivek Goyal
  -1 siblings, 0 replies; 68+ messages in thread
From: Vivek Goyal @ 2010-03-04 19:41 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Daisuke Nishimura, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Trond Myklebust, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	Suleiman Souhlal, Andrew Morton,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Balbir Singh

On Thu, Mar 04, 2010 at 11:40:15AM +0100, Andrea Righi wrote:

[..]
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 5a0f8f3..c5d14ea 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -137,13 +137,16 @@ static struct prop_descriptor vm_dirties;
>   */
>  static int calc_period_shift(void)
>  {
> +	struct dirty_param dirty_param;
>  	unsigned long dirty_total;
>  
> -	if (vm_dirty_bytes)
> -		dirty_total = vm_dirty_bytes / PAGE_SIZE;
> +	get_dirty_param(&dirty_param);
> +
> +	if (dirty_param.dirty_bytes)
> +		dirty_total = dirty_param.dirty_bytes / PAGE_SIZE;
>  	else
> -		dirty_total = (vm_dirty_ratio * determine_dirtyable_memory()) /
> -				100;
> +		dirty_total = (dirty_param.dirty_ratio *
> +				determine_dirtyable_memory()) / 100;
>  	return 2 + ilog2(dirty_total - 1);
>  }
>  

Hmm.., I have been staring at this for some time and I think something is
wrong. I don't fully understand the way floating proportions are working
but this function seems to be calculating the period over which we need
to measuer the proportions. (vm_completion proportion and vm_dirties
proportions).

And we this period (shift), when admin updates dirty_ratio or dirty_bytes
etc. In that case we recalculate the global dirty limit and take log2 and
use that as period over which we monitor and calculate proportions.

If yes, then it should be global and not per cgroup (because all our 
accouting of bdi completion is global and not per cgroup).

PeterZ, can tell us more about it. I am just raising the flag here to be
sure.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH -mmotm 4/4] memcg: dirty pages instrumentation
@ 2010-03-04 19:41       ` Vivek Goyal
  0 siblings, 0 replies; 68+ messages in thread
From: Vivek Goyal @ 2010-03-04 19:41 UTC (permalink / raw)
  To: Andrea Righi
  Cc: KAMEZAWA Hiroyuki, Balbir Singh, Peter Zijlstra, Trond Myklebust,
	Suleiman Souhlal, Greg Thelen, Daisuke Nishimura,
	Kirill A. Shutemov, Andrew Morton, containers, linux-kernel,
	linux-mm

On Thu, Mar 04, 2010 at 11:40:15AM +0100, Andrea Righi wrote:

[..]
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 5a0f8f3..c5d14ea 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -137,13 +137,16 @@ static struct prop_descriptor vm_dirties;
>   */
>  static int calc_period_shift(void)
>  {
> +	struct dirty_param dirty_param;
>  	unsigned long dirty_total;
>  
> -	if (vm_dirty_bytes)
> -		dirty_total = vm_dirty_bytes / PAGE_SIZE;
> +	get_dirty_param(&dirty_param);
> +
> +	if (dirty_param.dirty_bytes)
> +		dirty_total = dirty_param.dirty_bytes / PAGE_SIZE;
>  	else
> -		dirty_total = (vm_dirty_ratio * determine_dirtyable_memory()) /
> -				100;
> +		dirty_total = (dirty_param.dirty_ratio *
> +				determine_dirtyable_memory()) / 100;
>  	return 2 + ilog2(dirty_total - 1);
>  }
>  

Hmm.., I have been staring at this for some time and I think something is
wrong. I don't fully understand the way floating proportions are working
but this function seems to be calculating the period over which we need
to measuer the proportions. (vm_completion proportion and vm_dirties
proportions).

And we this period (shift), when admin updates dirty_ratio or dirty_bytes
etc. In that case we recalculate the global dirty limit and take log2 and
use that as period over which we monitor and calculate proportions.

If yes, then it should be global and not per cgroup (because all our 
accouting of bdi completion is global and not per cgroup).

PeterZ, can tell us more about it. I am just raising the flag here to be
sure.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH -mmotm 4/4] memcg: dirty pages instrumentation
@ 2010-03-04 19:41       ` Vivek Goyal
  0 siblings, 0 replies; 68+ messages in thread
From: Vivek Goyal @ 2010-03-04 19:41 UTC (permalink / raw)
  To: Andrea Righi
  Cc: KAMEZAWA Hiroyuki, Balbir Singh, Peter Zijlstra, Trond Myklebust,
	Suleiman Souhlal, Greg Thelen, Daisuke Nishimura,
	Kirill A. Shutemov, Andrew Morton, containers, linux-kernel,
	linux-mm

On Thu, Mar 04, 2010 at 11:40:15AM +0100, Andrea Righi wrote:

[..]
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 5a0f8f3..c5d14ea 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -137,13 +137,16 @@ static struct prop_descriptor vm_dirties;
>   */
>  static int calc_period_shift(void)
>  {
> +	struct dirty_param dirty_param;
>  	unsigned long dirty_total;
>  
> -	if (vm_dirty_bytes)
> -		dirty_total = vm_dirty_bytes / PAGE_SIZE;
> +	get_dirty_param(&dirty_param);
> +
> +	if (dirty_param.dirty_bytes)
> +		dirty_total = dirty_param.dirty_bytes / PAGE_SIZE;
>  	else
> -		dirty_total = (vm_dirty_ratio * determine_dirtyable_memory()) /
> -				100;
> +		dirty_total = (dirty_param.dirty_ratio *
> +				determine_dirtyable_memory()) / 100;
>  	return 2 + ilog2(dirty_total - 1);
>  }
>  

Hmm.., I have been staring at this for some time and I think something is
wrong. I don't fully understand the way floating proportions are working
but this function seems to be calculating the period over which we need
to measuer the proportions. (vm_completion proportion and vm_dirties
proportions).

And we this period (shift), when admin updates dirty_ratio or dirty_bytes
etc. In that case we recalculate the global dirty limit and take log2 and
use that as period over which we monitor and calculate proportions.

If yes, then it should be global and not per cgroup (because all our 
accouting of bdi completion is global and not per cgroup).

PeterZ, can tell us more about it. I am just raising the flag here to be
sure.

Thanks
Vivek

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH -mmotm 0/4] memcg: per cgroup dirty limit (v4)
       [not found]   ` <20100304171143.GG3073-SINUvgVNF2CyUtPGxGje5AC/G2K4zDHf@public.gmane.org>
@ 2010-03-04 21:37     ` Andrea Righi
  0 siblings, 0 replies; 68+ messages in thread
From: Andrea Righi @ 2010-03-04 21:37 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Daisuke Nishimura, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Trond Myklebust, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	Suleiman Souhlal, Andrew Morton,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Vivek Goyal

On Thu, Mar 04, 2010 at 10:41:43PM +0530, Balbir Singh wrote:
> * Andrea Righi <arighi-vWjgImWzx8FBDgjK7y7TUQ@public.gmane.org> [2010-03-04 11:40:11]:
> 
> > Control the maximum amount of dirty pages a cgroup can have at any given time.
> > 
> > Per cgroup dirty limit is like fixing the max amount of dirty (hard to reclaim)
> > page cache used by any cgroup. So, in case of multiple cgroup writers, they
> > will not be able to consume more than their designated share of dirty pages and
> > will be forced to perform write-out if they cross that limit.
> > 
> > The overall design is the following:
> > 
> >  - account dirty pages per cgroup
> >  - limit the number of dirty pages via memory.dirty_ratio / memory.dirty_bytes
> >    and memory.dirty_background_ratio / memory.dirty_background_bytes in
> >    cgroupfs
> >  - start to write-out (background or actively) when the cgroup limits are
> >    exceeded
> > 
> > This feature is supposed to be strictly connected to any underlying IO
> > controller implementation, so we can stop increasing dirty pages in VM layer
> > and enforce a write-out before any cgroup will consume the global amount of
> > dirty pages defined by the /proc/sys/vm/dirty_ratio|dirty_bytes and
> > /proc/sys/vm/dirty_background_ratio|dirty_background_bytes limits.
> > 
> > Changelog (v3 -> v4)
> > ~~~~~~~~~~~~~~~~~~~~~~
> >  * handle the migration of tasks across different cgroups
> >    NOTE: at the moment we don't move charges of file cache pages, so this
> >    functionality is not immediately necessary. However, since the migration of
> >    file cache pages is in plan, it is better to start handling file pages
> >    anyway.
> >  * properly account dirty pages in nilfs2
> >    (thanks to Kirill A. Shutemov <kirill-oKw7cIdHH8eLwutG50LtGA@public.gmane.org>)
> >  * lockless access to dirty memory parameters
> >  * fix: page_cgroup lock must not be acquired under mapping->tree_lock
> >    (thanks to Daisuke Nishimura <nishimura-YQH0OdQVrdy45+QrQBaojngSJqDPrsil@public.gmane.org> and
> >     KAMEZAWA Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>)
> >  * code restyling
> >
> 
> This seems to be converging, what sort of tests are you running on
> this patchset? 

A very simple test at the moment, just some parallel dd's running in
different cgroups. For example:

 - cgroup A: low dirty limits (writes are almost sync)
   echo 1000 > /cgroups/A/memory.dirty_bytes
   echo 1000 > /cgroups/A/memory.dirty_background_bytes

 - cgroup B: high dirty limits (writes are all buffered in page cache)
   echo 100 > /cgroups/B/memory.dirty_ratio
   echo 50  > /cgroups/B/memory.dirty_background_ratio

Then run the dd's and look at memory.stat:
  - cgroup A: # dd if=/dev/zero of=A bs=1M count=1000
  - cgroup B: # dd if=/dev/zero of=B bs=1M count=1000

A random snapshot during the writes:

# grep "dirty\|writeback" /cgroups/[AB]/memory.stat
/cgroups/A/memory.stat:filedirty 0
/cgroups/A/memory.stat:writeback 0
/cgroups/A/memory.stat:writeback_tmp 0
/cgroups/A/memory.stat:dirty_pages 0
/cgroups/A/memory.stat:writeback_pages 0
/cgroups/A/memory.stat:writeback_temp_pages 0
/cgroups/B/memory.stat:filedirty 67226
/cgroups/B/memory.stat:writeback 136
/cgroups/B/memory.stat:writeback_tmp 0
/cgroups/B/memory.stat:dirty_pages 67226
/cgroups/B/memory.stat:writeback_pages 136
/cgroups/B/memory.stat:writeback_temp_pages 0

I plan to run more detailed IO benchmark soon.

-Andrea

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH -mmotm 0/4] memcg: per cgroup dirty limit (v4)
  2010-03-04 17:11   ` Balbir Singh
@ 2010-03-04 21:37     ` Andrea Righi
  -1 siblings, 0 replies; 68+ messages in thread
From: Andrea Righi @ 2010-03-04 21:37 UTC (permalink / raw)
  To: Balbir Singh
  Cc: KAMEZAWA Hiroyuki, Vivek Goyal, Peter Zijlstra, Trond Myklebust,
	Suleiman Souhlal, Greg Thelen, Daisuke Nishimura,
	Kirill A. Shutemov, Andrew Morton, containers, linux-kernel,
	linux-mm

On Thu, Mar 04, 2010 at 10:41:43PM +0530, Balbir Singh wrote:
> * Andrea Righi <arighi@develer.com> [2010-03-04 11:40:11]:
> 
> > Control the maximum amount of dirty pages a cgroup can have at any given time.
> > 
> > Per cgroup dirty limit is like fixing the max amount of dirty (hard to reclaim)
> > page cache used by any cgroup. So, in case of multiple cgroup writers, they
> > will not be able to consume more than their designated share of dirty pages and
> > will be forced to perform write-out if they cross that limit.
> > 
> > The overall design is the following:
> > 
> >  - account dirty pages per cgroup
> >  - limit the number of dirty pages via memory.dirty_ratio / memory.dirty_bytes
> >    and memory.dirty_background_ratio / memory.dirty_background_bytes in
> >    cgroupfs
> >  - start to write-out (background or actively) when the cgroup limits are
> >    exceeded
> > 
> > This feature is supposed to be strictly connected to any underlying IO
> > controller implementation, so we can stop increasing dirty pages in VM layer
> > and enforce a write-out before any cgroup will consume the global amount of
> > dirty pages defined by the /proc/sys/vm/dirty_ratio|dirty_bytes and
> > /proc/sys/vm/dirty_background_ratio|dirty_background_bytes limits.
> > 
> > Changelog (v3 -> v4)
> > ~~~~~~~~~~~~~~~~~~~~~~
> >  * handle the migration of tasks across different cgroups
> >    NOTE: at the moment we don't move charges of file cache pages, so this
> >    functionality is not immediately necessary. However, since the migration of
> >    file cache pages is in plan, it is better to start handling file pages
> >    anyway.
> >  * properly account dirty pages in nilfs2
> >    (thanks to Kirill A. Shutemov <kirill@shutemov.name>)
> >  * lockless access to dirty memory parameters
> >  * fix: page_cgroup lock must not be acquired under mapping->tree_lock
> >    (thanks to Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> and
> >     KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>)
> >  * code restyling
> >
> 
> This seems to be converging, what sort of tests are you running on
> this patchset? 

A very simple test at the moment, just some parallel dd's running in
different cgroups. For example:

 - cgroup A: low dirty limits (writes are almost sync)
   echo 1000 > /cgroups/A/memory.dirty_bytes
   echo 1000 > /cgroups/A/memory.dirty_background_bytes

 - cgroup B: high dirty limits (writes are all buffered in page cache)
   echo 100 > /cgroups/B/memory.dirty_ratio
   echo 50  > /cgroups/B/memory.dirty_background_ratio

Then run the dd's and look at memory.stat:
  - cgroup A: # dd if=/dev/zero of=A bs=1M count=1000
  - cgroup B: # dd if=/dev/zero of=B bs=1M count=1000

A random snapshot during the writes:

# grep "dirty\|writeback" /cgroups/[AB]/memory.stat
/cgroups/A/memory.stat:filedirty 0
/cgroups/A/memory.stat:writeback 0
/cgroups/A/memory.stat:writeback_tmp 0
/cgroups/A/memory.stat:dirty_pages 0
/cgroups/A/memory.stat:writeback_pages 0
/cgroups/A/memory.stat:writeback_temp_pages 0
/cgroups/B/memory.stat:filedirty 67226
/cgroups/B/memory.stat:writeback 136
/cgroups/B/memory.stat:writeback_tmp 0
/cgroups/B/memory.stat:dirty_pages 67226
/cgroups/B/memory.stat:writeback_pages 136
/cgroups/B/memory.stat:writeback_temp_pages 0

I plan to run more detailed IO benchmark soon.

-Andrea

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH -mmotm 0/4] memcg: per cgroup dirty limit (v4)
@ 2010-03-04 21:37     ` Andrea Righi
  0 siblings, 0 replies; 68+ messages in thread
From: Andrea Righi @ 2010-03-04 21:37 UTC (permalink / raw)
  To: Balbir Singh
  Cc: KAMEZAWA Hiroyuki, Vivek Goyal, Peter Zijlstra, Trond Myklebust,
	Suleiman Souhlal, Greg Thelen, Daisuke Nishimura,
	Kirill A. Shutemov, Andrew Morton, containers, linux-kernel,
	linux-mm

On Thu, Mar 04, 2010 at 10:41:43PM +0530, Balbir Singh wrote:
> * Andrea Righi <arighi@develer.com> [2010-03-04 11:40:11]:
> 
> > Control the maximum amount of dirty pages a cgroup can have at any given time.
> > 
> > Per cgroup dirty limit is like fixing the max amount of dirty (hard to reclaim)
> > page cache used by any cgroup. So, in case of multiple cgroup writers, they
> > will not be able to consume more than their designated share of dirty pages and
> > will be forced to perform write-out if they cross that limit.
> > 
> > The overall design is the following:
> > 
> >  - account dirty pages per cgroup
> >  - limit the number of dirty pages via memory.dirty_ratio / memory.dirty_bytes
> >    and memory.dirty_background_ratio / memory.dirty_background_bytes in
> >    cgroupfs
> >  - start to write-out (background or actively) when the cgroup limits are
> >    exceeded
> > 
> > This feature is supposed to be strictly connected to any underlying IO
> > controller implementation, so we can stop increasing dirty pages in VM layer
> > and enforce a write-out before any cgroup will consume the global amount of
> > dirty pages defined by the /proc/sys/vm/dirty_ratio|dirty_bytes and
> > /proc/sys/vm/dirty_background_ratio|dirty_background_bytes limits.
> > 
> > Changelog (v3 -> v4)
> > ~~~~~~~~~~~~~~~~~~~~~~
> >  * handle the migration of tasks across different cgroups
> >    NOTE: at the moment we don't move charges of file cache pages, so this
> >    functionality is not immediately necessary. However, since the migration of
> >    file cache pages is in plan, it is better to start handling file pages
> >    anyway.
> >  * properly account dirty pages in nilfs2
> >    (thanks to Kirill A. Shutemov <kirill@shutemov.name>)
> >  * lockless access to dirty memory parameters
> >  * fix: page_cgroup lock must not be acquired under mapping->tree_lock
> >    (thanks to Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> and
> >     KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>)
> >  * code restyling
> >
> 
> This seems to be converging, what sort of tests are you running on
> this patchset? 

A very simple test at the moment, just some parallel dd's running in
different cgroups. For example:

 - cgroup A: low dirty limits (writes are almost sync)
   echo 1000 > /cgroups/A/memory.dirty_bytes
   echo 1000 > /cgroups/A/memory.dirty_background_bytes

 - cgroup B: high dirty limits (writes are all buffered in page cache)
   echo 100 > /cgroups/B/memory.dirty_ratio
   echo 50  > /cgroups/B/memory.dirty_background_ratio

Then run the dd's and look at memory.stat:
  - cgroup A: # dd if=/dev/zero of=A bs=1M count=1000
  - cgroup B: # dd if=/dev/zero of=B bs=1M count=1000

A random snapshot during the writes:

# grep "dirty\|writeback" /cgroups/[AB]/memory.stat
/cgroups/A/memory.stat:filedirty 0
/cgroups/A/memory.stat:writeback 0
/cgroups/A/memory.stat:writeback_tmp 0
/cgroups/A/memory.stat:dirty_pages 0
/cgroups/A/memory.stat:writeback_pages 0
/cgroups/A/memory.stat:writeback_temp_pages 0
/cgroups/B/memory.stat:filedirty 67226
/cgroups/B/memory.stat:writeback 136
/cgroups/B/memory.stat:writeback_tmp 0
/cgroups/B/memory.stat:dirty_pages 67226
/cgroups/B/memory.stat:writeback_pages 136
/cgroups/B/memory.stat:writeback_temp_pages 0

I plan to run more detailed IO benchmark soon.

-Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH -mmotm 4/4] memcg: dirty pages instrumentation
  2010-03-04 19:41       ` Vivek Goyal
  (?)
@ 2010-03-04 21:51           ` Andrea Righi
  -1 siblings, 0 replies; 68+ messages in thread
From: Andrea Righi @ 2010-03-04 21:51 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Daisuke Nishimura, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Trond Myklebust, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	Suleiman Souhlal, Andrew Morton,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Balbir Singh

On Thu, Mar 04, 2010 at 02:41:44PM -0500, Vivek Goyal wrote:
> On Thu, Mar 04, 2010 at 11:40:15AM +0100, Andrea Righi wrote:
> 
> [..]
> > diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> > index 5a0f8f3..c5d14ea 100644
> > --- a/mm/page-writeback.c
> > +++ b/mm/page-writeback.c
> > @@ -137,13 +137,16 @@ static struct prop_descriptor vm_dirties;
> >   */
> >  static int calc_period_shift(void)
> >  {
> > +	struct dirty_param dirty_param;
> >  	unsigned long dirty_total;
> >  
> > -	if (vm_dirty_bytes)
> > -		dirty_total = vm_dirty_bytes / PAGE_SIZE;
> > +	get_dirty_param(&dirty_param);
> > +
> > +	if (dirty_param.dirty_bytes)
> > +		dirty_total = dirty_param.dirty_bytes / PAGE_SIZE;
> >  	else
> > -		dirty_total = (vm_dirty_ratio * determine_dirtyable_memory()) /
> > -				100;
> > +		dirty_total = (dirty_param.dirty_ratio *
> > +				determine_dirtyable_memory()) / 100;
> >  	return 2 + ilog2(dirty_total - 1);
> >  }
> >  
> 
> Hmm.., I have been staring at this for some time and I think something is
> wrong. I don't fully understand the way floating proportions are working
> but this function seems to be calculating the period over which we need
> to measuer the proportions. (vm_completion proportion and vm_dirties
> proportions).
> 
> And we this period (shift), when admin updates dirty_ratio or dirty_bytes
> etc. In that case we recalculate the global dirty limit and take log2 and
> use that as period over which we monitor and calculate proportions.
> 
> If yes, then it should be global and not per cgroup (because all our 
> accouting of bdi completion is global and not per cgroup).
> 
> PeterZ, can tell us more about it. I am just raising the flag here to be
> sure.
> 
> Thanks
> Vivek

Hi Vivek,

I tend to agree, we must use global dirty values here.

BTW, update_completion_period() is called from dirty_* handlers, so it's
totally unrelated to use the current memcg. That's the memcg where the
admin is running, so probably it's the root memcg almost all the time,
but it's wrong in principle. In conclusion this patch shouldn't touch
calc_period_shift().

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH -mmotm 4/4] memcg: dirty pages instrumentation
@ 2010-03-04 21:51           ` Andrea Righi
  0 siblings, 0 replies; 68+ messages in thread
From: Andrea Righi @ 2010-03-04 21:51 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: KAMEZAWA Hiroyuki, Balbir Singh, Peter Zijlstra, Trond Myklebust,
	Suleiman Souhlal, Greg Thelen, Daisuke Nishimura,
	Kirill A. Shutemov, Andrew Morton, containers, linux-kernel,
	linux-mm

On Thu, Mar 04, 2010 at 02:41:44PM -0500, Vivek Goyal wrote:
> On Thu, Mar 04, 2010 at 11:40:15AM +0100, Andrea Righi wrote:
> 
> [..]
> > diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> > index 5a0f8f3..c5d14ea 100644
> > --- a/mm/page-writeback.c
> > +++ b/mm/page-writeback.c
> > @@ -137,13 +137,16 @@ static struct prop_descriptor vm_dirties;
> >   */
> >  static int calc_period_shift(void)
> >  {
> > +	struct dirty_param dirty_param;
> >  	unsigned long dirty_total;
> >  
> > -	if (vm_dirty_bytes)
> > -		dirty_total = vm_dirty_bytes / PAGE_SIZE;
> > +	get_dirty_param(&dirty_param);
> > +
> > +	if (dirty_param.dirty_bytes)
> > +		dirty_total = dirty_param.dirty_bytes / PAGE_SIZE;
> >  	else
> > -		dirty_total = (vm_dirty_ratio * determine_dirtyable_memory()) /
> > -				100;
> > +		dirty_total = (dirty_param.dirty_ratio *
> > +				determine_dirtyable_memory()) / 100;
> >  	return 2 + ilog2(dirty_total - 1);
> >  }
> >  
> 
> Hmm.., I have been staring at this for some time and I think something is
> wrong. I don't fully understand the way floating proportions are working
> but this function seems to be calculating the period over which we need
> to measuer the proportions. (vm_completion proportion and vm_dirties
> proportions).
> 
> And we this period (shift), when admin updates dirty_ratio or dirty_bytes
> etc. In that case we recalculate the global dirty limit and take log2 and
> use that as period over which we monitor and calculate proportions.
> 
> If yes, then it should be global and not per cgroup (because all our 
> accouting of bdi completion is global and not per cgroup).
> 
> PeterZ, can tell us more about it. I am just raising the flag here to be
> sure.
> 
> Thanks
> Vivek

Hi Vivek,

I tend to agree, we must use global dirty values here.

BTW, update_completion_period() is called from dirty_* handlers, so it's
totally unrelated to use the current memcg. That's the memcg where the
admin is running, so probably it's the root memcg almost all the time,
but it's wrong in principle. In conclusion this patch shouldn't touch
calc_period_shift().

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH -mmotm 4/4] memcg: dirty pages instrumentation
@ 2010-03-04 21:51           ` Andrea Righi
  0 siblings, 0 replies; 68+ messages in thread
From: Andrea Righi @ 2010-03-04 21:51 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: KAMEZAWA Hiroyuki, Balbir Singh, Peter Zijlstra, Trond Myklebust,
	Suleiman Souhlal, Greg Thelen, Daisuke Nishimura,
	Kirill A. Shutemov, Andrew Morton, containers, linux-kernel,
	linux-mm

On Thu, Mar 04, 2010 at 02:41:44PM -0500, Vivek Goyal wrote:
> On Thu, Mar 04, 2010 at 11:40:15AM +0100, Andrea Righi wrote:
> 
> [..]
> > diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> > index 5a0f8f3..c5d14ea 100644
> > --- a/mm/page-writeback.c
> > +++ b/mm/page-writeback.c
> > @@ -137,13 +137,16 @@ static struct prop_descriptor vm_dirties;
> >   */
> >  static int calc_period_shift(void)
> >  {
> > +	struct dirty_param dirty_param;
> >  	unsigned long dirty_total;
> >  
> > -	if (vm_dirty_bytes)
> > -		dirty_total = vm_dirty_bytes / PAGE_SIZE;
> > +	get_dirty_param(&dirty_param);
> > +
> > +	if (dirty_param.dirty_bytes)
> > +		dirty_total = dirty_param.dirty_bytes / PAGE_SIZE;
> >  	else
> > -		dirty_total = (vm_dirty_ratio * determine_dirtyable_memory()) /
> > -				100;
> > +		dirty_total = (dirty_param.dirty_ratio *
> > +				determine_dirtyable_memory()) / 100;
> >  	return 2 + ilog2(dirty_total - 1);
> >  }
> >  
> 
> Hmm.., I have been staring at this for some time and I think something is
> wrong. I don't fully understand the way floating proportions are working
> but this function seems to be calculating the period over which we need
> to measuer the proportions. (vm_completion proportion and vm_dirties
> proportions).
> 
> And we this period (shift), when admin updates dirty_ratio or dirty_bytes
> etc. In that case we recalculate the global dirty limit and take log2 and
> use that as period over which we monitor and calculate proportions.
> 
> If yes, then it should be global and not per cgroup (because all our 
> accouting of bdi completion is global and not per cgroup).
> 
> PeterZ, can tell us more about it. I am just raising the flag here to be
> sure.
> 
> Thanks
> Vivek

Hi Vivek,

I tend to agree, we must use global dirty values here.

BTW, update_completion_period() is called from dirty_* handlers, so it's
totally unrelated to use the current memcg. That's the memcg where the
admin is running, so probably it's the root memcg almost all the time,
but it's wrong in principle. In conclusion this patch shouldn't touch
calc_period_shift().

Thanks,
-Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH -mmotm 3/4] memcg: dirty pages accounting and limiting infrastructure
       [not found]   ` <1267699215-4101-4-git-send-email-arighi-vWjgImWzx8FBDgjK7y7TUQ@public.gmane.org>
  2010-03-04 11:54     ` Kirill A. Shutemov
@ 2010-03-05  1:12     ` Daisuke Nishimura
  1 sibling, 0 replies; 68+ messages in thread
From: Daisuke Nishimura @ 2010-03-05  1:12 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Daisuke Nishimura, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Trond Myklebust, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	Suleiman Souhlal, Andrew Morton,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Vivek Goyal, Balbir Singh

On Thu,  4 Mar 2010 11:40:14 +0100, Andrea Righi <arighi-vWjgImWzx8FBDgjK7y7TUQ@public.gmane.org> wrote:
> Infrastructure to account dirty pages per cgroup and add dirty limit
> interfaces in the cgroupfs:
> 
>  - Direct write-out: memory.dirty_ratio, memory.dirty_bytes
> 
>  - Background write-out: memory.dirty_background_ratio, memory.dirty_background_bytes
> 
> Signed-off-by: Andrea Righi <arighi-vWjgImWzx8FBDgjK7y7TUQ@public.gmane.org>
> ---
>  include/linux/memcontrol.h |   80 ++++++++-
>  mm/memcontrol.c            |  420 +++++++++++++++++++++++++++++++++++++++-----
>  2 files changed, 450 insertions(+), 50 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 1f9b119..cc3421b 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -19,12 +19,66 @@
>  
>  #ifndef _LINUX_MEMCONTROL_H
>  #define _LINUX_MEMCONTROL_H
> +
> +#include <linux/writeback.h>
>  #include <linux/cgroup.h>
> +
>  struct mem_cgroup;
>  struct page_cgroup;
>  struct page;
>  struct mm_struct;
>  
> +/* Cgroup memory statistics items exported to the kernel */
> +enum mem_cgroup_page_stat_item {
> +	MEMCG_NR_DIRTYABLE_PAGES,
> +	MEMCG_NR_RECLAIM_PAGES,
> +	MEMCG_NR_WRITEBACK,
> +	MEMCG_NR_DIRTY_WRITEBACK_PAGES,
> +};
> +
> +/* Dirty memory parameters */
> +struct dirty_param {
> +	int dirty_ratio;
> +	unsigned long dirty_bytes;
> +	int dirty_background_ratio;
> +	unsigned long dirty_background_bytes;
> +};
> +
> +/*
> + * Statistics for memory cgroup.
> + */
> +enum mem_cgroup_stat_index {
> +	/*
> +	 * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
> +	 */
> +	MEM_CGROUP_STAT_CACHE,	   /* # of pages charged as cache */
> +	MEM_CGROUP_STAT_RSS,	   /* # of pages charged as anon rss */
> +	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
> +	MEM_CGROUP_STAT_PGPGIN_COUNT,	/* # of pages paged in */
> +	MEM_CGROUP_STAT_PGPGOUT_COUNT,	/* # of pages paged out */
> +	MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
> +	MEM_CGROUP_EVENTS,	/* incremented at every  pagein/pageout */
> +	MEM_CGROUP_STAT_FILE_DIRTY,   /* # of dirty pages in page cache */
> +	MEM_CGROUP_STAT_WRITEBACK,   /* # of pages under writeback */
> +	MEM_CGROUP_STAT_WRITEBACK_TEMP,   /* # of pages under writeback using
> +						temporary buffers */
> +	MEM_CGROUP_STAT_UNSTABLE_NFS,   /* # of NFS unstable pages */
> +
> +	MEM_CGROUP_STAT_NSTATS,
> +};
> +
I must have said it earlier, but I don't think exporting all of these flags
is a good idea.
Can you export only mem_cgroup_page_stat_item(of course, need to add MEMCG_NR_FILE_MAPPED)?
We can translate mem_cgroup_page_stat_item to mem_cgroup_stat_index by simple arithmetic
if you define MEM_CGROUP_STAT_FILE_MAPPED..MEM_CGROUP_STAT_UNSTABLE_NFS sequentially.

> +/*
> + * TODO: provide a validation check routine. And retry if validation
> + * fails.
> + */
> +static inline void get_global_dirty_param(struct dirty_param *param)
> +{
> +	param->dirty_ratio = vm_dirty_ratio;
> +	param->dirty_bytes = vm_dirty_bytes;
> +	param->dirty_background_ratio = dirty_background_ratio;
> +	param->dirty_background_bytes = dirty_background_bytes;
> +}
> +
>  #ifdef CONFIG_CGROUP_MEM_RES_CTLR
>  /*
>   * All "charge" functions with gfp_mask should use GFP_KERNEL or
> @@ -117,6 +171,10 @@ extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
>  extern int do_swap_account;
>  #endif
>  
> +extern bool mem_cgroup_has_dirty_limit(void);
> +extern void get_dirty_param(struct dirty_param *param);
> +extern s64 mem_cgroup_page_stat(enum mem_cgroup_page_stat_item item);
> +
>  static inline bool mem_cgroup_disabled(void)
>  {
>  	if (mem_cgroup_subsys.disabled)
> @@ -125,7 +183,8 @@ static inline bool mem_cgroup_disabled(void)
>  }
>  
>  extern bool mem_cgroup_oom_called(struct task_struct *task);
> -void mem_cgroup_update_file_mapped(struct page *page, int val);
> +void mem_cgroup_update_stat(struct page *page,
> +			enum mem_cgroup_stat_index idx, int val);
>  unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>  						gfp_t gfp_mask, int nid,
>  						int zid);
> @@ -300,8 +359,8 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
>  {
>  }
>  
> -static inline void mem_cgroup_update_file_mapped(struct page *page,
> -							int val)
> +static inline void mem_cgroup_update_stat(struct page *page,
> +			enum mem_cgroup_stat_index idx, int val)
>  {
>  }
>  
> @@ -312,6 +371,21 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>  	return 0;
>  }
>  
> +static inline bool mem_cgroup_has_dirty_limit(void)
> +{
> +	return false;
> +}
> +
> +static inline void get_dirty_param(struct dirty_param *param)
> +{
> +	get_global_dirty_param(param);
> +}
> +
> +static inline s64 mem_cgroup_page_stat(enum mem_cgroup_page_stat_item item)
> +{
> +	return -ENOSYS;
> +}
> +
>  #endif /* CONFIG_CGROUP_MEM_CONT */
>  
>  #endif /* _LINUX_MEMCONTROL_H */
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 497b6f7..9842e7b 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -73,28 +73,23 @@ static int really_do_swap_account __initdata = 1; /* for remember boot option*/
>  #define THRESHOLDS_EVENTS_THRESH (7) /* once in 128 */
>  #define SOFTLIMIT_EVENTS_THRESH (10) /* once in 1024 */
>  
> -/*
> - * Statistics for memory cgroup.
> - */
> -enum mem_cgroup_stat_index {
> -	/*
> -	 * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
> -	 */
> -	MEM_CGROUP_STAT_CACHE, 	   /* # of pages charged as cache */
> -	MEM_CGROUP_STAT_RSS,	   /* # of pages charged as anon rss */
> -	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
> -	MEM_CGROUP_STAT_PGPGIN_COUNT,	/* # of pages paged in */
> -	MEM_CGROUP_STAT_PGPGOUT_COUNT,	/* # of pages paged out */
> -	MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
> -	MEM_CGROUP_EVENTS,	/* incremented at every  pagein/pageout */
> -
> -	MEM_CGROUP_STAT_NSTATS,
> -};
> -
>  struct mem_cgroup_stat_cpu {
>  	s64 count[MEM_CGROUP_STAT_NSTATS];
>  };
>  
> +/* Per cgroup page statistics */
> +struct mem_cgroup_page_stat {
> +	enum mem_cgroup_page_stat_item item;
> +	s64 value;
> +};
> +
> +enum {
> +	MEM_CGROUP_DIRTY_RATIO,
> +	MEM_CGROUP_DIRTY_BYTES,
> +	MEM_CGROUP_DIRTY_BACKGROUND_RATIO,
> +	MEM_CGROUP_DIRTY_BACKGROUND_BYTES,
> +};
> +
>  /*
>   * per-zone information in memory controller.
>   */
> @@ -208,6 +203,9 @@ struct mem_cgroup {
>  
>  	unsigned int	swappiness;
>  
> +	/* control memory cgroup dirty pages */
> +	struct dirty_param dirty_param;
> +
>  	/* set when res.limit == memsw.limit */
>  	bool		memsw_is_minimum;
>  
> @@ -1033,6 +1031,156 @@ static unsigned int get_swappiness(struct mem_cgroup *memcg)
>  	return swappiness;
>  }
>  
> +static bool dirty_param_is_valid(struct dirty_param *param)
> +{
> +	if (param->dirty_ratio && param->dirty_bytes)
> +		return false;
> +	if (param->dirty_background_ratio && param->dirty_background_bytes)
> +		return false;
> +	return true;
> +}
> +
> +static void
> +__mem_cgroup_get_dirty_param(struct dirty_param *param, struct mem_cgroup *mem)
> +{
> +	param->dirty_ratio = mem->dirty_param.dirty_ratio;
> +	param->dirty_bytes = mem->dirty_param.dirty_bytes;
> +	param->dirty_background_ratio = mem->dirty_param.dirty_background_ratio;
> +	param->dirty_background_bytes = mem->dirty_param.dirty_background_bytes;
> +}
> +
> +/*
> + * get_dirty_param() - get dirty memory parameters of the current memcg
> + * @param:	a structure is filled with the dirty memory settings
> + *
> + * The function fills @param with the current memcg dirty memory settings. If
> + * memory cgroup is disabled or in case of error the structure is filled with
> + * the global dirty memory settings.
> + */
> +void get_dirty_param(struct dirty_param *param)
> +{
> +	struct mem_cgroup *memcg;
> +
> +	if (mem_cgroup_disabled()) {
> +		get_global_dirty_param(param);
> +		return;
> +	}
> +	/*
> +	 * It's possible that "current" may be moved to other cgroup while we
> +	 * access cgroup. But precise check is meaningless because the task can
> +	 * be moved after our access and writeback tends to take long time.
> +	 * At least, "memcg" will not be freed under rcu_read_lock().
> +	 */
> +	while (1) {
> +		rcu_read_lock();
> +		memcg = mem_cgroup_from_task(current);
> +		if (likely(memcg))
> +			__mem_cgroup_get_dirty_param(param, memcg);
> +		else
> +			get_global_dirty_param(param);
> +		rcu_read_unlock();
> +		/*
> +		 * Since global and memcg dirty_param are not protected we try
> +		 * to speculatively read them and retry if we get inconsistent
> +		 * values.
> +		 */
> +		if (likely(dirty_param_is_valid(param)))
> +			break;
> +	}
> +}
> +
> +static inline bool mem_cgroup_can_swap(struct mem_cgroup *memcg)
> +{
> +	if (!do_swap_account)
> +		return nr_swap_pages > 0;
> +	return !memcg->memsw_is_minimum &&
> +		(res_counter_read_u64(&memcg->memsw, RES_LIMIT) > 0);
> +}
> +
> +static s64 mem_cgroup_get_local_page_stat(struct mem_cgroup *memcg,
> +				enum mem_cgroup_page_stat_item item)
> +{
> +	s64 ret;
> +
> +	switch (item) {
> +	case MEMCG_NR_DIRTYABLE_PAGES:
> +		ret = res_counter_read_u64(&memcg->res, RES_LIMIT) -
> +			res_counter_read_u64(&memcg->res, RES_USAGE);
> +		/* Translate free memory in pages */
> +		ret >>= PAGE_SHIFT;
> +		ret += mem_cgroup_read_stat(memcg, LRU_ACTIVE_FILE) +
> +			mem_cgroup_read_stat(memcg, LRU_INACTIVE_FILE);
> +		if (mem_cgroup_can_swap(memcg))
> +			ret += mem_cgroup_read_stat(memcg, LRU_ACTIVE_ANON) +
> +				mem_cgroup_read_stat(memcg, LRU_INACTIVE_ANON);
> +		break;
> +	case MEMCG_NR_RECLAIM_PAGES:
> +		ret = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_FILE_DIRTY) +
> +			mem_cgroup_read_stat(memcg,
> +					MEM_CGROUP_STAT_UNSTABLE_NFS);
> +		break;
> +	case MEMCG_NR_WRITEBACK:
> +		ret = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_WRITEBACK);
> +		break;
> +	case MEMCG_NR_DIRTY_WRITEBACK_PAGES:
> +		ret = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_WRITEBACK) +
> +			mem_cgroup_read_stat(memcg,
> +				MEM_CGROUP_STAT_UNSTABLE_NFS);
> +		break;
> +	default:
> +		BUG_ON(1);
> +	}
> +	return ret;
> +}
> +
> +static int mem_cgroup_page_stat_cb(struct mem_cgroup *mem, void *data)
> +{
> +	struct mem_cgroup_page_stat *stat = (struct mem_cgroup_page_stat *)data;
> +
> +	stat->value += mem_cgroup_get_local_page_stat(mem, stat->item);
> +	return 0;
> +}
> +
> +/*
> + * mem_cgroup_has_dirty_limit() - check if current memcg has local dirty limits
> + *
> + * Return true if the current memory cgroup has local dirty memory settings,
> + * false otherwise.
> + */
> +bool mem_cgroup_has_dirty_limit(void)
> +{
> +	if (mem_cgroup_disabled())
> +		return false;
> +	return mem_cgroup_from_task(current) != NULL;
> +}
> +
> +/*
> + * mem_cgroup_page_stat() - get memory cgroup file cache statistics
> + * @item:	memory statistic item exported to the kernel
> + *
> + * Return the accounted statistic value, or a negative value in case of error.
> + */
> +s64 mem_cgroup_page_stat(enum mem_cgroup_page_stat_item item)
> +{
> +	struct mem_cgroup_page_stat stat = {};
> +	struct mem_cgroup *memcg;
> +
> +	rcu_read_lock();
> +	memcg = mem_cgroup_from_task(current);
> +	if (memcg) {
> +		/*
> +		 * Recursively evaulate page statistics against all cgroup
> +		 * under hierarchy tree
> +		 */
> +		stat.item = item;
> +		mem_cgroup_walk_tree(memcg, &stat, mem_cgroup_page_stat_cb);
> +	} else
> +		stat.value = -EINVAL;
> +	rcu_read_unlock();
> +
> +	return stat.value;
> +}
> +
>  static int mem_cgroup_count_children_cb(struct mem_cgroup *mem, void *data)
>  {
>  	int *val = data;
> @@ -1275,34 +1423,70 @@ static void record_last_oom(struct mem_cgroup *mem)
>  }
>  
>  /*
> - * Currently used to update mapped file statistics, but the routine can be
> - * generalized to update other statistics as well.
> + * Generalized routine to update file cache's status for memcg.
> + *
> + * Before calling this, mapping->tree_lock should be held and preemption is
> + * disabled.  Then, it's guarnteed that the page is not uncharged while we
> + * access page_cgroup. We can make use of that.
>   */
IIUC, mapping->tree_lock is held with irq disabled, so I think "mapping->tree_lock
should be held with irq disabled" would be enouth.
And, as far as I can see, callers of this function have not ensured this yet in [4/4].

how about:

	void mem_cgroup_update_stat_locked(...)
	{
		...
	}

	void mem_cgroup_update_stat_unlocked(mapping, ...)
	{
		spin_lock_irqsave(mapping->tree_lock, ...);
		mem_cgroup_update_stat_locked();
		spin_unlock_irqrestore(...);
	}

> -void mem_cgroup_update_file_mapped(struct page *page, int val)
> +void mem_cgroup_update_stat(struct page *page,
> +			enum mem_cgroup_stat_index idx, int val)
>  {
I preffer "void mem_cgroup_update_page_stat(struct page *, enum mem_cgroup_page_stat_item, ..)"
as I said above.

>  	struct mem_cgroup *mem;
>  	struct page_cgroup *pc;
>  
> +	if (mem_cgroup_disabled())
> +		return;
>  	pc = lookup_page_cgroup(page);
> -	if (unlikely(!pc))
> +	if (unlikely(!pc) || !PageCgroupUsed(pc))
>  		return;
>  
> -	lock_page_cgroup(pc);
> -	mem = pc->mem_cgroup;
> -	if (!mem)
> -		goto done;
> -
> -	if (!PageCgroupUsed(pc))
> -		goto done;
> -
> +	lock_page_cgroup_migrate(pc);
>  	/*
> -	 * Preemption is already disabled. We can use __this_cpu_xxx
> -	 */
> -	__this_cpu_add(mem->stat->count[MEM_CGROUP_STAT_FILE_MAPPED], val);
> -
> -done:
> -	unlock_page_cgroup(pc);
> +	* It's guarnteed that this page is never uncharged.
> +	* The only racy problem is moving account among memcgs.
> +	*/
> +	switch (idx) {
> +	case MEM_CGROUP_STAT_FILE_MAPPED:
> +		if (val > 0)
> +			SetPageCgroupFileMapped(pc);
> +		else
> +			ClearPageCgroupFileMapped(pc);
> +		break;
> +	case MEM_CGROUP_STAT_FILE_DIRTY:
> +		if (val > 0)
> +			SetPageCgroupDirty(pc);
> +		else
> +			ClearPageCgroupDirty(pc);
> +		break;
> +	case MEM_CGROUP_STAT_WRITEBACK:
> +		if (val > 0)
> +			SetPageCgroupWriteback(pc);
> +		else
> +			ClearPageCgroupWriteback(pc);
> +		break;
> +	case MEM_CGROUP_STAT_WRITEBACK_TEMP:
> +		if (val > 0)
> +			SetPageCgroupWritebackTemp(pc);
> +		else
> +			ClearPageCgroupWritebackTemp(pc);
> +		break;
> +	case MEM_CGROUP_STAT_UNSTABLE_NFS:
> +		if (val > 0)
> +			SetPageCgroupUnstableNFS(pc);
> +		else
> +			ClearPageCgroupUnstableNFS(pc);
> +		break;
> +	default:
> +		BUG();
> +		break;
> +	}
> +	mem = pc->mem_cgroup;
> +	if (likely(mem))
> +		__this_cpu_add(mem->stat->count[idx], val);
> +	unlock_page_cgroup_migrate(pc);
>  }
> +EXPORT_SYMBOL_GPL(mem_cgroup_update_stat);
>  
>  /*
>   * size of first charge trial. "32" comes from vmscan.c's magic value.
> @@ -1701,6 +1885,45 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *mem,
>  	memcg_check_events(mem, pc->page);
>  }
>  
> +/*
> + * Update file cache accounted statistics on task migration.
> + *
> + * TODO: We don't move charges of file (including shmem/tmpfs) pages for now.
> + * So, at the moment this function simply returns without updating accounted
> + * statistics, because we deal only with anonymous pages here.
> + */
This function is not unique to task migration. It's called from rmdir() too.
So this comment isn't needed.

> +static void __mem_cgroup_update_file_stat(struct page_cgroup *pc,
> +	struct mem_cgroup *from, struct mem_cgroup *to)
> +{
> +	struct page *page = pc->page;
> +
> +	if (!page_mapped(page) || PageAnon(page))
> +		return;
> +
> +	if (PageCgroupFileMapped(pc)) {
> +		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> +		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> +	}
> +	if (PageCgroupDirty(pc)) {
> +		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_DIRTY]);
> +		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_DIRTY]);
> +	}
> +	if (PageCgroupWriteback(pc)) {
> +		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_WRITEBACK]);
> +		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_WRITEBACK]);
> +	}
> +	if (PageCgroupWritebackTemp(pc)) {
> +		__this_cpu_dec(
> +			from->stat->count[MEM_CGROUP_STAT_WRITEBACK_TEMP]);
> +		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_WRITEBACK_TEMP]);
> +	}
> +	if (PageCgroupUnstableNFS(pc)) {
> +		__this_cpu_dec(
> +			from->stat->count[MEM_CGROUP_STAT_UNSTABLE_NFS]);
> +		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_UNSTABLE_NFS]);
> +	}
> +}
> +
>  /**
>   * __mem_cgroup_move_account - move account of the page
>   * @pc:	page_cgroup of the page.
> @@ -1721,22 +1944,16 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *mem,
>  static void __mem_cgroup_move_account(struct page_cgroup *pc,
>  	struct mem_cgroup *from, struct mem_cgroup *to, bool uncharge)
>  {
> -	struct page *page;
> -
>  	VM_BUG_ON(from == to);
>  	VM_BUG_ON(PageLRU(pc->page));
>  	VM_BUG_ON(!PageCgroupLocked(pc));
>  	VM_BUG_ON(!PageCgroupUsed(pc));
>  	VM_BUG_ON(pc->mem_cgroup != from);
>  
> -	page = pc->page;
> -	if (page_mapped(page) && !PageAnon(page)) {
> -		/* Update mapped_file data for mem_cgroup */
> -		preempt_disable();
> -		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> -		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> -		preempt_enable();
> -	}
> +	preempt_disable();
> +	lock_page_cgroup_migrate(pc);
> +	__mem_cgroup_update_file_stat(pc, from, to);
> +
>  	mem_cgroup_charge_statistics(from, pc, false);
>  	if (uncharge)
>  		/* This is not "cancel", but cancel_charge does all we need. */
> @@ -1745,6 +1962,8 @@ static void __mem_cgroup_move_account(struct page_cgroup *pc,
>  	/* caller should have done css_get */
>  	pc->mem_cgroup = to;
>  	mem_cgroup_charge_statistics(to, pc, true);
> +	unlock_page_cgroup_migrate(pc);
> +	preempt_enable();
Glad to see this cleanup :)
But, hmm, I don't think preempt_disable/enable() is enough(and bit_spin_lock/unlock()
does it anyway). lock/unlock_page_cgroup_migrate() can be called under irq context
(e.g. end_page_writeback()), so I think we must local_irq_disable()/enable() here.


Thanks,
Daisuke Nishimura.

>  	/*
>  	 * We charges against "to" which may not have any tasks. Then, "to"
>  	 * can be under rmdir(). But in current implementation, caller of
> @@ -3042,6 +3261,10 @@ enum {
>  	MCS_PGPGIN,
>  	MCS_PGPGOUT,
>  	MCS_SWAP,
> +	MCS_FILE_DIRTY,
> +	MCS_WRITEBACK,
> +	MCS_WRITEBACK_TEMP,
> +	MCS_UNSTABLE_NFS,
>  	MCS_INACTIVE_ANON,
>  	MCS_ACTIVE_ANON,
>  	MCS_INACTIVE_FILE,
> @@ -3064,6 +3287,10 @@ struct {
>  	{"pgpgin", "total_pgpgin"},
>  	{"pgpgout", "total_pgpgout"},
>  	{"swap", "total_swap"},
> +	{"filedirty", "dirty_pages"},
> +	{"writeback", "writeback_pages"},
> +	{"writeback_tmp", "writeback_temp_pages"},
> +	{"nfs", "nfs_unstable"},
>  	{"inactive_anon", "total_inactive_anon"},
>  	{"active_anon", "total_active_anon"},
>  	{"inactive_file", "total_inactive_file"},
> @@ -3092,6 +3319,14 @@ static int mem_cgroup_get_local_stat(struct mem_cgroup *mem, void *data)
>  		val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_SWAPOUT);
>  		s->stat[MCS_SWAP] += val * PAGE_SIZE;
>  	}
> +	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_FILE_DIRTY);
> +	s->stat[MCS_FILE_DIRTY] += val;
> +	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_WRITEBACK);
> +	s->stat[MCS_WRITEBACK] += val;
> +	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_WRITEBACK_TEMP);
> +	s->stat[MCS_WRITEBACK_TEMP] += val;
> +	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_UNSTABLE_NFS);
> +	s->stat[MCS_UNSTABLE_NFS] += val;
>  
>  	/* per zone stat */
>  	val = mem_cgroup_get_local_zonestat(mem, LRU_INACTIVE_ANON);
> @@ -3453,6 +3688,60 @@ unlock:
>  	return ret;
>  }
>  
> +static u64 mem_cgroup_dirty_read(struct cgroup *cgrp, struct cftype *cft)
> +{
> +	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
> +
> +	switch (cft->private) {
> +	case MEM_CGROUP_DIRTY_RATIO:
> +		return memcg->dirty_param.dirty_ratio;
> +	case MEM_CGROUP_DIRTY_BYTES:
> +		return memcg->dirty_param.dirty_bytes;
> +	case MEM_CGROUP_DIRTY_BACKGROUND_RATIO:
> +		return memcg->dirty_param.dirty_background_ratio;
> +	case MEM_CGROUP_DIRTY_BACKGROUND_BYTES:
> +		return memcg->dirty_param.dirty_background_bytes;
> +	default:
> +		BUG();
> +	}
> +}
> +
> +static int
> +mem_cgroup_dirty_write(struct cgroup *cgrp, struct cftype *cft, u64 val)
> +{
> +	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
> +	int type = cft->private;
> +
> +	if (cgrp->parent == NULL)
> +		return -EINVAL;
> +	if ((type == MEM_CGROUP_DIRTY_RATIO ||
> +		type == MEM_CGROUP_DIRTY_BACKGROUND_RATIO) && val > 100)
> +		return -EINVAL;
> +	/*
> +	 * TODO: provide a validation check routine. And retry if validation
> +	 * fails.
> +	 */
> +	switch (type) {
> +	case MEM_CGROUP_DIRTY_RATIO:
> +		memcg->dirty_param.dirty_ratio = val;
> +		memcg->dirty_param.dirty_bytes = 0;
> +		break;
> +	case MEM_CGROUP_DIRTY_BYTES:
> +		memcg->dirty_param.dirty_ratio  = 0;
> +		memcg->dirty_param.dirty_bytes = val;
> +		break;
> +	case MEM_CGROUP_DIRTY_BACKGROUND_RATIO:
> +		memcg->dirty_param.dirty_background_ratio = val;
> +		memcg->dirty_param.dirty_background_bytes = 0;
> +		break;
> +	case MEM_CGROUP_DIRTY_BACKGROUND_BYTES:
> +		memcg->dirty_param.dirty_background_ratio = 0;
> +		memcg->dirty_param.dirty_background_bytes = val;
> +		break;
> +	}
> +	return 0;
> +}
> +
>  static struct cftype mem_cgroup_files[] = {
>  	{
>  		.name = "usage_in_bytes",
> @@ -3504,6 +3793,30 @@ static struct cftype mem_cgroup_files[] = {
>  		.write_u64 = mem_cgroup_swappiness_write,
>  	},
>  	{
> +		.name = "dirty_ratio",
> +		.read_u64 = mem_cgroup_dirty_read,
> +		.write_u64 = mem_cgroup_dirty_write,
> +		.private = MEM_CGROUP_DIRTY_RATIO,
> +	},
> +	{
> +		.name = "dirty_bytes",
> +		.read_u64 = mem_cgroup_dirty_read,
> +		.write_u64 = mem_cgroup_dirty_write,
> +		.private = MEM_CGROUP_DIRTY_BYTES,
> +	},
> +	{
> +		.name = "dirty_background_ratio",
> +		.read_u64 = mem_cgroup_dirty_read,
> +		.write_u64 = mem_cgroup_dirty_write,
> +		.private = MEM_CGROUP_DIRTY_BACKGROUND_RATIO,
> +	},
> +	{
> +		.name = "dirty_background_bytes",
> +		.read_u64 = mem_cgroup_dirty_read,
> +		.write_u64 = mem_cgroup_dirty_write,
> +		.private = MEM_CGROUP_DIRTY_BACKGROUND_BYTES,
> +	},
> +	{
>  		.name = "move_charge_at_immigrate",
>  		.read_u64 = mem_cgroup_move_charge_read,
>  		.write_u64 = mem_cgroup_move_charge_write,
> @@ -3762,8 +4075,21 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
>  	mem->last_scanned_child = 0;
>  	spin_lock_init(&mem->reclaim_param_lock);
>  
> -	if (parent)
> +	if (parent) {
>  		mem->swappiness = get_swappiness(parent);
> +		mem->dirty_param = parent->dirty_param;
> +	} else {
> +		while (1) {
> +			get_global_dirty_param(&mem->dirty_param);
> +			/*
> +			 * Since global dirty parameters are not protected we
> +			 * try to speculatively read them and retry if we get
> +			 * inconsistent values.
> +			 */
> +			if (likely(dirty_param_is_valid(&mem->dirty_param)))
> +				break;
> +		}
> +	}
>  	atomic_set(&mem->refcnt, 1);
>  	mem->move_charge_at_immigrate = 0;
>  	mutex_init(&mem->thresholds_lock);
> -- 
> 1.6.3.3
> 

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH -mmotm 3/4] memcg: dirty pages accounting and limiting infrastructure
  2010-03-04 10:40   ` Andrea Righi
@ 2010-03-05  1:12     ` Daisuke Nishimura
  -1 siblings, 0 replies; 68+ messages in thread
From: Daisuke Nishimura @ 2010-03-05  1:12 UTC (permalink / raw)
  To: Andrea Righi
  Cc: KAMEZAWA Hiroyuki, Balbir Singh, Vivek Goyal, Peter Zijlstra,
	Trond Myklebust, Suleiman Souhlal, Greg Thelen,
	Kirill A. Shutemov, Andrew Morton, containers, linux-kernel,
	linux-mm, Daisuke Nishimura

On Thu,  4 Mar 2010 11:40:14 +0100, Andrea Righi <arighi@develer.com> wrote:
> Infrastructure to account dirty pages per cgroup and add dirty limit
> interfaces in the cgroupfs:
> 
>  - Direct write-out: memory.dirty_ratio, memory.dirty_bytes
> 
>  - Background write-out: memory.dirty_background_ratio, memory.dirty_background_bytes
> 
> Signed-off-by: Andrea Righi <arighi@develer.com>
> ---
>  include/linux/memcontrol.h |   80 ++++++++-
>  mm/memcontrol.c            |  420 +++++++++++++++++++++++++++++++++++++++-----
>  2 files changed, 450 insertions(+), 50 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 1f9b119..cc3421b 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -19,12 +19,66 @@
>  
>  #ifndef _LINUX_MEMCONTROL_H
>  #define _LINUX_MEMCONTROL_H
> +
> +#include <linux/writeback.h>
>  #include <linux/cgroup.h>
> +
>  struct mem_cgroup;
>  struct page_cgroup;
>  struct page;
>  struct mm_struct;
>  
> +/* Cgroup memory statistics items exported to the kernel */
> +enum mem_cgroup_page_stat_item {
> +	MEMCG_NR_DIRTYABLE_PAGES,
> +	MEMCG_NR_RECLAIM_PAGES,
> +	MEMCG_NR_WRITEBACK,
> +	MEMCG_NR_DIRTY_WRITEBACK_PAGES,
> +};
> +
> +/* Dirty memory parameters */
> +struct dirty_param {
> +	int dirty_ratio;
> +	unsigned long dirty_bytes;
> +	int dirty_background_ratio;
> +	unsigned long dirty_background_bytes;
> +};
> +
> +/*
> + * Statistics for memory cgroup.
> + */
> +enum mem_cgroup_stat_index {
> +	/*
> +	 * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
> +	 */
> +	MEM_CGROUP_STAT_CACHE,	   /* # of pages charged as cache */
> +	MEM_CGROUP_STAT_RSS,	   /* # of pages charged as anon rss */
> +	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
> +	MEM_CGROUP_STAT_PGPGIN_COUNT,	/* # of pages paged in */
> +	MEM_CGROUP_STAT_PGPGOUT_COUNT,	/* # of pages paged out */
> +	MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
> +	MEM_CGROUP_EVENTS,	/* incremented at every  pagein/pageout */
> +	MEM_CGROUP_STAT_FILE_DIRTY,   /* # of dirty pages in page cache */
> +	MEM_CGROUP_STAT_WRITEBACK,   /* # of pages under writeback */
> +	MEM_CGROUP_STAT_WRITEBACK_TEMP,   /* # of pages under writeback using
> +						temporary buffers */
> +	MEM_CGROUP_STAT_UNSTABLE_NFS,   /* # of NFS unstable pages */
> +
> +	MEM_CGROUP_STAT_NSTATS,
> +};
> +
I must have said it earlier, but I don't think exporting all of these flags
is a good idea.
Can you export only mem_cgroup_page_stat_item(of course, need to add MEMCG_NR_FILE_MAPPED)?
We can translate mem_cgroup_page_stat_item to mem_cgroup_stat_index by simple arithmetic
if you define MEM_CGROUP_STAT_FILE_MAPPED..MEM_CGROUP_STAT_UNSTABLE_NFS sequentially.

> +/*
> + * TODO: provide a validation check routine. And retry if validation
> + * fails.
> + */
> +static inline void get_global_dirty_param(struct dirty_param *param)
> +{
> +	param->dirty_ratio = vm_dirty_ratio;
> +	param->dirty_bytes = vm_dirty_bytes;
> +	param->dirty_background_ratio = dirty_background_ratio;
> +	param->dirty_background_bytes = dirty_background_bytes;
> +}
> +
>  #ifdef CONFIG_CGROUP_MEM_RES_CTLR
>  /*
>   * All "charge" functions with gfp_mask should use GFP_KERNEL or
> @@ -117,6 +171,10 @@ extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
>  extern int do_swap_account;
>  #endif
>  
> +extern bool mem_cgroup_has_dirty_limit(void);
> +extern void get_dirty_param(struct dirty_param *param);
> +extern s64 mem_cgroup_page_stat(enum mem_cgroup_page_stat_item item);
> +
>  static inline bool mem_cgroup_disabled(void)
>  {
>  	if (mem_cgroup_subsys.disabled)
> @@ -125,7 +183,8 @@ static inline bool mem_cgroup_disabled(void)
>  }
>  
>  extern bool mem_cgroup_oom_called(struct task_struct *task);
> -void mem_cgroup_update_file_mapped(struct page *page, int val);
> +void mem_cgroup_update_stat(struct page *page,
> +			enum mem_cgroup_stat_index idx, int val);
>  unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>  						gfp_t gfp_mask, int nid,
>  						int zid);
> @@ -300,8 +359,8 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
>  {
>  }
>  
> -static inline void mem_cgroup_update_file_mapped(struct page *page,
> -							int val)
> +static inline void mem_cgroup_update_stat(struct page *page,
> +			enum mem_cgroup_stat_index idx, int val)
>  {
>  }
>  
> @@ -312,6 +371,21 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>  	return 0;
>  }
>  
> +static inline bool mem_cgroup_has_dirty_limit(void)
> +{
> +	return false;
> +}
> +
> +static inline void get_dirty_param(struct dirty_param *param)
> +{
> +	get_global_dirty_param(param);
> +}
> +
> +static inline s64 mem_cgroup_page_stat(enum mem_cgroup_page_stat_item item)
> +{
> +	return -ENOSYS;
> +}
> +
>  #endif /* CONFIG_CGROUP_MEM_CONT */
>  
>  #endif /* _LINUX_MEMCONTROL_H */
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 497b6f7..9842e7b 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -73,28 +73,23 @@ static int really_do_swap_account __initdata = 1; /* for remember boot option*/
>  #define THRESHOLDS_EVENTS_THRESH (7) /* once in 128 */
>  #define SOFTLIMIT_EVENTS_THRESH (10) /* once in 1024 */
>  
> -/*
> - * Statistics for memory cgroup.
> - */
> -enum mem_cgroup_stat_index {
> -	/*
> -	 * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
> -	 */
> -	MEM_CGROUP_STAT_CACHE, 	   /* # of pages charged as cache */
> -	MEM_CGROUP_STAT_RSS,	   /* # of pages charged as anon rss */
> -	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
> -	MEM_CGROUP_STAT_PGPGIN_COUNT,	/* # of pages paged in */
> -	MEM_CGROUP_STAT_PGPGOUT_COUNT,	/* # of pages paged out */
> -	MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
> -	MEM_CGROUP_EVENTS,	/* incremented at every  pagein/pageout */
> -
> -	MEM_CGROUP_STAT_NSTATS,
> -};
> -
>  struct mem_cgroup_stat_cpu {
>  	s64 count[MEM_CGROUP_STAT_NSTATS];
>  };
>  
> +/* Per cgroup page statistics */
> +struct mem_cgroup_page_stat {
> +	enum mem_cgroup_page_stat_item item;
> +	s64 value;
> +};
> +
> +enum {
> +	MEM_CGROUP_DIRTY_RATIO,
> +	MEM_CGROUP_DIRTY_BYTES,
> +	MEM_CGROUP_DIRTY_BACKGROUND_RATIO,
> +	MEM_CGROUP_DIRTY_BACKGROUND_BYTES,
> +};
> +
>  /*
>   * per-zone information in memory controller.
>   */
> @@ -208,6 +203,9 @@ struct mem_cgroup {
>  
>  	unsigned int	swappiness;
>  
> +	/* control memory cgroup dirty pages */
> +	struct dirty_param dirty_param;
> +
>  	/* set when res.limit == memsw.limit */
>  	bool		memsw_is_minimum;
>  
> @@ -1033,6 +1031,156 @@ static unsigned int get_swappiness(struct mem_cgroup *memcg)
>  	return swappiness;
>  }
>  
> +static bool dirty_param_is_valid(struct dirty_param *param)
> +{
> +	if (param->dirty_ratio && param->dirty_bytes)
> +		return false;
> +	if (param->dirty_background_ratio && param->dirty_background_bytes)
> +		return false;
> +	return true;
> +}
> +
> +static void
> +__mem_cgroup_get_dirty_param(struct dirty_param *param, struct mem_cgroup *mem)
> +{
> +	param->dirty_ratio = mem->dirty_param.dirty_ratio;
> +	param->dirty_bytes = mem->dirty_param.dirty_bytes;
> +	param->dirty_background_ratio = mem->dirty_param.dirty_background_ratio;
> +	param->dirty_background_bytes = mem->dirty_param.dirty_background_bytes;
> +}
> +
> +/*
> + * get_dirty_param() - get dirty memory parameters of the current memcg
> + * @param:	a structure is filled with the dirty memory settings
> + *
> + * The function fills @param with the current memcg dirty memory settings. If
> + * memory cgroup is disabled or in case of error the structure is filled with
> + * the global dirty memory settings.
> + */
> +void get_dirty_param(struct dirty_param *param)
> +{
> +	struct mem_cgroup *memcg;
> +
> +	if (mem_cgroup_disabled()) {
> +		get_global_dirty_param(param);
> +		return;
> +	}
> +	/*
> +	 * It's possible that "current" may be moved to other cgroup while we
> +	 * access cgroup. But precise check is meaningless because the task can
> +	 * be moved after our access and writeback tends to take long time.
> +	 * At least, "memcg" will not be freed under rcu_read_lock().
> +	 */
> +	while (1) {
> +		rcu_read_lock();
> +		memcg = mem_cgroup_from_task(current);
> +		if (likely(memcg))
> +			__mem_cgroup_get_dirty_param(param, memcg);
> +		else
> +			get_global_dirty_param(param);
> +		rcu_read_unlock();
> +		/*
> +		 * Since global and memcg dirty_param are not protected we try
> +		 * to speculatively read them and retry if we get inconsistent
> +		 * values.
> +		 */
> +		if (likely(dirty_param_is_valid(param)))
> +			break;
> +	}
> +}
> +
> +static inline bool mem_cgroup_can_swap(struct mem_cgroup *memcg)
> +{
> +	if (!do_swap_account)
> +		return nr_swap_pages > 0;
> +	return !memcg->memsw_is_minimum &&
> +		(res_counter_read_u64(&memcg->memsw, RES_LIMIT) > 0);
> +}
> +
> +static s64 mem_cgroup_get_local_page_stat(struct mem_cgroup *memcg,
> +				enum mem_cgroup_page_stat_item item)
> +{
> +	s64 ret;
> +
> +	switch (item) {
> +	case MEMCG_NR_DIRTYABLE_PAGES:
> +		ret = res_counter_read_u64(&memcg->res, RES_LIMIT) -
> +			res_counter_read_u64(&memcg->res, RES_USAGE);
> +		/* Translate free memory in pages */
> +		ret >>= PAGE_SHIFT;
> +		ret += mem_cgroup_read_stat(memcg, LRU_ACTIVE_FILE) +
> +			mem_cgroup_read_stat(memcg, LRU_INACTIVE_FILE);
> +		if (mem_cgroup_can_swap(memcg))
> +			ret += mem_cgroup_read_stat(memcg, LRU_ACTIVE_ANON) +
> +				mem_cgroup_read_stat(memcg, LRU_INACTIVE_ANON);
> +		break;
> +	case MEMCG_NR_RECLAIM_PAGES:
> +		ret = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_FILE_DIRTY) +
> +			mem_cgroup_read_stat(memcg,
> +					MEM_CGROUP_STAT_UNSTABLE_NFS);
> +		break;
> +	case MEMCG_NR_WRITEBACK:
> +		ret = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_WRITEBACK);
> +		break;
> +	case MEMCG_NR_DIRTY_WRITEBACK_PAGES:
> +		ret = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_WRITEBACK) +
> +			mem_cgroup_read_stat(memcg,
> +				MEM_CGROUP_STAT_UNSTABLE_NFS);
> +		break;
> +	default:
> +		BUG_ON(1);
> +	}
> +	return ret;
> +}
> +
> +static int mem_cgroup_page_stat_cb(struct mem_cgroup *mem, void *data)
> +{
> +	struct mem_cgroup_page_stat *stat = (struct mem_cgroup_page_stat *)data;
> +
> +	stat->value += mem_cgroup_get_local_page_stat(mem, stat->item);
> +	return 0;
> +}
> +
> +/*
> + * mem_cgroup_has_dirty_limit() - check if current memcg has local dirty limits
> + *
> + * Return true if the current memory cgroup has local dirty memory settings,
> + * false otherwise.
> + */
> +bool mem_cgroup_has_dirty_limit(void)
> +{
> +	if (mem_cgroup_disabled())
> +		return false;
> +	return mem_cgroup_from_task(current) != NULL;
> +}
> +
> +/*
> + * mem_cgroup_page_stat() - get memory cgroup file cache statistics
> + * @item:	memory statistic item exported to the kernel
> + *
> + * Return the accounted statistic value, or a negative value in case of error.
> + */
> +s64 mem_cgroup_page_stat(enum mem_cgroup_page_stat_item item)
> +{
> +	struct mem_cgroup_page_stat stat = {};
> +	struct mem_cgroup *memcg;
> +
> +	rcu_read_lock();
> +	memcg = mem_cgroup_from_task(current);
> +	if (memcg) {
> +		/*
> +		 * Recursively evaulate page statistics against all cgroup
> +		 * under hierarchy tree
> +		 */
> +		stat.item = item;
> +		mem_cgroup_walk_tree(memcg, &stat, mem_cgroup_page_stat_cb);
> +	} else
> +		stat.value = -EINVAL;
> +	rcu_read_unlock();
> +
> +	return stat.value;
> +}
> +
>  static int mem_cgroup_count_children_cb(struct mem_cgroup *mem, void *data)
>  {
>  	int *val = data;
> @@ -1275,34 +1423,70 @@ static void record_last_oom(struct mem_cgroup *mem)
>  }
>  
>  /*
> - * Currently used to update mapped file statistics, but the routine can be
> - * generalized to update other statistics as well.
> + * Generalized routine to update file cache's status for memcg.
> + *
> + * Before calling this, mapping->tree_lock should be held and preemption is
> + * disabled.  Then, it's guarnteed that the page is not uncharged while we
> + * access page_cgroup. We can make use of that.
>   */
IIUC, mapping->tree_lock is held with irq disabled, so I think "mapping->tree_lock
should be held with irq disabled" would be enouth.
And, as far as I can see, callers of this function have not ensured this yet in [4/4].

how about:

	void mem_cgroup_update_stat_locked(...)
	{
		...
	}

	void mem_cgroup_update_stat_unlocked(mapping, ...)
	{
		spin_lock_irqsave(mapping->tree_lock, ...);
		mem_cgroup_update_stat_locked();
		spin_unlock_irqrestore(...);
	}

> -void mem_cgroup_update_file_mapped(struct page *page, int val)
> +void mem_cgroup_update_stat(struct page *page,
> +			enum mem_cgroup_stat_index idx, int val)
>  {
I preffer "void mem_cgroup_update_page_stat(struct page *, enum mem_cgroup_page_stat_item, ..)"
as I said above.

>  	struct mem_cgroup *mem;
>  	struct page_cgroup *pc;
>  
> +	if (mem_cgroup_disabled())
> +		return;
>  	pc = lookup_page_cgroup(page);
> -	if (unlikely(!pc))
> +	if (unlikely(!pc) || !PageCgroupUsed(pc))
>  		return;
>  
> -	lock_page_cgroup(pc);
> -	mem = pc->mem_cgroup;
> -	if (!mem)
> -		goto done;
> -
> -	if (!PageCgroupUsed(pc))
> -		goto done;
> -
> +	lock_page_cgroup_migrate(pc);
>  	/*
> -	 * Preemption is already disabled. We can use __this_cpu_xxx
> -	 */
> -	__this_cpu_add(mem->stat->count[MEM_CGROUP_STAT_FILE_MAPPED], val);
> -
> -done:
> -	unlock_page_cgroup(pc);
> +	* It's guarnteed that this page is never uncharged.
> +	* The only racy problem is moving account among memcgs.
> +	*/
> +	switch (idx) {
> +	case MEM_CGROUP_STAT_FILE_MAPPED:
> +		if (val > 0)
> +			SetPageCgroupFileMapped(pc);
> +		else
> +			ClearPageCgroupFileMapped(pc);
> +		break;
> +	case MEM_CGROUP_STAT_FILE_DIRTY:
> +		if (val > 0)
> +			SetPageCgroupDirty(pc);
> +		else
> +			ClearPageCgroupDirty(pc);
> +		break;
> +	case MEM_CGROUP_STAT_WRITEBACK:
> +		if (val > 0)
> +			SetPageCgroupWriteback(pc);
> +		else
> +			ClearPageCgroupWriteback(pc);
> +		break;
> +	case MEM_CGROUP_STAT_WRITEBACK_TEMP:
> +		if (val > 0)
> +			SetPageCgroupWritebackTemp(pc);
> +		else
> +			ClearPageCgroupWritebackTemp(pc);
> +		break;
> +	case MEM_CGROUP_STAT_UNSTABLE_NFS:
> +		if (val > 0)
> +			SetPageCgroupUnstableNFS(pc);
> +		else
> +			ClearPageCgroupUnstableNFS(pc);
> +		break;
> +	default:
> +		BUG();
> +		break;
> +	}
> +	mem = pc->mem_cgroup;
> +	if (likely(mem))
> +		__this_cpu_add(mem->stat->count[idx], val);
> +	unlock_page_cgroup_migrate(pc);
>  }
> +EXPORT_SYMBOL_GPL(mem_cgroup_update_stat);
>  
>  /*
>   * size of first charge trial. "32" comes from vmscan.c's magic value.
> @@ -1701,6 +1885,45 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *mem,
>  	memcg_check_events(mem, pc->page);
>  }
>  
> +/*
> + * Update file cache accounted statistics on task migration.
> + *
> + * TODO: We don't move charges of file (including shmem/tmpfs) pages for now.
> + * So, at the moment this function simply returns without updating accounted
> + * statistics, because we deal only with anonymous pages here.
> + */
This function is not unique to task migration. It's called from rmdir() too.
So this comment isn't needed.

> +static void __mem_cgroup_update_file_stat(struct page_cgroup *pc,
> +	struct mem_cgroup *from, struct mem_cgroup *to)
> +{
> +	struct page *page = pc->page;
> +
> +	if (!page_mapped(page) || PageAnon(page))
> +		return;
> +
> +	if (PageCgroupFileMapped(pc)) {
> +		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> +		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> +	}
> +	if (PageCgroupDirty(pc)) {
> +		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_DIRTY]);
> +		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_DIRTY]);
> +	}
> +	if (PageCgroupWriteback(pc)) {
> +		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_WRITEBACK]);
> +		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_WRITEBACK]);
> +	}
> +	if (PageCgroupWritebackTemp(pc)) {
> +		__this_cpu_dec(
> +			from->stat->count[MEM_CGROUP_STAT_WRITEBACK_TEMP]);
> +		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_WRITEBACK_TEMP]);
> +	}
> +	if (PageCgroupUnstableNFS(pc)) {
> +		__this_cpu_dec(
> +			from->stat->count[MEM_CGROUP_STAT_UNSTABLE_NFS]);
> +		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_UNSTABLE_NFS]);
> +	}
> +}
> +
>  /**
>   * __mem_cgroup_move_account - move account of the page
>   * @pc:	page_cgroup of the page.
> @@ -1721,22 +1944,16 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *mem,
>  static void __mem_cgroup_move_account(struct page_cgroup *pc,
>  	struct mem_cgroup *from, struct mem_cgroup *to, bool uncharge)
>  {
> -	struct page *page;
> -
>  	VM_BUG_ON(from == to);
>  	VM_BUG_ON(PageLRU(pc->page));
>  	VM_BUG_ON(!PageCgroupLocked(pc));
>  	VM_BUG_ON(!PageCgroupUsed(pc));
>  	VM_BUG_ON(pc->mem_cgroup != from);
>  
> -	page = pc->page;
> -	if (page_mapped(page) && !PageAnon(page)) {
> -		/* Update mapped_file data for mem_cgroup */
> -		preempt_disable();
> -		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> -		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> -		preempt_enable();
> -	}
> +	preempt_disable();
> +	lock_page_cgroup_migrate(pc);
> +	__mem_cgroup_update_file_stat(pc, from, to);
> +
>  	mem_cgroup_charge_statistics(from, pc, false);
>  	if (uncharge)
>  		/* This is not "cancel", but cancel_charge does all we need. */
> @@ -1745,6 +1962,8 @@ static void __mem_cgroup_move_account(struct page_cgroup *pc,
>  	/* caller should have done css_get */
>  	pc->mem_cgroup = to;
>  	mem_cgroup_charge_statistics(to, pc, true);
> +	unlock_page_cgroup_migrate(pc);
> +	preempt_enable();
Glad to see this cleanup :)
But, hmm, I don't think preempt_disable/enable() is enough(and bit_spin_lock/unlock()
does it anyway). lock/unlock_page_cgroup_migrate() can be called under irq context
(e.g. end_page_writeback()), so I think we must local_irq_disable()/enable() here.


Thanks,
Daisuke Nishimura.

>  	/*
>  	 * We charges against "to" which may not have any tasks. Then, "to"
>  	 * can be under rmdir(). But in current implementation, caller of
> @@ -3042,6 +3261,10 @@ enum {
>  	MCS_PGPGIN,
>  	MCS_PGPGOUT,
>  	MCS_SWAP,
> +	MCS_FILE_DIRTY,
> +	MCS_WRITEBACK,
> +	MCS_WRITEBACK_TEMP,
> +	MCS_UNSTABLE_NFS,
>  	MCS_INACTIVE_ANON,
>  	MCS_ACTIVE_ANON,
>  	MCS_INACTIVE_FILE,
> @@ -3064,6 +3287,10 @@ struct {
>  	{"pgpgin", "total_pgpgin"},
>  	{"pgpgout", "total_pgpgout"},
>  	{"swap", "total_swap"},
> +	{"filedirty", "dirty_pages"},
> +	{"writeback", "writeback_pages"},
> +	{"writeback_tmp", "writeback_temp_pages"},
> +	{"nfs", "nfs_unstable"},
>  	{"inactive_anon", "total_inactive_anon"},
>  	{"active_anon", "total_active_anon"},
>  	{"inactive_file", "total_inactive_file"},
> @@ -3092,6 +3319,14 @@ static int mem_cgroup_get_local_stat(struct mem_cgroup *mem, void *data)
>  		val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_SWAPOUT);
>  		s->stat[MCS_SWAP] += val * PAGE_SIZE;
>  	}
> +	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_FILE_DIRTY);
> +	s->stat[MCS_FILE_DIRTY] += val;
> +	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_WRITEBACK);
> +	s->stat[MCS_WRITEBACK] += val;
> +	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_WRITEBACK_TEMP);
> +	s->stat[MCS_WRITEBACK_TEMP] += val;
> +	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_UNSTABLE_NFS);
> +	s->stat[MCS_UNSTABLE_NFS] += val;
>  
>  	/* per zone stat */
>  	val = mem_cgroup_get_local_zonestat(mem, LRU_INACTIVE_ANON);
> @@ -3453,6 +3688,60 @@ unlock:
>  	return ret;
>  }
>  
> +static u64 mem_cgroup_dirty_read(struct cgroup *cgrp, struct cftype *cft)
> +{
> +	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
> +
> +	switch (cft->private) {
> +	case MEM_CGROUP_DIRTY_RATIO:
> +		return memcg->dirty_param.dirty_ratio;
> +	case MEM_CGROUP_DIRTY_BYTES:
> +		return memcg->dirty_param.dirty_bytes;
> +	case MEM_CGROUP_DIRTY_BACKGROUND_RATIO:
> +		return memcg->dirty_param.dirty_background_ratio;
> +	case MEM_CGROUP_DIRTY_BACKGROUND_BYTES:
> +		return memcg->dirty_param.dirty_background_bytes;
> +	default:
> +		BUG();
> +	}
> +}
> +
> +static int
> +mem_cgroup_dirty_write(struct cgroup *cgrp, struct cftype *cft, u64 val)
> +{
> +	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
> +	int type = cft->private;
> +
> +	if (cgrp->parent == NULL)
> +		return -EINVAL;
> +	if ((type == MEM_CGROUP_DIRTY_RATIO ||
> +		type == MEM_CGROUP_DIRTY_BACKGROUND_RATIO) && val > 100)
> +		return -EINVAL;
> +	/*
> +	 * TODO: provide a validation check routine. And retry if validation
> +	 * fails.
> +	 */
> +	switch (type) {
> +	case MEM_CGROUP_DIRTY_RATIO:
> +		memcg->dirty_param.dirty_ratio = val;
> +		memcg->dirty_param.dirty_bytes = 0;
> +		break;
> +	case MEM_CGROUP_DIRTY_BYTES:
> +		memcg->dirty_param.dirty_ratio  = 0;
> +		memcg->dirty_param.dirty_bytes = val;
> +		break;
> +	case MEM_CGROUP_DIRTY_BACKGROUND_RATIO:
> +		memcg->dirty_param.dirty_background_ratio = val;
> +		memcg->dirty_param.dirty_background_bytes = 0;
> +		break;
> +	case MEM_CGROUP_DIRTY_BACKGROUND_BYTES:
> +		memcg->dirty_param.dirty_background_ratio = 0;
> +		memcg->dirty_param.dirty_background_bytes = val;
> +		break;
> +	}
> +	return 0;
> +}
> +
>  static struct cftype mem_cgroup_files[] = {
>  	{
>  		.name = "usage_in_bytes",
> @@ -3504,6 +3793,30 @@ static struct cftype mem_cgroup_files[] = {
>  		.write_u64 = mem_cgroup_swappiness_write,
>  	},
>  	{
> +		.name = "dirty_ratio",
> +		.read_u64 = mem_cgroup_dirty_read,
> +		.write_u64 = mem_cgroup_dirty_write,
> +		.private = MEM_CGROUP_DIRTY_RATIO,
> +	},
> +	{
> +		.name = "dirty_bytes",
> +		.read_u64 = mem_cgroup_dirty_read,
> +		.write_u64 = mem_cgroup_dirty_write,
> +		.private = MEM_CGROUP_DIRTY_BYTES,
> +	},
> +	{
> +		.name = "dirty_background_ratio",
> +		.read_u64 = mem_cgroup_dirty_read,
> +		.write_u64 = mem_cgroup_dirty_write,
> +		.private = MEM_CGROUP_DIRTY_BACKGROUND_RATIO,
> +	},
> +	{
> +		.name = "dirty_background_bytes",
> +		.read_u64 = mem_cgroup_dirty_read,
> +		.write_u64 = mem_cgroup_dirty_write,
> +		.private = MEM_CGROUP_DIRTY_BACKGROUND_BYTES,
> +	},
> +	{
>  		.name = "move_charge_at_immigrate",
>  		.read_u64 = mem_cgroup_move_charge_read,
>  		.write_u64 = mem_cgroup_move_charge_write,
> @@ -3762,8 +4075,21 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
>  	mem->last_scanned_child = 0;
>  	spin_lock_init(&mem->reclaim_param_lock);
>  
> -	if (parent)
> +	if (parent) {
>  		mem->swappiness = get_swappiness(parent);
> +		mem->dirty_param = parent->dirty_param;
> +	} else {
> +		while (1) {
> +			get_global_dirty_param(&mem->dirty_param);
> +			/*
> +			 * Since global dirty parameters are not protected we
> +			 * try to speculatively read them and retry if we get
> +			 * inconsistent values.
> +			 */
> +			if (likely(dirty_param_is_valid(&mem->dirty_param)))
> +				break;
> +		}
> +	}
>  	atomic_set(&mem->refcnt, 1);
>  	mem->move_charge_at_immigrate = 0;
>  	mutex_init(&mem->thresholds_lock);
> -- 
> 1.6.3.3
> 

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH -mmotm 3/4] memcg: dirty pages accounting and limiting infrastructure
@ 2010-03-05  1:12     ` Daisuke Nishimura
  0 siblings, 0 replies; 68+ messages in thread
From: Daisuke Nishimura @ 2010-03-05  1:12 UTC (permalink / raw)
  To: Andrea Righi
  Cc: KAMEZAWA Hiroyuki, Balbir Singh, Vivek Goyal, Peter Zijlstra,
	Trond Myklebust, Suleiman Souhlal, Greg Thelen,
	Kirill A. Shutemov, Andrew Morton, containers, linux-kernel,
	linux-mm, Daisuke Nishimura

On Thu,  4 Mar 2010 11:40:14 +0100, Andrea Righi <arighi@develer.com> wrote:
> Infrastructure to account dirty pages per cgroup and add dirty limit
> interfaces in the cgroupfs:
> 
>  - Direct write-out: memory.dirty_ratio, memory.dirty_bytes
> 
>  - Background write-out: memory.dirty_background_ratio, memory.dirty_background_bytes
> 
> Signed-off-by: Andrea Righi <arighi@develer.com>
> ---
>  include/linux/memcontrol.h |   80 ++++++++-
>  mm/memcontrol.c            |  420 +++++++++++++++++++++++++++++++++++++++-----
>  2 files changed, 450 insertions(+), 50 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 1f9b119..cc3421b 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -19,12 +19,66 @@
>  
>  #ifndef _LINUX_MEMCONTROL_H
>  #define _LINUX_MEMCONTROL_H
> +
> +#include <linux/writeback.h>
>  #include <linux/cgroup.h>
> +
>  struct mem_cgroup;
>  struct page_cgroup;
>  struct page;
>  struct mm_struct;
>  
> +/* Cgroup memory statistics items exported to the kernel */
> +enum mem_cgroup_page_stat_item {
> +	MEMCG_NR_DIRTYABLE_PAGES,
> +	MEMCG_NR_RECLAIM_PAGES,
> +	MEMCG_NR_WRITEBACK,
> +	MEMCG_NR_DIRTY_WRITEBACK_PAGES,
> +};
> +
> +/* Dirty memory parameters */
> +struct dirty_param {
> +	int dirty_ratio;
> +	unsigned long dirty_bytes;
> +	int dirty_background_ratio;
> +	unsigned long dirty_background_bytes;
> +};
> +
> +/*
> + * Statistics for memory cgroup.
> + */
> +enum mem_cgroup_stat_index {
> +	/*
> +	 * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
> +	 */
> +	MEM_CGROUP_STAT_CACHE,	   /* # of pages charged as cache */
> +	MEM_CGROUP_STAT_RSS,	   /* # of pages charged as anon rss */
> +	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
> +	MEM_CGROUP_STAT_PGPGIN_COUNT,	/* # of pages paged in */
> +	MEM_CGROUP_STAT_PGPGOUT_COUNT,	/* # of pages paged out */
> +	MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
> +	MEM_CGROUP_EVENTS,	/* incremented at every  pagein/pageout */
> +	MEM_CGROUP_STAT_FILE_DIRTY,   /* # of dirty pages in page cache */
> +	MEM_CGROUP_STAT_WRITEBACK,   /* # of pages under writeback */
> +	MEM_CGROUP_STAT_WRITEBACK_TEMP,   /* # of pages under writeback using
> +						temporary buffers */
> +	MEM_CGROUP_STAT_UNSTABLE_NFS,   /* # of NFS unstable pages */
> +
> +	MEM_CGROUP_STAT_NSTATS,
> +};
> +
I must have said it earlier, but I don't think exporting all of these flags
is a good idea.
Can you export only mem_cgroup_page_stat_item(of course, need to add MEMCG_NR_FILE_MAPPED)?
We can translate mem_cgroup_page_stat_item to mem_cgroup_stat_index by simple arithmetic
if you define MEM_CGROUP_STAT_FILE_MAPPED..MEM_CGROUP_STAT_UNSTABLE_NFS sequentially.

> +/*
> + * TODO: provide a validation check routine. And retry if validation
> + * fails.
> + */
> +static inline void get_global_dirty_param(struct dirty_param *param)
> +{
> +	param->dirty_ratio = vm_dirty_ratio;
> +	param->dirty_bytes = vm_dirty_bytes;
> +	param->dirty_background_ratio = dirty_background_ratio;
> +	param->dirty_background_bytes = dirty_background_bytes;
> +}
> +
>  #ifdef CONFIG_CGROUP_MEM_RES_CTLR
>  /*
>   * All "charge" functions with gfp_mask should use GFP_KERNEL or
> @@ -117,6 +171,10 @@ extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
>  extern int do_swap_account;
>  #endif
>  
> +extern bool mem_cgroup_has_dirty_limit(void);
> +extern void get_dirty_param(struct dirty_param *param);
> +extern s64 mem_cgroup_page_stat(enum mem_cgroup_page_stat_item item);
> +
>  static inline bool mem_cgroup_disabled(void)
>  {
>  	if (mem_cgroup_subsys.disabled)
> @@ -125,7 +183,8 @@ static inline bool mem_cgroup_disabled(void)
>  }
>  
>  extern bool mem_cgroup_oom_called(struct task_struct *task);
> -void mem_cgroup_update_file_mapped(struct page *page, int val);
> +void mem_cgroup_update_stat(struct page *page,
> +			enum mem_cgroup_stat_index idx, int val);
>  unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>  						gfp_t gfp_mask, int nid,
>  						int zid);
> @@ -300,8 +359,8 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
>  {
>  }
>  
> -static inline void mem_cgroup_update_file_mapped(struct page *page,
> -							int val)
> +static inline void mem_cgroup_update_stat(struct page *page,
> +			enum mem_cgroup_stat_index idx, int val)
>  {
>  }
>  
> @@ -312,6 +371,21 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>  	return 0;
>  }
>  
> +static inline bool mem_cgroup_has_dirty_limit(void)
> +{
> +	return false;
> +}
> +
> +static inline void get_dirty_param(struct dirty_param *param)
> +{
> +	get_global_dirty_param(param);
> +}
> +
> +static inline s64 mem_cgroup_page_stat(enum mem_cgroup_page_stat_item item)
> +{
> +	return -ENOSYS;
> +}
> +
>  #endif /* CONFIG_CGROUP_MEM_CONT */
>  
>  #endif /* _LINUX_MEMCONTROL_H */
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 497b6f7..9842e7b 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -73,28 +73,23 @@ static int really_do_swap_account __initdata = 1; /* for remember boot option*/
>  #define THRESHOLDS_EVENTS_THRESH (7) /* once in 128 */
>  #define SOFTLIMIT_EVENTS_THRESH (10) /* once in 1024 */
>  
> -/*
> - * Statistics for memory cgroup.
> - */
> -enum mem_cgroup_stat_index {
> -	/*
> -	 * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
> -	 */
> -	MEM_CGROUP_STAT_CACHE, 	   /* # of pages charged as cache */
> -	MEM_CGROUP_STAT_RSS,	   /* # of pages charged as anon rss */
> -	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
> -	MEM_CGROUP_STAT_PGPGIN_COUNT,	/* # of pages paged in */
> -	MEM_CGROUP_STAT_PGPGOUT_COUNT,	/* # of pages paged out */
> -	MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
> -	MEM_CGROUP_EVENTS,	/* incremented at every  pagein/pageout */
> -
> -	MEM_CGROUP_STAT_NSTATS,
> -};
> -
>  struct mem_cgroup_stat_cpu {
>  	s64 count[MEM_CGROUP_STAT_NSTATS];
>  };
>  
> +/* Per cgroup page statistics */
> +struct mem_cgroup_page_stat {
> +	enum mem_cgroup_page_stat_item item;
> +	s64 value;
> +};
> +
> +enum {
> +	MEM_CGROUP_DIRTY_RATIO,
> +	MEM_CGROUP_DIRTY_BYTES,
> +	MEM_CGROUP_DIRTY_BACKGROUND_RATIO,
> +	MEM_CGROUP_DIRTY_BACKGROUND_BYTES,
> +};
> +
>  /*
>   * per-zone information in memory controller.
>   */
> @@ -208,6 +203,9 @@ struct mem_cgroup {
>  
>  	unsigned int	swappiness;
>  
> +	/* control memory cgroup dirty pages */
> +	struct dirty_param dirty_param;
> +
>  	/* set when res.limit == memsw.limit */
>  	bool		memsw_is_minimum;
>  
> @@ -1033,6 +1031,156 @@ static unsigned int get_swappiness(struct mem_cgroup *memcg)
>  	return swappiness;
>  }
>  
> +static bool dirty_param_is_valid(struct dirty_param *param)
> +{
> +	if (param->dirty_ratio && param->dirty_bytes)
> +		return false;
> +	if (param->dirty_background_ratio && param->dirty_background_bytes)
> +		return false;
> +	return true;
> +}
> +
> +static void
> +__mem_cgroup_get_dirty_param(struct dirty_param *param, struct mem_cgroup *mem)
> +{
> +	param->dirty_ratio = mem->dirty_param.dirty_ratio;
> +	param->dirty_bytes = mem->dirty_param.dirty_bytes;
> +	param->dirty_background_ratio = mem->dirty_param.dirty_background_ratio;
> +	param->dirty_background_bytes = mem->dirty_param.dirty_background_bytes;
> +}
> +
> +/*
> + * get_dirty_param() - get dirty memory parameters of the current memcg
> + * @param:	a structure is filled with the dirty memory settings
> + *
> + * The function fills @param with the current memcg dirty memory settings. If
> + * memory cgroup is disabled or in case of error the structure is filled with
> + * the global dirty memory settings.
> + */
> +void get_dirty_param(struct dirty_param *param)
> +{
> +	struct mem_cgroup *memcg;
> +
> +	if (mem_cgroup_disabled()) {
> +		get_global_dirty_param(param);
> +		return;
> +	}
> +	/*
> +	 * It's possible that "current" may be moved to other cgroup while we
> +	 * access cgroup. But precise check is meaningless because the task can
> +	 * be moved after our access and writeback tends to take long time.
> +	 * At least, "memcg" will not be freed under rcu_read_lock().
> +	 */
> +	while (1) {
> +		rcu_read_lock();
> +		memcg = mem_cgroup_from_task(current);
> +		if (likely(memcg))
> +			__mem_cgroup_get_dirty_param(param, memcg);
> +		else
> +			get_global_dirty_param(param);
> +		rcu_read_unlock();
> +		/*
> +		 * Since global and memcg dirty_param are not protected we try
> +		 * to speculatively read them and retry if we get inconsistent
> +		 * values.
> +		 */
> +		if (likely(dirty_param_is_valid(param)))
> +			break;
> +	}
> +}
> +
> +static inline bool mem_cgroup_can_swap(struct mem_cgroup *memcg)
> +{
> +	if (!do_swap_account)
> +		return nr_swap_pages > 0;
> +	return !memcg->memsw_is_minimum &&
> +		(res_counter_read_u64(&memcg->memsw, RES_LIMIT) > 0);
> +}
> +
> +static s64 mem_cgroup_get_local_page_stat(struct mem_cgroup *memcg,
> +				enum mem_cgroup_page_stat_item item)
> +{
> +	s64 ret;
> +
> +	switch (item) {
> +	case MEMCG_NR_DIRTYABLE_PAGES:
> +		ret = res_counter_read_u64(&memcg->res, RES_LIMIT) -
> +			res_counter_read_u64(&memcg->res, RES_USAGE);
> +		/* Translate free memory in pages */
> +		ret >>= PAGE_SHIFT;
> +		ret += mem_cgroup_read_stat(memcg, LRU_ACTIVE_FILE) +
> +			mem_cgroup_read_stat(memcg, LRU_INACTIVE_FILE);
> +		if (mem_cgroup_can_swap(memcg))
> +			ret += mem_cgroup_read_stat(memcg, LRU_ACTIVE_ANON) +
> +				mem_cgroup_read_stat(memcg, LRU_INACTIVE_ANON);
> +		break;
> +	case MEMCG_NR_RECLAIM_PAGES:
> +		ret = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_FILE_DIRTY) +
> +			mem_cgroup_read_stat(memcg,
> +					MEM_CGROUP_STAT_UNSTABLE_NFS);
> +		break;
> +	case MEMCG_NR_WRITEBACK:
> +		ret = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_WRITEBACK);
> +		break;
> +	case MEMCG_NR_DIRTY_WRITEBACK_PAGES:
> +		ret = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_WRITEBACK) +
> +			mem_cgroup_read_stat(memcg,
> +				MEM_CGROUP_STAT_UNSTABLE_NFS);
> +		break;
> +	default:
> +		BUG_ON(1);
> +	}
> +	return ret;
> +}
> +
> +static int mem_cgroup_page_stat_cb(struct mem_cgroup *mem, void *data)
> +{
> +	struct mem_cgroup_page_stat *stat = (struct mem_cgroup_page_stat *)data;
> +
> +	stat->value += mem_cgroup_get_local_page_stat(mem, stat->item);
> +	return 0;
> +}
> +
> +/*
> + * mem_cgroup_has_dirty_limit() - check if current memcg has local dirty limits
> + *
> + * Return true if the current memory cgroup has local dirty memory settings,
> + * false otherwise.
> + */
> +bool mem_cgroup_has_dirty_limit(void)
> +{
> +	if (mem_cgroup_disabled())
> +		return false;
> +	return mem_cgroup_from_task(current) != NULL;
> +}
> +
> +/*
> + * mem_cgroup_page_stat() - get memory cgroup file cache statistics
> + * @item:	memory statistic item exported to the kernel
> + *
> + * Return the accounted statistic value, or a negative value in case of error.
> + */
> +s64 mem_cgroup_page_stat(enum mem_cgroup_page_stat_item item)
> +{
> +	struct mem_cgroup_page_stat stat = {};
> +	struct mem_cgroup *memcg;
> +
> +	rcu_read_lock();
> +	memcg = mem_cgroup_from_task(current);
> +	if (memcg) {
> +		/*
> +		 * Recursively evaulate page statistics against all cgroup
> +		 * under hierarchy tree
> +		 */
> +		stat.item = item;
> +		mem_cgroup_walk_tree(memcg, &stat, mem_cgroup_page_stat_cb);
> +	} else
> +		stat.value = -EINVAL;
> +	rcu_read_unlock();
> +
> +	return stat.value;
> +}
> +
>  static int mem_cgroup_count_children_cb(struct mem_cgroup *mem, void *data)
>  {
>  	int *val = data;
> @@ -1275,34 +1423,70 @@ static void record_last_oom(struct mem_cgroup *mem)
>  }
>  
>  /*
> - * Currently used to update mapped file statistics, but the routine can be
> - * generalized to update other statistics as well.
> + * Generalized routine to update file cache's status for memcg.
> + *
> + * Before calling this, mapping->tree_lock should be held and preemption is
> + * disabled.  Then, it's guarnteed that the page is not uncharged while we
> + * access page_cgroup. We can make use of that.
>   */
IIUC, mapping->tree_lock is held with irq disabled, so I think "mapping->tree_lock
should be held with irq disabled" would be enouth.
And, as far as I can see, callers of this function have not ensured this yet in [4/4].

how about:

	void mem_cgroup_update_stat_locked(...)
	{
		...
	}

	void mem_cgroup_update_stat_unlocked(mapping, ...)
	{
		spin_lock_irqsave(mapping->tree_lock, ...);
		mem_cgroup_update_stat_locked();
		spin_unlock_irqrestore(...);
	}

> -void mem_cgroup_update_file_mapped(struct page *page, int val)
> +void mem_cgroup_update_stat(struct page *page,
> +			enum mem_cgroup_stat_index idx, int val)
>  {
I preffer "void mem_cgroup_update_page_stat(struct page *, enum mem_cgroup_page_stat_item, ..)"
as I said above.

>  	struct mem_cgroup *mem;
>  	struct page_cgroup *pc;
>  
> +	if (mem_cgroup_disabled())
> +		return;
>  	pc = lookup_page_cgroup(page);
> -	if (unlikely(!pc))
> +	if (unlikely(!pc) || !PageCgroupUsed(pc))
>  		return;
>  
> -	lock_page_cgroup(pc);
> -	mem = pc->mem_cgroup;
> -	if (!mem)
> -		goto done;
> -
> -	if (!PageCgroupUsed(pc))
> -		goto done;
> -
> +	lock_page_cgroup_migrate(pc);
>  	/*
> -	 * Preemption is already disabled. We can use __this_cpu_xxx
> -	 */
> -	__this_cpu_add(mem->stat->count[MEM_CGROUP_STAT_FILE_MAPPED], val);
> -
> -done:
> -	unlock_page_cgroup(pc);
> +	* It's guarnteed that this page is never uncharged.
> +	* The only racy problem is moving account among memcgs.
> +	*/
> +	switch (idx) {
> +	case MEM_CGROUP_STAT_FILE_MAPPED:
> +		if (val > 0)
> +			SetPageCgroupFileMapped(pc);
> +		else
> +			ClearPageCgroupFileMapped(pc);
> +		break;
> +	case MEM_CGROUP_STAT_FILE_DIRTY:
> +		if (val > 0)
> +			SetPageCgroupDirty(pc);
> +		else
> +			ClearPageCgroupDirty(pc);
> +		break;
> +	case MEM_CGROUP_STAT_WRITEBACK:
> +		if (val > 0)
> +			SetPageCgroupWriteback(pc);
> +		else
> +			ClearPageCgroupWriteback(pc);
> +		break;
> +	case MEM_CGROUP_STAT_WRITEBACK_TEMP:
> +		if (val > 0)
> +			SetPageCgroupWritebackTemp(pc);
> +		else
> +			ClearPageCgroupWritebackTemp(pc);
> +		break;
> +	case MEM_CGROUP_STAT_UNSTABLE_NFS:
> +		if (val > 0)
> +			SetPageCgroupUnstableNFS(pc);
> +		else
> +			ClearPageCgroupUnstableNFS(pc);
> +		break;
> +	default:
> +		BUG();
> +		break;
> +	}
> +	mem = pc->mem_cgroup;
> +	if (likely(mem))
> +		__this_cpu_add(mem->stat->count[idx], val);
> +	unlock_page_cgroup_migrate(pc);
>  }
> +EXPORT_SYMBOL_GPL(mem_cgroup_update_stat);
>  
>  /*
>   * size of first charge trial. "32" comes from vmscan.c's magic value.
> @@ -1701,6 +1885,45 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *mem,
>  	memcg_check_events(mem, pc->page);
>  }
>  
> +/*
> + * Update file cache accounted statistics on task migration.
> + *
> + * TODO: We don't move charges of file (including shmem/tmpfs) pages for now.
> + * So, at the moment this function simply returns without updating accounted
> + * statistics, because we deal only with anonymous pages here.
> + */
This function is not unique to task migration. It's called from rmdir() too.
So this comment isn't needed.

> +static void __mem_cgroup_update_file_stat(struct page_cgroup *pc,
> +	struct mem_cgroup *from, struct mem_cgroup *to)
> +{
> +	struct page *page = pc->page;
> +
> +	if (!page_mapped(page) || PageAnon(page))
> +		return;
> +
> +	if (PageCgroupFileMapped(pc)) {
> +		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> +		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> +	}
> +	if (PageCgroupDirty(pc)) {
> +		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_DIRTY]);
> +		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_DIRTY]);
> +	}
> +	if (PageCgroupWriteback(pc)) {
> +		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_WRITEBACK]);
> +		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_WRITEBACK]);
> +	}
> +	if (PageCgroupWritebackTemp(pc)) {
> +		__this_cpu_dec(
> +			from->stat->count[MEM_CGROUP_STAT_WRITEBACK_TEMP]);
> +		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_WRITEBACK_TEMP]);
> +	}
> +	if (PageCgroupUnstableNFS(pc)) {
> +		__this_cpu_dec(
> +			from->stat->count[MEM_CGROUP_STAT_UNSTABLE_NFS]);
> +		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_UNSTABLE_NFS]);
> +	}
> +}
> +
>  /**
>   * __mem_cgroup_move_account - move account of the page
>   * @pc:	page_cgroup of the page.
> @@ -1721,22 +1944,16 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *mem,
>  static void __mem_cgroup_move_account(struct page_cgroup *pc,
>  	struct mem_cgroup *from, struct mem_cgroup *to, bool uncharge)
>  {
> -	struct page *page;
> -
>  	VM_BUG_ON(from == to);
>  	VM_BUG_ON(PageLRU(pc->page));
>  	VM_BUG_ON(!PageCgroupLocked(pc));
>  	VM_BUG_ON(!PageCgroupUsed(pc));
>  	VM_BUG_ON(pc->mem_cgroup != from);
>  
> -	page = pc->page;
> -	if (page_mapped(page) && !PageAnon(page)) {
> -		/* Update mapped_file data for mem_cgroup */
> -		preempt_disable();
> -		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> -		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> -		preempt_enable();
> -	}
> +	preempt_disable();
> +	lock_page_cgroup_migrate(pc);
> +	__mem_cgroup_update_file_stat(pc, from, to);
> +
>  	mem_cgroup_charge_statistics(from, pc, false);
>  	if (uncharge)
>  		/* This is not "cancel", but cancel_charge does all we need. */
> @@ -1745,6 +1962,8 @@ static void __mem_cgroup_move_account(struct page_cgroup *pc,
>  	/* caller should have done css_get */
>  	pc->mem_cgroup = to;
>  	mem_cgroup_charge_statistics(to, pc, true);
> +	unlock_page_cgroup_migrate(pc);
> +	preempt_enable();
Glad to see this cleanup :)
But, hmm, I don't think preempt_disable/enable() is enough(and bit_spin_lock/unlock()
does it anyway). lock/unlock_page_cgroup_migrate() can be called under irq context
(e.g. end_page_writeback()), so I think we must local_irq_disable()/enable() here.


Thanks,
Daisuke Nishimura.

>  	/*
>  	 * We charges against "to" which may not have any tasks. Then, "to"
>  	 * can be under rmdir(). But in current implementation, caller of
> @@ -3042,6 +3261,10 @@ enum {
>  	MCS_PGPGIN,
>  	MCS_PGPGOUT,
>  	MCS_SWAP,
> +	MCS_FILE_DIRTY,
> +	MCS_WRITEBACK,
> +	MCS_WRITEBACK_TEMP,
> +	MCS_UNSTABLE_NFS,
>  	MCS_INACTIVE_ANON,
>  	MCS_ACTIVE_ANON,
>  	MCS_INACTIVE_FILE,
> @@ -3064,6 +3287,10 @@ struct {
>  	{"pgpgin", "total_pgpgin"},
>  	{"pgpgout", "total_pgpgout"},
>  	{"swap", "total_swap"},
> +	{"filedirty", "dirty_pages"},
> +	{"writeback", "writeback_pages"},
> +	{"writeback_tmp", "writeback_temp_pages"},
> +	{"nfs", "nfs_unstable"},
>  	{"inactive_anon", "total_inactive_anon"},
>  	{"active_anon", "total_active_anon"},
>  	{"inactive_file", "total_inactive_file"},
> @@ -3092,6 +3319,14 @@ static int mem_cgroup_get_local_stat(struct mem_cgroup *mem, void *data)
>  		val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_SWAPOUT);
>  		s->stat[MCS_SWAP] += val * PAGE_SIZE;
>  	}
> +	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_FILE_DIRTY);
> +	s->stat[MCS_FILE_DIRTY] += val;
> +	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_WRITEBACK);
> +	s->stat[MCS_WRITEBACK] += val;
> +	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_WRITEBACK_TEMP);
> +	s->stat[MCS_WRITEBACK_TEMP] += val;
> +	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_UNSTABLE_NFS);
> +	s->stat[MCS_UNSTABLE_NFS] += val;
>  
>  	/* per zone stat */
>  	val = mem_cgroup_get_local_zonestat(mem, LRU_INACTIVE_ANON);
> @@ -3453,6 +3688,60 @@ unlock:
>  	return ret;
>  }
>  
> +static u64 mem_cgroup_dirty_read(struct cgroup *cgrp, struct cftype *cft)
> +{
> +	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
> +
> +	switch (cft->private) {
> +	case MEM_CGROUP_DIRTY_RATIO:
> +		return memcg->dirty_param.dirty_ratio;
> +	case MEM_CGROUP_DIRTY_BYTES:
> +		return memcg->dirty_param.dirty_bytes;
> +	case MEM_CGROUP_DIRTY_BACKGROUND_RATIO:
> +		return memcg->dirty_param.dirty_background_ratio;
> +	case MEM_CGROUP_DIRTY_BACKGROUND_BYTES:
> +		return memcg->dirty_param.dirty_background_bytes;
> +	default:
> +		BUG();
> +	}
> +}
> +
> +static int
> +mem_cgroup_dirty_write(struct cgroup *cgrp, struct cftype *cft, u64 val)
> +{
> +	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
> +	int type = cft->private;
> +
> +	if (cgrp->parent == NULL)
> +		return -EINVAL;
> +	if ((type == MEM_CGROUP_DIRTY_RATIO ||
> +		type == MEM_CGROUP_DIRTY_BACKGROUND_RATIO) && val > 100)
> +		return -EINVAL;
> +	/*
> +	 * TODO: provide a validation check routine. And retry if validation
> +	 * fails.
> +	 */
> +	switch (type) {
> +	case MEM_CGROUP_DIRTY_RATIO:
> +		memcg->dirty_param.dirty_ratio = val;
> +		memcg->dirty_param.dirty_bytes = 0;
> +		break;
> +	case MEM_CGROUP_DIRTY_BYTES:
> +		memcg->dirty_param.dirty_ratio  = 0;
> +		memcg->dirty_param.dirty_bytes = val;
> +		break;
> +	case MEM_CGROUP_DIRTY_BACKGROUND_RATIO:
> +		memcg->dirty_param.dirty_background_ratio = val;
> +		memcg->dirty_param.dirty_background_bytes = 0;
> +		break;
> +	case MEM_CGROUP_DIRTY_BACKGROUND_BYTES:
> +		memcg->dirty_param.dirty_background_ratio = 0;
> +		memcg->dirty_param.dirty_background_bytes = val;
> +		break;
> +	}
> +	return 0;
> +}
> +
>  static struct cftype mem_cgroup_files[] = {
>  	{
>  		.name = "usage_in_bytes",
> @@ -3504,6 +3793,30 @@ static struct cftype mem_cgroup_files[] = {
>  		.write_u64 = mem_cgroup_swappiness_write,
>  	},
>  	{
> +		.name = "dirty_ratio",
> +		.read_u64 = mem_cgroup_dirty_read,
> +		.write_u64 = mem_cgroup_dirty_write,
> +		.private = MEM_CGROUP_DIRTY_RATIO,
> +	},
> +	{
> +		.name = "dirty_bytes",
> +		.read_u64 = mem_cgroup_dirty_read,
> +		.write_u64 = mem_cgroup_dirty_write,
> +		.private = MEM_CGROUP_DIRTY_BYTES,
> +	},
> +	{
> +		.name = "dirty_background_ratio",
> +		.read_u64 = mem_cgroup_dirty_read,
> +		.write_u64 = mem_cgroup_dirty_write,
> +		.private = MEM_CGROUP_DIRTY_BACKGROUND_RATIO,
> +	},
> +	{
> +		.name = "dirty_background_bytes",
> +		.read_u64 = mem_cgroup_dirty_read,
> +		.write_u64 = mem_cgroup_dirty_write,
> +		.private = MEM_CGROUP_DIRTY_BACKGROUND_BYTES,
> +	},
> +	{
>  		.name = "move_charge_at_immigrate",
>  		.read_u64 = mem_cgroup_move_charge_read,
>  		.write_u64 = mem_cgroup_move_charge_write,
> @@ -3762,8 +4075,21 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
>  	mem->last_scanned_child = 0;
>  	spin_lock_init(&mem->reclaim_param_lock);
>  
> -	if (parent)
> +	if (parent) {
>  		mem->swappiness = get_swappiness(parent);
> +		mem->dirty_param = parent->dirty_param;
> +	} else {
> +		while (1) {
> +			get_global_dirty_param(&mem->dirty_param);
> +			/*
> +			 * Since global dirty parameters are not protected we
> +			 * try to speculatively read them and retry if we get
> +			 * inconsistent values.
> +			 */
> +			if (likely(dirty_param_is_valid(&mem->dirty_param)))
> +				break;
> +		}
> +	}
>  	atomic_set(&mem->refcnt, 1);
>  	mem->move_charge_at_immigrate = 0;
>  	mutex_init(&mem->thresholds_lock);
> -- 
> 1.6.3.3
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH -mmotm 3/4] memcg: dirty pages accounting and limiting infrastructure
       [not found]     ` <20100305101234.909001e8.nishimura-YQH0OdQVrdy45+QrQBaojngSJqDPrsil@public.gmane.org>
@ 2010-03-05  1:58       ` KAMEZAWA Hiroyuki
  2010-03-05 22:14       ` Andrea Righi
  1 sibling, 0 replies; 68+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-03-05  1:58 UTC (permalink / raw)
  To: Daisuke Nishimura
  Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Andrea Righi,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Trond Myklebust,
	Greg-FOgKQjlUJ6BQetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Suleiman Souhlal, Andrew Morton,
	Peter-FOgKQjlUJ6BQetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Vivek Goyal, Balbir Singh

On Fri, 5 Mar 2010 10:12:34 +0900
Daisuke Nishimura <nishimura-YQH0OdQVrdy45+QrQBaojngSJqDPrsil@public.gmane.org> wrote:

> On Thu,  4 Mar 2010 11:40:14 +0100, Andrea Righi <arighi-vWjgImWzx8FBDgjK7y7TUQ@public.gmane.org> wrote:
> > Infrastructure to account dirty pages per cgroup and add dirty limit
> >  static int mem_cgroup_count_children_cb(struct mem_cgroup *mem, void *data)
> >  {
> >  	int *val = data;
> > @@ -1275,34 +1423,70 @@ static void record_last_oom(struct mem_cgroup *mem)
> >  }
> >  
> >  /*
> > - * Currently used to update mapped file statistics, but the routine can be
> > - * generalized to update other statistics as well.
> > + * Generalized routine to update file cache's status for memcg.
> > + *
> > + * Before calling this, mapping->tree_lock should be held and preemption is
> > + * disabled.  Then, it's guarnteed that the page is not uncharged while we
> > + * access page_cgroup. We can make use of that.
> >   */
> IIUC, mapping->tree_lock is held with irq disabled, so I think "mapping->tree_lock
> should be held with irq disabled" would be enouth.
> And, as far as I can see, callers of this function have not ensured this yet in [4/4].
> 
> how about:
> 
> 	void mem_cgroup_update_stat_locked(...)
> 	{
> 		...
> 	}
> 
> 	void mem_cgroup_update_stat_unlocked(mapping, ...)
> 	{
> 		spin_lock_irqsave(mapping->tree_lock, ...);
> 		mem_cgroup_update_stat_locked();
> 		spin_unlock_irqrestore(...);
> 	}
>
Rather than tree_lock, lock_page_cgroup() can be used if tree_lock is not held.

		lock_page_cgroup();
		mem_cgroup_update_stat_locked();
		unlock_page_cgroup();

Andrea-san, FILE_MAPPED is updated without treelock, at least. You can't depend
on migration_lock about FILE_MAPPED.


 
> > -void mem_cgroup_update_file_mapped(struct page *page, int val)
> > +void mem_cgroup_update_stat(struct page *page,
> > +			enum mem_cgroup_stat_index idx, int val)
> >  {
> I preffer "void mem_cgroup_update_page_stat(struct page *, enum mem_cgroup_page_stat_item, ..)"
> as I said above.
> 
> >  	struct mem_cgroup *mem;
> >  	struct page_cgroup *pc;
> >  
> > +	if (mem_cgroup_disabled())
> > +		return;
> >  	pc = lookup_page_cgroup(page);
> > -	if (unlikely(!pc))
> > +	if (unlikely(!pc) || !PageCgroupUsed(pc))
> >  		return;
> >  
> > -	lock_page_cgroup(pc);
> > -	mem = pc->mem_cgroup;
> > -	if (!mem)
> > -		goto done;
> > -
> > -	if (!PageCgroupUsed(pc))
> > -		goto done;
> > -
> > +	lock_page_cgroup_migrate(pc);
> >  	/*
> > -	 * Preemption is already disabled. We can use __this_cpu_xxx
> > -	 */
> > -	__this_cpu_add(mem->stat->count[MEM_CGROUP_STAT_FILE_MAPPED], val);
> > -
> > -done:
> > -	unlock_page_cgroup(pc);
> > +	* It's guarnteed that this page is never uncharged.
> > +	* The only racy problem is moving account among memcgs.
> > +	*/
> > +	switch (idx) {
> > +	case MEM_CGROUP_STAT_FILE_MAPPED:
> > +		if (val > 0)
> > +			SetPageCgroupFileMapped(pc);
> > +		else
> > +			ClearPageCgroupFileMapped(pc);
> > +		break;
> > +	case MEM_CGROUP_STAT_FILE_DIRTY:
> > +		if (val > 0)
> > +			SetPageCgroupDirty(pc);
> > +		else
> > +			ClearPageCgroupDirty(pc);
> > +		break;
> > +	case MEM_CGROUP_STAT_WRITEBACK:
> > +		if (val > 0)
> > +			SetPageCgroupWriteback(pc);
> > +		else
> > +			ClearPageCgroupWriteback(pc);
> > +		break;
> > +	case MEM_CGROUP_STAT_WRITEBACK_TEMP:
> > +		if (val > 0)
> > +			SetPageCgroupWritebackTemp(pc);
> > +		else
> > +			ClearPageCgroupWritebackTemp(pc);
> > +		break;
> > +	case MEM_CGROUP_STAT_UNSTABLE_NFS:
> > +		if (val > 0)
> > +			SetPageCgroupUnstableNFS(pc);
> > +		else
> > +			ClearPageCgroupUnstableNFS(pc);
> > +		break;
> > +	default:
> > +		BUG();
> > +		break;
> > +	}
> > +	mem = pc->mem_cgroup;
> > +	if (likely(mem))
> > +		__this_cpu_add(mem->stat->count[idx], val);
> > +	unlock_page_cgroup_migrate(pc);
> >  }
> > +EXPORT_SYMBOL_GPL(mem_cgroup_update_stat);
> >  
> >  /*
> >   * size of first charge trial. "32" comes from vmscan.c's magic value.
> > @@ -1701,6 +1885,45 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *mem,
> >  	memcg_check_events(mem, pc->page);
> >  }
> >  
> > +/*
> > + * Update file cache accounted statistics on task migration.
> > + *
> > + * TODO: We don't move charges of file (including shmem/tmpfs) pages for now.
> > + * So, at the moment this function simply returns without updating accounted
> > + * statistics, because we deal only with anonymous pages here.
> > + */
> This function is not unique to task migration. It's called from rmdir() too.
> So this comment isn't needed.
> 
> > +static void __mem_cgroup_update_file_stat(struct page_cgroup *pc,
> > +	struct mem_cgroup *from, struct mem_cgroup *to)
> > +{
> > +	struct page *page = pc->page;
> > +
> > +	if (!page_mapped(page) || PageAnon(page))
> > +		return;
> > +
> > +	if (PageCgroupFileMapped(pc)) {
> > +		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> > +		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> > +	}
> > +	if (PageCgroupDirty(pc)) {
> > +		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_DIRTY]);
> > +		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_DIRTY]);
> > +	}
> > +	if (PageCgroupWriteback(pc)) {
> > +		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_WRITEBACK]);
> > +		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_WRITEBACK]);
> > +	}
> > +	if (PageCgroupWritebackTemp(pc)) {
> > +		__this_cpu_dec(
> > +			from->stat->count[MEM_CGROUP_STAT_WRITEBACK_TEMP]);
> > +		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_WRITEBACK_TEMP]);
> > +	}
> > +	if (PageCgroupUnstableNFS(pc)) {
> > +		__this_cpu_dec(
> > +			from->stat->count[MEM_CGROUP_STAT_UNSTABLE_NFS]);
> > +		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_UNSTABLE_NFS]);
> > +	}
> > +}
> > +
> >  /**
> >   * __mem_cgroup_move_account - move account of the page
> >   * @pc:	page_cgroup of the page.
> > @@ -1721,22 +1944,16 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *mem,
> >  static void __mem_cgroup_move_account(struct page_cgroup *pc,
> >  	struct mem_cgroup *from, struct mem_cgroup *to, bool uncharge)
> >  {
> > -	struct page *page;
> > -
> >  	VM_BUG_ON(from == to);
> >  	VM_BUG_ON(PageLRU(pc->page));
> >  	VM_BUG_ON(!PageCgroupLocked(pc));
> >  	VM_BUG_ON(!PageCgroupUsed(pc));
> >  	VM_BUG_ON(pc->mem_cgroup != from);
> >  
> > -	page = pc->page;
> > -	if (page_mapped(page) && !PageAnon(page)) {
> > -		/* Update mapped_file data for mem_cgroup */
> > -		preempt_disable();
> > -		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> > -		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> > -		preempt_enable();
> > -	}
> > +	preempt_disable();
> > +	lock_page_cgroup_migrate(pc);
> > +	__mem_cgroup_update_file_stat(pc, from, to);
> > +
> >  	mem_cgroup_charge_statistics(from, pc, false);
> >  	if (uncharge)
> >  		/* This is not "cancel", but cancel_charge does all we need. */
> > @@ -1745,6 +1962,8 @@ static void __mem_cgroup_move_account(struct page_cgroup *pc,
> >  	/* caller should have done css_get */
> >  	pc->mem_cgroup = to;
> >  	mem_cgroup_charge_statistics(to, pc, true);
> > +	unlock_page_cgroup_migrate(pc);
> > +	preempt_enable();
> Glad to see this cleanup :)
> But, hmm, I don't think preempt_disable/enable() is enough(and bit_spin_lock/unlock()
> does it anyway). lock/unlock_page_cgroup_migrate() can be called under irq context
> (e.g. end_page_writeback()), so I think we must local_irq_disable()/enable() here.
> 
Ah, hmm, yes. irq-disable is required.

Thanks,
-Kame

>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH -mmotm 3/4] memcg: dirty pages accounting and limiting infrastructure
  2010-03-05  1:12     ` Daisuke Nishimura
@ 2010-03-05  1:58       ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 68+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-03-05  1:58 UTC (permalink / raw)
  To: Daisuke Nishimura
  Cc: Andrea Righi, Balbir Singh, Vivek Goyal, Peter Zijlstra,
	Trond Myklebust, Suleiman Souhlal, Greg Thelen,
	Kirill A. Shutemov, Andrew Morton, containers, linux-kernel,
	linux-mm

On Fri, 5 Mar 2010 10:12:34 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:

> On Thu,  4 Mar 2010 11:40:14 +0100, Andrea Righi <arighi@develer.com> wrote:
> > Infrastructure to account dirty pages per cgroup and add dirty limit
> >  static int mem_cgroup_count_children_cb(struct mem_cgroup *mem, void *data)
> >  {
> >  	int *val = data;
> > @@ -1275,34 +1423,70 @@ static void record_last_oom(struct mem_cgroup *mem)
> >  }
> >  
> >  /*
> > - * Currently used to update mapped file statistics, but the routine can be
> > - * generalized to update other statistics as well.
> > + * Generalized routine to update file cache's status for memcg.
> > + *
> > + * Before calling this, mapping->tree_lock should be held and preemption is
> > + * disabled.  Then, it's guarnteed that the page is not uncharged while we
> > + * access page_cgroup. We can make use of that.
> >   */
> IIUC, mapping->tree_lock is held with irq disabled, so I think "mapping->tree_lock
> should be held with irq disabled" would be enouth.
> And, as far as I can see, callers of this function have not ensured this yet in [4/4].
> 
> how about:
> 
> 	void mem_cgroup_update_stat_locked(...)
> 	{
> 		...
> 	}
> 
> 	void mem_cgroup_update_stat_unlocked(mapping, ...)
> 	{
> 		spin_lock_irqsave(mapping->tree_lock, ...);
> 		mem_cgroup_update_stat_locked();
> 		spin_unlock_irqrestore(...);
> 	}
>
Rather than tree_lock, lock_page_cgroup() can be used if tree_lock is not held.

		lock_page_cgroup();
		mem_cgroup_update_stat_locked();
		unlock_page_cgroup();

Andrea-san, FILE_MAPPED is updated without treelock, at least. You can't depend
on migration_lock about FILE_MAPPED.


 
> > -void mem_cgroup_update_file_mapped(struct page *page, int val)
> > +void mem_cgroup_update_stat(struct page *page,
> > +			enum mem_cgroup_stat_index idx, int val)
> >  {
> I preffer "void mem_cgroup_update_page_stat(struct page *, enum mem_cgroup_page_stat_item, ..)"
> as I said above.
> 
> >  	struct mem_cgroup *mem;
> >  	struct page_cgroup *pc;
> >  
> > +	if (mem_cgroup_disabled())
> > +		return;
> >  	pc = lookup_page_cgroup(page);
> > -	if (unlikely(!pc))
> > +	if (unlikely(!pc) || !PageCgroupUsed(pc))
> >  		return;
> >  
> > -	lock_page_cgroup(pc);
> > -	mem = pc->mem_cgroup;
> > -	if (!mem)
> > -		goto done;
> > -
> > -	if (!PageCgroupUsed(pc))
> > -		goto done;
> > -
> > +	lock_page_cgroup_migrate(pc);
> >  	/*
> > -	 * Preemption is already disabled. We can use __this_cpu_xxx
> > -	 */
> > -	__this_cpu_add(mem->stat->count[MEM_CGROUP_STAT_FILE_MAPPED], val);
> > -
> > -done:
> > -	unlock_page_cgroup(pc);
> > +	* It's guarnteed that this page is never uncharged.
> > +	* The only racy problem is moving account among memcgs.
> > +	*/
> > +	switch (idx) {
> > +	case MEM_CGROUP_STAT_FILE_MAPPED:
> > +		if (val > 0)
> > +			SetPageCgroupFileMapped(pc);
> > +		else
> > +			ClearPageCgroupFileMapped(pc);
> > +		break;
> > +	case MEM_CGROUP_STAT_FILE_DIRTY:
> > +		if (val > 0)
> > +			SetPageCgroupDirty(pc);
> > +		else
> > +			ClearPageCgroupDirty(pc);
> > +		break;
> > +	case MEM_CGROUP_STAT_WRITEBACK:
> > +		if (val > 0)
> > +			SetPageCgroupWriteback(pc);
> > +		else
> > +			ClearPageCgroupWriteback(pc);
> > +		break;
> > +	case MEM_CGROUP_STAT_WRITEBACK_TEMP:
> > +		if (val > 0)
> > +			SetPageCgroupWritebackTemp(pc);
> > +		else
> > +			ClearPageCgroupWritebackTemp(pc);
> > +		break;
> > +	case MEM_CGROUP_STAT_UNSTABLE_NFS:
> > +		if (val > 0)
> > +			SetPageCgroupUnstableNFS(pc);
> > +		else
> > +			ClearPageCgroupUnstableNFS(pc);
> > +		break;
> > +	default:
> > +		BUG();
> > +		break;
> > +	}
> > +	mem = pc->mem_cgroup;
> > +	if (likely(mem))
> > +		__this_cpu_add(mem->stat->count[idx], val);
> > +	unlock_page_cgroup_migrate(pc);
> >  }
> > +EXPORT_SYMBOL_GPL(mem_cgroup_update_stat);
> >  
> >  /*
> >   * size of first charge trial. "32" comes from vmscan.c's magic value.
> > @@ -1701,6 +1885,45 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *mem,
> >  	memcg_check_events(mem, pc->page);
> >  }
> >  
> > +/*
> > + * Update file cache accounted statistics on task migration.
> > + *
> > + * TODO: We don't move charges of file (including shmem/tmpfs) pages for now.
> > + * So, at the moment this function simply returns without updating accounted
> > + * statistics, because we deal only with anonymous pages here.
> > + */
> This function is not unique to task migration. It's called from rmdir() too.
> So this comment isn't needed.
> 
> > +static void __mem_cgroup_update_file_stat(struct page_cgroup *pc,
> > +	struct mem_cgroup *from, struct mem_cgroup *to)
> > +{
> > +	struct page *page = pc->page;
> > +
> > +	if (!page_mapped(page) || PageAnon(page))
> > +		return;
> > +
> > +	if (PageCgroupFileMapped(pc)) {
> > +		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> > +		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> > +	}
> > +	if (PageCgroupDirty(pc)) {
> > +		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_DIRTY]);
> > +		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_DIRTY]);
> > +	}
> > +	if (PageCgroupWriteback(pc)) {
> > +		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_WRITEBACK]);
> > +		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_WRITEBACK]);
> > +	}
> > +	if (PageCgroupWritebackTemp(pc)) {
> > +		__this_cpu_dec(
> > +			from->stat->count[MEM_CGROUP_STAT_WRITEBACK_TEMP]);
> > +		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_WRITEBACK_TEMP]);
> > +	}
> > +	if (PageCgroupUnstableNFS(pc)) {
> > +		__this_cpu_dec(
> > +			from->stat->count[MEM_CGROUP_STAT_UNSTABLE_NFS]);
> > +		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_UNSTABLE_NFS]);
> > +	}
> > +}
> > +
> >  /**
> >   * __mem_cgroup_move_account - move account of the page
> >   * @pc:	page_cgroup of the page.
> > @@ -1721,22 +1944,16 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *mem,
> >  static void __mem_cgroup_move_account(struct page_cgroup *pc,
> >  	struct mem_cgroup *from, struct mem_cgroup *to, bool uncharge)
> >  {
> > -	struct page *page;
> > -
> >  	VM_BUG_ON(from == to);
> >  	VM_BUG_ON(PageLRU(pc->page));
> >  	VM_BUG_ON(!PageCgroupLocked(pc));
> >  	VM_BUG_ON(!PageCgroupUsed(pc));
> >  	VM_BUG_ON(pc->mem_cgroup != from);
> >  
> > -	page = pc->page;
> > -	if (page_mapped(page) && !PageAnon(page)) {
> > -		/* Update mapped_file data for mem_cgroup */
> > -		preempt_disable();
> > -		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> > -		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> > -		preempt_enable();
> > -	}
> > +	preempt_disable();
> > +	lock_page_cgroup_migrate(pc);
> > +	__mem_cgroup_update_file_stat(pc, from, to);
> > +
> >  	mem_cgroup_charge_statistics(from, pc, false);
> >  	if (uncharge)
> >  		/* This is not "cancel", but cancel_charge does all we need. */
> > @@ -1745,6 +1962,8 @@ static void __mem_cgroup_move_account(struct page_cgroup *pc,
> >  	/* caller should have done css_get */
> >  	pc->mem_cgroup = to;
> >  	mem_cgroup_charge_statistics(to, pc, true);
> > +	unlock_page_cgroup_migrate(pc);
> > +	preempt_enable();
> Glad to see this cleanup :)
> But, hmm, I don't think preempt_disable/enable() is enough(and bit_spin_lock/unlock()
> does it anyway). lock/unlock_page_cgroup_migrate() can be called under irq context
> (e.g. end_page_writeback()), so I think we must local_irq_disable()/enable() here.
> 
Ah, hmm, yes. irq-disable is required.

Thanks,
-Kame

>


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH -mmotm 3/4] memcg: dirty pages accounting and limiting infrastructure
@ 2010-03-05  1:58       ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 68+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-03-05  1:58 UTC (permalink / raw)
  To: Daisuke Nishimura
  Cc: Andrea Righi, Balbir Singh, Vivek Goyal, Peter Zijlstra,
	Trond Myklebust, Suleiman Souhlal, Greg Thelen,
	Kirill A. Shutemov, Andrew Morton, containers, linux-kernel,
	linux-mm

On Fri, 5 Mar 2010 10:12:34 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:

> On Thu,  4 Mar 2010 11:40:14 +0100, Andrea Righi <arighi@develer.com> wrote:
> > Infrastructure to account dirty pages per cgroup and add dirty limit
> >  static int mem_cgroup_count_children_cb(struct mem_cgroup *mem, void *data)
> >  {
> >  	int *val = data;
> > @@ -1275,34 +1423,70 @@ static void record_last_oom(struct mem_cgroup *mem)
> >  }
> >  
> >  /*
> > - * Currently used to update mapped file statistics, but the routine can be
> > - * generalized to update other statistics as well.
> > + * Generalized routine to update file cache's status for memcg.
> > + *
> > + * Before calling this, mapping->tree_lock should be held and preemption is
> > + * disabled.  Then, it's guarnteed that the page is not uncharged while we
> > + * access page_cgroup. We can make use of that.
> >   */
> IIUC, mapping->tree_lock is held with irq disabled, so I think "mapping->tree_lock
> should be held with irq disabled" would be enouth.
> And, as far as I can see, callers of this function have not ensured this yet in [4/4].
> 
> how about:
> 
> 	void mem_cgroup_update_stat_locked(...)
> 	{
> 		...
> 	}
> 
> 	void mem_cgroup_update_stat_unlocked(mapping, ...)
> 	{
> 		spin_lock_irqsave(mapping->tree_lock, ...);
> 		mem_cgroup_update_stat_locked();
> 		spin_unlock_irqrestore(...);
> 	}
>
Rather than tree_lock, lock_page_cgroup() can be used if tree_lock is not held.

		lock_page_cgroup();
		mem_cgroup_update_stat_locked();
		unlock_page_cgroup();

Andrea-san, FILE_MAPPED is updated without treelock, at least. You can't depend
on migration_lock about FILE_MAPPED.


 
> > -void mem_cgroup_update_file_mapped(struct page *page, int val)
> > +void mem_cgroup_update_stat(struct page *page,
> > +			enum mem_cgroup_stat_index idx, int val)
> >  {
> I preffer "void mem_cgroup_update_page_stat(struct page *, enum mem_cgroup_page_stat_item, ..)"
> as I said above.
> 
> >  	struct mem_cgroup *mem;
> >  	struct page_cgroup *pc;
> >  
> > +	if (mem_cgroup_disabled())
> > +		return;
> >  	pc = lookup_page_cgroup(page);
> > -	if (unlikely(!pc))
> > +	if (unlikely(!pc) || !PageCgroupUsed(pc))
> >  		return;
> >  
> > -	lock_page_cgroup(pc);
> > -	mem = pc->mem_cgroup;
> > -	if (!mem)
> > -		goto done;
> > -
> > -	if (!PageCgroupUsed(pc))
> > -		goto done;
> > -
> > +	lock_page_cgroup_migrate(pc);
> >  	/*
> > -	 * Preemption is already disabled. We can use __this_cpu_xxx
> > -	 */
> > -	__this_cpu_add(mem->stat->count[MEM_CGROUP_STAT_FILE_MAPPED], val);
> > -
> > -done:
> > -	unlock_page_cgroup(pc);
> > +	* It's guarnteed that this page is never uncharged.
> > +	* The only racy problem is moving account among memcgs.
> > +	*/
> > +	switch (idx) {
> > +	case MEM_CGROUP_STAT_FILE_MAPPED:
> > +		if (val > 0)
> > +			SetPageCgroupFileMapped(pc);
> > +		else
> > +			ClearPageCgroupFileMapped(pc);
> > +		break;
> > +	case MEM_CGROUP_STAT_FILE_DIRTY:
> > +		if (val > 0)
> > +			SetPageCgroupDirty(pc);
> > +		else
> > +			ClearPageCgroupDirty(pc);
> > +		break;
> > +	case MEM_CGROUP_STAT_WRITEBACK:
> > +		if (val > 0)
> > +			SetPageCgroupWriteback(pc);
> > +		else
> > +			ClearPageCgroupWriteback(pc);
> > +		break;
> > +	case MEM_CGROUP_STAT_WRITEBACK_TEMP:
> > +		if (val > 0)
> > +			SetPageCgroupWritebackTemp(pc);
> > +		else
> > +			ClearPageCgroupWritebackTemp(pc);
> > +		break;
> > +	case MEM_CGROUP_STAT_UNSTABLE_NFS:
> > +		if (val > 0)
> > +			SetPageCgroupUnstableNFS(pc);
> > +		else
> > +			ClearPageCgroupUnstableNFS(pc);
> > +		break;
> > +	default:
> > +		BUG();
> > +		break;
> > +	}
> > +	mem = pc->mem_cgroup;
> > +	if (likely(mem))
> > +		__this_cpu_add(mem->stat->count[idx], val);
> > +	unlock_page_cgroup_migrate(pc);
> >  }
> > +EXPORT_SYMBOL_GPL(mem_cgroup_update_stat);
> >  
> >  /*
> >   * size of first charge trial. "32" comes from vmscan.c's magic value.
> > @@ -1701,6 +1885,45 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *mem,
> >  	memcg_check_events(mem, pc->page);
> >  }
> >  
> > +/*
> > + * Update file cache accounted statistics on task migration.
> > + *
> > + * TODO: We don't move charges of file (including shmem/tmpfs) pages for now.
> > + * So, at the moment this function simply returns without updating accounted
> > + * statistics, because we deal only with anonymous pages here.
> > + */
> This function is not unique to task migration. It's called from rmdir() too.
> So this comment isn't needed.
> 
> > +static void __mem_cgroup_update_file_stat(struct page_cgroup *pc,
> > +	struct mem_cgroup *from, struct mem_cgroup *to)
> > +{
> > +	struct page *page = pc->page;
> > +
> > +	if (!page_mapped(page) || PageAnon(page))
> > +		return;
> > +
> > +	if (PageCgroupFileMapped(pc)) {
> > +		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> > +		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> > +	}
> > +	if (PageCgroupDirty(pc)) {
> > +		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_DIRTY]);
> > +		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_DIRTY]);
> > +	}
> > +	if (PageCgroupWriteback(pc)) {
> > +		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_WRITEBACK]);
> > +		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_WRITEBACK]);
> > +	}
> > +	if (PageCgroupWritebackTemp(pc)) {
> > +		__this_cpu_dec(
> > +			from->stat->count[MEM_CGROUP_STAT_WRITEBACK_TEMP]);
> > +		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_WRITEBACK_TEMP]);
> > +	}
> > +	if (PageCgroupUnstableNFS(pc)) {
> > +		__this_cpu_dec(
> > +			from->stat->count[MEM_CGROUP_STAT_UNSTABLE_NFS]);
> > +		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_UNSTABLE_NFS]);
> > +	}
> > +}
> > +
> >  /**
> >   * __mem_cgroup_move_account - move account of the page
> >   * @pc:	page_cgroup of the page.
> > @@ -1721,22 +1944,16 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *mem,
> >  static void __mem_cgroup_move_account(struct page_cgroup *pc,
> >  	struct mem_cgroup *from, struct mem_cgroup *to, bool uncharge)
> >  {
> > -	struct page *page;
> > -
> >  	VM_BUG_ON(from == to);
> >  	VM_BUG_ON(PageLRU(pc->page));
> >  	VM_BUG_ON(!PageCgroupLocked(pc));
> >  	VM_BUG_ON(!PageCgroupUsed(pc));
> >  	VM_BUG_ON(pc->mem_cgroup != from);
> >  
> > -	page = pc->page;
> > -	if (page_mapped(page) && !PageAnon(page)) {
> > -		/* Update mapped_file data for mem_cgroup */
> > -		preempt_disable();
> > -		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> > -		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> > -		preempt_enable();
> > -	}
> > +	preempt_disable();
> > +	lock_page_cgroup_migrate(pc);
> > +	__mem_cgroup_update_file_stat(pc, from, to);
> > +
> >  	mem_cgroup_charge_statistics(from, pc, false);
> >  	if (uncharge)
> >  		/* This is not "cancel", but cancel_charge does all we need. */
> > @@ -1745,6 +1962,8 @@ static void __mem_cgroup_move_account(struct page_cgroup *pc,
> >  	/* caller should have done css_get */
> >  	pc->mem_cgroup = to;
> >  	mem_cgroup_charge_statistics(to, pc, true);
> > +	unlock_page_cgroup_migrate(pc);
> > +	preempt_enable();
> Glad to see this cleanup :)
> But, hmm, I don't think preempt_disable/enable() is enough(and bit_spin_lock/unlock()
> does it anyway). lock/unlock_page_cgroup_migrate() can be called under irq context
> (e.g. end_page_writeback()), so I think we must local_irq_disable()/enable() here.
> 
Ah, hmm, yes. irq-disable is required.

Thanks,
-Kame

>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH -mmotm 2/4] page_cgroup: introduce file cache flags
       [not found]   ` <1267699215-4101-3-git-send-email-arighi-vWjgImWzx8FBDgjK7y7TUQ@public.gmane.org>
@ 2010-03-05  6:32     ` Balbir Singh
  0 siblings, 0 replies; 68+ messages in thread
From: Balbir Singh @ 2010-03-05  6:32 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Daisuke Nishimura, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Trond Myklebust, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	Suleiman Souhlal, Andrew Morton,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Vivek Goyal

* Andrea Righi <arighi-vWjgImWzx8FBDgjK7y7TUQ@public.gmane.org> [2010-03-04 11:40:13]:

> Introduce page_cgroup flags to keep track of file cache pages.
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
> Signed-off-by: Andrea Righi <arighi-vWjgImWzx8FBDgjK7y7TUQ@public.gmane.org>
> ---

Looks good


Acked-by: Balbir Singh <balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
 

>  include/linux/page_cgroup.h |   49 +++++++++++++++++++++++++++++++++++++++++++
>  1 files changed, 49 insertions(+), 0 deletions(-)
> 
> diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
> index 30b0813..1b79ded 100644
> --- a/include/linux/page_cgroup.h
> +++ b/include/linux/page_cgroup.h
> @@ -39,6 +39,12 @@ enum {
>  	PCG_CACHE, /* charged as cache */
>  	PCG_USED, /* this object is in use. */
>  	PCG_ACCT_LRU, /* page has been accounted for */
> +	PCG_MIGRATE_LOCK, /* used for mutual execution of account migration */
> +	PCG_ACCT_FILE_MAPPED, /* page is accounted as file rss*/
> +	PCG_ACCT_DIRTY, /* page is dirty */
> +	PCG_ACCT_WRITEBACK, /* page is being written back to disk */
> +	PCG_ACCT_WRITEBACK_TEMP, /* page is used as temporary buffer for FUSE */
> +	PCG_ACCT_UNSTABLE_NFS, /* NFS page not yet committed to the server */
>  };
> 
>  #define TESTPCGFLAG(uname, lname)			\
> @@ -73,6 +79,27 @@ CLEARPCGFLAG(AcctLRU, ACCT_LRU)
>  TESTPCGFLAG(AcctLRU, ACCT_LRU)
>  TESTCLEARPCGFLAG(AcctLRU, ACCT_LRU)
> 
> +/* File cache and dirty memory flags */
> +TESTPCGFLAG(FileMapped, ACCT_FILE_MAPPED)
> +SETPCGFLAG(FileMapped, ACCT_FILE_MAPPED)
> +CLEARPCGFLAG(FileMapped, ACCT_FILE_MAPPED)
> +
> +TESTPCGFLAG(Dirty, ACCT_DIRTY)
> +SETPCGFLAG(Dirty, ACCT_DIRTY)
> +CLEARPCGFLAG(Dirty, ACCT_DIRTY)
> +
> +TESTPCGFLAG(Writeback, ACCT_WRITEBACK)
> +SETPCGFLAG(Writeback, ACCT_WRITEBACK)
> +CLEARPCGFLAG(Writeback, ACCT_WRITEBACK)
> +
> +TESTPCGFLAG(WritebackTemp, ACCT_WRITEBACK_TEMP)
> +SETPCGFLAG(WritebackTemp, ACCT_WRITEBACK_TEMP)
> +CLEARPCGFLAG(WritebackTemp, ACCT_WRITEBACK_TEMP)
> +
> +TESTPCGFLAG(UnstableNFS, ACCT_UNSTABLE_NFS)
> +SETPCGFLAG(UnstableNFS, ACCT_UNSTABLE_NFS)
> +CLEARPCGFLAG(UnstableNFS, ACCT_UNSTABLE_NFS)
> +
>  static inline int page_cgroup_nid(struct page_cgroup *pc)
>  {
>  	return page_to_nid(pc->page);
> @@ -83,6 +110,9 @@ static inline enum zone_type page_cgroup_zid(struct page_cgroup *pc)
>  	return page_zonenum(pc->page);
>  }
> 
> +/*
> + * lock_page_cgroup() should not be held under mapping->tree_lock
> + */

May be a DEBUG WARN_ON would be appropriate here?

>  static inline void lock_page_cgroup(struct page_cgroup *pc)
>  {
>  	bit_spin_lock(PCG_LOCK, &pc->flags);
> @@ -93,6 +123,25 @@ static inline void unlock_page_cgroup(struct page_cgroup *pc)
>  	bit_spin_unlock(PCG_LOCK, &pc->flags);
>  }
> 
> +/*
> + * Lock order is
> + *     lock_page_cgroup()
> + *             lock_page_cgroup_migrate()
> + *
> + * This lock is not be lock for charge/uncharge but for account moving.
> + * i.e. overwrite pc->mem_cgroup. The lock owner should guarantee by itself
> + * the page is uncharged while we hold this.
> + */
> +static inline void lock_page_cgroup_migrate(struct page_cgroup *pc)
> +{
> +	bit_spin_lock(PCG_MIGRATE_LOCK, &pc->flags);
> +}
> +
> +static inline void unlock_page_cgroup_migrate(struct page_cgroup *pc)
> +{
> +	bit_spin_unlock(PCG_MIGRATE_LOCK, &pc->flags);
> +}
> +
>  #else /* CONFIG_CGROUP_MEM_RES_CTLR */
>  struct page_cgroup;
> 
> -- 
> 1.6.3.3
> 

-- 
	Three Cheers,
	Balbir

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH -mmotm 2/4] page_cgroup: introduce file cache flags
  2010-03-04 10:40   ` Andrea Righi
@ 2010-03-05  6:32     ` Balbir Singh
  -1 siblings, 0 replies; 68+ messages in thread
From: Balbir Singh @ 2010-03-05  6:32 UTC (permalink / raw)
  To: Andrea Righi
  Cc: KAMEZAWA Hiroyuki, Vivek Goyal, Peter Zijlstra, Trond Myklebust,
	Suleiman Souhlal, Greg Thelen, Daisuke Nishimura,
	Kirill A. Shutemov, Andrew Morton, containers, linux-kernel,
	linux-mm

* Andrea Righi <arighi@develer.com> [2010-03-04 11:40:13]:

> Introduce page_cgroup flags to keep track of file cache pages.
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Signed-off-by: Andrea Righi <arighi@develer.com>
> ---

Looks good


Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
 

>  include/linux/page_cgroup.h |   49 +++++++++++++++++++++++++++++++++++++++++++
>  1 files changed, 49 insertions(+), 0 deletions(-)
> 
> diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
> index 30b0813..1b79ded 100644
> --- a/include/linux/page_cgroup.h
> +++ b/include/linux/page_cgroup.h
> @@ -39,6 +39,12 @@ enum {
>  	PCG_CACHE, /* charged as cache */
>  	PCG_USED, /* this object is in use. */
>  	PCG_ACCT_LRU, /* page has been accounted for */
> +	PCG_MIGRATE_LOCK, /* used for mutual execution of account migration */
> +	PCG_ACCT_FILE_MAPPED, /* page is accounted as file rss*/
> +	PCG_ACCT_DIRTY, /* page is dirty */
> +	PCG_ACCT_WRITEBACK, /* page is being written back to disk */
> +	PCG_ACCT_WRITEBACK_TEMP, /* page is used as temporary buffer for FUSE */
> +	PCG_ACCT_UNSTABLE_NFS, /* NFS page not yet committed to the server */
>  };
> 
>  #define TESTPCGFLAG(uname, lname)			\
> @@ -73,6 +79,27 @@ CLEARPCGFLAG(AcctLRU, ACCT_LRU)
>  TESTPCGFLAG(AcctLRU, ACCT_LRU)
>  TESTCLEARPCGFLAG(AcctLRU, ACCT_LRU)
> 
> +/* File cache and dirty memory flags */
> +TESTPCGFLAG(FileMapped, ACCT_FILE_MAPPED)
> +SETPCGFLAG(FileMapped, ACCT_FILE_MAPPED)
> +CLEARPCGFLAG(FileMapped, ACCT_FILE_MAPPED)
> +
> +TESTPCGFLAG(Dirty, ACCT_DIRTY)
> +SETPCGFLAG(Dirty, ACCT_DIRTY)
> +CLEARPCGFLAG(Dirty, ACCT_DIRTY)
> +
> +TESTPCGFLAG(Writeback, ACCT_WRITEBACK)
> +SETPCGFLAG(Writeback, ACCT_WRITEBACK)
> +CLEARPCGFLAG(Writeback, ACCT_WRITEBACK)
> +
> +TESTPCGFLAG(WritebackTemp, ACCT_WRITEBACK_TEMP)
> +SETPCGFLAG(WritebackTemp, ACCT_WRITEBACK_TEMP)
> +CLEARPCGFLAG(WritebackTemp, ACCT_WRITEBACK_TEMP)
> +
> +TESTPCGFLAG(UnstableNFS, ACCT_UNSTABLE_NFS)
> +SETPCGFLAG(UnstableNFS, ACCT_UNSTABLE_NFS)
> +CLEARPCGFLAG(UnstableNFS, ACCT_UNSTABLE_NFS)
> +
>  static inline int page_cgroup_nid(struct page_cgroup *pc)
>  {
>  	return page_to_nid(pc->page);
> @@ -83,6 +110,9 @@ static inline enum zone_type page_cgroup_zid(struct page_cgroup *pc)
>  	return page_zonenum(pc->page);
>  }
> 
> +/*
> + * lock_page_cgroup() should not be held under mapping->tree_lock
> + */

May be a DEBUG WARN_ON would be appropriate here?

>  static inline void lock_page_cgroup(struct page_cgroup *pc)
>  {
>  	bit_spin_lock(PCG_LOCK, &pc->flags);
> @@ -93,6 +123,25 @@ static inline void unlock_page_cgroup(struct page_cgroup *pc)
>  	bit_spin_unlock(PCG_LOCK, &pc->flags);
>  }
> 
> +/*
> + * Lock order is
> + *     lock_page_cgroup()
> + *             lock_page_cgroup_migrate()
> + *
> + * This lock is not be lock for charge/uncharge but for account moving.
> + * i.e. overwrite pc->mem_cgroup. The lock owner should guarantee by itself
> + * the page is uncharged while we hold this.
> + */
> +static inline void lock_page_cgroup_migrate(struct page_cgroup *pc)
> +{
> +	bit_spin_lock(PCG_MIGRATE_LOCK, &pc->flags);
> +}
> +
> +static inline void unlock_page_cgroup_migrate(struct page_cgroup *pc)
> +{
> +	bit_spin_unlock(PCG_MIGRATE_LOCK, &pc->flags);
> +}
> +
>  #else /* CONFIG_CGROUP_MEM_RES_CTLR */
>  struct page_cgroup;
> 
> -- 
> 1.6.3.3
> 

-- 
	Three Cheers,
	Balbir

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH -mmotm 2/4] page_cgroup: introduce file cache flags
@ 2010-03-05  6:32     ` Balbir Singh
  0 siblings, 0 replies; 68+ messages in thread
From: Balbir Singh @ 2010-03-05  6:32 UTC (permalink / raw)
  To: Andrea Righi
  Cc: KAMEZAWA Hiroyuki, Vivek Goyal, Peter Zijlstra, Trond Myklebust,
	Suleiman Souhlal, Greg Thelen, Daisuke Nishimura,
	Kirill A. Shutemov, Andrew Morton, containers, linux-kernel,
	linux-mm

* Andrea Righi <arighi@develer.com> [2010-03-04 11:40:13]:

> Introduce page_cgroup flags to keep track of file cache pages.
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Signed-off-by: Andrea Righi <arighi@develer.com>
> ---

Looks good


Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
 

>  include/linux/page_cgroup.h |   49 +++++++++++++++++++++++++++++++++++++++++++
>  1 files changed, 49 insertions(+), 0 deletions(-)
> 
> diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
> index 30b0813..1b79ded 100644
> --- a/include/linux/page_cgroup.h
> +++ b/include/linux/page_cgroup.h
> @@ -39,6 +39,12 @@ enum {
>  	PCG_CACHE, /* charged as cache */
>  	PCG_USED, /* this object is in use. */
>  	PCG_ACCT_LRU, /* page has been accounted for */
> +	PCG_MIGRATE_LOCK, /* used for mutual execution of account migration */
> +	PCG_ACCT_FILE_MAPPED, /* page is accounted as file rss*/
> +	PCG_ACCT_DIRTY, /* page is dirty */
> +	PCG_ACCT_WRITEBACK, /* page is being written back to disk */
> +	PCG_ACCT_WRITEBACK_TEMP, /* page is used as temporary buffer for FUSE */
> +	PCG_ACCT_UNSTABLE_NFS, /* NFS page not yet committed to the server */
>  };
> 
>  #define TESTPCGFLAG(uname, lname)			\
> @@ -73,6 +79,27 @@ CLEARPCGFLAG(AcctLRU, ACCT_LRU)
>  TESTPCGFLAG(AcctLRU, ACCT_LRU)
>  TESTCLEARPCGFLAG(AcctLRU, ACCT_LRU)
> 
> +/* File cache and dirty memory flags */
> +TESTPCGFLAG(FileMapped, ACCT_FILE_MAPPED)
> +SETPCGFLAG(FileMapped, ACCT_FILE_MAPPED)
> +CLEARPCGFLAG(FileMapped, ACCT_FILE_MAPPED)
> +
> +TESTPCGFLAG(Dirty, ACCT_DIRTY)
> +SETPCGFLAG(Dirty, ACCT_DIRTY)
> +CLEARPCGFLAG(Dirty, ACCT_DIRTY)
> +
> +TESTPCGFLAG(Writeback, ACCT_WRITEBACK)
> +SETPCGFLAG(Writeback, ACCT_WRITEBACK)
> +CLEARPCGFLAG(Writeback, ACCT_WRITEBACK)
> +
> +TESTPCGFLAG(WritebackTemp, ACCT_WRITEBACK_TEMP)
> +SETPCGFLAG(WritebackTemp, ACCT_WRITEBACK_TEMP)
> +CLEARPCGFLAG(WritebackTemp, ACCT_WRITEBACK_TEMP)
> +
> +TESTPCGFLAG(UnstableNFS, ACCT_UNSTABLE_NFS)
> +SETPCGFLAG(UnstableNFS, ACCT_UNSTABLE_NFS)
> +CLEARPCGFLAG(UnstableNFS, ACCT_UNSTABLE_NFS)
> +
>  static inline int page_cgroup_nid(struct page_cgroup *pc)
>  {
>  	return page_to_nid(pc->page);
> @@ -83,6 +110,9 @@ static inline enum zone_type page_cgroup_zid(struct page_cgroup *pc)
>  	return page_zonenum(pc->page);
>  }
> 
> +/*
> + * lock_page_cgroup() should not be held under mapping->tree_lock
> + */

May be a DEBUG WARN_ON would be appropriate here?

>  static inline void lock_page_cgroup(struct page_cgroup *pc)
>  {
>  	bit_spin_lock(PCG_LOCK, &pc->flags);
> @@ -93,6 +123,25 @@ static inline void unlock_page_cgroup(struct page_cgroup *pc)
>  	bit_spin_unlock(PCG_LOCK, &pc->flags);
>  }
> 
> +/*
> + * Lock order is
> + *     lock_page_cgroup()
> + *             lock_page_cgroup_migrate()
> + *
> + * This lock is not be lock for charge/uncharge but for account moving.
> + * i.e. overwrite pc->mem_cgroup. The lock owner should guarantee by itself
> + * the page is uncharged while we hold this.
> + */
> +static inline void lock_page_cgroup_migrate(struct page_cgroup *pc)
> +{
> +	bit_spin_lock(PCG_MIGRATE_LOCK, &pc->flags);
> +}
> +
> +static inline void unlock_page_cgroup_migrate(struct page_cgroup *pc)
> +{
> +	bit_spin_unlock(PCG_MIGRATE_LOCK, &pc->flags);
> +}
> +
>  #else /* CONFIG_CGROUP_MEM_RES_CTLR */
>  struct page_cgroup;
> 
> -- 
> 1.6.3.3
> 

-- 
	Three Cheers,
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH -mmotm 4/4] memcg: dirty pages instrumentation
       [not found]   ` <1267699215-4101-5-git-send-email-arighi-vWjgImWzx8FBDgjK7y7TUQ@public.gmane.org>
  2010-03-04 16:18     ` Vivek Goyal
  2010-03-04 19:41       ` Vivek Goyal
@ 2010-03-05  6:38     ` Balbir Singh
  2 siblings, 0 replies; 68+ messages in thread
From: Balbir Singh @ 2010-03-05  6:38 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Daisuke Nishimura, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Trond Myklebust, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	Suleiman Souhlal, Andrew Morton,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Vivek Goyal

* Andrea Righi <arighi-vWjgImWzx8FBDgjK7y7TUQ@public.gmane.org> [2010-03-04 11:40:15]:

> Apply the cgroup dirty pages accounting and limiting infrastructure
> to the opportune kernel functions.
> 
> Signed-off-by: Andrea Righi <arighi-vWjgImWzx8FBDgjK7y7TUQ@public.gmane.org>
> ---
>  fs/fuse/file.c      |    5 +++
>  fs/nfs/write.c      |    4 ++
>  fs/nilfs2/segment.c |   11 +++++-
>  mm/filemap.c        |    1 +
>  mm/page-writeback.c |   91 ++++++++++++++++++++++++++++++++++-----------------
>  mm/rmap.c           |    4 +-
>  mm/truncate.c       |    2 +
>  7 files changed, 84 insertions(+), 34 deletions(-)
> 
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index a9f5e13..dbbdd53 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -11,6 +11,7 @@
>  #include <linux/pagemap.h>
>  #include <linux/slab.h>
>  #include <linux/kernel.h>
> +#include <linux/memcontrol.h>
>  #include <linux/sched.h>
>  #include <linux/module.h>
> 
> @@ -1129,6 +1130,8 @@ static void fuse_writepage_finish(struct fuse_conn *fc, struct fuse_req *req)
> 
>  	list_del(&req->writepages_entry);
>  	dec_bdi_stat(bdi, BDI_WRITEBACK);
> +	mem_cgroup_update_stat(req->pages[0],
> +			MEM_CGROUP_STAT_WRITEBACK_TEMP, -1);
>  	dec_zone_page_state(req->pages[0], NR_WRITEBACK_TEMP);
>  	bdi_writeout_inc(bdi);
>  	wake_up(&fi->page_waitq);
> @@ -1240,6 +1243,8 @@ static int fuse_writepage_locked(struct page *page)
>  	req->inode = inode;
> 
>  	inc_bdi_stat(mapping->backing_dev_info, BDI_WRITEBACK);
> +	mem_cgroup_update_stat(tmp_page,
> +			MEM_CGROUP_STAT_WRITEBACK_TEMP, 1);
>  	inc_zone_page_state(tmp_page, NR_WRITEBACK_TEMP);
>  	end_page_writeback(page);
> 
> diff --git a/fs/nfs/write.c b/fs/nfs/write.c
> index b753242..7316f7a 100644
> --- a/fs/nfs/write.c
> +++ b/fs/nfs/write.c
> @@ -439,6 +439,7 @@ nfs_mark_request_commit(struct nfs_page *req)
>  			req->wb_index,
>  			NFS_PAGE_TAG_COMMIT);
>  	spin_unlock(&inode->i_lock);
> +	mem_cgroup_update_stat(req->wb_page, MEM_CGROUP_STAT_UNSTABLE_NFS, 1);
>  	inc_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
>  	inc_bdi_stat(req->wb_page->mapping->backing_dev_info, BDI_UNSTABLE);
>  	__mark_inode_dirty(inode, I_DIRTY_DATASYNC);
> @@ -450,6 +451,7 @@ nfs_clear_request_commit(struct nfs_page *req)
>  	struct page *page = req->wb_page;
> 
>  	if (test_and_clear_bit(PG_CLEAN, &(req)->wb_flags)) {
> +		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_UNSTABLE_NFS, -1);
>  		dec_zone_page_state(page, NR_UNSTABLE_NFS);
>  		dec_bdi_stat(page->mapping->backing_dev_info, BDI_UNSTABLE);
>  		return 1;
> @@ -1273,6 +1275,8 @@ nfs_commit_list(struct inode *inode, struct list_head *head, int how)
>  		req = nfs_list_entry(head->next);
>  		nfs_list_remove_request(req);
>  		nfs_mark_request_commit(req);
> +		mem_cgroup_update_stat(req->wb_page,
> +				MEM_CGROUP_STAT_UNSTABLE_NFS, -1);
>  		dec_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
>  		dec_bdi_stat(req->wb_page->mapping->backing_dev_info,
>  				BDI_UNSTABLE);
> diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
> index ada2f1b..27a01b1 100644
> --- a/fs/nilfs2/segment.c
> +++ b/fs/nilfs2/segment.c
> @@ -24,6 +24,7 @@
>  #include <linux/pagemap.h>
>  #include <linux/buffer_head.h>
>  #include <linux/writeback.h>
> +#include <linux/memcontrol.h>
>  #include <linux/bio.h>
>  #include <linux/completion.h>
>  #include <linux/blkdev.h>
> @@ -1660,8 +1661,11 @@ nilfs_copy_replace_page_buffers(struct page *page, struct list_head *out)
>  	} while (bh = bh->b_this_page, bh2 = bh2->b_this_page, bh != head);
>  	kunmap_atomic(kaddr, KM_USER0);
> 
> -	if (!TestSetPageWriteback(clone_page))
> +	if (!TestSetPageWriteback(clone_page)) {
> +		mem_cgroup_update_stat(clone_page,
> +				MEM_CGROUP_STAT_WRITEBACK, 1);

I wonder if we should start implementing inc and dec to avoid passing
the +1 and -1 parameters. It should make the code easier to read.

>  		inc_zone_page_state(clone_page, NR_WRITEBACK);
> +	}
>  	unlock_page(clone_page);
> 
>  	return 0;
> @@ -1783,8 +1787,11 @@ static void __nilfs_end_page_io(struct page *page, int err)
>  	}
> 
>  	if (buffer_nilfs_allocated(page_buffers(page))) {
> -		if (TestClearPageWriteback(page))
> +		if (TestClearPageWriteback(page)) {
> +			mem_cgroup_update_stat(page,
> +					MEM_CGROUP_STAT_WRITEBACK, -1);
>  			dec_zone_page_state(page, NR_WRITEBACK);
> +		}
>  	} else
>  		end_page_writeback(page);
>  }
> diff --git a/mm/filemap.c b/mm/filemap.c
> index fe09e51..f85acae 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -135,6 +135,7 @@ void __remove_from_page_cache(struct page *page)
>  	 * having removed the page entirely.
>  	 */
>  	if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
> +		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_FILE_DIRTY, -1);
>  		dec_zone_page_state(page, NR_FILE_DIRTY);
>  		dec_bdi_stat(mapping->backing_dev_info, BDI_DIRTY);
>  	}
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 5a0f8f3..c5d14ea 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -137,13 +137,16 @@ static struct prop_descriptor vm_dirties;
>   */
>  static int calc_period_shift(void)
>  {
> +	struct dirty_param dirty_param;

vm_dirty_param?

>  	unsigned long dirty_total;
> 
> -	if (vm_dirty_bytes)
> -		dirty_total = vm_dirty_bytes / PAGE_SIZE;
> +	get_dirty_param(&dirty_param);

get_vm_dirty_param() is a nicer name.

> +
> +	if (dirty_param.dirty_bytes)
> +		dirty_total = dirty_param.dirty_bytes / PAGE_SIZE;
>  	else
> -		dirty_total = (vm_dirty_ratio * determine_dirtyable_memory()) /
> -				100;
> +		dirty_total = (dirty_param.dirty_ratio *
> +				determine_dirtyable_memory()) / 100;
>  	return 2 + ilog2(dirty_total - 1);
>  }
> 
> @@ -408,41 +411,46 @@ static unsigned long highmem_dirtyable_memory(unsigned long total)
>   */
>  unsigned long determine_dirtyable_memory(void)
>  {
> -	unsigned long x;
> -
> -	x = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages();
> +	unsigned long memory;
> +	s64 memcg_memory;
> 
> +	memory = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages();
>  	if (!vm_highmem_is_dirtyable)
> -		x -= highmem_dirtyable_memory(x);
> -
> -	return x + 1;	/* Ensure that we never return 0 */
> +		memory -= highmem_dirtyable_memory(memory);
> +	if (mem_cgroup_has_dirty_limit())
> +		return memory + 1;

Vivek already pointed out this issue I suppose. Should be *not*

> +	memcg_memory = mem_cgroup_page_stat(MEMCG_NR_DIRTYABLE_PAGES);

Can memcg_memory be 0?

> +	return min((unsigned long)memcg_memory, memory + 1);
>  }
> 
>  void
>  get_dirty_limits(unsigned long *pbackground, unsigned long *pdirty,
>  		 unsigned long *pbdi_dirty, struct backing_dev_info *bdi)
>  {
> -	unsigned long background;
> -	unsigned long dirty;
> +	unsigned long dirty, background;
>  	unsigned long available_memory = determine_dirtyable_memory();
>  	struct task_struct *tsk;
> +	struct dirty_param dirty_param;
> +
> +	get_dirty_param(&dirty_param);
> 
> -	if (vm_dirty_bytes)
> -		dirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE);
> +	if (dirty_param.dirty_bytes)
> +		dirty = DIV_ROUND_UP(dirty_param.dirty_bytes, PAGE_SIZE);
>  	else {
>  		int dirty_ratio;
> 
> -		dirty_ratio = vm_dirty_ratio;
> +		dirty_ratio = dirty_param.dirty_ratio;
>  		if (dirty_ratio < 5)
>  			dirty_ratio = 5;
>  		dirty = (dirty_ratio * available_memory) / 100;
>  	}
> 
> -	if (dirty_background_bytes)
> -		background = DIV_ROUND_UP(dirty_background_bytes, PAGE_SIZE);
> +	if (dirty_param.dirty_background_bytes)
> +		background = DIV_ROUND_UP(dirty_param.dirty_background_bytes,
> +						PAGE_SIZE);
>  	else
> -		background = (dirty_background_ratio * available_memory) / 100;
> -
> +		background = (dirty_param.dirty_background_ratio *
> +						available_memory) / 100;
>  	if (background >= dirty)
>  		background = dirty / 2;
>  	tsk = current;
> @@ -508,9 +516,15 @@ static void balance_dirty_pages(struct address_space *mapping,
>  		get_dirty_limits(&background_thresh, &dirty_thresh,
>  				&bdi_thresh, bdi);
> 
> -		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
> +		if (mem_cgroup_has_dirty_limit()) {
> +			nr_reclaimable =
> +				mem_cgroup_page_stat(MEMCG_NR_RECLAIM_PAGES);
> +			nr_writeback = mem_cgroup_page_stat(MEMCG_NR_WRITEBACK);
> +		} else {
> +			nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
>  					global_page_state(NR_UNSTABLE_NFS);
> -		nr_writeback = global_page_state(NR_WRITEBACK);
> +			nr_writeback = global_page_state(NR_WRITEBACK);
> +		}
> 
>  		bdi_nr_reclaimable = bdi_stat(bdi, BDI_DIRTY);
>  		if (bdi_cap_account_unstable(bdi)) {
> @@ -611,10 +625,13 @@ static void balance_dirty_pages(struct address_space *mapping,
>  	 * In normal mode, we start background writeout at the lower
>  	 * background_thresh, to keep the amount of dirty memory low.
>  	 */
> +	if (mem_cgroup_has_dirty_limit())
> +		nr_reclaimable = mem_cgroup_page_stat(MEMCG_NR_RECLAIM_PAGES);
> +	else
> +		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
> +				global_page_state(NR_UNSTABLE_NFS);
>  	if ((laptop_mode && pages_written) ||
> -	    (!laptop_mode && ((global_page_state(NR_FILE_DIRTY)
> -			       + global_page_state(NR_UNSTABLE_NFS))
> -					  > background_thresh)))
> +	    (!laptop_mode && (nr_reclaimable > background_thresh)))
>  		bdi_start_writeback(bdi, NULL, 0);
>  }
> 
> @@ -678,6 +695,8 @@ void throttle_vm_writeout(gfp_t gfp_mask)
>  	unsigned long dirty_thresh;
> 
>          for ( ; ; ) {
> +		unsigned long dirty;
> +
>  		get_dirty_limits(&background_thresh, &dirty_thresh, NULL, NULL);
> 
>                  /*
> @@ -686,10 +705,15 @@ void throttle_vm_writeout(gfp_t gfp_mask)
>                   */
>                  dirty_thresh += dirty_thresh / 10;      /* wheeee... */
> 
> -                if (global_page_state(NR_UNSTABLE_NFS) +
> -			global_page_state(NR_WRITEBACK) <= dirty_thresh)
> -                        	break;
> -                congestion_wait(BLK_RW_ASYNC, HZ/10);
> +		if (mem_cgroup_has_dirty_limit())
> +			dirty = mem_cgroup_page_stat(
> +					MEMCG_NR_DIRTY_WRITEBACK_PAGES);
> +		else
> +			dirty = global_page_state(NR_UNSTABLE_NFS) +
> +				global_page_state(NR_WRITEBACK);
> +		if (dirty <= dirty_thresh)
> +			break;
> +		congestion_wait(BLK_RW_ASYNC, HZ/10);
> 
>  		/*
>  		 * The caller might hold locks which can prevent IO completion
> @@ -1096,6 +1120,7 @@ int __set_page_dirty_no_writeback(struct page *page)
>  void account_page_dirtied(struct page *page, struct address_space *mapping)
>  {
>  	if (mapping_cap_account_dirty(mapping)) {
> +		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_FILE_DIRTY, 1);
>  		__inc_zone_page_state(page, NR_FILE_DIRTY);
>  		__inc_bdi_stat(mapping->backing_dev_info, BDI_DIRTY);
>  		task_dirty_inc(current);
> @@ -1297,6 +1322,8 @@ int clear_page_dirty_for_io(struct page *page)
>  		 * for more comments.
>  		 */
>  		if (TestClearPageDirty(page)) {
> +			mem_cgroup_update_stat(page,
> +					MEM_CGROUP_STAT_FILE_DIRTY, -1);
>  			dec_zone_page_state(page, NR_FILE_DIRTY);
>  			dec_bdi_stat(mapping->backing_dev_info,
>  					BDI_DIRTY);
> @@ -1332,8 +1359,10 @@ int test_clear_page_writeback(struct page *page)
>  	} else {
>  		ret = TestClearPageWriteback(page);
>  	}
> -	if (ret)
> +	if (ret) {
> +		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_WRITEBACK, -1);
>  		dec_zone_page_state(page, NR_WRITEBACK);
> +	}
>  	return ret;
>  }
> 
> @@ -1363,8 +1392,10 @@ int test_set_page_writeback(struct page *page)
>  	} else {
>  		ret = TestSetPageWriteback(page);
>  	}
> -	if (!ret)
> +	if (!ret) {
> +		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_WRITEBACK, 1);
>  		inc_zone_page_state(page, NR_WRITEBACK);
> +	}
>  	return ret;
> 
>  }
> diff --git a/mm/rmap.c b/mm/rmap.c
> index fcd593c..d47c257 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -829,7 +829,7 @@ void page_add_file_rmap(struct page *page)
>  {
>  	if (atomic_inc_and_test(&page->_mapcount)) {
>  		__inc_zone_page_state(page, NR_FILE_MAPPED);
> -		mem_cgroup_update_file_mapped(page, 1);
> +		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_FILE_MAPPED, 1);
>  	}
>  }
> 
> @@ -861,7 +861,7 @@ void page_remove_rmap(struct page *page)
>  		__dec_zone_page_state(page, NR_ANON_PAGES);
>  	} else {
>  		__dec_zone_page_state(page, NR_FILE_MAPPED);
> -		mem_cgroup_update_file_mapped(page, -1);
> +		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_FILE_MAPPED, -1);
>  	}
>  	/*
>  	 * It would be tidy to reset the PageAnon mapping here,
> diff --git a/mm/truncate.c b/mm/truncate.c
> index 2466e0c..5f437e7 100644
> --- a/mm/truncate.c
> +++ b/mm/truncate.c
> @@ -73,6 +73,8 @@ void cancel_dirty_page(struct page *page, unsigned int account_size)
>  	if (TestClearPageDirty(page)) {
>  		struct address_space *mapping = page->mapping;
>  		if (mapping && mapping_cap_account_dirty(mapping)) {
> +			mem_cgroup_update_stat(page,
> +					MEM_CGROUP_STAT_FILE_DIRTY, -1);
>  			dec_zone_page_state(page, NR_FILE_DIRTY);
>  			dec_bdi_stat(mapping->backing_dev_info,
>  					BDI_DIRTY);
> -- 
> 1.6.3.3
> 

-- 
	Three Cheers,
	Balbir

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH -mmotm 4/4] memcg: dirty pages instrumentation
  2010-03-04 10:40   ` Andrea Righi
@ 2010-03-05  6:38     ` Balbir Singh
  -1 siblings, 0 replies; 68+ messages in thread
From: Balbir Singh @ 2010-03-05  6:38 UTC (permalink / raw)
  To: Andrea Righi
  Cc: KAMEZAWA Hiroyuki, Vivek Goyal, Peter Zijlstra, Trond Myklebust,
	Suleiman Souhlal, Greg Thelen, Daisuke Nishimura,
	Kirill A. Shutemov, Andrew Morton, containers, linux-kernel,
	linux-mm

* Andrea Righi <arighi@develer.com> [2010-03-04 11:40:15]:

> Apply the cgroup dirty pages accounting and limiting infrastructure
> to the opportune kernel functions.
> 
> Signed-off-by: Andrea Righi <arighi@develer.com>
> ---
>  fs/fuse/file.c      |    5 +++
>  fs/nfs/write.c      |    4 ++
>  fs/nilfs2/segment.c |   11 +++++-
>  mm/filemap.c        |    1 +
>  mm/page-writeback.c |   91 ++++++++++++++++++++++++++++++++++-----------------
>  mm/rmap.c           |    4 +-
>  mm/truncate.c       |    2 +
>  7 files changed, 84 insertions(+), 34 deletions(-)
> 
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index a9f5e13..dbbdd53 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -11,6 +11,7 @@
>  #include <linux/pagemap.h>
>  #include <linux/slab.h>
>  #include <linux/kernel.h>
> +#include <linux/memcontrol.h>
>  #include <linux/sched.h>
>  #include <linux/module.h>
> 
> @@ -1129,6 +1130,8 @@ static void fuse_writepage_finish(struct fuse_conn *fc, struct fuse_req *req)
> 
>  	list_del(&req->writepages_entry);
>  	dec_bdi_stat(bdi, BDI_WRITEBACK);
> +	mem_cgroup_update_stat(req->pages[0],
> +			MEM_CGROUP_STAT_WRITEBACK_TEMP, -1);
>  	dec_zone_page_state(req->pages[0], NR_WRITEBACK_TEMP);
>  	bdi_writeout_inc(bdi);
>  	wake_up(&fi->page_waitq);
> @@ -1240,6 +1243,8 @@ static int fuse_writepage_locked(struct page *page)
>  	req->inode = inode;
> 
>  	inc_bdi_stat(mapping->backing_dev_info, BDI_WRITEBACK);
> +	mem_cgroup_update_stat(tmp_page,
> +			MEM_CGROUP_STAT_WRITEBACK_TEMP, 1);
>  	inc_zone_page_state(tmp_page, NR_WRITEBACK_TEMP);
>  	end_page_writeback(page);
> 
> diff --git a/fs/nfs/write.c b/fs/nfs/write.c
> index b753242..7316f7a 100644
> --- a/fs/nfs/write.c
> +++ b/fs/nfs/write.c
> @@ -439,6 +439,7 @@ nfs_mark_request_commit(struct nfs_page *req)
>  			req->wb_index,
>  			NFS_PAGE_TAG_COMMIT);
>  	spin_unlock(&inode->i_lock);
> +	mem_cgroup_update_stat(req->wb_page, MEM_CGROUP_STAT_UNSTABLE_NFS, 1);
>  	inc_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
>  	inc_bdi_stat(req->wb_page->mapping->backing_dev_info, BDI_UNSTABLE);
>  	__mark_inode_dirty(inode, I_DIRTY_DATASYNC);
> @@ -450,6 +451,7 @@ nfs_clear_request_commit(struct nfs_page *req)
>  	struct page *page = req->wb_page;
> 
>  	if (test_and_clear_bit(PG_CLEAN, &(req)->wb_flags)) {
> +		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_UNSTABLE_NFS, -1);
>  		dec_zone_page_state(page, NR_UNSTABLE_NFS);
>  		dec_bdi_stat(page->mapping->backing_dev_info, BDI_UNSTABLE);
>  		return 1;
> @@ -1273,6 +1275,8 @@ nfs_commit_list(struct inode *inode, struct list_head *head, int how)
>  		req = nfs_list_entry(head->next);
>  		nfs_list_remove_request(req);
>  		nfs_mark_request_commit(req);
> +		mem_cgroup_update_stat(req->wb_page,
> +				MEM_CGROUP_STAT_UNSTABLE_NFS, -1);
>  		dec_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
>  		dec_bdi_stat(req->wb_page->mapping->backing_dev_info,
>  				BDI_UNSTABLE);
> diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
> index ada2f1b..27a01b1 100644
> --- a/fs/nilfs2/segment.c
> +++ b/fs/nilfs2/segment.c
> @@ -24,6 +24,7 @@
>  #include <linux/pagemap.h>
>  #include <linux/buffer_head.h>
>  #include <linux/writeback.h>
> +#include <linux/memcontrol.h>
>  #include <linux/bio.h>
>  #include <linux/completion.h>
>  #include <linux/blkdev.h>
> @@ -1660,8 +1661,11 @@ nilfs_copy_replace_page_buffers(struct page *page, struct list_head *out)
>  	} while (bh = bh->b_this_page, bh2 = bh2->b_this_page, bh != head);
>  	kunmap_atomic(kaddr, KM_USER0);
> 
> -	if (!TestSetPageWriteback(clone_page))
> +	if (!TestSetPageWriteback(clone_page)) {
> +		mem_cgroup_update_stat(clone_page,
> +				MEM_CGROUP_STAT_WRITEBACK, 1);

I wonder if we should start implementing inc and dec to avoid passing
the +1 and -1 parameters. It should make the code easier to read.

>  		inc_zone_page_state(clone_page, NR_WRITEBACK);
> +	}
>  	unlock_page(clone_page);
> 
>  	return 0;
> @@ -1783,8 +1787,11 @@ static void __nilfs_end_page_io(struct page *page, int err)
>  	}
> 
>  	if (buffer_nilfs_allocated(page_buffers(page))) {
> -		if (TestClearPageWriteback(page))
> +		if (TestClearPageWriteback(page)) {
> +			mem_cgroup_update_stat(page,
> +					MEM_CGROUP_STAT_WRITEBACK, -1);
>  			dec_zone_page_state(page, NR_WRITEBACK);
> +		}
>  	} else
>  		end_page_writeback(page);
>  }
> diff --git a/mm/filemap.c b/mm/filemap.c
> index fe09e51..f85acae 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -135,6 +135,7 @@ void __remove_from_page_cache(struct page *page)
>  	 * having removed the page entirely.
>  	 */
>  	if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
> +		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_FILE_DIRTY, -1);
>  		dec_zone_page_state(page, NR_FILE_DIRTY);
>  		dec_bdi_stat(mapping->backing_dev_info, BDI_DIRTY);
>  	}
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 5a0f8f3..c5d14ea 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -137,13 +137,16 @@ static struct prop_descriptor vm_dirties;
>   */
>  static int calc_period_shift(void)
>  {
> +	struct dirty_param dirty_param;

vm_dirty_param?

>  	unsigned long dirty_total;
> 
> -	if (vm_dirty_bytes)
> -		dirty_total = vm_dirty_bytes / PAGE_SIZE;
> +	get_dirty_param(&dirty_param);

get_vm_dirty_param() is a nicer name.

> +
> +	if (dirty_param.dirty_bytes)
> +		dirty_total = dirty_param.dirty_bytes / PAGE_SIZE;
>  	else
> -		dirty_total = (vm_dirty_ratio * determine_dirtyable_memory()) /
> -				100;
> +		dirty_total = (dirty_param.dirty_ratio *
> +				determine_dirtyable_memory()) / 100;
>  	return 2 + ilog2(dirty_total - 1);
>  }
> 
> @@ -408,41 +411,46 @@ static unsigned long highmem_dirtyable_memory(unsigned long total)
>   */
>  unsigned long determine_dirtyable_memory(void)
>  {
> -	unsigned long x;
> -
> -	x = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages();
> +	unsigned long memory;
> +	s64 memcg_memory;
> 
> +	memory = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages();
>  	if (!vm_highmem_is_dirtyable)
> -		x -= highmem_dirtyable_memory(x);
> -
> -	return x + 1;	/* Ensure that we never return 0 */
> +		memory -= highmem_dirtyable_memory(memory);
> +	if (mem_cgroup_has_dirty_limit())
> +		return memory + 1;

Vivek already pointed out this issue I suppose. Should be *not*

> +	memcg_memory = mem_cgroup_page_stat(MEMCG_NR_DIRTYABLE_PAGES);

Can memcg_memory be 0?

> +	return min((unsigned long)memcg_memory, memory + 1);
>  }
> 
>  void
>  get_dirty_limits(unsigned long *pbackground, unsigned long *pdirty,
>  		 unsigned long *pbdi_dirty, struct backing_dev_info *bdi)
>  {
> -	unsigned long background;
> -	unsigned long dirty;
> +	unsigned long dirty, background;
>  	unsigned long available_memory = determine_dirtyable_memory();
>  	struct task_struct *tsk;
> +	struct dirty_param dirty_param;
> +
> +	get_dirty_param(&dirty_param);
> 
> -	if (vm_dirty_bytes)
> -		dirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE);
> +	if (dirty_param.dirty_bytes)
> +		dirty = DIV_ROUND_UP(dirty_param.dirty_bytes, PAGE_SIZE);
>  	else {
>  		int dirty_ratio;
> 
> -		dirty_ratio = vm_dirty_ratio;
> +		dirty_ratio = dirty_param.dirty_ratio;
>  		if (dirty_ratio < 5)
>  			dirty_ratio = 5;
>  		dirty = (dirty_ratio * available_memory) / 100;
>  	}
> 
> -	if (dirty_background_bytes)
> -		background = DIV_ROUND_UP(dirty_background_bytes, PAGE_SIZE);
> +	if (dirty_param.dirty_background_bytes)
> +		background = DIV_ROUND_UP(dirty_param.dirty_background_bytes,
> +						PAGE_SIZE);
>  	else
> -		background = (dirty_background_ratio * available_memory) / 100;
> -
> +		background = (dirty_param.dirty_background_ratio *
> +						available_memory) / 100;
>  	if (background >= dirty)
>  		background = dirty / 2;
>  	tsk = current;
> @@ -508,9 +516,15 @@ static void balance_dirty_pages(struct address_space *mapping,
>  		get_dirty_limits(&background_thresh, &dirty_thresh,
>  				&bdi_thresh, bdi);
> 
> -		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
> +		if (mem_cgroup_has_dirty_limit()) {
> +			nr_reclaimable =
> +				mem_cgroup_page_stat(MEMCG_NR_RECLAIM_PAGES);
> +			nr_writeback = mem_cgroup_page_stat(MEMCG_NR_WRITEBACK);
> +		} else {
> +			nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
>  					global_page_state(NR_UNSTABLE_NFS);
> -		nr_writeback = global_page_state(NR_WRITEBACK);
> +			nr_writeback = global_page_state(NR_WRITEBACK);
> +		}
> 
>  		bdi_nr_reclaimable = bdi_stat(bdi, BDI_DIRTY);
>  		if (bdi_cap_account_unstable(bdi)) {
> @@ -611,10 +625,13 @@ static void balance_dirty_pages(struct address_space *mapping,
>  	 * In normal mode, we start background writeout at the lower
>  	 * background_thresh, to keep the amount of dirty memory low.
>  	 */
> +	if (mem_cgroup_has_dirty_limit())
> +		nr_reclaimable = mem_cgroup_page_stat(MEMCG_NR_RECLAIM_PAGES);
> +	else
> +		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
> +				global_page_state(NR_UNSTABLE_NFS);
>  	if ((laptop_mode && pages_written) ||
> -	    (!laptop_mode && ((global_page_state(NR_FILE_DIRTY)
> -			       + global_page_state(NR_UNSTABLE_NFS))
> -					  > background_thresh)))
> +	    (!laptop_mode && (nr_reclaimable > background_thresh)))
>  		bdi_start_writeback(bdi, NULL, 0);
>  }
> 
> @@ -678,6 +695,8 @@ void throttle_vm_writeout(gfp_t gfp_mask)
>  	unsigned long dirty_thresh;
> 
>          for ( ; ; ) {
> +		unsigned long dirty;
> +
>  		get_dirty_limits(&background_thresh, &dirty_thresh, NULL, NULL);
> 
>                  /*
> @@ -686,10 +705,15 @@ void throttle_vm_writeout(gfp_t gfp_mask)
>                   */
>                  dirty_thresh += dirty_thresh / 10;      /* wheeee... */
> 
> -                if (global_page_state(NR_UNSTABLE_NFS) +
> -			global_page_state(NR_WRITEBACK) <= dirty_thresh)
> -                        	break;
> -                congestion_wait(BLK_RW_ASYNC, HZ/10);
> +		if (mem_cgroup_has_dirty_limit())
> +			dirty = mem_cgroup_page_stat(
> +					MEMCG_NR_DIRTY_WRITEBACK_PAGES);
> +		else
> +			dirty = global_page_state(NR_UNSTABLE_NFS) +
> +				global_page_state(NR_WRITEBACK);
> +		if (dirty <= dirty_thresh)
> +			break;
> +		congestion_wait(BLK_RW_ASYNC, HZ/10);
> 
>  		/*
>  		 * The caller might hold locks which can prevent IO completion
> @@ -1096,6 +1120,7 @@ int __set_page_dirty_no_writeback(struct page *page)
>  void account_page_dirtied(struct page *page, struct address_space *mapping)
>  {
>  	if (mapping_cap_account_dirty(mapping)) {
> +		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_FILE_DIRTY, 1);
>  		__inc_zone_page_state(page, NR_FILE_DIRTY);
>  		__inc_bdi_stat(mapping->backing_dev_info, BDI_DIRTY);
>  		task_dirty_inc(current);
> @@ -1297,6 +1322,8 @@ int clear_page_dirty_for_io(struct page *page)
>  		 * for more comments.
>  		 */
>  		if (TestClearPageDirty(page)) {
> +			mem_cgroup_update_stat(page,
> +					MEM_CGROUP_STAT_FILE_DIRTY, -1);
>  			dec_zone_page_state(page, NR_FILE_DIRTY);
>  			dec_bdi_stat(mapping->backing_dev_info,
>  					BDI_DIRTY);
> @@ -1332,8 +1359,10 @@ int test_clear_page_writeback(struct page *page)
>  	} else {
>  		ret = TestClearPageWriteback(page);
>  	}
> -	if (ret)
> +	if (ret) {
> +		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_WRITEBACK, -1);
>  		dec_zone_page_state(page, NR_WRITEBACK);
> +	}
>  	return ret;
>  }
> 
> @@ -1363,8 +1392,10 @@ int test_set_page_writeback(struct page *page)
>  	} else {
>  		ret = TestSetPageWriteback(page);
>  	}
> -	if (!ret)
> +	if (!ret) {
> +		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_WRITEBACK, 1);
>  		inc_zone_page_state(page, NR_WRITEBACK);
> +	}
>  	return ret;
> 
>  }
> diff --git a/mm/rmap.c b/mm/rmap.c
> index fcd593c..d47c257 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -829,7 +829,7 @@ void page_add_file_rmap(struct page *page)
>  {
>  	if (atomic_inc_and_test(&page->_mapcount)) {
>  		__inc_zone_page_state(page, NR_FILE_MAPPED);
> -		mem_cgroup_update_file_mapped(page, 1);
> +		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_FILE_MAPPED, 1);
>  	}
>  }
> 
> @@ -861,7 +861,7 @@ void page_remove_rmap(struct page *page)
>  		__dec_zone_page_state(page, NR_ANON_PAGES);
>  	} else {
>  		__dec_zone_page_state(page, NR_FILE_MAPPED);
> -		mem_cgroup_update_file_mapped(page, -1);
> +		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_FILE_MAPPED, -1);
>  	}
>  	/*
>  	 * It would be tidy to reset the PageAnon mapping here,
> diff --git a/mm/truncate.c b/mm/truncate.c
> index 2466e0c..5f437e7 100644
> --- a/mm/truncate.c
> +++ b/mm/truncate.c
> @@ -73,6 +73,8 @@ void cancel_dirty_page(struct page *page, unsigned int account_size)
>  	if (TestClearPageDirty(page)) {
>  		struct address_space *mapping = page->mapping;
>  		if (mapping && mapping_cap_account_dirty(mapping)) {
> +			mem_cgroup_update_stat(page,
> +					MEM_CGROUP_STAT_FILE_DIRTY, -1);
>  			dec_zone_page_state(page, NR_FILE_DIRTY);
>  			dec_bdi_stat(mapping->backing_dev_info,
>  					BDI_DIRTY);
> -- 
> 1.6.3.3
> 

-- 
	Three Cheers,
	Balbir

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH -mmotm 4/4] memcg: dirty pages instrumentation
@ 2010-03-05  6:38     ` Balbir Singh
  0 siblings, 0 replies; 68+ messages in thread
From: Balbir Singh @ 2010-03-05  6:38 UTC (permalink / raw)
  To: Andrea Righi
  Cc: KAMEZAWA Hiroyuki, Vivek Goyal, Peter Zijlstra, Trond Myklebust,
	Suleiman Souhlal, Greg Thelen, Daisuke Nishimura,
	Kirill A. Shutemov, Andrew Morton, containers, linux-kernel,
	linux-mm

* Andrea Righi <arighi@develer.com> [2010-03-04 11:40:15]:

> Apply the cgroup dirty pages accounting and limiting infrastructure
> to the opportune kernel functions.
> 
> Signed-off-by: Andrea Righi <arighi@develer.com>
> ---
>  fs/fuse/file.c      |    5 +++
>  fs/nfs/write.c      |    4 ++
>  fs/nilfs2/segment.c |   11 +++++-
>  mm/filemap.c        |    1 +
>  mm/page-writeback.c |   91 ++++++++++++++++++++++++++++++++++-----------------
>  mm/rmap.c           |    4 +-
>  mm/truncate.c       |    2 +
>  7 files changed, 84 insertions(+), 34 deletions(-)
> 
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index a9f5e13..dbbdd53 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -11,6 +11,7 @@
>  #include <linux/pagemap.h>
>  #include <linux/slab.h>
>  #include <linux/kernel.h>
> +#include <linux/memcontrol.h>
>  #include <linux/sched.h>
>  #include <linux/module.h>
> 
> @@ -1129,6 +1130,8 @@ static void fuse_writepage_finish(struct fuse_conn *fc, struct fuse_req *req)
> 
>  	list_del(&req->writepages_entry);
>  	dec_bdi_stat(bdi, BDI_WRITEBACK);
> +	mem_cgroup_update_stat(req->pages[0],
> +			MEM_CGROUP_STAT_WRITEBACK_TEMP, -1);
>  	dec_zone_page_state(req->pages[0], NR_WRITEBACK_TEMP);
>  	bdi_writeout_inc(bdi);
>  	wake_up(&fi->page_waitq);
> @@ -1240,6 +1243,8 @@ static int fuse_writepage_locked(struct page *page)
>  	req->inode = inode;
> 
>  	inc_bdi_stat(mapping->backing_dev_info, BDI_WRITEBACK);
> +	mem_cgroup_update_stat(tmp_page,
> +			MEM_CGROUP_STAT_WRITEBACK_TEMP, 1);
>  	inc_zone_page_state(tmp_page, NR_WRITEBACK_TEMP);
>  	end_page_writeback(page);
> 
> diff --git a/fs/nfs/write.c b/fs/nfs/write.c
> index b753242..7316f7a 100644
> --- a/fs/nfs/write.c
> +++ b/fs/nfs/write.c
> @@ -439,6 +439,7 @@ nfs_mark_request_commit(struct nfs_page *req)
>  			req->wb_index,
>  			NFS_PAGE_TAG_COMMIT);
>  	spin_unlock(&inode->i_lock);
> +	mem_cgroup_update_stat(req->wb_page, MEM_CGROUP_STAT_UNSTABLE_NFS, 1);
>  	inc_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
>  	inc_bdi_stat(req->wb_page->mapping->backing_dev_info, BDI_UNSTABLE);
>  	__mark_inode_dirty(inode, I_DIRTY_DATASYNC);
> @@ -450,6 +451,7 @@ nfs_clear_request_commit(struct nfs_page *req)
>  	struct page *page = req->wb_page;
> 
>  	if (test_and_clear_bit(PG_CLEAN, &(req)->wb_flags)) {
> +		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_UNSTABLE_NFS, -1);
>  		dec_zone_page_state(page, NR_UNSTABLE_NFS);
>  		dec_bdi_stat(page->mapping->backing_dev_info, BDI_UNSTABLE);
>  		return 1;
> @@ -1273,6 +1275,8 @@ nfs_commit_list(struct inode *inode, struct list_head *head, int how)
>  		req = nfs_list_entry(head->next);
>  		nfs_list_remove_request(req);
>  		nfs_mark_request_commit(req);
> +		mem_cgroup_update_stat(req->wb_page,
> +				MEM_CGROUP_STAT_UNSTABLE_NFS, -1);
>  		dec_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
>  		dec_bdi_stat(req->wb_page->mapping->backing_dev_info,
>  				BDI_UNSTABLE);
> diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
> index ada2f1b..27a01b1 100644
> --- a/fs/nilfs2/segment.c
> +++ b/fs/nilfs2/segment.c
> @@ -24,6 +24,7 @@
>  #include <linux/pagemap.h>
>  #include <linux/buffer_head.h>
>  #include <linux/writeback.h>
> +#include <linux/memcontrol.h>
>  #include <linux/bio.h>
>  #include <linux/completion.h>
>  #include <linux/blkdev.h>
> @@ -1660,8 +1661,11 @@ nilfs_copy_replace_page_buffers(struct page *page, struct list_head *out)
>  	} while (bh = bh->b_this_page, bh2 = bh2->b_this_page, bh != head);
>  	kunmap_atomic(kaddr, KM_USER0);
> 
> -	if (!TestSetPageWriteback(clone_page))
> +	if (!TestSetPageWriteback(clone_page)) {
> +		mem_cgroup_update_stat(clone_page,
> +				MEM_CGROUP_STAT_WRITEBACK, 1);

I wonder if we should start implementing inc and dec to avoid passing
the +1 and -1 parameters. It should make the code easier to read.

>  		inc_zone_page_state(clone_page, NR_WRITEBACK);
> +	}
>  	unlock_page(clone_page);
> 
>  	return 0;
> @@ -1783,8 +1787,11 @@ static void __nilfs_end_page_io(struct page *page, int err)
>  	}
> 
>  	if (buffer_nilfs_allocated(page_buffers(page))) {
> -		if (TestClearPageWriteback(page))
> +		if (TestClearPageWriteback(page)) {
> +			mem_cgroup_update_stat(page,
> +					MEM_CGROUP_STAT_WRITEBACK, -1);
>  			dec_zone_page_state(page, NR_WRITEBACK);
> +		}
>  	} else
>  		end_page_writeback(page);
>  }
> diff --git a/mm/filemap.c b/mm/filemap.c
> index fe09e51..f85acae 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -135,6 +135,7 @@ void __remove_from_page_cache(struct page *page)
>  	 * having removed the page entirely.
>  	 */
>  	if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
> +		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_FILE_DIRTY, -1);
>  		dec_zone_page_state(page, NR_FILE_DIRTY);
>  		dec_bdi_stat(mapping->backing_dev_info, BDI_DIRTY);
>  	}
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 5a0f8f3..c5d14ea 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -137,13 +137,16 @@ static struct prop_descriptor vm_dirties;
>   */
>  static int calc_period_shift(void)
>  {
> +	struct dirty_param dirty_param;

vm_dirty_param?

>  	unsigned long dirty_total;
> 
> -	if (vm_dirty_bytes)
> -		dirty_total = vm_dirty_bytes / PAGE_SIZE;
> +	get_dirty_param(&dirty_param);

get_vm_dirty_param() is a nicer name.

> +
> +	if (dirty_param.dirty_bytes)
> +		dirty_total = dirty_param.dirty_bytes / PAGE_SIZE;
>  	else
> -		dirty_total = (vm_dirty_ratio * determine_dirtyable_memory()) /
> -				100;
> +		dirty_total = (dirty_param.dirty_ratio *
> +				determine_dirtyable_memory()) / 100;
>  	return 2 + ilog2(dirty_total - 1);
>  }
> 
> @@ -408,41 +411,46 @@ static unsigned long highmem_dirtyable_memory(unsigned long total)
>   */
>  unsigned long determine_dirtyable_memory(void)
>  {
> -	unsigned long x;
> -
> -	x = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages();
> +	unsigned long memory;
> +	s64 memcg_memory;
> 
> +	memory = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages();
>  	if (!vm_highmem_is_dirtyable)
> -		x -= highmem_dirtyable_memory(x);
> -
> -	return x + 1;	/* Ensure that we never return 0 */
> +		memory -= highmem_dirtyable_memory(memory);
> +	if (mem_cgroup_has_dirty_limit())
> +		return memory + 1;

Vivek already pointed out this issue I suppose. Should be *not*

> +	memcg_memory = mem_cgroup_page_stat(MEMCG_NR_DIRTYABLE_PAGES);

Can memcg_memory be 0?

> +	return min((unsigned long)memcg_memory, memory + 1);
>  }
> 
>  void
>  get_dirty_limits(unsigned long *pbackground, unsigned long *pdirty,
>  		 unsigned long *pbdi_dirty, struct backing_dev_info *bdi)
>  {
> -	unsigned long background;
> -	unsigned long dirty;
> +	unsigned long dirty, background;
>  	unsigned long available_memory = determine_dirtyable_memory();
>  	struct task_struct *tsk;
> +	struct dirty_param dirty_param;
> +
> +	get_dirty_param(&dirty_param);
> 
> -	if (vm_dirty_bytes)
> -		dirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE);
> +	if (dirty_param.dirty_bytes)
> +		dirty = DIV_ROUND_UP(dirty_param.dirty_bytes, PAGE_SIZE);
>  	else {
>  		int dirty_ratio;
> 
> -		dirty_ratio = vm_dirty_ratio;
> +		dirty_ratio = dirty_param.dirty_ratio;
>  		if (dirty_ratio < 5)
>  			dirty_ratio = 5;
>  		dirty = (dirty_ratio * available_memory) / 100;
>  	}
> 
> -	if (dirty_background_bytes)
> -		background = DIV_ROUND_UP(dirty_background_bytes, PAGE_SIZE);
> +	if (dirty_param.dirty_background_bytes)
> +		background = DIV_ROUND_UP(dirty_param.dirty_background_bytes,
> +						PAGE_SIZE);
>  	else
> -		background = (dirty_background_ratio * available_memory) / 100;
> -
> +		background = (dirty_param.dirty_background_ratio *
> +						available_memory) / 100;
>  	if (background >= dirty)
>  		background = dirty / 2;
>  	tsk = current;
> @@ -508,9 +516,15 @@ static void balance_dirty_pages(struct address_space *mapping,
>  		get_dirty_limits(&background_thresh, &dirty_thresh,
>  				&bdi_thresh, bdi);
> 
> -		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
> +		if (mem_cgroup_has_dirty_limit()) {
> +			nr_reclaimable =
> +				mem_cgroup_page_stat(MEMCG_NR_RECLAIM_PAGES);
> +			nr_writeback = mem_cgroup_page_stat(MEMCG_NR_WRITEBACK);
> +		} else {
> +			nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
>  					global_page_state(NR_UNSTABLE_NFS);
> -		nr_writeback = global_page_state(NR_WRITEBACK);
> +			nr_writeback = global_page_state(NR_WRITEBACK);
> +		}
> 
>  		bdi_nr_reclaimable = bdi_stat(bdi, BDI_DIRTY);
>  		if (bdi_cap_account_unstable(bdi)) {
> @@ -611,10 +625,13 @@ static void balance_dirty_pages(struct address_space *mapping,
>  	 * In normal mode, we start background writeout at the lower
>  	 * background_thresh, to keep the amount of dirty memory low.
>  	 */
> +	if (mem_cgroup_has_dirty_limit())
> +		nr_reclaimable = mem_cgroup_page_stat(MEMCG_NR_RECLAIM_PAGES);
> +	else
> +		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
> +				global_page_state(NR_UNSTABLE_NFS);
>  	if ((laptop_mode && pages_written) ||
> -	    (!laptop_mode && ((global_page_state(NR_FILE_DIRTY)
> -			       + global_page_state(NR_UNSTABLE_NFS))
> -					  > background_thresh)))
> +	    (!laptop_mode && (nr_reclaimable > background_thresh)))
>  		bdi_start_writeback(bdi, NULL, 0);
>  }
> 
> @@ -678,6 +695,8 @@ void throttle_vm_writeout(gfp_t gfp_mask)
>  	unsigned long dirty_thresh;
> 
>          for ( ; ; ) {
> +		unsigned long dirty;
> +
>  		get_dirty_limits(&background_thresh, &dirty_thresh, NULL, NULL);
> 
>                  /*
> @@ -686,10 +705,15 @@ void throttle_vm_writeout(gfp_t gfp_mask)
>                   */
>                  dirty_thresh += dirty_thresh / 10;      /* wheeee... */
> 
> -                if (global_page_state(NR_UNSTABLE_NFS) +
> -			global_page_state(NR_WRITEBACK) <= dirty_thresh)
> -                        	break;
> -                congestion_wait(BLK_RW_ASYNC, HZ/10);
> +		if (mem_cgroup_has_dirty_limit())
> +			dirty = mem_cgroup_page_stat(
> +					MEMCG_NR_DIRTY_WRITEBACK_PAGES);
> +		else
> +			dirty = global_page_state(NR_UNSTABLE_NFS) +
> +				global_page_state(NR_WRITEBACK);
> +		if (dirty <= dirty_thresh)
> +			break;
> +		congestion_wait(BLK_RW_ASYNC, HZ/10);
> 
>  		/*
>  		 * The caller might hold locks which can prevent IO completion
> @@ -1096,6 +1120,7 @@ int __set_page_dirty_no_writeback(struct page *page)
>  void account_page_dirtied(struct page *page, struct address_space *mapping)
>  {
>  	if (mapping_cap_account_dirty(mapping)) {
> +		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_FILE_DIRTY, 1);
>  		__inc_zone_page_state(page, NR_FILE_DIRTY);
>  		__inc_bdi_stat(mapping->backing_dev_info, BDI_DIRTY);
>  		task_dirty_inc(current);
> @@ -1297,6 +1322,8 @@ int clear_page_dirty_for_io(struct page *page)
>  		 * for more comments.
>  		 */
>  		if (TestClearPageDirty(page)) {
> +			mem_cgroup_update_stat(page,
> +					MEM_CGROUP_STAT_FILE_DIRTY, -1);
>  			dec_zone_page_state(page, NR_FILE_DIRTY);
>  			dec_bdi_stat(mapping->backing_dev_info,
>  					BDI_DIRTY);
> @@ -1332,8 +1359,10 @@ int test_clear_page_writeback(struct page *page)
>  	} else {
>  		ret = TestClearPageWriteback(page);
>  	}
> -	if (ret)
> +	if (ret) {
> +		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_WRITEBACK, -1);
>  		dec_zone_page_state(page, NR_WRITEBACK);
> +	}
>  	return ret;
>  }
> 
> @@ -1363,8 +1392,10 @@ int test_set_page_writeback(struct page *page)
>  	} else {
>  		ret = TestSetPageWriteback(page);
>  	}
> -	if (!ret)
> +	if (!ret) {
> +		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_WRITEBACK, 1);
>  		inc_zone_page_state(page, NR_WRITEBACK);
> +	}
>  	return ret;
> 
>  }
> diff --git a/mm/rmap.c b/mm/rmap.c
> index fcd593c..d47c257 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -829,7 +829,7 @@ void page_add_file_rmap(struct page *page)
>  {
>  	if (atomic_inc_and_test(&page->_mapcount)) {
>  		__inc_zone_page_state(page, NR_FILE_MAPPED);
> -		mem_cgroup_update_file_mapped(page, 1);
> +		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_FILE_MAPPED, 1);
>  	}
>  }
> 
> @@ -861,7 +861,7 @@ void page_remove_rmap(struct page *page)
>  		__dec_zone_page_state(page, NR_ANON_PAGES);
>  	} else {
>  		__dec_zone_page_state(page, NR_FILE_MAPPED);
> -		mem_cgroup_update_file_mapped(page, -1);
> +		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_FILE_MAPPED, -1);
>  	}
>  	/*
>  	 * It would be tidy to reset the PageAnon mapping here,
> diff --git a/mm/truncate.c b/mm/truncate.c
> index 2466e0c..5f437e7 100644
> --- a/mm/truncate.c
> +++ b/mm/truncate.c
> @@ -73,6 +73,8 @@ void cancel_dirty_page(struct page *page, unsigned int account_size)
>  	if (TestClearPageDirty(page)) {
>  		struct address_space *mapping = page->mapping;
>  		if (mapping && mapping_cap_account_dirty(mapping)) {
> +			mem_cgroup_update_stat(page,
> +					MEM_CGROUP_STAT_FILE_DIRTY, -1);
>  			dec_zone_page_state(page, NR_FILE_DIRTY);
>  			dec_bdi_stat(mapping->backing_dev_info,
>  					BDI_DIRTY);
> -- 
> 1.6.3.3
> 

-- 
	Three Cheers,
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH -mmotm 3/4] memcg: dirty pages accounting and limiting infrastructure
  2010-03-05  1:58       ` KAMEZAWA Hiroyuki
  (?)
@ 2010-03-05  7:01           ` Balbir Singh
  -1 siblings, 0 replies; 68+ messages in thread
From: Balbir Singh @ 2010-03-05  7:01 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Andrea Righi, Daisuke Nishimura,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Trond Myklebust,
	Suleiman Souhlal, Andrew Morton,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Vivek Goyal

* KAMEZAWA Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org> [2010-03-05 10:58:55]:

> On Fri, 5 Mar 2010 10:12:34 +0900
> Daisuke Nishimura <nishimura-YQH0OdQVrdy45+QrQBaojngSJqDPrsil@public.gmane.org> wrote:
> 
> > On Thu,  4 Mar 2010 11:40:14 +0100, Andrea Righi <arighi-vWjgImWzx8FBDgjK7y7TUQ@public.gmane.org> wrote:
> > > Infrastructure to account dirty pages per cgroup and add dirty limit
> > >  static int mem_cgroup_count_children_cb(struct mem_cgroup *mem, void *data)
> > >  {
> > >  	int *val = data;
> > > @@ -1275,34 +1423,70 @@ static void record_last_oom(struct mem_cgroup *mem)
> > >  }
> > >  
> > >  /*
> > > - * Currently used to update mapped file statistics, but the routine can be
> > > - * generalized to update other statistics as well.
> > > + * Generalized routine to update file cache's status for memcg.
> > > + *
> > > + * Before calling this, mapping->tree_lock should be held and preemption is
> > > + * disabled.  Then, it's guarnteed that the page is not uncharged while we
> > > + * access page_cgroup. We can make use of that.
> > >   */
> > IIUC, mapping->tree_lock is held with irq disabled, so I think "mapping->tree_lock
> > should be held with irq disabled" would be enouth.
> > And, as far as I can see, callers of this function have not ensured this yet in [4/4].
> > 
> > how about:
> > 
> > 	void mem_cgroup_update_stat_locked(...)
> > 	{
> > 		...
> > 	}
> > 
> > 	void mem_cgroup_update_stat_unlocked(mapping, ...)
> > 	{
> > 		spin_lock_irqsave(mapping->tree_lock, ...);
> > 		mem_cgroup_update_stat_locked();
> > 		spin_unlock_irqrestore(...);
> > 	}
> >
> Rather than tree_lock, lock_page_cgroup() can be used if tree_lock is not held.
> 
> 		lock_page_cgroup();
> 		mem_cgroup_update_stat_locked();
> 		unlock_page_cgroup();
> 
> Andrea-san, FILE_MAPPED is updated without treelock, at least. You can't depend
> on migration_lock about FILE_MAPPED.
>

FILE_MAPPED is updated under pte lock in the rmap context and
page_cgroup lock within update_file_mapped.
 

-- 
	Three Cheers,
	Balbir

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH -mmotm 3/4] memcg: dirty pages accounting and limiting infrastructure
@ 2010-03-05  7:01           ` Balbir Singh
  0 siblings, 0 replies; 68+ messages in thread
From: Balbir Singh @ 2010-03-05  7:01 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Daisuke Nishimura, Andrea Righi, Vivek Goyal, Peter Zijlstra,
	Trond Myklebust, Suleiman Souhlal, Greg Thelen,
	Kirill A. Shutemov, Andrew Morton, containers, linux-kernel,
	linux-mm

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2010-03-05 10:58:55]:

> On Fri, 5 Mar 2010 10:12:34 +0900
> Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> 
> > On Thu,  4 Mar 2010 11:40:14 +0100, Andrea Righi <arighi@develer.com> wrote:
> > > Infrastructure to account dirty pages per cgroup and add dirty limit
> > >  static int mem_cgroup_count_children_cb(struct mem_cgroup *mem, void *data)
> > >  {
> > >  	int *val = data;
> > > @@ -1275,34 +1423,70 @@ static void record_last_oom(struct mem_cgroup *mem)
> > >  }
> > >  
> > >  /*
> > > - * Currently used to update mapped file statistics, but the routine can be
> > > - * generalized to update other statistics as well.
> > > + * Generalized routine to update file cache's status for memcg.
> > > + *
> > > + * Before calling this, mapping->tree_lock should be held and preemption is
> > > + * disabled.  Then, it's guarnteed that the page is not uncharged while we
> > > + * access page_cgroup. We can make use of that.
> > >   */
> > IIUC, mapping->tree_lock is held with irq disabled, so I think "mapping->tree_lock
> > should be held with irq disabled" would be enouth.
> > And, as far as I can see, callers of this function have not ensured this yet in [4/4].
> > 
> > how about:
> > 
> > 	void mem_cgroup_update_stat_locked(...)
> > 	{
> > 		...
> > 	}
> > 
> > 	void mem_cgroup_update_stat_unlocked(mapping, ...)
> > 	{
> > 		spin_lock_irqsave(mapping->tree_lock, ...);
> > 		mem_cgroup_update_stat_locked();
> > 		spin_unlock_irqrestore(...);
> > 	}
> >
> Rather than tree_lock, lock_page_cgroup() can be used if tree_lock is not held.
> 
> 		lock_page_cgroup();
> 		mem_cgroup_update_stat_locked();
> 		unlock_page_cgroup();
> 
> Andrea-san, FILE_MAPPED is updated without treelock, at least. You can't depend
> on migration_lock about FILE_MAPPED.
>

FILE_MAPPED is updated under pte lock in the rmap context and
page_cgroup lock within update_file_mapped.
 

-- 
	Three Cheers,
	Balbir

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH -mmotm 3/4] memcg: dirty pages accounting and limiting infrastructure
@ 2010-03-05  7:01           ` Balbir Singh
  0 siblings, 0 replies; 68+ messages in thread
From: Balbir Singh @ 2010-03-05  7:01 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Daisuke Nishimura, Andrea Righi, Vivek Goyal, Peter Zijlstra,
	Trond Myklebust, Suleiman Souhlal, Greg Thelen,
	Kirill A. Shutemov, Andrew Morton, containers, linux-kernel,
	linux-mm

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2010-03-05 10:58:55]:

> On Fri, 5 Mar 2010 10:12:34 +0900
> Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> 
> > On Thu,  4 Mar 2010 11:40:14 +0100, Andrea Righi <arighi@develer.com> wrote:
> > > Infrastructure to account dirty pages per cgroup and add dirty limit
> > >  static int mem_cgroup_count_children_cb(struct mem_cgroup *mem, void *data)
> > >  {
> > >  	int *val = data;
> > > @@ -1275,34 +1423,70 @@ static void record_last_oom(struct mem_cgroup *mem)
> > >  }
> > >  
> > >  /*
> > > - * Currently used to update mapped file statistics, but the routine can be
> > > - * generalized to update other statistics as well.
> > > + * Generalized routine to update file cache's status for memcg.
> > > + *
> > > + * Before calling this, mapping->tree_lock should be held and preemption is
> > > + * disabled.  Then, it's guarnteed that the page is not uncharged while we
> > > + * access page_cgroup. We can make use of that.
> > >   */
> > IIUC, mapping->tree_lock is held with irq disabled, so I think "mapping->tree_lock
> > should be held with irq disabled" would be enouth.
> > And, as far as I can see, callers of this function have not ensured this yet in [4/4].
> > 
> > how about:
> > 
> > 	void mem_cgroup_update_stat_locked(...)
> > 	{
> > 		...
> > 	}
> > 
> > 	void mem_cgroup_update_stat_unlocked(mapping, ...)
> > 	{
> > 		spin_lock_irqsave(mapping->tree_lock, ...);
> > 		mem_cgroup_update_stat_locked();
> > 		spin_unlock_irqrestore(...);
> > 	}
> >
> Rather than tree_lock, lock_page_cgroup() can be used if tree_lock is not held.
> 
> 		lock_page_cgroup();
> 		mem_cgroup_update_stat_locked();
> 		unlock_page_cgroup();
> 
> Andrea-san, FILE_MAPPED is updated without treelock, at least. You can't depend
> on migration_lock about FILE_MAPPED.
>

FILE_MAPPED is updated under pte lock in the rmap context and
page_cgroup lock within update_file_mapped.
 

-- 
	Three Cheers,
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH -mmotm 3/4] memcg: dirty pages accounting and limiting infrastructure
       [not found]     ` <20100305101234.909001e8.nishimura-YQH0OdQVrdy45+QrQBaojngSJqDPrsil@public.gmane.org>
  2010-03-05  1:58       ` KAMEZAWA Hiroyuki
@ 2010-03-05 22:14       ` Andrea Righi
  1 sibling, 0 replies; 68+ messages in thread
From: Andrea Righi @ 2010-03-05 22:14 UTC (permalink / raw)
  To: Daisuke Nishimura
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, Trond Myklebust,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Suleiman Souhlal, Andrew Morton,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Vivek Goyal, Balbir Singh

On Fri, Mar 05, 2010 at 10:12:34AM +0900, Daisuke Nishimura wrote:
> On Thu,  4 Mar 2010 11:40:14 +0100, Andrea Righi <arighi-vWjgImWzx8FBDgjK7y7TUQ@public.gmane.org> wrote:
> > Infrastructure to account dirty pages per cgroup and add dirty limit
> > interfaces in the cgroupfs:
> > 
> >  - Direct write-out: memory.dirty_ratio, memory.dirty_bytes
> > 
> >  - Background write-out: memory.dirty_background_ratio, memory.dirty_background_bytes
> > 
> > Signed-off-by: Andrea Righi <arighi-vWjgImWzx8FBDgjK7y7TUQ@public.gmane.org>
> > ---
> >  include/linux/memcontrol.h |   80 ++++++++-
> >  mm/memcontrol.c            |  420 +++++++++++++++++++++++++++++++++++++++-----
> >  2 files changed, 450 insertions(+), 50 deletions(-)
> > 
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index 1f9b119..cc3421b 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -19,12 +19,66 @@
> >  
> >  #ifndef _LINUX_MEMCONTROL_H
> >  #define _LINUX_MEMCONTROL_H
> > +
> > +#include <linux/writeback.h>
> >  #include <linux/cgroup.h>
> > +
> >  struct mem_cgroup;
> >  struct page_cgroup;
> >  struct page;
> >  struct mm_struct;
> >  
> > +/* Cgroup memory statistics items exported to the kernel */
> > +enum mem_cgroup_page_stat_item {
> > +	MEMCG_NR_DIRTYABLE_PAGES,
> > +	MEMCG_NR_RECLAIM_PAGES,
> > +	MEMCG_NR_WRITEBACK,
> > +	MEMCG_NR_DIRTY_WRITEBACK_PAGES,
> > +};
> > +
> > +/* Dirty memory parameters */
> > +struct dirty_param {
> > +	int dirty_ratio;
> > +	unsigned long dirty_bytes;
> > +	int dirty_background_ratio;
> > +	unsigned long dirty_background_bytes;
> > +};
> > +
> > +/*
> > + * Statistics for memory cgroup.
> > + */
> > +enum mem_cgroup_stat_index {
> > +	/*
> > +	 * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
> > +	 */
> > +	MEM_CGROUP_STAT_CACHE,	   /* # of pages charged as cache */
> > +	MEM_CGROUP_STAT_RSS,	   /* # of pages charged as anon rss */
> > +	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
> > +	MEM_CGROUP_STAT_PGPGIN_COUNT,	/* # of pages paged in */
> > +	MEM_CGROUP_STAT_PGPGOUT_COUNT,	/* # of pages paged out */
> > +	MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
> > +	MEM_CGROUP_EVENTS,	/* incremented at every  pagein/pageout */
> > +	MEM_CGROUP_STAT_FILE_DIRTY,   /* # of dirty pages in page cache */
> > +	MEM_CGROUP_STAT_WRITEBACK,   /* # of pages under writeback */
> > +	MEM_CGROUP_STAT_WRITEBACK_TEMP,   /* # of pages under writeback using
> > +						temporary buffers */
> > +	MEM_CGROUP_STAT_UNSTABLE_NFS,   /* # of NFS unstable pages */
> > +
> > +	MEM_CGROUP_STAT_NSTATS,
> > +};
> > +
> I must have said it earlier, but I don't think exporting all of these flags
> is a good idea.
> Can you export only mem_cgroup_page_stat_item(of course, need to add MEMCG_NR_FILE_MAPPED)?
> We can translate mem_cgroup_page_stat_item to mem_cgroup_stat_index by simple arithmetic
> if you define MEM_CGROUP_STAT_FILE_MAPPED..MEM_CGROUP_STAT_UNSTABLE_NFS sequentially.

Agreed.

> 
> > +/*
> > + * TODO: provide a validation check routine. And retry if validation
> > + * fails.
> > + */
> > +static inline void get_global_dirty_param(struct dirty_param *param)
> > +{
> > +	param->dirty_ratio = vm_dirty_ratio;
> > +	param->dirty_bytes = vm_dirty_bytes;
> > +	param->dirty_background_ratio = dirty_background_ratio;
> > +	param->dirty_background_bytes = dirty_background_bytes;
> > +}
> > +
> >  #ifdef CONFIG_CGROUP_MEM_RES_CTLR
> >  /*
> >   * All "charge" functions with gfp_mask should use GFP_KERNEL or
> > @@ -117,6 +171,10 @@ extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
> >  extern int do_swap_account;
> >  #endif
> >  
> > +extern bool mem_cgroup_has_dirty_limit(void);
> > +extern void get_dirty_param(struct dirty_param *param);
> > +extern s64 mem_cgroup_page_stat(enum mem_cgroup_page_stat_item item);
> > +
> >  static inline bool mem_cgroup_disabled(void)
> >  {
> >  	if (mem_cgroup_subsys.disabled)
> > @@ -125,7 +183,8 @@ static inline bool mem_cgroup_disabled(void)
> >  }
> >  
> >  extern bool mem_cgroup_oom_called(struct task_struct *task);
> > -void mem_cgroup_update_file_mapped(struct page *page, int val);
> > +void mem_cgroup_update_stat(struct page *page,
> > +			enum mem_cgroup_stat_index idx, int val);
> >  unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
> >  						gfp_t gfp_mask, int nid,
> >  						int zid);
> > @@ -300,8 +359,8 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
> >  {
> >  }
> >  
> > -static inline void mem_cgroup_update_file_mapped(struct page *page,
> > -							int val)
> > +static inline void mem_cgroup_update_stat(struct page *page,
> > +			enum mem_cgroup_stat_index idx, int val)
> >  {
> >  }
> >  
> > @@ -312,6 +371,21 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
> >  	return 0;
> >  }
> >  
> > +static inline bool mem_cgroup_has_dirty_limit(void)
> > +{
> > +	return false;
> > +}
> > +
> > +static inline void get_dirty_param(struct dirty_param *param)
> > +{
> > +	get_global_dirty_param(param);
> > +}
> > +
> > +static inline s64 mem_cgroup_page_stat(enum mem_cgroup_page_stat_item item)
> > +{
> > +	return -ENOSYS;
> > +}
> > +
> >  #endif /* CONFIG_CGROUP_MEM_CONT */
> >  
> >  #endif /* _LINUX_MEMCONTROL_H */
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 497b6f7..9842e7b 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -73,28 +73,23 @@ static int really_do_swap_account __initdata = 1; /* for remember boot option*/
> >  #define THRESHOLDS_EVENTS_THRESH (7) /* once in 128 */
> >  #define SOFTLIMIT_EVENTS_THRESH (10) /* once in 1024 */
> >  
> > -/*
> > - * Statistics for memory cgroup.
> > - */
> > -enum mem_cgroup_stat_index {
> > -	/*
> > -	 * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
> > -	 */
> > -	MEM_CGROUP_STAT_CACHE, 	   /* # of pages charged as cache */
> > -	MEM_CGROUP_STAT_RSS,	   /* # of pages charged as anon rss */
> > -	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
> > -	MEM_CGROUP_STAT_PGPGIN_COUNT,	/* # of pages paged in */
> > -	MEM_CGROUP_STAT_PGPGOUT_COUNT,	/* # of pages paged out */
> > -	MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
> > -	MEM_CGROUP_EVENTS,	/* incremented at every  pagein/pageout */
> > -
> > -	MEM_CGROUP_STAT_NSTATS,
> > -};
> > -
> >  struct mem_cgroup_stat_cpu {
> >  	s64 count[MEM_CGROUP_STAT_NSTATS];
> >  };
> >  
> > +/* Per cgroup page statistics */
> > +struct mem_cgroup_page_stat {
> > +	enum mem_cgroup_page_stat_item item;
> > +	s64 value;
> > +};
> > +
> > +enum {
> > +	MEM_CGROUP_DIRTY_RATIO,
> > +	MEM_CGROUP_DIRTY_BYTES,
> > +	MEM_CGROUP_DIRTY_BACKGROUND_RATIO,
> > +	MEM_CGROUP_DIRTY_BACKGROUND_BYTES,
> > +};
> > +
> >  /*
> >   * per-zone information in memory controller.
> >   */
> > @@ -208,6 +203,9 @@ struct mem_cgroup {
> >  
> >  	unsigned int	swappiness;
> >  
> > +	/* control memory cgroup dirty pages */
> > +	struct dirty_param dirty_param;
> > +
> >  	/* set when res.limit == memsw.limit */
> >  	bool		memsw_is_minimum;
> >  
> > @@ -1033,6 +1031,156 @@ static unsigned int get_swappiness(struct mem_cgroup *memcg)
> >  	return swappiness;
> >  }
> >  
> > +static bool dirty_param_is_valid(struct dirty_param *param)
> > +{
> > +	if (param->dirty_ratio && param->dirty_bytes)
> > +		return false;
> > +	if (param->dirty_background_ratio && param->dirty_background_bytes)
> > +		return false;
> > +	return true;
> > +}
> > +
> > +static void
> > +__mem_cgroup_get_dirty_param(struct dirty_param *param, struct mem_cgroup *mem)
> > +{
> > +	param->dirty_ratio = mem->dirty_param.dirty_ratio;
> > +	param->dirty_bytes = mem->dirty_param.dirty_bytes;
> > +	param->dirty_background_ratio = mem->dirty_param.dirty_background_ratio;
> > +	param->dirty_background_bytes = mem->dirty_param.dirty_background_bytes;
> > +}
> > +
> > +/*
> > + * get_dirty_param() - get dirty memory parameters of the current memcg
> > + * @param:	a structure is filled with the dirty memory settings
> > + *
> > + * The function fills @param with the current memcg dirty memory settings. If
> > + * memory cgroup is disabled or in case of error the structure is filled with
> > + * the global dirty memory settings.
> > + */
> > +void get_dirty_param(struct dirty_param *param)
> > +{
> > +	struct mem_cgroup *memcg;
> > +
> > +	if (mem_cgroup_disabled()) {
> > +		get_global_dirty_param(param);
> > +		return;
> > +	}
> > +	/*
> > +	 * It's possible that "current" may be moved to other cgroup while we
> > +	 * access cgroup. But precise check is meaningless because the task can
> > +	 * be moved after our access and writeback tends to take long time.
> > +	 * At least, "memcg" will not be freed under rcu_read_lock().
> > +	 */
> > +	while (1) {
> > +		rcu_read_lock();
> > +		memcg = mem_cgroup_from_task(current);
> > +		if (likely(memcg))
> > +			__mem_cgroup_get_dirty_param(param, memcg);
> > +		else
> > +			get_global_dirty_param(param);
> > +		rcu_read_unlock();
> > +		/*
> > +		 * Since global and memcg dirty_param are not protected we try
> > +		 * to speculatively read them and retry if we get inconsistent
> > +		 * values.
> > +		 */
> > +		if (likely(dirty_param_is_valid(param)))
> > +			break;
> > +	}
> > +}
> > +
> > +static inline bool mem_cgroup_can_swap(struct mem_cgroup *memcg)
> > +{
> > +	if (!do_swap_account)
> > +		return nr_swap_pages > 0;
> > +	return !memcg->memsw_is_minimum &&
> > +		(res_counter_read_u64(&memcg->memsw, RES_LIMIT) > 0);
> > +}
> > +
> > +static s64 mem_cgroup_get_local_page_stat(struct mem_cgroup *memcg,
> > +				enum mem_cgroup_page_stat_item item)
> > +{
> > +	s64 ret;
> > +
> > +	switch (item) {
> > +	case MEMCG_NR_DIRTYABLE_PAGES:
> > +		ret = res_counter_read_u64(&memcg->res, RES_LIMIT) -
> > +			res_counter_read_u64(&memcg->res, RES_USAGE);
> > +		/* Translate free memory in pages */
> > +		ret >>= PAGE_SHIFT;
> > +		ret += mem_cgroup_read_stat(memcg, LRU_ACTIVE_FILE) +
> > +			mem_cgroup_read_stat(memcg, LRU_INACTIVE_FILE);
> > +		if (mem_cgroup_can_swap(memcg))
> > +			ret += mem_cgroup_read_stat(memcg, LRU_ACTIVE_ANON) +
> > +				mem_cgroup_read_stat(memcg, LRU_INACTIVE_ANON);
> > +		break;
> > +	case MEMCG_NR_RECLAIM_PAGES:
> > +		ret = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_FILE_DIRTY) +
> > +			mem_cgroup_read_stat(memcg,
> > +					MEM_CGROUP_STAT_UNSTABLE_NFS);
> > +		break;
> > +	case MEMCG_NR_WRITEBACK:
> > +		ret = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_WRITEBACK);
> > +		break;
> > +	case MEMCG_NR_DIRTY_WRITEBACK_PAGES:
> > +		ret = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_WRITEBACK) +
> > +			mem_cgroup_read_stat(memcg,
> > +				MEM_CGROUP_STAT_UNSTABLE_NFS);
> > +		break;
> > +	default:
> > +		BUG_ON(1);
> > +	}
> > +	return ret;
> > +}
> > +
> > +static int mem_cgroup_page_stat_cb(struct mem_cgroup *mem, void *data)
> > +{
> > +	struct mem_cgroup_page_stat *stat = (struct mem_cgroup_page_stat *)data;
> > +
> > +	stat->value += mem_cgroup_get_local_page_stat(mem, stat->item);
> > +	return 0;
> > +}
> > +
> > +/*
> > + * mem_cgroup_has_dirty_limit() - check if current memcg has local dirty limits
> > + *
> > + * Return true if the current memory cgroup has local dirty memory settings,
> > + * false otherwise.
> > + */
> > +bool mem_cgroup_has_dirty_limit(void)
> > +{
> > +	if (mem_cgroup_disabled())
> > +		return false;
> > +	return mem_cgroup_from_task(current) != NULL;
> > +}
> > +
> > +/*
> > + * mem_cgroup_page_stat() - get memory cgroup file cache statistics
> > + * @item:	memory statistic item exported to the kernel
> > + *
> > + * Return the accounted statistic value, or a negative value in case of error.
> > + */
> > +s64 mem_cgroup_page_stat(enum mem_cgroup_page_stat_item item)
> > +{
> > +	struct mem_cgroup_page_stat stat = {};
> > +	struct mem_cgroup *memcg;
> > +
> > +	rcu_read_lock();
> > +	memcg = mem_cgroup_from_task(current);
> > +	if (memcg) {
> > +		/*
> > +		 * Recursively evaulate page statistics against all cgroup
> > +		 * under hierarchy tree
> > +		 */
> > +		stat.item = item;
> > +		mem_cgroup_walk_tree(memcg, &stat, mem_cgroup_page_stat_cb);
> > +	} else
> > +		stat.value = -EINVAL;
> > +	rcu_read_unlock();
> > +
> > +	return stat.value;
> > +}
> > +
> >  static int mem_cgroup_count_children_cb(struct mem_cgroup *mem, void *data)
> >  {
> >  	int *val = data;
> > @@ -1275,34 +1423,70 @@ static void record_last_oom(struct mem_cgroup *mem)
> >  }
> >  
> >  /*
> > - * Currently used to update mapped file statistics, but the routine can be
> > - * generalized to update other statistics as well.
> > + * Generalized routine to update file cache's status for memcg.
> > + *
> > + * Before calling this, mapping->tree_lock should be held and preemption is
> > + * disabled.  Then, it's guarnteed that the page is not uncharged while we
> > + * access page_cgroup. We can make use of that.
> >   */
> IIUC, mapping->tree_lock is held with irq disabled, so I think "mapping->tree_lock
> should be held with irq disabled" would be enouth.
> And, as far as I can see, callers of this function have not ensured this yet in [4/4].
> 
> how about:
> 
> 	void mem_cgroup_update_stat_locked(...)
> 	{
> 		...
> 	}
> 
> 	void mem_cgroup_update_stat_unlocked(mapping, ...)
> 	{
> 		spin_lock_irqsave(mapping->tree_lock, ...);
> 		mem_cgroup_update_stat_locked();
> 		spin_unlock_irqrestore(...);
> 	}

So, basically, lock_page_cgroup_migrate() should disable irqs and
unlock_page_cgroup_migrate() should re-enable them, except for updating
MEM_CGROUP_STAT_FILE_MAPPED, where just a lock/unlock_page_cgroup() is
needed. Right?

> 
> > -void mem_cgroup_update_file_mapped(struct page *page, int val)
> > +void mem_cgroup_update_stat(struct page *page,
> > +			enum mem_cgroup_stat_index idx, int val)
> >  {
> I preffer "void mem_cgroup_update_page_stat(struct page *, enum mem_cgroup_page_stat_item, ..)"
> as I said above.
> 
> >  	struct mem_cgroup *mem;
> >  	struct page_cgroup *pc;
> >  
> > +	if (mem_cgroup_disabled())
> > +		return;
> >  	pc = lookup_page_cgroup(page);
> > -	if (unlikely(!pc))
> > +	if (unlikely(!pc) || !PageCgroupUsed(pc))
> >  		return;
> >  
> > -	lock_page_cgroup(pc);
> > -	mem = pc->mem_cgroup;
> > -	if (!mem)
> > -		goto done;
> > -
> > -	if (!PageCgroupUsed(pc))
> > -		goto done;
> > -
> > +	lock_page_cgroup_migrate(pc);
> >  	/*
> > -	 * Preemption is already disabled. We can use __this_cpu_xxx
> > -	 */
> > -	__this_cpu_add(mem->stat->count[MEM_CGROUP_STAT_FILE_MAPPED], val);
> > -
> > -done:
> > -	unlock_page_cgroup(pc);
> > +	* It's guarnteed that this page is never uncharged.
> > +	* The only racy problem is moving account among memcgs.
> > +	*/
> > +	switch (idx) {
> > +	case MEM_CGROUP_STAT_FILE_MAPPED:
> > +		if (val > 0)
> > +			SetPageCgroupFileMapped(pc);
> > +		else
> > +			ClearPageCgroupFileMapped(pc);
> > +		break;
> > +	case MEM_CGROUP_STAT_FILE_DIRTY:
> > +		if (val > 0)
> > +			SetPageCgroupDirty(pc);
> > +		else
> > +			ClearPageCgroupDirty(pc);
> > +		break;
> > +	case MEM_CGROUP_STAT_WRITEBACK:
> > +		if (val > 0)
> > +			SetPageCgroupWriteback(pc);
> > +		else
> > +			ClearPageCgroupWriteback(pc);
> > +		break;
> > +	case MEM_CGROUP_STAT_WRITEBACK_TEMP:
> > +		if (val > 0)
> > +			SetPageCgroupWritebackTemp(pc);
> > +		else
> > +			ClearPageCgroupWritebackTemp(pc);
> > +		break;
> > +	case MEM_CGROUP_STAT_UNSTABLE_NFS:
> > +		if (val > 0)
> > +			SetPageCgroupUnstableNFS(pc);
> > +		else
> > +			ClearPageCgroupUnstableNFS(pc);
> > +		break;
> > +	default:
> > +		BUG();
> > +		break;
> > +	}
> > +	mem = pc->mem_cgroup;
> > +	if (likely(mem))
> > +		__this_cpu_add(mem->stat->count[idx], val);
> > +	unlock_page_cgroup_migrate(pc);
> >  }
> > +EXPORT_SYMBOL_GPL(mem_cgroup_update_stat);
> >  
> >  /*
> >   * size of first charge trial. "32" comes from vmscan.c's magic value.
> > @@ -1701,6 +1885,45 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *mem,
> >  	memcg_check_events(mem, pc->page);
> >  }
> >  
> > +/*
> > + * Update file cache accounted statistics on task migration.
> > + *
> > + * TODO: We don't move charges of file (including shmem/tmpfs) pages for now.
> > + * So, at the moment this function simply returns without updating accounted
> > + * statistics, because we deal only with anonymous pages here.
> > + */
> This function is not unique to task migration. It's called from rmdir() too.
> So this comment isn't needed.

Agreed.

> 
> > +static void __mem_cgroup_update_file_stat(struct page_cgroup *pc,
> > +	struct mem_cgroup *from, struct mem_cgroup *to)
> > +{
> > +	struct page *page = pc->page;
> > +
> > +	if (!page_mapped(page) || PageAnon(page))
> > +		return;
> > +
> > +	if (PageCgroupFileMapped(pc)) {
> > +		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> > +		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> > +	}
> > +	if (PageCgroupDirty(pc)) {
> > +		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_DIRTY]);
> > +		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_DIRTY]);
> > +	}
> > +	if (PageCgroupWriteback(pc)) {
> > +		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_WRITEBACK]);
> > +		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_WRITEBACK]);
> > +	}
> > +	if (PageCgroupWritebackTemp(pc)) {
> > +		__this_cpu_dec(
> > +			from->stat->count[MEM_CGROUP_STAT_WRITEBACK_TEMP]);
> > +		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_WRITEBACK_TEMP]);
> > +	}
> > +	if (PageCgroupUnstableNFS(pc)) {
> > +		__this_cpu_dec(
> > +			from->stat->count[MEM_CGROUP_STAT_UNSTABLE_NFS]);
> > +		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_UNSTABLE_NFS]);
> > +	}
> > +}
> > +
> >  /**
> >   * __mem_cgroup_move_account - move account of the page
> >   * @pc:	page_cgroup of the page.
> > @@ -1721,22 +1944,16 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *mem,
> >  static void __mem_cgroup_move_account(struct page_cgroup *pc,
> >  	struct mem_cgroup *from, struct mem_cgroup *to, bool uncharge)
> >  {
> > -	struct page *page;
> > -
> >  	VM_BUG_ON(from == to);
> >  	VM_BUG_ON(PageLRU(pc->page));
> >  	VM_BUG_ON(!PageCgroupLocked(pc));
> >  	VM_BUG_ON(!PageCgroupUsed(pc));
> >  	VM_BUG_ON(pc->mem_cgroup != from);
> >  
> > -	page = pc->page;
> > -	if (page_mapped(page) && !PageAnon(page)) {
> > -		/* Update mapped_file data for mem_cgroup */
> > -		preempt_disable();
> > -		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> > -		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> > -		preempt_enable();
> > -	}
> > +	preempt_disable();
> > +	lock_page_cgroup_migrate(pc);
> > +	__mem_cgroup_update_file_stat(pc, from, to);
> > +
> >  	mem_cgroup_charge_statistics(from, pc, false);
> >  	if (uncharge)
> >  		/* This is not "cancel", but cancel_charge does all we need. */
> > @@ -1745,6 +1962,8 @@ static void __mem_cgroup_move_account(struct page_cgroup *pc,
> >  	/* caller should have done css_get */
> >  	pc->mem_cgroup = to;
> >  	mem_cgroup_charge_statistics(to, pc, true);
> > +	unlock_page_cgroup_migrate(pc);
> > +	preempt_enable();
> Glad to see this cleanup :)
> But, hmm, I don't think preempt_disable/enable() is enough(and bit_spin_lock/unlock()
> does it anyway). lock/unlock_page_cgroup_migrate() can be called under irq context
> (e.g. end_page_writeback()), so I think we must local_irq_disable()/enable() here.

You're right. So, also for this case irqs must be disabled/enabled by
lock/unlock_page_cgroup_migrate(). And again, FILE_MAPPED just needs
lock/unlock_page_cgroup().

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH -mmotm 3/4] memcg: dirty pages accounting and limiting infrastructure
  2010-03-05  1:12     ` Daisuke Nishimura
@ 2010-03-05 22:14       ` Andrea Righi
  -1 siblings, 0 replies; 68+ messages in thread
From: Andrea Righi @ 2010-03-05 22:14 UTC (permalink / raw)
  To: Daisuke Nishimura
  Cc: KAMEZAWA Hiroyuki, Balbir Singh, Vivek Goyal, Peter Zijlstra,
	Trond Myklebust, Suleiman Souhlal, Greg Thelen,
	Kirill A. Shutemov, Andrew Morton, containers, linux-kernel,
	linux-mm

On Fri, Mar 05, 2010 at 10:12:34AM +0900, Daisuke Nishimura wrote:
> On Thu,  4 Mar 2010 11:40:14 +0100, Andrea Righi <arighi@develer.com> wrote:
> > Infrastructure to account dirty pages per cgroup and add dirty limit
> > interfaces in the cgroupfs:
> > 
> >  - Direct write-out: memory.dirty_ratio, memory.dirty_bytes
> > 
> >  - Background write-out: memory.dirty_background_ratio, memory.dirty_background_bytes
> > 
> > Signed-off-by: Andrea Righi <arighi@develer.com>
> > ---
> >  include/linux/memcontrol.h |   80 ++++++++-
> >  mm/memcontrol.c            |  420 +++++++++++++++++++++++++++++++++++++++-----
> >  2 files changed, 450 insertions(+), 50 deletions(-)
> > 
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index 1f9b119..cc3421b 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -19,12 +19,66 @@
> >  
> >  #ifndef _LINUX_MEMCONTROL_H
> >  #define _LINUX_MEMCONTROL_H
> > +
> > +#include <linux/writeback.h>
> >  #include <linux/cgroup.h>
> > +
> >  struct mem_cgroup;
> >  struct page_cgroup;
> >  struct page;
> >  struct mm_struct;
> >  
> > +/* Cgroup memory statistics items exported to the kernel */
> > +enum mem_cgroup_page_stat_item {
> > +	MEMCG_NR_DIRTYABLE_PAGES,
> > +	MEMCG_NR_RECLAIM_PAGES,
> > +	MEMCG_NR_WRITEBACK,
> > +	MEMCG_NR_DIRTY_WRITEBACK_PAGES,
> > +};
> > +
> > +/* Dirty memory parameters */
> > +struct dirty_param {
> > +	int dirty_ratio;
> > +	unsigned long dirty_bytes;
> > +	int dirty_background_ratio;
> > +	unsigned long dirty_background_bytes;
> > +};
> > +
> > +/*
> > + * Statistics for memory cgroup.
> > + */
> > +enum mem_cgroup_stat_index {
> > +	/*
> > +	 * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
> > +	 */
> > +	MEM_CGROUP_STAT_CACHE,	   /* # of pages charged as cache */
> > +	MEM_CGROUP_STAT_RSS,	   /* # of pages charged as anon rss */
> > +	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
> > +	MEM_CGROUP_STAT_PGPGIN_COUNT,	/* # of pages paged in */
> > +	MEM_CGROUP_STAT_PGPGOUT_COUNT,	/* # of pages paged out */
> > +	MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
> > +	MEM_CGROUP_EVENTS,	/* incremented at every  pagein/pageout */
> > +	MEM_CGROUP_STAT_FILE_DIRTY,   /* # of dirty pages in page cache */
> > +	MEM_CGROUP_STAT_WRITEBACK,   /* # of pages under writeback */
> > +	MEM_CGROUP_STAT_WRITEBACK_TEMP,   /* # of pages under writeback using
> > +						temporary buffers */
> > +	MEM_CGROUP_STAT_UNSTABLE_NFS,   /* # of NFS unstable pages */
> > +
> > +	MEM_CGROUP_STAT_NSTATS,
> > +};
> > +
> I must have said it earlier, but I don't think exporting all of these flags
> is a good idea.
> Can you export only mem_cgroup_page_stat_item(of course, need to add MEMCG_NR_FILE_MAPPED)?
> We can translate mem_cgroup_page_stat_item to mem_cgroup_stat_index by simple arithmetic
> if you define MEM_CGROUP_STAT_FILE_MAPPED..MEM_CGROUP_STAT_UNSTABLE_NFS sequentially.

Agreed.

> 
> > +/*
> > + * TODO: provide a validation check routine. And retry if validation
> > + * fails.
> > + */
> > +static inline void get_global_dirty_param(struct dirty_param *param)
> > +{
> > +	param->dirty_ratio = vm_dirty_ratio;
> > +	param->dirty_bytes = vm_dirty_bytes;
> > +	param->dirty_background_ratio = dirty_background_ratio;
> > +	param->dirty_background_bytes = dirty_background_bytes;
> > +}
> > +
> >  #ifdef CONFIG_CGROUP_MEM_RES_CTLR
> >  /*
> >   * All "charge" functions with gfp_mask should use GFP_KERNEL or
> > @@ -117,6 +171,10 @@ extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
> >  extern int do_swap_account;
> >  #endif
> >  
> > +extern bool mem_cgroup_has_dirty_limit(void);
> > +extern void get_dirty_param(struct dirty_param *param);
> > +extern s64 mem_cgroup_page_stat(enum mem_cgroup_page_stat_item item);
> > +
> >  static inline bool mem_cgroup_disabled(void)
> >  {
> >  	if (mem_cgroup_subsys.disabled)
> > @@ -125,7 +183,8 @@ static inline bool mem_cgroup_disabled(void)
> >  }
> >  
> >  extern bool mem_cgroup_oom_called(struct task_struct *task);
> > -void mem_cgroup_update_file_mapped(struct page *page, int val);
> > +void mem_cgroup_update_stat(struct page *page,
> > +			enum mem_cgroup_stat_index idx, int val);
> >  unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
> >  						gfp_t gfp_mask, int nid,
> >  						int zid);
> > @@ -300,8 +359,8 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
> >  {
> >  }
> >  
> > -static inline void mem_cgroup_update_file_mapped(struct page *page,
> > -							int val)
> > +static inline void mem_cgroup_update_stat(struct page *page,
> > +			enum mem_cgroup_stat_index idx, int val)
> >  {
> >  }
> >  
> > @@ -312,6 +371,21 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
> >  	return 0;
> >  }
> >  
> > +static inline bool mem_cgroup_has_dirty_limit(void)
> > +{
> > +	return false;
> > +}
> > +
> > +static inline void get_dirty_param(struct dirty_param *param)
> > +{
> > +	get_global_dirty_param(param);
> > +}
> > +
> > +static inline s64 mem_cgroup_page_stat(enum mem_cgroup_page_stat_item item)
> > +{
> > +	return -ENOSYS;
> > +}
> > +
> >  #endif /* CONFIG_CGROUP_MEM_CONT */
> >  
> >  #endif /* _LINUX_MEMCONTROL_H */
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 497b6f7..9842e7b 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -73,28 +73,23 @@ static int really_do_swap_account __initdata = 1; /* for remember boot option*/
> >  #define THRESHOLDS_EVENTS_THRESH (7) /* once in 128 */
> >  #define SOFTLIMIT_EVENTS_THRESH (10) /* once in 1024 */
> >  
> > -/*
> > - * Statistics for memory cgroup.
> > - */
> > -enum mem_cgroup_stat_index {
> > -	/*
> > -	 * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
> > -	 */
> > -	MEM_CGROUP_STAT_CACHE, 	   /* # of pages charged as cache */
> > -	MEM_CGROUP_STAT_RSS,	   /* # of pages charged as anon rss */
> > -	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
> > -	MEM_CGROUP_STAT_PGPGIN_COUNT,	/* # of pages paged in */
> > -	MEM_CGROUP_STAT_PGPGOUT_COUNT,	/* # of pages paged out */
> > -	MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
> > -	MEM_CGROUP_EVENTS,	/* incremented at every  pagein/pageout */
> > -
> > -	MEM_CGROUP_STAT_NSTATS,
> > -};
> > -
> >  struct mem_cgroup_stat_cpu {
> >  	s64 count[MEM_CGROUP_STAT_NSTATS];
> >  };
> >  
> > +/* Per cgroup page statistics */
> > +struct mem_cgroup_page_stat {
> > +	enum mem_cgroup_page_stat_item item;
> > +	s64 value;
> > +};
> > +
> > +enum {
> > +	MEM_CGROUP_DIRTY_RATIO,
> > +	MEM_CGROUP_DIRTY_BYTES,
> > +	MEM_CGROUP_DIRTY_BACKGROUND_RATIO,
> > +	MEM_CGROUP_DIRTY_BACKGROUND_BYTES,
> > +};
> > +
> >  /*
> >   * per-zone information in memory controller.
> >   */
> > @@ -208,6 +203,9 @@ struct mem_cgroup {
> >  
> >  	unsigned int	swappiness;
> >  
> > +	/* control memory cgroup dirty pages */
> > +	struct dirty_param dirty_param;
> > +
> >  	/* set when res.limit == memsw.limit */
> >  	bool		memsw_is_minimum;
> >  
> > @@ -1033,6 +1031,156 @@ static unsigned int get_swappiness(struct mem_cgroup *memcg)
> >  	return swappiness;
> >  }
> >  
> > +static bool dirty_param_is_valid(struct dirty_param *param)
> > +{
> > +	if (param->dirty_ratio && param->dirty_bytes)
> > +		return false;
> > +	if (param->dirty_background_ratio && param->dirty_background_bytes)
> > +		return false;
> > +	return true;
> > +}
> > +
> > +static void
> > +__mem_cgroup_get_dirty_param(struct dirty_param *param, struct mem_cgroup *mem)
> > +{
> > +	param->dirty_ratio = mem->dirty_param.dirty_ratio;
> > +	param->dirty_bytes = mem->dirty_param.dirty_bytes;
> > +	param->dirty_background_ratio = mem->dirty_param.dirty_background_ratio;
> > +	param->dirty_background_bytes = mem->dirty_param.dirty_background_bytes;
> > +}
> > +
> > +/*
> > + * get_dirty_param() - get dirty memory parameters of the current memcg
> > + * @param:	a structure is filled with the dirty memory settings
> > + *
> > + * The function fills @param with the current memcg dirty memory settings. If
> > + * memory cgroup is disabled or in case of error the structure is filled with
> > + * the global dirty memory settings.
> > + */
> > +void get_dirty_param(struct dirty_param *param)
> > +{
> > +	struct mem_cgroup *memcg;
> > +
> > +	if (mem_cgroup_disabled()) {
> > +		get_global_dirty_param(param);
> > +		return;
> > +	}
> > +	/*
> > +	 * It's possible that "current" may be moved to other cgroup while we
> > +	 * access cgroup. But precise check is meaningless because the task can
> > +	 * be moved after our access and writeback tends to take long time.
> > +	 * At least, "memcg" will not be freed under rcu_read_lock().
> > +	 */
> > +	while (1) {
> > +		rcu_read_lock();
> > +		memcg = mem_cgroup_from_task(current);
> > +		if (likely(memcg))
> > +			__mem_cgroup_get_dirty_param(param, memcg);
> > +		else
> > +			get_global_dirty_param(param);
> > +		rcu_read_unlock();
> > +		/*
> > +		 * Since global and memcg dirty_param are not protected we try
> > +		 * to speculatively read them and retry if we get inconsistent
> > +		 * values.
> > +		 */
> > +		if (likely(dirty_param_is_valid(param)))
> > +			break;
> > +	}
> > +}
> > +
> > +static inline bool mem_cgroup_can_swap(struct mem_cgroup *memcg)
> > +{
> > +	if (!do_swap_account)
> > +		return nr_swap_pages > 0;
> > +	return !memcg->memsw_is_minimum &&
> > +		(res_counter_read_u64(&memcg->memsw, RES_LIMIT) > 0);
> > +}
> > +
> > +static s64 mem_cgroup_get_local_page_stat(struct mem_cgroup *memcg,
> > +				enum mem_cgroup_page_stat_item item)
> > +{
> > +	s64 ret;
> > +
> > +	switch (item) {
> > +	case MEMCG_NR_DIRTYABLE_PAGES:
> > +		ret = res_counter_read_u64(&memcg->res, RES_LIMIT) -
> > +			res_counter_read_u64(&memcg->res, RES_USAGE);
> > +		/* Translate free memory in pages */
> > +		ret >>= PAGE_SHIFT;
> > +		ret += mem_cgroup_read_stat(memcg, LRU_ACTIVE_FILE) +
> > +			mem_cgroup_read_stat(memcg, LRU_INACTIVE_FILE);
> > +		if (mem_cgroup_can_swap(memcg))
> > +			ret += mem_cgroup_read_stat(memcg, LRU_ACTIVE_ANON) +
> > +				mem_cgroup_read_stat(memcg, LRU_INACTIVE_ANON);
> > +		break;
> > +	case MEMCG_NR_RECLAIM_PAGES:
> > +		ret = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_FILE_DIRTY) +
> > +			mem_cgroup_read_stat(memcg,
> > +					MEM_CGROUP_STAT_UNSTABLE_NFS);
> > +		break;
> > +	case MEMCG_NR_WRITEBACK:
> > +		ret = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_WRITEBACK);
> > +		break;
> > +	case MEMCG_NR_DIRTY_WRITEBACK_PAGES:
> > +		ret = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_WRITEBACK) +
> > +			mem_cgroup_read_stat(memcg,
> > +				MEM_CGROUP_STAT_UNSTABLE_NFS);
> > +		break;
> > +	default:
> > +		BUG_ON(1);
> > +	}
> > +	return ret;
> > +}
> > +
> > +static int mem_cgroup_page_stat_cb(struct mem_cgroup *mem, void *data)
> > +{
> > +	struct mem_cgroup_page_stat *stat = (struct mem_cgroup_page_stat *)data;
> > +
> > +	stat->value += mem_cgroup_get_local_page_stat(mem, stat->item);
> > +	return 0;
> > +}
> > +
> > +/*
> > + * mem_cgroup_has_dirty_limit() - check if current memcg has local dirty limits
> > + *
> > + * Return true if the current memory cgroup has local dirty memory settings,
> > + * false otherwise.
> > + */
> > +bool mem_cgroup_has_dirty_limit(void)
> > +{
> > +	if (mem_cgroup_disabled())
> > +		return false;
> > +	return mem_cgroup_from_task(current) != NULL;
> > +}
> > +
> > +/*
> > + * mem_cgroup_page_stat() - get memory cgroup file cache statistics
> > + * @item:	memory statistic item exported to the kernel
> > + *
> > + * Return the accounted statistic value, or a negative value in case of error.
> > + */
> > +s64 mem_cgroup_page_stat(enum mem_cgroup_page_stat_item item)
> > +{
> > +	struct mem_cgroup_page_stat stat = {};
> > +	struct mem_cgroup *memcg;
> > +
> > +	rcu_read_lock();
> > +	memcg = mem_cgroup_from_task(current);
> > +	if (memcg) {
> > +		/*
> > +		 * Recursively evaulate page statistics against all cgroup
> > +		 * under hierarchy tree
> > +		 */
> > +		stat.item = item;
> > +		mem_cgroup_walk_tree(memcg, &stat, mem_cgroup_page_stat_cb);
> > +	} else
> > +		stat.value = -EINVAL;
> > +	rcu_read_unlock();
> > +
> > +	return stat.value;
> > +}
> > +
> >  static int mem_cgroup_count_children_cb(struct mem_cgroup *mem, void *data)
> >  {
> >  	int *val = data;
> > @@ -1275,34 +1423,70 @@ static void record_last_oom(struct mem_cgroup *mem)
> >  }
> >  
> >  /*
> > - * Currently used to update mapped file statistics, but the routine can be
> > - * generalized to update other statistics as well.
> > + * Generalized routine to update file cache's status for memcg.
> > + *
> > + * Before calling this, mapping->tree_lock should be held and preemption is
> > + * disabled.  Then, it's guarnteed that the page is not uncharged while we
> > + * access page_cgroup. We can make use of that.
> >   */
> IIUC, mapping->tree_lock is held with irq disabled, so I think "mapping->tree_lock
> should be held with irq disabled" would be enouth.
> And, as far as I can see, callers of this function have not ensured this yet in [4/4].
> 
> how about:
> 
> 	void mem_cgroup_update_stat_locked(...)
> 	{
> 		...
> 	}
> 
> 	void mem_cgroup_update_stat_unlocked(mapping, ...)
> 	{
> 		spin_lock_irqsave(mapping->tree_lock, ...);
> 		mem_cgroup_update_stat_locked();
> 		spin_unlock_irqrestore(...);
> 	}

So, basically, lock_page_cgroup_migrate() should disable irqs and
unlock_page_cgroup_migrate() should re-enable them, except for updating
MEM_CGROUP_STAT_FILE_MAPPED, where just a lock/unlock_page_cgroup() is
needed. Right?

> 
> > -void mem_cgroup_update_file_mapped(struct page *page, int val)
> > +void mem_cgroup_update_stat(struct page *page,
> > +			enum mem_cgroup_stat_index idx, int val)
> >  {
> I preffer "void mem_cgroup_update_page_stat(struct page *, enum mem_cgroup_page_stat_item, ..)"
> as I said above.
> 
> >  	struct mem_cgroup *mem;
> >  	struct page_cgroup *pc;
> >  
> > +	if (mem_cgroup_disabled())
> > +		return;
> >  	pc = lookup_page_cgroup(page);
> > -	if (unlikely(!pc))
> > +	if (unlikely(!pc) || !PageCgroupUsed(pc))
> >  		return;
> >  
> > -	lock_page_cgroup(pc);
> > -	mem = pc->mem_cgroup;
> > -	if (!mem)
> > -		goto done;
> > -
> > -	if (!PageCgroupUsed(pc))
> > -		goto done;
> > -
> > +	lock_page_cgroup_migrate(pc);
> >  	/*
> > -	 * Preemption is already disabled. We can use __this_cpu_xxx
> > -	 */
> > -	__this_cpu_add(mem->stat->count[MEM_CGROUP_STAT_FILE_MAPPED], val);
> > -
> > -done:
> > -	unlock_page_cgroup(pc);
> > +	* It's guarnteed that this page is never uncharged.
> > +	* The only racy problem is moving account among memcgs.
> > +	*/
> > +	switch (idx) {
> > +	case MEM_CGROUP_STAT_FILE_MAPPED:
> > +		if (val > 0)
> > +			SetPageCgroupFileMapped(pc);
> > +		else
> > +			ClearPageCgroupFileMapped(pc);
> > +		break;
> > +	case MEM_CGROUP_STAT_FILE_DIRTY:
> > +		if (val > 0)
> > +			SetPageCgroupDirty(pc);
> > +		else
> > +			ClearPageCgroupDirty(pc);
> > +		break;
> > +	case MEM_CGROUP_STAT_WRITEBACK:
> > +		if (val > 0)
> > +			SetPageCgroupWriteback(pc);
> > +		else
> > +			ClearPageCgroupWriteback(pc);
> > +		break;
> > +	case MEM_CGROUP_STAT_WRITEBACK_TEMP:
> > +		if (val > 0)
> > +			SetPageCgroupWritebackTemp(pc);
> > +		else
> > +			ClearPageCgroupWritebackTemp(pc);
> > +		break;
> > +	case MEM_CGROUP_STAT_UNSTABLE_NFS:
> > +		if (val > 0)
> > +			SetPageCgroupUnstableNFS(pc);
> > +		else
> > +			ClearPageCgroupUnstableNFS(pc);
> > +		break;
> > +	default:
> > +		BUG();
> > +		break;
> > +	}
> > +	mem = pc->mem_cgroup;
> > +	if (likely(mem))
> > +		__this_cpu_add(mem->stat->count[idx], val);
> > +	unlock_page_cgroup_migrate(pc);
> >  }
> > +EXPORT_SYMBOL_GPL(mem_cgroup_update_stat);
> >  
> >  /*
> >   * size of first charge trial. "32" comes from vmscan.c's magic value.
> > @@ -1701,6 +1885,45 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *mem,
> >  	memcg_check_events(mem, pc->page);
> >  }
> >  
> > +/*
> > + * Update file cache accounted statistics on task migration.
> > + *
> > + * TODO: We don't move charges of file (including shmem/tmpfs) pages for now.
> > + * So, at the moment this function simply returns without updating accounted
> > + * statistics, because we deal only with anonymous pages here.
> > + */
> This function is not unique to task migration. It's called from rmdir() too.
> So this comment isn't needed.

Agreed.

> 
> > +static void __mem_cgroup_update_file_stat(struct page_cgroup *pc,
> > +	struct mem_cgroup *from, struct mem_cgroup *to)
> > +{
> > +	struct page *page = pc->page;
> > +
> > +	if (!page_mapped(page) || PageAnon(page))
> > +		return;
> > +
> > +	if (PageCgroupFileMapped(pc)) {
> > +		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> > +		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> > +	}
> > +	if (PageCgroupDirty(pc)) {
> > +		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_DIRTY]);
> > +		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_DIRTY]);
> > +	}
> > +	if (PageCgroupWriteback(pc)) {
> > +		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_WRITEBACK]);
> > +		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_WRITEBACK]);
> > +	}
> > +	if (PageCgroupWritebackTemp(pc)) {
> > +		__this_cpu_dec(
> > +			from->stat->count[MEM_CGROUP_STAT_WRITEBACK_TEMP]);
> > +		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_WRITEBACK_TEMP]);
> > +	}
> > +	if (PageCgroupUnstableNFS(pc)) {
> > +		__this_cpu_dec(
> > +			from->stat->count[MEM_CGROUP_STAT_UNSTABLE_NFS]);
> > +		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_UNSTABLE_NFS]);
> > +	}
> > +}
> > +
> >  /**
> >   * __mem_cgroup_move_account - move account of the page
> >   * @pc:	page_cgroup of the page.
> > @@ -1721,22 +1944,16 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *mem,
> >  static void __mem_cgroup_move_account(struct page_cgroup *pc,
> >  	struct mem_cgroup *from, struct mem_cgroup *to, bool uncharge)
> >  {
> > -	struct page *page;
> > -
> >  	VM_BUG_ON(from == to);
> >  	VM_BUG_ON(PageLRU(pc->page));
> >  	VM_BUG_ON(!PageCgroupLocked(pc));
> >  	VM_BUG_ON(!PageCgroupUsed(pc));
> >  	VM_BUG_ON(pc->mem_cgroup != from);
> >  
> > -	page = pc->page;
> > -	if (page_mapped(page) && !PageAnon(page)) {
> > -		/* Update mapped_file data for mem_cgroup */
> > -		preempt_disable();
> > -		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> > -		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> > -		preempt_enable();
> > -	}
> > +	preempt_disable();
> > +	lock_page_cgroup_migrate(pc);
> > +	__mem_cgroup_update_file_stat(pc, from, to);
> > +
> >  	mem_cgroup_charge_statistics(from, pc, false);
> >  	if (uncharge)
> >  		/* This is not "cancel", but cancel_charge does all we need. */
> > @@ -1745,6 +1962,8 @@ static void __mem_cgroup_move_account(struct page_cgroup *pc,
> >  	/* caller should have done css_get */
> >  	pc->mem_cgroup = to;
> >  	mem_cgroup_charge_statistics(to, pc, true);
> > +	unlock_page_cgroup_migrate(pc);
> > +	preempt_enable();
> Glad to see this cleanup :)
> But, hmm, I don't think preempt_disable/enable() is enough(and bit_spin_lock/unlock()
> does it anyway). lock/unlock_page_cgroup_migrate() can be called under irq context
> (e.g. end_page_writeback()), so I think we must local_irq_disable()/enable() here.

You're right. So, also for this case irqs must be disabled/enabled by
lock/unlock_page_cgroup_migrate(). And again, FILE_MAPPED just needs
lock/unlock_page_cgroup().

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH -mmotm 3/4] memcg: dirty pages accounting and limiting infrastructure
@ 2010-03-05 22:14       ` Andrea Righi
  0 siblings, 0 replies; 68+ messages in thread
From: Andrea Righi @ 2010-03-05 22:14 UTC (permalink / raw)
  To: Daisuke Nishimura
  Cc: KAMEZAWA Hiroyuki, Balbir Singh, Vivek Goyal, Peter Zijlstra,
	Trond Myklebust, Suleiman Souhlal, Greg Thelen,
	Kirill A. Shutemov, Andrew Morton, containers, linux-kernel,
	linux-mm

On Fri, Mar 05, 2010 at 10:12:34AM +0900, Daisuke Nishimura wrote:
> On Thu,  4 Mar 2010 11:40:14 +0100, Andrea Righi <arighi@develer.com> wrote:
> > Infrastructure to account dirty pages per cgroup and add dirty limit
> > interfaces in the cgroupfs:
> > 
> >  - Direct write-out: memory.dirty_ratio, memory.dirty_bytes
> > 
> >  - Background write-out: memory.dirty_background_ratio, memory.dirty_background_bytes
> > 
> > Signed-off-by: Andrea Righi <arighi@develer.com>
> > ---
> >  include/linux/memcontrol.h |   80 ++++++++-
> >  mm/memcontrol.c            |  420 +++++++++++++++++++++++++++++++++++++++-----
> >  2 files changed, 450 insertions(+), 50 deletions(-)
> > 
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index 1f9b119..cc3421b 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -19,12 +19,66 @@
> >  
> >  #ifndef _LINUX_MEMCONTROL_H
> >  #define _LINUX_MEMCONTROL_H
> > +
> > +#include <linux/writeback.h>
> >  #include <linux/cgroup.h>
> > +
> >  struct mem_cgroup;
> >  struct page_cgroup;
> >  struct page;
> >  struct mm_struct;
> >  
> > +/* Cgroup memory statistics items exported to the kernel */
> > +enum mem_cgroup_page_stat_item {
> > +	MEMCG_NR_DIRTYABLE_PAGES,
> > +	MEMCG_NR_RECLAIM_PAGES,
> > +	MEMCG_NR_WRITEBACK,
> > +	MEMCG_NR_DIRTY_WRITEBACK_PAGES,
> > +};
> > +
> > +/* Dirty memory parameters */
> > +struct dirty_param {
> > +	int dirty_ratio;
> > +	unsigned long dirty_bytes;
> > +	int dirty_background_ratio;
> > +	unsigned long dirty_background_bytes;
> > +};
> > +
> > +/*
> > + * Statistics for memory cgroup.
> > + */
> > +enum mem_cgroup_stat_index {
> > +	/*
> > +	 * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
> > +	 */
> > +	MEM_CGROUP_STAT_CACHE,	   /* # of pages charged as cache */
> > +	MEM_CGROUP_STAT_RSS,	   /* # of pages charged as anon rss */
> > +	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
> > +	MEM_CGROUP_STAT_PGPGIN_COUNT,	/* # of pages paged in */
> > +	MEM_CGROUP_STAT_PGPGOUT_COUNT,	/* # of pages paged out */
> > +	MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
> > +	MEM_CGROUP_EVENTS,	/* incremented at every  pagein/pageout */
> > +	MEM_CGROUP_STAT_FILE_DIRTY,   /* # of dirty pages in page cache */
> > +	MEM_CGROUP_STAT_WRITEBACK,   /* # of pages under writeback */
> > +	MEM_CGROUP_STAT_WRITEBACK_TEMP,   /* # of pages under writeback using
> > +						temporary buffers */
> > +	MEM_CGROUP_STAT_UNSTABLE_NFS,   /* # of NFS unstable pages */
> > +
> > +	MEM_CGROUP_STAT_NSTATS,
> > +};
> > +
> I must have said it earlier, but I don't think exporting all of these flags
> is a good idea.
> Can you export only mem_cgroup_page_stat_item(of course, need to add MEMCG_NR_FILE_MAPPED)?
> We can translate mem_cgroup_page_stat_item to mem_cgroup_stat_index by simple arithmetic
> if you define MEM_CGROUP_STAT_FILE_MAPPED..MEM_CGROUP_STAT_UNSTABLE_NFS sequentially.

Agreed.

> 
> > +/*
> > + * TODO: provide a validation check routine. And retry if validation
> > + * fails.
> > + */
> > +static inline void get_global_dirty_param(struct dirty_param *param)
> > +{
> > +	param->dirty_ratio = vm_dirty_ratio;
> > +	param->dirty_bytes = vm_dirty_bytes;
> > +	param->dirty_background_ratio = dirty_background_ratio;
> > +	param->dirty_background_bytes = dirty_background_bytes;
> > +}
> > +
> >  #ifdef CONFIG_CGROUP_MEM_RES_CTLR
> >  /*
> >   * All "charge" functions with gfp_mask should use GFP_KERNEL or
> > @@ -117,6 +171,10 @@ extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
> >  extern int do_swap_account;
> >  #endif
> >  
> > +extern bool mem_cgroup_has_dirty_limit(void);
> > +extern void get_dirty_param(struct dirty_param *param);
> > +extern s64 mem_cgroup_page_stat(enum mem_cgroup_page_stat_item item);
> > +
> >  static inline bool mem_cgroup_disabled(void)
> >  {
> >  	if (mem_cgroup_subsys.disabled)
> > @@ -125,7 +183,8 @@ static inline bool mem_cgroup_disabled(void)
> >  }
> >  
> >  extern bool mem_cgroup_oom_called(struct task_struct *task);
> > -void mem_cgroup_update_file_mapped(struct page *page, int val);
> > +void mem_cgroup_update_stat(struct page *page,
> > +			enum mem_cgroup_stat_index idx, int val);
> >  unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
> >  						gfp_t gfp_mask, int nid,
> >  						int zid);
> > @@ -300,8 +359,8 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
> >  {
> >  }
> >  
> > -static inline void mem_cgroup_update_file_mapped(struct page *page,
> > -							int val)
> > +static inline void mem_cgroup_update_stat(struct page *page,
> > +			enum mem_cgroup_stat_index idx, int val)
> >  {
> >  }
> >  
> > @@ -312,6 +371,21 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
> >  	return 0;
> >  }
> >  
> > +static inline bool mem_cgroup_has_dirty_limit(void)
> > +{
> > +	return false;
> > +}
> > +
> > +static inline void get_dirty_param(struct dirty_param *param)
> > +{
> > +	get_global_dirty_param(param);
> > +}
> > +
> > +static inline s64 mem_cgroup_page_stat(enum mem_cgroup_page_stat_item item)
> > +{
> > +	return -ENOSYS;
> > +}
> > +
> >  #endif /* CONFIG_CGROUP_MEM_CONT */
> >  
> >  #endif /* _LINUX_MEMCONTROL_H */
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 497b6f7..9842e7b 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -73,28 +73,23 @@ static int really_do_swap_account __initdata = 1; /* for remember boot option*/
> >  #define THRESHOLDS_EVENTS_THRESH (7) /* once in 128 */
> >  #define SOFTLIMIT_EVENTS_THRESH (10) /* once in 1024 */
> >  
> > -/*
> > - * Statistics for memory cgroup.
> > - */
> > -enum mem_cgroup_stat_index {
> > -	/*
> > -	 * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
> > -	 */
> > -	MEM_CGROUP_STAT_CACHE, 	   /* # of pages charged as cache */
> > -	MEM_CGROUP_STAT_RSS,	   /* # of pages charged as anon rss */
> > -	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
> > -	MEM_CGROUP_STAT_PGPGIN_COUNT,	/* # of pages paged in */
> > -	MEM_CGROUP_STAT_PGPGOUT_COUNT,	/* # of pages paged out */
> > -	MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
> > -	MEM_CGROUP_EVENTS,	/* incremented at every  pagein/pageout */
> > -
> > -	MEM_CGROUP_STAT_NSTATS,
> > -};
> > -
> >  struct mem_cgroup_stat_cpu {
> >  	s64 count[MEM_CGROUP_STAT_NSTATS];
> >  };
> >  
> > +/* Per cgroup page statistics */
> > +struct mem_cgroup_page_stat {
> > +	enum mem_cgroup_page_stat_item item;
> > +	s64 value;
> > +};
> > +
> > +enum {
> > +	MEM_CGROUP_DIRTY_RATIO,
> > +	MEM_CGROUP_DIRTY_BYTES,
> > +	MEM_CGROUP_DIRTY_BACKGROUND_RATIO,
> > +	MEM_CGROUP_DIRTY_BACKGROUND_BYTES,
> > +};
> > +
> >  /*
> >   * per-zone information in memory controller.
> >   */
> > @@ -208,6 +203,9 @@ struct mem_cgroup {
> >  
> >  	unsigned int	swappiness;
> >  
> > +	/* control memory cgroup dirty pages */
> > +	struct dirty_param dirty_param;
> > +
> >  	/* set when res.limit == memsw.limit */
> >  	bool		memsw_is_minimum;
> >  
> > @@ -1033,6 +1031,156 @@ static unsigned int get_swappiness(struct mem_cgroup *memcg)
> >  	return swappiness;
> >  }
> >  
> > +static bool dirty_param_is_valid(struct dirty_param *param)
> > +{
> > +	if (param->dirty_ratio && param->dirty_bytes)
> > +		return false;
> > +	if (param->dirty_background_ratio && param->dirty_background_bytes)
> > +		return false;
> > +	return true;
> > +}
> > +
> > +static void
> > +__mem_cgroup_get_dirty_param(struct dirty_param *param, struct mem_cgroup *mem)
> > +{
> > +	param->dirty_ratio = mem->dirty_param.dirty_ratio;
> > +	param->dirty_bytes = mem->dirty_param.dirty_bytes;
> > +	param->dirty_background_ratio = mem->dirty_param.dirty_background_ratio;
> > +	param->dirty_background_bytes = mem->dirty_param.dirty_background_bytes;
> > +}
> > +
> > +/*
> > + * get_dirty_param() - get dirty memory parameters of the current memcg
> > + * @param:	a structure is filled with the dirty memory settings
> > + *
> > + * The function fills @param with the current memcg dirty memory settings. If
> > + * memory cgroup is disabled or in case of error the structure is filled with
> > + * the global dirty memory settings.
> > + */
> > +void get_dirty_param(struct dirty_param *param)
> > +{
> > +	struct mem_cgroup *memcg;
> > +
> > +	if (mem_cgroup_disabled()) {
> > +		get_global_dirty_param(param);
> > +		return;
> > +	}
> > +	/*
> > +	 * It's possible that "current" may be moved to other cgroup while we
> > +	 * access cgroup. But precise check is meaningless because the task can
> > +	 * be moved after our access and writeback tends to take long time.
> > +	 * At least, "memcg" will not be freed under rcu_read_lock().
> > +	 */
> > +	while (1) {
> > +		rcu_read_lock();
> > +		memcg = mem_cgroup_from_task(current);
> > +		if (likely(memcg))
> > +			__mem_cgroup_get_dirty_param(param, memcg);
> > +		else
> > +			get_global_dirty_param(param);
> > +		rcu_read_unlock();
> > +		/*
> > +		 * Since global and memcg dirty_param are not protected we try
> > +		 * to speculatively read them and retry if we get inconsistent
> > +		 * values.
> > +		 */
> > +		if (likely(dirty_param_is_valid(param)))
> > +			break;
> > +	}
> > +}
> > +
> > +static inline bool mem_cgroup_can_swap(struct mem_cgroup *memcg)
> > +{
> > +	if (!do_swap_account)
> > +		return nr_swap_pages > 0;
> > +	return !memcg->memsw_is_minimum &&
> > +		(res_counter_read_u64(&memcg->memsw, RES_LIMIT) > 0);
> > +}
> > +
> > +static s64 mem_cgroup_get_local_page_stat(struct mem_cgroup *memcg,
> > +				enum mem_cgroup_page_stat_item item)
> > +{
> > +	s64 ret;
> > +
> > +	switch (item) {
> > +	case MEMCG_NR_DIRTYABLE_PAGES:
> > +		ret = res_counter_read_u64(&memcg->res, RES_LIMIT) -
> > +			res_counter_read_u64(&memcg->res, RES_USAGE);
> > +		/* Translate free memory in pages */
> > +		ret >>= PAGE_SHIFT;
> > +		ret += mem_cgroup_read_stat(memcg, LRU_ACTIVE_FILE) +
> > +			mem_cgroup_read_stat(memcg, LRU_INACTIVE_FILE);
> > +		if (mem_cgroup_can_swap(memcg))
> > +			ret += mem_cgroup_read_stat(memcg, LRU_ACTIVE_ANON) +
> > +				mem_cgroup_read_stat(memcg, LRU_INACTIVE_ANON);
> > +		break;
> > +	case MEMCG_NR_RECLAIM_PAGES:
> > +		ret = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_FILE_DIRTY) +
> > +			mem_cgroup_read_stat(memcg,
> > +					MEM_CGROUP_STAT_UNSTABLE_NFS);
> > +		break;
> > +	case MEMCG_NR_WRITEBACK:
> > +		ret = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_WRITEBACK);
> > +		break;
> > +	case MEMCG_NR_DIRTY_WRITEBACK_PAGES:
> > +		ret = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_WRITEBACK) +
> > +			mem_cgroup_read_stat(memcg,
> > +				MEM_CGROUP_STAT_UNSTABLE_NFS);
> > +		break;
> > +	default:
> > +		BUG_ON(1);
> > +	}
> > +	return ret;
> > +}
> > +
> > +static int mem_cgroup_page_stat_cb(struct mem_cgroup *mem, void *data)
> > +{
> > +	struct mem_cgroup_page_stat *stat = (struct mem_cgroup_page_stat *)data;
> > +
> > +	stat->value += mem_cgroup_get_local_page_stat(mem, stat->item);
> > +	return 0;
> > +}
> > +
> > +/*
> > + * mem_cgroup_has_dirty_limit() - check if current memcg has local dirty limits
> > + *
> > + * Return true if the current memory cgroup has local dirty memory settings,
> > + * false otherwise.
> > + */
> > +bool mem_cgroup_has_dirty_limit(void)
> > +{
> > +	if (mem_cgroup_disabled())
> > +		return false;
> > +	return mem_cgroup_from_task(current) != NULL;
> > +}
> > +
> > +/*
> > + * mem_cgroup_page_stat() - get memory cgroup file cache statistics
> > + * @item:	memory statistic item exported to the kernel
> > + *
> > + * Return the accounted statistic value, or a negative value in case of error.
> > + */
> > +s64 mem_cgroup_page_stat(enum mem_cgroup_page_stat_item item)
> > +{
> > +	struct mem_cgroup_page_stat stat = {};
> > +	struct mem_cgroup *memcg;
> > +
> > +	rcu_read_lock();
> > +	memcg = mem_cgroup_from_task(current);
> > +	if (memcg) {
> > +		/*
> > +		 * Recursively evaulate page statistics against all cgroup
> > +		 * under hierarchy tree
> > +		 */
> > +		stat.item = item;
> > +		mem_cgroup_walk_tree(memcg, &stat, mem_cgroup_page_stat_cb);
> > +	} else
> > +		stat.value = -EINVAL;
> > +	rcu_read_unlock();
> > +
> > +	return stat.value;
> > +}
> > +
> >  static int mem_cgroup_count_children_cb(struct mem_cgroup *mem, void *data)
> >  {
> >  	int *val = data;
> > @@ -1275,34 +1423,70 @@ static void record_last_oom(struct mem_cgroup *mem)
> >  }
> >  
> >  /*
> > - * Currently used to update mapped file statistics, but the routine can be
> > - * generalized to update other statistics as well.
> > + * Generalized routine to update file cache's status for memcg.
> > + *
> > + * Before calling this, mapping->tree_lock should be held and preemption is
> > + * disabled.  Then, it's guarnteed that the page is not uncharged while we
> > + * access page_cgroup. We can make use of that.
> >   */
> IIUC, mapping->tree_lock is held with irq disabled, so I think "mapping->tree_lock
> should be held with irq disabled" would be enouth.
> And, as far as I can see, callers of this function have not ensured this yet in [4/4].
> 
> how about:
> 
> 	void mem_cgroup_update_stat_locked(...)
> 	{
> 		...
> 	}
> 
> 	void mem_cgroup_update_stat_unlocked(mapping, ...)
> 	{
> 		spin_lock_irqsave(mapping->tree_lock, ...);
> 		mem_cgroup_update_stat_locked();
> 		spin_unlock_irqrestore(...);
> 	}

So, basically, lock_page_cgroup_migrate() should disable irqs and
unlock_page_cgroup_migrate() should re-enable them, except for updating
MEM_CGROUP_STAT_FILE_MAPPED, where just a lock/unlock_page_cgroup() is
needed. Right?

> 
> > -void mem_cgroup_update_file_mapped(struct page *page, int val)
> > +void mem_cgroup_update_stat(struct page *page,
> > +			enum mem_cgroup_stat_index idx, int val)
> >  {
> I preffer "void mem_cgroup_update_page_stat(struct page *, enum mem_cgroup_page_stat_item, ..)"
> as I said above.
> 
> >  	struct mem_cgroup *mem;
> >  	struct page_cgroup *pc;
> >  
> > +	if (mem_cgroup_disabled())
> > +		return;
> >  	pc = lookup_page_cgroup(page);
> > -	if (unlikely(!pc))
> > +	if (unlikely(!pc) || !PageCgroupUsed(pc))
> >  		return;
> >  
> > -	lock_page_cgroup(pc);
> > -	mem = pc->mem_cgroup;
> > -	if (!mem)
> > -		goto done;
> > -
> > -	if (!PageCgroupUsed(pc))
> > -		goto done;
> > -
> > +	lock_page_cgroup_migrate(pc);
> >  	/*
> > -	 * Preemption is already disabled. We can use __this_cpu_xxx
> > -	 */
> > -	__this_cpu_add(mem->stat->count[MEM_CGROUP_STAT_FILE_MAPPED], val);
> > -
> > -done:
> > -	unlock_page_cgroup(pc);
> > +	* It's guarnteed that this page is never uncharged.
> > +	* The only racy problem is moving account among memcgs.
> > +	*/
> > +	switch (idx) {
> > +	case MEM_CGROUP_STAT_FILE_MAPPED:
> > +		if (val > 0)
> > +			SetPageCgroupFileMapped(pc);
> > +		else
> > +			ClearPageCgroupFileMapped(pc);
> > +		break;
> > +	case MEM_CGROUP_STAT_FILE_DIRTY:
> > +		if (val > 0)
> > +			SetPageCgroupDirty(pc);
> > +		else
> > +			ClearPageCgroupDirty(pc);
> > +		break;
> > +	case MEM_CGROUP_STAT_WRITEBACK:
> > +		if (val > 0)
> > +			SetPageCgroupWriteback(pc);
> > +		else
> > +			ClearPageCgroupWriteback(pc);
> > +		break;
> > +	case MEM_CGROUP_STAT_WRITEBACK_TEMP:
> > +		if (val > 0)
> > +			SetPageCgroupWritebackTemp(pc);
> > +		else
> > +			ClearPageCgroupWritebackTemp(pc);
> > +		break;
> > +	case MEM_CGROUP_STAT_UNSTABLE_NFS:
> > +		if (val > 0)
> > +			SetPageCgroupUnstableNFS(pc);
> > +		else
> > +			ClearPageCgroupUnstableNFS(pc);
> > +		break;
> > +	default:
> > +		BUG();
> > +		break;
> > +	}
> > +	mem = pc->mem_cgroup;
> > +	if (likely(mem))
> > +		__this_cpu_add(mem->stat->count[idx], val);
> > +	unlock_page_cgroup_migrate(pc);
> >  }
> > +EXPORT_SYMBOL_GPL(mem_cgroup_update_stat);
> >  
> >  /*
> >   * size of first charge trial. "32" comes from vmscan.c's magic value.
> > @@ -1701,6 +1885,45 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *mem,
> >  	memcg_check_events(mem, pc->page);
> >  }
> >  
> > +/*
> > + * Update file cache accounted statistics on task migration.
> > + *
> > + * TODO: We don't move charges of file (including shmem/tmpfs) pages for now.
> > + * So, at the moment this function simply returns without updating accounted
> > + * statistics, because we deal only with anonymous pages here.
> > + */
> This function is not unique to task migration. It's called from rmdir() too.
> So this comment isn't needed.

Agreed.

> 
> > +static void __mem_cgroup_update_file_stat(struct page_cgroup *pc,
> > +	struct mem_cgroup *from, struct mem_cgroup *to)
> > +{
> > +	struct page *page = pc->page;
> > +
> > +	if (!page_mapped(page) || PageAnon(page))
> > +		return;
> > +
> > +	if (PageCgroupFileMapped(pc)) {
> > +		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> > +		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> > +	}
> > +	if (PageCgroupDirty(pc)) {
> > +		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_DIRTY]);
> > +		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_DIRTY]);
> > +	}
> > +	if (PageCgroupWriteback(pc)) {
> > +		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_WRITEBACK]);
> > +		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_WRITEBACK]);
> > +	}
> > +	if (PageCgroupWritebackTemp(pc)) {
> > +		__this_cpu_dec(
> > +			from->stat->count[MEM_CGROUP_STAT_WRITEBACK_TEMP]);
> > +		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_WRITEBACK_TEMP]);
> > +	}
> > +	if (PageCgroupUnstableNFS(pc)) {
> > +		__this_cpu_dec(
> > +			from->stat->count[MEM_CGROUP_STAT_UNSTABLE_NFS]);
> > +		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_UNSTABLE_NFS]);
> > +	}
> > +}
> > +
> >  /**
> >   * __mem_cgroup_move_account - move account of the page
> >   * @pc:	page_cgroup of the page.
> > @@ -1721,22 +1944,16 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *mem,
> >  static void __mem_cgroup_move_account(struct page_cgroup *pc,
> >  	struct mem_cgroup *from, struct mem_cgroup *to, bool uncharge)
> >  {
> > -	struct page *page;
> > -
> >  	VM_BUG_ON(from == to);
> >  	VM_BUG_ON(PageLRU(pc->page));
> >  	VM_BUG_ON(!PageCgroupLocked(pc));
> >  	VM_BUG_ON(!PageCgroupUsed(pc));
> >  	VM_BUG_ON(pc->mem_cgroup != from);
> >  
> > -	page = pc->page;
> > -	if (page_mapped(page) && !PageAnon(page)) {
> > -		/* Update mapped_file data for mem_cgroup */
> > -		preempt_disable();
> > -		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> > -		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> > -		preempt_enable();
> > -	}
> > +	preempt_disable();
> > +	lock_page_cgroup_migrate(pc);
> > +	__mem_cgroup_update_file_stat(pc, from, to);
> > +
> >  	mem_cgroup_charge_statistics(from, pc, false);
> >  	if (uncharge)
> >  		/* This is not "cancel", but cancel_charge does all we need. */
> > @@ -1745,6 +1962,8 @@ static void __mem_cgroup_move_account(struct page_cgroup *pc,
> >  	/* caller should have done css_get */
> >  	pc->mem_cgroup = to;
> >  	mem_cgroup_charge_statistics(to, pc, true);
> > +	unlock_page_cgroup_migrate(pc);
> > +	preempt_enable();
> Glad to see this cleanup :)
> But, hmm, I don't think preempt_disable/enable() is enough(and bit_spin_lock/unlock()
> does it anyway). lock/unlock_page_cgroup_migrate() can be called under irq context
> (e.g. end_page_writeback()), so I think we must local_irq_disable()/enable() here.

You're right. So, also for this case irqs must be disabled/enabled by
lock/unlock_page_cgroup_migrate(). And again, FILE_MAPPED just needs
lock/unlock_page_cgroup().

Thanks,
-Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH -mmotm 3/4] memcg: dirty pages accounting and limiting infrastructure
       [not found]       ` <20100305105855.9b53176c.kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
  2010-03-05  7:01           ` Balbir Singh
@ 2010-03-05 22:14         ` Andrea Righi
  1 sibling, 0 replies; 68+ messages in thread
From: Andrea Righi @ 2010-03-05 22:14 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Daisuke Nishimura, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Trond Myklebust, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	Suleiman Souhlal, Andrew Morton,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Vivek Goyal, Balbir Singh

On Fri, Mar 05, 2010 at 10:58:55AM +0900, KAMEZAWA Hiroyuki wrote:
> On Fri, 5 Mar 2010 10:12:34 +0900
> Daisuke Nishimura <nishimura-YQH0OdQVrdy45+QrQBaojngSJqDPrsil@public.gmane.org> wrote:
> 
> > On Thu,  4 Mar 2010 11:40:14 +0100, Andrea Righi <arighi-vWjgImWzx8FBDgjK7y7TUQ@public.gmane.org> wrote:
> > > Infrastructure to account dirty pages per cgroup and add dirty limit
> > >  static int mem_cgroup_count_children_cb(struct mem_cgroup *mem, void *data)
> > >  {
> > >  	int *val = data;
> > > @@ -1275,34 +1423,70 @@ static void record_last_oom(struct mem_cgroup *mem)
> > >  }
> > >  
> > >  /*
> > > - * Currently used to update mapped file statistics, but the routine can be
> > > - * generalized to update other statistics as well.
> > > + * Generalized routine to update file cache's status for memcg.
> > > + *
> > > + * Before calling this, mapping->tree_lock should be held and preemption is
> > > + * disabled.  Then, it's guarnteed that the page is not uncharged while we
> > > + * access page_cgroup. We can make use of that.
> > >   */
> > IIUC, mapping->tree_lock is held with irq disabled, so I think "mapping->tree_lock
> > should be held with irq disabled" would be enouth.
> > And, as far as I can see, callers of this function have not ensured this yet in [4/4].
> > 
> > how about:
> > 
> > 	void mem_cgroup_update_stat_locked(...)
> > 	{
> > 		...
> > 	}
> > 
> > 	void mem_cgroup_update_stat_unlocked(mapping, ...)
> > 	{
> > 		spin_lock_irqsave(mapping->tree_lock, ...);
> > 		mem_cgroup_update_stat_locked();
> > 		spin_unlock_irqrestore(...);
> > 	}
> >
> Rather than tree_lock, lock_page_cgroup() can be used if tree_lock is not held.
> 
> 		lock_page_cgroup();
> 		mem_cgroup_update_stat_locked();
> 		unlock_page_cgroup();
> 
> Andrea-san, FILE_MAPPED is updated without treelock, at least. You can't depend
> on migration_lock about FILE_MAPPED.

Right. I'll consider this in the next version of the patch.

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH -mmotm 3/4] memcg: dirty pages accounting and limiting infrastructure
  2010-03-05  1:58       ` KAMEZAWA Hiroyuki
@ 2010-03-05 22:14         ` Andrea Righi
  -1 siblings, 0 replies; 68+ messages in thread
From: Andrea Righi @ 2010-03-05 22:14 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Daisuke Nishimura, Balbir Singh, Vivek Goyal, Peter Zijlstra,
	Trond Myklebust, Suleiman Souhlal, Greg Thelen,
	Kirill A. Shutemov, Andrew Morton, containers, linux-kernel,
	linux-mm

On Fri, Mar 05, 2010 at 10:58:55AM +0900, KAMEZAWA Hiroyuki wrote:
> On Fri, 5 Mar 2010 10:12:34 +0900
> Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> 
> > On Thu,  4 Mar 2010 11:40:14 +0100, Andrea Righi <arighi@develer.com> wrote:
> > > Infrastructure to account dirty pages per cgroup and add dirty limit
> > >  static int mem_cgroup_count_children_cb(struct mem_cgroup *mem, void *data)
> > >  {
> > >  	int *val = data;
> > > @@ -1275,34 +1423,70 @@ static void record_last_oom(struct mem_cgroup *mem)
> > >  }
> > >  
> > >  /*
> > > - * Currently used to update mapped file statistics, but the routine can be
> > > - * generalized to update other statistics as well.
> > > + * Generalized routine to update file cache's status for memcg.
> > > + *
> > > + * Before calling this, mapping->tree_lock should be held and preemption is
> > > + * disabled.  Then, it's guarnteed that the page is not uncharged while we
> > > + * access page_cgroup. We can make use of that.
> > >   */
> > IIUC, mapping->tree_lock is held with irq disabled, so I think "mapping->tree_lock
> > should be held with irq disabled" would be enouth.
> > And, as far as I can see, callers of this function have not ensured this yet in [4/4].
> > 
> > how about:
> > 
> > 	void mem_cgroup_update_stat_locked(...)
> > 	{
> > 		...
> > 	}
> > 
> > 	void mem_cgroup_update_stat_unlocked(mapping, ...)
> > 	{
> > 		spin_lock_irqsave(mapping->tree_lock, ...);
> > 		mem_cgroup_update_stat_locked();
> > 		spin_unlock_irqrestore(...);
> > 	}
> >
> Rather than tree_lock, lock_page_cgroup() can be used if tree_lock is not held.
> 
> 		lock_page_cgroup();
> 		mem_cgroup_update_stat_locked();
> 		unlock_page_cgroup();
> 
> Andrea-san, FILE_MAPPED is updated without treelock, at least. You can't depend
> on migration_lock about FILE_MAPPED.

Right. I'll consider this in the next version of the patch.

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH -mmotm 3/4] memcg: dirty pages accounting and limiting infrastructure
@ 2010-03-05 22:14         ` Andrea Righi
  0 siblings, 0 replies; 68+ messages in thread
From: Andrea Righi @ 2010-03-05 22:14 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Daisuke Nishimura, Balbir Singh, Vivek Goyal, Peter Zijlstra,
	Trond Myklebust, Suleiman Souhlal, Greg Thelen,
	Kirill A. Shutemov, Andrew Morton, containers, linux-kernel,
	linux-mm

On Fri, Mar 05, 2010 at 10:58:55AM +0900, KAMEZAWA Hiroyuki wrote:
> On Fri, 5 Mar 2010 10:12:34 +0900
> Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> 
> > On Thu,  4 Mar 2010 11:40:14 +0100, Andrea Righi <arighi@develer.com> wrote:
> > > Infrastructure to account dirty pages per cgroup and add dirty limit
> > >  static int mem_cgroup_count_children_cb(struct mem_cgroup *mem, void *data)
> > >  {
> > >  	int *val = data;
> > > @@ -1275,34 +1423,70 @@ static void record_last_oom(struct mem_cgroup *mem)
> > >  }
> > >  
> > >  /*
> > > - * Currently used to update mapped file statistics, but the routine can be
> > > - * generalized to update other statistics as well.
> > > + * Generalized routine to update file cache's status for memcg.
> > > + *
> > > + * Before calling this, mapping->tree_lock should be held and preemption is
> > > + * disabled.  Then, it's guarnteed that the page is not uncharged while we
> > > + * access page_cgroup. We can make use of that.
> > >   */
> > IIUC, mapping->tree_lock is held with irq disabled, so I think "mapping->tree_lock
> > should be held with irq disabled" would be enouth.
> > And, as far as I can see, callers of this function have not ensured this yet in [4/4].
> > 
> > how about:
> > 
> > 	void mem_cgroup_update_stat_locked(...)
> > 	{
> > 		...
> > 	}
> > 
> > 	void mem_cgroup_update_stat_unlocked(mapping, ...)
> > 	{
> > 		spin_lock_irqsave(mapping->tree_lock, ...);
> > 		mem_cgroup_update_stat_locked();
> > 		spin_unlock_irqrestore(...);
> > 	}
> >
> Rather than tree_lock, lock_page_cgroup() can be used if tree_lock is not held.
> 
> 		lock_page_cgroup();
> 		mem_cgroup_update_stat_locked();
> 		unlock_page_cgroup();
> 
> Andrea-san, FILE_MAPPED is updated without treelock, at least. You can't depend
> on migration_lock about FILE_MAPPED.

Right. I'll consider this in the next version of the patch.

Thanks,
-Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH -mmotm 2/4] page_cgroup: introduce file cache flags
       [not found]     ` <20100305063249.GH3073-SINUvgVNF2CyUtPGxGje5AC/G2K4zDHf@public.gmane.org>
@ 2010-03-05 22:35       ` Andrea Righi
  0 siblings, 0 replies; 68+ messages in thread
From: Andrea Righi @ 2010-03-05 22:35 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Daisuke Nishimura, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Trond Myklebust, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	Suleiman Souhlal, Andrew Morton,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Vivek Goyal

On Fri, Mar 05, 2010 at 12:02:49PM +0530, Balbir Singh wrote:
> * Andrea Righi <arighi-vWjgImWzx8FBDgjK7y7TUQ@public.gmane.org> [2010-03-04 11:40:13]:
> 
> > Introduce page_cgroup flags to keep track of file cache pages.
> > 
> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
> > Signed-off-by: Andrea Righi <arighi-vWjgImWzx8FBDgjK7y7TUQ@public.gmane.org>
> > ---
> 
> Looks good
> 
> 
> Acked-by: Balbir Singh <balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
>  
> 
> >  include/linux/page_cgroup.h |   49 +++++++++++++++++++++++++++++++++++++++++++
> >  1 files changed, 49 insertions(+), 0 deletions(-)
> > 
> > diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
> > index 30b0813..1b79ded 100644
> > --- a/include/linux/page_cgroup.h
> > +++ b/include/linux/page_cgroup.h
> > @@ -39,6 +39,12 @@ enum {
> >  	PCG_CACHE, /* charged as cache */
> >  	PCG_USED, /* this object is in use. */
> >  	PCG_ACCT_LRU, /* page has been accounted for */
> > +	PCG_MIGRATE_LOCK, /* used for mutual execution of account migration */
> > +	PCG_ACCT_FILE_MAPPED, /* page is accounted as file rss*/
> > +	PCG_ACCT_DIRTY, /* page is dirty */
> > +	PCG_ACCT_WRITEBACK, /* page is being written back to disk */
> > +	PCG_ACCT_WRITEBACK_TEMP, /* page is used as temporary buffer for FUSE */
> > +	PCG_ACCT_UNSTABLE_NFS, /* NFS page not yet committed to the server */
> >  };
> > 
> >  #define TESTPCGFLAG(uname, lname)			\
> > @@ -73,6 +79,27 @@ CLEARPCGFLAG(AcctLRU, ACCT_LRU)
> >  TESTPCGFLAG(AcctLRU, ACCT_LRU)
> >  TESTCLEARPCGFLAG(AcctLRU, ACCT_LRU)
> > 
> > +/* File cache and dirty memory flags */
> > +TESTPCGFLAG(FileMapped, ACCT_FILE_MAPPED)
> > +SETPCGFLAG(FileMapped, ACCT_FILE_MAPPED)
> > +CLEARPCGFLAG(FileMapped, ACCT_FILE_MAPPED)
> > +
> > +TESTPCGFLAG(Dirty, ACCT_DIRTY)
> > +SETPCGFLAG(Dirty, ACCT_DIRTY)
> > +CLEARPCGFLAG(Dirty, ACCT_DIRTY)
> > +
> > +TESTPCGFLAG(Writeback, ACCT_WRITEBACK)
> > +SETPCGFLAG(Writeback, ACCT_WRITEBACK)
> > +CLEARPCGFLAG(Writeback, ACCT_WRITEBACK)
> > +
> > +TESTPCGFLAG(WritebackTemp, ACCT_WRITEBACK_TEMP)
> > +SETPCGFLAG(WritebackTemp, ACCT_WRITEBACK_TEMP)
> > +CLEARPCGFLAG(WritebackTemp, ACCT_WRITEBACK_TEMP)
> > +
> > +TESTPCGFLAG(UnstableNFS, ACCT_UNSTABLE_NFS)
> > +SETPCGFLAG(UnstableNFS, ACCT_UNSTABLE_NFS)
> > +CLEARPCGFLAG(UnstableNFS, ACCT_UNSTABLE_NFS)
> > +
> >  static inline int page_cgroup_nid(struct page_cgroup *pc)
> >  {
> >  	return page_to_nid(pc->page);
> > @@ -83,6 +110,9 @@ static inline enum zone_type page_cgroup_zid(struct page_cgroup *pc)
> >  	return page_zonenum(pc->page);
> >  }
> > 
> > +/*
> > + * lock_page_cgroup() should not be held under mapping->tree_lock
> > + */
> 
> May be a DEBUG WARN_ON would be appropriate here?

Sounds good. WARN_ON_ONCE()?

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH -mmotm 2/4] page_cgroup: introduce file cache flags
  2010-03-05  6:32     ` Balbir Singh
@ 2010-03-05 22:35       ` Andrea Righi
  -1 siblings, 0 replies; 68+ messages in thread
From: Andrea Righi @ 2010-03-05 22:35 UTC (permalink / raw)
  To: Balbir Singh
  Cc: KAMEZAWA Hiroyuki, Vivek Goyal, Peter Zijlstra, Trond Myklebust,
	Suleiman Souhlal, Greg Thelen, Daisuke Nishimura,
	Kirill A. Shutemov, Andrew Morton, containers, linux-kernel,
	linux-mm

On Fri, Mar 05, 2010 at 12:02:49PM +0530, Balbir Singh wrote:
> * Andrea Righi <arighi@develer.com> [2010-03-04 11:40:13]:
> 
> > Introduce page_cgroup flags to keep track of file cache pages.
> > 
> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > Signed-off-by: Andrea Righi <arighi@develer.com>
> > ---
> 
> Looks good
> 
> 
> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
>  
> 
> >  include/linux/page_cgroup.h |   49 +++++++++++++++++++++++++++++++++++++++++++
> >  1 files changed, 49 insertions(+), 0 deletions(-)
> > 
> > diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
> > index 30b0813..1b79ded 100644
> > --- a/include/linux/page_cgroup.h
> > +++ b/include/linux/page_cgroup.h
> > @@ -39,6 +39,12 @@ enum {
> >  	PCG_CACHE, /* charged as cache */
> >  	PCG_USED, /* this object is in use. */
> >  	PCG_ACCT_LRU, /* page has been accounted for */
> > +	PCG_MIGRATE_LOCK, /* used for mutual execution of account migration */
> > +	PCG_ACCT_FILE_MAPPED, /* page is accounted as file rss*/
> > +	PCG_ACCT_DIRTY, /* page is dirty */
> > +	PCG_ACCT_WRITEBACK, /* page is being written back to disk */
> > +	PCG_ACCT_WRITEBACK_TEMP, /* page is used as temporary buffer for FUSE */
> > +	PCG_ACCT_UNSTABLE_NFS, /* NFS page not yet committed to the server */
> >  };
> > 
> >  #define TESTPCGFLAG(uname, lname)			\
> > @@ -73,6 +79,27 @@ CLEARPCGFLAG(AcctLRU, ACCT_LRU)
> >  TESTPCGFLAG(AcctLRU, ACCT_LRU)
> >  TESTCLEARPCGFLAG(AcctLRU, ACCT_LRU)
> > 
> > +/* File cache and dirty memory flags */
> > +TESTPCGFLAG(FileMapped, ACCT_FILE_MAPPED)
> > +SETPCGFLAG(FileMapped, ACCT_FILE_MAPPED)
> > +CLEARPCGFLAG(FileMapped, ACCT_FILE_MAPPED)
> > +
> > +TESTPCGFLAG(Dirty, ACCT_DIRTY)
> > +SETPCGFLAG(Dirty, ACCT_DIRTY)
> > +CLEARPCGFLAG(Dirty, ACCT_DIRTY)
> > +
> > +TESTPCGFLAG(Writeback, ACCT_WRITEBACK)
> > +SETPCGFLAG(Writeback, ACCT_WRITEBACK)
> > +CLEARPCGFLAG(Writeback, ACCT_WRITEBACK)
> > +
> > +TESTPCGFLAG(WritebackTemp, ACCT_WRITEBACK_TEMP)
> > +SETPCGFLAG(WritebackTemp, ACCT_WRITEBACK_TEMP)
> > +CLEARPCGFLAG(WritebackTemp, ACCT_WRITEBACK_TEMP)
> > +
> > +TESTPCGFLAG(UnstableNFS, ACCT_UNSTABLE_NFS)
> > +SETPCGFLAG(UnstableNFS, ACCT_UNSTABLE_NFS)
> > +CLEARPCGFLAG(UnstableNFS, ACCT_UNSTABLE_NFS)
> > +
> >  static inline int page_cgroup_nid(struct page_cgroup *pc)
> >  {
> >  	return page_to_nid(pc->page);
> > @@ -83,6 +110,9 @@ static inline enum zone_type page_cgroup_zid(struct page_cgroup *pc)
> >  	return page_zonenum(pc->page);
> >  }
> > 
> > +/*
> > + * lock_page_cgroup() should not be held under mapping->tree_lock
> > + */
> 
> May be a DEBUG WARN_ON would be appropriate here?

Sounds good. WARN_ON_ONCE()?

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH -mmotm 2/4] page_cgroup: introduce file cache flags
@ 2010-03-05 22:35       ` Andrea Righi
  0 siblings, 0 replies; 68+ messages in thread
From: Andrea Righi @ 2010-03-05 22:35 UTC (permalink / raw)
  To: Balbir Singh
  Cc: KAMEZAWA Hiroyuki, Vivek Goyal, Peter Zijlstra, Trond Myklebust,
	Suleiman Souhlal, Greg Thelen, Daisuke Nishimura,
	Kirill A. Shutemov, Andrew Morton, containers, linux-kernel,
	linux-mm

On Fri, Mar 05, 2010 at 12:02:49PM +0530, Balbir Singh wrote:
> * Andrea Righi <arighi@develer.com> [2010-03-04 11:40:13]:
> 
> > Introduce page_cgroup flags to keep track of file cache pages.
> > 
> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > Signed-off-by: Andrea Righi <arighi@develer.com>
> > ---
> 
> Looks good
> 
> 
> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
>  
> 
> >  include/linux/page_cgroup.h |   49 +++++++++++++++++++++++++++++++++++++++++++
> >  1 files changed, 49 insertions(+), 0 deletions(-)
> > 
> > diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
> > index 30b0813..1b79ded 100644
> > --- a/include/linux/page_cgroup.h
> > +++ b/include/linux/page_cgroup.h
> > @@ -39,6 +39,12 @@ enum {
> >  	PCG_CACHE, /* charged as cache */
> >  	PCG_USED, /* this object is in use. */
> >  	PCG_ACCT_LRU, /* page has been accounted for */
> > +	PCG_MIGRATE_LOCK, /* used for mutual execution of account migration */
> > +	PCG_ACCT_FILE_MAPPED, /* page is accounted as file rss*/
> > +	PCG_ACCT_DIRTY, /* page is dirty */
> > +	PCG_ACCT_WRITEBACK, /* page is being written back to disk */
> > +	PCG_ACCT_WRITEBACK_TEMP, /* page is used as temporary buffer for FUSE */
> > +	PCG_ACCT_UNSTABLE_NFS, /* NFS page not yet committed to the server */
> >  };
> > 
> >  #define TESTPCGFLAG(uname, lname)			\
> > @@ -73,6 +79,27 @@ CLEARPCGFLAG(AcctLRU, ACCT_LRU)
> >  TESTPCGFLAG(AcctLRU, ACCT_LRU)
> >  TESTCLEARPCGFLAG(AcctLRU, ACCT_LRU)
> > 
> > +/* File cache and dirty memory flags */
> > +TESTPCGFLAG(FileMapped, ACCT_FILE_MAPPED)
> > +SETPCGFLAG(FileMapped, ACCT_FILE_MAPPED)
> > +CLEARPCGFLAG(FileMapped, ACCT_FILE_MAPPED)
> > +
> > +TESTPCGFLAG(Dirty, ACCT_DIRTY)
> > +SETPCGFLAG(Dirty, ACCT_DIRTY)
> > +CLEARPCGFLAG(Dirty, ACCT_DIRTY)
> > +
> > +TESTPCGFLAG(Writeback, ACCT_WRITEBACK)
> > +SETPCGFLAG(Writeback, ACCT_WRITEBACK)
> > +CLEARPCGFLAG(Writeback, ACCT_WRITEBACK)
> > +
> > +TESTPCGFLAG(WritebackTemp, ACCT_WRITEBACK_TEMP)
> > +SETPCGFLAG(WritebackTemp, ACCT_WRITEBACK_TEMP)
> > +CLEARPCGFLAG(WritebackTemp, ACCT_WRITEBACK_TEMP)
> > +
> > +TESTPCGFLAG(UnstableNFS, ACCT_UNSTABLE_NFS)
> > +SETPCGFLAG(UnstableNFS, ACCT_UNSTABLE_NFS)
> > +CLEARPCGFLAG(UnstableNFS, ACCT_UNSTABLE_NFS)
> > +
> >  static inline int page_cgroup_nid(struct page_cgroup *pc)
> >  {
> >  	return page_to_nid(pc->page);
> > @@ -83,6 +110,9 @@ static inline enum zone_type page_cgroup_zid(struct page_cgroup *pc)
> >  	return page_zonenum(pc->page);
> >  }
> > 
> > +/*
> > + * lock_page_cgroup() should not be held under mapping->tree_lock
> > + */
> 
> May be a DEBUG WARN_ON would be appropriate here?

Sounds good. WARN_ON_ONCE()?

Thanks,
-Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH -mmotm 4/4] memcg: dirty pages instrumentation
       [not found]     ` <20100305063843.GI3073-SINUvgVNF2CyUtPGxGje5AC/G2K4zDHf@public.gmane.org>
@ 2010-03-05 22:55       ` Andrea Righi
  0 siblings, 0 replies; 68+ messages in thread
From: Andrea Righi @ 2010-03-05 22:55 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Daisuke Nishimura, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Trond Myklebust, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	Suleiman Souhlal, Andrew Morton,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Vivek Goyal

On Fri, Mar 05, 2010 at 12:08:43PM +0530, Balbir Singh wrote:
> * Andrea Righi <arighi-vWjgImWzx8FBDgjK7y7TUQ@public.gmane.org> [2010-03-04 11:40:15]:
> 
> > Apply the cgroup dirty pages accounting and limiting infrastructure
> > to the opportune kernel functions.
> > 
> > Signed-off-by: Andrea Righi <arighi-vWjgImWzx8FBDgjK7y7TUQ@public.gmane.org>
> > ---
> >  fs/fuse/file.c      |    5 +++
> >  fs/nfs/write.c      |    4 ++
> >  fs/nilfs2/segment.c |   11 +++++-
> >  mm/filemap.c        |    1 +
> >  mm/page-writeback.c |   91 ++++++++++++++++++++++++++++++++++-----------------
> >  mm/rmap.c           |    4 +-
> >  mm/truncate.c       |    2 +
> >  7 files changed, 84 insertions(+), 34 deletions(-)
> > 
> > diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> > index a9f5e13..dbbdd53 100644
> > --- a/fs/fuse/file.c
> > +++ b/fs/fuse/file.c
> > @@ -11,6 +11,7 @@
> >  #include <linux/pagemap.h>
> >  #include <linux/slab.h>
> >  #include <linux/kernel.h>
> > +#include <linux/memcontrol.h>
> >  #include <linux/sched.h>
> >  #include <linux/module.h>
> > 
> > @@ -1129,6 +1130,8 @@ static void fuse_writepage_finish(struct fuse_conn *fc, struct fuse_req *req)
> > 
> >  	list_del(&req->writepages_entry);
> >  	dec_bdi_stat(bdi, BDI_WRITEBACK);
> > +	mem_cgroup_update_stat(req->pages[0],
> > +			MEM_CGROUP_STAT_WRITEBACK_TEMP, -1);
> >  	dec_zone_page_state(req->pages[0], NR_WRITEBACK_TEMP);
> >  	bdi_writeout_inc(bdi);
> >  	wake_up(&fi->page_waitq);
> > @@ -1240,6 +1243,8 @@ static int fuse_writepage_locked(struct page *page)
> >  	req->inode = inode;
> > 
> >  	inc_bdi_stat(mapping->backing_dev_info, BDI_WRITEBACK);
> > +	mem_cgroup_update_stat(tmp_page,
> > +			MEM_CGROUP_STAT_WRITEBACK_TEMP, 1);
> >  	inc_zone_page_state(tmp_page, NR_WRITEBACK_TEMP);
> >  	end_page_writeback(page);
> > 
> > diff --git a/fs/nfs/write.c b/fs/nfs/write.c
> > index b753242..7316f7a 100644
> > --- a/fs/nfs/write.c
> > +++ b/fs/nfs/write.c
> > @@ -439,6 +439,7 @@ nfs_mark_request_commit(struct nfs_page *req)
> >  			req->wb_index,
> >  			NFS_PAGE_TAG_COMMIT);
> >  	spin_unlock(&inode->i_lock);
> > +	mem_cgroup_update_stat(req->wb_page, MEM_CGROUP_STAT_UNSTABLE_NFS, 1);
> >  	inc_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
> >  	inc_bdi_stat(req->wb_page->mapping->backing_dev_info, BDI_UNSTABLE);
> >  	__mark_inode_dirty(inode, I_DIRTY_DATASYNC);
> > @@ -450,6 +451,7 @@ nfs_clear_request_commit(struct nfs_page *req)
> >  	struct page *page = req->wb_page;
> > 
> >  	if (test_and_clear_bit(PG_CLEAN, &(req)->wb_flags)) {
> > +		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_UNSTABLE_NFS, -1);
> >  		dec_zone_page_state(page, NR_UNSTABLE_NFS);
> >  		dec_bdi_stat(page->mapping->backing_dev_info, BDI_UNSTABLE);
> >  		return 1;
> > @@ -1273,6 +1275,8 @@ nfs_commit_list(struct inode *inode, struct list_head *head, int how)
> >  		req = nfs_list_entry(head->next);
> >  		nfs_list_remove_request(req);
> >  		nfs_mark_request_commit(req);
> > +		mem_cgroup_update_stat(req->wb_page,
> > +				MEM_CGROUP_STAT_UNSTABLE_NFS, -1);
> >  		dec_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
> >  		dec_bdi_stat(req->wb_page->mapping->backing_dev_info,
> >  				BDI_UNSTABLE);
> > diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
> > index ada2f1b..27a01b1 100644
> > --- a/fs/nilfs2/segment.c
> > +++ b/fs/nilfs2/segment.c
> > @@ -24,6 +24,7 @@
> >  #include <linux/pagemap.h>
> >  #include <linux/buffer_head.h>
> >  #include <linux/writeback.h>
> > +#include <linux/memcontrol.h>
> >  #include <linux/bio.h>
> >  #include <linux/completion.h>
> >  #include <linux/blkdev.h>
> > @@ -1660,8 +1661,11 @@ nilfs_copy_replace_page_buffers(struct page *page, struct list_head *out)
> >  	} while (bh = bh->b_this_page, bh2 = bh2->b_this_page, bh != head);
> >  	kunmap_atomic(kaddr, KM_USER0);
> > 
> > -	if (!TestSetPageWriteback(clone_page))
> > +	if (!TestSetPageWriteback(clone_page)) {
> > +		mem_cgroup_update_stat(clone_page,
> > +				MEM_CGROUP_STAT_WRITEBACK, 1);
> 
> I wonder if we should start implementing inc and dec to avoid passing
> the +1 and -1 parameters. It should make the code easier to read.

OK, it's always +1/-1, and I don't see any case where we should use
different numbers. So, better to move to the inc/dec naming.

> 
> >  		inc_zone_page_state(clone_page, NR_WRITEBACK);
> > +	}
> >  	unlock_page(clone_page);
> > 
> >  	return 0;
> > @@ -1783,8 +1787,11 @@ static void __nilfs_end_page_io(struct page *page, int err)
> >  	}
> > 
> >  	if (buffer_nilfs_allocated(page_buffers(page))) {
> > -		if (TestClearPageWriteback(page))
> > +		if (TestClearPageWriteback(page)) {
> > +			mem_cgroup_update_stat(page,
> > +					MEM_CGROUP_STAT_WRITEBACK, -1);
> >  			dec_zone_page_state(page, NR_WRITEBACK);
> > +		}
> >  	} else
> >  		end_page_writeback(page);
> >  }
> > diff --git a/mm/filemap.c b/mm/filemap.c
> > index fe09e51..f85acae 100644
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> > @@ -135,6 +135,7 @@ void __remove_from_page_cache(struct page *page)
> >  	 * having removed the page entirely.
> >  	 */
> >  	if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
> > +		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_FILE_DIRTY, -1);
> >  		dec_zone_page_state(page, NR_FILE_DIRTY);
> >  		dec_bdi_stat(mapping->backing_dev_info, BDI_DIRTY);
> >  	}
> > diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> > index 5a0f8f3..c5d14ea 100644
> > --- a/mm/page-writeback.c
> > +++ b/mm/page-writeback.c
> > @@ -137,13 +137,16 @@ static struct prop_descriptor vm_dirties;
> >   */
> >  static int calc_period_shift(void)
> >  {
> > +	struct dirty_param dirty_param;
> 
> vm_dirty_param?

Agreed.

> 
> >  	unsigned long dirty_total;
> > 
> > -	if (vm_dirty_bytes)
> > -		dirty_total = vm_dirty_bytes / PAGE_SIZE;
> > +	get_dirty_param(&dirty_param);
> 
> get_vm_dirty_param() is a nicer name.

Agreed.

> 
> > +
> > +	if (dirty_param.dirty_bytes)
> > +		dirty_total = dirty_param.dirty_bytes / PAGE_SIZE;
> >  	else
> > -		dirty_total = (vm_dirty_ratio * determine_dirtyable_memory()) /
> > -				100;
> > +		dirty_total = (dirty_param.dirty_ratio *
> > +				determine_dirtyable_memory()) / 100;
> >  	return 2 + ilog2(dirty_total - 1);
> >  }
> > 
> > @@ -408,41 +411,46 @@ static unsigned long highmem_dirtyable_memory(unsigned long total)
> >   */
> >  unsigned long determine_dirtyable_memory(void)
> >  {
> > -	unsigned long x;
> > -
> > -	x = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages();
> > +	unsigned long memory;
> > +	s64 memcg_memory;
> > 
> > +	memory = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages();
> >  	if (!vm_highmem_is_dirtyable)
> > -		x -= highmem_dirtyable_memory(x);
> > -
> > -	return x + 1;	/* Ensure that we never return 0 */
> > +		memory -= highmem_dirtyable_memory(memory);
> > +	if (mem_cgroup_has_dirty_limit())
> > +		return memory + 1;
> 
> Vivek already pointed out this issue I suppose. Should be *not*

Right. Will be fixed in the next version of the patch.

> 
> > +	memcg_memory = mem_cgroup_page_stat(MEMCG_NR_DIRTYABLE_PAGES);
> 
> Can memcg_memory be 0?

No LRU file pages, no swappable pages, and RES_USAGE == RES_LIMIT? this
would trigger an OOM before memcg_memory == 0 can happen, I think.

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH -mmotm 4/4] memcg: dirty pages instrumentation
  2010-03-05  6:38     ` Balbir Singh
@ 2010-03-05 22:55       ` Andrea Righi
  -1 siblings, 0 replies; 68+ messages in thread
From: Andrea Righi @ 2010-03-05 22:55 UTC (permalink / raw)
  To: Balbir Singh
  Cc: KAMEZAWA Hiroyuki, Vivek Goyal, Peter Zijlstra, Trond Myklebust,
	Suleiman Souhlal, Greg Thelen, Daisuke Nishimura,
	Kirill A. Shutemov, Andrew Morton, containers, linux-kernel,
	linux-mm

On Fri, Mar 05, 2010 at 12:08:43PM +0530, Balbir Singh wrote:
> * Andrea Righi <arighi@develer.com> [2010-03-04 11:40:15]:
> 
> > Apply the cgroup dirty pages accounting and limiting infrastructure
> > to the opportune kernel functions.
> > 
> > Signed-off-by: Andrea Righi <arighi@develer.com>
> > ---
> >  fs/fuse/file.c      |    5 +++
> >  fs/nfs/write.c      |    4 ++
> >  fs/nilfs2/segment.c |   11 +++++-
> >  mm/filemap.c        |    1 +
> >  mm/page-writeback.c |   91 ++++++++++++++++++++++++++++++++++-----------------
> >  mm/rmap.c           |    4 +-
> >  mm/truncate.c       |    2 +
> >  7 files changed, 84 insertions(+), 34 deletions(-)
> > 
> > diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> > index a9f5e13..dbbdd53 100644
> > --- a/fs/fuse/file.c
> > +++ b/fs/fuse/file.c
> > @@ -11,6 +11,7 @@
> >  #include <linux/pagemap.h>
> >  #include <linux/slab.h>
> >  #include <linux/kernel.h>
> > +#include <linux/memcontrol.h>
> >  #include <linux/sched.h>
> >  #include <linux/module.h>
> > 
> > @@ -1129,6 +1130,8 @@ static void fuse_writepage_finish(struct fuse_conn *fc, struct fuse_req *req)
> > 
> >  	list_del(&req->writepages_entry);
> >  	dec_bdi_stat(bdi, BDI_WRITEBACK);
> > +	mem_cgroup_update_stat(req->pages[0],
> > +			MEM_CGROUP_STAT_WRITEBACK_TEMP, -1);
> >  	dec_zone_page_state(req->pages[0], NR_WRITEBACK_TEMP);
> >  	bdi_writeout_inc(bdi);
> >  	wake_up(&fi->page_waitq);
> > @@ -1240,6 +1243,8 @@ static int fuse_writepage_locked(struct page *page)
> >  	req->inode = inode;
> > 
> >  	inc_bdi_stat(mapping->backing_dev_info, BDI_WRITEBACK);
> > +	mem_cgroup_update_stat(tmp_page,
> > +			MEM_CGROUP_STAT_WRITEBACK_TEMP, 1);
> >  	inc_zone_page_state(tmp_page, NR_WRITEBACK_TEMP);
> >  	end_page_writeback(page);
> > 
> > diff --git a/fs/nfs/write.c b/fs/nfs/write.c
> > index b753242..7316f7a 100644
> > --- a/fs/nfs/write.c
> > +++ b/fs/nfs/write.c
> > @@ -439,6 +439,7 @@ nfs_mark_request_commit(struct nfs_page *req)
> >  			req->wb_index,
> >  			NFS_PAGE_TAG_COMMIT);
> >  	spin_unlock(&inode->i_lock);
> > +	mem_cgroup_update_stat(req->wb_page, MEM_CGROUP_STAT_UNSTABLE_NFS, 1);
> >  	inc_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
> >  	inc_bdi_stat(req->wb_page->mapping->backing_dev_info, BDI_UNSTABLE);
> >  	__mark_inode_dirty(inode, I_DIRTY_DATASYNC);
> > @@ -450,6 +451,7 @@ nfs_clear_request_commit(struct nfs_page *req)
> >  	struct page *page = req->wb_page;
> > 
> >  	if (test_and_clear_bit(PG_CLEAN, &(req)->wb_flags)) {
> > +		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_UNSTABLE_NFS, -1);
> >  		dec_zone_page_state(page, NR_UNSTABLE_NFS);
> >  		dec_bdi_stat(page->mapping->backing_dev_info, BDI_UNSTABLE);
> >  		return 1;
> > @@ -1273,6 +1275,8 @@ nfs_commit_list(struct inode *inode, struct list_head *head, int how)
> >  		req = nfs_list_entry(head->next);
> >  		nfs_list_remove_request(req);
> >  		nfs_mark_request_commit(req);
> > +		mem_cgroup_update_stat(req->wb_page,
> > +				MEM_CGROUP_STAT_UNSTABLE_NFS, -1);
> >  		dec_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
> >  		dec_bdi_stat(req->wb_page->mapping->backing_dev_info,
> >  				BDI_UNSTABLE);
> > diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
> > index ada2f1b..27a01b1 100644
> > --- a/fs/nilfs2/segment.c
> > +++ b/fs/nilfs2/segment.c
> > @@ -24,6 +24,7 @@
> >  #include <linux/pagemap.h>
> >  #include <linux/buffer_head.h>
> >  #include <linux/writeback.h>
> > +#include <linux/memcontrol.h>
> >  #include <linux/bio.h>
> >  #include <linux/completion.h>
> >  #include <linux/blkdev.h>
> > @@ -1660,8 +1661,11 @@ nilfs_copy_replace_page_buffers(struct page *page, struct list_head *out)
> >  	} while (bh = bh->b_this_page, bh2 = bh2->b_this_page, bh != head);
> >  	kunmap_atomic(kaddr, KM_USER0);
> > 
> > -	if (!TestSetPageWriteback(clone_page))
> > +	if (!TestSetPageWriteback(clone_page)) {
> > +		mem_cgroup_update_stat(clone_page,
> > +				MEM_CGROUP_STAT_WRITEBACK, 1);
> 
> I wonder if we should start implementing inc and dec to avoid passing
> the +1 and -1 parameters. It should make the code easier to read.

OK, it's always +1/-1, and I don't see any case where we should use
different numbers. So, better to move to the inc/dec naming.

> 
> >  		inc_zone_page_state(clone_page, NR_WRITEBACK);
> > +	}
> >  	unlock_page(clone_page);
> > 
> >  	return 0;
> > @@ -1783,8 +1787,11 @@ static void __nilfs_end_page_io(struct page *page, int err)
> >  	}
> > 
> >  	if (buffer_nilfs_allocated(page_buffers(page))) {
> > -		if (TestClearPageWriteback(page))
> > +		if (TestClearPageWriteback(page)) {
> > +			mem_cgroup_update_stat(page,
> > +					MEM_CGROUP_STAT_WRITEBACK, -1);
> >  			dec_zone_page_state(page, NR_WRITEBACK);
> > +		}
> >  	} else
> >  		end_page_writeback(page);
> >  }
> > diff --git a/mm/filemap.c b/mm/filemap.c
> > index fe09e51..f85acae 100644
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> > @@ -135,6 +135,7 @@ void __remove_from_page_cache(struct page *page)
> >  	 * having removed the page entirely.
> >  	 */
> >  	if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
> > +		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_FILE_DIRTY, -1);
> >  		dec_zone_page_state(page, NR_FILE_DIRTY);
> >  		dec_bdi_stat(mapping->backing_dev_info, BDI_DIRTY);
> >  	}
> > diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> > index 5a0f8f3..c5d14ea 100644
> > --- a/mm/page-writeback.c
> > +++ b/mm/page-writeback.c
> > @@ -137,13 +137,16 @@ static struct prop_descriptor vm_dirties;
> >   */
> >  static int calc_period_shift(void)
> >  {
> > +	struct dirty_param dirty_param;
> 
> vm_dirty_param?

Agreed.

> 
> >  	unsigned long dirty_total;
> > 
> > -	if (vm_dirty_bytes)
> > -		dirty_total = vm_dirty_bytes / PAGE_SIZE;
> > +	get_dirty_param(&dirty_param);
> 
> get_vm_dirty_param() is a nicer name.

Agreed.

> 
> > +
> > +	if (dirty_param.dirty_bytes)
> > +		dirty_total = dirty_param.dirty_bytes / PAGE_SIZE;
> >  	else
> > -		dirty_total = (vm_dirty_ratio * determine_dirtyable_memory()) /
> > -				100;
> > +		dirty_total = (dirty_param.dirty_ratio *
> > +				determine_dirtyable_memory()) / 100;
> >  	return 2 + ilog2(dirty_total - 1);
> >  }
> > 
> > @@ -408,41 +411,46 @@ static unsigned long highmem_dirtyable_memory(unsigned long total)
> >   */
> >  unsigned long determine_dirtyable_memory(void)
> >  {
> > -	unsigned long x;
> > -
> > -	x = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages();
> > +	unsigned long memory;
> > +	s64 memcg_memory;
> > 
> > +	memory = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages();
> >  	if (!vm_highmem_is_dirtyable)
> > -		x -= highmem_dirtyable_memory(x);
> > -
> > -	return x + 1;	/* Ensure that we never return 0 */
> > +		memory -= highmem_dirtyable_memory(memory);
> > +	if (mem_cgroup_has_dirty_limit())
> > +		return memory + 1;
> 
> Vivek already pointed out this issue I suppose. Should be *not*

Right. Will be fixed in the next version of the patch.

> 
> > +	memcg_memory = mem_cgroup_page_stat(MEMCG_NR_DIRTYABLE_PAGES);
> 
> Can memcg_memory be 0?

No LRU file pages, no swappable pages, and RES_USAGE == RES_LIMIT? this
would trigger an OOM before memcg_memory == 0 can happen, I think.

Thanks,
-Andrea

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH -mmotm 4/4] memcg: dirty pages instrumentation
@ 2010-03-05 22:55       ` Andrea Righi
  0 siblings, 0 replies; 68+ messages in thread
From: Andrea Righi @ 2010-03-05 22:55 UTC (permalink / raw)
  To: Balbir Singh
  Cc: KAMEZAWA Hiroyuki, Vivek Goyal, Peter Zijlstra, Trond Myklebust,
	Suleiman Souhlal, Greg Thelen, Daisuke Nishimura,
	Kirill A. Shutemov, Andrew Morton, containers, linux-kernel,
	linux-mm

On Fri, Mar 05, 2010 at 12:08:43PM +0530, Balbir Singh wrote:
> * Andrea Righi <arighi@develer.com> [2010-03-04 11:40:15]:
> 
> > Apply the cgroup dirty pages accounting and limiting infrastructure
> > to the opportune kernel functions.
> > 
> > Signed-off-by: Andrea Righi <arighi@develer.com>
> > ---
> >  fs/fuse/file.c      |    5 +++
> >  fs/nfs/write.c      |    4 ++
> >  fs/nilfs2/segment.c |   11 +++++-
> >  mm/filemap.c        |    1 +
> >  mm/page-writeback.c |   91 ++++++++++++++++++++++++++++++++++-----------------
> >  mm/rmap.c           |    4 +-
> >  mm/truncate.c       |    2 +
> >  7 files changed, 84 insertions(+), 34 deletions(-)
> > 
> > diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> > index a9f5e13..dbbdd53 100644
> > --- a/fs/fuse/file.c
> > +++ b/fs/fuse/file.c
> > @@ -11,6 +11,7 @@
> >  #include <linux/pagemap.h>
> >  #include <linux/slab.h>
> >  #include <linux/kernel.h>
> > +#include <linux/memcontrol.h>
> >  #include <linux/sched.h>
> >  #include <linux/module.h>
> > 
> > @@ -1129,6 +1130,8 @@ static void fuse_writepage_finish(struct fuse_conn *fc, struct fuse_req *req)
> > 
> >  	list_del(&req->writepages_entry);
> >  	dec_bdi_stat(bdi, BDI_WRITEBACK);
> > +	mem_cgroup_update_stat(req->pages[0],
> > +			MEM_CGROUP_STAT_WRITEBACK_TEMP, -1);
> >  	dec_zone_page_state(req->pages[0], NR_WRITEBACK_TEMP);
> >  	bdi_writeout_inc(bdi);
> >  	wake_up(&fi->page_waitq);
> > @@ -1240,6 +1243,8 @@ static int fuse_writepage_locked(struct page *page)
> >  	req->inode = inode;
> > 
> >  	inc_bdi_stat(mapping->backing_dev_info, BDI_WRITEBACK);
> > +	mem_cgroup_update_stat(tmp_page,
> > +			MEM_CGROUP_STAT_WRITEBACK_TEMP, 1);
> >  	inc_zone_page_state(tmp_page, NR_WRITEBACK_TEMP);
> >  	end_page_writeback(page);
> > 
> > diff --git a/fs/nfs/write.c b/fs/nfs/write.c
> > index b753242..7316f7a 100644
> > --- a/fs/nfs/write.c
> > +++ b/fs/nfs/write.c
> > @@ -439,6 +439,7 @@ nfs_mark_request_commit(struct nfs_page *req)
> >  			req->wb_index,
> >  			NFS_PAGE_TAG_COMMIT);
> >  	spin_unlock(&inode->i_lock);
> > +	mem_cgroup_update_stat(req->wb_page, MEM_CGROUP_STAT_UNSTABLE_NFS, 1);
> >  	inc_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
> >  	inc_bdi_stat(req->wb_page->mapping->backing_dev_info, BDI_UNSTABLE);
> >  	__mark_inode_dirty(inode, I_DIRTY_DATASYNC);
> > @@ -450,6 +451,7 @@ nfs_clear_request_commit(struct nfs_page *req)
> >  	struct page *page = req->wb_page;
> > 
> >  	if (test_and_clear_bit(PG_CLEAN, &(req)->wb_flags)) {
> > +		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_UNSTABLE_NFS, -1);
> >  		dec_zone_page_state(page, NR_UNSTABLE_NFS);
> >  		dec_bdi_stat(page->mapping->backing_dev_info, BDI_UNSTABLE);
> >  		return 1;
> > @@ -1273,6 +1275,8 @@ nfs_commit_list(struct inode *inode, struct list_head *head, int how)
> >  		req = nfs_list_entry(head->next);
> >  		nfs_list_remove_request(req);
> >  		nfs_mark_request_commit(req);
> > +		mem_cgroup_update_stat(req->wb_page,
> > +				MEM_CGROUP_STAT_UNSTABLE_NFS, -1);
> >  		dec_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
> >  		dec_bdi_stat(req->wb_page->mapping->backing_dev_info,
> >  				BDI_UNSTABLE);
> > diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
> > index ada2f1b..27a01b1 100644
> > --- a/fs/nilfs2/segment.c
> > +++ b/fs/nilfs2/segment.c
> > @@ -24,6 +24,7 @@
> >  #include <linux/pagemap.h>
> >  #include <linux/buffer_head.h>
> >  #include <linux/writeback.h>
> > +#include <linux/memcontrol.h>
> >  #include <linux/bio.h>
> >  #include <linux/completion.h>
> >  #include <linux/blkdev.h>
> > @@ -1660,8 +1661,11 @@ nilfs_copy_replace_page_buffers(struct page *page, struct list_head *out)
> >  	} while (bh = bh->b_this_page, bh2 = bh2->b_this_page, bh != head);
> >  	kunmap_atomic(kaddr, KM_USER0);
> > 
> > -	if (!TestSetPageWriteback(clone_page))
> > +	if (!TestSetPageWriteback(clone_page)) {
> > +		mem_cgroup_update_stat(clone_page,
> > +				MEM_CGROUP_STAT_WRITEBACK, 1);
> 
> I wonder if we should start implementing inc and dec to avoid passing
> the +1 and -1 parameters. It should make the code easier to read.

OK, it's always +1/-1, and I don't see any case where we should use
different numbers. So, better to move to the inc/dec naming.

> 
> >  		inc_zone_page_state(clone_page, NR_WRITEBACK);
> > +	}
> >  	unlock_page(clone_page);
> > 
> >  	return 0;
> > @@ -1783,8 +1787,11 @@ static void __nilfs_end_page_io(struct page *page, int err)
> >  	}
> > 
> >  	if (buffer_nilfs_allocated(page_buffers(page))) {
> > -		if (TestClearPageWriteback(page))
> > +		if (TestClearPageWriteback(page)) {
> > +			mem_cgroup_update_stat(page,
> > +					MEM_CGROUP_STAT_WRITEBACK, -1);
> >  			dec_zone_page_state(page, NR_WRITEBACK);
> > +		}
> >  	} else
> >  		end_page_writeback(page);
> >  }
> > diff --git a/mm/filemap.c b/mm/filemap.c
> > index fe09e51..f85acae 100644
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> > @@ -135,6 +135,7 @@ void __remove_from_page_cache(struct page *page)
> >  	 * having removed the page entirely.
> >  	 */
> >  	if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
> > +		mem_cgroup_update_stat(page, MEM_CGROUP_STAT_FILE_DIRTY, -1);
> >  		dec_zone_page_state(page, NR_FILE_DIRTY);
> >  		dec_bdi_stat(mapping->backing_dev_info, BDI_DIRTY);
> >  	}
> > diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> > index 5a0f8f3..c5d14ea 100644
> > --- a/mm/page-writeback.c
> > +++ b/mm/page-writeback.c
> > @@ -137,13 +137,16 @@ static struct prop_descriptor vm_dirties;
> >   */
> >  static int calc_period_shift(void)
> >  {
> > +	struct dirty_param dirty_param;
> 
> vm_dirty_param?

Agreed.

> 
> >  	unsigned long dirty_total;
> > 
> > -	if (vm_dirty_bytes)
> > -		dirty_total = vm_dirty_bytes / PAGE_SIZE;
> > +	get_dirty_param(&dirty_param);
> 
> get_vm_dirty_param() is a nicer name.

Agreed.

> 
> > +
> > +	if (dirty_param.dirty_bytes)
> > +		dirty_total = dirty_param.dirty_bytes / PAGE_SIZE;
> >  	else
> > -		dirty_total = (vm_dirty_ratio * determine_dirtyable_memory()) /
> > -				100;
> > +		dirty_total = (dirty_param.dirty_ratio *
> > +				determine_dirtyable_memory()) / 100;
> >  	return 2 + ilog2(dirty_total - 1);
> >  }
> > 
> > @@ -408,41 +411,46 @@ static unsigned long highmem_dirtyable_memory(unsigned long total)
> >   */
> >  unsigned long determine_dirtyable_memory(void)
> >  {
> > -	unsigned long x;
> > -
> > -	x = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages();
> > +	unsigned long memory;
> > +	s64 memcg_memory;
> > 
> > +	memory = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages();
> >  	if (!vm_highmem_is_dirtyable)
> > -		x -= highmem_dirtyable_memory(x);
> > -
> > -	return x + 1;	/* Ensure that we never return 0 */
> > +		memory -= highmem_dirtyable_memory(memory);
> > +	if (mem_cgroup_has_dirty_limit())
> > +		return memory + 1;
> 
> Vivek already pointed out this issue I suppose. Should be *not*

Right. Will be fixed in the next version of the patch.

> 
> > +	memcg_memory = mem_cgroup_page_stat(MEMCG_NR_DIRTYABLE_PAGES);
> 
> Can memcg_memory be 0?

No LRU file pages, no swappable pages, and RES_USAGE == RES_LIMIT? this
would trigger an OOM before memcg_memory == 0 can happen, I think.

Thanks,
-Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH -mmotm 4/4] memcg: dirty pages instrumentation
       [not found]   ` <1267995474-9117-5-git-send-email-arighi-vWjgImWzx8FBDgjK7y7TUQ@public.gmane.org>
@ 2010-03-08  2:31     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 68+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-03-08  2:31 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Daisuke Nishimura, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Trond Myklebust, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	Greg-FOgKQjlUJ6BQetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Suleiman Souhlal, Andrew Morton,
	Peter-FOgKQjlUJ6BQetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Vivek Goyal, Balbir Singh

On Sun,  7 Mar 2010 21:57:54 +0100
Andrea Righi <arighi-vWjgImWzx8FBDgjK7y7TUQ@public.gmane.org> wrote:

> Apply the cgroup dirty pages accounting and limiting infrastructure to
> the opportune kernel functions.
> 
> As a bonus, make determine_dirtyable_memory() static again: this
> function isn't used anymore outside page writeback.
> 
> Signed-off-by: Andrea Righi <arighi-vWjgImWzx8FBDgjK7y7TUQ@public.gmane.org>

I'm sorry if I misunderstand..almost all this kind of accounting is done
under lock_page()...then...


> ---
>  fs/fuse/file.c            |    5 +
>  fs/nfs/write.c            |    6 +
>  fs/nilfs2/segment.c       |   11 ++-
>  include/linux/writeback.h |    2 -
>  mm/filemap.c              |    1 +
>  mm/page-writeback.c       |  224 ++++++++++++++++++++++++++++-----------------
>  mm/rmap.c                 |    4 +-
>  mm/truncate.c             |    2 +
>  8 files changed, 165 insertions(+), 90 deletions(-)
> 
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index a9f5e13..9a542e5 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -11,6 +11,7 @@
>  #include <linux/pagemap.h>
>  #include <linux/slab.h>
>  #include <linux/kernel.h>
> +#include <linux/memcontrol.h>
>  #include <linux/sched.h>
>  #include <linux/module.h>
>  
> @@ -1129,6 +1130,8 @@ static void fuse_writepage_finish(struct fuse_conn *fc, struct fuse_req *req)
>  
>  	list_del(&req->writepages_entry);
>  	dec_bdi_stat(bdi, BDI_WRITEBACK);
> +	mem_cgroup_dec_page_stat_unlocked(req->pages[0],
> +			MEMCG_NR_FILE_WRITEBACK_TEMP);
>  	dec_zone_page_state(req->pages[0], NR_WRITEBACK_TEMP);

Hmm. IIUC, this req->pages[0] is "tmp_page", which works as bounce_buffer for FUSE.
Then, this req->pages[] is not under any memcg.
So, this accounting never work.


>  	bdi_writeout_inc(bdi);
>  	wake_up(&fi->page_waitq);
> @@ -1240,6 +1243,8 @@ static int fuse_writepage_locked(struct page *page)
>  	req->inode = inode;
>  
>  	inc_bdi_stat(mapping->backing_dev_info, BDI_WRITEBACK);
> +	mem_cgroup_inc_page_stat_unlocked(tmp_page,
> +			MEMCG_NR_FILE_WRITEBACK_TEMP);
>  	inc_zone_page_state(tmp_page, NR_WRITEBACK_TEMP);
>  	end_page_writeback(page);
ditto.


>  
> diff --git a/fs/nfs/write.c b/fs/nfs/write.c
> index 53ff70e..a35e3c0 100644
> --- a/fs/nfs/write.c
> +++ b/fs/nfs/write.c
> @@ -440,6 +440,8 @@ nfs_mark_request_commit(struct nfs_page *req)
>  			NFS_PAGE_TAG_COMMIT);
>  	nfsi->ncommit++;
>  	spin_unlock(&inode->i_lock);
> +	mem_cgroup_inc_page_stat_unlocked(req->wb_page,
> +			MEMCG_NR_FILE_UNSTABLE_NFS);
>  	inc_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);

Here, if the page is locked (by lock_page()), it will never be uncharged.
Then, _locked() version stat accounting can be used.


>  	inc_bdi_stat(req->wb_page->mapping->backing_dev_info, BDI_RECLAIMABLE);
>  	__mark_inode_dirty(inode, I_DIRTY_DATASYNC);
> @@ -451,6 +453,8 @@ nfs_clear_request_commit(struct nfs_page *req)
>  	struct page *page = req->wb_page;
>  
>  	if (test_and_clear_bit(PG_CLEAN, &(req)->wb_flags)) {
> +		mem_cgroup_dec_page_stat_unlocked(page,
> +				MEMCG_NR_FILE_UNSTABLE_NFS);
ditto.


>  		dec_zone_page_state(page, NR_UNSTABLE_NFS);
>  		dec_bdi_stat(page->mapping->backing_dev_info, BDI_RECLAIMABLE);
>  		return 1;
> @@ -1277,6 +1281,8 @@ nfs_commit_list(struct inode *inode, struct list_head *head, int how)
>  		req = nfs_list_entry(head->next);
>  		nfs_list_remove_request(req);
>  		nfs_mark_request_commit(req);
> +		mem_cgroup_dec_page_stat_unlocked(req->wb_page,
> +				MEMCG_NR_FILE_UNSTABLE_NFS);

ditto.

>  		dec_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
>  		dec_bdi_stat(req->wb_page->mapping->backing_dev_info,
>  				BDI_RECLAIMABLE);
> diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
> index ada2f1b..fb79558 100644
> --- a/fs/nilfs2/segment.c
> +++ b/fs/nilfs2/segment.c
> @@ -24,6 +24,7 @@
>  #include <linux/pagemap.h>
>  #include <linux/buffer_head.h>
>  #include <linux/writeback.h>
> +#include <linux/memcontrol.h>
>  #include <linux/bio.h>
>  #include <linux/completion.h>
>  #include <linux/blkdev.h>
> @@ -1660,8 +1661,11 @@ nilfs_copy_replace_page_buffers(struct page *page, struct list_head *out)
>  	} while (bh = bh->b_this_page, bh2 = bh2->b_this_page, bh != head);
>  	kunmap_atomic(kaddr, KM_USER0);
>  
> -	if (!TestSetPageWriteback(clone_page))
> +	if (!TestSetPageWriteback(clone_page)) {
> +		mem_cgroup_inc_page_stat_unlocked(clone_page,
> +				MEMCG_NR_FILE_WRITEBACK);
>  		inc_zone_page_state(clone_page, NR_WRITEBACK);
> +	}
>  	unlock_page(clone_page);
>  
IIUC, this clone_page is not under memcg, too. Then, it can't be handled. (now)




>  	return 0;
> @@ -1783,8 +1787,11 @@ static void __nilfs_end_page_io(struct page *page, int err)
>  	}
>  
>  	if (buffer_nilfs_allocated(page_buffers(page))) {
> -		if (TestClearPageWriteback(page))
> +		if (TestClearPageWriteback(page)) {
> +			mem_cgroup_dec_page_stat_unlocked(page,
> +					MEMCG_NR_FILE_WRITEBACK);
>  			dec_zone_page_state(page, NR_WRITEBACK);
> +		}

Hmm...isn't this a clone_page in above ? If so, this should be avoided.

IMHO, at 1st version, NILFS and FUSE's bounce page should be skipped.
If we want to limit this, we have to charge against bounce page.
I'm not sure it's difficult or not...but...




>  	} else
>  		end_page_writeback(page);
>  }
> diff --git a/include/linux/writeback.h b/include/linux/writeback.h
> index dd9512d..39e4cb2 100644
> --- a/include/linux/writeback.h
> +++ b/include/linux/writeback.h
> @@ -117,8 +117,6 @@ extern int vm_highmem_is_dirtyable;
>  extern int block_dump;
>  extern int laptop_mode;
>  
> -extern unsigned long determine_dirtyable_memory(void);
> -
>  extern int dirty_background_ratio_handler(struct ctl_table *table, int write,
>  		void __user *buffer, size_t *lenp,
>  		loff_t *ppos);
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 62cbac0..37f89d1 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -135,6 +135,7 @@ void __remove_from_page_cache(struct page *page)
>  	 * having removed the page entirely.
>  	 */
>  	if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
> +		mem_cgroup_dec_page_stat_locked(page, MEMCG_NR_FILE_DIRTY);
>  		dec_zone_page_state(page, NR_FILE_DIRTY);
>  		dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
>  	}
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index ab84693..9d4503a 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -131,6 +131,111 @@ static struct prop_descriptor vm_completions;
>  static struct prop_descriptor vm_dirties;
>  
>  /*
> + * Work out the current dirty-memory clamping and background writeout
> + * thresholds.
> + *
> + * The main aim here is to lower them aggressively if there is a lot of mapped
> + * memory around.  To avoid stressing page reclaim with lots of unreclaimable
> + * pages.  It is better to clamp down on writers than to start swapping, and
> + * performing lots of scanning.
> + *
> + * We only allow 1/2 of the currently-unmapped memory to be dirtied.
> + *
> + * We don't permit the clamping level to fall below 5% - that is getting rather
> + * excessive.
> + *
> + * We make sure that the background writeout level is below the adjusted
> + * clamping level.
> + */
> +
> +static unsigned long highmem_dirtyable_memory(unsigned long total)
> +{
> +#ifdef CONFIG_HIGHMEM
> +	int node;
> +	unsigned long x = 0;
> +
> +	for_each_node_state(node, N_HIGH_MEMORY) {
> +		struct zone *z =
> +			&NODE_DATA(node)->node_zones[ZONE_HIGHMEM];
> +
> +		x += zone_page_state(z, NR_FREE_PAGES) +
> +		     zone_reclaimable_pages(z);
> +	}
> +	/*
> +	 * Make sure that the number of highmem pages is never larger
> +	 * than the number of the total dirtyable memory. This can only
> +	 * occur in very strange VM situations but we want to make sure
> +	 * that this does not occur.
> +	 */
> +	return min(x, total);
> +#else
> +	return 0;
> +#endif
> +}
> +
> +static unsigned long get_global_dirtyable_memory(void)
> +{
> +	unsigned long memory;
> +
> +	memory = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages();
> +	if (!vm_highmem_is_dirtyable)
> +		memory -= highmem_dirtyable_memory(memory);
> +	return memory + 1;
> +}
> +
> +static unsigned long get_dirtyable_memory(void)
> +{
> +	unsigned long memory;
> +	s64 memcg_memory;
> +
> +	memory = get_global_dirtyable_memory();
> +	if (!mem_cgroup_has_dirty_limit())
> +		return memory;
> +	memcg_memory = mem_cgroup_page_stat(MEMCG_NR_DIRTYABLE_PAGES);
> +	BUG_ON(memcg_memory < 0);
> +
> +	return min((unsigned long)memcg_memory, memory);
> +}
> +
> +static long get_reclaimable_pages(void)
> +{
> +	s64 ret;
> +
> +	if (!mem_cgroup_has_dirty_limit())
> +		return global_page_state(NR_FILE_DIRTY) +
> +			global_page_state(NR_UNSTABLE_NFS);
> +	ret = mem_cgroup_page_stat(MEMCG_NR_RECLAIM_PAGES);
> +	BUG_ON(ret < 0);
> +
> +	return ret;
> +}
> +
> +static long get_writeback_pages(void)
> +{
> +	s64 ret;
> +
> +	if (!mem_cgroup_has_dirty_limit())
> +		return global_page_state(NR_WRITEBACK);
> +	ret = mem_cgroup_page_stat(MEMCG_NR_WRITEBACK);
> +	BUG_ON(ret < 0);
> +
> +	return ret;
> +}
> +
> +static unsigned long get_dirty_writeback_pages(void)
> +{
> +	s64 ret;
> +
> +	if (!mem_cgroup_has_dirty_limit())
> +		return global_page_state(NR_UNSTABLE_NFS) +
> +			global_page_state(NR_WRITEBACK);
> +	ret = mem_cgroup_page_stat(MEMCG_NR_DIRTY_WRITEBACK_PAGES);
> +	BUG_ON(ret < 0);
> +
> +	return ret;
> +}
> +
> +/*
>   * couple the period to the dirty_ratio:
>   *
>   *   period/2 ~ roundup_pow_of_two(dirty limit)
> @@ -142,7 +247,7 @@ static int calc_period_shift(void)
>  	if (vm_dirty_bytes)
>  		dirty_total = vm_dirty_bytes / PAGE_SIZE;
>  	else
> -		dirty_total = (vm_dirty_ratio * determine_dirtyable_memory()) /
> +		dirty_total = (vm_dirty_ratio * get_global_dirtyable_memory()) /
>  				100;
>  	return 2 + ilog2(dirty_total - 1);
>  }
> @@ -355,92 +460,34 @@ int bdi_set_max_ratio(struct backing_dev_info *bdi, unsigned max_ratio)
>  }
>  EXPORT_SYMBOL(bdi_set_max_ratio);
>  
> -/*
> - * Work out the current dirty-memory clamping and background writeout
> - * thresholds.
> - *
> - * The main aim here is to lower them aggressively if there is a lot of mapped
> - * memory around.  To avoid stressing page reclaim with lots of unreclaimable
> - * pages.  It is better to clamp down on writers than to start swapping, and
> - * performing lots of scanning.
> - *
> - * We only allow 1/2 of the currently-unmapped memory to be dirtied.
> - *
> - * We don't permit the clamping level to fall below 5% - that is getting rather
> - * excessive.
> - *
> - * We make sure that the background writeout level is below the adjusted
> - * clamping level.
> - */
> -
> -static unsigned long highmem_dirtyable_memory(unsigned long total)
> -{
> -#ifdef CONFIG_HIGHMEM
> -	int node;
> -	unsigned long x = 0;
> -
> -	for_each_node_state(node, N_HIGH_MEMORY) {
> -		struct zone *z =
> -			&NODE_DATA(node)->node_zones[ZONE_HIGHMEM];
> -
> -		x += zone_page_state(z, NR_FREE_PAGES) +
> -		     zone_reclaimable_pages(z);
> -	}
> -	/*
> -	 * Make sure that the number of highmem pages is never larger
> -	 * than the number of the total dirtyable memory. This can only
> -	 * occur in very strange VM situations but we want to make sure
> -	 * that this does not occur.
> -	 */
> -	return min(x, total);
> -#else
> -	return 0;
> -#endif
> -}
> -
> -/**
> - * determine_dirtyable_memory - amount of memory that may be used
> - *
> - * Returns the numebr of pages that can currently be freed and used
> - * by the kernel for direct mappings.
> - */
> -unsigned long determine_dirtyable_memory(void)
> -{
> -	unsigned long x;
> -
> -	x = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages();
> -
> -	if (!vm_highmem_is_dirtyable)
> -		x -= highmem_dirtyable_memory(x);
> -
> -	return x + 1;	/* Ensure that we never return 0 */
> -}
> -
>  void
>  get_dirty_limits(unsigned long *pbackground, unsigned long *pdirty,
>  		 unsigned long *pbdi_dirty, struct backing_dev_info *bdi)
>  {
> -	unsigned long background;
> -	unsigned long dirty;
> -	unsigned long available_memory = determine_dirtyable_memory();
> +	unsigned long dirty, background;
> +	unsigned long available_memory = get_dirtyable_memory();
>  	struct task_struct *tsk;
> +	struct vm_dirty_param dirty_param;
>  
> -	if (vm_dirty_bytes)
> -		dirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE);
> +	get_vm_dirty_param(&dirty_param);
> +
> +	if (dirty_param.dirty_bytes)
> +		dirty = DIV_ROUND_UP(dirty_param.dirty_bytes, PAGE_SIZE);
>  	else {
>  		int dirty_ratio;
>  
> -		dirty_ratio = vm_dirty_ratio;
> +		dirty_ratio = dirty_param.dirty_ratio;
>  		if (dirty_ratio < 5)
>  			dirty_ratio = 5;
>  		dirty = (dirty_ratio * available_memory) / 100;
>  	}
>  
> -	if (dirty_background_bytes)
> -		background = DIV_ROUND_UP(dirty_background_bytes, PAGE_SIZE);
> +	if (dirty_param.dirty_background_bytes)
> +		background = DIV_ROUND_UP(dirty_param.dirty_background_bytes,
> +						PAGE_SIZE);
>  	else
> -		background = (dirty_background_ratio * available_memory) / 100;
> -
> +		background = (dirty_param.dirty_background_ratio *
> +						available_memory) / 100;
>  	if (background >= dirty)
>  		background = dirty / 2;
>  	tsk = current;
> @@ -505,9 +552,8 @@ static void balance_dirty_pages(struct address_space *mapping,
>  		get_dirty_limits(&background_thresh, &dirty_thresh,
>  				&bdi_thresh, bdi);
>  
> -		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
> -					global_page_state(NR_UNSTABLE_NFS);
> -		nr_writeback = global_page_state(NR_WRITEBACK);
> +		nr_reclaimable = get_reclaimable_pages();
> +		nr_writeback = get_writeback_pages();
>  
>  		bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE);
>  		bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK);
> @@ -593,10 +639,9 @@ static void balance_dirty_pages(struct address_space *mapping,
>  	 * In normal mode, we start background writeout at the lower
>  	 * background_thresh, to keep the amount of dirty memory low.
>  	 */
> +	nr_reclaimable = get_reclaimable_pages();
>  	if ((laptop_mode && pages_written) ||
> -	    (!laptop_mode && ((global_page_state(NR_FILE_DIRTY)
> -			       + global_page_state(NR_UNSTABLE_NFS))
> -					  > background_thresh)))
> +	    (!laptop_mode && (nr_reclaimable > background_thresh)))
>  		bdi_start_writeback(bdi, NULL, 0);
>  }
>  
> @@ -660,6 +705,8 @@ void throttle_vm_writeout(gfp_t gfp_mask)
>  	unsigned long dirty_thresh;
>  
>          for ( ; ; ) {
> +		unsigned long dirty;
> +
>  		get_dirty_limits(&background_thresh, &dirty_thresh, NULL, NULL);
>  
>                  /*
> @@ -668,10 +715,10 @@ void throttle_vm_writeout(gfp_t gfp_mask)
>                   */
>                  dirty_thresh += dirty_thresh / 10;      /* wheeee... */
>  
> -                if (global_page_state(NR_UNSTABLE_NFS) +
> -			global_page_state(NR_WRITEBACK) <= dirty_thresh)
> -                        	break;
> -                congestion_wait(BLK_RW_ASYNC, HZ/10);
> +		dirty = get_dirty_writeback_pages();
> +		if (dirty <= dirty_thresh)
> +			break;
> +		congestion_wait(BLK_RW_ASYNC, HZ/10);
>  
>  		/*
>  		 * The caller might hold locks which can prevent IO completion
> @@ -1078,6 +1125,7 @@ int __set_page_dirty_no_writeback(struct page *page)
>  void account_page_dirtied(struct page *page, struct address_space *mapping)
>  {
>  	if (mapping_cap_account_dirty(mapping)) {
> +		mem_cgroup_inc_page_stat_locked(page, MEMCG_NR_FILE_DIRTY);
>  		__inc_zone_page_state(page, NR_FILE_DIRTY);
>  		__inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
>  		task_dirty_inc(current);
> @@ -1279,6 +1327,8 @@ int clear_page_dirty_for_io(struct page *page)
>  		 * for more comments.
>  		 */
>  		if (TestClearPageDirty(page)) {
> +			mem_cgroup_dec_page_stat_unlocked(page,
> +					MEMCG_NR_FILE_DIRTY);
>  			dec_zone_page_state(page, NR_FILE_DIRTY);
>  			dec_bdi_stat(mapping->backing_dev_info,
>  					BDI_RECLAIMABLE);

This is called under lock_page(). Then, the page is stable under us.
locked version can be used.


> @@ -1314,8 +1364,11 @@ int test_clear_page_writeback(struct page *page)
>  	} else {
>  		ret = TestClearPageWriteback(page);
>  	}
> -	if (ret)
> +	if (ret) {
> +		mem_cgroup_dec_page_stat_unlocked(page,
> +				MEMCG_NR_FILE_WRITEBACK);
>  		dec_zone_page_state(page, NR_WRITEBACK);
> +	}
Can this be moved up to under tree_lock ?


>  	return ret;
>  }
>  
> @@ -1345,8 +1398,11 @@ int test_set_page_writeback(struct page *page)
>  	} else {
>  		ret = TestSetPageWriteback(page);
>  	}
> -	if (!ret)
> +	if (!ret) {
> +		mem_cgroup_inc_page_stat_unlocked(page,
> +				MEMCG_NR_FILE_WRITEBACK);
>  		inc_zone_page_state(page, NR_WRITEBACK);
> +	}
>  	return ret;
>  
Maybe moving this to under tree_lock and using unloked version is better.



>  }
> diff --git a/mm/rmap.c b/mm/rmap.c
> index fcd593c..61f07cc 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -828,8 +828,8 @@ void page_add_new_anon_rmap(struct page *page,
>  void page_add_file_rmap(struct page *page)
>  {
>  	if (atomic_inc_and_test(&page->_mapcount)) {
> +		mem_cgroup_inc_page_stat_unlocked(page, MEMCG_NR_FILE_MAPPED);
>  		__inc_zone_page_state(page, NR_FILE_MAPPED);
> -		mem_cgroup_update_file_mapped(page, 1);
>  	}
>  }
>  
> @@ -860,8 +860,8 @@ void page_remove_rmap(struct page *page)
>  		mem_cgroup_uncharge_page(page);
>  		__dec_zone_page_state(page, NR_ANON_PAGES);
>  	} else {
> +		mem_cgroup_dec_page_stat_unlocked(page, MEMCG_NR_FILE_MAPPED);
>  		__dec_zone_page_state(page, NR_FILE_MAPPED);
> -		mem_cgroup_update_file_mapped(page, -1);
>  	}
>  	/*
>  	 * It would be tidy to reset the PageAnon mapping here,
> diff --git a/mm/truncate.c b/mm/truncate.c
> index e87e372..1613632 100644
> --- a/mm/truncate.c
> +++ b/mm/truncate.c
> @@ -73,6 +73,8 @@ void cancel_dirty_page(struct page *page, unsigned int account_size)
>  	if (TestClearPageDirty(page)) {
>  		struct address_space *mapping = page->mapping;
>  		if (mapping && mapping_cap_account_dirty(mapping)) {
> +			mem_cgroup_dec_page_stat_unlocked(page,
> +					MEMCG_NR_FILE_DIRTY);
>  			dec_zone_page_state(page, NR_FILE_DIRTY);
>  			dec_bdi_stat(mapping->backing_dev_info,
>  					BDI_RECLAIMABLE);

cancel_dirty_page() is called after do_invalidatepage() but before
remove_from_pagecache(), it's all done under lock_page().

Then, we can use "locked" accounting here.

If you feel locked/unlocked accounting is toooo complex, simply adding
irq_enable/disable around lock_page_cgroup() is a choice.
But please measure performance before doing that.


Thanks,
-Kame

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH -mmotm 4/4] memcg: dirty pages instrumentation
  2010-03-07 20:57   ` Andrea Righi
@ 2010-03-08  2:31     ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 68+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-03-08  2:31 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Balbir Singh, Daisuke Nishimura, Vivek Goyal, Peter Zijlstra,
	Trond Myklebust, Suleiman Souhlal, Greg Thelen,
	Kirill A. Shutemov, Andrew Morton, containers, linux-kernel,
	linux-mm

On Sun,  7 Mar 2010 21:57:54 +0100
Andrea Righi <arighi@develer.com> wrote:

> Apply the cgroup dirty pages accounting and limiting infrastructure to
> the opportune kernel functions.
> 
> As a bonus, make determine_dirtyable_memory() static again: this
> function isn't used anymore outside page writeback.
> 
> Signed-off-by: Andrea Righi <arighi@develer.com>

I'm sorry if I misunderstand..almost all this kind of accounting is done
under lock_page()...then...


> ---
>  fs/fuse/file.c            |    5 +
>  fs/nfs/write.c            |    6 +
>  fs/nilfs2/segment.c       |   11 ++-
>  include/linux/writeback.h |    2 -
>  mm/filemap.c              |    1 +
>  mm/page-writeback.c       |  224 ++++++++++++++++++++++++++++-----------------
>  mm/rmap.c                 |    4 +-
>  mm/truncate.c             |    2 +
>  8 files changed, 165 insertions(+), 90 deletions(-)
> 
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index a9f5e13..9a542e5 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -11,6 +11,7 @@
>  #include <linux/pagemap.h>
>  #include <linux/slab.h>
>  #include <linux/kernel.h>
> +#include <linux/memcontrol.h>
>  #include <linux/sched.h>
>  #include <linux/module.h>
>  
> @@ -1129,6 +1130,8 @@ static void fuse_writepage_finish(struct fuse_conn *fc, struct fuse_req *req)
>  
>  	list_del(&req->writepages_entry);
>  	dec_bdi_stat(bdi, BDI_WRITEBACK);
> +	mem_cgroup_dec_page_stat_unlocked(req->pages[0],
> +			MEMCG_NR_FILE_WRITEBACK_TEMP);
>  	dec_zone_page_state(req->pages[0], NR_WRITEBACK_TEMP);

Hmm. IIUC, this req->pages[0] is "tmp_page", which works as bounce_buffer for FUSE.
Then, this req->pages[] is not under any memcg.
So, this accounting never work.


>  	bdi_writeout_inc(bdi);
>  	wake_up(&fi->page_waitq);
> @@ -1240,6 +1243,8 @@ static int fuse_writepage_locked(struct page *page)
>  	req->inode = inode;
>  
>  	inc_bdi_stat(mapping->backing_dev_info, BDI_WRITEBACK);
> +	mem_cgroup_inc_page_stat_unlocked(tmp_page,
> +			MEMCG_NR_FILE_WRITEBACK_TEMP);
>  	inc_zone_page_state(tmp_page, NR_WRITEBACK_TEMP);
>  	end_page_writeback(page);
ditto.


>  
> diff --git a/fs/nfs/write.c b/fs/nfs/write.c
> index 53ff70e..a35e3c0 100644
> --- a/fs/nfs/write.c
> +++ b/fs/nfs/write.c
> @@ -440,6 +440,8 @@ nfs_mark_request_commit(struct nfs_page *req)
>  			NFS_PAGE_TAG_COMMIT);
>  	nfsi->ncommit++;
>  	spin_unlock(&inode->i_lock);
> +	mem_cgroup_inc_page_stat_unlocked(req->wb_page,
> +			MEMCG_NR_FILE_UNSTABLE_NFS);
>  	inc_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);

Here, if the page is locked (by lock_page()), it will never be uncharged.
Then, _locked() version stat accounting can be used.


>  	inc_bdi_stat(req->wb_page->mapping->backing_dev_info, BDI_RECLAIMABLE);
>  	__mark_inode_dirty(inode, I_DIRTY_DATASYNC);
> @@ -451,6 +453,8 @@ nfs_clear_request_commit(struct nfs_page *req)
>  	struct page *page = req->wb_page;
>  
>  	if (test_and_clear_bit(PG_CLEAN, &(req)->wb_flags)) {
> +		mem_cgroup_dec_page_stat_unlocked(page,
> +				MEMCG_NR_FILE_UNSTABLE_NFS);
ditto.


>  		dec_zone_page_state(page, NR_UNSTABLE_NFS);
>  		dec_bdi_stat(page->mapping->backing_dev_info, BDI_RECLAIMABLE);
>  		return 1;
> @@ -1277,6 +1281,8 @@ nfs_commit_list(struct inode *inode, struct list_head *head, int how)
>  		req = nfs_list_entry(head->next);
>  		nfs_list_remove_request(req);
>  		nfs_mark_request_commit(req);
> +		mem_cgroup_dec_page_stat_unlocked(req->wb_page,
> +				MEMCG_NR_FILE_UNSTABLE_NFS);

ditto.

>  		dec_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
>  		dec_bdi_stat(req->wb_page->mapping->backing_dev_info,
>  				BDI_RECLAIMABLE);
> diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
> index ada2f1b..fb79558 100644
> --- a/fs/nilfs2/segment.c
> +++ b/fs/nilfs2/segment.c
> @@ -24,6 +24,7 @@
>  #include <linux/pagemap.h>
>  #include <linux/buffer_head.h>
>  #include <linux/writeback.h>
> +#include <linux/memcontrol.h>
>  #include <linux/bio.h>
>  #include <linux/completion.h>
>  #include <linux/blkdev.h>
> @@ -1660,8 +1661,11 @@ nilfs_copy_replace_page_buffers(struct page *page, struct list_head *out)
>  	} while (bh = bh->b_this_page, bh2 = bh2->b_this_page, bh != head);
>  	kunmap_atomic(kaddr, KM_USER0);
>  
> -	if (!TestSetPageWriteback(clone_page))
> +	if (!TestSetPageWriteback(clone_page)) {
> +		mem_cgroup_inc_page_stat_unlocked(clone_page,
> +				MEMCG_NR_FILE_WRITEBACK);
>  		inc_zone_page_state(clone_page, NR_WRITEBACK);
> +	}
>  	unlock_page(clone_page);
>  
IIUC, this clone_page is not under memcg, too. Then, it can't be handled. (now)




>  	return 0;
> @@ -1783,8 +1787,11 @@ static void __nilfs_end_page_io(struct page *page, int err)
>  	}
>  
>  	if (buffer_nilfs_allocated(page_buffers(page))) {
> -		if (TestClearPageWriteback(page))
> +		if (TestClearPageWriteback(page)) {
> +			mem_cgroup_dec_page_stat_unlocked(page,
> +					MEMCG_NR_FILE_WRITEBACK);
>  			dec_zone_page_state(page, NR_WRITEBACK);
> +		}

Hmm...isn't this a clone_page in above ? If so, this should be avoided.

IMHO, at 1st version, NILFS and FUSE's bounce page should be skipped.
If we want to limit this, we have to charge against bounce page.
I'm not sure it's difficult or not...but...




>  	} else
>  		end_page_writeback(page);
>  }
> diff --git a/include/linux/writeback.h b/include/linux/writeback.h
> index dd9512d..39e4cb2 100644
> --- a/include/linux/writeback.h
> +++ b/include/linux/writeback.h
> @@ -117,8 +117,6 @@ extern int vm_highmem_is_dirtyable;
>  extern int block_dump;
>  extern int laptop_mode;
>  
> -extern unsigned long determine_dirtyable_memory(void);
> -
>  extern int dirty_background_ratio_handler(struct ctl_table *table, int write,
>  		void __user *buffer, size_t *lenp,
>  		loff_t *ppos);
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 62cbac0..37f89d1 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -135,6 +135,7 @@ void __remove_from_page_cache(struct page *page)
>  	 * having removed the page entirely.
>  	 */
>  	if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
> +		mem_cgroup_dec_page_stat_locked(page, MEMCG_NR_FILE_DIRTY);
>  		dec_zone_page_state(page, NR_FILE_DIRTY);
>  		dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
>  	}
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index ab84693..9d4503a 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -131,6 +131,111 @@ static struct prop_descriptor vm_completions;
>  static struct prop_descriptor vm_dirties;
>  
>  /*
> + * Work out the current dirty-memory clamping and background writeout
> + * thresholds.
> + *
> + * The main aim here is to lower them aggressively if there is a lot of mapped
> + * memory around.  To avoid stressing page reclaim with lots of unreclaimable
> + * pages.  It is better to clamp down on writers than to start swapping, and
> + * performing lots of scanning.
> + *
> + * We only allow 1/2 of the currently-unmapped memory to be dirtied.
> + *
> + * We don't permit the clamping level to fall below 5% - that is getting rather
> + * excessive.
> + *
> + * We make sure that the background writeout level is below the adjusted
> + * clamping level.
> + */
> +
> +static unsigned long highmem_dirtyable_memory(unsigned long total)
> +{
> +#ifdef CONFIG_HIGHMEM
> +	int node;
> +	unsigned long x = 0;
> +
> +	for_each_node_state(node, N_HIGH_MEMORY) {
> +		struct zone *z =
> +			&NODE_DATA(node)->node_zones[ZONE_HIGHMEM];
> +
> +		x += zone_page_state(z, NR_FREE_PAGES) +
> +		     zone_reclaimable_pages(z);
> +	}
> +	/*
> +	 * Make sure that the number of highmem pages is never larger
> +	 * than the number of the total dirtyable memory. This can only
> +	 * occur in very strange VM situations but we want to make sure
> +	 * that this does not occur.
> +	 */
> +	return min(x, total);
> +#else
> +	return 0;
> +#endif
> +}
> +
> +static unsigned long get_global_dirtyable_memory(void)
> +{
> +	unsigned long memory;
> +
> +	memory = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages();
> +	if (!vm_highmem_is_dirtyable)
> +		memory -= highmem_dirtyable_memory(memory);
> +	return memory + 1;
> +}
> +
> +static unsigned long get_dirtyable_memory(void)
> +{
> +	unsigned long memory;
> +	s64 memcg_memory;
> +
> +	memory = get_global_dirtyable_memory();
> +	if (!mem_cgroup_has_dirty_limit())
> +		return memory;
> +	memcg_memory = mem_cgroup_page_stat(MEMCG_NR_DIRTYABLE_PAGES);
> +	BUG_ON(memcg_memory < 0);
> +
> +	return min((unsigned long)memcg_memory, memory);
> +}
> +
> +static long get_reclaimable_pages(void)
> +{
> +	s64 ret;
> +
> +	if (!mem_cgroup_has_dirty_limit())
> +		return global_page_state(NR_FILE_DIRTY) +
> +			global_page_state(NR_UNSTABLE_NFS);
> +	ret = mem_cgroup_page_stat(MEMCG_NR_RECLAIM_PAGES);
> +	BUG_ON(ret < 0);
> +
> +	return ret;
> +}
> +
> +static long get_writeback_pages(void)
> +{
> +	s64 ret;
> +
> +	if (!mem_cgroup_has_dirty_limit())
> +		return global_page_state(NR_WRITEBACK);
> +	ret = mem_cgroup_page_stat(MEMCG_NR_WRITEBACK);
> +	BUG_ON(ret < 0);
> +
> +	return ret;
> +}
> +
> +static unsigned long get_dirty_writeback_pages(void)
> +{
> +	s64 ret;
> +
> +	if (!mem_cgroup_has_dirty_limit())
> +		return global_page_state(NR_UNSTABLE_NFS) +
> +			global_page_state(NR_WRITEBACK);
> +	ret = mem_cgroup_page_stat(MEMCG_NR_DIRTY_WRITEBACK_PAGES);
> +	BUG_ON(ret < 0);
> +
> +	return ret;
> +}
> +
> +/*
>   * couple the period to the dirty_ratio:
>   *
>   *   period/2 ~ roundup_pow_of_two(dirty limit)
> @@ -142,7 +247,7 @@ static int calc_period_shift(void)
>  	if (vm_dirty_bytes)
>  		dirty_total = vm_dirty_bytes / PAGE_SIZE;
>  	else
> -		dirty_total = (vm_dirty_ratio * determine_dirtyable_memory()) /
> +		dirty_total = (vm_dirty_ratio * get_global_dirtyable_memory()) /
>  				100;
>  	return 2 + ilog2(dirty_total - 1);
>  }
> @@ -355,92 +460,34 @@ int bdi_set_max_ratio(struct backing_dev_info *bdi, unsigned max_ratio)
>  }
>  EXPORT_SYMBOL(bdi_set_max_ratio);
>  
> -/*
> - * Work out the current dirty-memory clamping and background writeout
> - * thresholds.
> - *
> - * The main aim here is to lower them aggressively if there is a lot of mapped
> - * memory around.  To avoid stressing page reclaim with lots of unreclaimable
> - * pages.  It is better to clamp down on writers than to start swapping, and
> - * performing lots of scanning.
> - *
> - * We only allow 1/2 of the currently-unmapped memory to be dirtied.
> - *
> - * We don't permit the clamping level to fall below 5% - that is getting rather
> - * excessive.
> - *
> - * We make sure that the background writeout level is below the adjusted
> - * clamping level.
> - */
> -
> -static unsigned long highmem_dirtyable_memory(unsigned long total)
> -{
> -#ifdef CONFIG_HIGHMEM
> -	int node;
> -	unsigned long x = 0;
> -
> -	for_each_node_state(node, N_HIGH_MEMORY) {
> -		struct zone *z =
> -			&NODE_DATA(node)->node_zones[ZONE_HIGHMEM];
> -
> -		x += zone_page_state(z, NR_FREE_PAGES) +
> -		     zone_reclaimable_pages(z);
> -	}
> -	/*
> -	 * Make sure that the number of highmem pages is never larger
> -	 * than the number of the total dirtyable memory. This can only
> -	 * occur in very strange VM situations but we want to make sure
> -	 * that this does not occur.
> -	 */
> -	return min(x, total);
> -#else
> -	return 0;
> -#endif
> -}
> -
> -/**
> - * determine_dirtyable_memory - amount of memory that may be used
> - *
> - * Returns the numebr of pages that can currently be freed and used
> - * by the kernel for direct mappings.
> - */
> -unsigned long determine_dirtyable_memory(void)
> -{
> -	unsigned long x;
> -
> -	x = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages();
> -
> -	if (!vm_highmem_is_dirtyable)
> -		x -= highmem_dirtyable_memory(x);
> -
> -	return x + 1;	/* Ensure that we never return 0 */
> -}
> -
>  void
>  get_dirty_limits(unsigned long *pbackground, unsigned long *pdirty,
>  		 unsigned long *pbdi_dirty, struct backing_dev_info *bdi)
>  {
> -	unsigned long background;
> -	unsigned long dirty;
> -	unsigned long available_memory = determine_dirtyable_memory();
> +	unsigned long dirty, background;
> +	unsigned long available_memory = get_dirtyable_memory();
>  	struct task_struct *tsk;
> +	struct vm_dirty_param dirty_param;
>  
> -	if (vm_dirty_bytes)
> -		dirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE);
> +	get_vm_dirty_param(&dirty_param);
> +
> +	if (dirty_param.dirty_bytes)
> +		dirty = DIV_ROUND_UP(dirty_param.dirty_bytes, PAGE_SIZE);
>  	else {
>  		int dirty_ratio;
>  
> -		dirty_ratio = vm_dirty_ratio;
> +		dirty_ratio = dirty_param.dirty_ratio;
>  		if (dirty_ratio < 5)
>  			dirty_ratio = 5;
>  		dirty = (dirty_ratio * available_memory) / 100;
>  	}
>  
> -	if (dirty_background_bytes)
> -		background = DIV_ROUND_UP(dirty_background_bytes, PAGE_SIZE);
> +	if (dirty_param.dirty_background_bytes)
> +		background = DIV_ROUND_UP(dirty_param.dirty_background_bytes,
> +						PAGE_SIZE);
>  	else
> -		background = (dirty_background_ratio * available_memory) / 100;
> -
> +		background = (dirty_param.dirty_background_ratio *
> +						available_memory) / 100;
>  	if (background >= dirty)
>  		background = dirty / 2;
>  	tsk = current;
> @@ -505,9 +552,8 @@ static void balance_dirty_pages(struct address_space *mapping,
>  		get_dirty_limits(&background_thresh, &dirty_thresh,
>  				&bdi_thresh, bdi);
>  
> -		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
> -					global_page_state(NR_UNSTABLE_NFS);
> -		nr_writeback = global_page_state(NR_WRITEBACK);
> +		nr_reclaimable = get_reclaimable_pages();
> +		nr_writeback = get_writeback_pages();
>  
>  		bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE);
>  		bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK);
> @@ -593,10 +639,9 @@ static void balance_dirty_pages(struct address_space *mapping,
>  	 * In normal mode, we start background writeout at the lower
>  	 * background_thresh, to keep the amount of dirty memory low.
>  	 */
> +	nr_reclaimable = get_reclaimable_pages();
>  	if ((laptop_mode && pages_written) ||
> -	    (!laptop_mode && ((global_page_state(NR_FILE_DIRTY)
> -			       + global_page_state(NR_UNSTABLE_NFS))
> -					  > background_thresh)))
> +	    (!laptop_mode && (nr_reclaimable > background_thresh)))
>  		bdi_start_writeback(bdi, NULL, 0);
>  }
>  
> @@ -660,6 +705,8 @@ void throttle_vm_writeout(gfp_t gfp_mask)
>  	unsigned long dirty_thresh;
>  
>          for ( ; ; ) {
> +		unsigned long dirty;
> +
>  		get_dirty_limits(&background_thresh, &dirty_thresh, NULL, NULL);
>  
>                  /*
> @@ -668,10 +715,10 @@ void throttle_vm_writeout(gfp_t gfp_mask)
>                   */
>                  dirty_thresh += dirty_thresh / 10;      /* wheeee... */
>  
> -                if (global_page_state(NR_UNSTABLE_NFS) +
> -			global_page_state(NR_WRITEBACK) <= dirty_thresh)
> -                        	break;
> -                congestion_wait(BLK_RW_ASYNC, HZ/10);
> +		dirty = get_dirty_writeback_pages();
> +		if (dirty <= dirty_thresh)
> +			break;
> +		congestion_wait(BLK_RW_ASYNC, HZ/10);
>  
>  		/*
>  		 * The caller might hold locks which can prevent IO completion
> @@ -1078,6 +1125,7 @@ int __set_page_dirty_no_writeback(struct page *page)
>  void account_page_dirtied(struct page *page, struct address_space *mapping)
>  {
>  	if (mapping_cap_account_dirty(mapping)) {
> +		mem_cgroup_inc_page_stat_locked(page, MEMCG_NR_FILE_DIRTY);
>  		__inc_zone_page_state(page, NR_FILE_DIRTY);
>  		__inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
>  		task_dirty_inc(current);
> @@ -1279,6 +1327,8 @@ int clear_page_dirty_for_io(struct page *page)
>  		 * for more comments.
>  		 */
>  		if (TestClearPageDirty(page)) {
> +			mem_cgroup_dec_page_stat_unlocked(page,
> +					MEMCG_NR_FILE_DIRTY);
>  			dec_zone_page_state(page, NR_FILE_DIRTY);
>  			dec_bdi_stat(mapping->backing_dev_info,
>  					BDI_RECLAIMABLE);

This is called under lock_page(). Then, the page is stable under us.
locked version can be used.


> @@ -1314,8 +1364,11 @@ int test_clear_page_writeback(struct page *page)
>  	} else {
>  		ret = TestClearPageWriteback(page);
>  	}
> -	if (ret)
> +	if (ret) {
> +		mem_cgroup_dec_page_stat_unlocked(page,
> +				MEMCG_NR_FILE_WRITEBACK);
>  		dec_zone_page_state(page, NR_WRITEBACK);
> +	}
Can this be moved up to under tree_lock ?


>  	return ret;
>  }
>  
> @@ -1345,8 +1398,11 @@ int test_set_page_writeback(struct page *page)
>  	} else {
>  		ret = TestSetPageWriteback(page);
>  	}
> -	if (!ret)
> +	if (!ret) {
> +		mem_cgroup_inc_page_stat_unlocked(page,
> +				MEMCG_NR_FILE_WRITEBACK);
>  		inc_zone_page_state(page, NR_WRITEBACK);
> +	}
>  	return ret;
>  
Maybe moving this to under tree_lock and using unloked version is better.



>  }
> diff --git a/mm/rmap.c b/mm/rmap.c
> index fcd593c..61f07cc 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -828,8 +828,8 @@ void page_add_new_anon_rmap(struct page *page,
>  void page_add_file_rmap(struct page *page)
>  {
>  	if (atomic_inc_and_test(&page->_mapcount)) {
> +		mem_cgroup_inc_page_stat_unlocked(page, MEMCG_NR_FILE_MAPPED);
>  		__inc_zone_page_state(page, NR_FILE_MAPPED);
> -		mem_cgroup_update_file_mapped(page, 1);
>  	}
>  }
>  
> @@ -860,8 +860,8 @@ void page_remove_rmap(struct page *page)
>  		mem_cgroup_uncharge_page(page);
>  		__dec_zone_page_state(page, NR_ANON_PAGES);
>  	} else {
> +		mem_cgroup_dec_page_stat_unlocked(page, MEMCG_NR_FILE_MAPPED);
>  		__dec_zone_page_state(page, NR_FILE_MAPPED);
> -		mem_cgroup_update_file_mapped(page, -1);
>  	}
>  	/*
>  	 * It would be tidy to reset the PageAnon mapping here,
> diff --git a/mm/truncate.c b/mm/truncate.c
> index e87e372..1613632 100644
> --- a/mm/truncate.c
> +++ b/mm/truncate.c
> @@ -73,6 +73,8 @@ void cancel_dirty_page(struct page *page, unsigned int account_size)
>  	if (TestClearPageDirty(page)) {
>  		struct address_space *mapping = page->mapping;
>  		if (mapping && mapping_cap_account_dirty(mapping)) {
> +			mem_cgroup_dec_page_stat_unlocked(page,
> +					MEMCG_NR_FILE_DIRTY);
>  			dec_zone_page_state(page, NR_FILE_DIRTY);
>  			dec_bdi_stat(mapping->backing_dev_info,
>  					BDI_RECLAIMABLE);

cancel_dirty_page() is called after do_invalidatepage() but before
remove_from_pagecache(), it's all done under lock_page().

Then, we can use "locked" accounting here.

If you feel locked/unlocked accounting is toooo complex, simply adding
irq_enable/disable around lock_page_cgroup() is a choice.
But please measure performance before doing that.


Thanks,
-Kame



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH -mmotm 4/4] memcg: dirty pages instrumentation
@ 2010-03-08  2:31     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 68+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-03-08  2:31 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Balbir Singh, Daisuke Nishimura, Vivek Goyal, Peter Zijlstra,
	Trond Myklebust, Suleiman Souhlal, Greg Thelen,
	Kirill A. Shutemov, Andrew Morton, containers, linux-kernel,
	linux-mm

On Sun,  7 Mar 2010 21:57:54 +0100
Andrea Righi <arighi@develer.com> wrote:

> Apply the cgroup dirty pages accounting and limiting infrastructure to
> the opportune kernel functions.
> 
> As a bonus, make determine_dirtyable_memory() static again: this
> function isn't used anymore outside page writeback.
> 
> Signed-off-by: Andrea Righi <arighi@develer.com>

I'm sorry if I misunderstand..almost all this kind of accounting is done
under lock_page()...then...


> ---
>  fs/fuse/file.c            |    5 +
>  fs/nfs/write.c            |    6 +
>  fs/nilfs2/segment.c       |   11 ++-
>  include/linux/writeback.h |    2 -
>  mm/filemap.c              |    1 +
>  mm/page-writeback.c       |  224 ++++++++++++++++++++++++++++-----------------
>  mm/rmap.c                 |    4 +-
>  mm/truncate.c             |    2 +
>  8 files changed, 165 insertions(+), 90 deletions(-)
> 
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index a9f5e13..9a542e5 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -11,6 +11,7 @@
>  #include <linux/pagemap.h>
>  #include <linux/slab.h>
>  #include <linux/kernel.h>
> +#include <linux/memcontrol.h>
>  #include <linux/sched.h>
>  #include <linux/module.h>
>  
> @@ -1129,6 +1130,8 @@ static void fuse_writepage_finish(struct fuse_conn *fc, struct fuse_req *req)
>  
>  	list_del(&req->writepages_entry);
>  	dec_bdi_stat(bdi, BDI_WRITEBACK);
> +	mem_cgroup_dec_page_stat_unlocked(req->pages[0],
> +			MEMCG_NR_FILE_WRITEBACK_TEMP);
>  	dec_zone_page_state(req->pages[0], NR_WRITEBACK_TEMP);

Hmm. IIUC, this req->pages[0] is "tmp_page", which works as bounce_buffer for FUSE.
Then, this req->pages[] is not under any memcg.
So, this accounting never work.


>  	bdi_writeout_inc(bdi);
>  	wake_up(&fi->page_waitq);
> @@ -1240,6 +1243,8 @@ static int fuse_writepage_locked(struct page *page)
>  	req->inode = inode;
>  
>  	inc_bdi_stat(mapping->backing_dev_info, BDI_WRITEBACK);
> +	mem_cgroup_inc_page_stat_unlocked(tmp_page,
> +			MEMCG_NR_FILE_WRITEBACK_TEMP);
>  	inc_zone_page_state(tmp_page, NR_WRITEBACK_TEMP);
>  	end_page_writeback(page);
ditto.


>  
> diff --git a/fs/nfs/write.c b/fs/nfs/write.c
> index 53ff70e..a35e3c0 100644
> --- a/fs/nfs/write.c
> +++ b/fs/nfs/write.c
> @@ -440,6 +440,8 @@ nfs_mark_request_commit(struct nfs_page *req)
>  			NFS_PAGE_TAG_COMMIT);
>  	nfsi->ncommit++;
>  	spin_unlock(&inode->i_lock);
> +	mem_cgroup_inc_page_stat_unlocked(req->wb_page,
> +			MEMCG_NR_FILE_UNSTABLE_NFS);
>  	inc_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);

Here, if the page is locked (by lock_page()), it will never be uncharged.
Then, _locked() version stat accounting can be used.


>  	inc_bdi_stat(req->wb_page->mapping->backing_dev_info, BDI_RECLAIMABLE);
>  	__mark_inode_dirty(inode, I_DIRTY_DATASYNC);
> @@ -451,6 +453,8 @@ nfs_clear_request_commit(struct nfs_page *req)
>  	struct page *page = req->wb_page;
>  
>  	if (test_and_clear_bit(PG_CLEAN, &(req)->wb_flags)) {
> +		mem_cgroup_dec_page_stat_unlocked(page,
> +				MEMCG_NR_FILE_UNSTABLE_NFS);
ditto.


>  		dec_zone_page_state(page, NR_UNSTABLE_NFS);
>  		dec_bdi_stat(page->mapping->backing_dev_info, BDI_RECLAIMABLE);
>  		return 1;
> @@ -1277,6 +1281,8 @@ nfs_commit_list(struct inode *inode, struct list_head *head, int how)
>  		req = nfs_list_entry(head->next);
>  		nfs_list_remove_request(req);
>  		nfs_mark_request_commit(req);
> +		mem_cgroup_dec_page_stat_unlocked(req->wb_page,
> +				MEMCG_NR_FILE_UNSTABLE_NFS);

ditto.

>  		dec_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
>  		dec_bdi_stat(req->wb_page->mapping->backing_dev_info,
>  				BDI_RECLAIMABLE);
> diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
> index ada2f1b..fb79558 100644
> --- a/fs/nilfs2/segment.c
> +++ b/fs/nilfs2/segment.c
> @@ -24,6 +24,7 @@
>  #include <linux/pagemap.h>
>  #include <linux/buffer_head.h>
>  #include <linux/writeback.h>
> +#include <linux/memcontrol.h>
>  #include <linux/bio.h>
>  #include <linux/completion.h>
>  #include <linux/blkdev.h>
> @@ -1660,8 +1661,11 @@ nilfs_copy_replace_page_buffers(struct page *page, struct list_head *out)
>  	} while (bh = bh->b_this_page, bh2 = bh2->b_this_page, bh != head);
>  	kunmap_atomic(kaddr, KM_USER0);
>  
> -	if (!TestSetPageWriteback(clone_page))
> +	if (!TestSetPageWriteback(clone_page)) {
> +		mem_cgroup_inc_page_stat_unlocked(clone_page,
> +				MEMCG_NR_FILE_WRITEBACK);
>  		inc_zone_page_state(clone_page, NR_WRITEBACK);
> +	}
>  	unlock_page(clone_page);
>  
IIUC, this clone_page is not under memcg, too. Then, it can't be handled. (now)




>  	return 0;
> @@ -1783,8 +1787,11 @@ static void __nilfs_end_page_io(struct page *page, int err)
>  	}
>  
>  	if (buffer_nilfs_allocated(page_buffers(page))) {
> -		if (TestClearPageWriteback(page))
> +		if (TestClearPageWriteback(page)) {
> +			mem_cgroup_dec_page_stat_unlocked(page,
> +					MEMCG_NR_FILE_WRITEBACK);
>  			dec_zone_page_state(page, NR_WRITEBACK);
> +		}

Hmm...isn't this a clone_page in above ? If so, this should be avoided.

IMHO, at 1st version, NILFS and FUSE's bounce page should be skipped.
If we want to limit this, we have to charge against bounce page.
I'm not sure it's difficult or not...but...




>  	} else
>  		end_page_writeback(page);
>  }
> diff --git a/include/linux/writeback.h b/include/linux/writeback.h
> index dd9512d..39e4cb2 100644
> --- a/include/linux/writeback.h
> +++ b/include/linux/writeback.h
> @@ -117,8 +117,6 @@ extern int vm_highmem_is_dirtyable;
>  extern int block_dump;
>  extern int laptop_mode;
>  
> -extern unsigned long determine_dirtyable_memory(void);
> -
>  extern int dirty_background_ratio_handler(struct ctl_table *table, int write,
>  		void __user *buffer, size_t *lenp,
>  		loff_t *ppos);
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 62cbac0..37f89d1 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -135,6 +135,7 @@ void __remove_from_page_cache(struct page *page)
>  	 * having removed the page entirely.
>  	 */
>  	if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
> +		mem_cgroup_dec_page_stat_locked(page, MEMCG_NR_FILE_DIRTY);
>  		dec_zone_page_state(page, NR_FILE_DIRTY);
>  		dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
>  	}
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index ab84693..9d4503a 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -131,6 +131,111 @@ static struct prop_descriptor vm_completions;
>  static struct prop_descriptor vm_dirties;
>  
>  /*
> + * Work out the current dirty-memory clamping and background writeout
> + * thresholds.
> + *
> + * The main aim here is to lower them aggressively if there is a lot of mapped
> + * memory around.  To avoid stressing page reclaim with lots of unreclaimable
> + * pages.  It is better to clamp down on writers than to start swapping, and
> + * performing lots of scanning.
> + *
> + * We only allow 1/2 of the currently-unmapped memory to be dirtied.
> + *
> + * We don't permit the clamping level to fall below 5% - that is getting rather
> + * excessive.
> + *
> + * We make sure that the background writeout level is below the adjusted
> + * clamping level.
> + */
> +
> +static unsigned long highmem_dirtyable_memory(unsigned long total)
> +{
> +#ifdef CONFIG_HIGHMEM
> +	int node;
> +	unsigned long x = 0;
> +
> +	for_each_node_state(node, N_HIGH_MEMORY) {
> +		struct zone *z =
> +			&NODE_DATA(node)->node_zones[ZONE_HIGHMEM];
> +
> +		x += zone_page_state(z, NR_FREE_PAGES) +
> +		     zone_reclaimable_pages(z);
> +	}
> +	/*
> +	 * Make sure that the number of highmem pages is never larger
> +	 * than the number of the total dirtyable memory. This can only
> +	 * occur in very strange VM situations but we want to make sure
> +	 * that this does not occur.
> +	 */
> +	return min(x, total);
> +#else
> +	return 0;
> +#endif
> +}
> +
> +static unsigned long get_global_dirtyable_memory(void)
> +{
> +	unsigned long memory;
> +
> +	memory = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages();
> +	if (!vm_highmem_is_dirtyable)
> +		memory -= highmem_dirtyable_memory(memory);
> +	return memory + 1;
> +}
> +
> +static unsigned long get_dirtyable_memory(void)
> +{
> +	unsigned long memory;
> +	s64 memcg_memory;
> +
> +	memory = get_global_dirtyable_memory();
> +	if (!mem_cgroup_has_dirty_limit())
> +		return memory;
> +	memcg_memory = mem_cgroup_page_stat(MEMCG_NR_DIRTYABLE_PAGES);
> +	BUG_ON(memcg_memory < 0);
> +
> +	return min((unsigned long)memcg_memory, memory);
> +}
> +
> +static long get_reclaimable_pages(void)
> +{
> +	s64 ret;
> +
> +	if (!mem_cgroup_has_dirty_limit())
> +		return global_page_state(NR_FILE_DIRTY) +
> +			global_page_state(NR_UNSTABLE_NFS);
> +	ret = mem_cgroup_page_stat(MEMCG_NR_RECLAIM_PAGES);
> +	BUG_ON(ret < 0);
> +
> +	return ret;
> +}
> +
> +static long get_writeback_pages(void)
> +{
> +	s64 ret;
> +
> +	if (!mem_cgroup_has_dirty_limit())
> +		return global_page_state(NR_WRITEBACK);
> +	ret = mem_cgroup_page_stat(MEMCG_NR_WRITEBACK);
> +	BUG_ON(ret < 0);
> +
> +	return ret;
> +}
> +
> +static unsigned long get_dirty_writeback_pages(void)
> +{
> +	s64 ret;
> +
> +	if (!mem_cgroup_has_dirty_limit())
> +		return global_page_state(NR_UNSTABLE_NFS) +
> +			global_page_state(NR_WRITEBACK);
> +	ret = mem_cgroup_page_stat(MEMCG_NR_DIRTY_WRITEBACK_PAGES);
> +	BUG_ON(ret < 0);
> +
> +	return ret;
> +}
> +
> +/*
>   * couple the period to the dirty_ratio:
>   *
>   *   period/2 ~ roundup_pow_of_two(dirty limit)
> @@ -142,7 +247,7 @@ static int calc_period_shift(void)
>  	if (vm_dirty_bytes)
>  		dirty_total = vm_dirty_bytes / PAGE_SIZE;
>  	else
> -		dirty_total = (vm_dirty_ratio * determine_dirtyable_memory()) /
> +		dirty_total = (vm_dirty_ratio * get_global_dirtyable_memory()) /
>  				100;
>  	return 2 + ilog2(dirty_total - 1);
>  }
> @@ -355,92 +460,34 @@ int bdi_set_max_ratio(struct backing_dev_info *bdi, unsigned max_ratio)
>  }
>  EXPORT_SYMBOL(bdi_set_max_ratio);
>  
> -/*
> - * Work out the current dirty-memory clamping and background writeout
> - * thresholds.
> - *
> - * The main aim here is to lower them aggressively if there is a lot of mapped
> - * memory around.  To avoid stressing page reclaim with lots of unreclaimable
> - * pages.  It is better to clamp down on writers than to start swapping, and
> - * performing lots of scanning.
> - *
> - * We only allow 1/2 of the currently-unmapped memory to be dirtied.
> - *
> - * We don't permit the clamping level to fall below 5% - that is getting rather
> - * excessive.
> - *
> - * We make sure that the background writeout level is below the adjusted
> - * clamping level.
> - */
> -
> -static unsigned long highmem_dirtyable_memory(unsigned long total)
> -{
> -#ifdef CONFIG_HIGHMEM
> -	int node;
> -	unsigned long x = 0;
> -
> -	for_each_node_state(node, N_HIGH_MEMORY) {
> -		struct zone *z =
> -			&NODE_DATA(node)->node_zones[ZONE_HIGHMEM];
> -
> -		x += zone_page_state(z, NR_FREE_PAGES) +
> -		     zone_reclaimable_pages(z);
> -	}
> -	/*
> -	 * Make sure that the number of highmem pages is never larger
> -	 * than the number of the total dirtyable memory. This can only
> -	 * occur in very strange VM situations but we want to make sure
> -	 * that this does not occur.
> -	 */
> -	return min(x, total);
> -#else
> -	return 0;
> -#endif
> -}
> -
> -/**
> - * determine_dirtyable_memory - amount of memory that may be used
> - *
> - * Returns the numebr of pages that can currently be freed and used
> - * by the kernel for direct mappings.
> - */
> -unsigned long determine_dirtyable_memory(void)
> -{
> -	unsigned long x;
> -
> -	x = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages();
> -
> -	if (!vm_highmem_is_dirtyable)
> -		x -= highmem_dirtyable_memory(x);
> -
> -	return x + 1;	/* Ensure that we never return 0 */
> -}
> -
>  void
>  get_dirty_limits(unsigned long *pbackground, unsigned long *pdirty,
>  		 unsigned long *pbdi_dirty, struct backing_dev_info *bdi)
>  {
> -	unsigned long background;
> -	unsigned long dirty;
> -	unsigned long available_memory = determine_dirtyable_memory();
> +	unsigned long dirty, background;
> +	unsigned long available_memory = get_dirtyable_memory();
>  	struct task_struct *tsk;
> +	struct vm_dirty_param dirty_param;
>  
> -	if (vm_dirty_bytes)
> -		dirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE);
> +	get_vm_dirty_param(&dirty_param);
> +
> +	if (dirty_param.dirty_bytes)
> +		dirty = DIV_ROUND_UP(dirty_param.dirty_bytes, PAGE_SIZE);
>  	else {
>  		int dirty_ratio;
>  
> -		dirty_ratio = vm_dirty_ratio;
> +		dirty_ratio = dirty_param.dirty_ratio;
>  		if (dirty_ratio < 5)
>  			dirty_ratio = 5;
>  		dirty = (dirty_ratio * available_memory) / 100;
>  	}
>  
> -	if (dirty_background_bytes)
> -		background = DIV_ROUND_UP(dirty_background_bytes, PAGE_SIZE);
> +	if (dirty_param.dirty_background_bytes)
> +		background = DIV_ROUND_UP(dirty_param.dirty_background_bytes,
> +						PAGE_SIZE);
>  	else
> -		background = (dirty_background_ratio * available_memory) / 100;
> -
> +		background = (dirty_param.dirty_background_ratio *
> +						available_memory) / 100;
>  	if (background >= dirty)
>  		background = dirty / 2;
>  	tsk = current;
> @@ -505,9 +552,8 @@ static void balance_dirty_pages(struct address_space *mapping,
>  		get_dirty_limits(&background_thresh, &dirty_thresh,
>  				&bdi_thresh, bdi);
>  
> -		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
> -					global_page_state(NR_UNSTABLE_NFS);
> -		nr_writeback = global_page_state(NR_WRITEBACK);
> +		nr_reclaimable = get_reclaimable_pages();
> +		nr_writeback = get_writeback_pages();
>  
>  		bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE);
>  		bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK);
> @@ -593,10 +639,9 @@ static void balance_dirty_pages(struct address_space *mapping,
>  	 * In normal mode, we start background writeout at the lower
>  	 * background_thresh, to keep the amount of dirty memory low.
>  	 */
> +	nr_reclaimable = get_reclaimable_pages();
>  	if ((laptop_mode && pages_written) ||
> -	    (!laptop_mode && ((global_page_state(NR_FILE_DIRTY)
> -			       + global_page_state(NR_UNSTABLE_NFS))
> -					  > background_thresh)))
> +	    (!laptop_mode && (nr_reclaimable > background_thresh)))
>  		bdi_start_writeback(bdi, NULL, 0);
>  }
>  
> @@ -660,6 +705,8 @@ void throttle_vm_writeout(gfp_t gfp_mask)
>  	unsigned long dirty_thresh;
>  
>          for ( ; ; ) {
> +		unsigned long dirty;
> +
>  		get_dirty_limits(&background_thresh, &dirty_thresh, NULL, NULL);
>  
>                  /*
> @@ -668,10 +715,10 @@ void throttle_vm_writeout(gfp_t gfp_mask)
>                   */
>                  dirty_thresh += dirty_thresh / 10;      /* wheeee... */
>  
> -                if (global_page_state(NR_UNSTABLE_NFS) +
> -			global_page_state(NR_WRITEBACK) <= dirty_thresh)
> -                        	break;
> -                congestion_wait(BLK_RW_ASYNC, HZ/10);
> +		dirty = get_dirty_writeback_pages();
> +		if (dirty <= dirty_thresh)
> +			break;
> +		congestion_wait(BLK_RW_ASYNC, HZ/10);
>  
>  		/*
>  		 * The caller might hold locks which can prevent IO completion
> @@ -1078,6 +1125,7 @@ int __set_page_dirty_no_writeback(struct page *page)
>  void account_page_dirtied(struct page *page, struct address_space *mapping)
>  {
>  	if (mapping_cap_account_dirty(mapping)) {
> +		mem_cgroup_inc_page_stat_locked(page, MEMCG_NR_FILE_DIRTY);
>  		__inc_zone_page_state(page, NR_FILE_DIRTY);
>  		__inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
>  		task_dirty_inc(current);
> @@ -1279,6 +1327,8 @@ int clear_page_dirty_for_io(struct page *page)
>  		 * for more comments.
>  		 */
>  		if (TestClearPageDirty(page)) {
> +			mem_cgroup_dec_page_stat_unlocked(page,
> +					MEMCG_NR_FILE_DIRTY);
>  			dec_zone_page_state(page, NR_FILE_DIRTY);
>  			dec_bdi_stat(mapping->backing_dev_info,
>  					BDI_RECLAIMABLE);

This is called under lock_page(). Then, the page is stable under us.
locked version can be used.


> @@ -1314,8 +1364,11 @@ int test_clear_page_writeback(struct page *page)
>  	} else {
>  		ret = TestClearPageWriteback(page);
>  	}
> -	if (ret)
> +	if (ret) {
> +		mem_cgroup_dec_page_stat_unlocked(page,
> +				MEMCG_NR_FILE_WRITEBACK);
>  		dec_zone_page_state(page, NR_WRITEBACK);
> +	}
Can this be moved up to under tree_lock ?


>  	return ret;
>  }
>  
> @@ -1345,8 +1398,11 @@ int test_set_page_writeback(struct page *page)
>  	} else {
>  		ret = TestSetPageWriteback(page);
>  	}
> -	if (!ret)
> +	if (!ret) {
> +		mem_cgroup_inc_page_stat_unlocked(page,
> +				MEMCG_NR_FILE_WRITEBACK);
>  		inc_zone_page_state(page, NR_WRITEBACK);
> +	}
>  	return ret;
>  
Maybe moving this to under tree_lock and using unloked version is better.



>  }
> diff --git a/mm/rmap.c b/mm/rmap.c
> index fcd593c..61f07cc 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -828,8 +828,8 @@ void page_add_new_anon_rmap(struct page *page,
>  void page_add_file_rmap(struct page *page)
>  {
>  	if (atomic_inc_and_test(&page->_mapcount)) {
> +		mem_cgroup_inc_page_stat_unlocked(page, MEMCG_NR_FILE_MAPPED);
>  		__inc_zone_page_state(page, NR_FILE_MAPPED);
> -		mem_cgroup_update_file_mapped(page, 1);
>  	}
>  }
>  
> @@ -860,8 +860,8 @@ void page_remove_rmap(struct page *page)
>  		mem_cgroup_uncharge_page(page);
>  		__dec_zone_page_state(page, NR_ANON_PAGES);
>  	} else {
> +		mem_cgroup_dec_page_stat_unlocked(page, MEMCG_NR_FILE_MAPPED);
>  		__dec_zone_page_state(page, NR_FILE_MAPPED);
> -		mem_cgroup_update_file_mapped(page, -1);
>  	}
>  	/*
>  	 * It would be tidy to reset the PageAnon mapping here,
> diff --git a/mm/truncate.c b/mm/truncate.c
> index e87e372..1613632 100644
> --- a/mm/truncate.c
> +++ b/mm/truncate.c
> @@ -73,6 +73,8 @@ void cancel_dirty_page(struct page *page, unsigned int account_size)
>  	if (TestClearPageDirty(page)) {
>  		struct address_space *mapping = page->mapping;
>  		if (mapping && mapping_cap_account_dirty(mapping)) {
> +			mem_cgroup_dec_page_stat_unlocked(page,
> +					MEMCG_NR_FILE_DIRTY);
>  			dec_zone_page_state(page, NR_FILE_DIRTY);
>  			dec_bdi_stat(mapping->backing_dev_info,
>  					BDI_RECLAIMABLE);

cancel_dirty_page() is called after do_invalidatepage() but before
remove_from_pagecache(), it's all done under lock_page().

Then, we can use "locked" accounting here.

If you feel locked/unlocked accounting is toooo complex, simply adding
irq_enable/disable around lock_page_cgroup() is a choice.
But please measure performance before doing that.


Thanks,
-Kame


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH -mmotm 4/4] memcg: dirty pages instrumentation
       [not found] ` <1267995474-9117-1-git-send-email-arighi-vWjgImWzx8FBDgjK7y7TUQ@public.gmane.org>
@ 2010-03-07 20:57   ` Andrea Righi
  0 siblings, 0 replies; 68+ messages in thread
From: Andrea Righi @ 2010-03-07 20:57 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki, Balbir Singh, Daisuke Nishimura
  Cc: Andrea Righi,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Trond Myklebust,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Suleiman Souhlal, Andrew Morton,
	Vivek Goyal

Apply the cgroup dirty pages accounting and limiting infrastructure to
the opportune kernel functions.

As a bonus, make determine_dirtyable_memory() static again: this
function isn't used anymore outside page writeback.

Signed-off-by: Andrea Righi <arighi-vWjgImWzx8FBDgjK7y7TUQ@public.gmane.org>
---
 fs/fuse/file.c            |    5 +
 fs/nfs/write.c            |    6 +
 fs/nilfs2/segment.c       |   11 ++-
 include/linux/writeback.h |    2 -
 mm/filemap.c              |    1 +
 mm/page-writeback.c       |  224 ++++++++++++++++++++++++++++-----------------
 mm/rmap.c                 |    4 +-
 mm/truncate.c             |    2 +
 8 files changed, 165 insertions(+), 90 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index a9f5e13..9a542e5 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -11,6 +11,7 @@
 #include <linux/pagemap.h>
 #include <linux/slab.h>
 #include <linux/kernel.h>
+#include <linux/memcontrol.h>
 #include <linux/sched.h>
 #include <linux/module.h>
 
@@ -1129,6 +1130,8 @@ static void fuse_writepage_finish(struct fuse_conn *fc, struct fuse_req *req)
 
 	list_del(&req->writepages_entry);
 	dec_bdi_stat(bdi, BDI_WRITEBACK);
+	mem_cgroup_dec_page_stat_unlocked(req->pages[0],
+			MEMCG_NR_FILE_WRITEBACK_TEMP);
 	dec_zone_page_state(req->pages[0], NR_WRITEBACK_TEMP);
 	bdi_writeout_inc(bdi);
 	wake_up(&fi->page_waitq);
@@ -1240,6 +1243,8 @@ static int fuse_writepage_locked(struct page *page)
 	req->inode = inode;
 
 	inc_bdi_stat(mapping->backing_dev_info, BDI_WRITEBACK);
+	mem_cgroup_inc_page_stat_unlocked(tmp_page,
+			MEMCG_NR_FILE_WRITEBACK_TEMP);
 	inc_zone_page_state(tmp_page, NR_WRITEBACK_TEMP);
 	end_page_writeback(page);
 
diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index 53ff70e..a35e3c0 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -440,6 +440,8 @@ nfs_mark_request_commit(struct nfs_page *req)
 			NFS_PAGE_TAG_COMMIT);
 	nfsi->ncommit++;
 	spin_unlock(&inode->i_lock);
+	mem_cgroup_inc_page_stat_unlocked(req->wb_page,
+			MEMCG_NR_FILE_UNSTABLE_NFS);
 	inc_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
 	inc_bdi_stat(req->wb_page->mapping->backing_dev_info, BDI_RECLAIMABLE);
 	__mark_inode_dirty(inode, I_DIRTY_DATASYNC);
@@ -451,6 +453,8 @@ nfs_clear_request_commit(struct nfs_page *req)
 	struct page *page = req->wb_page;
 
 	if (test_and_clear_bit(PG_CLEAN, &(req)->wb_flags)) {
+		mem_cgroup_dec_page_stat_unlocked(page,
+				MEMCG_NR_FILE_UNSTABLE_NFS);
 		dec_zone_page_state(page, NR_UNSTABLE_NFS);
 		dec_bdi_stat(page->mapping->backing_dev_info, BDI_RECLAIMABLE);
 		return 1;
@@ -1277,6 +1281,8 @@ nfs_commit_list(struct inode *inode, struct list_head *head, int how)
 		req = nfs_list_entry(head->next);
 		nfs_list_remove_request(req);
 		nfs_mark_request_commit(req);
+		mem_cgroup_dec_page_stat_unlocked(req->wb_page,
+				MEMCG_NR_FILE_UNSTABLE_NFS);
 		dec_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
 		dec_bdi_stat(req->wb_page->mapping->backing_dev_info,
 				BDI_RECLAIMABLE);
diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
index ada2f1b..fb79558 100644
--- a/fs/nilfs2/segment.c
+++ b/fs/nilfs2/segment.c
@@ -24,6 +24,7 @@
 #include <linux/pagemap.h>
 #include <linux/buffer_head.h>
 #include <linux/writeback.h>
+#include <linux/memcontrol.h>
 #include <linux/bio.h>
 #include <linux/completion.h>
 #include <linux/blkdev.h>
@@ -1660,8 +1661,11 @@ nilfs_copy_replace_page_buffers(struct page *page, struct list_head *out)
 	} while (bh = bh->b_this_page, bh2 = bh2->b_this_page, bh != head);
 	kunmap_atomic(kaddr, KM_USER0);
 
-	if (!TestSetPageWriteback(clone_page))
+	if (!TestSetPageWriteback(clone_page)) {
+		mem_cgroup_inc_page_stat_unlocked(clone_page,
+				MEMCG_NR_FILE_WRITEBACK);
 		inc_zone_page_state(clone_page, NR_WRITEBACK);
+	}
 	unlock_page(clone_page);
 
 	return 0;
@@ -1783,8 +1787,11 @@ static void __nilfs_end_page_io(struct page *page, int err)
 	}
 
 	if (buffer_nilfs_allocated(page_buffers(page))) {
-		if (TestClearPageWriteback(page))
+		if (TestClearPageWriteback(page)) {
+			mem_cgroup_dec_page_stat_unlocked(page,
+					MEMCG_NR_FILE_WRITEBACK);
 			dec_zone_page_state(page, NR_WRITEBACK);
+		}
 	} else
 		end_page_writeback(page);
 }
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index dd9512d..39e4cb2 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -117,8 +117,6 @@ extern int vm_highmem_is_dirtyable;
 extern int block_dump;
 extern int laptop_mode;
 
-extern unsigned long determine_dirtyable_memory(void);
-
 extern int dirty_background_ratio_handler(struct ctl_table *table, int write,
 		void __user *buffer, size_t *lenp,
 		loff_t *ppos);
diff --git a/mm/filemap.c b/mm/filemap.c
index 62cbac0..37f89d1 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -135,6 +135,7 @@ void __remove_from_page_cache(struct page *page)
 	 * having removed the page entirely.
 	 */
 	if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
+		mem_cgroup_dec_page_stat_locked(page, MEMCG_NR_FILE_DIRTY);
 		dec_zone_page_state(page, NR_FILE_DIRTY);
 		dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
 	}
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index ab84693..9d4503a 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -131,6 +131,111 @@ static struct prop_descriptor vm_completions;
 static struct prop_descriptor vm_dirties;
 
 /*
+ * Work out the current dirty-memory clamping and background writeout
+ * thresholds.
+ *
+ * The main aim here is to lower them aggressively if there is a lot of mapped
+ * memory around.  To avoid stressing page reclaim with lots of unreclaimable
+ * pages.  It is better to clamp down on writers than to start swapping, and
+ * performing lots of scanning.
+ *
+ * We only allow 1/2 of the currently-unmapped memory to be dirtied.
+ *
+ * We don't permit the clamping level to fall below 5% - that is getting rather
+ * excessive.
+ *
+ * We make sure that the background writeout level is below the adjusted
+ * clamping level.
+ */
+
+static unsigned long highmem_dirtyable_memory(unsigned long total)
+{
+#ifdef CONFIG_HIGHMEM
+	int node;
+	unsigned long x = 0;
+
+	for_each_node_state(node, N_HIGH_MEMORY) {
+		struct zone *z =
+			&NODE_DATA(node)->node_zones[ZONE_HIGHMEM];
+
+		x += zone_page_state(z, NR_FREE_PAGES) +
+		     zone_reclaimable_pages(z);
+	}
+	/*
+	 * Make sure that the number of highmem pages is never larger
+	 * than the number of the total dirtyable memory. This can only
+	 * occur in very strange VM situations but we want to make sure
+	 * that this does not occur.
+	 */
+	return min(x, total);
+#else
+	return 0;
+#endif
+}
+
+static unsigned long get_global_dirtyable_memory(void)
+{
+	unsigned long memory;
+
+	memory = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages();
+	if (!vm_highmem_is_dirtyable)
+		memory -= highmem_dirtyable_memory(memory);
+	return memory + 1;
+}
+
+static unsigned long get_dirtyable_memory(void)
+{
+	unsigned long memory;
+	s64 memcg_memory;
+
+	memory = get_global_dirtyable_memory();
+	if (!mem_cgroup_has_dirty_limit())
+		return memory;
+	memcg_memory = mem_cgroup_page_stat(MEMCG_NR_DIRTYABLE_PAGES);
+	BUG_ON(memcg_memory < 0);
+
+	return min((unsigned long)memcg_memory, memory);
+}
+
+static long get_reclaimable_pages(void)
+{
+	s64 ret;
+
+	if (!mem_cgroup_has_dirty_limit())
+		return global_page_state(NR_FILE_DIRTY) +
+			global_page_state(NR_UNSTABLE_NFS);
+	ret = mem_cgroup_page_stat(MEMCG_NR_RECLAIM_PAGES);
+	BUG_ON(ret < 0);
+
+	return ret;
+}
+
+static long get_writeback_pages(void)
+{
+	s64 ret;
+
+	if (!mem_cgroup_has_dirty_limit())
+		return global_page_state(NR_WRITEBACK);
+	ret = mem_cgroup_page_stat(MEMCG_NR_WRITEBACK);
+	BUG_ON(ret < 0);
+
+	return ret;
+}
+
+static unsigned long get_dirty_writeback_pages(void)
+{
+	s64 ret;
+
+	if (!mem_cgroup_has_dirty_limit())
+		return global_page_state(NR_UNSTABLE_NFS) +
+			global_page_state(NR_WRITEBACK);
+	ret = mem_cgroup_page_stat(MEMCG_NR_DIRTY_WRITEBACK_PAGES);
+	BUG_ON(ret < 0);
+
+	return ret;
+}
+
+/*
  * couple the period to the dirty_ratio:
  *
  *   period/2 ~ roundup_pow_of_two(dirty limit)
@@ -142,7 +247,7 @@ static int calc_period_shift(void)
 	if (vm_dirty_bytes)
 		dirty_total = vm_dirty_bytes / PAGE_SIZE;
 	else
-		dirty_total = (vm_dirty_ratio * determine_dirtyable_memory()) /
+		dirty_total = (vm_dirty_ratio * get_global_dirtyable_memory()) /
 				100;
 	return 2 + ilog2(dirty_total - 1);
 }
@@ -355,92 +460,34 @@ int bdi_set_max_ratio(struct backing_dev_info *bdi, unsigned max_ratio)
 }
 EXPORT_SYMBOL(bdi_set_max_ratio);
 
-/*
- * Work out the current dirty-memory clamping and background writeout
- * thresholds.
- *
- * The main aim here is to lower them aggressively if there is a lot of mapped
- * memory around.  To avoid stressing page reclaim with lots of unreclaimable
- * pages.  It is better to clamp down on writers than to start swapping, and
- * performing lots of scanning.
- *
- * We only allow 1/2 of the currently-unmapped memory to be dirtied.
- *
- * We don't permit the clamping level to fall below 5% - that is getting rather
- * excessive.
- *
- * We make sure that the background writeout level is below the adjusted
- * clamping level.
- */
-
-static unsigned long highmem_dirtyable_memory(unsigned long total)
-{
-#ifdef CONFIG_HIGHMEM
-	int node;
-	unsigned long x = 0;
-
-	for_each_node_state(node, N_HIGH_MEMORY) {
-		struct zone *z =
-			&NODE_DATA(node)->node_zones[ZONE_HIGHMEM];
-
-		x += zone_page_state(z, NR_FREE_PAGES) +
-		     zone_reclaimable_pages(z);
-	}
-	/*
-	 * Make sure that the number of highmem pages is never larger
-	 * than the number of the total dirtyable memory. This can only
-	 * occur in very strange VM situations but we want to make sure
-	 * that this does not occur.
-	 */
-	return min(x, total);
-#else
-	return 0;
-#endif
-}
-
-/**
- * determine_dirtyable_memory - amount of memory that may be used
- *
- * Returns the numebr of pages that can currently be freed and used
- * by the kernel for direct mappings.
- */
-unsigned long determine_dirtyable_memory(void)
-{
-	unsigned long x;
-
-	x = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages();
-
-	if (!vm_highmem_is_dirtyable)
-		x -= highmem_dirtyable_memory(x);
-
-	return x + 1;	/* Ensure that we never return 0 */
-}
-
 void
 get_dirty_limits(unsigned long *pbackground, unsigned long *pdirty,
 		 unsigned long *pbdi_dirty, struct backing_dev_info *bdi)
 {
-	unsigned long background;
-	unsigned long dirty;
-	unsigned long available_memory = determine_dirtyable_memory();
+	unsigned long dirty, background;
+	unsigned long available_memory = get_dirtyable_memory();
 	struct task_struct *tsk;
+	struct vm_dirty_param dirty_param;
 
-	if (vm_dirty_bytes)
-		dirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE);
+	get_vm_dirty_param(&dirty_param);
+
+	if (dirty_param.dirty_bytes)
+		dirty = DIV_ROUND_UP(dirty_param.dirty_bytes, PAGE_SIZE);
 	else {
 		int dirty_ratio;
 
-		dirty_ratio = vm_dirty_ratio;
+		dirty_ratio = dirty_param.dirty_ratio;
 		if (dirty_ratio < 5)
 			dirty_ratio = 5;
 		dirty = (dirty_ratio * available_memory) / 100;
 	}
 
-	if (dirty_background_bytes)
-		background = DIV_ROUND_UP(dirty_background_bytes, PAGE_SIZE);
+	if (dirty_param.dirty_background_bytes)
+		background = DIV_ROUND_UP(dirty_param.dirty_background_bytes,
+						PAGE_SIZE);
 	else
-		background = (dirty_background_ratio * available_memory) / 100;
-
+		background = (dirty_param.dirty_background_ratio *
+						available_memory) / 100;
 	if (background >= dirty)
 		background = dirty / 2;
 	tsk = current;
@@ -505,9 +552,8 @@ static void balance_dirty_pages(struct address_space *mapping,
 		get_dirty_limits(&background_thresh, &dirty_thresh,
 				&bdi_thresh, bdi);
 
-		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
-					global_page_state(NR_UNSTABLE_NFS);
-		nr_writeback = global_page_state(NR_WRITEBACK);
+		nr_reclaimable = get_reclaimable_pages();
+		nr_writeback = get_writeback_pages();
 
 		bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE);
 		bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK);
@@ -593,10 +639,9 @@ static void balance_dirty_pages(struct address_space *mapping,
 	 * In normal mode, we start background writeout at the lower
 	 * background_thresh, to keep the amount of dirty memory low.
 	 */
+	nr_reclaimable = get_reclaimable_pages();
 	if ((laptop_mode && pages_written) ||
-	    (!laptop_mode && ((global_page_state(NR_FILE_DIRTY)
-			       + global_page_state(NR_UNSTABLE_NFS))
-					  > background_thresh)))
+	    (!laptop_mode && (nr_reclaimable > background_thresh)))
 		bdi_start_writeback(bdi, NULL, 0);
 }
 
@@ -660,6 +705,8 @@ void throttle_vm_writeout(gfp_t gfp_mask)
 	unsigned long dirty_thresh;
 
         for ( ; ; ) {
+		unsigned long dirty;
+
 		get_dirty_limits(&background_thresh, &dirty_thresh, NULL, NULL);
 
                 /*
@@ -668,10 +715,10 @@ void throttle_vm_writeout(gfp_t gfp_mask)
                  */
                 dirty_thresh += dirty_thresh / 10;      /* wheeee... */
 
-                if (global_page_state(NR_UNSTABLE_NFS) +
-			global_page_state(NR_WRITEBACK) <= dirty_thresh)
-                        	break;
-                congestion_wait(BLK_RW_ASYNC, HZ/10);
+		dirty = get_dirty_writeback_pages();
+		if (dirty <= dirty_thresh)
+			break;
+		congestion_wait(BLK_RW_ASYNC, HZ/10);
 
 		/*
 		 * The caller might hold locks which can prevent IO completion
@@ -1078,6 +1125,7 @@ int __set_page_dirty_no_writeback(struct page *page)
 void account_page_dirtied(struct page *page, struct address_space *mapping)
 {
 	if (mapping_cap_account_dirty(mapping)) {
+		mem_cgroup_inc_page_stat_locked(page, MEMCG_NR_FILE_DIRTY);
 		__inc_zone_page_state(page, NR_FILE_DIRTY);
 		__inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
 		task_dirty_inc(current);
@@ -1279,6 +1327,8 @@ int clear_page_dirty_for_io(struct page *page)
 		 * for more comments.
 		 */
 		if (TestClearPageDirty(page)) {
+			mem_cgroup_dec_page_stat_unlocked(page,
+					MEMCG_NR_FILE_DIRTY);
 			dec_zone_page_state(page, NR_FILE_DIRTY);
 			dec_bdi_stat(mapping->backing_dev_info,
 					BDI_RECLAIMABLE);
@@ -1314,8 +1364,11 @@ int test_clear_page_writeback(struct page *page)
 	} else {
 		ret = TestClearPageWriteback(page);
 	}
-	if (ret)
+	if (ret) {
+		mem_cgroup_dec_page_stat_unlocked(page,
+				MEMCG_NR_FILE_WRITEBACK);
 		dec_zone_page_state(page, NR_WRITEBACK);
+	}
 	return ret;
 }
 
@@ -1345,8 +1398,11 @@ int test_set_page_writeback(struct page *page)
 	} else {
 		ret = TestSetPageWriteback(page);
 	}
-	if (!ret)
+	if (!ret) {
+		mem_cgroup_inc_page_stat_unlocked(page,
+				MEMCG_NR_FILE_WRITEBACK);
 		inc_zone_page_state(page, NR_WRITEBACK);
+	}
 	return ret;
 
 }
diff --git a/mm/rmap.c b/mm/rmap.c
index fcd593c..61f07cc 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -828,8 +828,8 @@ void page_add_new_anon_rmap(struct page *page,
 void page_add_file_rmap(struct page *page)
 {
 	if (atomic_inc_and_test(&page->_mapcount)) {
+		mem_cgroup_inc_page_stat_unlocked(page, MEMCG_NR_FILE_MAPPED);
 		__inc_zone_page_state(page, NR_FILE_MAPPED);
-		mem_cgroup_update_file_mapped(page, 1);
 	}
 }
 
@@ -860,8 +860,8 @@ void page_remove_rmap(struct page *page)
 		mem_cgroup_uncharge_page(page);
 		__dec_zone_page_state(page, NR_ANON_PAGES);
 	} else {
+		mem_cgroup_dec_page_stat_unlocked(page, MEMCG_NR_FILE_MAPPED);
 		__dec_zone_page_state(page, NR_FILE_MAPPED);
-		mem_cgroup_update_file_mapped(page, -1);
 	}
 	/*
 	 * It would be tidy to reset the PageAnon mapping here,
diff --git a/mm/truncate.c b/mm/truncate.c
index e87e372..1613632 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -73,6 +73,8 @@ void cancel_dirty_page(struct page *page, unsigned int account_size)
 	if (TestClearPageDirty(page)) {
 		struct address_space *mapping = page->mapping;
 		if (mapping && mapping_cap_account_dirty(mapping)) {
+			mem_cgroup_dec_page_stat_unlocked(page,
+					MEMCG_NR_FILE_DIRTY);
 			dec_zone_page_state(page, NR_FILE_DIRTY);
 			dec_bdi_stat(mapping->backing_dev_info,
 					BDI_RECLAIMABLE);
-- 
1.6.3.3

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH -mmotm 4/4] memcg: dirty pages instrumentation
  2010-03-07 20:57 [PATCH -mmotm 0/4] memcg: per cgroup dirty limit (v5) Andrea Righi
@ 2010-03-07 20:57   ` Andrea Righi
  2010-03-07 20:57   ` Andrea Righi
  1 sibling, 0 replies; 68+ messages in thread
From: Andrea Righi @ 2010-03-07 20:57 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki, Balbir Singh, Daisuke Nishimura
  Cc: Vivek Goyal, Peter Zijlstra, Trond Myklebust, Suleiman Souhlal,
	Greg Thelen, Kirill A. Shutemov, Andrew Morton, containers,
	linux-kernel, linux-mm, Andrea Righi

Apply the cgroup dirty pages accounting and limiting infrastructure to
the opportune kernel functions.

As a bonus, make determine_dirtyable_memory() static again: this
function isn't used anymore outside page writeback.

Signed-off-by: Andrea Righi <arighi@develer.com>
---
 fs/fuse/file.c            |    5 +
 fs/nfs/write.c            |    6 +
 fs/nilfs2/segment.c       |   11 ++-
 include/linux/writeback.h |    2 -
 mm/filemap.c              |    1 +
 mm/page-writeback.c       |  224 ++++++++++++++++++++++++++++-----------------
 mm/rmap.c                 |    4 +-
 mm/truncate.c             |    2 +
 8 files changed, 165 insertions(+), 90 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index a9f5e13..9a542e5 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -11,6 +11,7 @@
 #include <linux/pagemap.h>
 #include <linux/slab.h>
 #include <linux/kernel.h>
+#include <linux/memcontrol.h>
 #include <linux/sched.h>
 #include <linux/module.h>
 
@@ -1129,6 +1130,8 @@ static void fuse_writepage_finish(struct fuse_conn *fc, struct fuse_req *req)
 
 	list_del(&req->writepages_entry);
 	dec_bdi_stat(bdi, BDI_WRITEBACK);
+	mem_cgroup_dec_page_stat_unlocked(req->pages[0],
+			MEMCG_NR_FILE_WRITEBACK_TEMP);
 	dec_zone_page_state(req->pages[0], NR_WRITEBACK_TEMP);
 	bdi_writeout_inc(bdi);
 	wake_up(&fi->page_waitq);
@@ -1240,6 +1243,8 @@ static int fuse_writepage_locked(struct page *page)
 	req->inode = inode;
 
 	inc_bdi_stat(mapping->backing_dev_info, BDI_WRITEBACK);
+	mem_cgroup_inc_page_stat_unlocked(tmp_page,
+			MEMCG_NR_FILE_WRITEBACK_TEMP);
 	inc_zone_page_state(tmp_page, NR_WRITEBACK_TEMP);
 	end_page_writeback(page);
 
diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index 53ff70e..a35e3c0 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -440,6 +440,8 @@ nfs_mark_request_commit(struct nfs_page *req)
 			NFS_PAGE_TAG_COMMIT);
 	nfsi->ncommit++;
 	spin_unlock(&inode->i_lock);
+	mem_cgroup_inc_page_stat_unlocked(req->wb_page,
+			MEMCG_NR_FILE_UNSTABLE_NFS);
 	inc_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
 	inc_bdi_stat(req->wb_page->mapping->backing_dev_info, BDI_RECLAIMABLE);
 	__mark_inode_dirty(inode, I_DIRTY_DATASYNC);
@@ -451,6 +453,8 @@ nfs_clear_request_commit(struct nfs_page *req)
 	struct page *page = req->wb_page;
 
 	if (test_and_clear_bit(PG_CLEAN, &(req)->wb_flags)) {
+		mem_cgroup_dec_page_stat_unlocked(page,
+				MEMCG_NR_FILE_UNSTABLE_NFS);
 		dec_zone_page_state(page, NR_UNSTABLE_NFS);
 		dec_bdi_stat(page->mapping->backing_dev_info, BDI_RECLAIMABLE);
 		return 1;
@@ -1277,6 +1281,8 @@ nfs_commit_list(struct inode *inode, struct list_head *head, int how)
 		req = nfs_list_entry(head->next);
 		nfs_list_remove_request(req);
 		nfs_mark_request_commit(req);
+		mem_cgroup_dec_page_stat_unlocked(req->wb_page,
+				MEMCG_NR_FILE_UNSTABLE_NFS);
 		dec_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
 		dec_bdi_stat(req->wb_page->mapping->backing_dev_info,
 				BDI_RECLAIMABLE);
diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
index ada2f1b..fb79558 100644
--- a/fs/nilfs2/segment.c
+++ b/fs/nilfs2/segment.c
@@ -24,6 +24,7 @@
 #include <linux/pagemap.h>
 #include <linux/buffer_head.h>
 #include <linux/writeback.h>
+#include <linux/memcontrol.h>
 #include <linux/bio.h>
 #include <linux/completion.h>
 #include <linux/blkdev.h>
@@ -1660,8 +1661,11 @@ nilfs_copy_replace_page_buffers(struct page *page, struct list_head *out)
 	} while (bh = bh->b_this_page, bh2 = bh2->b_this_page, bh != head);
 	kunmap_atomic(kaddr, KM_USER0);
 
-	if (!TestSetPageWriteback(clone_page))
+	if (!TestSetPageWriteback(clone_page)) {
+		mem_cgroup_inc_page_stat_unlocked(clone_page,
+				MEMCG_NR_FILE_WRITEBACK);
 		inc_zone_page_state(clone_page, NR_WRITEBACK);
+	}
 	unlock_page(clone_page);
 
 	return 0;
@@ -1783,8 +1787,11 @@ static void __nilfs_end_page_io(struct page *page, int err)
 	}
 
 	if (buffer_nilfs_allocated(page_buffers(page))) {
-		if (TestClearPageWriteback(page))
+		if (TestClearPageWriteback(page)) {
+			mem_cgroup_dec_page_stat_unlocked(page,
+					MEMCG_NR_FILE_WRITEBACK);
 			dec_zone_page_state(page, NR_WRITEBACK);
+		}
 	} else
 		end_page_writeback(page);
 }
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index dd9512d..39e4cb2 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -117,8 +117,6 @@ extern int vm_highmem_is_dirtyable;
 extern int block_dump;
 extern int laptop_mode;
 
-extern unsigned long determine_dirtyable_memory(void);
-
 extern int dirty_background_ratio_handler(struct ctl_table *table, int write,
 		void __user *buffer, size_t *lenp,
 		loff_t *ppos);
diff --git a/mm/filemap.c b/mm/filemap.c
index 62cbac0..37f89d1 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -135,6 +135,7 @@ void __remove_from_page_cache(struct page *page)
 	 * having removed the page entirely.
 	 */
 	if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
+		mem_cgroup_dec_page_stat_locked(page, MEMCG_NR_FILE_DIRTY);
 		dec_zone_page_state(page, NR_FILE_DIRTY);
 		dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
 	}
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index ab84693..9d4503a 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -131,6 +131,111 @@ static struct prop_descriptor vm_completions;
 static struct prop_descriptor vm_dirties;
 
 /*
+ * Work out the current dirty-memory clamping and background writeout
+ * thresholds.
+ *
+ * The main aim here is to lower them aggressively if there is a lot of mapped
+ * memory around.  To avoid stressing page reclaim with lots of unreclaimable
+ * pages.  It is better to clamp down on writers than to start swapping, and
+ * performing lots of scanning.
+ *
+ * We only allow 1/2 of the currently-unmapped memory to be dirtied.
+ *
+ * We don't permit the clamping level to fall below 5% - that is getting rather
+ * excessive.
+ *
+ * We make sure that the background writeout level is below the adjusted
+ * clamping level.
+ */
+
+static unsigned long highmem_dirtyable_memory(unsigned long total)
+{
+#ifdef CONFIG_HIGHMEM
+	int node;
+	unsigned long x = 0;
+
+	for_each_node_state(node, N_HIGH_MEMORY) {
+		struct zone *z =
+			&NODE_DATA(node)->node_zones[ZONE_HIGHMEM];
+
+		x += zone_page_state(z, NR_FREE_PAGES) +
+		     zone_reclaimable_pages(z);
+	}
+	/*
+	 * Make sure that the number of highmem pages is never larger
+	 * than the number of the total dirtyable memory. This can only
+	 * occur in very strange VM situations but we want to make sure
+	 * that this does not occur.
+	 */
+	return min(x, total);
+#else
+	return 0;
+#endif
+}
+
+static unsigned long get_global_dirtyable_memory(void)
+{
+	unsigned long memory;
+
+	memory = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages();
+	if (!vm_highmem_is_dirtyable)
+		memory -= highmem_dirtyable_memory(memory);
+	return memory + 1;
+}
+
+static unsigned long get_dirtyable_memory(void)
+{
+	unsigned long memory;
+	s64 memcg_memory;
+
+	memory = get_global_dirtyable_memory();
+	if (!mem_cgroup_has_dirty_limit())
+		return memory;
+	memcg_memory = mem_cgroup_page_stat(MEMCG_NR_DIRTYABLE_PAGES);
+	BUG_ON(memcg_memory < 0);
+
+	return min((unsigned long)memcg_memory, memory);
+}
+
+static long get_reclaimable_pages(void)
+{
+	s64 ret;
+
+	if (!mem_cgroup_has_dirty_limit())
+		return global_page_state(NR_FILE_DIRTY) +
+			global_page_state(NR_UNSTABLE_NFS);
+	ret = mem_cgroup_page_stat(MEMCG_NR_RECLAIM_PAGES);
+	BUG_ON(ret < 0);
+
+	return ret;
+}
+
+static long get_writeback_pages(void)
+{
+	s64 ret;
+
+	if (!mem_cgroup_has_dirty_limit())
+		return global_page_state(NR_WRITEBACK);
+	ret = mem_cgroup_page_stat(MEMCG_NR_WRITEBACK);
+	BUG_ON(ret < 0);
+
+	return ret;
+}
+
+static unsigned long get_dirty_writeback_pages(void)
+{
+	s64 ret;
+
+	if (!mem_cgroup_has_dirty_limit())
+		return global_page_state(NR_UNSTABLE_NFS) +
+			global_page_state(NR_WRITEBACK);
+	ret = mem_cgroup_page_stat(MEMCG_NR_DIRTY_WRITEBACK_PAGES);
+	BUG_ON(ret < 0);
+
+	return ret;
+}
+
+/*
  * couple the period to the dirty_ratio:
  *
  *   period/2 ~ roundup_pow_of_two(dirty limit)
@@ -142,7 +247,7 @@ static int calc_period_shift(void)
 	if (vm_dirty_bytes)
 		dirty_total = vm_dirty_bytes / PAGE_SIZE;
 	else
-		dirty_total = (vm_dirty_ratio * determine_dirtyable_memory()) /
+		dirty_total = (vm_dirty_ratio * get_global_dirtyable_memory()) /
 				100;
 	return 2 + ilog2(dirty_total - 1);
 }
@@ -355,92 +460,34 @@ int bdi_set_max_ratio(struct backing_dev_info *bdi, unsigned max_ratio)
 }
 EXPORT_SYMBOL(bdi_set_max_ratio);
 
-/*
- * Work out the current dirty-memory clamping and background writeout
- * thresholds.
- *
- * The main aim here is to lower them aggressively if there is a lot of mapped
- * memory around.  To avoid stressing page reclaim with lots of unreclaimable
- * pages.  It is better to clamp down on writers than to start swapping, and
- * performing lots of scanning.
- *
- * We only allow 1/2 of the currently-unmapped memory to be dirtied.
- *
- * We don't permit the clamping level to fall below 5% - that is getting rather
- * excessive.
- *
- * We make sure that the background writeout level is below the adjusted
- * clamping level.
- */
-
-static unsigned long highmem_dirtyable_memory(unsigned long total)
-{
-#ifdef CONFIG_HIGHMEM
-	int node;
-	unsigned long x = 0;
-
-	for_each_node_state(node, N_HIGH_MEMORY) {
-		struct zone *z =
-			&NODE_DATA(node)->node_zones[ZONE_HIGHMEM];
-
-		x += zone_page_state(z, NR_FREE_PAGES) +
-		     zone_reclaimable_pages(z);
-	}
-	/*
-	 * Make sure that the number of highmem pages is never larger
-	 * than the number of the total dirtyable memory. This can only
-	 * occur in very strange VM situations but we want to make sure
-	 * that this does not occur.
-	 */
-	return min(x, total);
-#else
-	return 0;
-#endif
-}
-
-/**
- * determine_dirtyable_memory - amount of memory that may be used
- *
- * Returns the numebr of pages that can currently be freed and used
- * by the kernel for direct mappings.
- */
-unsigned long determine_dirtyable_memory(void)
-{
-	unsigned long x;
-
-	x = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages();
-
-	if (!vm_highmem_is_dirtyable)
-		x -= highmem_dirtyable_memory(x);
-
-	return x + 1;	/* Ensure that we never return 0 */
-}
-
 void
 get_dirty_limits(unsigned long *pbackground, unsigned long *pdirty,
 		 unsigned long *pbdi_dirty, struct backing_dev_info *bdi)
 {
-	unsigned long background;
-	unsigned long dirty;
-	unsigned long available_memory = determine_dirtyable_memory();
+	unsigned long dirty, background;
+	unsigned long available_memory = get_dirtyable_memory();
 	struct task_struct *tsk;
+	struct vm_dirty_param dirty_param;
 
-	if (vm_dirty_bytes)
-		dirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE);
+	get_vm_dirty_param(&dirty_param);
+
+	if (dirty_param.dirty_bytes)
+		dirty = DIV_ROUND_UP(dirty_param.dirty_bytes, PAGE_SIZE);
 	else {
 		int dirty_ratio;
 
-		dirty_ratio = vm_dirty_ratio;
+		dirty_ratio = dirty_param.dirty_ratio;
 		if (dirty_ratio < 5)
 			dirty_ratio = 5;
 		dirty = (dirty_ratio * available_memory) / 100;
 	}
 
-	if (dirty_background_bytes)
-		background = DIV_ROUND_UP(dirty_background_bytes, PAGE_SIZE);
+	if (dirty_param.dirty_background_bytes)
+		background = DIV_ROUND_UP(dirty_param.dirty_background_bytes,
+						PAGE_SIZE);
 	else
-		background = (dirty_background_ratio * available_memory) / 100;
-
+		background = (dirty_param.dirty_background_ratio *
+						available_memory) / 100;
 	if (background >= dirty)
 		background = dirty / 2;
 	tsk = current;
@@ -505,9 +552,8 @@ static void balance_dirty_pages(struct address_space *mapping,
 		get_dirty_limits(&background_thresh, &dirty_thresh,
 				&bdi_thresh, bdi);
 
-		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
-					global_page_state(NR_UNSTABLE_NFS);
-		nr_writeback = global_page_state(NR_WRITEBACK);
+		nr_reclaimable = get_reclaimable_pages();
+		nr_writeback = get_writeback_pages();
 
 		bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE);
 		bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK);
@@ -593,10 +639,9 @@ static void balance_dirty_pages(struct address_space *mapping,
 	 * In normal mode, we start background writeout at the lower
 	 * background_thresh, to keep the amount of dirty memory low.
 	 */
+	nr_reclaimable = get_reclaimable_pages();
 	if ((laptop_mode && pages_written) ||
-	    (!laptop_mode && ((global_page_state(NR_FILE_DIRTY)
-			       + global_page_state(NR_UNSTABLE_NFS))
-					  > background_thresh)))
+	    (!laptop_mode && (nr_reclaimable > background_thresh)))
 		bdi_start_writeback(bdi, NULL, 0);
 }
 
@@ -660,6 +705,8 @@ void throttle_vm_writeout(gfp_t gfp_mask)
 	unsigned long dirty_thresh;
 
         for ( ; ; ) {
+		unsigned long dirty;
+
 		get_dirty_limits(&background_thresh, &dirty_thresh, NULL, NULL);
 
                 /*
@@ -668,10 +715,10 @@ void throttle_vm_writeout(gfp_t gfp_mask)
                  */
                 dirty_thresh += dirty_thresh / 10;      /* wheeee... */
 
-                if (global_page_state(NR_UNSTABLE_NFS) +
-			global_page_state(NR_WRITEBACK) <= dirty_thresh)
-                        	break;
-                congestion_wait(BLK_RW_ASYNC, HZ/10);
+		dirty = get_dirty_writeback_pages();
+		if (dirty <= dirty_thresh)
+			break;
+		congestion_wait(BLK_RW_ASYNC, HZ/10);
 
 		/*
 		 * The caller might hold locks which can prevent IO completion
@@ -1078,6 +1125,7 @@ int __set_page_dirty_no_writeback(struct page *page)
 void account_page_dirtied(struct page *page, struct address_space *mapping)
 {
 	if (mapping_cap_account_dirty(mapping)) {
+		mem_cgroup_inc_page_stat_locked(page, MEMCG_NR_FILE_DIRTY);
 		__inc_zone_page_state(page, NR_FILE_DIRTY);
 		__inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
 		task_dirty_inc(current);
@@ -1279,6 +1327,8 @@ int clear_page_dirty_for_io(struct page *page)
 		 * for more comments.
 		 */
 		if (TestClearPageDirty(page)) {
+			mem_cgroup_dec_page_stat_unlocked(page,
+					MEMCG_NR_FILE_DIRTY);
 			dec_zone_page_state(page, NR_FILE_DIRTY);
 			dec_bdi_stat(mapping->backing_dev_info,
 					BDI_RECLAIMABLE);
@@ -1314,8 +1364,11 @@ int test_clear_page_writeback(struct page *page)
 	} else {
 		ret = TestClearPageWriteback(page);
 	}
-	if (ret)
+	if (ret) {
+		mem_cgroup_dec_page_stat_unlocked(page,
+				MEMCG_NR_FILE_WRITEBACK);
 		dec_zone_page_state(page, NR_WRITEBACK);
+	}
 	return ret;
 }
 
@@ -1345,8 +1398,11 @@ int test_set_page_writeback(struct page *page)
 	} else {
 		ret = TestSetPageWriteback(page);
 	}
-	if (!ret)
+	if (!ret) {
+		mem_cgroup_inc_page_stat_unlocked(page,
+				MEMCG_NR_FILE_WRITEBACK);
 		inc_zone_page_state(page, NR_WRITEBACK);
+	}
 	return ret;
 
 }
diff --git a/mm/rmap.c b/mm/rmap.c
index fcd593c..61f07cc 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -828,8 +828,8 @@ void page_add_new_anon_rmap(struct page *page,
 void page_add_file_rmap(struct page *page)
 {
 	if (atomic_inc_and_test(&page->_mapcount)) {
+		mem_cgroup_inc_page_stat_unlocked(page, MEMCG_NR_FILE_MAPPED);
 		__inc_zone_page_state(page, NR_FILE_MAPPED);
-		mem_cgroup_update_file_mapped(page, 1);
 	}
 }
 
@@ -860,8 +860,8 @@ void page_remove_rmap(struct page *page)
 		mem_cgroup_uncharge_page(page);
 		__dec_zone_page_state(page, NR_ANON_PAGES);
 	} else {
+		mem_cgroup_dec_page_stat_unlocked(page, MEMCG_NR_FILE_MAPPED);
 		__dec_zone_page_state(page, NR_FILE_MAPPED);
-		mem_cgroup_update_file_mapped(page, -1);
 	}
 	/*
 	 * It would be tidy to reset the PageAnon mapping here,
diff --git a/mm/truncate.c b/mm/truncate.c
index e87e372..1613632 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -73,6 +73,8 @@ void cancel_dirty_page(struct page *page, unsigned int account_size)
 	if (TestClearPageDirty(page)) {
 		struct address_space *mapping = page->mapping;
 		if (mapping && mapping_cap_account_dirty(mapping)) {
+			mem_cgroup_dec_page_stat_unlocked(page,
+					MEMCG_NR_FILE_DIRTY);
 			dec_zone_page_state(page, NR_FILE_DIRTY);
 			dec_bdi_stat(mapping->backing_dev_info,
 					BDI_RECLAIMABLE);
-- 
1.6.3.3


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH -mmotm 4/4] memcg: dirty pages instrumentation
@ 2010-03-07 20:57   ` Andrea Righi
  0 siblings, 0 replies; 68+ messages in thread
From: Andrea Righi @ 2010-03-07 20:57 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki, Balbir Singh, Daisuke Nishimura
  Cc: Vivek Goyal, Peter Zijlstra, Trond Myklebust, Suleiman Souhlal,
	Greg Thelen, Kirill A. Shutemov, Andrew Morton, containers,
	linux-kernel, linux-mm, Andrea Righi

Apply the cgroup dirty pages accounting and limiting infrastructure to
the opportune kernel functions.

As a bonus, make determine_dirtyable_memory() static again: this
function isn't used anymore outside page writeback.

Signed-off-by: Andrea Righi <arighi@develer.com>
---
 fs/fuse/file.c            |    5 +
 fs/nfs/write.c            |    6 +
 fs/nilfs2/segment.c       |   11 ++-
 include/linux/writeback.h |    2 -
 mm/filemap.c              |    1 +
 mm/page-writeback.c       |  224 ++++++++++++++++++++++++++++-----------------
 mm/rmap.c                 |    4 +-
 mm/truncate.c             |    2 +
 8 files changed, 165 insertions(+), 90 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index a9f5e13..9a542e5 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -11,6 +11,7 @@
 #include <linux/pagemap.h>
 #include <linux/slab.h>
 #include <linux/kernel.h>
+#include <linux/memcontrol.h>
 #include <linux/sched.h>
 #include <linux/module.h>
 
@@ -1129,6 +1130,8 @@ static void fuse_writepage_finish(struct fuse_conn *fc, struct fuse_req *req)
 
 	list_del(&req->writepages_entry);
 	dec_bdi_stat(bdi, BDI_WRITEBACK);
+	mem_cgroup_dec_page_stat_unlocked(req->pages[0],
+			MEMCG_NR_FILE_WRITEBACK_TEMP);
 	dec_zone_page_state(req->pages[0], NR_WRITEBACK_TEMP);
 	bdi_writeout_inc(bdi);
 	wake_up(&fi->page_waitq);
@@ -1240,6 +1243,8 @@ static int fuse_writepage_locked(struct page *page)
 	req->inode = inode;
 
 	inc_bdi_stat(mapping->backing_dev_info, BDI_WRITEBACK);
+	mem_cgroup_inc_page_stat_unlocked(tmp_page,
+			MEMCG_NR_FILE_WRITEBACK_TEMP);
 	inc_zone_page_state(tmp_page, NR_WRITEBACK_TEMP);
 	end_page_writeback(page);
 
diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index 53ff70e..a35e3c0 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -440,6 +440,8 @@ nfs_mark_request_commit(struct nfs_page *req)
 			NFS_PAGE_TAG_COMMIT);
 	nfsi->ncommit++;
 	spin_unlock(&inode->i_lock);
+	mem_cgroup_inc_page_stat_unlocked(req->wb_page,
+			MEMCG_NR_FILE_UNSTABLE_NFS);
 	inc_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
 	inc_bdi_stat(req->wb_page->mapping->backing_dev_info, BDI_RECLAIMABLE);
 	__mark_inode_dirty(inode, I_DIRTY_DATASYNC);
@@ -451,6 +453,8 @@ nfs_clear_request_commit(struct nfs_page *req)
 	struct page *page = req->wb_page;
 
 	if (test_and_clear_bit(PG_CLEAN, &(req)->wb_flags)) {
+		mem_cgroup_dec_page_stat_unlocked(page,
+				MEMCG_NR_FILE_UNSTABLE_NFS);
 		dec_zone_page_state(page, NR_UNSTABLE_NFS);
 		dec_bdi_stat(page->mapping->backing_dev_info, BDI_RECLAIMABLE);
 		return 1;
@@ -1277,6 +1281,8 @@ nfs_commit_list(struct inode *inode, struct list_head *head, int how)
 		req = nfs_list_entry(head->next);
 		nfs_list_remove_request(req);
 		nfs_mark_request_commit(req);
+		mem_cgroup_dec_page_stat_unlocked(req->wb_page,
+				MEMCG_NR_FILE_UNSTABLE_NFS);
 		dec_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
 		dec_bdi_stat(req->wb_page->mapping->backing_dev_info,
 				BDI_RECLAIMABLE);
diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
index ada2f1b..fb79558 100644
--- a/fs/nilfs2/segment.c
+++ b/fs/nilfs2/segment.c
@@ -24,6 +24,7 @@
 #include <linux/pagemap.h>
 #include <linux/buffer_head.h>
 #include <linux/writeback.h>
+#include <linux/memcontrol.h>
 #include <linux/bio.h>
 #include <linux/completion.h>
 #include <linux/blkdev.h>
@@ -1660,8 +1661,11 @@ nilfs_copy_replace_page_buffers(struct page *page, struct list_head *out)
 	} while (bh = bh->b_this_page, bh2 = bh2->b_this_page, bh != head);
 	kunmap_atomic(kaddr, KM_USER0);
 
-	if (!TestSetPageWriteback(clone_page))
+	if (!TestSetPageWriteback(clone_page)) {
+		mem_cgroup_inc_page_stat_unlocked(clone_page,
+				MEMCG_NR_FILE_WRITEBACK);
 		inc_zone_page_state(clone_page, NR_WRITEBACK);
+	}
 	unlock_page(clone_page);
 
 	return 0;
@@ -1783,8 +1787,11 @@ static void __nilfs_end_page_io(struct page *page, int err)
 	}
 
 	if (buffer_nilfs_allocated(page_buffers(page))) {
-		if (TestClearPageWriteback(page))
+		if (TestClearPageWriteback(page)) {
+			mem_cgroup_dec_page_stat_unlocked(page,
+					MEMCG_NR_FILE_WRITEBACK);
 			dec_zone_page_state(page, NR_WRITEBACK);
+		}
 	} else
 		end_page_writeback(page);
 }
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index dd9512d..39e4cb2 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -117,8 +117,6 @@ extern int vm_highmem_is_dirtyable;
 extern int block_dump;
 extern int laptop_mode;
 
-extern unsigned long determine_dirtyable_memory(void);
-
 extern int dirty_background_ratio_handler(struct ctl_table *table, int write,
 		void __user *buffer, size_t *lenp,
 		loff_t *ppos);
diff --git a/mm/filemap.c b/mm/filemap.c
index 62cbac0..37f89d1 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -135,6 +135,7 @@ void __remove_from_page_cache(struct page *page)
 	 * having removed the page entirely.
 	 */
 	if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
+		mem_cgroup_dec_page_stat_locked(page, MEMCG_NR_FILE_DIRTY);
 		dec_zone_page_state(page, NR_FILE_DIRTY);
 		dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
 	}
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index ab84693..9d4503a 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -131,6 +131,111 @@ static struct prop_descriptor vm_completions;
 static struct prop_descriptor vm_dirties;
 
 /*
+ * Work out the current dirty-memory clamping and background writeout
+ * thresholds.
+ *
+ * The main aim here is to lower them aggressively if there is a lot of mapped
+ * memory around.  To avoid stressing page reclaim with lots of unreclaimable
+ * pages.  It is better to clamp down on writers than to start swapping, and
+ * performing lots of scanning.
+ *
+ * We only allow 1/2 of the currently-unmapped memory to be dirtied.
+ *
+ * We don't permit the clamping level to fall below 5% - that is getting rather
+ * excessive.
+ *
+ * We make sure that the background writeout level is below the adjusted
+ * clamping level.
+ */
+
+static unsigned long highmem_dirtyable_memory(unsigned long total)
+{
+#ifdef CONFIG_HIGHMEM
+	int node;
+	unsigned long x = 0;
+
+	for_each_node_state(node, N_HIGH_MEMORY) {
+		struct zone *z =
+			&NODE_DATA(node)->node_zones[ZONE_HIGHMEM];
+
+		x += zone_page_state(z, NR_FREE_PAGES) +
+		     zone_reclaimable_pages(z);
+	}
+	/*
+	 * Make sure that the number of highmem pages is never larger
+	 * than the number of the total dirtyable memory. This can only
+	 * occur in very strange VM situations but we want to make sure
+	 * that this does not occur.
+	 */
+	return min(x, total);
+#else
+	return 0;
+#endif
+}
+
+static unsigned long get_global_dirtyable_memory(void)
+{
+	unsigned long memory;
+
+	memory = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages();
+	if (!vm_highmem_is_dirtyable)
+		memory -= highmem_dirtyable_memory(memory);
+	return memory + 1;
+}
+
+static unsigned long get_dirtyable_memory(void)
+{
+	unsigned long memory;
+	s64 memcg_memory;
+
+	memory = get_global_dirtyable_memory();
+	if (!mem_cgroup_has_dirty_limit())
+		return memory;
+	memcg_memory = mem_cgroup_page_stat(MEMCG_NR_DIRTYABLE_PAGES);
+	BUG_ON(memcg_memory < 0);
+
+	return min((unsigned long)memcg_memory, memory);
+}
+
+static long get_reclaimable_pages(void)
+{
+	s64 ret;
+
+	if (!mem_cgroup_has_dirty_limit())
+		return global_page_state(NR_FILE_DIRTY) +
+			global_page_state(NR_UNSTABLE_NFS);
+	ret = mem_cgroup_page_stat(MEMCG_NR_RECLAIM_PAGES);
+	BUG_ON(ret < 0);
+
+	return ret;
+}
+
+static long get_writeback_pages(void)
+{
+	s64 ret;
+
+	if (!mem_cgroup_has_dirty_limit())
+		return global_page_state(NR_WRITEBACK);
+	ret = mem_cgroup_page_stat(MEMCG_NR_WRITEBACK);
+	BUG_ON(ret < 0);
+
+	return ret;
+}
+
+static unsigned long get_dirty_writeback_pages(void)
+{
+	s64 ret;
+
+	if (!mem_cgroup_has_dirty_limit())
+		return global_page_state(NR_UNSTABLE_NFS) +
+			global_page_state(NR_WRITEBACK);
+	ret = mem_cgroup_page_stat(MEMCG_NR_DIRTY_WRITEBACK_PAGES);
+	BUG_ON(ret < 0);
+
+	return ret;
+}
+
+/*
  * couple the period to the dirty_ratio:
  *
  *   period/2 ~ roundup_pow_of_two(dirty limit)
@@ -142,7 +247,7 @@ static int calc_period_shift(void)
 	if (vm_dirty_bytes)
 		dirty_total = vm_dirty_bytes / PAGE_SIZE;
 	else
-		dirty_total = (vm_dirty_ratio * determine_dirtyable_memory()) /
+		dirty_total = (vm_dirty_ratio * get_global_dirtyable_memory()) /
 				100;
 	return 2 + ilog2(dirty_total - 1);
 }
@@ -355,92 +460,34 @@ int bdi_set_max_ratio(struct backing_dev_info *bdi, unsigned max_ratio)
 }
 EXPORT_SYMBOL(bdi_set_max_ratio);
 
-/*
- * Work out the current dirty-memory clamping and background writeout
- * thresholds.
- *
- * The main aim here is to lower them aggressively if there is a lot of mapped
- * memory around.  To avoid stressing page reclaim with lots of unreclaimable
- * pages.  It is better to clamp down on writers than to start swapping, and
- * performing lots of scanning.
- *
- * We only allow 1/2 of the currently-unmapped memory to be dirtied.
- *
- * We don't permit the clamping level to fall below 5% - that is getting rather
- * excessive.
- *
- * We make sure that the background writeout level is below the adjusted
- * clamping level.
- */
-
-static unsigned long highmem_dirtyable_memory(unsigned long total)
-{
-#ifdef CONFIG_HIGHMEM
-	int node;
-	unsigned long x = 0;
-
-	for_each_node_state(node, N_HIGH_MEMORY) {
-		struct zone *z =
-			&NODE_DATA(node)->node_zones[ZONE_HIGHMEM];
-
-		x += zone_page_state(z, NR_FREE_PAGES) +
-		     zone_reclaimable_pages(z);
-	}
-	/*
-	 * Make sure that the number of highmem pages is never larger
-	 * than the number of the total dirtyable memory. This can only
-	 * occur in very strange VM situations but we want to make sure
-	 * that this does not occur.
-	 */
-	return min(x, total);
-#else
-	return 0;
-#endif
-}
-
-/**
- * determine_dirtyable_memory - amount of memory that may be used
- *
- * Returns the numebr of pages that can currently be freed and used
- * by the kernel for direct mappings.
- */
-unsigned long determine_dirtyable_memory(void)
-{
-	unsigned long x;
-
-	x = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages();
-
-	if (!vm_highmem_is_dirtyable)
-		x -= highmem_dirtyable_memory(x);
-
-	return x + 1;	/* Ensure that we never return 0 */
-}
-
 void
 get_dirty_limits(unsigned long *pbackground, unsigned long *pdirty,
 		 unsigned long *pbdi_dirty, struct backing_dev_info *bdi)
 {
-	unsigned long background;
-	unsigned long dirty;
-	unsigned long available_memory = determine_dirtyable_memory();
+	unsigned long dirty, background;
+	unsigned long available_memory = get_dirtyable_memory();
 	struct task_struct *tsk;
+	struct vm_dirty_param dirty_param;
 
-	if (vm_dirty_bytes)
-		dirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE);
+	get_vm_dirty_param(&dirty_param);
+
+	if (dirty_param.dirty_bytes)
+		dirty = DIV_ROUND_UP(dirty_param.dirty_bytes, PAGE_SIZE);
 	else {
 		int dirty_ratio;
 
-		dirty_ratio = vm_dirty_ratio;
+		dirty_ratio = dirty_param.dirty_ratio;
 		if (dirty_ratio < 5)
 			dirty_ratio = 5;
 		dirty = (dirty_ratio * available_memory) / 100;
 	}
 
-	if (dirty_background_bytes)
-		background = DIV_ROUND_UP(dirty_background_bytes, PAGE_SIZE);
+	if (dirty_param.dirty_background_bytes)
+		background = DIV_ROUND_UP(dirty_param.dirty_background_bytes,
+						PAGE_SIZE);
 	else
-		background = (dirty_background_ratio * available_memory) / 100;
-
+		background = (dirty_param.dirty_background_ratio *
+						available_memory) / 100;
 	if (background >= dirty)
 		background = dirty / 2;
 	tsk = current;
@@ -505,9 +552,8 @@ static void balance_dirty_pages(struct address_space *mapping,
 		get_dirty_limits(&background_thresh, &dirty_thresh,
 				&bdi_thresh, bdi);
 
-		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
-					global_page_state(NR_UNSTABLE_NFS);
-		nr_writeback = global_page_state(NR_WRITEBACK);
+		nr_reclaimable = get_reclaimable_pages();
+		nr_writeback = get_writeback_pages();
 
 		bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE);
 		bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK);
@@ -593,10 +639,9 @@ static void balance_dirty_pages(struct address_space *mapping,
 	 * In normal mode, we start background writeout at the lower
 	 * background_thresh, to keep the amount of dirty memory low.
 	 */
+	nr_reclaimable = get_reclaimable_pages();
 	if ((laptop_mode && pages_written) ||
-	    (!laptop_mode && ((global_page_state(NR_FILE_DIRTY)
-			       + global_page_state(NR_UNSTABLE_NFS))
-					  > background_thresh)))
+	    (!laptop_mode && (nr_reclaimable > background_thresh)))
 		bdi_start_writeback(bdi, NULL, 0);
 }
 
@@ -660,6 +705,8 @@ void throttle_vm_writeout(gfp_t gfp_mask)
 	unsigned long dirty_thresh;
 
         for ( ; ; ) {
+		unsigned long dirty;
+
 		get_dirty_limits(&background_thresh, &dirty_thresh, NULL, NULL);
 
                 /*
@@ -668,10 +715,10 @@ void throttle_vm_writeout(gfp_t gfp_mask)
                  */
                 dirty_thresh += dirty_thresh / 10;      /* wheeee... */
 
-                if (global_page_state(NR_UNSTABLE_NFS) +
-			global_page_state(NR_WRITEBACK) <= dirty_thresh)
-                        	break;
-                congestion_wait(BLK_RW_ASYNC, HZ/10);
+		dirty = get_dirty_writeback_pages();
+		if (dirty <= dirty_thresh)
+			break;
+		congestion_wait(BLK_RW_ASYNC, HZ/10);
 
 		/*
 		 * The caller might hold locks which can prevent IO completion
@@ -1078,6 +1125,7 @@ int __set_page_dirty_no_writeback(struct page *page)
 void account_page_dirtied(struct page *page, struct address_space *mapping)
 {
 	if (mapping_cap_account_dirty(mapping)) {
+		mem_cgroup_inc_page_stat_locked(page, MEMCG_NR_FILE_DIRTY);
 		__inc_zone_page_state(page, NR_FILE_DIRTY);
 		__inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
 		task_dirty_inc(current);
@@ -1279,6 +1327,8 @@ int clear_page_dirty_for_io(struct page *page)
 		 * for more comments.
 		 */
 		if (TestClearPageDirty(page)) {
+			mem_cgroup_dec_page_stat_unlocked(page,
+					MEMCG_NR_FILE_DIRTY);
 			dec_zone_page_state(page, NR_FILE_DIRTY);
 			dec_bdi_stat(mapping->backing_dev_info,
 					BDI_RECLAIMABLE);
@@ -1314,8 +1364,11 @@ int test_clear_page_writeback(struct page *page)
 	} else {
 		ret = TestClearPageWriteback(page);
 	}
-	if (ret)
+	if (ret) {
+		mem_cgroup_dec_page_stat_unlocked(page,
+				MEMCG_NR_FILE_WRITEBACK);
 		dec_zone_page_state(page, NR_WRITEBACK);
+	}
 	return ret;
 }
 
@@ -1345,8 +1398,11 @@ int test_set_page_writeback(struct page *page)
 	} else {
 		ret = TestSetPageWriteback(page);
 	}
-	if (!ret)
+	if (!ret) {
+		mem_cgroup_inc_page_stat_unlocked(page,
+				MEMCG_NR_FILE_WRITEBACK);
 		inc_zone_page_state(page, NR_WRITEBACK);
+	}
 	return ret;
 
 }
diff --git a/mm/rmap.c b/mm/rmap.c
index fcd593c..61f07cc 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -828,8 +828,8 @@ void page_add_new_anon_rmap(struct page *page,
 void page_add_file_rmap(struct page *page)
 {
 	if (atomic_inc_and_test(&page->_mapcount)) {
+		mem_cgroup_inc_page_stat_unlocked(page, MEMCG_NR_FILE_MAPPED);
 		__inc_zone_page_state(page, NR_FILE_MAPPED);
-		mem_cgroup_update_file_mapped(page, 1);
 	}
 }
 
@@ -860,8 +860,8 @@ void page_remove_rmap(struct page *page)
 		mem_cgroup_uncharge_page(page);
 		__dec_zone_page_state(page, NR_ANON_PAGES);
 	} else {
+		mem_cgroup_dec_page_stat_unlocked(page, MEMCG_NR_FILE_MAPPED);
 		__dec_zone_page_state(page, NR_FILE_MAPPED);
-		mem_cgroup_update_file_mapped(page, -1);
 	}
 	/*
 	 * It would be tidy to reset the PageAnon mapping here,
diff --git a/mm/truncate.c b/mm/truncate.c
index e87e372..1613632 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -73,6 +73,8 @@ void cancel_dirty_page(struct page *page, unsigned int account_size)
 	if (TestClearPageDirty(page)) {
 		struct address_space *mapping = page->mapping;
 		if (mapping && mapping_cap_account_dirty(mapping)) {
+			mem_cgroup_dec_page_stat_unlocked(page,
+					MEMCG_NR_FILE_DIRTY);
 			dec_zone_page_state(page, NR_FILE_DIRTY);
 			dec_bdi_stat(mapping->backing_dev_info,
 					BDI_RECLAIMABLE);
-- 
1.6.3.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 68+ messages in thread

end of thread, other threads:[~2010-03-08  2:35 UTC | newest]

Thread overview: 68+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-03-04 10:40 [PATCH -mmotm 0/4] memcg: per cgroup dirty limit (v4) Andrea Righi
2010-03-04 10:40 ` Andrea Righi
2010-03-04 10:40 ` [PATCH -mmotm 1/4] memcg: dirty memory documentation Andrea Righi
2010-03-04 10:40   ` Andrea Righi
     [not found] ` <1267699215-4101-1-git-send-email-arighi-vWjgImWzx8FBDgjK7y7TUQ@public.gmane.org>
2010-03-04 10:40   ` Andrea Righi
2010-03-04 10:40   ` [PATCH -mmotm 2/4] page_cgroup: introduce file cache flags Andrea Righi
2010-03-04 10:40   ` [PATCH -mmotm 3/4] memcg: dirty pages accounting and limiting infrastructure Andrea Righi
2010-03-04 10:40   ` [PATCH -mmotm 4/4] memcg: dirty pages instrumentation Andrea Righi
2010-03-04 17:11   ` [PATCH -mmotm 0/4] memcg: per cgroup dirty limit (v4) Balbir Singh
2010-03-04 10:40 ` [PATCH -mmotm 2/4] page_cgroup: introduce file cache flags Andrea Righi
2010-03-04 10:40   ` Andrea Righi
2010-03-05  6:32   ` Balbir Singh
2010-03-05  6:32     ` Balbir Singh
2010-03-05 22:35     ` Andrea Righi
2010-03-05 22:35       ` Andrea Righi
     [not found]     ` <20100305063249.GH3073-SINUvgVNF2CyUtPGxGje5AC/G2K4zDHf@public.gmane.org>
2010-03-05 22:35       ` Andrea Righi
     [not found]   ` <1267699215-4101-3-git-send-email-arighi-vWjgImWzx8FBDgjK7y7TUQ@public.gmane.org>
2010-03-05  6:32     ` Balbir Singh
2010-03-04 10:40 ` [PATCH -mmotm 3/4] memcg: dirty pages accounting and limiting infrastructure Andrea Righi
2010-03-04 10:40   ` Andrea Righi
2010-03-04 11:54   ` Kirill A. Shutemov
2010-03-04 11:54     ` Kirill A. Shutemov
2010-03-05  1:12   ` Daisuke Nishimura
2010-03-05  1:12     ` Daisuke Nishimura
2010-03-05  1:58     ` KAMEZAWA Hiroyuki
2010-03-05  1:58       ` KAMEZAWA Hiroyuki
     [not found]       ` <20100305105855.9b53176c.kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
2010-03-05  7:01         ` Balbir Singh
2010-03-05  7:01           ` Balbir Singh
2010-03-05  7:01           ` Balbir Singh
2010-03-05 22:14         ` Andrea Righi
2010-03-05 22:14       ` Andrea Righi
2010-03-05 22:14         ` Andrea Righi
     [not found]     ` <20100305101234.909001e8.nishimura-YQH0OdQVrdy45+QrQBaojngSJqDPrsil@public.gmane.org>
2010-03-05  1:58       ` KAMEZAWA Hiroyuki
2010-03-05 22:14       ` Andrea Righi
2010-03-05 22:14     ` Andrea Righi
2010-03-05 22:14       ` Andrea Righi
     [not found]   ` <1267699215-4101-4-git-send-email-arighi-vWjgImWzx8FBDgjK7y7TUQ@public.gmane.org>
2010-03-04 11:54     ` Kirill A. Shutemov
2010-03-05  1:12     ` Daisuke Nishimura
2010-03-04 10:40 ` [PATCH -mmotm 4/4] memcg: dirty pages instrumentation Andrea Righi
2010-03-04 10:40   ` Andrea Righi
2010-03-04 16:18   ` Vivek Goyal
2010-03-04 16:18     ` Vivek Goyal
     [not found]     ` <20100304161828.GC18786-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2010-03-04 16:28       ` Andrea Righi
2010-03-04 16:28     ` Andrea Righi
2010-03-04 16:28       ` Andrea Righi
     [not found]   ` <1267699215-4101-5-git-send-email-arighi-vWjgImWzx8FBDgjK7y7TUQ@public.gmane.org>
2010-03-04 16:18     ` Vivek Goyal
2010-03-04 19:41     ` Vivek Goyal
2010-03-04 19:41       ` Vivek Goyal
2010-03-04 19:41       ` Vivek Goyal
     [not found]       ` <20100304194144.GE18786-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2010-03-04 21:51         ` Andrea Righi
2010-03-04 21:51           ` Andrea Righi
2010-03-04 21:51           ` Andrea Righi
2010-03-05  6:38     ` Balbir Singh
2010-03-05  6:38   ` Balbir Singh
2010-03-05  6:38     ` Balbir Singh
     [not found]     ` <20100305063843.GI3073-SINUvgVNF2CyUtPGxGje5AC/G2K4zDHf@public.gmane.org>
2010-03-05 22:55       ` Andrea Righi
2010-03-05 22:55     ` Andrea Righi
2010-03-05 22:55       ` Andrea Righi
2010-03-04 17:11 ` [PATCH -mmotm 0/4] memcg: per cgroup dirty limit (v4) Balbir Singh
2010-03-04 17:11   ` Balbir Singh
2010-03-04 21:37   ` Andrea Righi
2010-03-04 21:37     ` Andrea Righi
     [not found]   ` <20100304171143.GG3073-SINUvgVNF2CyUtPGxGje5AC/G2K4zDHf@public.gmane.org>
2010-03-04 21:37     ` Andrea Righi
2010-03-07 20:57 [PATCH -mmotm 0/4] memcg: per cgroup dirty limit (v5) Andrea Righi
     [not found] ` <1267995474-9117-1-git-send-email-arighi-vWjgImWzx8FBDgjK7y7TUQ@public.gmane.org>
2010-03-07 20:57   ` [PATCH -mmotm 4/4] memcg: dirty pages instrumentation Andrea Righi
2010-03-07 20:57 ` Andrea Righi
2010-03-07 20:57   ` Andrea Righi
2010-03-08  2:31   ` KAMEZAWA Hiroyuki
2010-03-08  2:31     ` KAMEZAWA Hiroyuki
     [not found]   ` <1267995474-9117-5-git-send-email-arighi-vWjgImWzx8FBDgjK7y7TUQ@public.gmane.org>
2010-03-08  2:31     ` KAMEZAWA Hiroyuki

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.