All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v4 00/11] memcg: per cgroup dirty page accounting
@ 2010-10-29  7:09 ` Greg Thelen
  0 siblings, 0 replies; 75+ messages in thread
From: Greg Thelen @ 2010-10-29  7:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, containers, Andrea Righi, Balbir Singh,
	KAMEZAWA Hiroyuki, Daisuke Nishimura, Minchan Kim, Ciju Rajan K,
	David Rientjes, Wu Fengguang, Greg Thelen

Changes since v3:
- Refactored balance_dirty_pages() dirtying checking to use new struct
  dirty_info, which is used to compare both system and memcg dirty limits
  against usage.
- Disabled memcg dirty limits when memory.use_hierarchy=1.  An enhancement is
  needed to check the chain of parents to ensure that no dirty limit is
  exceeded.
- Ported to mmotm-2010-10-22-16-36.

Changes since v2:
- Rather than disabling softirq in lock_page_cgroup(), introduce a separate lock
  to synchronize between memcg page accounting and migration.  This only affects
  patch 4 of the series.  Patch 4 used to disable softirq, now it introduces the
  new lock.

Changes since v1:
- Renamed "nfs"/"total_nfs" to "nfs_unstable"/"total_nfs_unstable" in per cgroup
  memory.stat to match /proc/meminfo.
- Avoid lockdep warnings by using rcu_read_[un]lock() in
  mem_cgroup_has_dirty_limit().
- Fixed lockdep issue in mem_cgroup_read_stat() which is exposed by these
  patches.
- Remove redundant comments.
- Rename (for clarity):
  - mem_cgroup_write_page_stat_item -> mem_cgroup_page_stat_item
  - mem_cgroup_read_page_stat_item -> mem_cgroup_nr_pages_item
- Renamed newly created proc files:
  - memory.dirty_bytes -> memory.dirty_limit_in_bytes
  - memory.dirty_background_bytes -> memory.dirty_background_limit_in_bytes
- Removed unnecessary get_ prefix from get_xxx() functions.
- Allow [kKmMgG] suffixes for newly created dirty limit value cgroupfs files.
- Disable softirq rather than hardirq in lock_page_cgroup()
- Made mem_cgroup_move_account_page_stat() inline.
- Ported patches to mmotm-2010-10-13-17-13.

This patch set provides the ability for each cgroup to have independent dirty
page limits.

Limiting dirty memory is like fixing the max amount of dirty (hard to reclaim)
page cache used by a cgroup.  So, in case of multiple cgroup writers, they will
not be able to consume more than their designated share of dirty pages and will
be forced to perform write-out if they cross that limit.

The patches are based on a series proposed by Andrea Righi in Mar 2010.

Overview:
- Add page_cgroup flags to record when pages are dirty, in writeback, or nfs
  unstable.

- Extend mem_cgroup to record the total number of pages in each of the 
  interesting dirty states (dirty, writeback, unstable_nfs).  

- Add dirty parameters similar to the system-wide  /proc/sys/vm/dirty_*
  limits to mem_cgroup.  The mem_cgroup dirty parameters are accessible
  via cgroupfs control files.

- Consider both system and per-memcg dirty limits in page writeback when
  deciding to queue background writeback or block for foreground writeback.

Known shortcomings:
- When a cgroup dirty limit is exceeded, then bdi writeback is employed to
  writeback dirty inodes.  Bdi writeback considers inodes from any cgroup, not
  just inodes contributing dirty pages to the cgroup exceeding its limit.  

- When memory.use_hierarchy is set, then dirty limits are disabled.  This is a
  implementation detail.  An enhanced implementation is needed to check the
  chain of parents to ensure that no dirty limit is exceeded.

Performance data:
- A page fault microbenchmark workload was used to measure performance, which
  can be called in read or write mode:
        f = open(foo. $cpu)
        truncate(f, 4096)
        alarm(60)
        while (1) {
                p = mmap(f, 4096)
                if (write)
			*p = 1
		else
			x = *p
                munmap(p)
        }

- The workload was called for several points in the patch series in different
  modes:
  - s_read is a single threaded reader
  - s_write is a single threaded writer
  - p_read is a 16 thread reader, each operating on a different file
  - p_write is a 16 thread writer, each operating on a different file

- Measurements were collected on a 16 core non-numa system using "perf stat
  --repeat 3".  The -a option was used for parallel (p_*) runs.

- All numbers are page fault rate (M/sec).  Higher is better.

- To compare the performance of a kernel without non-memcg compare the first and
  last rows, neither has memcg configured.  The first row does not include any
  of these memcg patches.

- To compare the performance of using memcg dirty limits, compare the baseline
  (2nd row titled "w/ memcg") with the the code and memcg enabled (2nd to last
  row titled "all patches").

                           root_cgroup                    child_cgroup
                 s_read s_write p_read p_write   s_read s_write p_read p_write
mmotm w/o memcg   0.428  0.390   0.429  0.388
mmotm w/ memcg    0.411  0.378   0.391  0.362     0.412  0.377   0.385  0.363
all patches       0.384  0.360   0.370  0.348     0.381  0.363   0.368  0.347
all patches       0.431  0.402   0.427  0.395
  w/o memcg

Balbir Singh (1):
  memcg: CPU hotplug lockdep warning fix

Greg Thelen (9):
  memcg: add page_cgroup flags for dirty page tracking
  memcg: document cgroup dirty memory interfaces
  memcg: create extensible page stat update routines
  writeback: create dirty_info structure
  memcg: add dirty page accounting infrastructure
  memcg: add kernel calls for memcg dirty page stats
  memcg: add dirty limits to mem_cgroup
  memcg: add cgroupfs interface to memcg dirty limits
  memcg: check memcg dirty limits in page writeback

KAMEZAWA Hiroyuki (1):
  memcg: add lock to synchronize page accounting and migration

 Documentation/cgroups/memory.txt |   73 ++++++
 fs/fs-writeback.c                |    7 +-
 fs/nfs/write.c                   |    4 +
 include/linux/memcontrol.h       |   64 +++++-
 include/linux/page_cgroup.h      |   54 ++++-
 include/linux/writeback.h        |    9 +-
 mm/backing-dev.c                 |   12 +-
 mm/filemap.c                     |    1 +
 mm/memcontrol.c                  |  477 ++++++++++++++++++++++++++++++++++++--
 mm/page-writeback.c              |  135 ++++++++----
 mm/rmap.c                        |    4 +-
 mm/truncate.c                    |    1 +
 mm/vmstat.c                      |    6 +-
 13 files changed, 764 insertions(+), 83 deletions(-)

-- 
1.7.3.1



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [PATCH v4 00/11] memcg: per cgroup dirty page accounting
@ 2010-10-29  7:09 ` Greg Thelen
  0 siblings, 0 replies; 75+ messages in thread
From: Greg Thelen @ 2010-10-29  7:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, containers, Andrea Righi, Balbir Singh,
	KAMEZAWA Hiroyuki, Daisuke Nishimura, Minchan Kim, Ciju Rajan K,
	David Rientjes, Wu Fengguang, Greg Thelen

Changes since v3:
- Refactored balance_dirty_pages() dirtying checking to use new struct
  dirty_info, which is used to compare both system and memcg dirty limits
  against usage.
- Disabled memcg dirty limits when memory.use_hierarchy=1.  An enhancement is
  needed to check the chain of parents to ensure that no dirty limit is
  exceeded.
- Ported to mmotm-2010-10-22-16-36.

Changes since v2:
- Rather than disabling softirq in lock_page_cgroup(), introduce a separate lock
  to synchronize between memcg page accounting and migration.  This only affects
  patch 4 of the series.  Patch 4 used to disable softirq, now it introduces the
  new lock.

Changes since v1:
- Renamed "nfs"/"total_nfs" to "nfs_unstable"/"total_nfs_unstable" in per cgroup
  memory.stat to match /proc/meminfo.
- Avoid lockdep warnings by using rcu_read_[un]lock() in
  mem_cgroup_has_dirty_limit().
- Fixed lockdep issue in mem_cgroup_read_stat() which is exposed by these
  patches.
- Remove redundant comments.
- Rename (for clarity):
  - mem_cgroup_write_page_stat_item -> mem_cgroup_page_stat_item
  - mem_cgroup_read_page_stat_item -> mem_cgroup_nr_pages_item
- Renamed newly created proc files:
  - memory.dirty_bytes -> memory.dirty_limit_in_bytes
  - memory.dirty_background_bytes -> memory.dirty_background_limit_in_bytes
- Removed unnecessary get_ prefix from get_xxx() functions.
- Allow [kKmMgG] suffixes for newly created dirty limit value cgroupfs files.
- Disable softirq rather than hardirq in lock_page_cgroup()
- Made mem_cgroup_move_account_page_stat() inline.
- Ported patches to mmotm-2010-10-13-17-13.

This patch set provides the ability for each cgroup to have independent dirty
page limits.

Limiting dirty memory is like fixing the max amount of dirty (hard to reclaim)
page cache used by a cgroup.  So, in case of multiple cgroup writers, they will
not be able to consume more than their designated share of dirty pages and will
be forced to perform write-out if they cross that limit.

The patches are based on a series proposed by Andrea Righi in Mar 2010.

Overview:
- Add page_cgroup flags to record when pages are dirty, in writeback, or nfs
  unstable.

- Extend mem_cgroup to record the total number of pages in each of the 
  interesting dirty states (dirty, writeback, unstable_nfs).  

- Add dirty parameters similar to the system-wide  /proc/sys/vm/dirty_*
  limits to mem_cgroup.  The mem_cgroup dirty parameters are accessible
  via cgroupfs control files.

- Consider both system and per-memcg dirty limits in page writeback when
  deciding to queue background writeback or block for foreground writeback.

Known shortcomings:
- When a cgroup dirty limit is exceeded, then bdi writeback is employed to
  writeback dirty inodes.  Bdi writeback considers inodes from any cgroup, not
  just inodes contributing dirty pages to the cgroup exceeding its limit.  

- When memory.use_hierarchy is set, then dirty limits are disabled.  This is a
  implementation detail.  An enhanced implementation is needed to check the
  chain of parents to ensure that no dirty limit is exceeded.

Performance data:
- A page fault microbenchmark workload was used to measure performance, which
  can be called in read or write mode:
        f = open(foo. $cpu)
        truncate(f, 4096)
        alarm(60)
        while (1) {
                p = mmap(f, 4096)
                if (write)
			*p = 1
		else
			x = *p
                munmap(p)
        }

- The workload was called for several points in the patch series in different
  modes:
  - s_read is a single threaded reader
  - s_write is a single threaded writer
  - p_read is a 16 thread reader, each operating on a different file
  - p_write is a 16 thread writer, each operating on a different file

- Measurements were collected on a 16 core non-numa system using "perf stat
  --repeat 3".  The -a option was used for parallel (p_*) runs.

- All numbers are page fault rate (M/sec).  Higher is better.

- To compare the performance of a kernel without non-memcg compare the first and
  last rows, neither has memcg configured.  The first row does not include any
  of these memcg patches.

- To compare the performance of using memcg dirty limits, compare the baseline
  (2nd row titled "w/ memcg") with the the code and memcg enabled (2nd to last
  row titled "all patches").

                           root_cgroup                    child_cgroup
                 s_read s_write p_read p_write   s_read s_write p_read p_write
mmotm w/o memcg   0.428  0.390   0.429  0.388
mmotm w/ memcg    0.411  0.378   0.391  0.362     0.412  0.377   0.385  0.363
all patches       0.384  0.360   0.370  0.348     0.381  0.363   0.368  0.347
all patches       0.431  0.402   0.427  0.395
  w/o memcg

Balbir Singh (1):
  memcg: CPU hotplug lockdep warning fix

Greg Thelen (9):
  memcg: add page_cgroup flags for dirty page tracking
  memcg: document cgroup dirty memory interfaces
  memcg: create extensible page stat update routines
  writeback: create dirty_info structure
  memcg: add dirty page accounting infrastructure
  memcg: add kernel calls for memcg dirty page stats
  memcg: add dirty limits to mem_cgroup
  memcg: add cgroupfs interface to memcg dirty limits
  memcg: check memcg dirty limits in page writeback

KAMEZAWA Hiroyuki (1):
  memcg: add lock to synchronize page accounting and migration

 Documentation/cgroups/memory.txt |   73 ++++++
 fs/fs-writeback.c                |    7 +-
 fs/nfs/write.c                   |    4 +
 include/linux/memcontrol.h       |   64 +++++-
 include/linux/page_cgroup.h      |   54 ++++-
 include/linux/writeback.h        |    9 +-
 mm/backing-dev.c                 |   12 +-
 mm/filemap.c                     |    1 +
 mm/memcontrol.c                  |  477 ++++++++++++++++++++++++++++++++++++--
 mm/page-writeback.c              |  135 ++++++++----
 mm/rmap.c                        |    4 +-
 mm/truncate.c                    |    1 +
 mm/vmstat.c                      |    6 +-
 13 files changed, 764 insertions(+), 83 deletions(-)

-- 
1.7.3.1




^ permalink raw reply	[flat|nested] 75+ messages in thread

* [PATCH v4 01/11] memcg: add page_cgroup flags for dirty page tracking
  2010-10-29  7:09 ` Greg Thelen
@ 2010-10-29  7:09   ` Greg Thelen
  -1 siblings, 0 replies; 75+ messages in thread
From: Greg Thelen @ 2010-10-29  7:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, containers, Andrea Righi, Balbir Singh,
	KAMEZAWA Hiroyuki, Daisuke Nishimura, Minchan Kim, Ciju Rajan K,
	David Rientjes, Wu Fengguang, Greg Thelen

Add additional flags to page_cgroup to track dirty pages
within a mem_cgroup.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrea Righi <arighi@develer.com>
Signed-off-by: Greg Thelen <gthelen@google.com>
Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
---
 include/linux/page_cgroup.h |   23 +++++++++++++++++++++++
 1 files changed, 23 insertions(+), 0 deletions(-)

diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
index 5bb13b3..b59c298 100644
--- a/include/linux/page_cgroup.h
+++ b/include/linux/page_cgroup.h
@@ -40,6 +40,9 @@ enum {
 	PCG_USED, /* this object is in use. */
 	PCG_ACCT_LRU, /* page has been accounted for */
 	PCG_FILE_MAPPED, /* page is accounted as "mapped" */
+	PCG_FILE_DIRTY, /* page is dirty */
+	PCG_FILE_WRITEBACK, /* page is under writeback */
+	PCG_FILE_UNSTABLE_NFS, /* page is NFS unstable */
 	PCG_MIGRATION, /* under page migration */
 };
 
@@ -59,6 +62,10 @@ static inline void ClearPageCgroup##uname(struct page_cgroup *pc)	\
 static inline int TestClearPageCgroup##uname(struct page_cgroup *pc)	\
 	{ return test_and_clear_bit(PCG_##lname, &pc->flags);  }
 
+#define TESTSETPCGFLAG(uname, lname)			\
+static inline int TestSetPageCgroup##uname(struct page_cgroup *pc)	\
+	{ return test_and_set_bit(PCG_##lname, &pc->flags);  }
+
 TESTPCGFLAG(Locked, LOCK)
 
 /* Cache flag is set only once (at allocation) */
@@ -80,6 +87,22 @@ SETPCGFLAG(FileMapped, FILE_MAPPED)
 CLEARPCGFLAG(FileMapped, FILE_MAPPED)
 TESTPCGFLAG(FileMapped, FILE_MAPPED)
 
+SETPCGFLAG(FileDirty, FILE_DIRTY)
+CLEARPCGFLAG(FileDirty, FILE_DIRTY)
+TESTPCGFLAG(FileDirty, FILE_DIRTY)
+TESTCLEARPCGFLAG(FileDirty, FILE_DIRTY)
+TESTSETPCGFLAG(FileDirty, FILE_DIRTY)
+
+SETPCGFLAG(FileWriteback, FILE_WRITEBACK)
+CLEARPCGFLAG(FileWriteback, FILE_WRITEBACK)
+TESTPCGFLAG(FileWriteback, FILE_WRITEBACK)
+
+SETPCGFLAG(FileUnstableNFS, FILE_UNSTABLE_NFS)
+CLEARPCGFLAG(FileUnstableNFS, FILE_UNSTABLE_NFS)
+TESTPCGFLAG(FileUnstableNFS, FILE_UNSTABLE_NFS)
+TESTCLEARPCGFLAG(FileUnstableNFS, FILE_UNSTABLE_NFS)
+TESTSETPCGFLAG(FileUnstableNFS, FILE_UNSTABLE_NFS)
+
 SETPCGFLAG(Migration, MIGRATION)
 CLEARPCGFLAG(Migration, MIGRATION)
 TESTPCGFLAG(Migration, MIGRATION)
-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [PATCH v4 01/11] memcg: add page_cgroup flags for dirty page tracking
@ 2010-10-29  7:09   ` Greg Thelen
  0 siblings, 0 replies; 75+ messages in thread
From: Greg Thelen @ 2010-10-29  7:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, containers, Andrea Righi, Balbir Singh,
	KAMEZAWA Hiroyuki, Daisuke Nishimura, Minchan Kim, Ciju Rajan K,
	David Rientjes, Wu Fengguang, Greg Thelen

Add additional flags to page_cgroup to track dirty pages
within a mem_cgroup.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrea Righi <arighi@develer.com>
Signed-off-by: Greg Thelen <gthelen@google.com>
Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
---
 include/linux/page_cgroup.h |   23 +++++++++++++++++++++++
 1 files changed, 23 insertions(+), 0 deletions(-)

diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
index 5bb13b3..b59c298 100644
--- a/include/linux/page_cgroup.h
+++ b/include/linux/page_cgroup.h
@@ -40,6 +40,9 @@ enum {
 	PCG_USED, /* this object is in use. */
 	PCG_ACCT_LRU, /* page has been accounted for */
 	PCG_FILE_MAPPED, /* page is accounted as "mapped" */
+	PCG_FILE_DIRTY, /* page is dirty */
+	PCG_FILE_WRITEBACK, /* page is under writeback */
+	PCG_FILE_UNSTABLE_NFS, /* page is NFS unstable */
 	PCG_MIGRATION, /* under page migration */
 };
 
@@ -59,6 +62,10 @@ static inline void ClearPageCgroup##uname(struct page_cgroup *pc)	\
 static inline int TestClearPageCgroup##uname(struct page_cgroup *pc)	\
 	{ return test_and_clear_bit(PCG_##lname, &pc->flags);  }
 
+#define TESTSETPCGFLAG(uname, lname)			\
+static inline int TestSetPageCgroup##uname(struct page_cgroup *pc)	\
+	{ return test_and_set_bit(PCG_##lname, &pc->flags);  }
+
 TESTPCGFLAG(Locked, LOCK)
 
 /* Cache flag is set only once (at allocation) */
@@ -80,6 +87,22 @@ SETPCGFLAG(FileMapped, FILE_MAPPED)
 CLEARPCGFLAG(FileMapped, FILE_MAPPED)
 TESTPCGFLAG(FileMapped, FILE_MAPPED)
 
+SETPCGFLAG(FileDirty, FILE_DIRTY)
+CLEARPCGFLAG(FileDirty, FILE_DIRTY)
+TESTPCGFLAG(FileDirty, FILE_DIRTY)
+TESTCLEARPCGFLAG(FileDirty, FILE_DIRTY)
+TESTSETPCGFLAG(FileDirty, FILE_DIRTY)
+
+SETPCGFLAG(FileWriteback, FILE_WRITEBACK)
+CLEARPCGFLAG(FileWriteback, FILE_WRITEBACK)
+TESTPCGFLAG(FileWriteback, FILE_WRITEBACK)
+
+SETPCGFLAG(FileUnstableNFS, FILE_UNSTABLE_NFS)
+CLEARPCGFLAG(FileUnstableNFS, FILE_UNSTABLE_NFS)
+TESTPCGFLAG(FileUnstableNFS, FILE_UNSTABLE_NFS)
+TESTCLEARPCGFLAG(FileUnstableNFS, FILE_UNSTABLE_NFS)
+TESTSETPCGFLAG(FileUnstableNFS, FILE_UNSTABLE_NFS)
+
 SETPCGFLAG(Migration, MIGRATION)
 CLEARPCGFLAG(Migration, MIGRATION)
 TESTPCGFLAG(Migration, MIGRATION)
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [PATCH v4 02/11] memcg: document cgroup dirty memory interfaces
  2010-10-29  7:09 ` Greg Thelen
@ 2010-10-29  7:09   ` Greg Thelen
  -1 siblings, 0 replies; 75+ messages in thread
From: Greg Thelen @ 2010-10-29  7:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, containers, Andrea Righi, Balbir Singh,
	KAMEZAWA Hiroyuki, Daisuke Nishimura, Minchan Kim, Ciju Rajan K,
	David Rientjes, Wu Fengguang, Greg Thelen

Document cgroup dirty memory interfaces and statistics.

Signed-off-by: Andrea Righi <arighi@develer.com>
Signed-off-by: Greg Thelen <gthelen@google.com>
---
Changelog since v3:
- Described interactions with memory.use_hierarchy.
- Added description of total_dirty, total_writeback, and total_nfs_unstable.

Changelog since v1:
- Renamed "nfs"/"total_nfs" to "nfs_unstable"/"total_nfs_unstable" in per cgroup
  memory.stat to match /proc/meminfo.

- Allow [kKmMgG] suffixes for newly created dirty limit value cgroupfs files.

- Describe a situation where a cgroup can exceed its dirty limit.

 Documentation/cgroups/memory.txt |   73 ++++++++++++++++++++++++++++++++++++++
 1 files changed, 73 insertions(+), 0 deletions(-)

diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index 7781857..a3861f3 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -385,6 +385,10 @@ mapped_file	- # of bytes of mapped file (includes tmpfs/shmem)
 pgpgin		- # of pages paged in (equivalent to # of charging events).
 pgpgout		- # of pages paged out (equivalent to # of uncharging events).
 swap		- # of bytes of swap usage
+dirty		- # of bytes that are waiting to get written back to the disk.
+writeback	- # of bytes that are actively being written back to the disk.
+nfs_unstable	- # of bytes sent to the NFS server, but not yet committed to
+		the actual storage.
 inactive_anon	- # of bytes of anonymous memory and swap cache memory on
 		LRU list.
 active_anon	- # of bytes of anonymous and swap cache memory on active
@@ -406,6 +410,9 @@ total_mapped_file	- sum of all children's "cache"
 total_pgpgin		- sum of all children's "pgpgin"
 total_pgpgout		- sum of all children's "pgpgout"
 total_swap		- sum of all children's "swap"
+total_dirty		- sum of all children's "dirty"
+total_writeback		- sum of all children's "writeback"
+total_nfs_unstable	- sum of all children's "nfs_unstable"
 total_inactive_anon	- sum of all children's "inactive_anon"
 total_active_anon	- sum of all children's "active_anon"
 total_inactive_file	- sum of all children's "inactive_file"
@@ -453,6 +460,72 @@ memory under it will be reclaimed.
 You can reset failcnt by writing 0 to failcnt file.
 # echo 0 > .../memory.failcnt
 
+5.5 dirty memory
+
+Control the maximum amount of dirty pages a cgroup can have at any given time.
+
+Limiting dirty memory is like fixing the max amount of dirty (hard to reclaim)
+page cache used by a cgroup.  So, in case of multiple cgroup writers, they will
+not be able to consume more than their designated share of dirty pages and will
+be forced to perform write-out if they cross that limit.
+
+The interface is equivalent to the procfs interface: /proc/sys/vm/dirty_*.  It
+is possible to configure a limit to trigger both a direct writeback or a
+background writeback performed by per-bdi flusher threads.  The root cgroup
+memory.dirty_* control files are read-only and match the contents of
+the /proc/sys/vm/dirty_* files.
+
+Per-cgroup dirty limits can be set using the following files in the cgroupfs:
+
+- memory.dirty_ratio: the amount of dirty memory (expressed as a percentage of
+  cgroup memory) at which a process generating dirty pages will itself start
+  writing out dirty data.
+
+- memory.dirty_limit_in_bytes: the amount of dirty memory (expressed in bytes)
+  in the cgroup at which a process generating dirty pages will start itself
+  writing out dirty data.  Suffix (k, K, m, M, g, or G) can be used to indicate
+  that value is kilo, mega or gigabytes.
+
+  Note: memory.dirty_limit_in_bytes is the counterpart of memory.dirty_ratio.
+  Only one of them may be specified at a time.  When one is written it is
+  immediately taken into account to evaluate the dirty memory limits and the
+  other appears as 0 when read.
+
+- memory.dirty_background_ratio: the amount of dirty memory of the cgroup
+  (expressed as a percentage of cgroup memory) at which background writeback
+  kernel threads will start writing out dirty data.
+
+- memory.dirty_background_limit_in_bytes: the amount of dirty memory (expressed
+  in bytes) in the cgroup at which background writeback kernel threads will
+  start writing out dirty data.  Suffix (k, K, m, M, g, or G) can be used to
+  indicate that value is kilo, mega or gigabytes.
+
+  Note: memory.dirty_background_limit_in_bytes is the counterpart of
+  memory.dirty_background_ratio.  Only one of them may be specified at a time.
+  When one is written it is immediately taken into account to evaluate the dirty
+  memory limits and the other appears as 0 when read.
+
+A cgroup may contain more dirty memory than its dirty limit.  This is possible
+because of the principle that the first cgroup to touch a page is charged for
+it.  Subsequent page counting events (dirty, writeback, nfs_unstable) are also
+counted to the originally charged cgroup.
+
+Example: If page is allocated by a cgroup A task, then the page is charged to
+cgroup A.  If the page is later dirtied by a task in cgroup B, then the cgroup A
+dirty count will be incremented.  If cgroup A is over its dirty limit but cgroup
+B is not, then dirtying a cgroup A page from a cgroup B task may push cgroup A
+over its dirty limit without throttling the dirtying cgroup B task.
+
+When use_hierarchy=0, each cgroup has dirty memory usage and limits.
+System-wide dirty limits are also consulted.  Dirty memory consumption is
+checked against both system-wide and per-cgroup dirty limits.
+
+The current implementation does enforce per-cgroup dirty limits when
+use_hierarchy=1.  System-wide dirty limits are used for processes in such
+cgroups.  Attempts to read memory.dirty_* files return the system-wide values.
+Writes to the memory.dirty_* files return error.  An enhanced implementation is
+needed to check the chain of parents to ensure that no dirty limit is exceeded.
+
 6. Hierarchy support
 
 The memory controller supports a deep hierarchy and hierarchical accounting.
-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [PATCH v4 02/11] memcg: document cgroup dirty memory interfaces
@ 2010-10-29  7:09   ` Greg Thelen
  0 siblings, 0 replies; 75+ messages in thread
From: Greg Thelen @ 2010-10-29  7:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, containers, Andrea Righi, Balbir Singh,
	KAMEZAWA Hiroyuki, Daisuke Nishimura, Minchan Kim, Ciju Rajan K,
	David Rientjes, Wu Fengguang, Greg Thelen

Document cgroup dirty memory interfaces and statistics.

Signed-off-by: Andrea Righi <arighi@develer.com>
Signed-off-by: Greg Thelen <gthelen@google.com>
---
Changelog since v3:
- Described interactions with memory.use_hierarchy.
- Added description of total_dirty, total_writeback, and total_nfs_unstable.

Changelog since v1:
- Renamed "nfs"/"total_nfs" to "nfs_unstable"/"total_nfs_unstable" in per cgroup
  memory.stat to match /proc/meminfo.

- Allow [kKmMgG] suffixes for newly created dirty limit value cgroupfs files.

- Describe a situation where a cgroup can exceed its dirty limit.

 Documentation/cgroups/memory.txt |   73 ++++++++++++++++++++++++++++++++++++++
 1 files changed, 73 insertions(+), 0 deletions(-)

diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index 7781857..a3861f3 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -385,6 +385,10 @@ mapped_file	- # of bytes of mapped file (includes tmpfs/shmem)
 pgpgin		- # of pages paged in (equivalent to # of charging events).
 pgpgout		- # of pages paged out (equivalent to # of uncharging events).
 swap		- # of bytes of swap usage
+dirty		- # of bytes that are waiting to get written back to the disk.
+writeback	- # of bytes that are actively being written back to the disk.
+nfs_unstable	- # of bytes sent to the NFS server, but not yet committed to
+		the actual storage.
 inactive_anon	- # of bytes of anonymous memory and swap cache memory on
 		LRU list.
 active_anon	- # of bytes of anonymous and swap cache memory on active
@@ -406,6 +410,9 @@ total_mapped_file	- sum of all children's "cache"
 total_pgpgin		- sum of all children's "pgpgin"
 total_pgpgout		- sum of all children's "pgpgout"
 total_swap		- sum of all children's "swap"
+total_dirty		- sum of all children's "dirty"
+total_writeback		- sum of all children's "writeback"
+total_nfs_unstable	- sum of all children's "nfs_unstable"
 total_inactive_anon	- sum of all children's "inactive_anon"
 total_active_anon	- sum of all children's "active_anon"
 total_inactive_file	- sum of all children's "inactive_file"
@@ -453,6 +460,72 @@ memory under it will be reclaimed.
 You can reset failcnt by writing 0 to failcnt file.
 # echo 0 > .../memory.failcnt
 
+5.5 dirty memory
+
+Control the maximum amount of dirty pages a cgroup can have at any given time.
+
+Limiting dirty memory is like fixing the max amount of dirty (hard to reclaim)
+page cache used by a cgroup.  So, in case of multiple cgroup writers, they will
+not be able to consume more than their designated share of dirty pages and will
+be forced to perform write-out if they cross that limit.
+
+The interface is equivalent to the procfs interface: /proc/sys/vm/dirty_*.  It
+is possible to configure a limit to trigger both a direct writeback or a
+background writeback performed by per-bdi flusher threads.  The root cgroup
+memory.dirty_* control files are read-only and match the contents of
+the /proc/sys/vm/dirty_* files.
+
+Per-cgroup dirty limits can be set using the following files in the cgroupfs:
+
+- memory.dirty_ratio: the amount of dirty memory (expressed as a percentage of
+  cgroup memory) at which a process generating dirty pages will itself start
+  writing out dirty data.
+
+- memory.dirty_limit_in_bytes: the amount of dirty memory (expressed in bytes)
+  in the cgroup at which a process generating dirty pages will start itself
+  writing out dirty data.  Suffix (k, K, m, M, g, or G) can be used to indicate
+  that value is kilo, mega or gigabytes.
+
+  Note: memory.dirty_limit_in_bytes is the counterpart of memory.dirty_ratio.
+  Only one of them may be specified at a time.  When one is written it is
+  immediately taken into account to evaluate the dirty memory limits and the
+  other appears as 0 when read.
+
+- memory.dirty_background_ratio: the amount of dirty memory of the cgroup
+  (expressed as a percentage of cgroup memory) at which background writeback
+  kernel threads will start writing out dirty data.
+
+- memory.dirty_background_limit_in_bytes: the amount of dirty memory (expressed
+  in bytes) in the cgroup at which background writeback kernel threads will
+  start writing out dirty data.  Suffix (k, K, m, M, g, or G) can be used to
+  indicate that value is kilo, mega or gigabytes.
+
+  Note: memory.dirty_background_limit_in_bytes is the counterpart of
+  memory.dirty_background_ratio.  Only one of them may be specified at a time.
+  When one is written it is immediately taken into account to evaluate the dirty
+  memory limits and the other appears as 0 when read.
+
+A cgroup may contain more dirty memory than its dirty limit.  This is possible
+because of the principle that the first cgroup to touch a page is charged for
+it.  Subsequent page counting events (dirty, writeback, nfs_unstable) are also
+counted to the originally charged cgroup.
+
+Example: If page is allocated by a cgroup A task, then the page is charged to
+cgroup A.  If the page is later dirtied by a task in cgroup B, then the cgroup A
+dirty count will be incremented.  If cgroup A is over its dirty limit but cgroup
+B is not, then dirtying a cgroup A page from a cgroup B task may push cgroup A
+over its dirty limit without throttling the dirtying cgroup B task.
+
+When use_hierarchy=0, each cgroup has dirty memory usage and limits.
+System-wide dirty limits are also consulted.  Dirty memory consumption is
+checked against both system-wide and per-cgroup dirty limits.
+
+The current implementation does enforce per-cgroup dirty limits when
+use_hierarchy=1.  System-wide dirty limits are used for processes in such
+cgroups.  Attempts to read memory.dirty_* files return the system-wide values.
+Writes to the memory.dirty_* files return error.  An enhanced implementation is
+needed to check the chain of parents to ensure that no dirty limit is exceeded.
+
 6. Hierarchy support
 
 The memory controller supports a deep hierarchy and hierarchical accounting.
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [PATCH v4 03/11] memcg: create extensible page stat update routines
  2010-10-29  7:09 ` Greg Thelen
@ 2010-10-29  7:09   ` Greg Thelen
  -1 siblings, 0 replies; 75+ messages in thread
From: Greg Thelen @ 2010-10-29  7:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, containers, Andrea Righi, Balbir Singh,
	KAMEZAWA Hiroyuki, Daisuke Nishimura, Minchan Kim, Ciju Rajan K,
	David Rientjes, Wu Fengguang, Greg Thelen

Replace usage of the mem_cgroup_update_file_mapped() memcg
statistic update routine with two new routines:
* mem_cgroup_inc_page_stat()
* mem_cgroup_dec_page_stat()

As before, only the file_mapped statistic is managed.  However,
these more general interfaces allow for new statistics to be
more easily added.  New statistics are added with memcg dirty
page accounting.

Signed-off-by: Greg Thelen <gthelen@google.com>
Signed-off-by: Andrea Righi <arighi@develer.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
---
Changelog since v1:
- Rename (for clarity):
  - mem_cgroup_write_page_stat_item -> mem_cgroup_page_stat_item
  - mem_cgroup_read_page_stat_item -> mem_cgroup_nr_pages_item

 include/linux/memcontrol.h |   31 ++++++++++++++++++++++++++++---
 mm/memcontrol.c            |   16 +++++++---------
 mm/rmap.c                  |    4 ++--
 3 files changed, 37 insertions(+), 14 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 159a076..067115c 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -25,6 +25,11 @@ struct page_cgroup;
 struct page;
 struct mm_struct;
 
+/* Stats that can be updated by kernel. */
+enum mem_cgroup_page_stat_item {
+	MEMCG_NR_FILE_MAPPED, /* # of pages charged as file rss */
+};
+
 extern unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
 					struct list_head *dst,
 					unsigned long *scanned, int order,
@@ -121,7 +126,22 @@ static inline bool mem_cgroup_disabled(void)
 	return false;
 }
 
-void mem_cgroup_update_file_mapped(struct page *page, int val);
+void mem_cgroup_update_page_stat(struct page *page,
+				 enum mem_cgroup_page_stat_item idx,
+				 int val);
+
+static inline void mem_cgroup_inc_page_stat(struct page *page,
+					    enum mem_cgroup_page_stat_item idx)
+{
+	mem_cgroup_update_page_stat(page, idx, 1);
+}
+
+static inline void mem_cgroup_dec_page_stat(struct page *page,
+					    enum mem_cgroup_page_stat_item idx)
+{
+	mem_cgroup_update_page_stat(page, idx, -1);
+}
+
 unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
 						gfp_t gfp_mask);
 u64 mem_cgroup_get_limit(struct mem_cgroup *mem);
@@ -293,8 +313,13 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
 {
 }
 
-static inline void mem_cgroup_update_file_mapped(struct page *page,
-							int val)
+static inline void mem_cgroup_inc_page_stat(struct page *page,
+					    enum mem_cgroup_page_stat_item idx)
+{
+}
+
+static inline void mem_cgroup_dec_page_stat(struct page *page,
+					    enum mem_cgroup_page_stat_item idx)
 {
 }
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 9a99cfa..4fd00c4 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1592,7 +1592,8 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask)
  * possibility of race condition. If there is, we take a lock.
  */
 
-static void mem_cgroup_update_file_stat(struct page *page, int idx, int val)
+void mem_cgroup_update_page_stat(struct page *page,
+				 enum mem_cgroup_page_stat_item idx, int val)
 {
 	struct mem_cgroup *mem;
 	struct page_cgroup *pc = lookup_page_cgroup(page);
@@ -1615,30 +1616,27 @@ static void mem_cgroup_update_file_stat(struct page *page, int idx, int val)
 			goto out;
 	}
 
-	this_cpu_add(mem->stat->count[idx], val);
-
 	switch (idx) {
-	case MEM_CGROUP_STAT_FILE_MAPPED:
+	case MEMCG_NR_FILE_MAPPED:
 		if (val > 0)
 			SetPageCgroupFileMapped(pc);
 		else if (!page_mapped(page))
 			ClearPageCgroupFileMapped(pc);
+		idx = MEM_CGROUP_STAT_FILE_MAPPED;
 		break;
 	default:
 		BUG();
 	}
 
+	this_cpu_add(mem->stat->count[idx], val);
+
 out:
 	if (unlikely(need_unlock))
 		unlock_page_cgroup(pc);
 	rcu_read_unlock();
 	return;
 }
-
-void mem_cgroup_update_file_mapped(struct page *page, int val)
-{
-	mem_cgroup_update_file_stat(page, MEM_CGROUP_STAT_FILE_MAPPED, val);
-}
+EXPORT_SYMBOL(mem_cgroup_update_page_stat);
 
 /*
  * size of first charge trial. "32" comes from vmscan.c's magic value.
diff --git a/mm/rmap.c b/mm/rmap.c
index 1a8bf76..a66ab76 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -911,7 +911,7 @@ void page_add_file_rmap(struct page *page)
 {
 	if (atomic_inc_and_test(&page->_mapcount)) {
 		__inc_zone_page_state(page, NR_FILE_MAPPED);
-		mem_cgroup_update_file_mapped(page, 1);
+		mem_cgroup_inc_page_stat(page, MEMCG_NR_FILE_MAPPED);
 	}
 }
 
@@ -949,7 +949,7 @@ void page_remove_rmap(struct page *page)
 		__dec_zone_page_state(page, NR_ANON_PAGES);
 	} else {
 		__dec_zone_page_state(page, NR_FILE_MAPPED);
-		mem_cgroup_update_file_mapped(page, -1);
+		mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_MAPPED);
 	}
 	/*
 	 * It would be tidy to reset the PageAnon mapping here,
-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [PATCH v4 03/11] memcg: create extensible page stat update routines
@ 2010-10-29  7:09   ` Greg Thelen
  0 siblings, 0 replies; 75+ messages in thread
From: Greg Thelen @ 2010-10-29  7:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, containers, Andrea Righi, Balbir Singh,
	KAMEZAWA Hiroyuki, Daisuke Nishimura, Minchan Kim, Ciju Rajan K,
	David Rientjes, Wu Fengguang, Greg Thelen

Replace usage of the mem_cgroup_update_file_mapped() memcg
statistic update routine with two new routines:
* mem_cgroup_inc_page_stat()
* mem_cgroup_dec_page_stat()

As before, only the file_mapped statistic is managed.  However,
these more general interfaces allow for new statistics to be
more easily added.  New statistics are added with memcg dirty
page accounting.

Signed-off-by: Greg Thelen <gthelen@google.com>
Signed-off-by: Andrea Righi <arighi@develer.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
---
Changelog since v1:
- Rename (for clarity):
  - mem_cgroup_write_page_stat_item -> mem_cgroup_page_stat_item
  - mem_cgroup_read_page_stat_item -> mem_cgroup_nr_pages_item

 include/linux/memcontrol.h |   31 ++++++++++++++++++++++++++++---
 mm/memcontrol.c            |   16 +++++++---------
 mm/rmap.c                  |    4 ++--
 3 files changed, 37 insertions(+), 14 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 159a076..067115c 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -25,6 +25,11 @@ struct page_cgroup;
 struct page;
 struct mm_struct;
 
+/* Stats that can be updated by kernel. */
+enum mem_cgroup_page_stat_item {
+	MEMCG_NR_FILE_MAPPED, /* # of pages charged as file rss */
+};
+
 extern unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
 					struct list_head *dst,
 					unsigned long *scanned, int order,
@@ -121,7 +126,22 @@ static inline bool mem_cgroup_disabled(void)
 	return false;
 }
 
-void mem_cgroup_update_file_mapped(struct page *page, int val);
+void mem_cgroup_update_page_stat(struct page *page,
+				 enum mem_cgroup_page_stat_item idx,
+				 int val);
+
+static inline void mem_cgroup_inc_page_stat(struct page *page,
+					    enum mem_cgroup_page_stat_item idx)
+{
+	mem_cgroup_update_page_stat(page, idx, 1);
+}
+
+static inline void mem_cgroup_dec_page_stat(struct page *page,
+					    enum mem_cgroup_page_stat_item idx)
+{
+	mem_cgroup_update_page_stat(page, idx, -1);
+}
+
 unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
 						gfp_t gfp_mask);
 u64 mem_cgroup_get_limit(struct mem_cgroup *mem);
@@ -293,8 +313,13 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
 {
 }
 
-static inline void mem_cgroup_update_file_mapped(struct page *page,
-							int val)
+static inline void mem_cgroup_inc_page_stat(struct page *page,
+					    enum mem_cgroup_page_stat_item idx)
+{
+}
+
+static inline void mem_cgroup_dec_page_stat(struct page *page,
+					    enum mem_cgroup_page_stat_item idx)
 {
 }
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 9a99cfa..4fd00c4 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1592,7 +1592,8 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask)
  * possibility of race condition. If there is, we take a lock.
  */
 
-static void mem_cgroup_update_file_stat(struct page *page, int idx, int val)
+void mem_cgroup_update_page_stat(struct page *page,
+				 enum mem_cgroup_page_stat_item idx, int val)
 {
 	struct mem_cgroup *mem;
 	struct page_cgroup *pc = lookup_page_cgroup(page);
@@ -1615,30 +1616,27 @@ static void mem_cgroup_update_file_stat(struct page *page, int idx, int val)
 			goto out;
 	}
 
-	this_cpu_add(mem->stat->count[idx], val);
-
 	switch (idx) {
-	case MEM_CGROUP_STAT_FILE_MAPPED:
+	case MEMCG_NR_FILE_MAPPED:
 		if (val > 0)
 			SetPageCgroupFileMapped(pc);
 		else if (!page_mapped(page))
 			ClearPageCgroupFileMapped(pc);
+		idx = MEM_CGROUP_STAT_FILE_MAPPED;
 		break;
 	default:
 		BUG();
 	}
 
+	this_cpu_add(mem->stat->count[idx], val);
+
 out:
 	if (unlikely(need_unlock))
 		unlock_page_cgroup(pc);
 	rcu_read_unlock();
 	return;
 }
-
-void mem_cgroup_update_file_mapped(struct page *page, int val)
-{
-	mem_cgroup_update_file_stat(page, MEM_CGROUP_STAT_FILE_MAPPED, val);
-}
+EXPORT_SYMBOL(mem_cgroup_update_page_stat);
 
 /*
  * size of first charge trial. "32" comes from vmscan.c's magic value.
diff --git a/mm/rmap.c b/mm/rmap.c
index 1a8bf76..a66ab76 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -911,7 +911,7 @@ void page_add_file_rmap(struct page *page)
 {
 	if (atomic_inc_and_test(&page->_mapcount)) {
 		__inc_zone_page_state(page, NR_FILE_MAPPED);
-		mem_cgroup_update_file_mapped(page, 1);
+		mem_cgroup_inc_page_stat(page, MEMCG_NR_FILE_MAPPED);
 	}
 }
 
@@ -949,7 +949,7 @@ void page_remove_rmap(struct page *page)
 		__dec_zone_page_state(page, NR_ANON_PAGES);
 	} else {
 		__dec_zone_page_state(page, NR_FILE_MAPPED);
-		mem_cgroup_update_file_mapped(page, -1);
+		mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_MAPPED);
 	}
 	/*
 	 * It would be tidy to reset the PageAnon mapping here,
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [PATCH v4 04/11] memcg: add lock to synchronize page accounting and migration
  2010-10-29  7:09 ` Greg Thelen
@ 2010-10-29  7:09   ` Greg Thelen
  -1 siblings, 0 replies; 75+ messages in thread
From: Greg Thelen @ 2010-10-29  7:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, containers, Andrea Righi, Balbir Singh,
	KAMEZAWA Hiroyuki, Daisuke Nishimura, Minchan Kim, Ciju Rajan K,
	David Rientjes, Wu Fengguang, Greg Thelen

From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Introduce a new bit spin lock, PCG_MOVE_LOCK, to synchronize
the page accounting and migration code.  This reworks the
locking scheme of _update_stat() and _move_account() by
adding new lock bit PCG_MOVE_LOCK, which is always taken
under IRQ disable.

1. If pages are being migrated from a memcg, then updates to
   that memcg page statistics are protected by grabbing
   PCG_MOVE_LOCK using move_lock_page_cgroup().  In an
   upcoming commit, memcg dirty page accounting will be
   updating memcg page accounting (specifically: num
   writeback pages) from IRQ context (softirq).  Avoid a
   deadlocking nested spin lock attempt by disabling irq on
   the local processor when grabbing the PCG_MOVE_LOCK.

2. lock for update_page_stat is used only for avoiding race
   with move_account().  So, IRQ awareness of
   lock_page_cgroup() itself is not a problem.  The problem
   is between mem_cgroup_update_page_stat() and
   mem_cgroup_move_account_page().

Trade-off:
  * Changing lock_page_cgroup() to always disable IRQ (or
    local_bh) has some impacts on performance and I think
    it's bad to disable IRQ when it's not necessary.
  * adding a new lock makes move_account() slower.  Score is
    here.

Performance Impact: moving a 8G anon process.

Before:
	real    0m0.792s
	user    0m0.000s
	sys     0m0.780s

After:
	real    0m0.854s
	user    0m0.000s
	sys     0m0.842s

This score is bad but planned patches for optimization can reduce
this impact.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Greg Thelen <gthelen@google.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
---
 include/linux/page_cgroup.h |   31 ++++++++++++++++++++++++++++---
 mm/memcontrol.c             |    9 +++++++--
 2 files changed, 35 insertions(+), 5 deletions(-)

diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
index b59c298..509452e 100644
--- a/include/linux/page_cgroup.h
+++ b/include/linux/page_cgroup.h
@@ -35,15 +35,18 @@ struct page_cgroup *lookup_page_cgroup(struct page *page);
 
 enum {
 	/* flags for mem_cgroup */
-	PCG_LOCK,  /* page cgroup is locked */
+	PCG_LOCK,  /* Lock for pc->mem_cgroup and following bits. */
 	PCG_CACHE, /* charged as cache */
 	PCG_USED, /* this object is in use. */
-	PCG_ACCT_LRU, /* page has been accounted for */
+	PCG_MIGRATION, /* under page migration */
+	/* flags for mem_cgroup and file and I/O status */
+	PCG_MOVE_LOCK, /* For race between move_account v.s. following bits */
 	PCG_FILE_MAPPED, /* page is accounted as "mapped" */
 	PCG_FILE_DIRTY, /* page is dirty */
 	PCG_FILE_WRITEBACK, /* page is under writeback */
 	PCG_FILE_UNSTABLE_NFS, /* page is NFS unstable */
-	PCG_MIGRATION, /* under page migration */
+	/* No lock in page_cgroup */
+	PCG_ACCT_LRU, /* page has been accounted for (under lru_lock) */
 };
 
 #define TESTPCGFLAG(uname, lname)			\
@@ -119,6 +122,10 @@ static inline enum zone_type page_cgroup_zid(struct page_cgroup *pc)
 
 static inline void lock_page_cgroup(struct page_cgroup *pc)
 {
+	/*
+	 * Don't take this lock in IRQ context.
+	 * This lock is for pc->mem_cgroup, USED, CACHE, MIGRATION
+	 */
 	bit_spin_lock(PCG_LOCK, &pc->flags);
 }
 
@@ -127,6 +134,24 @@ static inline void unlock_page_cgroup(struct page_cgroup *pc)
 	bit_spin_unlock(PCG_LOCK, &pc->flags);
 }
 
+static inline void move_lock_page_cgroup(struct page_cgroup *pc,
+	unsigned long *flags)
+{
+	/*
+	 * We know updates to pc->flags of page cache's stats are from both of
+	 * usual context or IRQ context. Disable IRQ to avoid deadlock.
+	 */
+	local_irq_save(*flags);
+	bit_spin_lock(PCG_MOVE_LOCK, &pc->flags);
+}
+
+static inline void move_unlock_page_cgroup(struct page_cgroup *pc,
+	unsigned long *flags)
+{
+	bit_spin_unlock(PCG_MOVE_LOCK, &pc->flags);
+	local_irq_restore(*flags);
+}
+
 #else /* CONFIG_CGROUP_MEM_RES_CTLR */
 struct page_cgroup;
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 4fd00c4..94359d6 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1598,6 +1598,7 @@ void mem_cgroup_update_page_stat(struct page *page,
 	struct mem_cgroup *mem;
 	struct page_cgroup *pc = lookup_page_cgroup(page);
 	bool need_unlock = false;
+	unsigned long uninitialized_var(flags);
 
 	if (unlikely(!pc))
 		return;
@@ -1609,7 +1610,7 @@ void mem_cgroup_update_page_stat(struct page *page,
 	/* pc->mem_cgroup is unstable ? */
 	if (unlikely(mem_cgroup_stealed(mem))) {
 		/* take a lock against to access pc->mem_cgroup */
-		lock_page_cgroup(pc);
+		move_lock_page_cgroup(pc, &flags);
 		need_unlock = true;
 		mem = pc->mem_cgroup;
 		if (!mem || !PageCgroupUsed(pc))
@@ -1632,7 +1633,7 @@ void mem_cgroup_update_page_stat(struct page *page,
 
 out:
 	if (unlikely(need_unlock))
-		unlock_page_cgroup(pc);
+		move_unlock_page_cgroup(pc, &flags);
 	rcu_read_unlock();
 	return;
 }
@@ -2186,9 +2187,13 @@ static int mem_cgroup_move_account(struct page_cgroup *pc,
 		struct mem_cgroup *from, struct mem_cgroup *to, bool uncharge)
 {
 	int ret = -EINVAL;
+	unsigned long flags;
+
 	lock_page_cgroup(pc);
 	if (PageCgroupUsed(pc) && pc->mem_cgroup == from) {
+		move_lock_page_cgroup(pc, &flags);
 		__mem_cgroup_move_account(pc, from, to, uncharge);
+		move_unlock_page_cgroup(pc, &flags);
 		ret = 0;
 	}
 	unlock_page_cgroup(pc);
-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [PATCH v4 04/11] memcg: add lock to synchronize page accounting and migration
@ 2010-10-29  7:09   ` Greg Thelen
  0 siblings, 0 replies; 75+ messages in thread
From: Greg Thelen @ 2010-10-29  7:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, containers, Andrea Righi, Balbir Singh,
	KAMEZAWA Hiroyuki, Daisuke Nishimura, Minchan Kim, Ciju Rajan K,
	David Rientjes, Wu Fengguang, Greg Thelen

From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Introduce a new bit spin lock, PCG_MOVE_LOCK, to synchronize
the page accounting and migration code.  This reworks the
locking scheme of _update_stat() and _move_account() by
adding new lock bit PCG_MOVE_LOCK, which is always taken
under IRQ disable.

1. If pages are being migrated from a memcg, then updates to
   that memcg page statistics are protected by grabbing
   PCG_MOVE_LOCK using move_lock_page_cgroup().  In an
   upcoming commit, memcg dirty page accounting will be
   updating memcg page accounting (specifically: num
   writeback pages) from IRQ context (softirq).  Avoid a
   deadlocking nested spin lock attempt by disabling irq on
   the local processor when grabbing the PCG_MOVE_LOCK.

2. lock for update_page_stat is used only for avoiding race
   with move_account().  So, IRQ awareness of
   lock_page_cgroup() itself is not a problem.  The problem
   is between mem_cgroup_update_page_stat() and
   mem_cgroup_move_account_page().

Trade-off:
  * Changing lock_page_cgroup() to always disable IRQ (or
    local_bh) has some impacts on performance and I think
    it's bad to disable IRQ when it's not necessary.
  * adding a new lock makes move_account() slower.  Score is
    here.

Performance Impact: moving a 8G anon process.

Before:
	real    0m0.792s
	user    0m0.000s
	sys     0m0.780s

After:
	real    0m0.854s
	user    0m0.000s
	sys     0m0.842s

This score is bad but planned patches for optimization can reduce
this impact.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Greg Thelen <gthelen@google.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
---
 include/linux/page_cgroup.h |   31 ++++++++++++++++++++++++++++---
 mm/memcontrol.c             |    9 +++++++--
 2 files changed, 35 insertions(+), 5 deletions(-)

diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
index b59c298..509452e 100644
--- a/include/linux/page_cgroup.h
+++ b/include/linux/page_cgroup.h
@@ -35,15 +35,18 @@ struct page_cgroup *lookup_page_cgroup(struct page *page);
 
 enum {
 	/* flags for mem_cgroup */
-	PCG_LOCK,  /* page cgroup is locked */
+	PCG_LOCK,  /* Lock for pc->mem_cgroup and following bits. */
 	PCG_CACHE, /* charged as cache */
 	PCG_USED, /* this object is in use. */
-	PCG_ACCT_LRU, /* page has been accounted for */
+	PCG_MIGRATION, /* under page migration */
+	/* flags for mem_cgroup and file and I/O status */
+	PCG_MOVE_LOCK, /* For race between move_account v.s. following bits */
 	PCG_FILE_MAPPED, /* page is accounted as "mapped" */
 	PCG_FILE_DIRTY, /* page is dirty */
 	PCG_FILE_WRITEBACK, /* page is under writeback */
 	PCG_FILE_UNSTABLE_NFS, /* page is NFS unstable */
-	PCG_MIGRATION, /* under page migration */
+	/* No lock in page_cgroup */
+	PCG_ACCT_LRU, /* page has been accounted for (under lru_lock) */
 };
 
 #define TESTPCGFLAG(uname, lname)			\
@@ -119,6 +122,10 @@ static inline enum zone_type page_cgroup_zid(struct page_cgroup *pc)
 
 static inline void lock_page_cgroup(struct page_cgroup *pc)
 {
+	/*
+	 * Don't take this lock in IRQ context.
+	 * This lock is for pc->mem_cgroup, USED, CACHE, MIGRATION
+	 */
 	bit_spin_lock(PCG_LOCK, &pc->flags);
 }
 
@@ -127,6 +134,24 @@ static inline void unlock_page_cgroup(struct page_cgroup *pc)
 	bit_spin_unlock(PCG_LOCK, &pc->flags);
 }
 
+static inline void move_lock_page_cgroup(struct page_cgroup *pc,
+	unsigned long *flags)
+{
+	/*
+	 * We know updates to pc->flags of page cache's stats are from both of
+	 * usual context or IRQ context. Disable IRQ to avoid deadlock.
+	 */
+	local_irq_save(*flags);
+	bit_spin_lock(PCG_MOVE_LOCK, &pc->flags);
+}
+
+static inline void move_unlock_page_cgroup(struct page_cgroup *pc,
+	unsigned long *flags)
+{
+	bit_spin_unlock(PCG_MOVE_LOCK, &pc->flags);
+	local_irq_restore(*flags);
+}
+
 #else /* CONFIG_CGROUP_MEM_RES_CTLR */
 struct page_cgroup;
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 4fd00c4..94359d6 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1598,6 +1598,7 @@ void mem_cgroup_update_page_stat(struct page *page,
 	struct mem_cgroup *mem;
 	struct page_cgroup *pc = lookup_page_cgroup(page);
 	bool need_unlock = false;
+	unsigned long uninitialized_var(flags);
 
 	if (unlikely(!pc))
 		return;
@@ -1609,7 +1610,7 @@ void mem_cgroup_update_page_stat(struct page *page,
 	/* pc->mem_cgroup is unstable ? */
 	if (unlikely(mem_cgroup_stealed(mem))) {
 		/* take a lock against to access pc->mem_cgroup */
-		lock_page_cgroup(pc);
+		move_lock_page_cgroup(pc, &flags);
 		need_unlock = true;
 		mem = pc->mem_cgroup;
 		if (!mem || !PageCgroupUsed(pc))
@@ -1632,7 +1633,7 @@ void mem_cgroup_update_page_stat(struct page *page,
 
 out:
 	if (unlikely(need_unlock))
-		unlock_page_cgroup(pc);
+		move_unlock_page_cgroup(pc, &flags);
 	rcu_read_unlock();
 	return;
 }
@@ -2186,9 +2187,13 @@ static int mem_cgroup_move_account(struct page_cgroup *pc,
 		struct mem_cgroup *from, struct mem_cgroup *to, bool uncharge)
 {
 	int ret = -EINVAL;
+	unsigned long flags;
+
 	lock_page_cgroup(pc);
 	if (PageCgroupUsed(pc) && pc->mem_cgroup == from) {
+		move_lock_page_cgroup(pc, &flags);
 		__mem_cgroup_move_account(pc, from, to, uncharge);
+		move_unlock_page_cgroup(pc, &flags);
 		ret = 0;
 	}
 	unlock_page_cgroup(pc);
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [PATCH v4 05/11] writeback: create dirty_info structure
  2010-10-29  7:09 ` Greg Thelen
@ 2010-10-29  7:09   ` Greg Thelen
  -1 siblings, 0 replies; 75+ messages in thread
From: Greg Thelen @ 2010-10-29  7:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, containers, Andrea Righi, Balbir Singh,
	KAMEZAWA Hiroyuki, Daisuke Nishimura, Minchan Kim, Ciju Rajan K,
	David Rientjes, Wu Fengguang, Greg Thelen

Bundle dirty limits and dirty memory usage metrics into a dirty_info
structure to simplify interfaces of routines that need all.

Signed-off-by: Greg Thelen <gthelen@google.com>
---
Changelog since v3:
- This is a new patch in v4.

 fs/fs-writeback.c         |    7 ++---
 include/linux/writeback.h |    9 +++++++-
 mm/backing-dev.c          |   12 +++++-----
 mm/page-writeback.c       |   52 ++++++++++++++++++++++++--------------------
 mm/vmstat.c               |    6 +++-
 5 files changed, 49 insertions(+), 37 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 9e46aec..1c27bb9 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -577,12 +577,11 @@ static void __writeback_inodes_sb(struct super_block *sb,
 
 static inline bool over_bground_thresh(void)
 {
-	unsigned long background_thresh, dirty_thresh;
+	struct dirty_info info;
 
-	global_dirty_limits(&background_thresh, &dirty_thresh);
+	global_dirty_info(&info);
 
-	return (global_page_state(NR_FILE_DIRTY) +
-		global_page_state(NR_UNSTABLE_NFS) > background_thresh);
+	return info.nr_reclaimable > info.background_thresh;
 }
 
 /*
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index c7299d2..ab23a73 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -84,6 +84,13 @@ static inline void inode_sync_wait(struct inode *inode)
 /*
  * mm/page-writeback.c
  */
+struct dirty_info {
+	unsigned long dirty_thresh;
+	unsigned long background_thresh;
+	unsigned long nr_reclaimable;
+	unsigned long nr_writeback;
+};
+
 #ifdef CONFIG_BLOCK
 void laptop_io_completion(struct backing_dev_info *info);
 void laptop_sync_completion(void);
@@ -124,7 +131,7 @@ struct ctl_table;
 int dirty_writeback_centisecs_handler(struct ctl_table *, int,
 				      void __user *, size_t *, loff_t *);
 
-void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty);
+void global_dirty_info(struct dirty_info *info);
 unsigned long bdi_dirty_limit(struct backing_dev_info *bdi,
 			       unsigned long dirty);
 
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index f2eb278..b3a50d2 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -66,8 +66,7 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
 {
 	struct backing_dev_info *bdi = m->private;
 	struct bdi_writeback *wb = &bdi->wb;
-	unsigned long background_thresh;
-	unsigned long dirty_thresh;
+	struct dirty_info dirty_info;
 	unsigned long bdi_thresh;
 	unsigned long nr_dirty, nr_io, nr_more_io, nr_wb;
 	struct inode *inode;
@@ -82,8 +81,8 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
 		nr_more_io++;
 	spin_unlock(&inode_lock);
 
-	global_dirty_limits(&background_thresh, &dirty_thresh);
-	bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
+	global_dirty_info(&dirty_info);
+	bdi_thresh = bdi_dirty_limit(bdi, dirty_info.dirty_thresh);
 
 #define K(x) ((x) << (PAGE_SHIFT - 10))
 	seq_printf(m,
@@ -99,8 +98,9 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
 		   "state:            %8lx\n",
 		   (unsigned long) K(bdi_stat(bdi, BDI_WRITEBACK)),
 		   (unsigned long) K(bdi_stat(bdi, BDI_RECLAIMABLE)),
-		   K(bdi_thresh), K(dirty_thresh),
-		   K(background_thresh), nr_dirty, nr_io, nr_more_io,
+		   K(bdi_thresh), K(dirty_info.dirty_thresh),
+		   K(dirty_info.background_thresh),
+		   nr_dirty, nr_io, nr_more_io,
 		   !list_empty(&bdi->bdi_list), bdi->state);
 #undef K
 
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index b840afa..722bd61 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -398,7 +398,8 @@ unsigned long determine_dirtyable_memory(void)
 }
 
 /*
- * global_dirty_limits - background-writeback and dirty-throttling thresholds
+ * global_dirty_info - return background-writeback and dirty-throttling
+ * thresholds as well as dirty usage metrics.
  *
  * Calculate the dirty thresholds based on sysctl parameters
  * - vm.dirty_background_ratio  or  vm.dirty_background_bytes
@@ -406,7 +407,7 @@ unsigned long determine_dirtyable_memory(void)
  * The dirty limits will be lifted by 1/4 for PF_LESS_THROTTLE (ie. nfsd) and
  * runtime tasks.
  */
-void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty)
+void global_dirty_info(struct dirty_info *info)
 {
 	unsigned long background;
 	unsigned long dirty;
@@ -423,6 +424,10 @@ void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty)
 	else
 		background = (dirty_background_ratio * available_memory) / 100;
 
+	info->nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
+				global_page_state(NR_UNSTABLE_NFS);
+	info->nr_writeback = global_page_state(NR_WRITEBACK);
+
 	if (background >= dirty)
 		background = dirty / 2;
 	tsk = current;
@@ -430,8 +435,8 @@ void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty)
 		background += background / 4;
 		dirty += dirty / 4;
 	}
-	*pbackground = background;
-	*pdirty = dirty;
+	info->background_thresh = background;
+	info->dirty_thresh = dirty;
 }
 
 /*
@@ -475,10 +480,9 @@ unsigned long bdi_dirty_limit(struct backing_dev_info *bdi, unsigned long dirty)
 static void balance_dirty_pages(struct address_space *mapping,
 				unsigned long write_chunk)
 {
-	long nr_reclaimable, bdi_nr_reclaimable;
-	long nr_writeback, bdi_nr_writeback;
-	unsigned long background_thresh;
-	unsigned long dirty_thresh;
+	struct dirty_info dirty_info;
+	long bdi_nr_reclaimable;
+	long bdi_nr_writeback;
 	unsigned long bdi_thresh;
 	unsigned long pages_written = 0;
 	unsigned long pause = 1;
@@ -493,22 +497,19 @@ static void balance_dirty_pages(struct address_space *mapping,
 			.range_cyclic	= 1,
 		};
 
-		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
-					global_page_state(NR_UNSTABLE_NFS);
-		nr_writeback = global_page_state(NR_WRITEBACK);
-
-		global_dirty_limits(&background_thresh, &dirty_thresh);
+		global_dirty_info(&dirty_info);
 
 		/*
 		 * Throttle it only when the background writeback cannot
 		 * catch-up. This avoids (excessively) small writeouts
 		 * when the bdi limits are ramping up.
 		 */
-		if (nr_reclaimable + nr_writeback <=
-				(background_thresh + dirty_thresh) / 2)
+		if (dirty_info.nr_reclaimable + dirty_info.nr_writeback <=
+				(dirty_info.background_thresh +
+				 dirty_info.dirty_thresh) / 2)
 			break;
 
-		bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
+		bdi_thresh = bdi_dirty_limit(bdi, dirty_info.dirty_thresh);
 		bdi_thresh = task_dirty_limit(current, bdi_thresh);
 
 		/*
@@ -537,7 +538,9 @@ static void balance_dirty_pages(struct address_space *mapping,
 		 */
 		dirty_exceeded =
 			(bdi_nr_reclaimable + bdi_nr_writeback > bdi_thresh)
-			|| (nr_reclaimable + nr_writeback > dirty_thresh);
+			|| (dirty_info.nr_reclaimable +
+			    dirty_info.nr_writeback >
+			    dirty_info.dirty_thresh);
 
 		if (!dirty_exceeded)
 			break;
@@ -590,7 +593,8 @@ static void balance_dirty_pages(struct address_space *mapping,
 	 * background_thresh, to keep the amount of dirty memory low.
 	 */
 	if ((laptop_mode && pages_written) ||
-	    (!laptop_mode && (nr_reclaimable > background_thresh)))
+	    (!laptop_mode && (dirty_info.nr_reclaimable >
+			      dirty_info.background_thresh)))
 		bdi_start_background_writeback(bdi);
 }
 
@@ -650,21 +654,21 @@ EXPORT_SYMBOL(balance_dirty_pages_ratelimited_nr);
 
 void throttle_vm_writeout(gfp_t gfp_mask)
 {
-	unsigned long background_thresh;
-	unsigned long dirty_thresh;
+	struct dirty_info dirty_info;
 
         for ( ; ; ) {
-		global_dirty_limits(&background_thresh, &dirty_thresh);
+		global_dirty_info(&dirty_info);
 
                 /*
                  * Boost the allowable dirty threshold a bit for page
                  * allocators so they don't get DoS'ed by heavy writers
                  */
-                dirty_thresh += dirty_thresh / 10;      /* wheeee... */
+		dirty_info.dirty_thresh +=
+			dirty_info.dirty_thresh / 10;      /* wheeee... */
 
                 if (global_page_state(NR_UNSTABLE_NFS) +
-			global_page_state(NR_WRITEBACK) <= dirty_thresh)
-                        	break;
+		    global_page_state(NR_WRITEBACK) <= dirty_info.dirty_thresh)
+			break;
                 congestion_wait(BLK_RW_ASYNC, HZ/10);
 
 		/*
diff --git a/mm/vmstat.c b/mm/vmstat.c
index cd2e42b..de4d415 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -922,6 +922,7 @@ static void *vmstat_start(struct seq_file *m, loff_t *pos)
 {
 	unsigned long *v;
 	int i, stat_items_size;
+	struct dirty_info dirty_info;
 
 	if (*pos >= ARRAY_SIZE(vmstat_text))
 		return NULL;
@@ -940,8 +941,9 @@ static void *vmstat_start(struct seq_file *m, loff_t *pos)
 		v[i] = global_page_state(i);
 	v += NR_VM_ZONE_STAT_ITEMS;
 
-	global_dirty_limits(v + NR_DIRTY_BG_THRESHOLD,
-			    v + NR_DIRTY_THRESHOLD);
+	global_dirty_info(&dirty_info);
+	v[NR_DIRTY_BG_THRESHOLD] = dirty_info.background_thresh;
+	v[NR_DIRTY_THRESHOLD] = dirty_info.dirty_thresh;
 	v += NR_VM_WRITEBACK_STAT_ITEMS;
 
 #ifdef CONFIG_VM_EVENT_COUNTERS
-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [PATCH v4 05/11] writeback: create dirty_info structure
@ 2010-10-29  7:09   ` Greg Thelen
  0 siblings, 0 replies; 75+ messages in thread
From: Greg Thelen @ 2010-10-29  7:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, containers, Andrea Righi, Balbir Singh,
	KAMEZAWA Hiroyuki, Daisuke Nishimura, Minchan Kim, Ciju Rajan K,
	David Rientjes, Wu Fengguang, Greg Thelen

Bundle dirty limits and dirty memory usage metrics into a dirty_info
structure to simplify interfaces of routines that need all.

Signed-off-by: Greg Thelen <gthelen@google.com>
---
Changelog since v3:
- This is a new patch in v4.

 fs/fs-writeback.c         |    7 ++---
 include/linux/writeback.h |    9 +++++++-
 mm/backing-dev.c          |   12 +++++-----
 mm/page-writeback.c       |   52 ++++++++++++++++++++++++--------------------
 mm/vmstat.c               |    6 +++-
 5 files changed, 49 insertions(+), 37 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 9e46aec..1c27bb9 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -577,12 +577,11 @@ static void __writeback_inodes_sb(struct super_block *sb,
 
 static inline bool over_bground_thresh(void)
 {
-	unsigned long background_thresh, dirty_thresh;
+	struct dirty_info info;
 
-	global_dirty_limits(&background_thresh, &dirty_thresh);
+	global_dirty_info(&info);
 
-	return (global_page_state(NR_FILE_DIRTY) +
-		global_page_state(NR_UNSTABLE_NFS) > background_thresh);
+	return info.nr_reclaimable > info.background_thresh;
 }
 
 /*
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index c7299d2..ab23a73 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -84,6 +84,13 @@ static inline void inode_sync_wait(struct inode *inode)
 /*
  * mm/page-writeback.c
  */
+struct dirty_info {
+	unsigned long dirty_thresh;
+	unsigned long background_thresh;
+	unsigned long nr_reclaimable;
+	unsigned long nr_writeback;
+};
+
 #ifdef CONFIG_BLOCK
 void laptop_io_completion(struct backing_dev_info *info);
 void laptop_sync_completion(void);
@@ -124,7 +131,7 @@ struct ctl_table;
 int dirty_writeback_centisecs_handler(struct ctl_table *, int,
 				      void __user *, size_t *, loff_t *);
 
-void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty);
+void global_dirty_info(struct dirty_info *info);
 unsigned long bdi_dirty_limit(struct backing_dev_info *bdi,
 			       unsigned long dirty);
 
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index f2eb278..b3a50d2 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -66,8 +66,7 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
 {
 	struct backing_dev_info *bdi = m->private;
 	struct bdi_writeback *wb = &bdi->wb;
-	unsigned long background_thresh;
-	unsigned long dirty_thresh;
+	struct dirty_info dirty_info;
 	unsigned long bdi_thresh;
 	unsigned long nr_dirty, nr_io, nr_more_io, nr_wb;
 	struct inode *inode;
@@ -82,8 +81,8 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
 		nr_more_io++;
 	spin_unlock(&inode_lock);
 
-	global_dirty_limits(&background_thresh, &dirty_thresh);
-	bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
+	global_dirty_info(&dirty_info);
+	bdi_thresh = bdi_dirty_limit(bdi, dirty_info.dirty_thresh);
 
 #define K(x) ((x) << (PAGE_SHIFT - 10))
 	seq_printf(m,
@@ -99,8 +98,9 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
 		   "state:            %8lx\n",
 		   (unsigned long) K(bdi_stat(bdi, BDI_WRITEBACK)),
 		   (unsigned long) K(bdi_stat(bdi, BDI_RECLAIMABLE)),
-		   K(bdi_thresh), K(dirty_thresh),
-		   K(background_thresh), nr_dirty, nr_io, nr_more_io,
+		   K(bdi_thresh), K(dirty_info.dirty_thresh),
+		   K(dirty_info.background_thresh),
+		   nr_dirty, nr_io, nr_more_io,
 		   !list_empty(&bdi->bdi_list), bdi->state);
 #undef K
 
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index b840afa..722bd61 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -398,7 +398,8 @@ unsigned long determine_dirtyable_memory(void)
 }
 
 /*
- * global_dirty_limits - background-writeback and dirty-throttling thresholds
+ * global_dirty_info - return background-writeback and dirty-throttling
+ * thresholds as well as dirty usage metrics.
  *
  * Calculate the dirty thresholds based on sysctl parameters
  * - vm.dirty_background_ratio  or  vm.dirty_background_bytes
@@ -406,7 +407,7 @@ unsigned long determine_dirtyable_memory(void)
  * The dirty limits will be lifted by 1/4 for PF_LESS_THROTTLE (ie. nfsd) and
  * runtime tasks.
  */
-void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty)
+void global_dirty_info(struct dirty_info *info)
 {
 	unsigned long background;
 	unsigned long dirty;
@@ -423,6 +424,10 @@ void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty)
 	else
 		background = (dirty_background_ratio * available_memory) / 100;
 
+	info->nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
+				global_page_state(NR_UNSTABLE_NFS);
+	info->nr_writeback = global_page_state(NR_WRITEBACK);
+
 	if (background >= dirty)
 		background = dirty / 2;
 	tsk = current;
@@ -430,8 +435,8 @@ void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty)
 		background += background / 4;
 		dirty += dirty / 4;
 	}
-	*pbackground = background;
-	*pdirty = dirty;
+	info->background_thresh = background;
+	info->dirty_thresh = dirty;
 }
 
 /*
@@ -475,10 +480,9 @@ unsigned long bdi_dirty_limit(struct backing_dev_info *bdi, unsigned long dirty)
 static void balance_dirty_pages(struct address_space *mapping,
 				unsigned long write_chunk)
 {
-	long nr_reclaimable, bdi_nr_reclaimable;
-	long nr_writeback, bdi_nr_writeback;
-	unsigned long background_thresh;
-	unsigned long dirty_thresh;
+	struct dirty_info dirty_info;
+	long bdi_nr_reclaimable;
+	long bdi_nr_writeback;
 	unsigned long bdi_thresh;
 	unsigned long pages_written = 0;
 	unsigned long pause = 1;
@@ -493,22 +497,19 @@ static void balance_dirty_pages(struct address_space *mapping,
 			.range_cyclic	= 1,
 		};
 
-		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
-					global_page_state(NR_UNSTABLE_NFS);
-		nr_writeback = global_page_state(NR_WRITEBACK);
-
-		global_dirty_limits(&background_thresh, &dirty_thresh);
+		global_dirty_info(&dirty_info);
 
 		/*
 		 * Throttle it only when the background writeback cannot
 		 * catch-up. This avoids (excessively) small writeouts
 		 * when the bdi limits are ramping up.
 		 */
-		if (nr_reclaimable + nr_writeback <=
-				(background_thresh + dirty_thresh) / 2)
+		if (dirty_info.nr_reclaimable + dirty_info.nr_writeback <=
+				(dirty_info.background_thresh +
+				 dirty_info.dirty_thresh) / 2)
 			break;
 
-		bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
+		bdi_thresh = bdi_dirty_limit(bdi, dirty_info.dirty_thresh);
 		bdi_thresh = task_dirty_limit(current, bdi_thresh);
 
 		/*
@@ -537,7 +538,9 @@ static void balance_dirty_pages(struct address_space *mapping,
 		 */
 		dirty_exceeded =
 			(bdi_nr_reclaimable + bdi_nr_writeback > bdi_thresh)
-			|| (nr_reclaimable + nr_writeback > dirty_thresh);
+			|| (dirty_info.nr_reclaimable +
+			    dirty_info.nr_writeback >
+			    dirty_info.dirty_thresh);
 
 		if (!dirty_exceeded)
 			break;
@@ -590,7 +593,8 @@ static void balance_dirty_pages(struct address_space *mapping,
 	 * background_thresh, to keep the amount of dirty memory low.
 	 */
 	if ((laptop_mode && pages_written) ||
-	    (!laptop_mode && (nr_reclaimable > background_thresh)))
+	    (!laptop_mode && (dirty_info.nr_reclaimable >
+			      dirty_info.background_thresh)))
 		bdi_start_background_writeback(bdi);
 }
 
@@ -650,21 +654,21 @@ EXPORT_SYMBOL(balance_dirty_pages_ratelimited_nr);
 
 void throttle_vm_writeout(gfp_t gfp_mask)
 {
-	unsigned long background_thresh;
-	unsigned long dirty_thresh;
+	struct dirty_info dirty_info;
 
         for ( ; ; ) {
-		global_dirty_limits(&background_thresh, &dirty_thresh);
+		global_dirty_info(&dirty_info);
 
                 /*
                  * Boost the allowable dirty threshold a bit for page
                  * allocators so they don't get DoS'ed by heavy writers
                  */
-                dirty_thresh += dirty_thresh / 10;      /* wheeee... */
+		dirty_info.dirty_thresh +=
+			dirty_info.dirty_thresh / 10;      /* wheeee... */
 
                 if (global_page_state(NR_UNSTABLE_NFS) +
-			global_page_state(NR_WRITEBACK) <= dirty_thresh)
-                        	break;
+		    global_page_state(NR_WRITEBACK) <= dirty_info.dirty_thresh)
+			break;
                 congestion_wait(BLK_RW_ASYNC, HZ/10);
 
 		/*
diff --git a/mm/vmstat.c b/mm/vmstat.c
index cd2e42b..de4d415 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -922,6 +922,7 @@ static void *vmstat_start(struct seq_file *m, loff_t *pos)
 {
 	unsigned long *v;
 	int i, stat_items_size;
+	struct dirty_info dirty_info;
 
 	if (*pos >= ARRAY_SIZE(vmstat_text))
 		return NULL;
@@ -940,8 +941,9 @@ static void *vmstat_start(struct seq_file *m, loff_t *pos)
 		v[i] = global_page_state(i);
 	v += NR_VM_ZONE_STAT_ITEMS;
 
-	global_dirty_limits(v + NR_DIRTY_BG_THRESHOLD,
-			    v + NR_DIRTY_THRESHOLD);
+	global_dirty_info(&dirty_info);
+	v[NR_DIRTY_BG_THRESHOLD] = dirty_info.background_thresh;
+	v[NR_DIRTY_THRESHOLD] = dirty_info.dirty_thresh;
 	v += NR_VM_WRITEBACK_STAT_ITEMS;
 
 #ifdef CONFIG_VM_EVENT_COUNTERS
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [PATCH v4 06/11] memcg: add dirty page accounting infrastructure
  2010-10-29  7:09 ` Greg Thelen
@ 2010-10-29  7:09   ` Greg Thelen
  -1 siblings, 0 replies; 75+ messages in thread
From: Greg Thelen @ 2010-10-29  7:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, containers, Andrea Righi, Balbir Singh,
	KAMEZAWA Hiroyuki, Daisuke Nishimura, Minchan Kim, Ciju Rajan K,
	David Rientjes, Wu Fengguang, Greg Thelen

Add memcg routines to track dirty, writeback, and unstable_NFS pages.
These routines are not yet used by the kernel to count such pages.
A later change adds kernel calls to these new routines.

Signed-off-by: Greg Thelen <gthelen@google.com>
Signed-off-by: Andrea Righi <arighi@develer.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
---
Changelog since v1:
- Renamed "nfs"/"total_nfs" to "nfs_unstable"/"total_nfs_unstable" in per cgroup
  memory.stat to match /proc/meminfo.
- Rename (for clarity):
  - mem_cgroup_write_page_stat_item -> mem_cgroup_page_stat_item
  - mem_cgroup_read_page_stat_item -> mem_cgroup_nr_pages_item
- Remove redundant comments.
- Made mem_cgroup_move_account_page_stat() inline.

 include/linux/memcontrol.h |    3 ++
 mm/memcontrol.c            |   86 +++++++++++++++++++++++++++++++++++++++----
 2 files changed, 81 insertions(+), 8 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 067115c..ef2eec7 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -28,6 +28,9 @@ struct mm_struct;
 /* Stats that can be updated by kernel. */
 enum mem_cgroup_page_stat_item {
 	MEMCG_NR_FILE_MAPPED, /* # of pages charged as file rss */
+	MEMCG_NR_FILE_DIRTY, /* # of dirty pages in page cache */
+	MEMCG_NR_FILE_WRITEBACK, /* # of pages under writeback */
+	MEMCG_NR_FILE_UNSTABLE_NFS, /* # of NFS unstable pages */
 };
 
 extern unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 94359d6..7f91029 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -85,10 +85,13 @@ enum mem_cgroup_stat_index {
 	 */
 	MEM_CGROUP_STAT_CACHE, 	   /* # of pages charged as cache */
 	MEM_CGROUP_STAT_RSS,	   /* # of pages charged as anon rss */
-	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
 	MEM_CGROUP_STAT_PGPGIN_COUNT,	/* # of pages paged in */
 	MEM_CGROUP_STAT_PGPGOUT_COUNT,	/* # of pages paged out */
 	MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
+	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
+	MEM_CGROUP_STAT_FILE_DIRTY,	/* # of dirty pages in page cache */
+	MEM_CGROUP_STAT_FILE_WRITEBACK,		/* # of pages under writeback */
+	MEM_CGROUP_STAT_FILE_UNSTABLE_NFS,	/* # of NFS unstable pages */
 	MEM_CGROUP_STAT_DATA, /* end of data requires synchronization */
 	/* incremented at every  pagein/pageout */
 	MEM_CGROUP_EVENTS = MEM_CGROUP_STAT_DATA,
@@ -1625,6 +1628,44 @@ void mem_cgroup_update_page_stat(struct page *page,
 			ClearPageCgroupFileMapped(pc);
 		idx = MEM_CGROUP_STAT_FILE_MAPPED;
 		break;
+
+	case MEMCG_NR_FILE_DIRTY:
+		/* Use Test{Set,Clear} to only un/charge the memcg once. */
+		if (val > 0) {
+			if (TestSetPageCgroupFileDirty(pc))
+				val = 0;
+		} else {
+			if (!TestClearPageCgroupFileDirty(pc))
+				val = 0;
+		}
+		idx = MEM_CGROUP_STAT_FILE_DIRTY;
+		break;
+
+	case MEMCG_NR_FILE_WRITEBACK:
+		/*
+		 * This counter is adjusted while holding the mapping's
+		 * tree_lock.  Therefore there is no race between settings and
+		 * clearing of this flag.
+		 */
+		if (val > 0)
+			SetPageCgroupFileWriteback(pc);
+		else
+			ClearPageCgroupFileWriteback(pc);
+		idx = MEM_CGROUP_STAT_FILE_WRITEBACK;
+		break;
+
+	case MEMCG_NR_FILE_UNSTABLE_NFS:
+		/* Use Test{Set,Clear} to only un/charge the memcg once. */
+		if (val > 0) {
+			if (TestSetPageCgroupFileUnstableNFS(pc))
+				val = 0;
+		} else {
+			if (!TestClearPageCgroupFileUnstableNFS(pc))
+				val = 0;
+		}
+		idx = MEM_CGROUP_STAT_FILE_UNSTABLE_NFS;
+		break;
+
 	default:
 		BUG();
 	}
@@ -2129,6 +2170,17 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *mem,
 	memcg_check_events(mem, pc->page);
 }
 
+static inline
+void mem_cgroup_move_account_page_stat(struct mem_cgroup *from,
+				       struct mem_cgroup *to,
+				       enum mem_cgroup_stat_index idx)
+{
+	preempt_disable();
+	__this_cpu_dec(from->stat->count[idx]);
+	__this_cpu_inc(to->stat->count[idx]);
+	preempt_enable();
+}
+
 /**
  * __mem_cgroup_move_account - move account of the page
  * @pc:	page_cgroup of the page.
@@ -2155,13 +2207,18 @@ static void __mem_cgroup_move_account(struct page_cgroup *pc,
 	VM_BUG_ON(!PageCgroupUsed(pc));
 	VM_BUG_ON(pc->mem_cgroup != from);
 
-	if (PageCgroupFileMapped(pc)) {
-		/* Update mapped_file data for mem_cgroup */
-		preempt_disable();
-		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
-		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
-		preempt_enable();
-	}
+	if (PageCgroupFileMapped(pc))
+		mem_cgroup_move_account_page_stat(from, to,
+					MEM_CGROUP_STAT_FILE_MAPPED);
+	if (PageCgroupFileDirty(pc))
+		mem_cgroup_move_account_page_stat(from, to,
+					MEM_CGROUP_STAT_FILE_DIRTY);
+	if (PageCgroupFileWriteback(pc))
+		mem_cgroup_move_account_page_stat(from, to,
+					MEM_CGROUP_STAT_FILE_WRITEBACK);
+	if (PageCgroupFileUnstableNFS(pc))
+		mem_cgroup_move_account_page_stat(from, to,
+					MEM_CGROUP_STAT_FILE_UNSTABLE_NFS);
 	mem_cgroup_charge_statistics(from, pc, false);
 	if (uncharge)
 		/* This is not "cancel", but cancel_charge does all we need. */
@@ -3540,6 +3597,9 @@ enum {
 	MCS_PGPGIN,
 	MCS_PGPGOUT,
 	MCS_SWAP,
+	MCS_FILE_DIRTY,
+	MCS_WRITEBACK,
+	MCS_UNSTABLE_NFS,
 	MCS_INACTIVE_ANON,
 	MCS_ACTIVE_ANON,
 	MCS_INACTIVE_FILE,
@@ -3562,6 +3622,9 @@ struct {
 	{"pgpgin", "total_pgpgin"},
 	{"pgpgout", "total_pgpgout"},
 	{"swap", "total_swap"},
+	{"dirty", "total_dirty"},
+	{"writeback", "total_writeback"},
+	{"nfs_unstable", "total_nfs_unstable"},
 	{"inactive_anon", "total_inactive_anon"},
 	{"active_anon", "total_active_anon"},
 	{"inactive_file", "total_inactive_file"},
@@ -3591,6 +3654,13 @@ mem_cgroup_get_local_stat(struct mem_cgroup *mem, struct mcs_total_stat *s)
 		s->stat[MCS_SWAP] += val * PAGE_SIZE;
 	}
 
+	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_FILE_DIRTY);
+	s->stat[MCS_FILE_DIRTY] += val * PAGE_SIZE;
+	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_FILE_WRITEBACK);
+	s->stat[MCS_WRITEBACK] += val * PAGE_SIZE;
+	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_FILE_UNSTABLE_NFS);
+	s->stat[MCS_UNSTABLE_NFS] += val * PAGE_SIZE;
+
 	/* per zone stat */
 	val = mem_cgroup_get_local_zonestat(mem, LRU_INACTIVE_ANON);
 	s->stat[MCS_INACTIVE_ANON] += val * PAGE_SIZE;
-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [PATCH v4 06/11] memcg: add dirty page accounting infrastructure
@ 2010-10-29  7:09   ` Greg Thelen
  0 siblings, 0 replies; 75+ messages in thread
From: Greg Thelen @ 2010-10-29  7:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, containers, Andrea Righi, Balbir Singh,
	KAMEZAWA Hiroyuki, Daisuke Nishimura, Minchan Kim, Ciju Rajan K,
	David Rientjes, Wu Fengguang, Greg Thelen

Add memcg routines to track dirty, writeback, and unstable_NFS pages.
These routines are not yet used by the kernel to count such pages.
A later change adds kernel calls to these new routines.

Signed-off-by: Greg Thelen <gthelen@google.com>
Signed-off-by: Andrea Righi <arighi@develer.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
---
Changelog since v1:
- Renamed "nfs"/"total_nfs" to "nfs_unstable"/"total_nfs_unstable" in per cgroup
  memory.stat to match /proc/meminfo.
- Rename (for clarity):
  - mem_cgroup_write_page_stat_item -> mem_cgroup_page_stat_item
  - mem_cgroup_read_page_stat_item -> mem_cgroup_nr_pages_item
- Remove redundant comments.
- Made mem_cgroup_move_account_page_stat() inline.

 include/linux/memcontrol.h |    3 ++
 mm/memcontrol.c            |   86 +++++++++++++++++++++++++++++++++++++++----
 2 files changed, 81 insertions(+), 8 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 067115c..ef2eec7 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -28,6 +28,9 @@ struct mm_struct;
 /* Stats that can be updated by kernel. */
 enum mem_cgroup_page_stat_item {
 	MEMCG_NR_FILE_MAPPED, /* # of pages charged as file rss */
+	MEMCG_NR_FILE_DIRTY, /* # of dirty pages in page cache */
+	MEMCG_NR_FILE_WRITEBACK, /* # of pages under writeback */
+	MEMCG_NR_FILE_UNSTABLE_NFS, /* # of NFS unstable pages */
 };
 
 extern unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 94359d6..7f91029 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -85,10 +85,13 @@ enum mem_cgroup_stat_index {
 	 */
 	MEM_CGROUP_STAT_CACHE, 	   /* # of pages charged as cache */
 	MEM_CGROUP_STAT_RSS,	   /* # of pages charged as anon rss */
-	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
 	MEM_CGROUP_STAT_PGPGIN_COUNT,	/* # of pages paged in */
 	MEM_CGROUP_STAT_PGPGOUT_COUNT,	/* # of pages paged out */
 	MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
+	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
+	MEM_CGROUP_STAT_FILE_DIRTY,	/* # of dirty pages in page cache */
+	MEM_CGROUP_STAT_FILE_WRITEBACK,		/* # of pages under writeback */
+	MEM_CGROUP_STAT_FILE_UNSTABLE_NFS,	/* # of NFS unstable pages */
 	MEM_CGROUP_STAT_DATA, /* end of data requires synchronization */
 	/* incremented at every  pagein/pageout */
 	MEM_CGROUP_EVENTS = MEM_CGROUP_STAT_DATA,
@@ -1625,6 +1628,44 @@ void mem_cgroup_update_page_stat(struct page *page,
 			ClearPageCgroupFileMapped(pc);
 		idx = MEM_CGROUP_STAT_FILE_MAPPED;
 		break;
+
+	case MEMCG_NR_FILE_DIRTY:
+		/* Use Test{Set,Clear} to only un/charge the memcg once. */
+		if (val > 0) {
+			if (TestSetPageCgroupFileDirty(pc))
+				val = 0;
+		} else {
+			if (!TestClearPageCgroupFileDirty(pc))
+				val = 0;
+		}
+		idx = MEM_CGROUP_STAT_FILE_DIRTY;
+		break;
+
+	case MEMCG_NR_FILE_WRITEBACK:
+		/*
+		 * This counter is adjusted while holding the mapping's
+		 * tree_lock.  Therefore there is no race between settings and
+		 * clearing of this flag.
+		 */
+		if (val > 0)
+			SetPageCgroupFileWriteback(pc);
+		else
+			ClearPageCgroupFileWriteback(pc);
+		idx = MEM_CGROUP_STAT_FILE_WRITEBACK;
+		break;
+
+	case MEMCG_NR_FILE_UNSTABLE_NFS:
+		/* Use Test{Set,Clear} to only un/charge the memcg once. */
+		if (val > 0) {
+			if (TestSetPageCgroupFileUnstableNFS(pc))
+				val = 0;
+		} else {
+			if (!TestClearPageCgroupFileUnstableNFS(pc))
+				val = 0;
+		}
+		idx = MEM_CGROUP_STAT_FILE_UNSTABLE_NFS;
+		break;
+
 	default:
 		BUG();
 	}
@@ -2129,6 +2170,17 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *mem,
 	memcg_check_events(mem, pc->page);
 }
 
+static inline
+void mem_cgroup_move_account_page_stat(struct mem_cgroup *from,
+				       struct mem_cgroup *to,
+				       enum mem_cgroup_stat_index idx)
+{
+	preempt_disable();
+	__this_cpu_dec(from->stat->count[idx]);
+	__this_cpu_inc(to->stat->count[idx]);
+	preempt_enable();
+}
+
 /**
  * __mem_cgroup_move_account - move account of the page
  * @pc:	page_cgroup of the page.
@@ -2155,13 +2207,18 @@ static void __mem_cgroup_move_account(struct page_cgroup *pc,
 	VM_BUG_ON(!PageCgroupUsed(pc));
 	VM_BUG_ON(pc->mem_cgroup != from);
 
-	if (PageCgroupFileMapped(pc)) {
-		/* Update mapped_file data for mem_cgroup */
-		preempt_disable();
-		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
-		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
-		preempt_enable();
-	}
+	if (PageCgroupFileMapped(pc))
+		mem_cgroup_move_account_page_stat(from, to,
+					MEM_CGROUP_STAT_FILE_MAPPED);
+	if (PageCgroupFileDirty(pc))
+		mem_cgroup_move_account_page_stat(from, to,
+					MEM_CGROUP_STAT_FILE_DIRTY);
+	if (PageCgroupFileWriteback(pc))
+		mem_cgroup_move_account_page_stat(from, to,
+					MEM_CGROUP_STAT_FILE_WRITEBACK);
+	if (PageCgroupFileUnstableNFS(pc))
+		mem_cgroup_move_account_page_stat(from, to,
+					MEM_CGROUP_STAT_FILE_UNSTABLE_NFS);
 	mem_cgroup_charge_statistics(from, pc, false);
 	if (uncharge)
 		/* This is not "cancel", but cancel_charge does all we need. */
@@ -3540,6 +3597,9 @@ enum {
 	MCS_PGPGIN,
 	MCS_PGPGOUT,
 	MCS_SWAP,
+	MCS_FILE_DIRTY,
+	MCS_WRITEBACK,
+	MCS_UNSTABLE_NFS,
 	MCS_INACTIVE_ANON,
 	MCS_ACTIVE_ANON,
 	MCS_INACTIVE_FILE,
@@ -3562,6 +3622,9 @@ struct {
 	{"pgpgin", "total_pgpgin"},
 	{"pgpgout", "total_pgpgout"},
 	{"swap", "total_swap"},
+	{"dirty", "total_dirty"},
+	{"writeback", "total_writeback"},
+	{"nfs_unstable", "total_nfs_unstable"},
 	{"inactive_anon", "total_inactive_anon"},
 	{"active_anon", "total_active_anon"},
 	{"inactive_file", "total_inactive_file"},
@@ -3591,6 +3654,13 @@ mem_cgroup_get_local_stat(struct mem_cgroup *mem, struct mcs_total_stat *s)
 		s->stat[MCS_SWAP] += val * PAGE_SIZE;
 	}
 
+	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_FILE_DIRTY);
+	s->stat[MCS_FILE_DIRTY] += val * PAGE_SIZE;
+	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_FILE_WRITEBACK);
+	s->stat[MCS_WRITEBACK] += val * PAGE_SIZE;
+	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_FILE_UNSTABLE_NFS);
+	s->stat[MCS_UNSTABLE_NFS] += val * PAGE_SIZE;
+
 	/* per zone stat */
 	val = mem_cgroup_get_local_zonestat(mem, LRU_INACTIVE_ANON);
 	s->stat[MCS_INACTIVE_ANON] += val * PAGE_SIZE;
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [PATCH v4 07/11] memcg: add kernel calls for memcg dirty page stats
  2010-10-29  7:09 ` Greg Thelen
@ 2010-10-29  7:09   ` Greg Thelen
  -1 siblings, 0 replies; 75+ messages in thread
From: Greg Thelen @ 2010-10-29  7:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, containers, Andrea Righi, Balbir Singh,
	KAMEZAWA Hiroyuki, Daisuke Nishimura, Minchan Kim, Ciju Rajan K,
	David Rientjes, Wu Fengguang, Greg Thelen

Add calls into memcg dirty page accounting.  Notify memcg when pages
transition between clean, file dirty, writeback, and unstable nfs.
This allows the memory controller to maintain an accurate view of
the amount of its memory that is dirty.

Signed-off-by: Greg Thelen <gthelen@google.com>
Signed-off-by: Andrea Righi <arighi@develer.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
---
 fs/nfs/write.c      |    4 ++++
 mm/filemap.c        |    1 +
 mm/page-writeback.c |    4 ++++
 mm/truncate.c       |    1 +
 4 files changed, 10 insertions(+), 0 deletions(-)

diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index 4c14c17..a3c39f7 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -450,6 +450,7 @@ nfs_mark_request_commit(struct nfs_page *req)
 			NFS_PAGE_TAG_COMMIT);
 	nfsi->ncommit++;
 	spin_unlock(&inode->i_lock);
+	mem_cgroup_inc_page_stat(req->wb_page, MEMCG_NR_FILE_UNSTABLE_NFS);
 	inc_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
 	inc_bdi_stat(req->wb_page->mapping->backing_dev_info, BDI_RECLAIMABLE);
 	__mark_inode_dirty(inode, I_DIRTY_DATASYNC);
@@ -461,6 +462,7 @@ nfs_clear_request_commit(struct nfs_page *req)
 	struct page *page = req->wb_page;
 
 	if (test_and_clear_bit(PG_CLEAN, &(req)->wb_flags)) {
+		mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_UNSTABLE_NFS);
 		dec_zone_page_state(page, NR_UNSTABLE_NFS);
 		dec_bdi_stat(page->mapping->backing_dev_info, BDI_RECLAIMABLE);
 		return 1;
@@ -1316,6 +1318,8 @@ nfs_commit_list(struct inode *inode, struct list_head *head, int how)
 		req = nfs_list_entry(head->next);
 		nfs_list_remove_request(req);
 		nfs_mark_request_commit(req);
+		mem_cgroup_dec_page_stat(req->wb_page,
+					 MEMCG_NR_FILE_UNSTABLE_NFS);
 		dec_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
 		dec_bdi_stat(req->wb_page->mapping->backing_dev_info,
 				BDI_RECLAIMABLE);
diff --git a/mm/filemap.c b/mm/filemap.c
index 49b2d2e..f6bd6f2 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -146,6 +146,7 @@ void __remove_from_page_cache(struct page *page)
 	 * having removed the page entirely.
 	 */
 	if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
+		mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_DIRTY);
 		dec_zone_page_state(page, NR_FILE_DIRTY);
 		dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
 	}
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 722bd61..b3bb2fb 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1118,6 +1118,7 @@ int __set_page_dirty_no_writeback(struct page *page)
 void account_page_dirtied(struct page *page, struct address_space *mapping)
 {
 	if (mapping_cap_account_dirty(mapping)) {
+		mem_cgroup_inc_page_stat(page, MEMCG_NR_FILE_DIRTY);
 		__inc_zone_page_state(page, NR_FILE_DIRTY);
 		__inc_zone_page_state(page, NR_DIRTIED);
 		__inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
@@ -1307,6 +1308,7 @@ int clear_page_dirty_for_io(struct page *page)
 		 * for more comments.
 		 */
 		if (TestClearPageDirty(page)) {
+			mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_DIRTY);
 			dec_zone_page_state(page, NR_FILE_DIRTY);
 			dec_bdi_stat(mapping->backing_dev_info,
 					BDI_RECLAIMABLE);
@@ -1337,6 +1339,7 @@ int test_clear_page_writeback(struct page *page)
 				__dec_bdi_stat(bdi, BDI_WRITEBACK);
 				__bdi_writeout_inc(bdi);
 			}
+			mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_WRITEBACK);
 		}
 		spin_unlock_irqrestore(&mapping->tree_lock, flags);
 	} else {
@@ -1364,6 +1367,7 @@ int test_set_page_writeback(struct page *page)
 						PAGECACHE_TAG_WRITEBACK);
 			if (bdi_cap_account_writeback(bdi))
 				__inc_bdi_stat(bdi, BDI_WRITEBACK);
+			mem_cgroup_inc_page_stat(page, MEMCG_NR_FILE_WRITEBACK);
 		}
 		if (!PageDirty(page))
 			radix_tree_tag_clear(&mapping->page_tree,
diff --git a/mm/truncate.c b/mm/truncate.c
index cd94607..54cca83 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -76,6 +76,7 @@ void cancel_dirty_page(struct page *page, unsigned int account_size)
 	if (TestClearPageDirty(page)) {
 		struct address_space *mapping = page->mapping;
 		if (mapping && mapping_cap_account_dirty(mapping)) {
+			mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_DIRTY);
 			dec_zone_page_state(page, NR_FILE_DIRTY);
 			dec_bdi_stat(mapping->backing_dev_info,
 					BDI_RECLAIMABLE);
-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [PATCH v4 07/11] memcg: add kernel calls for memcg dirty page stats
@ 2010-10-29  7:09   ` Greg Thelen
  0 siblings, 0 replies; 75+ messages in thread
From: Greg Thelen @ 2010-10-29  7:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, containers, Andrea Righi, Balbir Singh,
	KAMEZAWA Hiroyuki, Daisuke Nishimura, Minchan Kim, Ciju Rajan K,
	David Rientjes, Wu Fengguang, Greg Thelen

Add calls into memcg dirty page accounting.  Notify memcg when pages
transition between clean, file dirty, writeback, and unstable nfs.
This allows the memory controller to maintain an accurate view of
the amount of its memory that is dirty.

Signed-off-by: Greg Thelen <gthelen@google.com>
Signed-off-by: Andrea Righi <arighi@develer.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
---
 fs/nfs/write.c      |    4 ++++
 mm/filemap.c        |    1 +
 mm/page-writeback.c |    4 ++++
 mm/truncate.c       |    1 +
 4 files changed, 10 insertions(+), 0 deletions(-)

diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index 4c14c17..a3c39f7 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -450,6 +450,7 @@ nfs_mark_request_commit(struct nfs_page *req)
 			NFS_PAGE_TAG_COMMIT);
 	nfsi->ncommit++;
 	spin_unlock(&inode->i_lock);
+	mem_cgroup_inc_page_stat(req->wb_page, MEMCG_NR_FILE_UNSTABLE_NFS);
 	inc_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
 	inc_bdi_stat(req->wb_page->mapping->backing_dev_info, BDI_RECLAIMABLE);
 	__mark_inode_dirty(inode, I_DIRTY_DATASYNC);
@@ -461,6 +462,7 @@ nfs_clear_request_commit(struct nfs_page *req)
 	struct page *page = req->wb_page;
 
 	if (test_and_clear_bit(PG_CLEAN, &(req)->wb_flags)) {
+		mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_UNSTABLE_NFS);
 		dec_zone_page_state(page, NR_UNSTABLE_NFS);
 		dec_bdi_stat(page->mapping->backing_dev_info, BDI_RECLAIMABLE);
 		return 1;
@@ -1316,6 +1318,8 @@ nfs_commit_list(struct inode *inode, struct list_head *head, int how)
 		req = nfs_list_entry(head->next);
 		nfs_list_remove_request(req);
 		nfs_mark_request_commit(req);
+		mem_cgroup_dec_page_stat(req->wb_page,
+					 MEMCG_NR_FILE_UNSTABLE_NFS);
 		dec_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
 		dec_bdi_stat(req->wb_page->mapping->backing_dev_info,
 				BDI_RECLAIMABLE);
diff --git a/mm/filemap.c b/mm/filemap.c
index 49b2d2e..f6bd6f2 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -146,6 +146,7 @@ void __remove_from_page_cache(struct page *page)
 	 * having removed the page entirely.
 	 */
 	if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
+		mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_DIRTY);
 		dec_zone_page_state(page, NR_FILE_DIRTY);
 		dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
 	}
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 722bd61..b3bb2fb 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1118,6 +1118,7 @@ int __set_page_dirty_no_writeback(struct page *page)
 void account_page_dirtied(struct page *page, struct address_space *mapping)
 {
 	if (mapping_cap_account_dirty(mapping)) {
+		mem_cgroup_inc_page_stat(page, MEMCG_NR_FILE_DIRTY);
 		__inc_zone_page_state(page, NR_FILE_DIRTY);
 		__inc_zone_page_state(page, NR_DIRTIED);
 		__inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
@@ -1307,6 +1308,7 @@ int clear_page_dirty_for_io(struct page *page)
 		 * for more comments.
 		 */
 		if (TestClearPageDirty(page)) {
+			mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_DIRTY);
 			dec_zone_page_state(page, NR_FILE_DIRTY);
 			dec_bdi_stat(mapping->backing_dev_info,
 					BDI_RECLAIMABLE);
@@ -1337,6 +1339,7 @@ int test_clear_page_writeback(struct page *page)
 				__dec_bdi_stat(bdi, BDI_WRITEBACK);
 				__bdi_writeout_inc(bdi);
 			}
+			mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_WRITEBACK);
 		}
 		spin_unlock_irqrestore(&mapping->tree_lock, flags);
 	} else {
@@ -1364,6 +1367,7 @@ int test_set_page_writeback(struct page *page)
 						PAGECACHE_TAG_WRITEBACK);
 			if (bdi_cap_account_writeback(bdi))
 				__inc_bdi_stat(bdi, BDI_WRITEBACK);
+			mem_cgroup_inc_page_stat(page, MEMCG_NR_FILE_WRITEBACK);
 		}
 		if (!PageDirty(page))
 			radix_tree_tag_clear(&mapping->page_tree,
diff --git a/mm/truncate.c b/mm/truncate.c
index cd94607..54cca83 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -76,6 +76,7 @@ void cancel_dirty_page(struct page *page, unsigned int account_size)
 	if (TestClearPageDirty(page)) {
 		struct address_space *mapping = page->mapping;
 		if (mapping && mapping_cap_account_dirty(mapping)) {
+			mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_DIRTY);
 			dec_zone_page_state(page, NR_FILE_DIRTY);
 			dec_bdi_stat(mapping->backing_dev_info,
 					BDI_RECLAIMABLE);
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [PATCH v4 08/11] memcg: add dirty limits to mem_cgroup
  2010-10-29  7:09 ` Greg Thelen
@ 2010-10-29  7:09   ` Greg Thelen
  -1 siblings, 0 replies; 75+ messages in thread
From: Greg Thelen @ 2010-10-29  7:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, containers, Andrea Righi, Balbir Singh,
	KAMEZAWA Hiroyuki, Daisuke Nishimura, Minchan Kim, Ciju Rajan K,
	David Rientjes, Wu Fengguang, Greg Thelen

Extend mem_cgroup to contain dirty page limits.  Also add routines
allowing the kernel to query the dirty usage of a memcg.

These interfaces not used by the kernel yet.  A subsequent commit
will add kernel calls to utilize these new routines.

Signed-off-by: Greg Thelen <gthelen@google.com>
Signed-off-by: Andrea Righi <arighi@develer.com>
---
Changelog since v3:
- Previously memcontrol.c used struct vm_dirty_param and vm_dirty_param() to
  advertise dirty memory limits.  Now struct dirty_info and
  mem_cgroup_dirty_info() is used to share dirty limits between memcontrol and
  the rest of the kernel.
- __mem_cgroup_has_dirty_limit() now returns false if use_hierarchy is set.
- memcg_hierarchical_free_pages() now uses parent_mem_cgroup() and is simpler.
- created internal routine, __mem_cgroup_has_dirty_limit(), to consolidate the
  logic.

Changelog since v1:
- Rename (for clarity):
  - mem_cgroup_write_page_stat_item -> mem_cgroup_page_stat_item
  - mem_cgroup_read_page_stat_item -> mem_cgroup_nr_pages_item
- Removed unnecessary get_ prefix from get_xxx() functions.
- Avoid lockdep warnings by using rcu_read_[un]lock() in
  mem_cgroup_has_dirty_limit().

 include/linux/memcontrol.h |   30 ++++++
 mm/memcontrol.c            |  248 +++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 277 insertions(+), 1 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index ef2eec7..736d318 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -19,6 +19,7 @@
 
 #ifndef _LINUX_MEMCONTROL_H
 #define _LINUX_MEMCONTROL_H
+#include <linux/writeback.h>
 #include <linux/cgroup.h>
 struct mem_cgroup;
 struct page_cgroup;
@@ -33,6 +34,14 @@ enum mem_cgroup_page_stat_item {
 	MEMCG_NR_FILE_UNSTABLE_NFS, /* # of NFS unstable pages */
 };
 
+/* Cgroup memory statistics items exported to the kernel. */
+enum mem_cgroup_nr_pages_item {
+	MEMCG_NR_DIRTYABLE_PAGES,
+	MEMCG_NR_RECLAIM_PAGES,
+	MEMCG_NR_WRITEBACK,
+	MEMCG_NR_DIRTY_WRITEBACK_PAGES,
+};
+
 extern unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
 					struct list_head *dst,
 					unsigned long *scanned, int order,
@@ -145,6 +154,11 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,
 	mem_cgroup_update_page_stat(page, idx, -1);
 }
 
+bool mem_cgroup_has_dirty_limit(void);
+bool mem_cgroup_dirty_info(unsigned long sys_available_mem,
+			   struct dirty_info *info);
+s64 mem_cgroup_page_stat(enum mem_cgroup_nr_pages_item item);
+
 unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
 						gfp_t gfp_mask);
 u64 mem_cgroup_get_limit(struct mem_cgroup *mem);
@@ -326,6 +340,22 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,
 {
 }
 
+static inline bool mem_cgroup_has_dirty_limit(void)
+{
+	return false;
+}
+
+static inline bool mem_cgroup_dirty_info(unsigned long sys_available_mem,
+					 struct dirty_info *info)
+{
+	return false;
+}
+
+static inline s64 mem_cgroup_page_stat(enum mem_cgroup_nr_pages_item item)
+{
+	return -ENOSYS;
+}
+
 static inline
 unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
 					    gfp_t gfp_mask)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 7f91029..52d688d 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -188,6 +188,14 @@ struct mem_cgroup_eventfd_list {
 static void mem_cgroup_threshold(struct mem_cgroup *mem);
 static void mem_cgroup_oom_notify(struct mem_cgroup *mem);
 
+/* Dirty memory parameters */
+struct vm_dirty_param {
+	int dirty_ratio;
+	int dirty_background_ratio;
+	unsigned long dirty_bytes;
+	unsigned long dirty_background_bytes;
+};
+
 /*
  * The memory controller data structure. The memory controller controls both
  * page cache and RSS per cgroup. We would eventually like to provide
@@ -233,6 +241,10 @@ struct mem_cgroup {
 	atomic_t	refcnt;
 
 	unsigned int	swappiness;
+
+	/* control memory cgroup dirty pages */
+	struct vm_dirty_param dirty_param;
+
 	/* OOM-Killer disable */
 	int		oom_kill_disable;
 
@@ -1132,6 +1144,232 @@ static unsigned int get_swappiness(struct mem_cgroup *memcg)
 	return swappiness;
 }
 
+/*
+ * Return true if the current memory cgroup has local dirty memory settings.
+ * There is an allowed race between the current task migrating in-to/out-of the
+ * root cgroup while this routine runs.  So the return value may be incorrect if
+ * the current task is being simultaneously migrated.
+ */
+static bool __mem_cgroup_has_dirty_limit(struct mem_cgroup *mem)
+{
+	if (!mem)
+		return false;
+	if (mem_cgroup_is_root(mem))
+		return false;
+	/*
+	 * The current memcg implementation does not yet support hierarchical
+	 * dirty limits.
+	 */
+	if (mem->use_hierarchy)
+		return false;
+	return true;
+}
+
+bool mem_cgroup_has_dirty_limit(void)
+{
+	struct mem_cgroup *mem;
+	bool ret;
+
+	if (mem_cgroup_disabled())
+		return false;
+
+	rcu_read_lock();
+	mem = mem_cgroup_from_task(current);
+	ret = __mem_cgroup_has_dirty_limit(mem);
+	rcu_read_unlock();
+
+	return ret;
+}
+
+/*
+ * Returns a snapshot of the current dirty limits which is not synchronized with
+ * the routines that change the dirty limits.  If this routine races with an
+ * update to the dirty bytes/ratio value, then the caller must handle the case
+ * where both dirty_[background_]_ratio and _bytes are set.
+ */
+static void __mem_cgroup_dirty_param(struct vm_dirty_param *param,
+				     struct mem_cgroup *mem)
+{
+	if (__mem_cgroup_has_dirty_limit(mem)) {
+		param->dirty_ratio = mem->dirty_param.dirty_ratio;
+		param->dirty_bytes = mem->dirty_param.dirty_bytes;
+		param->dirty_background_ratio =
+			mem->dirty_param.dirty_background_ratio;
+		param->dirty_background_bytes =
+			mem->dirty_param.dirty_background_bytes;
+	} else {
+		param->dirty_ratio = vm_dirty_ratio;
+		param->dirty_bytes = vm_dirty_bytes;
+		param->dirty_background_ratio = dirty_background_ratio;
+		param->dirty_background_bytes = dirty_background_bytes;
+	}
+}
+
+/*
+ * Return the background-writeback and dirty-throttling thresholds as well as
+ * dirty usage metrics.
+ *
+ * The current task may be moved to another cgroup while this routine accesses
+ * the dirty limit.  But a precise check is meaningless because the task can be
+ * moved after our access and writeback tends to take long time.  At least,
+ * "memcg" will not be freed while holding rcu_read_lock().
+ */
+bool mem_cgroup_dirty_info(unsigned long sys_available_mem,
+			   struct dirty_info *info)
+{
+	s64 available_mem;
+	struct vm_dirty_param dirty_param;
+	struct mem_cgroup *memcg;
+
+	if (mem_cgroup_disabled())
+		return false;
+
+	rcu_read_lock();
+	memcg = mem_cgroup_from_task(current);
+	if (!__mem_cgroup_has_dirty_limit(memcg)) {
+		rcu_read_unlock();
+		return false;
+	}
+	__mem_cgroup_dirty_param(&dirty_param, memcg);
+	rcu_read_unlock();
+
+	available_mem = mem_cgroup_page_stat(MEMCG_NR_DIRTYABLE_PAGES);
+	if (available_mem < 0)
+		return false;
+
+	available_mem = min((unsigned long)available_mem, sys_available_mem);
+
+	if (dirty_param.dirty_bytes)
+		info->dirty_thresh =
+			DIV_ROUND_UP(dirty_param.dirty_bytes, PAGE_SIZE);
+	else
+		info->dirty_thresh =
+			(dirty_param.dirty_ratio * available_mem) / 100;
+
+	if (dirty_param.dirty_background_bytes)
+		info->background_thresh =
+			DIV_ROUND_UP(dirty_param.dirty_background_bytes,
+				     PAGE_SIZE);
+	else
+		info->background_thresh =
+			(dirty_param.dirty_background_ratio *
+			       available_mem) / 100;
+
+	info->nr_reclaimable =
+		mem_cgroup_page_stat(MEMCG_NR_RECLAIM_PAGES);
+	if (info->nr_reclaimable < 0)
+		return false;
+
+	info->nr_writeback = mem_cgroup_page_stat(MEMCG_NR_WRITEBACK);
+	if (info->nr_writeback < 0)
+		return false;
+
+	return true;
+}
+
+static inline bool mem_cgroup_can_swap(struct mem_cgroup *memcg)
+{
+	if (!do_swap_account)
+		return nr_swap_pages > 0;
+	return !memcg->memsw_is_minimum &&
+		(res_counter_read_u64(&memcg->memsw, RES_LIMIT) > 0);
+}
+
+static s64 mem_cgroup_local_page_stat(struct mem_cgroup *mem,
+				      enum mem_cgroup_nr_pages_item item)
+{
+	s64 ret;
+
+	switch (item) {
+	case MEMCG_NR_DIRTYABLE_PAGES:
+		ret = mem_cgroup_read_stat(mem, LRU_ACTIVE_FILE) +
+			mem_cgroup_read_stat(mem, LRU_INACTIVE_FILE);
+		if (mem_cgroup_can_swap(mem))
+			ret += mem_cgroup_read_stat(mem, LRU_ACTIVE_ANON) +
+				mem_cgroup_read_stat(mem, LRU_INACTIVE_ANON);
+		break;
+	case MEMCG_NR_RECLAIM_PAGES:
+		ret = mem_cgroup_read_stat(mem,	MEM_CGROUP_STAT_FILE_DIRTY) +
+			mem_cgroup_read_stat(mem,
+					     MEM_CGROUP_STAT_FILE_UNSTABLE_NFS);
+		break;
+	case MEMCG_NR_WRITEBACK:
+		ret = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_FILE_WRITEBACK);
+		break;
+	case MEMCG_NR_DIRTY_WRITEBACK_PAGES:
+		ret = mem_cgroup_read_stat(mem,
+					   MEM_CGROUP_STAT_FILE_WRITEBACK) +
+			mem_cgroup_read_stat(mem,
+					     MEM_CGROUP_STAT_FILE_UNSTABLE_NFS);
+		break;
+	default:
+		BUG();
+		break;
+	}
+	return ret;
+}
+
+/*
+ * Return the number of pages that the @mem cgroup could allocate.  If
+ * use_hierarchy is set, then this involves parent mem cgroups to find the
+ * cgroup with the smallest free space.
+ */
+static unsigned long long
+memcg_hierarchical_free_pages(struct mem_cgroup *mem)
+{
+	unsigned long free, min_free;
+
+	min_free = global_page_state(NR_FREE_PAGES) << PAGE_SHIFT;
+
+	while (mem) {
+		free = res_counter_read_u64(&mem->res, RES_LIMIT) -
+			res_counter_read_u64(&mem->res, RES_USAGE);
+		min_free = min(min_free, free);
+		mem = parent_mem_cgroup(mem);
+	}
+
+	/* Translate free memory in pages */
+	return min_free >> PAGE_SHIFT;
+}
+
+/*
+ * mem_cgroup_page_stat() - get memory cgroup file cache statistics
+ * @item:      memory statistic item exported to the kernel
+ *
+ * Return the accounted statistic value or negative value if current task is
+ * root cgroup.
+ */
+s64 mem_cgroup_page_stat(enum mem_cgroup_nr_pages_item item)
+{
+	struct mem_cgroup *mem;
+	struct mem_cgroup *iter;
+	s64 value;
+
+	rcu_read_lock();
+	mem = mem_cgroup_from_task(current);
+	if (__mem_cgroup_has_dirty_limit(mem)) {
+		/*
+		 * If we're looking for dirtyable pages we need to evaluate
+		 * free pages depending on the limit and usage of the parents
+		 * first of all.
+		 */
+		if (item == MEMCG_NR_DIRTYABLE_PAGES)
+			value = memcg_hierarchical_free_pages(mem);
+		else
+			value = 0;
+		/*
+		 * Recursively evaluate page statistics against all cgroup
+		 * under hierarchy tree
+		 */
+		for_each_mem_cgroup_tree(iter, mem)
+			value += mem_cgroup_local_page_stat(iter, item);
+	} else
+		value = -EINVAL;
+	rcu_read_unlock();
+
+	return value;
+}
+
 static void mem_cgroup_start_move(struct mem_cgroup *mem)
 {
 	int cpu;
@@ -4440,8 +4678,16 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
 	spin_lock_init(&mem->reclaim_param_lock);
 	INIT_LIST_HEAD(&mem->oom_notify);
 
-	if (parent)
+	if (parent) {
 		mem->swappiness = get_swappiness(parent);
+		__mem_cgroup_dirty_param(&mem->dirty_param, parent);
+	} else {
+		/*
+		 * The root cgroup dirty_param field is not used, instead,
+		 * system-wide dirty limits are used.
+		 */
+	}
+
 	atomic_set(&mem->refcnt, 1);
 	mem->move_charge_at_immigrate = 0;
 	mutex_init(&mem->thresholds_lock);
-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [PATCH v4 08/11] memcg: add dirty limits to mem_cgroup
@ 2010-10-29  7:09   ` Greg Thelen
  0 siblings, 0 replies; 75+ messages in thread
From: Greg Thelen @ 2010-10-29  7:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, containers, Andrea Righi, Balbir Singh,
	KAMEZAWA Hiroyuki, Daisuke Nishimura, Minchan Kim, Ciju Rajan K,
	David Rientjes, Wu Fengguang, Greg Thelen

Extend mem_cgroup to contain dirty page limits.  Also add routines
allowing the kernel to query the dirty usage of a memcg.

These interfaces not used by the kernel yet.  A subsequent commit
will add kernel calls to utilize these new routines.

Signed-off-by: Greg Thelen <gthelen@google.com>
Signed-off-by: Andrea Righi <arighi@develer.com>
---
Changelog since v3:
- Previously memcontrol.c used struct vm_dirty_param and vm_dirty_param() to
  advertise dirty memory limits.  Now struct dirty_info and
  mem_cgroup_dirty_info() is used to share dirty limits between memcontrol and
  the rest of the kernel.
- __mem_cgroup_has_dirty_limit() now returns false if use_hierarchy is set.
- memcg_hierarchical_free_pages() now uses parent_mem_cgroup() and is simpler.
- created internal routine, __mem_cgroup_has_dirty_limit(), to consolidate the
  logic.

Changelog since v1:
- Rename (for clarity):
  - mem_cgroup_write_page_stat_item -> mem_cgroup_page_stat_item
  - mem_cgroup_read_page_stat_item -> mem_cgroup_nr_pages_item
- Removed unnecessary get_ prefix from get_xxx() functions.
- Avoid lockdep warnings by using rcu_read_[un]lock() in
  mem_cgroup_has_dirty_limit().

 include/linux/memcontrol.h |   30 ++++++
 mm/memcontrol.c            |  248 +++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 277 insertions(+), 1 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index ef2eec7..736d318 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -19,6 +19,7 @@
 
 #ifndef _LINUX_MEMCONTROL_H
 #define _LINUX_MEMCONTROL_H
+#include <linux/writeback.h>
 #include <linux/cgroup.h>
 struct mem_cgroup;
 struct page_cgroup;
@@ -33,6 +34,14 @@ enum mem_cgroup_page_stat_item {
 	MEMCG_NR_FILE_UNSTABLE_NFS, /* # of NFS unstable pages */
 };
 
+/* Cgroup memory statistics items exported to the kernel. */
+enum mem_cgroup_nr_pages_item {
+	MEMCG_NR_DIRTYABLE_PAGES,
+	MEMCG_NR_RECLAIM_PAGES,
+	MEMCG_NR_WRITEBACK,
+	MEMCG_NR_DIRTY_WRITEBACK_PAGES,
+};
+
 extern unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
 					struct list_head *dst,
 					unsigned long *scanned, int order,
@@ -145,6 +154,11 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,
 	mem_cgroup_update_page_stat(page, idx, -1);
 }
 
+bool mem_cgroup_has_dirty_limit(void);
+bool mem_cgroup_dirty_info(unsigned long sys_available_mem,
+			   struct dirty_info *info);
+s64 mem_cgroup_page_stat(enum mem_cgroup_nr_pages_item item);
+
 unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
 						gfp_t gfp_mask);
 u64 mem_cgroup_get_limit(struct mem_cgroup *mem);
@@ -326,6 +340,22 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,
 {
 }
 
+static inline bool mem_cgroup_has_dirty_limit(void)
+{
+	return false;
+}
+
+static inline bool mem_cgroup_dirty_info(unsigned long sys_available_mem,
+					 struct dirty_info *info)
+{
+	return false;
+}
+
+static inline s64 mem_cgroup_page_stat(enum mem_cgroup_nr_pages_item item)
+{
+	return -ENOSYS;
+}
+
 static inline
 unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
 					    gfp_t gfp_mask)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 7f91029..52d688d 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -188,6 +188,14 @@ struct mem_cgroup_eventfd_list {
 static void mem_cgroup_threshold(struct mem_cgroup *mem);
 static void mem_cgroup_oom_notify(struct mem_cgroup *mem);
 
+/* Dirty memory parameters */
+struct vm_dirty_param {
+	int dirty_ratio;
+	int dirty_background_ratio;
+	unsigned long dirty_bytes;
+	unsigned long dirty_background_bytes;
+};
+
 /*
  * The memory controller data structure. The memory controller controls both
  * page cache and RSS per cgroup. We would eventually like to provide
@@ -233,6 +241,10 @@ struct mem_cgroup {
 	atomic_t	refcnt;
 
 	unsigned int	swappiness;
+
+	/* control memory cgroup dirty pages */
+	struct vm_dirty_param dirty_param;
+
 	/* OOM-Killer disable */
 	int		oom_kill_disable;
 
@@ -1132,6 +1144,232 @@ static unsigned int get_swappiness(struct mem_cgroup *memcg)
 	return swappiness;
 }
 
+/*
+ * Return true if the current memory cgroup has local dirty memory settings.
+ * There is an allowed race between the current task migrating in-to/out-of the
+ * root cgroup while this routine runs.  So the return value may be incorrect if
+ * the current task is being simultaneously migrated.
+ */
+static bool __mem_cgroup_has_dirty_limit(struct mem_cgroup *mem)
+{
+	if (!mem)
+		return false;
+	if (mem_cgroup_is_root(mem))
+		return false;
+	/*
+	 * The current memcg implementation does not yet support hierarchical
+	 * dirty limits.
+	 */
+	if (mem->use_hierarchy)
+		return false;
+	return true;
+}
+
+bool mem_cgroup_has_dirty_limit(void)
+{
+	struct mem_cgroup *mem;
+	bool ret;
+
+	if (mem_cgroup_disabled())
+		return false;
+
+	rcu_read_lock();
+	mem = mem_cgroup_from_task(current);
+	ret = __mem_cgroup_has_dirty_limit(mem);
+	rcu_read_unlock();
+
+	return ret;
+}
+
+/*
+ * Returns a snapshot of the current dirty limits which is not synchronized with
+ * the routines that change the dirty limits.  If this routine races with an
+ * update to the dirty bytes/ratio value, then the caller must handle the case
+ * where both dirty_[background_]_ratio and _bytes are set.
+ */
+static void __mem_cgroup_dirty_param(struct vm_dirty_param *param,
+				     struct mem_cgroup *mem)
+{
+	if (__mem_cgroup_has_dirty_limit(mem)) {
+		param->dirty_ratio = mem->dirty_param.dirty_ratio;
+		param->dirty_bytes = mem->dirty_param.dirty_bytes;
+		param->dirty_background_ratio =
+			mem->dirty_param.dirty_background_ratio;
+		param->dirty_background_bytes =
+			mem->dirty_param.dirty_background_bytes;
+	} else {
+		param->dirty_ratio = vm_dirty_ratio;
+		param->dirty_bytes = vm_dirty_bytes;
+		param->dirty_background_ratio = dirty_background_ratio;
+		param->dirty_background_bytes = dirty_background_bytes;
+	}
+}
+
+/*
+ * Return the background-writeback and dirty-throttling thresholds as well as
+ * dirty usage metrics.
+ *
+ * The current task may be moved to another cgroup while this routine accesses
+ * the dirty limit.  But a precise check is meaningless because the task can be
+ * moved after our access and writeback tends to take long time.  At least,
+ * "memcg" will not be freed while holding rcu_read_lock().
+ */
+bool mem_cgroup_dirty_info(unsigned long sys_available_mem,
+			   struct dirty_info *info)
+{
+	s64 available_mem;
+	struct vm_dirty_param dirty_param;
+	struct mem_cgroup *memcg;
+
+	if (mem_cgroup_disabled())
+		return false;
+
+	rcu_read_lock();
+	memcg = mem_cgroup_from_task(current);
+	if (!__mem_cgroup_has_dirty_limit(memcg)) {
+		rcu_read_unlock();
+		return false;
+	}
+	__mem_cgroup_dirty_param(&dirty_param, memcg);
+	rcu_read_unlock();
+
+	available_mem = mem_cgroup_page_stat(MEMCG_NR_DIRTYABLE_PAGES);
+	if (available_mem < 0)
+		return false;
+
+	available_mem = min((unsigned long)available_mem, sys_available_mem);
+
+	if (dirty_param.dirty_bytes)
+		info->dirty_thresh =
+			DIV_ROUND_UP(dirty_param.dirty_bytes, PAGE_SIZE);
+	else
+		info->dirty_thresh =
+			(dirty_param.dirty_ratio * available_mem) / 100;
+
+	if (dirty_param.dirty_background_bytes)
+		info->background_thresh =
+			DIV_ROUND_UP(dirty_param.dirty_background_bytes,
+				     PAGE_SIZE);
+	else
+		info->background_thresh =
+			(dirty_param.dirty_background_ratio *
+			       available_mem) / 100;
+
+	info->nr_reclaimable =
+		mem_cgroup_page_stat(MEMCG_NR_RECLAIM_PAGES);
+	if (info->nr_reclaimable < 0)
+		return false;
+
+	info->nr_writeback = mem_cgroup_page_stat(MEMCG_NR_WRITEBACK);
+	if (info->nr_writeback < 0)
+		return false;
+
+	return true;
+}
+
+static inline bool mem_cgroup_can_swap(struct mem_cgroup *memcg)
+{
+	if (!do_swap_account)
+		return nr_swap_pages > 0;
+	return !memcg->memsw_is_minimum &&
+		(res_counter_read_u64(&memcg->memsw, RES_LIMIT) > 0);
+}
+
+static s64 mem_cgroup_local_page_stat(struct mem_cgroup *mem,
+				      enum mem_cgroup_nr_pages_item item)
+{
+	s64 ret;
+
+	switch (item) {
+	case MEMCG_NR_DIRTYABLE_PAGES:
+		ret = mem_cgroup_read_stat(mem, LRU_ACTIVE_FILE) +
+			mem_cgroup_read_stat(mem, LRU_INACTIVE_FILE);
+		if (mem_cgroup_can_swap(mem))
+			ret += mem_cgroup_read_stat(mem, LRU_ACTIVE_ANON) +
+				mem_cgroup_read_stat(mem, LRU_INACTIVE_ANON);
+		break;
+	case MEMCG_NR_RECLAIM_PAGES:
+		ret = mem_cgroup_read_stat(mem,	MEM_CGROUP_STAT_FILE_DIRTY) +
+			mem_cgroup_read_stat(mem,
+					     MEM_CGROUP_STAT_FILE_UNSTABLE_NFS);
+		break;
+	case MEMCG_NR_WRITEBACK:
+		ret = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_FILE_WRITEBACK);
+		break;
+	case MEMCG_NR_DIRTY_WRITEBACK_PAGES:
+		ret = mem_cgroup_read_stat(mem,
+					   MEM_CGROUP_STAT_FILE_WRITEBACK) +
+			mem_cgroup_read_stat(mem,
+					     MEM_CGROUP_STAT_FILE_UNSTABLE_NFS);
+		break;
+	default:
+		BUG();
+		break;
+	}
+	return ret;
+}
+
+/*
+ * Return the number of pages that the @mem cgroup could allocate.  If
+ * use_hierarchy is set, then this involves parent mem cgroups to find the
+ * cgroup with the smallest free space.
+ */
+static unsigned long long
+memcg_hierarchical_free_pages(struct mem_cgroup *mem)
+{
+	unsigned long free, min_free;
+
+	min_free = global_page_state(NR_FREE_PAGES) << PAGE_SHIFT;
+
+	while (mem) {
+		free = res_counter_read_u64(&mem->res, RES_LIMIT) -
+			res_counter_read_u64(&mem->res, RES_USAGE);
+		min_free = min(min_free, free);
+		mem = parent_mem_cgroup(mem);
+	}
+
+	/* Translate free memory in pages */
+	return min_free >> PAGE_SHIFT;
+}
+
+/*
+ * mem_cgroup_page_stat() - get memory cgroup file cache statistics
+ * @item:      memory statistic item exported to the kernel
+ *
+ * Return the accounted statistic value or negative value if current task is
+ * root cgroup.
+ */
+s64 mem_cgroup_page_stat(enum mem_cgroup_nr_pages_item item)
+{
+	struct mem_cgroup *mem;
+	struct mem_cgroup *iter;
+	s64 value;
+
+	rcu_read_lock();
+	mem = mem_cgroup_from_task(current);
+	if (__mem_cgroup_has_dirty_limit(mem)) {
+		/*
+		 * If we're looking for dirtyable pages we need to evaluate
+		 * free pages depending on the limit and usage of the parents
+		 * first of all.
+		 */
+		if (item == MEMCG_NR_DIRTYABLE_PAGES)
+			value = memcg_hierarchical_free_pages(mem);
+		else
+			value = 0;
+		/*
+		 * Recursively evaluate page statistics against all cgroup
+		 * under hierarchy tree
+		 */
+		for_each_mem_cgroup_tree(iter, mem)
+			value += mem_cgroup_local_page_stat(iter, item);
+	} else
+		value = -EINVAL;
+	rcu_read_unlock();
+
+	return value;
+}
+
 static void mem_cgroup_start_move(struct mem_cgroup *mem)
 {
 	int cpu;
@@ -4440,8 +4678,16 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
 	spin_lock_init(&mem->reclaim_param_lock);
 	INIT_LIST_HEAD(&mem->oom_notify);
 
-	if (parent)
+	if (parent) {
 		mem->swappiness = get_swappiness(parent);
+		__mem_cgroup_dirty_param(&mem->dirty_param, parent);
+	} else {
+		/*
+		 * The root cgroup dirty_param field is not used, instead,
+		 * system-wide dirty limits are used.
+		 */
+	}
+
 	atomic_set(&mem->refcnt, 1);
 	mem->move_charge_at_immigrate = 0;
 	mutex_init(&mem->thresholds_lock);
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [PATCH v4 09/11] memcg: CPU hotplug lockdep warning fix
  2010-10-29  7:09 ` Greg Thelen
@ 2010-10-29  7:09   ` Greg Thelen
  -1 siblings, 0 replies; 75+ messages in thread
From: Greg Thelen @ 2010-10-29  7:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, containers, Andrea Righi, Balbir Singh,
	KAMEZAWA Hiroyuki, Daisuke Nishimura, Minchan Kim, Ciju Rajan K,
	David Rientjes, Wu Fengguang

From: Balbir Singh <balbir@linux.vnet.ibm.com>

memcg has lockdep warnings (sleep inside rcu lock)

From: Balbir Singh <balbir@linux.vnet.ibm.com>

Recent move to get_online_cpus() ends up calling get_online_cpus() from
mem_cgroup_read_stat(). However mem_cgroup_read_stat() is called under rcu
lock. get_online_cpus() can sleep. The dirty limit patches expose
this BUG more readily due to their usage of mem_cgroup_page_stat()

This patch address this issue as identified by lockdep and moves the
hotplug protection to a higher layer. This might increase the time
required to hotplug, but not by much.

Warning messages

BUG: sleeping function called from invalid context at kernel/cpu.c:62
in_atomic(): 0, irqs_disabled(): 0, pid: 6325, name: pagetest
2 locks held by pagetest/6325:
do_page_fault+0x27d/0x4a0
mem_cgroup_page_stat+0x0/0x23f
Pid: 6325, comm: pagetest Not tainted 2.6.36-rc5-mm1+ #201
Call Trace:
[<ffffffff81041224>] __might_sleep+0x12d/0x131
[<ffffffff8104f4af>] get_online_cpus+0x1c/0x51
[<ffffffff8110eedb>] mem_cgroup_read_stat+0x27/0xa3
[<ffffffff811125d2>] mem_cgroup_page_stat+0x131/0x23f
[<ffffffff811124a1>] ? mem_cgroup_page_stat+0x0/0x23f
[<ffffffff810d57c3>] global_dirty_limits+0x42/0xf8
[<ffffffff810d58b3>] throttle_vm_writeout+0x3a/0xb4
[<ffffffff810dc2f8>] shrink_zone+0x3e6/0x3f8
[<ffffffff81074a35>] ? ktime_get_ts+0xb2/0xbf
[<ffffffff810dd1aa>] do_try_to_free_pages+0x106/0x478
[<ffffffff810dd601>] try_to_free_mem_cgroup_pages+0xe5/0x14c
[<ffffffff8110f947>] mem_cgroup_hierarchical_reclaim+0x314/0x3a2
[<ffffffff81111b31>] __mem_cgroup_try_charge+0x29b/0x593
[<ffffffff8111194a>] ? __mem_cgroup_try_charge+0xb4/0x593
[<ffffffff81071258>] ? local_clock+0x40/0x59
[<ffffffff81009015>] ? sched_clock+0x9/0xd
[<ffffffff810710d5>] ? sched_clock_local+0x1c/0x82
[<ffffffff8111398a>] mem_cgroup_charge_common+0x4b/0x76
[<ffffffff81141469>] ? bio_add_page+0x36/0x38
[<ffffffff81113ba9>] mem_cgroup_cache_charge+0x1f4/0x214
[<ffffffff810cd195>] add_to_page_cache_locked+0x4a/0x148
....

Acked-by: Greg Thelen <gthelen@google.com>
Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
---
Changelog since v3:
- Make use of new routine: __mem_cgroup_has_dirty_limit()

 mm/memcontrol.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 52d688d..35dc329 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -579,7 +579,6 @@ static s64 mem_cgroup_read_stat(struct mem_cgroup *mem,
 	int cpu;
 	s64 val = 0;
 
-	get_online_cpus();
 	for_each_online_cpu(cpu)
 		val += per_cpu(mem->stat->count[idx], cpu);
 #ifdef CONFIG_HOTPLUG_CPU
@@ -587,7 +586,6 @@ static s64 mem_cgroup_read_stat(struct mem_cgroup *mem,
 	val += mem->nocpu_base.count[idx];
 	spin_unlock(&mem->pcp_counter_lock);
 #endif
-	put_online_cpus();
 	return val;
 }
 
@@ -1345,6 +1343,7 @@ s64 mem_cgroup_page_stat(enum mem_cgroup_nr_pages_item item)
 	struct mem_cgroup *iter;
 	s64 value;
 
+	get_online_cpus();
 	rcu_read_lock();
 	mem = mem_cgroup_from_task(current);
 	if (__mem_cgroup_has_dirty_limit(mem)) {
@@ -1366,6 +1365,7 @@ s64 mem_cgroup_page_stat(enum mem_cgroup_nr_pages_item item)
 	} else
 		value = -EINVAL;
 	rcu_read_unlock();
+	put_online_cpus();
 
 	return value;
 }
-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [PATCH v4 09/11] memcg: CPU hotplug lockdep warning fix
@ 2010-10-29  7:09   ` Greg Thelen
  0 siblings, 0 replies; 75+ messages in thread
From: Greg Thelen @ 2010-10-29  7:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, containers, Andrea Righi, Balbir Singh,
	KAMEZAWA Hiroyuki, Daisuke Nishimura, Minchan Kim, Ciju Rajan K,
	David Rientjes, Wu Fengguang

From: Balbir Singh <balbir@linux.vnet.ibm.com>

memcg has lockdep warnings (sleep inside rcu lock)

From: Balbir Singh <balbir@linux.vnet.ibm.com>

Recent move to get_online_cpus() ends up calling get_online_cpus() from
mem_cgroup_read_stat(). However mem_cgroup_read_stat() is called under rcu
lock. get_online_cpus() can sleep. The dirty limit patches expose
this BUG more readily due to their usage of mem_cgroup_page_stat()

This patch address this issue as identified by lockdep and moves the
hotplug protection to a higher layer. This might increase the time
required to hotplug, but not by much.

Warning messages

BUG: sleeping function called from invalid context at kernel/cpu.c:62
in_atomic(): 0, irqs_disabled(): 0, pid: 6325, name: pagetest
2 locks held by pagetest/6325:
do_page_fault+0x27d/0x4a0
mem_cgroup_page_stat+0x0/0x23f
Pid: 6325, comm: pagetest Not tainted 2.6.36-rc5-mm1+ #201
Call Trace:
[<ffffffff81041224>] __might_sleep+0x12d/0x131
[<ffffffff8104f4af>] get_online_cpus+0x1c/0x51
[<ffffffff8110eedb>] mem_cgroup_read_stat+0x27/0xa3
[<ffffffff811125d2>] mem_cgroup_page_stat+0x131/0x23f
[<ffffffff811124a1>] ? mem_cgroup_page_stat+0x0/0x23f
[<ffffffff810d57c3>] global_dirty_limits+0x42/0xf8
[<ffffffff810d58b3>] throttle_vm_writeout+0x3a/0xb4
[<ffffffff810dc2f8>] shrink_zone+0x3e6/0x3f8
[<ffffffff81074a35>] ? ktime_get_ts+0xb2/0xbf
[<ffffffff810dd1aa>] do_try_to_free_pages+0x106/0x478
[<ffffffff810dd601>] try_to_free_mem_cgroup_pages+0xe5/0x14c
[<ffffffff8110f947>] mem_cgroup_hierarchical_reclaim+0x314/0x3a2
[<ffffffff81111b31>] __mem_cgroup_try_charge+0x29b/0x593
[<ffffffff8111194a>] ? __mem_cgroup_try_charge+0xb4/0x593
[<ffffffff81071258>] ? local_clock+0x40/0x59
[<ffffffff81009015>] ? sched_clock+0x9/0xd
[<ffffffff810710d5>] ? sched_clock_local+0x1c/0x82
[<ffffffff8111398a>] mem_cgroup_charge_common+0x4b/0x76
[<ffffffff81141469>] ? bio_add_page+0x36/0x38
[<ffffffff81113ba9>] mem_cgroup_cache_charge+0x1f4/0x214
[<ffffffff810cd195>] add_to_page_cache_locked+0x4a/0x148
....

Acked-by: Greg Thelen <gthelen@google.com>
Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
---
Changelog since v3:
- Make use of new routine: __mem_cgroup_has_dirty_limit()

 mm/memcontrol.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 52d688d..35dc329 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -579,7 +579,6 @@ static s64 mem_cgroup_read_stat(struct mem_cgroup *mem,
 	int cpu;
 	s64 val = 0;
 
-	get_online_cpus();
 	for_each_online_cpu(cpu)
 		val += per_cpu(mem->stat->count[idx], cpu);
 #ifdef CONFIG_HOTPLUG_CPU
@@ -587,7 +586,6 @@ static s64 mem_cgroup_read_stat(struct mem_cgroup *mem,
 	val += mem->nocpu_base.count[idx];
 	spin_unlock(&mem->pcp_counter_lock);
 #endif
-	put_online_cpus();
 	return val;
 }
 
@@ -1345,6 +1343,7 @@ s64 mem_cgroup_page_stat(enum mem_cgroup_nr_pages_item item)
 	struct mem_cgroup *iter;
 	s64 value;
 
+	get_online_cpus();
 	rcu_read_lock();
 	mem = mem_cgroup_from_task(current);
 	if (__mem_cgroup_has_dirty_limit(mem)) {
@@ -1366,6 +1365,7 @@ s64 mem_cgroup_page_stat(enum mem_cgroup_nr_pages_item item)
 	} else
 		value = -EINVAL;
 	rcu_read_unlock();
+	put_online_cpus();
 
 	return value;
 }
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [PATCH v4 10/11] memcg: add cgroupfs interface to memcg dirty limits
  2010-10-29  7:09 ` Greg Thelen
@ 2010-10-29  7:09   ` Greg Thelen
  -1 siblings, 0 replies; 75+ messages in thread
From: Greg Thelen @ 2010-10-29  7:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, containers, Andrea Righi, Balbir Singh,
	KAMEZAWA Hiroyuki, Daisuke Nishimura, Minchan Kim, Ciju Rajan K,
	David Rientjes, Wu Fengguang, Greg Thelen

Add cgroupfs interface to memcg dirty page limits:
  Direct write-out is controlled with:
  - memory.dirty_ratio
  - memory.dirty_limit_in_bytes

  Background write-out is controlled with:
  - memory.dirty_background_ratio
  - memory.dirty_background_limit_bytes

Other memcg cgroupfs files support 'M', 'm', 'k', 'K', 'g'
and 'G' suffixes for byte counts.  This patch provides the
same functionality for memory.dirty_limit_in_bytes and
memory.dirty_background_limit_bytes.

Signed-off-by: Andrea Righi <arighi@develer.com>
Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Signed-off-by: Greg Thelen <gthelen@google.com>
---
Changelog since v3:
- Make use of new routine, __mem_cgroup_has_dirty_limit(), to disable memcg
  dirty limits when use_hierarchy=1.

Changelog since v1:
- Renamed newly created proc files:
  - memory.dirty_bytes -> memory.dirty_limit_in_bytes
  - memory.dirty_background_bytes -> memory.dirty_background_limit_in_bytes
- Allow [kKmMgG] suffixes for newly created dirty limit value cgroupfs files.

 mm/memcontrol.c |  114 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 114 insertions(+), 0 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 35dc329..52141c5 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -100,6 +100,13 @@ enum mem_cgroup_stat_index {
 	MEM_CGROUP_STAT_NSTATS,
 };
 
+enum {
+	MEM_CGROUP_DIRTY_RATIO,
+	MEM_CGROUP_DIRTY_LIMIT_IN_BYTES,
+	MEM_CGROUP_DIRTY_BACKGROUND_RATIO,
+	MEM_CGROUP_DIRTY_BACKGROUND_LIMIT_IN_BYTES,
+};
+
 struct mem_cgroup_stat_cpu {
 	s64 count[MEM_CGROUP_STAT_NSTATS];
 };
@@ -4356,6 +4363,89 @@ static int mem_cgroup_oom_control_write(struct cgroup *cgrp,
 	return 0;
 }
 
+static u64 mem_cgroup_dirty_read(struct cgroup *cgrp, struct cftype *cft)
+{
+	struct mem_cgroup *mem = mem_cgroup_from_cont(cgrp);
+	bool use_sys = !__mem_cgroup_has_dirty_limit(mem);
+
+	switch (cft->private) {
+	case MEM_CGROUP_DIRTY_RATIO:
+		return use_sys ? vm_dirty_ratio : mem->dirty_param.dirty_ratio;
+	case MEM_CGROUP_DIRTY_LIMIT_IN_BYTES:
+		return use_sys ? vm_dirty_bytes : mem->dirty_param.dirty_bytes;
+	case MEM_CGROUP_DIRTY_BACKGROUND_RATIO:
+		return use_sys ? dirty_background_ratio :
+			mem->dirty_param.dirty_background_ratio;
+	case MEM_CGROUP_DIRTY_BACKGROUND_LIMIT_IN_BYTES:
+		return use_sys ? dirty_background_bytes :
+			mem->dirty_param.dirty_background_bytes;
+	default:
+		BUG();
+	}
+}
+
+static int
+mem_cgroup_dirty_write_string(struct cgroup *cgrp, struct cftype *cft,
+				const char *buffer)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
+	int type = cft->private;
+	int ret = -EINVAL;
+	unsigned long long val;
+
+	if (!__mem_cgroup_has_dirty_limit(memcg))
+		return ret;
+
+	switch (type) {
+	case MEM_CGROUP_DIRTY_LIMIT_IN_BYTES:
+		/* This function does all necessary parse...reuse it */
+		ret = res_counter_memparse_write_strategy(buffer, &val);
+		if (ret)
+			break;
+		memcg->dirty_param.dirty_bytes = val;
+		memcg->dirty_param.dirty_ratio  = 0;
+		break;
+	case MEM_CGROUP_DIRTY_BACKGROUND_LIMIT_IN_BYTES:
+		ret = res_counter_memparse_write_strategy(buffer, &val);
+		if (ret)
+			break;
+		memcg->dirty_param.dirty_background_bytes = val;
+		memcg->dirty_param.dirty_background_ratio = 0;
+		break;
+	default:
+		BUG();
+		break;
+	}
+	return ret;
+}
+
+static int
+mem_cgroup_dirty_write(struct cgroup *cgrp, struct cftype *cft, u64 val)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
+	int type = cft->private;
+
+	if (!__mem_cgroup_has_dirty_limit(memcg))
+		return -EINVAL;
+	if ((type == MEM_CGROUP_DIRTY_RATIO ||
+	     type == MEM_CGROUP_DIRTY_BACKGROUND_RATIO) && val > 100)
+		return -EINVAL;
+	switch (type) {
+	case MEM_CGROUP_DIRTY_RATIO:
+		memcg->dirty_param.dirty_ratio = val;
+		memcg->dirty_param.dirty_bytes = 0;
+		break;
+	case MEM_CGROUP_DIRTY_BACKGROUND_RATIO:
+		memcg->dirty_param.dirty_background_ratio = val;
+		memcg->dirty_param.dirty_background_bytes = 0;
+		break;
+	default:
+		BUG();
+		break;
+	}
+	return 0;
+}
+
 static struct cftype mem_cgroup_files[] = {
 	{
 		.name = "usage_in_bytes",
@@ -4419,6 +4509,30 @@ static struct cftype mem_cgroup_files[] = {
 		.unregister_event = mem_cgroup_oom_unregister_event,
 		.private = MEMFILE_PRIVATE(_OOM_TYPE, OOM_CONTROL),
 	},
+	{
+		.name = "dirty_ratio",
+		.read_u64 = mem_cgroup_dirty_read,
+		.write_u64 = mem_cgroup_dirty_write,
+		.private = MEM_CGROUP_DIRTY_RATIO,
+	},
+	{
+		.name = "dirty_limit_in_bytes",
+		.read_u64 = mem_cgroup_dirty_read,
+		.write_string = mem_cgroup_dirty_write_string,
+		.private = MEM_CGROUP_DIRTY_LIMIT_IN_BYTES,
+	},
+	{
+		.name = "dirty_background_ratio",
+		.read_u64 = mem_cgroup_dirty_read,
+		.write_u64 = mem_cgroup_dirty_write,
+		.private = MEM_CGROUP_DIRTY_BACKGROUND_RATIO,
+	},
+	{
+		.name = "dirty_background_limit_in_bytes",
+		.read_u64 = mem_cgroup_dirty_read,
+		.write_string = mem_cgroup_dirty_write_string,
+		.private = MEM_CGROUP_DIRTY_BACKGROUND_LIMIT_IN_BYTES,
+	},
 };
 
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [PATCH v4 10/11] memcg: add cgroupfs interface to memcg dirty limits
@ 2010-10-29  7:09   ` Greg Thelen
  0 siblings, 0 replies; 75+ messages in thread
From: Greg Thelen @ 2010-10-29  7:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, containers, Andrea Righi, Balbir Singh,
	KAMEZAWA Hiroyuki, Daisuke Nishimura, Minchan Kim, Ciju Rajan K,
	David Rientjes, Wu Fengguang, Greg Thelen

Add cgroupfs interface to memcg dirty page limits:
  Direct write-out is controlled with:
  - memory.dirty_ratio
  - memory.dirty_limit_in_bytes

  Background write-out is controlled with:
  - memory.dirty_background_ratio
  - memory.dirty_background_limit_bytes

Other memcg cgroupfs files support 'M', 'm', 'k', 'K', 'g'
and 'G' suffixes for byte counts.  This patch provides the
same functionality for memory.dirty_limit_in_bytes and
memory.dirty_background_limit_bytes.

Signed-off-by: Andrea Righi <arighi@develer.com>
Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Signed-off-by: Greg Thelen <gthelen@google.com>
---
Changelog since v3:
- Make use of new routine, __mem_cgroup_has_dirty_limit(), to disable memcg
  dirty limits when use_hierarchy=1.

Changelog since v1:
- Renamed newly created proc files:
  - memory.dirty_bytes -> memory.dirty_limit_in_bytes
  - memory.dirty_background_bytes -> memory.dirty_background_limit_in_bytes
- Allow [kKmMgG] suffixes for newly created dirty limit value cgroupfs files.

 mm/memcontrol.c |  114 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 114 insertions(+), 0 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 35dc329..52141c5 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -100,6 +100,13 @@ enum mem_cgroup_stat_index {
 	MEM_CGROUP_STAT_NSTATS,
 };
 
+enum {
+	MEM_CGROUP_DIRTY_RATIO,
+	MEM_CGROUP_DIRTY_LIMIT_IN_BYTES,
+	MEM_CGROUP_DIRTY_BACKGROUND_RATIO,
+	MEM_CGROUP_DIRTY_BACKGROUND_LIMIT_IN_BYTES,
+};
+
 struct mem_cgroup_stat_cpu {
 	s64 count[MEM_CGROUP_STAT_NSTATS];
 };
@@ -4356,6 +4363,89 @@ static int mem_cgroup_oom_control_write(struct cgroup *cgrp,
 	return 0;
 }
 
+static u64 mem_cgroup_dirty_read(struct cgroup *cgrp, struct cftype *cft)
+{
+	struct mem_cgroup *mem = mem_cgroup_from_cont(cgrp);
+	bool use_sys = !__mem_cgroup_has_dirty_limit(mem);
+
+	switch (cft->private) {
+	case MEM_CGROUP_DIRTY_RATIO:
+		return use_sys ? vm_dirty_ratio : mem->dirty_param.dirty_ratio;
+	case MEM_CGROUP_DIRTY_LIMIT_IN_BYTES:
+		return use_sys ? vm_dirty_bytes : mem->dirty_param.dirty_bytes;
+	case MEM_CGROUP_DIRTY_BACKGROUND_RATIO:
+		return use_sys ? dirty_background_ratio :
+			mem->dirty_param.dirty_background_ratio;
+	case MEM_CGROUP_DIRTY_BACKGROUND_LIMIT_IN_BYTES:
+		return use_sys ? dirty_background_bytes :
+			mem->dirty_param.dirty_background_bytes;
+	default:
+		BUG();
+	}
+}
+
+static int
+mem_cgroup_dirty_write_string(struct cgroup *cgrp, struct cftype *cft,
+				const char *buffer)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
+	int type = cft->private;
+	int ret = -EINVAL;
+	unsigned long long val;
+
+	if (!__mem_cgroup_has_dirty_limit(memcg))
+		return ret;
+
+	switch (type) {
+	case MEM_CGROUP_DIRTY_LIMIT_IN_BYTES:
+		/* This function does all necessary parse...reuse it */
+		ret = res_counter_memparse_write_strategy(buffer, &val);
+		if (ret)
+			break;
+		memcg->dirty_param.dirty_bytes = val;
+		memcg->dirty_param.dirty_ratio  = 0;
+		break;
+	case MEM_CGROUP_DIRTY_BACKGROUND_LIMIT_IN_BYTES:
+		ret = res_counter_memparse_write_strategy(buffer, &val);
+		if (ret)
+			break;
+		memcg->dirty_param.dirty_background_bytes = val;
+		memcg->dirty_param.dirty_background_ratio = 0;
+		break;
+	default:
+		BUG();
+		break;
+	}
+	return ret;
+}
+
+static int
+mem_cgroup_dirty_write(struct cgroup *cgrp, struct cftype *cft, u64 val)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
+	int type = cft->private;
+
+	if (!__mem_cgroup_has_dirty_limit(memcg))
+		return -EINVAL;
+	if ((type == MEM_CGROUP_DIRTY_RATIO ||
+	     type == MEM_CGROUP_DIRTY_BACKGROUND_RATIO) && val > 100)
+		return -EINVAL;
+	switch (type) {
+	case MEM_CGROUP_DIRTY_RATIO:
+		memcg->dirty_param.dirty_ratio = val;
+		memcg->dirty_param.dirty_bytes = 0;
+		break;
+	case MEM_CGROUP_DIRTY_BACKGROUND_RATIO:
+		memcg->dirty_param.dirty_background_ratio = val;
+		memcg->dirty_param.dirty_background_bytes = 0;
+		break;
+	default:
+		BUG();
+		break;
+	}
+	return 0;
+}
+
 static struct cftype mem_cgroup_files[] = {
 	{
 		.name = "usage_in_bytes",
@@ -4419,6 +4509,30 @@ static struct cftype mem_cgroup_files[] = {
 		.unregister_event = mem_cgroup_oom_unregister_event,
 		.private = MEMFILE_PRIVATE(_OOM_TYPE, OOM_CONTROL),
 	},
+	{
+		.name = "dirty_ratio",
+		.read_u64 = mem_cgroup_dirty_read,
+		.write_u64 = mem_cgroup_dirty_write,
+		.private = MEM_CGROUP_DIRTY_RATIO,
+	},
+	{
+		.name = "dirty_limit_in_bytes",
+		.read_u64 = mem_cgroup_dirty_read,
+		.write_string = mem_cgroup_dirty_write_string,
+		.private = MEM_CGROUP_DIRTY_LIMIT_IN_BYTES,
+	},
+	{
+		.name = "dirty_background_ratio",
+		.read_u64 = mem_cgroup_dirty_read,
+		.write_u64 = mem_cgroup_dirty_write,
+		.private = MEM_CGROUP_DIRTY_BACKGROUND_RATIO,
+	},
+	{
+		.name = "dirty_background_limit_in_bytes",
+		.read_u64 = mem_cgroup_dirty_read,
+		.write_string = mem_cgroup_dirty_write_string,
+		.private = MEM_CGROUP_DIRTY_BACKGROUND_LIMIT_IN_BYTES,
+	},
 };
 
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [PATCH v4 11/11] memcg: check memcg dirty limits in page writeback
  2010-10-29  7:09 ` Greg Thelen
@ 2010-10-29  7:09   ` Greg Thelen
  -1 siblings, 0 replies; 75+ messages in thread
From: Greg Thelen @ 2010-10-29  7:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, containers, Andrea Righi, Balbir Singh,
	KAMEZAWA Hiroyuki, Daisuke Nishimura, Minchan Kim, Ciju Rajan K,
	David Rientjes, Wu Fengguang, Greg Thelen

If the current process is in a non-root memcg, then
balance_dirty_pages() will consider the memcg dirty limits
as well as the system-wide limits.  This allows different
cgroups to have distinct dirty limits which trigger direct
and background writeback at different levels.

Signed-off-by: Andrea Righi <arighi@develer.com>
Signed-off-by: Greg Thelen <gthelen@google.com>
---
Changelog since v3:
- Leave determine_dirtyable_memory() static.  v3 made is non-static.
- balance_dirty_pages() now considers both system and memcg dirty limits and
  usage data.  This data is retrieved with global_dirty_info() and
  memcg_dirty_info().  

 mm/page-writeback.c |  109 ++++++++++++++++++++++++++++++++++++--------------
 1 files changed, 78 insertions(+), 31 deletions(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index b3bb2fb..57caee5 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -131,6 +131,18 @@ EXPORT_SYMBOL(laptop_mode);
 static struct prop_descriptor vm_completions;
 static struct prop_descriptor vm_dirties;
 
+static unsigned long dirty_writeback_pages(void)
+{
+	s64 ret;
+
+	ret = mem_cgroup_page_stat(MEMCG_NR_DIRTY_WRITEBACK_PAGES);
+	if (ret < 0)
+		ret = global_page_state(NR_UNSTABLE_NFS) +
+			global_page_state(NR_WRITEBACK);
+
+	return ret;
+}
+
 /*
  * couple the period to the dirty_ratio:
  *
@@ -398,45 +410,67 @@ unsigned long determine_dirtyable_memory(void)
 }
 
 /*
+ * The dirty limits will be lifted by 1/4 for PF_LESS_THROTTLE (ie. nfsd) and
+ * runtime tasks.
+ */
+static inline void adjust_dirty_info(struct dirty_info *info)
+{
+	struct task_struct *tsk;
+
+	if (info->background_thresh >= info->dirty_thresh)
+		info->background_thresh = info->dirty_thresh / 2;
+	tsk = current;
+	if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk)) {
+		info->background_thresh += info->background_thresh / 4;
+		info->dirty_thresh += info->dirty_thresh / 4;
+	}
+}
+
+/*
  * global_dirty_info - return background-writeback and dirty-throttling
  * thresholds as well as dirty usage metrics.
  *
  * Calculate the dirty thresholds based on sysctl parameters
  * - vm.dirty_background_ratio  or  vm.dirty_background_bytes
  * - vm.dirty_ratio             or  vm.dirty_bytes
- * The dirty limits will be lifted by 1/4 for PF_LESS_THROTTLE (ie. nfsd) and
- * runtime tasks.
  */
 void global_dirty_info(struct dirty_info *info)
 {
-	unsigned long background;
-	unsigned long dirty;
 	unsigned long available_memory = determine_dirtyable_memory();
-	struct task_struct *tsk;
 
 	if (vm_dirty_bytes)
-		dirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE);
+		info->dirty_thresh = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE);
 	else
-		dirty = (vm_dirty_ratio * available_memory) / 100;
+		info->dirty_thresh = (vm_dirty_ratio * available_memory) / 100;
 
 	if (dirty_background_bytes)
-		background = DIV_ROUND_UP(dirty_background_bytes, PAGE_SIZE);
+		info->background_thresh =
+			DIV_ROUND_UP(dirty_background_bytes, PAGE_SIZE);
 	else
-		background = (dirty_background_ratio * available_memory) / 100;
+		info->background_thresh =
+			(dirty_background_ratio * available_memory) / 100;
 
 	info->nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
 				global_page_state(NR_UNSTABLE_NFS);
 	info->nr_writeback = global_page_state(NR_WRITEBACK);
 
-	if (background >= dirty)
-		background = dirty / 2;
-	tsk = current;
-	if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk)) {
-		background += background / 4;
-		dirty += dirty / 4;
-	}
-	info->background_thresh = background;
-	info->dirty_thresh = dirty;
+	adjust_dirty_info(info);
+}
+
+/*
+ * Calculate the background-writeback and dirty-throttling thresholds and dirty
+ * usage metrics from the current task's memcg dirty limit parameters.  Returns
+ * false if no memcg limits exist.
+ */
+static bool memcg_dirty_info(struct dirty_info *info)
+{
+	unsigned long available_memory = determine_dirtyable_memory();
+
+	if (!mem_cgroup_dirty_info(available_memory, info))
+		return false;
+
+	adjust_dirty_info(info);
+	return true;
 }
 
 /*
@@ -480,7 +514,8 @@ unsigned long bdi_dirty_limit(struct backing_dev_info *bdi, unsigned long dirty)
 static void balance_dirty_pages(struct address_space *mapping,
 				unsigned long write_chunk)
 {
-	struct dirty_info dirty_info;
+	struct dirty_info sys_info;
+	struct dirty_info memcg_info;
 	long bdi_nr_reclaimable;
 	long bdi_nr_writeback;
 	unsigned long bdi_thresh;
@@ -497,19 +532,27 @@ static void balance_dirty_pages(struct address_space *mapping,
 			.range_cyclic	= 1,
 		};
 
-		global_dirty_info(&dirty_info);
+		global_dirty_info(&sys_info);
+
+		if (!memcg_dirty_info(&memcg_info))
+			memcg_info = sys_info;
 
 		/*
 		 * Throttle it only when the background writeback cannot
 		 * catch-up. This avoids (excessively) small writeouts
 		 * when the bdi limits are ramping up.
 		 */
-		if (dirty_info.nr_reclaimable + dirty_info.nr_writeback <=
-				(dirty_info.background_thresh +
-				 dirty_info.dirty_thresh) / 2)
+		if ((sys_info.nr_reclaimable + sys_info.nr_writeback <=
+				(sys_info.background_thresh +
+				 sys_info.dirty_thresh) / 2) &&
+		    (memcg_info.nr_reclaimable + memcg_info.nr_writeback <=
+				(memcg_info.background_thresh +
+				 memcg_info.dirty_thresh) / 2))
 			break;
 
-		bdi_thresh = bdi_dirty_limit(bdi, dirty_info.dirty_thresh);
+		bdi_thresh = bdi_dirty_limit(bdi,
+				min(sys_info.dirty_thresh,
+				    memcg_info.dirty_thresh));
 		bdi_thresh = task_dirty_limit(current, bdi_thresh);
 
 		/*
@@ -538,9 +581,12 @@ static void balance_dirty_pages(struct address_space *mapping,
 		 */
 		dirty_exceeded =
 			(bdi_nr_reclaimable + bdi_nr_writeback > bdi_thresh)
-			|| (dirty_info.nr_reclaimable +
-			    dirty_info.nr_writeback >
-			    dirty_info.dirty_thresh);
+			|| (sys_info.nr_reclaimable +
+			    sys_info.nr_writeback >
+			    sys_info.dirty_thresh)
+			|| (memcg_info.nr_reclaimable +
+			    memcg_info.nr_writeback >
+			    memcg_info.dirty_thresh);
 
 		if (!dirty_exceeded)
 			break;
@@ -593,8 +639,10 @@ static void balance_dirty_pages(struct address_space *mapping,
 	 * background_thresh, to keep the amount of dirty memory low.
 	 */
 	if ((laptop_mode && pages_written) ||
-	    (!laptop_mode && (dirty_info.nr_reclaimable >
-			      dirty_info.background_thresh)))
+	    (!laptop_mode && ((sys_info.nr_reclaimable >
+			       sys_info.background_thresh) ||
+			      (memcg_info.nr_reclaimable >
+			       memcg_info.background_thresh))))
 		bdi_start_background_writeback(bdi);
 }
 
@@ -666,8 +714,7 @@ void throttle_vm_writeout(gfp_t gfp_mask)
 		dirty_info.dirty_thresh +=
 			dirty_info.dirty_thresh / 10;      /* wheeee... */
 
-                if (global_page_state(NR_UNSTABLE_NFS) +
-		    global_page_state(NR_WRITEBACK) <= dirty_info.dirty_thresh)
+		if (dirty_writeback_pages() <= dirty_info.dirty_thresh)
 			break;
                 congestion_wait(BLK_RW_ASYNC, HZ/10);
 
-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [PATCH v4 11/11] memcg: check memcg dirty limits in page writeback
@ 2010-10-29  7:09   ` Greg Thelen
  0 siblings, 0 replies; 75+ messages in thread
From: Greg Thelen @ 2010-10-29  7:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, containers, Andrea Righi, Balbir Singh,
	KAMEZAWA Hiroyuki, Daisuke Nishimura, Minchan Kim, Ciju Rajan K,
	David Rientjes, Wu Fengguang, Greg Thelen

If the current process is in a non-root memcg, then
balance_dirty_pages() will consider the memcg dirty limits
as well as the system-wide limits.  This allows different
cgroups to have distinct dirty limits which trigger direct
and background writeback at different levels.

Signed-off-by: Andrea Righi <arighi@develer.com>
Signed-off-by: Greg Thelen <gthelen@google.com>
---
Changelog since v3:
- Leave determine_dirtyable_memory() static.  v3 made is non-static.
- balance_dirty_pages() now considers both system and memcg dirty limits and
  usage data.  This data is retrieved with global_dirty_info() and
  memcg_dirty_info().  

 mm/page-writeback.c |  109 ++++++++++++++++++++++++++++++++++++--------------
 1 files changed, 78 insertions(+), 31 deletions(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index b3bb2fb..57caee5 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -131,6 +131,18 @@ EXPORT_SYMBOL(laptop_mode);
 static struct prop_descriptor vm_completions;
 static struct prop_descriptor vm_dirties;
 
+static unsigned long dirty_writeback_pages(void)
+{
+	s64 ret;
+
+	ret = mem_cgroup_page_stat(MEMCG_NR_DIRTY_WRITEBACK_PAGES);
+	if (ret < 0)
+		ret = global_page_state(NR_UNSTABLE_NFS) +
+			global_page_state(NR_WRITEBACK);
+
+	return ret;
+}
+
 /*
  * couple the period to the dirty_ratio:
  *
@@ -398,45 +410,67 @@ unsigned long determine_dirtyable_memory(void)
 }
 
 /*
+ * The dirty limits will be lifted by 1/4 for PF_LESS_THROTTLE (ie. nfsd) and
+ * runtime tasks.
+ */
+static inline void adjust_dirty_info(struct dirty_info *info)
+{
+	struct task_struct *tsk;
+
+	if (info->background_thresh >= info->dirty_thresh)
+		info->background_thresh = info->dirty_thresh / 2;
+	tsk = current;
+	if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk)) {
+		info->background_thresh += info->background_thresh / 4;
+		info->dirty_thresh += info->dirty_thresh / 4;
+	}
+}
+
+/*
  * global_dirty_info - return background-writeback and dirty-throttling
  * thresholds as well as dirty usage metrics.
  *
  * Calculate the dirty thresholds based on sysctl parameters
  * - vm.dirty_background_ratio  or  vm.dirty_background_bytes
  * - vm.dirty_ratio             or  vm.dirty_bytes
- * The dirty limits will be lifted by 1/4 for PF_LESS_THROTTLE (ie. nfsd) and
- * runtime tasks.
  */
 void global_dirty_info(struct dirty_info *info)
 {
-	unsigned long background;
-	unsigned long dirty;
 	unsigned long available_memory = determine_dirtyable_memory();
-	struct task_struct *tsk;
 
 	if (vm_dirty_bytes)
-		dirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE);
+		info->dirty_thresh = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE);
 	else
-		dirty = (vm_dirty_ratio * available_memory) / 100;
+		info->dirty_thresh = (vm_dirty_ratio * available_memory) / 100;
 
 	if (dirty_background_bytes)
-		background = DIV_ROUND_UP(dirty_background_bytes, PAGE_SIZE);
+		info->background_thresh =
+			DIV_ROUND_UP(dirty_background_bytes, PAGE_SIZE);
 	else
-		background = (dirty_background_ratio * available_memory) / 100;
+		info->background_thresh =
+			(dirty_background_ratio * available_memory) / 100;
 
 	info->nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
 				global_page_state(NR_UNSTABLE_NFS);
 	info->nr_writeback = global_page_state(NR_WRITEBACK);
 
-	if (background >= dirty)
-		background = dirty / 2;
-	tsk = current;
-	if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk)) {
-		background += background / 4;
-		dirty += dirty / 4;
-	}
-	info->background_thresh = background;
-	info->dirty_thresh = dirty;
+	adjust_dirty_info(info);
+}
+
+/*
+ * Calculate the background-writeback and dirty-throttling thresholds and dirty
+ * usage metrics from the current task's memcg dirty limit parameters.  Returns
+ * false if no memcg limits exist.
+ */
+static bool memcg_dirty_info(struct dirty_info *info)
+{
+	unsigned long available_memory = determine_dirtyable_memory();
+
+	if (!mem_cgroup_dirty_info(available_memory, info))
+		return false;
+
+	adjust_dirty_info(info);
+	return true;
 }
 
 /*
@@ -480,7 +514,8 @@ unsigned long bdi_dirty_limit(struct backing_dev_info *bdi, unsigned long dirty)
 static void balance_dirty_pages(struct address_space *mapping,
 				unsigned long write_chunk)
 {
-	struct dirty_info dirty_info;
+	struct dirty_info sys_info;
+	struct dirty_info memcg_info;
 	long bdi_nr_reclaimable;
 	long bdi_nr_writeback;
 	unsigned long bdi_thresh;
@@ -497,19 +532,27 @@ static void balance_dirty_pages(struct address_space *mapping,
 			.range_cyclic	= 1,
 		};
 
-		global_dirty_info(&dirty_info);
+		global_dirty_info(&sys_info);
+
+		if (!memcg_dirty_info(&memcg_info))
+			memcg_info = sys_info;
 
 		/*
 		 * Throttle it only when the background writeback cannot
 		 * catch-up. This avoids (excessively) small writeouts
 		 * when the bdi limits are ramping up.
 		 */
-		if (dirty_info.nr_reclaimable + dirty_info.nr_writeback <=
-				(dirty_info.background_thresh +
-				 dirty_info.dirty_thresh) / 2)
+		if ((sys_info.nr_reclaimable + sys_info.nr_writeback <=
+				(sys_info.background_thresh +
+				 sys_info.dirty_thresh) / 2) &&
+		    (memcg_info.nr_reclaimable + memcg_info.nr_writeback <=
+				(memcg_info.background_thresh +
+				 memcg_info.dirty_thresh) / 2))
 			break;
 
-		bdi_thresh = bdi_dirty_limit(bdi, dirty_info.dirty_thresh);
+		bdi_thresh = bdi_dirty_limit(bdi,
+				min(sys_info.dirty_thresh,
+				    memcg_info.dirty_thresh));
 		bdi_thresh = task_dirty_limit(current, bdi_thresh);
 
 		/*
@@ -538,9 +581,12 @@ static void balance_dirty_pages(struct address_space *mapping,
 		 */
 		dirty_exceeded =
 			(bdi_nr_reclaimable + bdi_nr_writeback > bdi_thresh)
-			|| (dirty_info.nr_reclaimable +
-			    dirty_info.nr_writeback >
-			    dirty_info.dirty_thresh);
+			|| (sys_info.nr_reclaimable +
+			    sys_info.nr_writeback >
+			    sys_info.dirty_thresh)
+			|| (memcg_info.nr_reclaimable +
+			    memcg_info.nr_writeback >
+			    memcg_info.dirty_thresh);
 
 		if (!dirty_exceeded)
 			break;
@@ -593,8 +639,10 @@ static void balance_dirty_pages(struct address_space *mapping,
 	 * background_thresh, to keep the amount of dirty memory low.
 	 */
 	if ((laptop_mode && pages_written) ||
-	    (!laptop_mode && (dirty_info.nr_reclaimable >
-			      dirty_info.background_thresh)))
+	    (!laptop_mode && ((sys_info.nr_reclaimable >
+			       sys_info.background_thresh) ||
+			      (memcg_info.nr_reclaimable >
+			       memcg_info.background_thresh))))
 		bdi_start_background_writeback(bdi);
 }
 
@@ -666,8 +714,7 @@ void throttle_vm_writeout(gfp_t gfp_mask)
 		dirty_info.dirty_thresh +=
 			dirty_info.dirty_thresh / 10;      /* wheeee... */
 
-                if (global_page_state(NR_UNSTABLE_NFS) +
-		    global_page_state(NR_WRITEBACK) <= dirty_info.dirty_thresh)
+		if (dirty_writeback_pages() <= dirty_info.dirty_thresh)
 			break;
                 congestion_wait(BLK_RW_ASYNC, HZ/10);
 
-- 
1.7.3.1


^ permalink raw reply related	[flat|nested] 75+ messages in thread

* Re: [PATCH v4 08/11] memcg: add dirty limits to mem_cgroup
  2010-10-29  7:09   ` Greg Thelen
@ 2010-10-29  7:41     ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 75+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-10-29  7:41 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, Andrea Righi,
	Balbir Singh, Daisuke Nishimura, Minchan Kim, Ciju Rajan K,
	David Rientjes, Wu Fengguang

On Fri, 29 Oct 2010 00:09:11 -0700
Greg Thelen <gthelen@google.com> wrote:

> Extend mem_cgroup to contain dirty page limits.  Also add routines
> allowing the kernel to query the dirty usage of a memcg.
> 
> These interfaces not used by the kernel yet.  A subsequent commit
> will add kernel calls to utilize these new routines.
> 
> Signed-off-by: Greg Thelen <gthelen@google.com>
> Signed-off-by: Andrea Righi <arighi@develer.com>
> ---
> Changelog since v3:
> - Previously memcontrol.c used struct vm_dirty_param and vm_dirty_param() to
>   advertise dirty memory limits.  Now struct dirty_info and
>   mem_cgroup_dirty_info() is used to share dirty limits between memcontrol and
>   the rest of the kernel.
> - __mem_cgroup_has_dirty_limit() now returns false if use_hierarchy is set.

This seems Okay for our starting point. Hierarchy is always problem..



> - memcg_hierarchical_free_pages() now uses parent_mem_cgroup() and is simpler.
> - created internal routine, __mem_cgroup_has_dirty_limit(), to consolidate the
>   logic.
> 



> Changelog since v1:
> - Rename (for clarity):
>   - mem_cgroup_write_page_stat_item -> mem_cgroup_page_stat_item
>   - mem_cgroup_read_page_stat_item -> mem_cgroup_nr_pages_item
> - Removed unnecessary get_ prefix from get_xxx() functions.
> - Avoid lockdep warnings by using rcu_read_[un]lock() in
>   mem_cgroup_has_dirty_limit().
> 
>  include/linux/memcontrol.h |   30 ++++++
>  mm/memcontrol.c            |  248 +++++++++++++++++++++++++++++++++++++++++++-
>  2 files changed, 277 insertions(+), 1 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index ef2eec7..736d318 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -19,6 +19,7 @@
>  
>  #ifndef _LINUX_MEMCONTROL_H
>  #define _LINUX_MEMCONTROL_H
> +#include <linux/writeback.h>
>  #include <linux/cgroup.h>
>  struct mem_cgroup;
>  struct page_cgroup;
> @@ -33,6 +34,14 @@ enum mem_cgroup_page_stat_item {
>  	MEMCG_NR_FILE_UNSTABLE_NFS, /* # of NFS unstable pages */
>  };
>  
> +/* Cgroup memory statistics items exported to the kernel. */
> +enum mem_cgroup_nr_pages_item {
> +	MEMCG_NR_DIRTYABLE_PAGES,
> +	MEMCG_NR_RECLAIM_PAGES,
> +	MEMCG_NR_WRITEBACK,
> +	MEMCG_NR_DIRTY_WRITEBACK_PAGES,
> +};
> +
>  extern unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
>  					struct list_head *dst,
>  					unsigned long *scanned, int order,
> @@ -145,6 +154,11 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,
>  	mem_cgroup_update_page_stat(page, idx, -1);
>  }
>  
> +bool mem_cgroup_has_dirty_limit(void);
> +bool mem_cgroup_dirty_info(unsigned long sys_available_mem,
> +			   struct dirty_info *info);
> +s64 mem_cgroup_page_stat(enum mem_cgroup_nr_pages_item item);
> +
>  unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>  						gfp_t gfp_mask);
>  u64 mem_cgroup_get_limit(struct mem_cgroup *mem);
> @@ -326,6 +340,22 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,
>  {
>  }
>  
> +static inline bool mem_cgroup_has_dirty_limit(void)
> +{
> +	return false;
> +}
> +
> +static inline bool mem_cgroup_dirty_info(unsigned long sys_available_mem,
> +					 struct dirty_info *info)
> +{
> +	return false;
> +}
> +
> +static inline s64 mem_cgroup_page_stat(enum mem_cgroup_nr_pages_item item)
> +{
> +	return -ENOSYS;
> +}
> +
>  static inline
>  unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>  					    gfp_t gfp_mask)
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 7f91029..52d688d 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -188,6 +188,14 @@ struct mem_cgroup_eventfd_list {
>  static void mem_cgroup_threshold(struct mem_cgroup *mem);
>  static void mem_cgroup_oom_notify(struct mem_cgroup *mem);
>  
> +/* Dirty memory parameters */
> +struct vm_dirty_param {
> +	int dirty_ratio;
> +	int dirty_background_ratio;
> +	unsigned long dirty_bytes;
> +	unsigned long dirty_background_bytes;
> +};
> +
>  /*
>   * The memory controller data structure. The memory controller controls both
>   * page cache and RSS per cgroup. We would eventually like to provide
> @@ -233,6 +241,10 @@ struct mem_cgroup {
>  	atomic_t	refcnt;
>  
>  	unsigned int	swappiness;
> +
> +	/* control memory cgroup dirty pages */
> +	struct vm_dirty_param dirty_param;
> +
>  	/* OOM-Killer disable */
>  	int		oom_kill_disable;
>  
> @@ -1132,6 +1144,232 @@ static unsigned int get_swappiness(struct mem_cgroup *memcg)
>  	return swappiness;
>  }
>  
> +/*
> + * Return true if the current memory cgroup has local dirty memory settings.
> + * There is an allowed race between the current task migrating in-to/out-of the
> + * root cgroup while this routine runs.  So the return value may be incorrect if
> + * the current task is being simultaneously migrated.
> + */
> +static bool __mem_cgroup_has_dirty_limit(struct mem_cgroup *mem)
> +{
> +	if (!mem)
> +		return false;
> +	if (mem_cgroup_is_root(mem))
> +		return false;
> +	/*
> +	 * The current memcg implementation does not yet support hierarchical
> +	 * dirty limits.
> +	 */
> +	if (mem->use_hierarchy)
> +		return false;
> +	return true;
> +}
> +
> +bool mem_cgroup_has_dirty_limit(void)
> +{
> +	struct mem_cgroup *mem;
> +	bool ret;
> +
> +	if (mem_cgroup_disabled())
> +		return false;
> +
> +	rcu_read_lock();
> +	mem = mem_cgroup_from_task(current);
> +	ret = __mem_cgroup_has_dirty_limit(mem);
> +	rcu_read_unlock();
> +
> +	return ret;
> +}
> +
> +/*
> + * Returns a snapshot of the current dirty limits which is not synchronized with
> + * the routines that change the dirty limits.  If this routine races with an
> + * update to the dirty bytes/ratio value, then the caller must handle the case
> + * where both dirty_[background_]_ratio and _bytes are set.
> + */
> +static void __mem_cgroup_dirty_param(struct vm_dirty_param *param,
> +				     struct mem_cgroup *mem)
> +{
> +	if (__mem_cgroup_has_dirty_limit(mem)) {
> +		param->dirty_ratio = mem->dirty_param.dirty_ratio;
> +		param->dirty_bytes = mem->dirty_param.dirty_bytes;
> +		param->dirty_background_ratio =
> +			mem->dirty_param.dirty_background_ratio;
> +		param->dirty_background_bytes =
> +			mem->dirty_param.dirty_background_bytes;
> +	} else {
> +		param->dirty_ratio = vm_dirty_ratio;
> +		param->dirty_bytes = vm_dirty_bytes;
> +		param->dirty_background_ratio = dirty_background_ratio;
> +		param->dirty_background_bytes = dirty_background_bytes;
> +	}
> +}
> +
> +/*
> + * Return the background-writeback and dirty-throttling thresholds as well as
> + * dirty usage metrics.
> + *
> + * The current task may be moved to another cgroup while this routine accesses
> + * the dirty limit.  But a precise check is meaningless because the task can be
> + * moved after our access and writeback tends to take long time.  At least,
> + * "memcg" will not be freed while holding rcu_read_lock().
> + */
> +bool mem_cgroup_dirty_info(unsigned long sys_available_mem,
> +			   struct dirty_info *info)
> +{
> +	s64 available_mem;
> +	struct vm_dirty_param dirty_param;
> +	struct mem_cgroup *memcg;
> +
> +	if (mem_cgroup_disabled())
> +		return false;
> +
> +	rcu_read_lock();
> +	memcg = mem_cgroup_from_task(current);
> +	if (!__mem_cgroup_has_dirty_limit(memcg)) {
> +		rcu_read_unlock();
> +		return false;
> +	}
> +	__mem_cgroup_dirty_param(&dirty_param, memcg);
> +	rcu_read_unlock();

Hmm, don't we need to get css_get() for this "memcg" ?

> +
> +	available_mem = mem_cgroup_page_stat(MEMCG_NR_DIRTYABLE_PAGES);
> +	if (available_mem < 0)
> +		return false;
> +
> +	available_mem = min((unsigned long)available_mem, sys_available_mem);
> +
This seems nice.

> +	if (dirty_param.dirty_bytes)
> +		info->dirty_thresh =
> +			DIV_ROUND_UP(dirty_param.dirty_bytes, PAGE_SIZE);
> +	else
> +		info->dirty_thresh =
> +			(dirty_param.dirty_ratio * available_mem) / 100;
> +
> +	if (dirty_param.dirty_background_bytes)
> +		info->background_thresh =
> +			DIV_ROUND_UP(dirty_param.dirty_background_bytes,
> +				     PAGE_SIZE);
> +	else
> +		info->background_thresh =
> +			(dirty_param.dirty_background_ratio *
> +			       available_mem) / 100;
> +

Okay, then these will be finally double-checked with system's dirty-info.
Right ?

Thanks,
-Kame

> +	info->nr_reclaimable =
> +		mem_cgroup_page_stat(MEMCG_NR_RECLAIM_PAGES);
> +	if (info->nr_reclaimable < 0)
> +		return false;
> +
> +	info->nr_writeback = mem_cgroup_page_stat(MEMCG_NR_WRITEBACK);
> +	if (info->nr_writeback < 0)
> +		return false;
> +
> +	return true;
> +}
> +
> +static inline bool mem_cgroup_can_swap(struct mem_cgroup *memcg)
> +{
> +	if (!do_swap_account)
> +		return nr_swap_pages > 0;
> +	return !memcg->memsw_is_minimum &&
> +		(res_counter_read_u64(&memcg->memsw, RES_LIMIT) > 0);
> +}
> +
> +static s64 mem_cgroup_local_page_stat(struct mem_cgroup *mem,
> +				      enum mem_cgroup_nr_pages_item item)
> +{
> +	s64 ret;
> +
> +	switch (item) {
> +	case MEMCG_NR_DIRTYABLE_PAGES:
> +		ret = mem_cgroup_read_stat(mem, LRU_ACTIVE_FILE) +
> +			mem_cgroup_read_stat(mem, LRU_INACTIVE_FILE);
> +		if (mem_cgroup_can_swap(mem))
> +			ret += mem_cgroup_read_stat(mem, LRU_ACTIVE_ANON) +
> +				mem_cgroup_read_stat(mem, LRU_INACTIVE_ANON);
> +		break;
> +	case MEMCG_NR_RECLAIM_PAGES:
> +		ret = mem_cgroup_read_stat(mem,	MEM_CGROUP_STAT_FILE_DIRTY) +
> +			mem_cgroup_read_stat(mem,
> +					     MEM_CGROUP_STAT_FILE_UNSTABLE_NFS);
> +		break;
> +	case MEMCG_NR_WRITEBACK:
> +		ret = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_FILE_WRITEBACK);
> +		break;
> +	case MEMCG_NR_DIRTY_WRITEBACK_PAGES:
> +		ret = mem_cgroup_read_stat(mem,
> +					   MEM_CGROUP_STAT_FILE_WRITEBACK) +
> +			mem_cgroup_read_stat(mem,
> +					     MEM_CGROUP_STAT_FILE_UNSTABLE_NFS);
> +		break;
> +	default:
> +		BUG();
> +		break;
> +	}
> +	return ret;
> +}
> +
> +/*
> + * Return the number of pages that the @mem cgroup could allocate.  If
> + * use_hierarchy is set, then this involves parent mem cgroups to find the
> + * cgroup with the smallest free space.
> + */
> +static unsigned long long
> +memcg_hierarchical_free_pages(struct mem_cgroup *mem)
> +{
> +	unsigned long free, min_free;
> +
> +	min_free = global_page_state(NR_FREE_PAGES) << PAGE_SHIFT;
> +
> +	while (mem) {
> +		free = res_counter_read_u64(&mem->res, RES_LIMIT) -
> +			res_counter_read_u64(&mem->res, RES_USAGE);
> +		min_free = min(min_free, free);
> +		mem = parent_mem_cgroup(mem);
> +	}
> +
> +	/* Translate free memory in pages */
> +	return min_free >> PAGE_SHIFT;
> +}
> +
> +/*
> + * mem_cgroup_page_stat() - get memory cgroup file cache statistics
> + * @item:      memory statistic item exported to the kernel
> + *
> + * Return the accounted statistic value or negative value if current task is
> + * root cgroup.
> + */
> +s64 mem_cgroup_page_stat(enum mem_cgroup_nr_pages_item item)
> +{
> +	struct mem_cgroup *mem;
> +	struct mem_cgroup *iter;
> +	s64 value;
> +
> +	rcu_read_lock();
> +	mem = mem_cgroup_from_task(current);
> +	if (__mem_cgroup_has_dirty_limit(mem)) {
> +		/*
> +		 * If we're looking for dirtyable pages we need to evaluate
> +		 * free pages depending on the limit and usage of the parents
> +		 * first of all.
> +		 */
> +		if (item == MEMCG_NR_DIRTYABLE_PAGES)
> +			value = memcg_hierarchical_free_pages(mem);
> +		else
> +			value = 0;
> +		/*
> +		 * Recursively evaluate page statistics against all cgroup
> +		 * under hierarchy tree
> +		 */
> +		for_each_mem_cgroup_tree(iter, mem)
> +			value += mem_cgroup_local_page_stat(iter, item);
> +	} else
> +		value = -EINVAL;
> +	rcu_read_unlock();
> +
> +	return value;
> +}
> +
>  static void mem_cgroup_start_move(struct mem_cgroup *mem)
>  {
>  	int cpu;
> @@ -4440,8 +4678,16 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
>  	spin_lock_init(&mem->reclaim_param_lock);
>  	INIT_LIST_HEAD(&mem->oom_notify);
>  
> -	if (parent)
> +	if (parent) {
>  		mem->swappiness = get_swappiness(parent);
> +		__mem_cgroup_dirty_param(&mem->dirty_param, parent);
> +	} else {
> +		/*
> +		 * The root cgroup dirty_param field is not used, instead,
> +		 * system-wide dirty limits are used.
> +		 */
> +	}
> +
>  	atomic_set(&mem->refcnt, 1);
>  	mem->move_charge_at_immigrate = 0;
>  	mutex_init(&mem->thresholds_lock);
> -- 
> 1.7.3.1
> 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v4 08/11] memcg: add dirty limits to mem_cgroup
@ 2010-10-29  7:41     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 75+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-10-29  7:41 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, Andrea Righi,
	Balbir Singh, Daisuke Nishimura, Minchan Kim, Ciju Rajan K,
	David Rientjes, Wu Fengguang

On Fri, 29 Oct 2010 00:09:11 -0700
Greg Thelen <gthelen@google.com> wrote:

> Extend mem_cgroup to contain dirty page limits.  Also add routines
> allowing the kernel to query the dirty usage of a memcg.
> 
> These interfaces not used by the kernel yet.  A subsequent commit
> will add kernel calls to utilize these new routines.
> 
> Signed-off-by: Greg Thelen <gthelen@google.com>
> Signed-off-by: Andrea Righi <arighi@develer.com>
> ---
> Changelog since v3:
> - Previously memcontrol.c used struct vm_dirty_param and vm_dirty_param() to
>   advertise dirty memory limits.  Now struct dirty_info and
>   mem_cgroup_dirty_info() is used to share dirty limits between memcontrol and
>   the rest of the kernel.
> - __mem_cgroup_has_dirty_limit() now returns false if use_hierarchy is set.

This seems Okay for our starting point. Hierarchy is always problem..



> - memcg_hierarchical_free_pages() now uses parent_mem_cgroup() and is simpler.
> - created internal routine, __mem_cgroup_has_dirty_limit(), to consolidate the
>   logic.
> 



> Changelog since v1:
> - Rename (for clarity):
>   - mem_cgroup_write_page_stat_item -> mem_cgroup_page_stat_item
>   - mem_cgroup_read_page_stat_item -> mem_cgroup_nr_pages_item
> - Removed unnecessary get_ prefix from get_xxx() functions.
> - Avoid lockdep warnings by using rcu_read_[un]lock() in
>   mem_cgroup_has_dirty_limit().
> 
>  include/linux/memcontrol.h |   30 ++++++
>  mm/memcontrol.c            |  248 +++++++++++++++++++++++++++++++++++++++++++-
>  2 files changed, 277 insertions(+), 1 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index ef2eec7..736d318 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -19,6 +19,7 @@
>  
>  #ifndef _LINUX_MEMCONTROL_H
>  #define _LINUX_MEMCONTROL_H
> +#include <linux/writeback.h>
>  #include <linux/cgroup.h>
>  struct mem_cgroup;
>  struct page_cgroup;
> @@ -33,6 +34,14 @@ enum mem_cgroup_page_stat_item {
>  	MEMCG_NR_FILE_UNSTABLE_NFS, /* # of NFS unstable pages */
>  };
>  
> +/* Cgroup memory statistics items exported to the kernel. */
> +enum mem_cgroup_nr_pages_item {
> +	MEMCG_NR_DIRTYABLE_PAGES,
> +	MEMCG_NR_RECLAIM_PAGES,
> +	MEMCG_NR_WRITEBACK,
> +	MEMCG_NR_DIRTY_WRITEBACK_PAGES,
> +};
> +
>  extern unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
>  					struct list_head *dst,
>  					unsigned long *scanned, int order,
> @@ -145,6 +154,11 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,
>  	mem_cgroup_update_page_stat(page, idx, -1);
>  }
>  
> +bool mem_cgroup_has_dirty_limit(void);
> +bool mem_cgroup_dirty_info(unsigned long sys_available_mem,
> +			   struct dirty_info *info);
> +s64 mem_cgroup_page_stat(enum mem_cgroup_nr_pages_item item);
> +
>  unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>  						gfp_t gfp_mask);
>  u64 mem_cgroup_get_limit(struct mem_cgroup *mem);
> @@ -326,6 +340,22 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,
>  {
>  }
>  
> +static inline bool mem_cgroup_has_dirty_limit(void)
> +{
> +	return false;
> +}
> +
> +static inline bool mem_cgroup_dirty_info(unsigned long sys_available_mem,
> +					 struct dirty_info *info)
> +{
> +	return false;
> +}
> +
> +static inline s64 mem_cgroup_page_stat(enum mem_cgroup_nr_pages_item item)
> +{
> +	return -ENOSYS;
> +}
> +
>  static inline
>  unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>  					    gfp_t gfp_mask)
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 7f91029..52d688d 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -188,6 +188,14 @@ struct mem_cgroup_eventfd_list {
>  static void mem_cgroup_threshold(struct mem_cgroup *mem);
>  static void mem_cgroup_oom_notify(struct mem_cgroup *mem);
>  
> +/* Dirty memory parameters */
> +struct vm_dirty_param {
> +	int dirty_ratio;
> +	int dirty_background_ratio;
> +	unsigned long dirty_bytes;
> +	unsigned long dirty_background_bytes;
> +};
> +
>  /*
>   * The memory controller data structure. The memory controller controls both
>   * page cache and RSS per cgroup. We would eventually like to provide
> @@ -233,6 +241,10 @@ struct mem_cgroup {
>  	atomic_t	refcnt;
>  
>  	unsigned int	swappiness;
> +
> +	/* control memory cgroup dirty pages */
> +	struct vm_dirty_param dirty_param;
> +
>  	/* OOM-Killer disable */
>  	int		oom_kill_disable;
>  
> @@ -1132,6 +1144,232 @@ static unsigned int get_swappiness(struct mem_cgroup *memcg)
>  	return swappiness;
>  }
>  
> +/*
> + * Return true if the current memory cgroup has local dirty memory settings.
> + * There is an allowed race between the current task migrating in-to/out-of the
> + * root cgroup while this routine runs.  So the return value may be incorrect if
> + * the current task is being simultaneously migrated.
> + */
> +static bool __mem_cgroup_has_dirty_limit(struct mem_cgroup *mem)
> +{
> +	if (!mem)
> +		return false;
> +	if (mem_cgroup_is_root(mem))
> +		return false;
> +	/*
> +	 * The current memcg implementation does not yet support hierarchical
> +	 * dirty limits.
> +	 */
> +	if (mem->use_hierarchy)
> +		return false;
> +	return true;
> +}
> +
> +bool mem_cgroup_has_dirty_limit(void)
> +{
> +	struct mem_cgroup *mem;
> +	bool ret;
> +
> +	if (mem_cgroup_disabled())
> +		return false;
> +
> +	rcu_read_lock();
> +	mem = mem_cgroup_from_task(current);
> +	ret = __mem_cgroup_has_dirty_limit(mem);
> +	rcu_read_unlock();
> +
> +	return ret;
> +}
> +
> +/*
> + * Returns a snapshot of the current dirty limits which is not synchronized with
> + * the routines that change the dirty limits.  If this routine races with an
> + * update to the dirty bytes/ratio value, then the caller must handle the case
> + * where both dirty_[background_]_ratio and _bytes are set.
> + */
> +static void __mem_cgroup_dirty_param(struct vm_dirty_param *param,
> +				     struct mem_cgroup *mem)
> +{
> +	if (__mem_cgroup_has_dirty_limit(mem)) {
> +		param->dirty_ratio = mem->dirty_param.dirty_ratio;
> +		param->dirty_bytes = mem->dirty_param.dirty_bytes;
> +		param->dirty_background_ratio =
> +			mem->dirty_param.dirty_background_ratio;
> +		param->dirty_background_bytes =
> +			mem->dirty_param.dirty_background_bytes;
> +	} else {
> +		param->dirty_ratio = vm_dirty_ratio;
> +		param->dirty_bytes = vm_dirty_bytes;
> +		param->dirty_background_ratio = dirty_background_ratio;
> +		param->dirty_background_bytes = dirty_background_bytes;
> +	}
> +}
> +
> +/*
> + * Return the background-writeback and dirty-throttling thresholds as well as
> + * dirty usage metrics.
> + *
> + * The current task may be moved to another cgroup while this routine accesses
> + * the dirty limit.  But a precise check is meaningless because the task can be
> + * moved after our access and writeback tends to take long time.  At least,
> + * "memcg" will not be freed while holding rcu_read_lock().
> + */
> +bool mem_cgroup_dirty_info(unsigned long sys_available_mem,
> +			   struct dirty_info *info)
> +{
> +	s64 available_mem;
> +	struct vm_dirty_param dirty_param;
> +	struct mem_cgroup *memcg;
> +
> +	if (mem_cgroup_disabled())
> +		return false;
> +
> +	rcu_read_lock();
> +	memcg = mem_cgroup_from_task(current);
> +	if (!__mem_cgroup_has_dirty_limit(memcg)) {
> +		rcu_read_unlock();
> +		return false;
> +	}
> +	__mem_cgroup_dirty_param(&dirty_param, memcg);
> +	rcu_read_unlock();

Hmm, don't we need to get css_get() for this "memcg" ?

> +
> +	available_mem = mem_cgroup_page_stat(MEMCG_NR_DIRTYABLE_PAGES);
> +	if (available_mem < 0)
> +		return false;
> +
> +	available_mem = min((unsigned long)available_mem, sys_available_mem);
> +
This seems nice.

> +	if (dirty_param.dirty_bytes)
> +		info->dirty_thresh =
> +			DIV_ROUND_UP(dirty_param.dirty_bytes, PAGE_SIZE);
> +	else
> +		info->dirty_thresh =
> +			(dirty_param.dirty_ratio * available_mem) / 100;
> +
> +	if (dirty_param.dirty_background_bytes)
> +		info->background_thresh =
> +			DIV_ROUND_UP(dirty_param.dirty_background_bytes,
> +				     PAGE_SIZE);
> +	else
> +		info->background_thresh =
> +			(dirty_param.dirty_background_ratio *
> +			       available_mem) / 100;
> +

Okay, then these will be finally double-checked with system's dirty-info.
Right ?

Thanks,
-Kame

> +	info->nr_reclaimable =
> +		mem_cgroup_page_stat(MEMCG_NR_RECLAIM_PAGES);
> +	if (info->nr_reclaimable < 0)
> +		return false;
> +
> +	info->nr_writeback = mem_cgroup_page_stat(MEMCG_NR_WRITEBACK);
> +	if (info->nr_writeback < 0)
> +		return false;
> +
> +	return true;
> +}
> +
> +static inline bool mem_cgroup_can_swap(struct mem_cgroup *memcg)
> +{
> +	if (!do_swap_account)
> +		return nr_swap_pages > 0;
> +	return !memcg->memsw_is_minimum &&
> +		(res_counter_read_u64(&memcg->memsw, RES_LIMIT) > 0);
> +}
> +
> +static s64 mem_cgroup_local_page_stat(struct mem_cgroup *mem,
> +				      enum mem_cgroup_nr_pages_item item)
> +{
> +	s64 ret;
> +
> +	switch (item) {
> +	case MEMCG_NR_DIRTYABLE_PAGES:
> +		ret = mem_cgroup_read_stat(mem, LRU_ACTIVE_FILE) +
> +			mem_cgroup_read_stat(mem, LRU_INACTIVE_FILE);
> +		if (mem_cgroup_can_swap(mem))
> +			ret += mem_cgroup_read_stat(mem, LRU_ACTIVE_ANON) +
> +				mem_cgroup_read_stat(mem, LRU_INACTIVE_ANON);
> +		break;
> +	case MEMCG_NR_RECLAIM_PAGES:
> +		ret = mem_cgroup_read_stat(mem,	MEM_CGROUP_STAT_FILE_DIRTY) +
> +			mem_cgroup_read_stat(mem,
> +					     MEM_CGROUP_STAT_FILE_UNSTABLE_NFS);
> +		break;
> +	case MEMCG_NR_WRITEBACK:
> +		ret = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_FILE_WRITEBACK);
> +		break;
> +	case MEMCG_NR_DIRTY_WRITEBACK_PAGES:
> +		ret = mem_cgroup_read_stat(mem,
> +					   MEM_CGROUP_STAT_FILE_WRITEBACK) +
> +			mem_cgroup_read_stat(mem,
> +					     MEM_CGROUP_STAT_FILE_UNSTABLE_NFS);
> +		break;
> +	default:
> +		BUG();
> +		break;
> +	}
> +	return ret;
> +}
> +
> +/*
> + * Return the number of pages that the @mem cgroup could allocate.  If
> + * use_hierarchy is set, then this involves parent mem cgroups to find the
> + * cgroup with the smallest free space.
> + */
> +static unsigned long long
> +memcg_hierarchical_free_pages(struct mem_cgroup *mem)
> +{
> +	unsigned long free, min_free;
> +
> +	min_free = global_page_state(NR_FREE_PAGES) << PAGE_SHIFT;
> +
> +	while (mem) {
> +		free = res_counter_read_u64(&mem->res, RES_LIMIT) -
> +			res_counter_read_u64(&mem->res, RES_USAGE);
> +		min_free = min(min_free, free);
> +		mem = parent_mem_cgroup(mem);
> +	}
> +
> +	/* Translate free memory in pages */
> +	return min_free >> PAGE_SHIFT;
> +}
> +
> +/*
> + * mem_cgroup_page_stat() - get memory cgroup file cache statistics
> + * @item:      memory statistic item exported to the kernel
> + *
> + * Return the accounted statistic value or negative value if current task is
> + * root cgroup.
> + */
> +s64 mem_cgroup_page_stat(enum mem_cgroup_nr_pages_item item)
> +{
> +	struct mem_cgroup *mem;
> +	struct mem_cgroup *iter;
> +	s64 value;
> +
> +	rcu_read_lock();
> +	mem = mem_cgroup_from_task(current);
> +	if (__mem_cgroup_has_dirty_limit(mem)) {
> +		/*
> +		 * If we're looking for dirtyable pages we need to evaluate
> +		 * free pages depending on the limit and usage of the parents
> +		 * first of all.
> +		 */
> +		if (item == MEMCG_NR_DIRTYABLE_PAGES)
> +			value = memcg_hierarchical_free_pages(mem);
> +		else
> +			value = 0;
> +		/*
> +		 * Recursively evaluate page statistics against all cgroup
> +		 * under hierarchy tree
> +		 */
> +		for_each_mem_cgroup_tree(iter, mem)
> +			value += mem_cgroup_local_page_stat(iter, item);
> +	} else
> +		value = -EINVAL;
> +	rcu_read_unlock();
> +
> +	return value;
> +}
> +
>  static void mem_cgroup_start_move(struct mem_cgroup *mem)
>  {
>  	int cpu;
> @@ -4440,8 +4678,16 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
>  	spin_lock_init(&mem->reclaim_param_lock);
>  	INIT_LIST_HEAD(&mem->oom_notify);
>  
> -	if (parent)
> +	if (parent) {
>  		mem->swappiness = get_swappiness(parent);
> +		__mem_cgroup_dirty_param(&mem->dirty_param, parent);
> +	} else {
> +		/*
> +		 * The root cgroup dirty_param field is not used, instead,
> +		 * system-wide dirty limits are used.
> +		 */
> +	}
> +
>  	atomic_set(&mem->refcnt, 1);
>  	mem->move_charge_at_immigrate = 0;
>  	mutex_init(&mem->thresholds_lock);
> -- 
> 1.7.3.1
> 
> 


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v4 10/11] memcg: add cgroupfs interface to memcg dirty limits
  2010-10-29  7:09   ` Greg Thelen
@ 2010-10-29  7:43     ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 75+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-10-29  7:43 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, Andrea Righi,
	Balbir Singh, Daisuke Nishimura, Minchan Kim, Ciju Rajan K,
	David Rientjes, Wu Fengguang

On Fri, 29 Oct 2010 00:09:13 -0700
Greg Thelen <gthelen@google.com> wrote:

> Add cgroupfs interface to memcg dirty page limits:
>   Direct write-out is controlled with:
>   - memory.dirty_ratio
>   - memory.dirty_limit_in_bytes
> 
>   Background write-out is controlled with:
>   - memory.dirty_background_ratio
>   - memory.dirty_background_limit_bytes
> 
> Other memcg cgroupfs files support 'M', 'm', 'k', 'K', 'g'
> and 'G' suffixes for byte counts.  This patch provides the
> same functionality for memory.dirty_limit_in_bytes and
> memory.dirty_background_limit_bytes.
> 
> Signed-off-by: Andrea Righi <arighi@develer.com>
> Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
> Signed-off-by: Greg Thelen <gthelen@google.com>

Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v4 10/11] memcg: add cgroupfs interface to memcg dirty limits
@ 2010-10-29  7:43     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 75+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-10-29  7:43 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, Andrea Righi,
	Balbir Singh, Daisuke Nishimura, Minchan Kim, Ciju Rajan K,
	David Rientjes, Wu Fengguang

On Fri, 29 Oct 2010 00:09:13 -0700
Greg Thelen <gthelen@google.com> wrote:

> Add cgroupfs interface to memcg dirty page limits:
>   Direct write-out is controlled with:
>   - memory.dirty_ratio
>   - memory.dirty_limit_in_bytes
> 
>   Background write-out is controlled with:
>   - memory.dirty_background_ratio
>   - memory.dirty_background_limit_bytes
> 
> Other memcg cgroupfs files support 'M', 'm', 'k', 'K', 'g'
> and 'G' suffixes for byte counts.  This patch provides the
> same functionality for memory.dirty_limit_in_bytes and
> memory.dirty_background_limit_bytes.
> 
> Signed-off-by: Andrea Righi <arighi@develer.com>
> Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
> Signed-off-by: Greg Thelen <gthelen@google.com>

Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

-Kame


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v4 11/11] memcg: check memcg dirty limits in page writeback
  2010-10-29  7:09   ` Greg Thelen
@ 2010-10-29  7:48     ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 75+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-10-29  7:48 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, Andrea Righi,
	Balbir Singh, Daisuke Nishimura, Minchan Kim, Ciju Rajan K,
	David Rientjes, Wu Fengguang

On Fri, 29 Oct 2010 00:09:14 -0700
Greg Thelen <gthelen@google.com> wrote:

> If the current process is in a non-root memcg, then
> balance_dirty_pages() will consider the memcg dirty limits
> as well as the system-wide limits.  This allows different
> cgroups to have distinct dirty limits which trigger direct
> and background writeback at different levels.
> 
> Signed-off-by: Andrea Righi <arighi@develer.com>
> Signed-off-by: Greg Thelen <gthelen@google.com>

Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Ideally, I think some comments in the code for "why we need double-check system's
dirty limit and memcg's dirty limit" will be appreciated.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v4 11/11] memcg: check memcg dirty limits in page writeback
@ 2010-10-29  7:48     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 75+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-10-29  7:48 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, Andrea Righi,
	Balbir Singh, Daisuke Nishimura, Minchan Kim, Ciju Rajan K,
	David Rientjes, Wu Fengguang

On Fri, 29 Oct 2010 00:09:14 -0700
Greg Thelen <gthelen@google.com> wrote:

> If the current process is in a non-root memcg, then
> balance_dirty_pages() will consider the memcg dirty limits
> as well as the system-wide limits.  This allows different
> cgroups to have distinct dirty limits which trigger direct
> and background writeback at different levels.
> 
> Signed-off-by: Andrea Righi <arighi@develer.com>
> Signed-off-by: Greg Thelen <gthelen@google.com>

Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Ideally, I think some comments in the code for "why we need double-check system's
dirty limit and memcg's dirty limit" will be appreciated.



^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v4 05/11] writeback: create dirty_info structure
  2010-10-29  7:09   ` Greg Thelen
@ 2010-10-29  7:50     ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 75+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-10-29  7:50 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, Andrea Righi,
	Balbir Singh, Daisuke Nishimura, Minchan Kim, Ciju Rajan K,
	David Rientjes, Wu Fengguang

On Fri, 29 Oct 2010 00:09:08 -0700
Greg Thelen <gthelen@google.com> wrote:

> Bundle dirty limits and dirty memory usage metrics into a dirty_info
> structure to simplify interfaces of routines that need all.
> 
> Signed-off-by: Greg Thelen <gthelen@google.com>

Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v4 05/11] writeback: create dirty_info structure
@ 2010-10-29  7:50     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 75+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-10-29  7:50 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, Andrea Righi,
	Balbir Singh, Daisuke Nishimura, Minchan Kim, Ciju Rajan K,
	David Rientjes, Wu Fengguang

On Fri, 29 Oct 2010 00:09:08 -0700
Greg Thelen <gthelen@google.com> wrote:

> Bundle dirty limits and dirty memory usage metrics into a dirty_info
> structure to simplify interfaces of routines that need all.
> 
> Signed-off-by: Greg Thelen <gthelen@google.com>

Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v4 02/11] memcg: document cgroup dirty memory interfaces
  2010-10-29  7:09   ` Greg Thelen
@ 2010-10-29 11:03     ` Wu Fengguang
  -1 siblings, 0 replies; 75+ messages in thread
From: Wu Fengguang @ 2010-10-29 11:03 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, Andrea Righi,
	Balbir Singh, KAMEZAWA Hiroyuki, Daisuke Nishimura, Minchan Kim,
	Ciju Rajan K, David Rientjes

Hi Greg,

On Fri, Oct 29, 2010 at 03:09:05PM +0800, Greg Thelen wrote:

> Document cgroup dirty memory interfaces and statistics.
> 
> Signed-off-by: Andrea Righi <arighi@develer.com>
> Signed-off-by: Greg Thelen <gthelen@google.com>
> ---

> +Limiting dirty memory is like fixing the max amount of dirty (hard to reclaim)
> +page cache used by a cgroup.  So, in case of multiple cgroup writers, they will
> +not be able to consume more than their designated share of dirty pages and will
> +be forced to perform write-out if they cross that limit.

It's more pertinent to say "will be throttled", as "perform write-out"
is some implementation behavior that will change soon. 

> +- memory.dirty_limit_in_bytes: the amount of dirty memory (expressed in bytes)
> +  in the cgroup at which a process generating dirty pages will start itself
> +  writing out dirty data.  Suffix (k, K, m, M, g, or G) can be used to indicate
> +  that value is kilo, mega or gigabytes.

The suffix feature is handy, thanks! It makes sense to also add this
for the global interfaces, perhaps in a standalone patch.

> +A cgroup may contain more dirty memory than its dirty limit.  This is possible
> +because of the principle that the first cgroup to touch a page is charged for
> +it.  Subsequent page counting events (dirty, writeback, nfs_unstable) are also
> +counted to the originally charged cgroup.
> +
> +Example: If page is allocated by a cgroup A task, then the page is charged to
> +cgroup A.  If the page is later dirtied by a task in cgroup B, then the cgroup A
> +dirty count will be incremented.  If cgroup A is over its dirty limit but cgroup
> +B is not, then dirtying a cgroup A page from a cgroup B task may push cgroup A
> +over its dirty limit without throttling the dirtying cgroup B task.

It's good to document the above "misbehavior". But why not throttling
the dirtying cgroup B task? Is it simply not implemented or makes no
sense to do so at all?

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v4 02/11] memcg: document cgroup dirty memory interfaces
@ 2010-10-29 11:03     ` Wu Fengguang
  0 siblings, 0 replies; 75+ messages in thread
From: Wu Fengguang @ 2010-10-29 11:03 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, Andrea Righi,
	Balbir Singh, KAMEZAWA Hiroyuki, Daisuke Nishimura, Minchan Kim,
	Ciju Rajan K, David Rientjes

Hi Greg,

On Fri, Oct 29, 2010 at 03:09:05PM +0800, Greg Thelen wrote:

> Document cgroup dirty memory interfaces and statistics.
> 
> Signed-off-by: Andrea Righi <arighi@develer.com>
> Signed-off-by: Greg Thelen <gthelen@google.com>
> ---

> +Limiting dirty memory is like fixing the max amount of dirty (hard to reclaim)
> +page cache used by a cgroup.  So, in case of multiple cgroup writers, they will
> +not be able to consume more than their designated share of dirty pages and will
> +be forced to perform write-out if they cross that limit.

It's more pertinent to say "will be throttled", as "perform write-out"
is some implementation behavior that will change soon. 

> +- memory.dirty_limit_in_bytes: the amount of dirty memory (expressed in bytes)
> +  in the cgroup at which a process generating dirty pages will start itself
> +  writing out dirty data.  Suffix (k, K, m, M, g, or G) can be used to indicate
> +  that value is kilo, mega or gigabytes.

The suffix feature is handy, thanks! It makes sense to also add this
for the global interfaces, perhaps in a standalone patch.

> +A cgroup may contain more dirty memory than its dirty limit.  This is possible
> +because of the principle that the first cgroup to touch a page is charged for
> +it.  Subsequent page counting events (dirty, writeback, nfs_unstable) are also
> +counted to the originally charged cgroup.
> +
> +Example: If page is allocated by a cgroup A task, then the page is charged to
> +cgroup A.  If the page is later dirtied by a task in cgroup B, then the cgroup A
> +dirty count will be incremented.  If cgroup A is over its dirty limit but cgroup
> +B is not, then dirtying a cgroup A page from a cgroup B task may push cgroup A
> +over its dirty limit without throttling the dirtying cgroup B task.

It's good to document the above "misbehavior". But why not throttling
the dirtying cgroup B task? Is it simply not implemented or makes no
sense to do so at all?

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v4 06/11] memcg: add dirty page accounting infrastructure
  2010-10-29  7:09   ` Greg Thelen
@ 2010-10-29 11:13     ` Wu Fengguang
  -1 siblings, 0 replies; 75+ messages in thread
From: Wu Fengguang @ 2010-10-29 11:13 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, Andrea Righi,
	Balbir Singh, KAMEZAWA Hiroyuki, Daisuke Nishimura, Minchan Kim,
	Ciju Rajan K, David Rientjes

On Fri, Oct 29, 2010 at 03:09:09PM +0800, Greg Thelen wrote:

> +
> +	case MEMCG_NR_FILE_DIRTY:
> +		/* Use Test{Set,Clear} to only un/charge the memcg once. */
> +		if (val > 0) {
> +			if (TestSetPageCgroupFileDirty(pc))
> +				val = 0;
> +		} else {
> +			if (!TestClearPageCgroupFileDirty(pc))
> +				val = 0;
> +		}

I'm wondering why TestSet/TestClear and even the cgroup page flags for
dirty/writeback/unstable pages are necessary at all (it helps to
document in changelog if there are any). For example, VFS will call
TestSetPageDirty() before calling
mem_cgroup_inc_page_stat(MEMCG_NR_FILE_DIRTY), so there should be no
chance of false double counting.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v4 06/11] memcg: add dirty page accounting infrastructure
@ 2010-10-29 11:13     ` Wu Fengguang
  0 siblings, 0 replies; 75+ messages in thread
From: Wu Fengguang @ 2010-10-29 11:13 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, Andrea Righi,
	Balbir Singh, KAMEZAWA Hiroyuki, Daisuke Nishimura, Minchan Kim,
	Ciju Rajan K, David Rientjes

On Fri, Oct 29, 2010 at 03:09:09PM +0800, Greg Thelen wrote:

> +
> +	case MEMCG_NR_FILE_DIRTY:
> +		/* Use Test{Set,Clear} to only un/charge the memcg once. */
> +		if (val > 0) {
> +			if (TestSetPageCgroupFileDirty(pc))
> +				val = 0;
> +		} else {
> +			if (!TestClearPageCgroupFileDirty(pc))
> +				val = 0;
> +		}

I'm wondering why TestSet/TestClear and even the cgroup page flags for
dirty/writeback/unstable pages are necessary at all (it helps to
document in changelog if there are any). For example, VFS will call
TestSetPageDirty() before calling
mem_cgroup_inc_page_stat(MEMCG_NR_FILE_DIRTY), so there should be no
chance of false double counting.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v4 06/11] memcg: add dirty page accounting infrastructure
  2010-10-29 11:13     ` Wu Fengguang
@ 2010-10-29 11:17       ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 75+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-10-29 11:17 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Greg Thelen, Andrew Morton, linux-kernel, linux-mm, containers,
	Andrea Righi, Balbir Singh, Daisuke Nishimura, Minchan Kim,
	Ciju Rajan K, David Rientjes

On Fri, 29 Oct 2010 19:13:00 +0800
Wu Fengguang <fengguang.wu@intel.com> wrote:

> On Fri, Oct 29, 2010 at 03:09:09PM +0800, Greg Thelen wrote:
> 
> > +
> > +	case MEMCG_NR_FILE_DIRTY:
> > +		/* Use Test{Set,Clear} to only un/charge the memcg once. */
> > +		if (val > 0) {
> > +			if (TestSetPageCgroupFileDirty(pc))
> > +				val = 0;
> > +		} else {
> > +			if (!TestClearPageCgroupFileDirty(pc))
> > +				val = 0;
> > +		}
> 
> I'm wondering why TestSet/TestClear and even the cgroup page flags for
> dirty/writeback/unstable pages are necessary at all (it helps to
> document in changelog if there are any). For example, VFS will call
> TestSetPageDirty() before calling
> mem_cgroup_inc_page_stat(MEMCG_NR_FILE_DIRTY), so there should be no
> chance of false double counting.
> 

1. flag is necessary for moving accounting information between cgroups
   when account_move() occurs.

2. TestSet... is required because there are always race with page_cgroup_lock()'s
   lock bit.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v4 06/11] memcg: add dirty page accounting infrastructure
@ 2010-10-29 11:17       ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 75+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-10-29 11:17 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Greg Thelen, Andrew Morton, linux-kernel, linux-mm, containers,
	Andrea Righi, Balbir Singh, Daisuke Nishimura, Minchan Kim,
	Ciju Rajan K, David Rientjes

On Fri, 29 Oct 2010 19:13:00 +0800
Wu Fengguang <fengguang.wu@intel.com> wrote:

> On Fri, Oct 29, 2010 at 03:09:09PM +0800, Greg Thelen wrote:
> 
> > +
> > +	case MEMCG_NR_FILE_DIRTY:
> > +		/* Use Test{Set,Clear} to only un/charge the memcg once. */
> > +		if (val > 0) {
> > +			if (TestSetPageCgroupFileDirty(pc))
> > +				val = 0;
> > +		} else {
> > +			if (!TestClearPageCgroupFileDirty(pc))
> > +				val = 0;
> > +		}
> 
> I'm wondering why TestSet/TestClear and even the cgroup page flags for
> dirty/writeback/unstable pages are necessary at all (it helps to
> document in changelog if there are any). For example, VFS will call
> TestSetPageDirty() before calling
> mem_cgroup_inc_page_stat(MEMCG_NR_FILE_DIRTY), so there should be no
> chance of false double counting.
> 

1. flag is necessary for moving accounting information between cgroups
   when account_move() occurs.

2. TestSet... is required because there are always race with page_cgroup_lock()'s
   lock bit.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v4 08/11] memcg: add dirty limits to mem_cgroup
  2010-10-29  7:41     ` KAMEZAWA Hiroyuki
@ 2010-10-29 16:00       ` Greg Thelen
  -1 siblings, 0 replies; 75+ messages in thread
From: Greg Thelen @ 2010-10-29 16:00 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, Andrea Righi,
	Balbir Singh, Daisuke Nishimura, Minchan Kim, Ciju Rajan K,
	David Rientjes, Wu Fengguang

KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> writes:

> On Fri, 29 Oct 2010 00:09:11 -0700
> Greg Thelen <gthelen@google.com> wrote:
>
>> Extend mem_cgroup to contain dirty page limits.  Also add routines
>> allowing the kernel to query the dirty usage of a memcg.
>> 
>> These interfaces not used by the kernel yet.  A subsequent commit
>> will add kernel calls to utilize these new routines.
>> 
>> Signed-off-by: Greg Thelen <gthelen@google.com>
>> Signed-off-by: Andrea Righi <arighi@develer.com>
>> ---
>> Changelog since v3:
>> - Previously memcontrol.c used struct vm_dirty_param and vm_dirty_param() to
>>   advertise dirty memory limits.  Now struct dirty_info and
>>   mem_cgroup_dirty_info() is used to share dirty limits between memcontrol and
>>   the rest of the kernel.
>> - __mem_cgroup_has_dirty_limit() now returns false if use_hierarchy is set.
>
> This seems Okay for our starting point. Hierarchy is always problem..
>
>
>
>> - memcg_hierarchical_free_pages() now uses parent_mem_cgroup() and is simpler.
>> - created internal routine, __mem_cgroup_has_dirty_limit(), to consolidate the
>>   logic.
>> 
>
>
>
>> Changelog since v1:
>> - Rename (for clarity):
>>   - mem_cgroup_write_page_stat_item -> mem_cgroup_page_stat_item
>>   - mem_cgroup_read_page_stat_item -> mem_cgroup_nr_pages_item
>> - Removed unnecessary get_ prefix from get_xxx() functions.
>> - Avoid lockdep warnings by using rcu_read_[un]lock() in
>>   mem_cgroup_has_dirty_limit().
>> 
>>  include/linux/memcontrol.h |   30 ++++++
>>  mm/memcontrol.c            |  248 +++++++++++++++++++++++++++++++++++++++++++-
>>  2 files changed, 277 insertions(+), 1 deletions(-)
>> 
>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>> index ef2eec7..736d318 100644
>> --- a/include/linux/memcontrol.h
>> +++ b/include/linux/memcontrol.h
>> @@ -19,6 +19,7 @@
>>  
>>  #ifndef _LINUX_MEMCONTROL_H
>>  #define _LINUX_MEMCONTROL_H
>> +#include <linux/writeback.h>
>>  #include <linux/cgroup.h>
>>  struct mem_cgroup;
>>  struct page_cgroup;
>> @@ -33,6 +34,14 @@ enum mem_cgroup_page_stat_item {
>>  	MEMCG_NR_FILE_UNSTABLE_NFS, /* # of NFS unstable pages */
>>  };
>>  
>> +/* Cgroup memory statistics items exported to the kernel. */
>> +enum mem_cgroup_nr_pages_item {
>> +	MEMCG_NR_DIRTYABLE_PAGES,
>> +	MEMCG_NR_RECLAIM_PAGES,
>> +	MEMCG_NR_WRITEBACK,
>> +	MEMCG_NR_DIRTY_WRITEBACK_PAGES,
>> +};
>> +
>>  extern unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
>>  					struct list_head *dst,
>>  					unsigned long *scanned, int order,
>> @@ -145,6 +154,11 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,
>>  	mem_cgroup_update_page_stat(page, idx, -1);
>>  }
>>  
>> +bool mem_cgroup_has_dirty_limit(void);
>> +bool mem_cgroup_dirty_info(unsigned long sys_available_mem,
>> +			   struct dirty_info *info);
>> +s64 mem_cgroup_page_stat(enum mem_cgroup_nr_pages_item item);
>> +
>>  unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>>  						gfp_t gfp_mask);
>>  u64 mem_cgroup_get_limit(struct mem_cgroup *mem);
>> @@ -326,6 +340,22 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,
>>  {
>>  }
>>  
>> +static inline bool mem_cgroup_has_dirty_limit(void)
>> +{
>> +	return false;
>> +}
>> +
>> +static inline bool mem_cgroup_dirty_info(unsigned long sys_available_mem,
>> +					 struct dirty_info *info)
>> +{
>> +	return false;
>> +}
>> +
>> +static inline s64 mem_cgroup_page_stat(enum mem_cgroup_nr_pages_item item)
>> +{
>> +	return -ENOSYS;
>> +}
>> +
>>  static inline
>>  unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>>  					    gfp_t gfp_mask)
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 7f91029..52d688d 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -188,6 +188,14 @@ struct mem_cgroup_eventfd_list {
>>  static void mem_cgroup_threshold(struct mem_cgroup *mem);
>>  static void mem_cgroup_oom_notify(struct mem_cgroup *mem);
>>  
>> +/* Dirty memory parameters */
>> +struct vm_dirty_param {
>> +	int dirty_ratio;
>> +	int dirty_background_ratio;
>> +	unsigned long dirty_bytes;
>> +	unsigned long dirty_background_bytes;
>> +};
>> +
>>  /*
>>   * The memory controller data structure. The memory controller controls both
>>   * page cache and RSS per cgroup. We would eventually like to provide
>> @@ -233,6 +241,10 @@ struct mem_cgroup {
>>  	atomic_t	refcnt;
>>  
>>  	unsigned int	swappiness;
>> +
>> +	/* control memory cgroup dirty pages */
>> +	struct vm_dirty_param dirty_param;
>> +
>>  	/* OOM-Killer disable */
>>  	int		oom_kill_disable;
>>  
>> @@ -1132,6 +1144,232 @@ static unsigned int get_swappiness(struct mem_cgroup *memcg)
>>  	return swappiness;
>>  }
>>  
>> +/*
>> + * Return true if the current memory cgroup has local dirty memory settings.
>> + * There is an allowed race between the current task migrating in-to/out-of the
>> + * root cgroup while this routine runs.  So the return value may be incorrect if
>> + * the current task is being simultaneously migrated.
>> + */
>> +static bool __mem_cgroup_has_dirty_limit(struct mem_cgroup *mem)
>> +{
>> +	if (!mem)
>> +		return false;
>> +	if (mem_cgroup_is_root(mem))
>> +		return false;
>> +	/*
>> +	 * The current memcg implementation does not yet support hierarchical
>> +	 * dirty limits.
>> +	 */
>> +	if (mem->use_hierarchy)
>> +		return false;
>> +	return true;
>> +}
>> +
>> +bool mem_cgroup_has_dirty_limit(void)
>> +{
>> +	struct mem_cgroup *mem;
>> +	bool ret;
>> +
>> +	if (mem_cgroup_disabled())
>> +		return false;
>> +
>> +	rcu_read_lock();
>> +	mem = mem_cgroup_from_task(current);
>> +	ret = __mem_cgroup_has_dirty_limit(mem);
>> +	rcu_read_unlock();
>> +
>> +	return ret;
>> +}
>> +
>> +/*
>> + * Returns a snapshot of the current dirty limits which is not synchronized with
>> + * the routines that change the dirty limits.  If this routine races with an
>> + * update to the dirty bytes/ratio value, then the caller must handle the case
>> + * where both dirty_[background_]_ratio and _bytes are set.
>> + */
>> +static void __mem_cgroup_dirty_param(struct vm_dirty_param *param,
>> +				     struct mem_cgroup *mem)
>> +{
>> +	if (__mem_cgroup_has_dirty_limit(mem)) {
>> +		param->dirty_ratio = mem->dirty_param.dirty_ratio;
>> +		param->dirty_bytes = mem->dirty_param.dirty_bytes;
>> +		param->dirty_background_ratio =
>> +			mem->dirty_param.dirty_background_ratio;
>> +		param->dirty_background_bytes =
>> +			mem->dirty_param.dirty_background_bytes;
>> +	} else {
>> +		param->dirty_ratio = vm_dirty_ratio;
>> +		param->dirty_bytes = vm_dirty_bytes;
>> +		param->dirty_background_ratio = dirty_background_ratio;
>> +		param->dirty_background_bytes = dirty_background_bytes;
>> +	}
>> +}
>> +
>> +/*
>> + * Return the background-writeback and dirty-throttling thresholds as well as
>> + * dirty usage metrics.
>> + *
>> + * The current task may be moved to another cgroup while this routine accesses
>> + * the dirty limit.  But a precise check is meaningless because the task can be
>> + * moved after our access and writeback tends to take long time.  At least,
>> + * "memcg" will not be freed while holding rcu_read_lock().
>> + */
>> +bool mem_cgroup_dirty_info(unsigned long sys_available_mem,
>> +			   struct dirty_info *info)
>> +{
>> +	s64 available_mem;
>> +	struct vm_dirty_param dirty_param;
>> +	struct mem_cgroup *memcg;
>> +
>> +	if (mem_cgroup_disabled())
>> +		return false;
>> +
>> +	rcu_read_lock();
>> +	memcg = mem_cgroup_from_task(current);
>> +	if (!__mem_cgroup_has_dirty_limit(memcg)) {
>> +		rcu_read_unlock();
>> +		return false;
>> +	}
>> +	__mem_cgroup_dirty_param(&dirty_param, memcg);
>> +	rcu_read_unlock();
>
> Hmm, don't we need to get css_get() for this "memcg" ?

The memcg variable is not directly used later in this routine.  memcg is
only used in this routine while holding rcu_read_lock().
mem_cgroup_page_stat calls (below) query memcg from the current task.
So I do not think that css_get() is needed.

>> +
>> +	available_mem = mem_cgroup_page_stat(MEMCG_NR_DIRTYABLE_PAGES);
>> +	if (available_mem < 0)
>> +		return false;
>> +
>> +	available_mem = min((unsigned long)available_mem, sys_available_mem);
>> +
> This seems nice.
>
>> +	if (dirty_param.dirty_bytes)
>> +		info->dirty_thresh =
>> +			DIV_ROUND_UP(dirty_param.dirty_bytes, PAGE_SIZE);
>> +	else
>> +		info->dirty_thresh =
>> +			(dirty_param.dirty_ratio * available_mem) / 100;
>> +
>> +	if (dirty_param.dirty_background_bytes)
>> +		info->background_thresh =
>> +			DIV_ROUND_UP(dirty_param.dirty_background_bytes,
>> +				     PAGE_SIZE);
>> +	else
>> +		info->background_thresh =
>> +			(dirty_param.dirty_background_ratio *
>> +			       available_mem) / 100;
>> +
>
> Okay, then these will be finally double-checked with system's dirty-info.
> Right ?

balance_dirty_pages() calls both global_dirty_info() and
memcg_dirty_info() to determine dirty limits and usage for both the
system and the current memcg.  Both the system and memcg limits are
checked by balance_dirty_pages().

> Thanks,
> -Kame
>
>> +	info->nr_reclaimable =
>> +		mem_cgroup_page_stat(MEMCG_NR_RECLAIM_PAGES);
>> +	if (info->nr_reclaimable < 0)
>> +		return false;
>> +
>> +	info->nr_writeback = mem_cgroup_page_stat(MEMCG_NR_WRITEBACK);
>> +	if (info->nr_writeback < 0)
>> +		return false;
>> +
>> +	return true;
>> +}
>> +
>> +static inline bool mem_cgroup_can_swap(struct mem_cgroup *memcg)
>> +{
>> +	if (!do_swap_account)
>> +		return nr_swap_pages > 0;
>> +	return !memcg->memsw_is_minimum &&
>> +		(res_counter_read_u64(&memcg->memsw, RES_LIMIT) > 0);
>> +}
>> +
>> +static s64 mem_cgroup_local_page_stat(struct mem_cgroup *mem,
>> +				      enum mem_cgroup_nr_pages_item item)
>> +{
>> +	s64 ret;
>> +
>> +	switch (item) {
>> +	case MEMCG_NR_DIRTYABLE_PAGES:
>> +		ret = mem_cgroup_read_stat(mem, LRU_ACTIVE_FILE) +
>> +			mem_cgroup_read_stat(mem, LRU_INACTIVE_FILE);
>> +		if (mem_cgroup_can_swap(mem))
>> +			ret += mem_cgroup_read_stat(mem, LRU_ACTIVE_ANON) +
>> +				mem_cgroup_read_stat(mem, LRU_INACTIVE_ANON);
>> +		break;
>> +	case MEMCG_NR_RECLAIM_PAGES:
>> +		ret = mem_cgroup_read_stat(mem,	MEM_CGROUP_STAT_FILE_DIRTY) +
>> +			mem_cgroup_read_stat(mem,
>> +					     MEM_CGROUP_STAT_FILE_UNSTABLE_NFS);
>> +		break;
>> +	case MEMCG_NR_WRITEBACK:
>> +		ret = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_FILE_WRITEBACK);
>> +		break;
>> +	case MEMCG_NR_DIRTY_WRITEBACK_PAGES:
>> +		ret = mem_cgroup_read_stat(mem,
>> +					   MEM_CGROUP_STAT_FILE_WRITEBACK) +
>> +			mem_cgroup_read_stat(mem,
>> +					     MEM_CGROUP_STAT_FILE_UNSTABLE_NFS);
>> +		break;
>> +	default:
>> +		BUG();
>> +		break;
>> +	}
>> +	return ret;
>> +}
>> +
>> +/*
>> + * Return the number of pages that the @mem cgroup could allocate.  If
>> + * use_hierarchy is set, then this involves parent mem cgroups to find the
>> + * cgroup with the smallest free space.
>> + */
>> +static unsigned long long
>> +memcg_hierarchical_free_pages(struct mem_cgroup *mem)
>> +{
>> +	unsigned long free, min_free;
>> +
>> +	min_free = global_page_state(NR_FREE_PAGES) << PAGE_SHIFT;
>> +
>> +	while (mem) {
>> +		free = res_counter_read_u64(&mem->res, RES_LIMIT) -
>> +			res_counter_read_u64(&mem->res, RES_USAGE);
>> +		min_free = min(min_free, free);
>> +		mem = parent_mem_cgroup(mem);
>> +	}
>> +
>> +	/* Translate free memory in pages */
>> +	return min_free >> PAGE_SHIFT;
>> +}
>> +
>> +/*
>> + * mem_cgroup_page_stat() - get memory cgroup file cache statistics
>> + * @item:      memory statistic item exported to the kernel
>> + *
>> + * Return the accounted statistic value or negative value if current task is
>> + * root cgroup.
>> + */
>> +s64 mem_cgroup_page_stat(enum mem_cgroup_nr_pages_item item)
>> +{
>> +	struct mem_cgroup *mem;
>> +	struct mem_cgroup *iter;
>> +	s64 value;
>> +
>> +	rcu_read_lock();
>> +	mem = mem_cgroup_from_task(current);
>> +	if (__mem_cgroup_has_dirty_limit(mem)) {
>> +		/*
>> +		 * If we're looking for dirtyable pages we need to evaluate
>> +		 * free pages depending on the limit and usage of the parents
>> +		 * first of all.
>> +		 */
>> +		if (item == MEMCG_NR_DIRTYABLE_PAGES)
>> +			value = memcg_hierarchical_free_pages(mem);
>> +		else
>> +			value = 0;
>> +		/*
>> +		 * Recursively evaluate page statistics against all cgroup
>> +		 * under hierarchy tree
>> +		 */
>> +		for_each_mem_cgroup_tree(iter, mem)
>> +			value += mem_cgroup_local_page_stat(iter, item);
>> +	} else
>> +		value = -EINVAL;
>> +	rcu_read_unlock();
>> +
>> +	return value;
>> +}
>> +
>>  static void mem_cgroup_start_move(struct mem_cgroup *mem)
>>  {
>>  	int cpu;
>> @@ -4440,8 +4678,16 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
>>  	spin_lock_init(&mem->reclaim_param_lock);
>>  	INIT_LIST_HEAD(&mem->oom_notify);
>>  
>> -	if (parent)
>> +	if (parent) {
>>  		mem->swappiness = get_swappiness(parent);
>> +		__mem_cgroup_dirty_param(&mem->dirty_param, parent);
>> +	} else {
>> +		/*
>> +		 * The root cgroup dirty_param field is not used, instead,
>> +		 * system-wide dirty limits are used.
>> +		 */
>> +	}
>> +
>>  	atomic_set(&mem->refcnt, 1);
>>  	mem->move_charge_at_immigrate = 0;
>>  	mutex_init(&mem->thresholds_lock);
>> -- 
>> 1.7.3.1
>> 
>> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v4 08/11] memcg: add dirty limits to mem_cgroup
@ 2010-10-29 16:00       ` Greg Thelen
  0 siblings, 0 replies; 75+ messages in thread
From: Greg Thelen @ 2010-10-29 16:00 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, Andrea Righi,
	Balbir Singh, Daisuke Nishimura, Minchan Kim, Ciju Rajan K,
	David Rientjes, Wu Fengguang

KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> writes:

> On Fri, 29 Oct 2010 00:09:11 -0700
> Greg Thelen <gthelen@google.com> wrote:
>
>> Extend mem_cgroup to contain dirty page limits.  Also add routines
>> allowing the kernel to query the dirty usage of a memcg.
>> 
>> These interfaces not used by the kernel yet.  A subsequent commit
>> will add kernel calls to utilize these new routines.
>> 
>> Signed-off-by: Greg Thelen <gthelen@google.com>
>> Signed-off-by: Andrea Righi <arighi@develer.com>
>> ---
>> Changelog since v3:
>> - Previously memcontrol.c used struct vm_dirty_param and vm_dirty_param() to
>>   advertise dirty memory limits.  Now struct dirty_info and
>>   mem_cgroup_dirty_info() is used to share dirty limits between memcontrol and
>>   the rest of the kernel.
>> - __mem_cgroup_has_dirty_limit() now returns false if use_hierarchy is set.
>
> This seems Okay for our starting point. Hierarchy is always problem..
>
>
>
>> - memcg_hierarchical_free_pages() now uses parent_mem_cgroup() and is simpler.
>> - created internal routine, __mem_cgroup_has_dirty_limit(), to consolidate the
>>   logic.
>> 
>
>
>
>> Changelog since v1:
>> - Rename (for clarity):
>>   - mem_cgroup_write_page_stat_item -> mem_cgroup_page_stat_item
>>   - mem_cgroup_read_page_stat_item -> mem_cgroup_nr_pages_item
>> - Removed unnecessary get_ prefix from get_xxx() functions.
>> - Avoid lockdep warnings by using rcu_read_[un]lock() in
>>   mem_cgroup_has_dirty_limit().
>> 
>>  include/linux/memcontrol.h |   30 ++++++
>>  mm/memcontrol.c            |  248 +++++++++++++++++++++++++++++++++++++++++++-
>>  2 files changed, 277 insertions(+), 1 deletions(-)
>> 
>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>> index ef2eec7..736d318 100644
>> --- a/include/linux/memcontrol.h
>> +++ b/include/linux/memcontrol.h
>> @@ -19,6 +19,7 @@
>>  
>>  #ifndef _LINUX_MEMCONTROL_H
>>  #define _LINUX_MEMCONTROL_H
>> +#include <linux/writeback.h>
>>  #include <linux/cgroup.h>
>>  struct mem_cgroup;
>>  struct page_cgroup;
>> @@ -33,6 +34,14 @@ enum mem_cgroup_page_stat_item {
>>  	MEMCG_NR_FILE_UNSTABLE_NFS, /* # of NFS unstable pages */
>>  };
>>  
>> +/* Cgroup memory statistics items exported to the kernel. */
>> +enum mem_cgroup_nr_pages_item {
>> +	MEMCG_NR_DIRTYABLE_PAGES,
>> +	MEMCG_NR_RECLAIM_PAGES,
>> +	MEMCG_NR_WRITEBACK,
>> +	MEMCG_NR_DIRTY_WRITEBACK_PAGES,
>> +};
>> +
>>  extern unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
>>  					struct list_head *dst,
>>  					unsigned long *scanned, int order,
>> @@ -145,6 +154,11 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,
>>  	mem_cgroup_update_page_stat(page, idx, -1);
>>  }
>>  
>> +bool mem_cgroup_has_dirty_limit(void);
>> +bool mem_cgroup_dirty_info(unsigned long sys_available_mem,
>> +			   struct dirty_info *info);
>> +s64 mem_cgroup_page_stat(enum mem_cgroup_nr_pages_item item);
>> +
>>  unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>>  						gfp_t gfp_mask);
>>  u64 mem_cgroup_get_limit(struct mem_cgroup *mem);
>> @@ -326,6 +340,22 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,
>>  {
>>  }
>>  
>> +static inline bool mem_cgroup_has_dirty_limit(void)
>> +{
>> +	return false;
>> +}
>> +
>> +static inline bool mem_cgroup_dirty_info(unsigned long sys_available_mem,
>> +					 struct dirty_info *info)
>> +{
>> +	return false;
>> +}
>> +
>> +static inline s64 mem_cgroup_page_stat(enum mem_cgroup_nr_pages_item item)
>> +{
>> +	return -ENOSYS;
>> +}
>> +
>>  static inline
>>  unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>>  					    gfp_t gfp_mask)
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 7f91029..52d688d 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -188,6 +188,14 @@ struct mem_cgroup_eventfd_list {
>>  static void mem_cgroup_threshold(struct mem_cgroup *mem);
>>  static void mem_cgroup_oom_notify(struct mem_cgroup *mem);
>>  
>> +/* Dirty memory parameters */
>> +struct vm_dirty_param {
>> +	int dirty_ratio;
>> +	int dirty_background_ratio;
>> +	unsigned long dirty_bytes;
>> +	unsigned long dirty_background_bytes;
>> +};
>> +
>>  /*
>>   * The memory controller data structure. The memory controller controls both
>>   * page cache and RSS per cgroup. We would eventually like to provide
>> @@ -233,6 +241,10 @@ struct mem_cgroup {
>>  	atomic_t	refcnt;
>>  
>>  	unsigned int	swappiness;
>> +
>> +	/* control memory cgroup dirty pages */
>> +	struct vm_dirty_param dirty_param;
>> +
>>  	/* OOM-Killer disable */
>>  	int		oom_kill_disable;
>>  
>> @@ -1132,6 +1144,232 @@ static unsigned int get_swappiness(struct mem_cgroup *memcg)
>>  	return swappiness;
>>  }
>>  
>> +/*
>> + * Return true if the current memory cgroup has local dirty memory settings.
>> + * There is an allowed race between the current task migrating in-to/out-of the
>> + * root cgroup while this routine runs.  So the return value may be incorrect if
>> + * the current task is being simultaneously migrated.
>> + */
>> +static bool __mem_cgroup_has_dirty_limit(struct mem_cgroup *mem)
>> +{
>> +	if (!mem)
>> +		return false;
>> +	if (mem_cgroup_is_root(mem))
>> +		return false;
>> +	/*
>> +	 * The current memcg implementation does not yet support hierarchical
>> +	 * dirty limits.
>> +	 */
>> +	if (mem->use_hierarchy)
>> +		return false;
>> +	return true;
>> +}
>> +
>> +bool mem_cgroup_has_dirty_limit(void)
>> +{
>> +	struct mem_cgroup *mem;
>> +	bool ret;
>> +
>> +	if (mem_cgroup_disabled())
>> +		return false;
>> +
>> +	rcu_read_lock();
>> +	mem = mem_cgroup_from_task(current);
>> +	ret = __mem_cgroup_has_dirty_limit(mem);
>> +	rcu_read_unlock();
>> +
>> +	return ret;
>> +}
>> +
>> +/*
>> + * Returns a snapshot of the current dirty limits which is not synchronized with
>> + * the routines that change the dirty limits.  If this routine races with an
>> + * update to the dirty bytes/ratio value, then the caller must handle the case
>> + * where both dirty_[background_]_ratio and _bytes are set.
>> + */
>> +static void __mem_cgroup_dirty_param(struct vm_dirty_param *param,
>> +				     struct mem_cgroup *mem)
>> +{
>> +	if (__mem_cgroup_has_dirty_limit(mem)) {
>> +		param->dirty_ratio = mem->dirty_param.dirty_ratio;
>> +		param->dirty_bytes = mem->dirty_param.dirty_bytes;
>> +		param->dirty_background_ratio =
>> +			mem->dirty_param.dirty_background_ratio;
>> +		param->dirty_background_bytes =
>> +			mem->dirty_param.dirty_background_bytes;
>> +	} else {
>> +		param->dirty_ratio = vm_dirty_ratio;
>> +		param->dirty_bytes = vm_dirty_bytes;
>> +		param->dirty_background_ratio = dirty_background_ratio;
>> +		param->dirty_background_bytes = dirty_background_bytes;
>> +	}
>> +}
>> +
>> +/*
>> + * Return the background-writeback and dirty-throttling thresholds as well as
>> + * dirty usage metrics.
>> + *
>> + * The current task may be moved to another cgroup while this routine accesses
>> + * the dirty limit.  But a precise check is meaningless because the task can be
>> + * moved after our access and writeback tends to take long time.  At least,
>> + * "memcg" will not be freed while holding rcu_read_lock().
>> + */
>> +bool mem_cgroup_dirty_info(unsigned long sys_available_mem,
>> +			   struct dirty_info *info)
>> +{
>> +	s64 available_mem;
>> +	struct vm_dirty_param dirty_param;
>> +	struct mem_cgroup *memcg;
>> +
>> +	if (mem_cgroup_disabled())
>> +		return false;
>> +
>> +	rcu_read_lock();
>> +	memcg = mem_cgroup_from_task(current);
>> +	if (!__mem_cgroup_has_dirty_limit(memcg)) {
>> +		rcu_read_unlock();
>> +		return false;
>> +	}
>> +	__mem_cgroup_dirty_param(&dirty_param, memcg);
>> +	rcu_read_unlock();
>
> Hmm, don't we need to get css_get() for this "memcg" ?

The memcg variable is not directly used later in this routine.  memcg is
only used in this routine while holding rcu_read_lock().
mem_cgroup_page_stat calls (below) query memcg from the current task.
So I do not think that css_get() is needed.

>> +
>> +	available_mem = mem_cgroup_page_stat(MEMCG_NR_DIRTYABLE_PAGES);
>> +	if (available_mem < 0)
>> +		return false;
>> +
>> +	available_mem = min((unsigned long)available_mem, sys_available_mem);
>> +
> This seems nice.
>
>> +	if (dirty_param.dirty_bytes)
>> +		info->dirty_thresh =
>> +			DIV_ROUND_UP(dirty_param.dirty_bytes, PAGE_SIZE);
>> +	else
>> +		info->dirty_thresh =
>> +			(dirty_param.dirty_ratio * available_mem) / 100;
>> +
>> +	if (dirty_param.dirty_background_bytes)
>> +		info->background_thresh =
>> +			DIV_ROUND_UP(dirty_param.dirty_background_bytes,
>> +				     PAGE_SIZE);
>> +	else
>> +		info->background_thresh =
>> +			(dirty_param.dirty_background_ratio *
>> +			       available_mem) / 100;
>> +
>
> Okay, then these will be finally double-checked with system's dirty-info.
> Right ?

balance_dirty_pages() calls both global_dirty_info() and
memcg_dirty_info() to determine dirty limits and usage for both the
system and the current memcg.  Both the system and memcg limits are
checked by balance_dirty_pages().

> Thanks,
> -Kame
>
>> +	info->nr_reclaimable =
>> +		mem_cgroup_page_stat(MEMCG_NR_RECLAIM_PAGES);
>> +	if (info->nr_reclaimable < 0)
>> +		return false;
>> +
>> +	info->nr_writeback = mem_cgroup_page_stat(MEMCG_NR_WRITEBACK);
>> +	if (info->nr_writeback < 0)
>> +		return false;
>> +
>> +	return true;
>> +}
>> +
>> +static inline bool mem_cgroup_can_swap(struct mem_cgroup *memcg)
>> +{
>> +	if (!do_swap_account)
>> +		return nr_swap_pages > 0;
>> +	return !memcg->memsw_is_minimum &&
>> +		(res_counter_read_u64(&memcg->memsw, RES_LIMIT) > 0);
>> +}
>> +
>> +static s64 mem_cgroup_local_page_stat(struct mem_cgroup *mem,
>> +				      enum mem_cgroup_nr_pages_item item)
>> +{
>> +	s64 ret;
>> +
>> +	switch (item) {
>> +	case MEMCG_NR_DIRTYABLE_PAGES:
>> +		ret = mem_cgroup_read_stat(mem, LRU_ACTIVE_FILE) +
>> +			mem_cgroup_read_stat(mem, LRU_INACTIVE_FILE);
>> +		if (mem_cgroup_can_swap(mem))
>> +			ret += mem_cgroup_read_stat(mem, LRU_ACTIVE_ANON) +
>> +				mem_cgroup_read_stat(mem, LRU_INACTIVE_ANON);
>> +		break;
>> +	case MEMCG_NR_RECLAIM_PAGES:
>> +		ret = mem_cgroup_read_stat(mem,	MEM_CGROUP_STAT_FILE_DIRTY) +
>> +			mem_cgroup_read_stat(mem,
>> +					     MEM_CGROUP_STAT_FILE_UNSTABLE_NFS);
>> +		break;
>> +	case MEMCG_NR_WRITEBACK:
>> +		ret = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_FILE_WRITEBACK);
>> +		break;
>> +	case MEMCG_NR_DIRTY_WRITEBACK_PAGES:
>> +		ret = mem_cgroup_read_stat(mem,
>> +					   MEM_CGROUP_STAT_FILE_WRITEBACK) +
>> +			mem_cgroup_read_stat(mem,
>> +					     MEM_CGROUP_STAT_FILE_UNSTABLE_NFS);
>> +		break;
>> +	default:
>> +		BUG();
>> +		break;
>> +	}
>> +	return ret;
>> +}
>> +
>> +/*
>> + * Return the number of pages that the @mem cgroup could allocate.  If
>> + * use_hierarchy is set, then this involves parent mem cgroups to find the
>> + * cgroup with the smallest free space.
>> + */
>> +static unsigned long long
>> +memcg_hierarchical_free_pages(struct mem_cgroup *mem)
>> +{
>> +	unsigned long free, min_free;
>> +
>> +	min_free = global_page_state(NR_FREE_PAGES) << PAGE_SHIFT;
>> +
>> +	while (mem) {
>> +		free = res_counter_read_u64(&mem->res, RES_LIMIT) -
>> +			res_counter_read_u64(&mem->res, RES_USAGE);
>> +		min_free = min(min_free, free);
>> +		mem = parent_mem_cgroup(mem);
>> +	}
>> +
>> +	/* Translate free memory in pages */
>> +	return min_free >> PAGE_SHIFT;
>> +}
>> +
>> +/*
>> + * mem_cgroup_page_stat() - get memory cgroup file cache statistics
>> + * @item:      memory statistic item exported to the kernel
>> + *
>> + * Return the accounted statistic value or negative value if current task is
>> + * root cgroup.
>> + */
>> +s64 mem_cgroup_page_stat(enum mem_cgroup_nr_pages_item item)
>> +{
>> +	struct mem_cgroup *mem;
>> +	struct mem_cgroup *iter;
>> +	s64 value;
>> +
>> +	rcu_read_lock();
>> +	mem = mem_cgroup_from_task(current);
>> +	if (__mem_cgroup_has_dirty_limit(mem)) {
>> +		/*
>> +		 * If we're looking for dirtyable pages we need to evaluate
>> +		 * free pages depending on the limit and usage of the parents
>> +		 * first of all.
>> +		 */
>> +		if (item == MEMCG_NR_DIRTYABLE_PAGES)
>> +			value = memcg_hierarchical_free_pages(mem);
>> +		else
>> +			value = 0;
>> +		/*
>> +		 * Recursively evaluate page statistics against all cgroup
>> +		 * under hierarchy tree
>> +		 */
>> +		for_each_mem_cgroup_tree(iter, mem)
>> +			value += mem_cgroup_local_page_stat(iter, item);
>> +	} else
>> +		value = -EINVAL;
>> +	rcu_read_unlock();
>> +
>> +	return value;
>> +}
>> +
>>  static void mem_cgroup_start_move(struct mem_cgroup *mem)
>>  {
>>  	int cpu;
>> @@ -4440,8 +4678,16 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
>>  	spin_lock_init(&mem->reclaim_param_lock);
>>  	INIT_LIST_HEAD(&mem->oom_notify);
>>  
>> -	if (parent)
>> +	if (parent) {
>>  		mem->swappiness = get_swappiness(parent);
>> +		__mem_cgroup_dirty_param(&mem->dirty_param, parent);
>> +	} else {
>> +		/*
>> +		 * The root cgroup dirty_param field is not used, instead,
>> +		 * system-wide dirty limits are used.
>> +		 */
>> +	}
>> +
>>  	atomic_set(&mem->refcnt, 1);
>>  	mem->move_charge_at_immigrate = 0;
>>  	mutex_init(&mem->thresholds_lock);
>> -- 
>> 1.7.3.1
>> 
>> 

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v4 11/11] memcg: check memcg dirty limits in page writeback
  2010-10-29  7:48     ` KAMEZAWA Hiroyuki
@ 2010-10-29 16:06       ` Greg Thelen
  -1 siblings, 0 replies; 75+ messages in thread
From: Greg Thelen @ 2010-10-29 16:06 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, Andrea Righi,
	Balbir Singh, Daisuke Nishimura, Minchan Kim, Ciju Rajan K,
	David Rientjes, Wu Fengguang

KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> writes:

> On Fri, 29 Oct 2010 00:09:14 -0700
> Greg Thelen <gthelen@google.com> wrote:
>
>> If the current process is in a non-root memcg, then
>> balance_dirty_pages() will consider the memcg dirty limits
>> as well as the system-wide limits.  This allows different
>> cgroups to have distinct dirty limits which trigger direct
>> and background writeback at different levels.
>> 
>> Signed-off-by: Andrea Righi <arighi@develer.com>
>> Signed-off-by: Greg Thelen <gthelen@google.com>
>
> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>
> Ideally, I think some comments in the code for "why we need double-check system's
> dirty limit and memcg's dirty limit" will be appreciated.

I will add to the balance_dirty_pages() comment.  It will read:
/*
 * balance_dirty_pages() must be called by processes which are generating dirty
 * data.  It looks at the number of dirty pages in the machine and will force
 * the caller to perform writeback if the system is over `vm_dirty_ratio'.
 * If we're over `background_thresh' then the writeback threads are woken to
 * perform some writeout.  The current task may have per-memcg dirty
 * limits, which are also checked.
 */

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v4 11/11] memcg: check memcg dirty limits in page writeback
@ 2010-10-29 16:06       ` Greg Thelen
  0 siblings, 0 replies; 75+ messages in thread
From: Greg Thelen @ 2010-10-29 16:06 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, Andrea Righi,
	Balbir Singh, Daisuke Nishimura, Minchan Kim, Ciju Rajan K,
	David Rientjes, Wu Fengguang

KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> writes:

> On Fri, 29 Oct 2010 00:09:14 -0700
> Greg Thelen <gthelen@google.com> wrote:
>
>> If the current process is in a non-root memcg, then
>> balance_dirty_pages() will consider the memcg dirty limits
>> as well as the system-wide limits.  This allows different
>> cgroups to have distinct dirty limits which trigger direct
>> and background writeback at different levels.
>> 
>> Signed-off-by: Andrea Righi <arighi@develer.com>
>> Signed-off-by: Greg Thelen <gthelen@google.com>
>
> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>
> Ideally, I think some comments in the code for "why we need double-check system's
> dirty limit and memcg's dirty limit" will be appreciated.

I will add to the balance_dirty_pages() comment.  It will read:
/*
 * balance_dirty_pages() must be called by processes which are generating dirty
 * data.  It looks at the number of dirty pages in the machine and will force
 * the caller to perform writeback if the system is over `vm_dirty_ratio'.
 * If we're over `background_thresh' then the writeback threads are woken to
 * perform some writeout.  The current task may have per-memcg dirty
 * limits, which are also checked.
 */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v4 00/11] memcg: per cgroup dirty page accounting
  2010-10-29  7:09 ` Greg Thelen
@ 2010-10-29 20:19   ` Andrew Morton
  -1 siblings, 0 replies; 75+ messages in thread
From: Andrew Morton @ 2010-10-29 20:19 UTC (permalink / raw)
  To: Greg Thelen
  Cc: linux-kernel, linux-mm, containers, Andrea Righi, Balbir Singh,
	KAMEZAWA Hiroyuki, Daisuke Nishimura, Minchan Kim, Ciju Rajan K,
	David Rientjes, Wu Fengguang

On Fri, 29 Oct 2010 00:09:03 -0700
Greg Thelen <gthelen@google.com> wrote:

This is cool stuff - it's been a long haul.  One day we'll be
nearly-finished and someone will write a book telling people how to use
it all and lots of people will go "holy crap".  I hope.

> Limiting dirty memory is like fixing the max amount of dirty (hard to reclaim)
> page cache used by a cgroup.  So, in case of multiple cgroup writers, they will
> not be able to consume more than their designated share of dirty pages and will
> be forced to perform write-out if they cross that limit.
> 
> The patches are based on a series proposed by Andrea Righi in Mar 2010.
> 
> Overview:
> - Add page_cgroup flags to record when pages are dirty, in writeback, or nfs
>   unstable.
> 
> - Extend mem_cgroup to record the total number of pages in each of the 
>   interesting dirty states (dirty, writeback, unstable_nfs).  
> 
> - Add dirty parameters similar to the system-wide  /proc/sys/vm/dirty_*
>   limits to mem_cgroup.  The mem_cgroup dirty parameters are accessible
>   via cgroupfs control files.

Curious minds will want to know what the default values are set to and
how they were determined.

> - Consider both system and per-memcg dirty limits in page writeback when
>   deciding to queue background writeback or block for foreground writeback.
> 
> Known shortcomings:
> - When a cgroup dirty limit is exceeded, then bdi writeback is employed to
>   writeback dirty inodes.  Bdi writeback considers inodes from any cgroup, not
>   just inodes contributing dirty pages to the cgroup exceeding its limit.  

yup.  Some broader discussion of the implications of this shortcoming
is needed.  I'm not sure where it would be placed, though. 
Documentation/ for now, until you write that book.

> - When memory.use_hierarchy is set, then dirty limits are disabled.  This is a
>   implementation detail.

So this is unintentional, and forced upon us my the present implementation?

>  An enhanced implementation is needed to check the
>   chain of parents to ensure that no dirty limit is exceeded.

How important is it that this be fixed?

And how feasible would that fix be?  A linear walk up the hierarchy
list?  More than that?

> Performance data:
> - A page fault microbenchmark workload was used to measure performance, which
>   can be called in read or write mode:
>         f = open(foo. $cpu)
>         truncate(f, 4096)
>         alarm(60)
>         while (1) {
>                 p = mmap(f, 4096)
>                 if (write)
> 			*p = 1
> 		else
> 			x = *p
>                 munmap(p)
>         }
> 
> - The workload was called for several points in the patch series in different
>   modes:
>   - s_read is a single threaded reader
>   - s_write is a single threaded writer
>   - p_read is a 16 thread reader, each operating on a different file
>   - p_write is a 16 thread writer, each operating on a different file
> 
> - Measurements were collected on a 16 core non-numa system using "perf stat
>   --repeat 3".  The -a option was used for parallel (p_*) runs.
> 
> - All numbers are page fault rate (M/sec).  Higher is better.
> 
> - To compare the performance of a kernel without non-memcg compare the first and
>   last rows, neither has memcg configured.  The first row does not include any
>   of these memcg patches.
> 
> - To compare the performance of using memcg dirty limits, compare the baseline
>   (2nd row titled "w/ memcg") with the the code and memcg enabled (2nd to last
>   row titled "all patches").
> 
>                            root_cgroup                    child_cgroup
>                  s_read s_write p_read p_write   s_read s_write p_read p_write
> mmotm w/o memcg   0.428  0.390   0.429  0.388
> mmotm w/ memcg    0.411  0.378   0.391  0.362     0.412  0.377   0.385  0.363
> all patches       0.384  0.360   0.370  0.348     0.381  0.363   0.368  0.347
> all patches       0.431  0.402   0.427  0.395
>   w/o memcg

afaict this benchmark has demonstrated that the changes do not cause an
appreciable performance regression in terms of CPU loading, yes?

Can we come up with any tests which demonstrate the _benefits_ of the
feature?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v4 00/11] memcg: per cgroup dirty page accounting
@ 2010-10-29 20:19   ` Andrew Morton
  0 siblings, 0 replies; 75+ messages in thread
From: Andrew Morton @ 2010-10-29 20:19 UTC (permalink / raw)
  To: Greg Thelen
  Cc: linux-kernel, linux-mm, containers, Andrea Righi, Balbir Singh,
	KAMEZAWA Hiroyuki, Daisuke Nishimura, Minchan Kim, Ciju Rajan K,
	David Rientjes, Wu Fengguang

On Fri, 29 Oct 2010 00:09:03 -0700
Greg Thelen <gthelen@google.com> wrote:

This is cool stuff - it's been a long haul.  One day we'll be
nearly-finished and someone will write a book telling people how to use
it all and lots of people will go "holy crap".  I hope.

> Limiting dirty memory is like fixing the max amount of dirty (hard to reclaim)
> page cache used by a cgroup.  So, in case of multiple cgroup writers, they will
> not be able to consume more than their designated share of dirty pages and will
> be forced to perform write-out if they cross that limit.
> 
> The patches are based on a series proposed by Andrea Righi in Mar 2010.
> 
> Overview:
> - Add page_cgroup flags to record when pages are dirty, in writeback, or nfs
>   unstable.
> 
> - Extend mem_cgroup to record the total number of pages in each of the 
>   interesting dirty states (dirty, writeback, unstable_nfs).  
> 
> - Add dirty parameters similar to the system-wide  /proc/sys/vm/dirty_*
>   limits to mem_cgroup.  The mem_cgroup dirty parameters are accessible
>   via cgroupfs control files.

Curious minds will want to know what the default values are set to and
how they were determined.

> - Consider both system and per-memcg dirty limits in page writeback when
>   deciding to queue background writeback or block for foreground writeback.
> 
> Known shortcomings:
> - When a cgroup dirty limit is exceeded, then bdi writeback is employed to
>   writeback dirty inodes.  Bdi writeback considers inodes from any cgroup, not
>   just inodes contributing dirty pages to the cgroup exceeding its limit.  

yup.  Some broader discussion of the implications of this shortcoming
is needed.  I'm not sure where it would be placed, though. 
Documentation/ for now, until you write that book.

> - When memory.use_hierarchy is set, then dirty limits are disabled.  This is a
>   implementation detail.

So this is unintentional, and forced upon us my the present implementation?

>  An enhanced implementation is needed to check the
>   chain of parents to ensure that no dirty limit is exceeded.

How important is it that this be fixed?

And how feasible would that fix be?  A linear walk up the hierarchy
list?  More than that?

> Performance data:
> - A page fault microbenchmark workload was used to measure performance, which
>   can be called in read or write mode:
>         f = open(foo. $cpu)
>         truncate(f, 4096)
>         alarm(60)
>         while (1) {
>                 p = mmap(f, 4096)
>                 if (write)
> 			*p = 1
> 		else
> 			x = *p
>                 munmap(p)
>         }
> 
> - The workload was called for several points in the patch series in different
>   modes:
>   - s_read is a single threaded reader
>   - s_write is a single threaded writer
>   - p_read is a 16 thread reader, each operating on a different file
>   - p_write is a 16 thread writer, each operating on a different file
> 
> - Measurements were collected on a 16 core non-numa system using "perf stat
>   --repeat 3".  The -a option was used for parallel (p_*) runs.
> 
> - All numbers are page fault rate (M/sec).  Higher is better.
> 
> - To compare the performance of a kernel without non-memcg compare the first and
>   last rows, neither has memcg configured.  The first row does not include any
>   of these memcg patches.
> 
> - To compare the performance of using memcg dirty limits, compare the baseline
>   (2nd row titled "w/ memcg") with the the code and memcg enabled (2nd to last
>   row titled "all patches").
> 
>                            root_cgroup                    child_cgroup
>                  s_read s_write p_read p_write   s_read s_write p_read p_write
> mmotm w/o memcg   0.428  0.390   0.429  0.388
> mmotm w/ memcg    0.411  0.378   0.391  0.362     0.412  0.377   0.385  0.363
> all patches       0.384  0.360   0.370  0.348     0.381  0.363   0.368  0.347
> all patches       0.431  0.402   0.427  0.395
>   w/o memcg

afaict this benchmark has demonstrated that the changes do not cause an
appreciable performance regression in terms of CPU loading, yes?

Can we come up with any tests which demonstrate the _benefits_ of the
feature?


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v4 02/11] memcg: document cgroup dirty memory interfaces
  2010-10-29  7:09   ` Greg Thelen
@ 2010-10-29 20:19     ` Andrew Morton
  -1 siblings, 0 replies; 75+ messages in thread
From: Andrew Morton @ 2010-10-29 20:19 UTC (permalink / raw)
  To: Greg Thelen
  Cc: linux-kernel, linux-mm, containers, Andrea Righi, Balbir Singh,
	KAMEZAWA Hiroyuki, Daisuke Nishimura, Minchan Kim, Ciju Rajan K,
	David Rientjes, Wu Fengguang

On Fri, 29 Oct 2010 00:09:05 -0700
Greg Thelen <gthelen@google.com> wrote:

> Document cgroup dirty memory interfaces and statistics.
> 
>
> ...
>
> +When use_hierarchy=0, each cgroup has dirty memory usage and limits.
> +System-wide dirty limits are also consulted.  Dirty memory consumption is
> +checked against both system-wide and per-cgroup dirty limits.
> +
> +The current implementation does enforce per-cgroup dirty limits when

"does not", I trust.

> +use_hierarchy=1.  System-wide dirty limits are used for processes in such
> +cgroups.  Attempts to read memory.dirty_* files return the system-wide values.
> +Writes to the memory.dirty_* files return error.  An enhanced implementation is
> +needed to check the chain of parents to ensure that no dirty limit is exceeded.
> +
>  6. Hierarchy support
>  
>  The memory controller supports a deep hierarchy and hierarchical accounting.
> -- 
> 1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v4 02/11] memcg: document cgroup dirty memory interfaces
@ 2010-10-29 20:19     ` Andrew Morton
  0 siblings, 0 replies; 75+ messages in thread
From: Andrew Morton @ 2010-10-29 20:19 UTC (permalink / raw)
  To: Greg Thelen
  Cc: linux-kernel, linux-mm, containers, Andrea Righi, Balbir Singh,
	KAMEZAWA Hiroyuki, Daisuke Nishimura, Minchan Kim, Ciju Rajan K,
	David Rientjes, Wu Fengguang

On Fri, 29 Oct 2010 00:09:05 -0700
Greg Thelen <gthelen@google.com> wrote:

> Document cgroup dirty memory interfaces and statistics.
> 
>
> ...
>
> +When use_hierarchy=0, each cgroup has dirty memory usage and limits.
> +System-wide dirty limits are also consulted.  Dirty memory consumption is
> +checked against both system-wide and per-cgroup dirty limits.
> +
> +The current implementation does enforce per-cgroup dirty limits when

"does not", I trust.

> +use_hierarchy=1.  System-wide dirty limits are used for processes in such
> +cgroups.  Attempts to read memory.dirty_* files return the system-wide values.
> +Writes to the memory.dirty_* files return error.  An enhanced implementation is
> +needed to check the chain of parents to ensure that no dirty limit is exceeded.
> +
>  6. Hierarchy support
>  
>  The memory controller supports a deep hierarchy and hierarchical accounting.
> -- 
> 1.7.3.1

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v4 09/11] memcg: CPU hotplug lockdep warning fix
  2010-10-29  7:09   ` Greg Thelen
@ 2010-10-29 20:19     ` Andrew Morton
  -1 siblings, 0 replies; 75+ messages in thread
From: Andrew Morton @ 2010-10-29 20:19 UTC (permalink / raw)
  To: Greg Thelen
  Cc: linux-kernel, linux-mm, containers, Andrea Righi, Balbir Singh,
	KAMEZAWA Hiroyuki, Daisuke Nishimura, Minchan Kim, Ciju Rajan K,
	David Rientjes, Wu Fengguang

On Fri, 29 Oct 2010 00:09:12 -0700
Greg Thelen <gthelen@google.com> wrote:

> From: Balbir Singh <balbir@linux.vnet.ibm.com>
> 
> memcg has lockdep warnings (sleep inside rcu lock)
> >
> ...
>
> Acked-by: Greg Thelen <gthelen@google.com>

You were on the patch delivery path, so this should be Signed-off-by:. 
I made that change to my copy.

> Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v4 09/11] memcg: CPU hotplug lockdep warning fix
@ 2010-10-29 20:19     ` Andrew Morton
  0 siblings, 0 replies; 75+ messages in thread
From: Andrew Morton @ 2010-10-29 20:19 UTC (permalink / raw)
  To: Greg Thelen
  Cc: linux-kernel, linux-mm, containers, Andrea Righi, Balbir Singh,
	KAMEZAWA Hiroyuki, Daisuke Nishimura, Minchan Kim, Ciju Rajan K,
	David Rientjes, Wu Fengguang

On Fri, 29 Oct 2010 00:09:12 -0700
Greg Thelen <gthelen@google.com> wrote:

> From: Balbir Singh <balbir@linux.vnet.ibm.com>
> 
> memcg has lockdep warnings (sleep inside rcu lock)
> >
> ...
>
> Acked-by: Greg Thelen <gthelen@google.com>

You were on the patch delivery path, so this should be Signed-off-by:. 
I made that change to my copy.

> Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>



^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v4 02/11] memcg: document cgroup dirty memory interfaces
  2010-10-29 11:03     ` Wu Fengguang
@ 2010-10-29 21:35       ` Greg Thelen
  -1 siblings, 0 replies; 75+ messages in thread
From: Greg Thelen @ 2010-10-29 21:35 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, Andrea Righi,
	Balbir Singh, KAMEZAWA Hiroyuki, Daisuke Nishimura, Minchan Kim,
	Ciju Rajan K, David Rientjes

Wu Fengguang <fengguang.wu@intel.com> writes:

> Hi Greg,
>
> On Fri, Oct 29, 2010 at 03:09:05PM +0800, Greg Thelen wrote:
>
>> Document cgroup dirty memory interfaces and statistics.
>> 
>> Signed-off-by: Andrea Righi <arighi@develer.com>
>> Signed-off-by: Greg Thelen <gthelen@google.com>
>> ---
>
>> +Limiting dirty memory is like fixing the max amount of dirty (hard to reclaim)
>> +page cache used by a cgroup.  So, in case of multiple cgroup writers, they will
>> +not be able to consume more than their designated share of dirty pages and will
>> +be forced to perform write-out if they cross that limit.
>
> It's more pertinent to say "will be throttled", as "perform write-out"
> is some implementation behavior that will change soon. 

Good point.  I will update reword the docs to be less specific about
where the write-out occurs.  The important point is that the writer is
throttled.

>> +- memory.dirty_limit_in_bytes: the amount of dirty memory (expressed in bytes)
>> +  in the cgroup at which a process generating dirty pages will start itself
>> +  writing out dirty data.  Suffix (k, K, m, M, g, or G) can be used to indicate
>> +  that value is kilo, mega or gigabytes.
>
> The suffix feature is handy, thanks! It makes sense to also add this
> for the global interfaces, perhaps in a standalone patch.

I agree that this would also be useful for the global interfaces.  I
will submit an independent patch for the global interfaces.

>> +A cgroup may contain more dirty memory than its dirty limit.  This is possible
>> +because of the principle that the first cgroup to touch a page is charged for
>> +it.  Subsequent page counting events (dirty, writeback, nfs_unstable) are also
>> +counted to the originally charged cgroup.
>> +
>> +Example: If page is allocated by a cgroup A task, then the page is charged to
>> +cgroup A.  If the page is later dirtied by a task in cgroup B, then the cgroup A
>> +dirty count will be incremented.  If cgroup A is over its dirty limit but cgroup
>> +B is not, then dirtying a cgroup A page from a cgroup B task may push cgroup A
>> +over its dirty limit without throttling the dirtying cgroup B task.
>
> It's good to document the above "misbehavior". But why not throttling
> the dirtying cgroup B task? Is it simply not implemented or makes no
> sense to do so at all?

Ideally cgroup B would be throttled.  Note, even with this misbehavior,
the system dirty limit will keep cgroup B from exceeding system-wide
limits.

The challenge here is that when the current system increments dirty
counters using account_page_dirtied() which does not immediately check
against dirty limits.  Later balance_dirty_pages() checks to see if any
limits were exceeded, but only after a batch of pages may have been
dirtied.  The task may have written many pages in many different memcg.
So checking all possible memcg that may have been written in the mapping
may be a large set.  I do not like this approach.

memcontrol.c can easily detect when memcg other than the current task's
memcg is charged for a dirty page.  It does not record this today, but
it could.  When such a foreign page dirty event occurs the associated
memcg could be linked into the dirtying address_space so that
balance_dirty_pages() could check the limits of all foreign memcg.  In
the common case I think the task is dirtying pages that have been
charged to the task's cgroup, so the address_space's foreign_memcg list
would be empty.  But when such foreign memcg are dirtied
balance_dirty_pages() would have access to references to all memcg that
need dirty limits checking.  This approach might work.  Comments?

> Thanks,
> Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v4 02/11] memcg: document cgroup dirty memory interfaces
@ 2010-10-29 21:35       ` Greg Thelen
  0 siblings, 0 replies; 75+ messages in thread
From: Greg Thelen @ 2010-10-29 21:35 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, Andrea Righi,
	Balbir Singh, KAMEZAWA Hiroyuki, Daisuke Nishimura, Minchan Kim,
	Ciju Rajan K, David Rientjes

Wu Fengguang <fengguang.wu@intel.com> writes:

> Hi Greg,
>
> On Fri, Oct 29, 2010 at 03:09:05PM +0800, Greg Thelen wrote:
>
>> Document cgroup dirty memory interfaces and statistics.
>> 
>> Signed-off-by: Andrea Righi <arighi@develer.com>
>> Signed-off-by: Greg Thelen <gthelen@google.com>
>> ---
>
>> +Limiting dirty memory is like fixing the max amount of dirty (hard to reclaim)
>> +page cache used by a cgroup.  So, in case of multiple cgroup writers, they will
>> +not be able to consume more than their designated share of dirty pages and will
>> +be forced to perform write-out if they cross that limit.
>
> It's more pertinent to say "will be throttled", as "perform write-out"
> is some implementation behavior that will change soon. 

Good point.  I will update reword the docs to be less specific about
where the write-out occurs.  The important point is that the writer is
throttled.

>> +- memory.dirty_limit_in_bytes: the amount of dirty memory (expressed in bytes)
>> +  in the cgroup at which a process generating dirty pages will start itself
>> +  writing out dirty data.  Suffix (k, K, m, M, g, or G) can be used to indicate
>> +  that value is kilo, mega or gigabytes.
>
> The suffix feature is handy, thanks! It makes sense to also add this
> for the global interfaces, perhaps in a standalone patch.

I agree that this would also be useful for the global interfaces.  I
will submit an independent patch for the global interfaces.

>> +A cgroup may contain more dirty memory than its dirty limit.  This is possible
>> +because of the principle that the first cgroup to touch a page is charged for
>> +it.  Subsequent page counting events (dirty, writeback, nfs_unstable) are also
>> +counted to the originally charged cgroup.
>> +
>> +Example: If page is allocated by a cgroup A task, then the page is charged to
>> +cgroup A.  If the page is later dirtied by a task in cgroup B, then the cgroup A
>> +dirty count will be incremented.  If cgroup A is over its dirty limit but cgroup
>> +B is not, then dirtying a cgroup A page from a cgroup B task may push cgroup A
>> +over its dirty limit without throttling the dirtying cgroup B task.
>
> It's good to document the above "misbehavior". But why not throttling
> the dirtying cgroup B task? Is it simply not implemented or makes no
> sense to do so at all?

Ideally cgroup B would be throttled.  Note, even with this misbehavior,
the system dirty limit will keep cgroup B from exceeding system-wide
limits.

The challenge here is that when the current system increments dirty
counters using account_page_dirtied() which does not immediately check
against dirty limits.  Later balance_dirty_pages() checks to see if any
limits were exceeded, but only after a batch of pages may have been
dirtied.  The task may have written many pages in many different memcg.
So checking all possible memcg that may have been written in the mapping
may be a large set.  I do not like this approach.

memcontrol.c can easily detect when memcg other than the current task's
memcg is charged for a dirty page.  It does not record this today, but
it could.  When such a foreign page dirty event occurs the associated
memcg could be linked into the dirtying address_space so that
balance_dirty_pages() could check the limits of all foreign memcg.  In
the common case I think the task is dirtying pages that have been
charged to the task's cgroup, so the address_space's foreign_memcg list
would be empty.  But when such foreign memcg are dirtied
balance_dirty_pages() would have access to references to all memcg that
need dirty limits checking.  This approach might work.  Comments?

> Thanks,
> Fengguang

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v4 02/11] memcg: document cgroup dirty memory interfaces
  2010-10-29 20:19     ` Andrew Morton
@ 2010-10-29 21:37       ` Greg Thelen
  -1 siblings, 0 replies; 75+ messages in thread
From: Greg Thelen @ 2010-10-29 21:37 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, containers, Andrea Righi, Balbir Singh,
	KAMEZAWA Hiroyuki, Daisuke Nishimura, Minchan Kim, Ciju Rajan K,
	David Rientjes, Wu Fengguang

Andrew Morton <akpm@linux-foundation.org> writes:

> On Fri, 29 Oct 2010 00:09:05 -0700
> Greg Thelen <gthelen@google.com> wrote:
>
>> Document cgroup dirty memory interfaces and statistics.
>> 
>>
>> ...
>>
>> +When use_hierarchy=0, each cgroup has dirty memory usage and limits.
>> +System-wide dirty limits are also consulted.  Dirty memory consumption is
>> +checked against both system-wide and per-cgroup dirty limits.
>> +
>> +The current implementation does enforce per-cgroup dirty limits when
>
> "does not", I trust.

Correct.  Thanks.

>> +use_hierarchy=1.  System-wide dirty limits are used for processes in such
>> +cgroups.  Attempts to read memory.dirty_* files return the system-wide values.
>> +Writes to the memory.dirty_* files return error.  An enhanced implementation is
>> +needed to check the chain of parents to ensure that no dirty limit is exceeded.
>> +
>>  6. Hierarchy support
>>  
>>  The memory controller supports a deep hierarchy and hierarchical accounting.
>> -- 
>> 1.7.3.1
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v4 02/11] memcg: document cgroup dirty memory interfaces
@ 2010-10-29 21:37       ` Greg Thelen
  0 siblings, 0 replies; 75+ messages in thread
From: Greg Thelen @ 2010-10-29 21:37 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, containers, Andrea Righi, Balbir Singh,
	KAMEZAWA Hiroyuki, Daisuke Nishimura, Minchan Kim, Ciju Rajan K,
	David Rientjes, Wu Fengguang

Andrew Morton <akpm@linux-foundation.org> writes:

> On Fri, 29 Oct 2010 00:09:05 -0700
> Greg Thelen <gthelen@google.com> wrote:
>
>> Document cgroup dirty memory interfaces and statistics.
>> 
>>
>> ...
>>
>> +When use_hierarchy=0, each cgroup has dirty memory usage and limits.
>> +System-wide dirty limits are also consulted.  Dirty memory consumption is
>> +checked against both system-wide and per-cgroup dirty limits.
>> +
>> +The current implementation does enforce per-cgroup dirty limits when
>
> "does not", I trust.

Correct.  Thanks.

>> +use_hierarchy=1.  System-wide dirty limits are used for processes in such
>> +cgroups.  Attempts to read memory.dirty_* files return the system-wide values.
>> +Writes to the memory.dirty_* files return error.  An enhanced implementation is
>> +needed to check the chain of parents to ensure that no dirty limit is exceeded.
>> +
>>  6. Hierarchy support
>>  
>>  The memory controller supports a deep hierarchy and hierarchical accounting.
>> -- 
>> 1.7.3.1
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v4 02/11] memcg: document cgroup dirty memory interfaces
  2010-10-29 21:35       ` Greg Thelen
@ 2010-10-30  3:02         ` Wu Fengguang
  -1 siblings, 0 replies; 75+ messages in thread
From: Wu Fengguang @ 2010-10-30  3:02 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, Andrea Righi,
	Balbir Singh, KAMEZAWA Hiroyuki, Daisuke Nishimura, Minchan Kim,
	Ciju Rajan K, David Rientjes

On Sat, Oct 30, 2010 at 05:35:50AM +0800, Greg Thelen wrote:
> >> +A cgroup may contain more dirty memory than its dirty limit.  This is possible
> >> +because of the principle that the first cgroup to touch a page is charged for
> >> +it.  Subsequent page counting events (dirty, writeback, nfs_unstable) are also
> >> +counted to the originally charged cgroup.
> >> +
> >> +Example: If page is allocated by a cgroup A task, then the page is charged to
> >> +cgroup A.  If the page is later dirtied by a task in cgroup B, then the cgroup A
> >> +dirty count will be incremented.  If cgroup A is over its dirty limit but cgroup
> >> +B is not, then dirtying a cgroup A page from a cgroup B task may push cgroup A
> >> +over its dirty limit without throttling the dirtying cgroup B task.
> >
> > It's good to document the above "misbehavior". But why not throttling
> > the dirtying cgroup B task? Is it simply not implemented or makes no
> > sense to do so at all?
> 
> Ideally cgroup B would be throttled.  Note, even with this misbehavior,
> the system dirty limit will keep cgroup B from exceeding system-wide
> limits.

Yeah. And I'm OK with the current behavior, since
1) it does not impact the global limits
2) the common memcg usage (the workload you cared) seems don't share
   pages between memcg's a lot

So I'm OK to improve it in future when there comes a need.

> The challenge here is that when the current system increments dirty
> counters using account_page_dirtied() which does not immediately check
> against dirty limits.  Later balance_dirty_pages() checks to see if any
> limits were exceeded, but only after a batch of pages may have been
> dirtied.  The task may have written many pages in many different memcg.
> So checking all possible memcg that may have been written in the mapping
> may be a large set.  I do not like this approach.

Me too.

> memcontrol.c can easily detect when memcg other than the current task's
> memcg is charged for a dirty page.  It does not record this today, but
> it could.  When such a foreign page dirty event occurs the associated
> memcg could be linked into the dirtying address_space so that
> balance_dirty_pages() could check the limits of all foreign memcg.  In
> the common case I think the task is dirtying pages that have been
> charged to the task's cgroup, so the address_space's foreign_memcg list
> would be empty.  But when such foreign memcg are dirtied
> balance_dirty_pages() would have access to references to all memcg that
> need dirty limits checking.  This approach might work.  Comments?

It still introduce complexities of maintaining the foreign memcg <=>
task mutual links.

Another approach may to add a parameter "struct page *page" to
balance_dirty_pages(). Then balance_dirty_pages() can check the memcg
that is associated with the _current_ dirtied page. It may not catch
all foreign memcg's, but should work fine with good probability
without introducing new data structure.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v4 02/11] memcg: document cgroup dirty memory interfaces
@ 2010-10-30  3:02         ` Wu Fengguang
  0 siblings, 0 replies; 75+ messages in thread
From: Wu Fengguang @ 2010-10-30  3:02 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, Andrea Righi,
	Balbir Singh, KAMEZAWA Hiroyuki, Daisuke Nishimura, Minchan Kim,
	Ciju Rajan K, David Rientjes

On Sat, Oct 30, 2010 at 05:35:50AM +0800, Greg Thelen wrote:
> >> +A cgroup may contain more dirty memory than its dirty limit.  This is possible
> >> +because of the principle that the first cgroup to touch a page is charged for
> >> +it.  Subsequent page counting events (dirty, writeback, nfs_unstable) are also
> >> +counted to the originally charged cgroup.
> >> +
> >> +Example: If page is allocated by a cgroup A task, then the page is charged to
> >> +cgroup A.  If the page is later dirtied by a task in cgroup B, then the cgroup A
> >> +dirty count will be incremented.  If cgroup A is over its dirty limit but cgroup
> >> +B is not, then dirtying a cgroup A page from a cgroup B task may push cgroup A
> >> +over its dirty limit without throttling the dirtying cgroup B task.
> >
> > It's good to document the above "misbehavior". But why not throttling
> > the dirtying cgroup B task? Is it simply not implemented or makes no
> > sense to do so at all?
> 
> Ideally cgroup B would be throttled.  Note, even with this misbehavior,
> the system dirty limit will keep cgroup B from exceeding system-wide
> limits.

Yeah. And I'm OK with the current behavior, since
1) it does not impact the global limits
2) the common memcg usage (the workload you cared) seems don't share
   pages between memcg's a lot

So I'm OK to improve it in future when there comes a need.

> The challenge here is that when the current system increments dirty
> counters using account_page_dirtied() which does not immediately check
> against dirty limits.  Later balance_dirty_pages() checks to see if any
> limits were exceeded, but only after a batch of pages may have been
> dirtied.  The task may have written many pages in many different memcg.
> So checking all possible memcg that may have been written in the mapping
> may be a large set.  I do not like this approach.

Me too.

> memcontrol.c can easily detect when memcg other than the current task's
> memcg is charged for a dirty page.  It does not record this today, but
> it could.  When such a foreign page dirty event occurs the associated
> memcg could be linked into the dirtying address_space so that
> balance_dirty_pages() could check the limits of all foreign memcg.  In
> the common case I think the task is dirtying pages that have been
> charged to the task's cgroup, so the address_space's foreign_memcg list
> would be empty.  But when such foreign memcg are dirtied
> balance_dirty_pages() would have access to references to all memcg that
> need dirty limits checking.  This approach might work.  Comments?

It still introduce complexities of maintaining the foreign memcg <=>
task mutual links.

Another approach may to add a parameter "struct page *page" to
balance_dirty_pages(). Then balance_dirty_pages() can check the memcg
that is associated with the _current_ dirtied page. It may not catch
all foreign memcg's, but should work fine with good probability
without introducing new data structure.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v4 00/11] memcg: per cgroup dirty page accounting
  2010-10-29 20:19   ` Andrew Morton
@ 2010-10-30 21:46     ` Greg Thelen
  -1 siblings, 0 replies; 75+ messages in thread
From: Greg Thelen @ 2010-10-30 21:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, containers, Andrea Righi, Balbir Singh,
	KAMEZAWA Hiroyuki, Daisuke Nishimura, Minchan Kim, Ciju Rajan K,
	David Rientjes, Wu Fengguang

Andrew Morton <akpm@linux-foundation.org> writes:

> On Fri, 29 Oct 2010 00:09:03 -0700
> Greg Thelen <gthelen@google.com> wrote:
>
> This is cool stuff - it's been a long haul.  One day we'll be
> nearly-finished and someone will write a book telling people how to use
> it all and lots of people will go "holy crap".  I hope.
>
>> Limiting dirty memory is like fixing the max amount of dirty (hard to reclaim)
>> page cache used by a cgroup.  So, in case of multiple cgroup writers, they will
>> not be able to consume more than their designated share of dirty pages and will
>> be forced to perform write-out if they cross that limit.
>> 
>> The patches are based on a series proposed by Andrea Righi in Mar 2010.
>> 
>> Overview:
>> - Add page_cgroup flags to record when pages are dirty, in writeback, or nfs
>>   unstable.
>> 
>> - Extend mem_cgroup to record the total number of pages in each of the 
>>   interesting dirty states (dirty, writeback, unstable_nfs).  
>> 
>> - Add dirty parameters similar to the system-wide  /proc/sys/vm/dirty_*
>>   limits to mem_cgroup.  The mem_cgroup dirty parameters are accessible
>>   via cgroupfs control files.
>
> Curious minds will want to know what the default values are set to and
> how they were determined.

When a memcg is created, its dirty limits are set to a copy of the
parent's limits.  If the new cgroup is a top level cgroup, then it
inherits from the system parameters (/proc/sys/vm/dirty_*).

>> - Consider both system and per-memcg dirty limits in page writeback when
>>   deciding to queue background writeback or block for foreground writeback.
>> 
>> Known shortcomings:
>> - When a cgroup dirty limit is exceeded, then bdi writeback is employed to
>>   writeback dirty inodes.  Bdi writeback considers inodes from any cgroup, not
>>   just inodes contributing dirty pages to the cgroup exceeding its limit.  
>
> yup.  Some broader discussion of the implications of this shortcoming
> is needed.  I'm not sure where it would be placed, though. 
> Documentation/ for now, until you write that book.

Fair enough.  I can add more text to Documentation/ describing the
behavior and issue in more detail.

>> - When memory.use_hierarchy is set, then dirty limits are disabled.  This is a
>>   implementation detail.
>
> So this is unintentional, and forced upon us my the present implementation?

Yes, this is not ideal.  I chose not to address this particular issue in
this series to keep the series smaller.

>>  An enhanced implementation is needed to check the
>>   chain of parents to ensure that no dirty limit is exceeded.
>
> How important is it that this be fixed?

I am not sure if there is interest in hierarchical per-memcg dirty
limits.  So I don't think that this is very important to be fixed
immediately.  But the fact that it doesn't work is unexpected.  It would
be nice if it just worked.  I'll look into making it work.

> And how feasible would that fix be?  A linear walk up the hierarchy
> list?  More than that?

I think it should be a simple matter of enhancing
mem_cgroup_dirty_info() to walk up the hierarchy looking for the cgroup
closest to its dirty limit.  The only tricky part is that there are
really two limits (foreground/throttling limit, and a background limit)
that need to be considered when finding the memcg that most deserves
inspection by balance_dirty_pages().

>> Performance data:
>> - A page fault microbenchmark workload was used to measure performance, which
>>   can be called in read or write mode:
>>         f = open(foo. $cpu)
>>         truncate(f, 4096)
>>         alarm(60)
>>         while (1) {
>>                 p = mmap(f, 4096)
>>                 if (write)
>> 			*p = 1
>> 		else
>> 			x = *p
>>                 munmap(p)
>>         }
>> 
>> - The workload was called for several points in the patch series in different
>>   modes:
>>   - s_read is a single threaded reader
>>   - s_write is a single threaded writer
>>   - p_read is a 16 thread reader, each operating on a different file
>>   - p_write is a 16 thread writer, each operating on a different file
>> 
>> - Measurements were collected on a 16 core non-numa system using "perf stat
>>   --repeat 3".  The -a option was used for parallel (p_*) runs.
>> 
>> - All numbers are page fault rate (M/sec).  Higher is better.
>> 
>> - To compare the performance of a kernel without non-memcg compare the first and
>>   last rows, neither has memcg configured.  The first row does not include any
>>   of these memcg patches.
>> 
>> - To compare the performance of using memcg dirty limits, compare the baseline
>>   (2nd row titled "w/ memcg") with the the code and memcg enabled (2nd to last
>>   row titled "all patches").
>> 
>>                            root_cgroup                    child_cgroup
>>                  s_read s_write p_read p_write   s_read s_write p_read p_write
>> mmotm w/o memcg   0.428  0.390   0.429  0.388
>> mmotm w/ memcg    0.411  0.378   0.391  0.362     0.412  0.377   0.385  0.363
>> all patches       0.384  0.360   0.370  0.348     0.381  0.363   0.368  0.347
>> all patches       0.431  0.402   0.427  0.395
>>   w/o memcg
>
> afaict this benchmark has demonstrated that the changes do not cause an
> appreciable performance regression in terms of CPU loading, yes?

Using the mmap() workload, which is a fault heavy workload...

When memcg is not configured, there is no significant performance
change.  Depending on the workload the performance is between 0%..3%
faster.  This is likely workload noise.

When memcg is configured, the performance drops between 4% and 8%.  Some
of this might be noise, but it is expected that memcg faults will get
slower because there's more code in the fault path.

> Can we come up with any tests which demonstrate the _benefits_ of the
> feature?

Here is a test script that shows a situation where memcg dirty limits
are beneficial.  The script runs two programs: a dirty page background
antagonist (dd) and an interactive foreground process (tar).  If the
scripts argument is false, then both processes are run together in the
root cgroup sharing system-wide dirty memory in classic fashion.  If the
script is given a true argument, then a cgroup is used to contain dd
dirty page consumption.

---[start]---
#!/bin/bash
# dirty.sh - dirty limit performance test script
echo use_cgroup: $1

# start antagonist
if $1; then    # if using cgroup to contain 'dd'...
  mkdir /dev/cgroup/A
  echo 400M > /dev/cgroup/A/memory.dirty_limit_in_bytes
  (echo $BASHPID > /dev/cgroup/A/tasks; dd if=/dev/zero of=big.file
  count=10k bs=1M) &
else
  dd if=/dev/zero of=big.file count=10k bs=1M &
fi

sleep 10

time tar -xzf linux-2.6.36.tar.gz
wait
$1 && rmdir /dev/cgroup/A
---[end]---

dirty.sh false : dd 59.7MB/s stddev 7.442%, tar 12.2s stddev 25.720%
  # both in root_cgroup
dirty.sh true  : dd 55.4MB/s stddev 0.958%, tar  3.8s stddev  0.250%
  # tar in root_cgroup, dd in cgroup

The cgroup reserved dirty memory resources for the rest of the system
processes (tar in this case).  The tar process had faster and more
predictable performance.  memcg dirty ratios might be useful to serve
different task classes (interactive vs batch).  A past discussion
touched on this: http://lkml.org/lkml/2010/5/20/136

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v4 00/11] memcg: per cgroup dirty page accounting
@ 2010-10-30 21:46     ` Greg Thelen
  0 siblings, 0 replies; 75+ messages in thread
From: Greg Thelen @ 2010-10-30 21:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, containers, Andrea Righi, Balbir Singh,
	KAMEZAWA Hiroyuki, Daisuke Nishimura, Minchan Kim, Ciju Rajan K,
	David Rientjes, Wu Fengguang

Andrew Morton <akpm@linux-foundation.org> writes:

> On Fri, 29 Oct 2010 00:09:03 -0700
> Greg Thelen <gthelen@google.com> wrote:
>
> This is cool stuff - it's been a long haul.  One day we'll be
> nearly-finished and someone will write a book telling people how to use
> it all and lots of people will go "holy crap".  I hope.
>
>> Limiting dirty memory is like fixing the max amount of dirty (hard to reclaim)
>> page cache used by a cgroup.  So, in case of multiple cgroup writers, they will
>> not be able to consume more than their designated share of dirty pages and will
>> be forced to perform write-out if they cross that limit.
>> 
>> The patches are based on a series proposed by Andrea Righi in Mar 2010.
>> 
>> Overview:
>> - Add page_cgroup flags to record when pages are dirty, in writeback, or nfs
>>   unstable.
>> 
>> - Extend mem_cgroup to record the total number of pages in each of the 
>>   interesting dirty states (dirty, writeback, unstable_nfs).  
>> 
>> - Add dirty parameters similar to the system-wide  /proc/sys/vm/dirty_*
>>   limits to mem_cgroup.  The mem_cgroup dirty parameters are accessible
>>   via cgroupfs control files.
>
> Curious minds will want to know what the default values are set to and
> how they were determined.

When a memcg is created, its dirty limits are set to a copy of the
parent's limits.  If the new cgroup is a top level cgroup, then it
inherits from the system parameters (/proc/sys/vm/dirty_*).

>> - Consider both system and per-memcg dirty limits in page writeback when
>>   deciding to queue background writeback or block for foreground writeback.
>> 
>> Known shortcomings:
>> - When a cgroup dirty limit is exceeded, then bdi writeback is employed to
>>   writeback dirty inodes.  Bdi writeback considers inodes from any cgroup, not
>>   just inodes contributing dirty pages to the cgroup exceeding its limit.  
>
> yup.  Some broader discussion of the implications of this shortcoming
> is needed.  I'm not sure where it would be placed, though. 
> Documentation/ for now, until you write that book.

Fair enough.  I can add more text to Documentation/ describing the
behavior and issue in more detail.

>> - When memory.use_hierarchy is set, then dirty limits are disabled.  This is a
>>   implementation detail.
>
> So this is unintentional, and forced upon us my the present implementation?

Yes, this is not ideal.  I chose not to address this particular issue in
this series to keep the series smaller.

>>  An enhanced implementation is needed to check the
>>   chain of parents to ensure that no dirty limit is exceeded.
>
> How important is it that this be fixed?

I am not sure if there is interest in hierarchical per-memcg dirty
limits.  So I don't think that this is very important to be fixed
immediately.  But the fact that it doesn't work is unexpected.  It would
be nice if it just worked.  I'll look into making it work.

> And how feasible would that fix be?  A linear walk up the hierarchy
> list?  More than that?

I think it should be a simple matter of enhancing
mem_cgroup_dirty_info() to walk up the hierarchy looking for the cgroup
closest to its dirty limit.  The only tricky part is that there are
really two limits (foreground/throttling limit, and a background limit)
that need to be considered when finding the memcg that most deserves
inspection by balance_dirty_pages().

>> Performance data:
>> - A page fault microbenchmark workload was used to measure performance, which
>>   can be called in read or write mode:
>>         f = open(foo. $cpu)
>>         truncate(f, 4096)
>>         alarm(60)
>>         while (1) {
>>                 p = mmap(f, 4096)
>>                 if (write)
>> 			*p = 1
>> 		else
>> 			x = *p
>>                 munmap(p)
>>         }
>> 
>> - The workload was called for several points in the patch series in different
>>   modes:
>>   - s_read is a single threaded reader
>>   - s_write is a single threaded writer
>>   - p_read is a 16 thread reader, each operating on a different file
>>   - p_write is a 16 thread writer, each operating on a different file
>> 
>> - Measurements were collected on a 16 core non-numa system using "perf stat
>>   --repeat 3".  The -a option was used for parallel (p_*) runs.
>> 
>> - All numbers are page fault rate (M/sec).  Higher is better.
>> 
>> - To compare the performance of a kernel without non-memcg compare the first and
>>   last rows, neither has memcg configured.  The first row does not include any
>>   of these memcg patches.
>> 
>> - To compare the performance of using memcg dirty limits, compare the baseline
>>   (2nd row titled "w/ memcg") with the the code and memcg enabled (2nd to last
>>   row titled "all patches").
>> 
>>                            root_cgroup                    child_cgroup
>>                  s_read s_write p_read p_write   s_read s_write p_read p_write
>> mmotm w/o memcg   0.428  0.390   0.429  0.388
>> mmotm w/ memcg    0.411  0.378   0.391  0.362     0.412  0.377   0.385  0.363
>> all patches       0.384  0.360   0.370  0.348     0.381  0.363   0.368  0.347
>> all patches       0.431  0.402   0.427  0.395
>>   w/o memcg
>
> afaict this benchmark has demonstrated that the changes do not cause an
> appreciable performance regression in terms of CPU loading, yes?

Using the mmap() workload, which is a fault heavy workload...

When memcg is not configured, there is no significant performance
change.  Depending on the workload the performance is between 0%..3%
faster.  This is likely workload noise.

When memcg is configured, the performance drops between 4% and 8%.  Some
of this might be noise, but it is expected that memcg faults will get
slower because there's more code in the fault path.

> Can we come up with any tests which demonstrate the _benefits_ of the
> feature?

Here is a test script that shows a situation where memcg dirty limits
are beneficial.  The script runs two programs: a dirty page background
antagonist (dd) and an interactive foreground process (tar).  If the
scripts argument is false, then both processes are run together in the
root cgroup sharing system-wide dirty memory in classic fashion.  If the
script is given a true argument, then a cgroup is used to contain dd
dirty page consumption.

---[start]---
#!/bin/bash
# dirty.sh - dirty limit performance test script
echo use_cgroup: $1

# start antagonist
if $1; then    # if using cgroup to contain 'dd'...
  mkdir /dev/cgroup/A
  echo 400M > /dev/cgroup/A/memory.dirty_limit_in_bytes
  (echo $BASHPID > /dev/cgroup/A/tasks; dd if=/dev/zero of=big.file
  count=10k bs=1M) &
else
  dd if=/dev/zero of=big.file count=10k bs=1M &
fi

sleep 10

time tar -xzf linux-2.6.36.tar.gz
wait
$1 && rmdir /dev/cgroup/A
---[end]---

dirty.sh false : dd 59.7MB/s stddev 7.442%, tar 12.2s stddev 25.720%
  # both in root_cgroup
dirty.sh true  : dd 55.4MB/s stddev 0.958%, tar  3.8s stddev  0.250%
  # tar in root_cgroup, dd in cgroup

The cgroup reserved dirty memory resources for the rest of the system
processes (tar in this case).  The tar process had faster and more
predictable performance.  memcg dirty ratios might be useful to serve
different task classes (interactive vs batch).  A past discussion
touched on this: http://lkml.org/lkml/2010/5/20/136

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v4 03/11] memcg: create extensible page stat update routines
  2010-10-29  7:09   ` Greg Thelen
@ 2010-10-31 14:48     ` Ciju Rajan K
  -1 siblings, 0 replies; 75+ messages in thread
From: Ciju Rajan K @ 2010-10-31 14:48 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, Andrea Righi,
	Balbir Singh, KAMEZAWA Hiroyuki, Daisuke Nishimura, Minchan Kim,
	David Rientjes, Wu Fengguang

Greg Thelen wrote:
> Replace usage of the mem_cgroup_update_file_mapped() memcg
> statistic update routine with two new routines:
> * mem_cgroup_inc_page_stat()
> * mem_cgroup_dec_page_stat()
>
> As before, only the file_mapped statistic is managed.  However,
> these more general interfaces allow for new statistics to be
> more easily added.  New statistics are added with memcg dirty
> page accounting.
>
> Signed-off-by: Greg Thelen <gthelen@google.com>
> Signed-off-by: Andrea Righi <arighi@develer.com>
> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
> ---
> Changelog since v1:
> - Rename (for clarity):
>   - mem_cgroup_write_page_stat_item -> mem_cgroup_page_stat_item
>   - mem_cgroup_read_page_stat_item -> mem_cgroup_nr_pages_item
>
>  include/linux/memcontrol.h |   31 ++++++++++++++++++++++++++++---
>  mm/memcontrol.c            |   16 +++++++---------
>  mm/rmap.c                  |    4 ++--
>  3 files changed, 37 insertions(+), 14 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 159a076..067115c 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -25,6 +25,11 @@ struct page_cgroup;
>  struct page;
>  struct mm_struct;
>
> +/* Stats that can be updated by kernel. */
> +enum mem_cgroup_page_stat_item {
> +	MEMCG_NR_FILE_MAPPED, /* # of pages charged as file rss */
> +};
> +
>  extern unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
>  					struct list_head *dst,
>  					unsigned long *scanned, int order,
> @@ -121,7 +126,22 @@ static inline bool mem_cgroup_disabled(void)
>  	return false;
>  }
>
> -void mem_cgroup_update_file_mapped(struct page *page, int val);
> +void mem_cgroup_update_page_stat(struct page *page,
> +				 enum mem_cgroup_page_stat_item idx,
> +				 int val);
> +
> +static inline void mem_cgroup_inc_page_stat(struct page *page,
> +					    enum mem_cgroup_page_stat_item idx)
> +{
> +	mem_cgroup_update_page_stat(page, idx, 1);
> +}
> +
> +static inline void mem_cgroup_dec_page_stat(struct page *page,
> +					    enum mem_cgroup_page_stat_item idx)
> +{
> +	mem_cgroup_update_page_stat(page, idx, -1);
> +}
> +
>  unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>  						gfp_t gfp_mask);
>  u64 mem_cgroup_get_limit(struct mem_cgroup *mem);
> @@ -293,8 +313,13 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
>  {
>  }
>
> -static inline void mem_cgroup_update_file_mapped(struct page *page,
> -							int val)
> +static inline void mem_cgroup_inc_page_stat(struct page *page,
> +					    enum mem_cgroup_page_stat_item idx)
> +{
> +}
> +
> +static inline void mem_cgroup_dec_page_stat(struct page *page,
> +					    enum mem_cgroup_page_stat_item idx)
>  {
>  }
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 9a99cfa..4fd00c4 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1592,7 +1592,8 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask)
>   * possibility of race condition. If there is, we take a lock.
>   */
>
>   
Greg,

I am not seeing the function mem_cgroup_update_file_stat() in the latest 
mmotm 2010-10-22-16-36.
So not able to apply this patch. Tried couple of times cloning the 
entire mmotm git repository. But no luck.
Tried in the web interface 
http://git.zen-kernel.org/mmotm/tree/mm/memcontrol.c also. It is not there.
Surprisingly git log doesn't show any recent changes to mm/memcontrol.c. 
Am I missing something?
I could see this function in the mainline linux 2.6 git tree.

-Ciju

> -static void mem_cgroup_update_file_stat(struct page *page, int idx, int val)
> +void mem_cgroup_update_page_stat(struct page *page,
> +				 enum mem_cgroup_page_stat_item idx, int val)
>  {
>  	struct mem_cgroup *mem;
>  	struct page_cgroup *pc = lookup_page_cgroup(page);
> @@ -1615,30 +1616,27 @@ static void mem_cgroup_update_file_stat(struct page *page, int idx, int val)
>  			goto out;
>  	}
>
> -	this_cpu_add(mem->stat->count[idx], val);
> -
>  	switch (idx) {
> -	case MEM_CGROUP_STAT_FILE_MAPPED:
> +	case MEMCG_NR_FILE_MAPPED:
>  		if (val > 0)
>  			SetPageCgroupFileMapped(pc);
>  		else if (!page_mapped(page))
>  			ClearPageCgroupFileMapped(pc);
> +		idx = MEM_CGROUP_STAT_FILE_MAPPED;
>  		break;
>  	default:
>  		BUG();
>  	}
>
> +	this_cpu_add(mem->stat->count[idx], val);
> +
>  out:
>  	if (unlikely(need_unlock))
>  		unlock_page_cgroup(pc);
>  	rcu_read_unlock();
>  	return;
>  }
> -
> -void mem_cgroup_update_file_mapped(struct page *page, int val)
> -{
> -	mem_cgroup_update_file_stat(page, MEM_CGROUP_STAT_FILE_MAPPED, val);
> -}
> +EXPORT_SYMBOL(mem_cgroup_update_page_stat);
>
>  /*
>   * size of first charge trial. "32" comes from vmscan.c's magic value.
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 1a8bf76..a66ab76 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -911,7 +911,7 @@ void page_add_file_rmap(struct page *page)
>  {
>  	if (atomic_inc_and_test(&page->_mapcount)) {
>  		__inc_zone_page_state(page, NR_FILE_MAPPED);
> -		mem_cgroup_update_file_mapped(page, 1);
> +		mem_cgroup_inc_page_stat(page, MEMCG_NR_FILE_MAPPED);
>  	}
>  }
>
> @@ -949,7 +949,7 @@ void page_remove_rmap(struct page *page)
>  		__dec_zone_page_state(page, NR_ANON_PAGES);
>  	} else {
>  		__dec_zone_page_state(page, NR_FILE_MAPPED);
> -		mem_cgroup_update_file_mapped(page, -1);
> +		mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_MAPPED);
>  	}
>  	/*
>  	 * It would be tidy to reset the PageAnon mapping here,
>   

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v4 03/11] memcg: create extensible page stat update routines
@ 2010-10-31 14:48     ` Ciju Rajan K
  0 siblings, 0 replies; 75+ messages in thread
From: Ciju Rajan K @ 2010-10-31 14:48 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, Andrea Righi,
	Balbir Singh, KAMEZAWA Hiroyuki, Daisuke Nishimura, Minchan Kim,
	David Rientjes, Wu Fengguang

Greg Thelen wrote:
> Replace usage of the mem_cgroup_update_file_mapped() memcg
> statistic update routine with two new routines:
> * mem_cgroup_inc_page_stat()
> * mem_cgroup_dec_page_stat()
>
> As before, only the file_mapped statistic is managed.  However,
> these more general interfaces allow for new statistics to be
> more easily added.  New statistics are added with memcg dirty
> page accounting.
>
> Signed-off-by: Greg Thelen <gthelen@google.com>
> Signed-off-by: Andrea Righi <arighi@develer.com>
> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
> ---
> Changelog since v1:
> - Rename (for clarity):
>   - mem_cgroup_write_page_stat_item -> mem_cgroup_page_stat_item
>   - mem_cgroup_read_page_stat_item -> mem_cgroup_nr_pages_item
>
>  include/linux/memcontrol.h |   31 ++++++++++++++++++++++++++++---
>  mm/memcontrol.c            |   16 +++++++---------
>  mm/rmap.c                  |    4 ++--
>  3 files changed, 37 insertions(+), 14 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 159a076..067115c 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -25,6 +25,11 @@ struct page_cgroup;
>  struct page;
>  struct mm_struct;
>
> +/* Stats that can be updated by kernel. */
> +enum mem_cgroup_page_stat_item {
> +	MEMCG_NR_FILE_MAPPED, /* # of pages charged as file rss */
> +};
> +
>  extern unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
>  					struct list_head *dst,
>  					unsigned long *scanned, int order,
> @@ -121,7 +126,22 @@ static inline bool mem_cgroup_disabled(void)
>  	return false;
>  }
>
> -void mem_cgroup_update_file_mapped(struct page *page, int val);
> +void mem_cgroup_update_page_stat(struct page *page,
> +				 enum mem_cgroup_page_stat_item idx,
> +				 int val);
> +
> +static inline void mem_cgroup_inc_page_stat(struct page *page,
> +					    enum mem_cgroup_page_stat_item idx)
> +{
> +	mem_cgroup_update_page_stat(page, idx, 1);
> +}
> +
> +static inline void mem_cgroup_dec_page_stat(struct page *page,
> +					    enum mem_cgroup_page_stat_item idx)
> +{
> +	mem_cgroup_update_page_stat(page, idx, -1);
> +}
> +
>  unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>  						gfp_t gfp_mask);
>  u64 mem_cgroup_get_limit(struct mem_cgroup *mem);
> @@ -293,8 +313,13 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
>  {
>  }
>
> -static inline void mem_cgroup_update_file_mapped(struct page *page,
> -							int val)
> +static inline void mem_cgroup_inc_page_stat(struct page *page,
> +					    enum mem_cgroup_page_stat_item idx)
> +{
> +}
> +
> +static inline void mem_cgroup_dec_page_stat(struct page *page,
> +					    enum mem_cgroup_page_stat_item idx)
>  {
>  }
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 9a99cfa..4fd00c4 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1592,7 +1592,8 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask)
>   * possibility of race condition. If there is, we take a lock.
>   */
>
>   
Greg,

I am not seeing the function mem_cgroup_update_file_stat() in the latest 
mmotm 2010-10-22-16-36.
So not able to apply this patch. Tried couple of times cloning the 
entire mmotm git repository. But no luck.
Tried in the web interface 
http://git.zen-kernel.org/mmotm/tree/mm/memcontrol.c also. It is not there.
Surprisingly git log doesn't show any recent changes to mm/memcontrol.c. 
Am I missing something?
I could see this function in the mainline linux 2.6 git tree.

-Ciju

> -static void mem_cgroup_update_file_stat(struct page *page, int idx, int val)
> +void mem_cgroup_update_page_stat(struct page *page,
> +				 enum mem_cgroup_page_stat_item idx, int val)
>  {
>  	struct mem_cgroup *mem;
>  	struct page_cgroup *pc = lookup_page_cgroup(page);
> @@ -1615,30 +1616,27 @@ static void mem_cgroup_update_file_stat(struct page *page, int idx, int val)
>  			goto out;
>  	}
>
> -	this_cpu_add(mem->stat->count[idx], val);
> -
>  	switch (idx) {
> -	case MEM_CGROUP_STAT_FILE_MAPPED:
> +	case MEMCG_NR_FILE_MAPPED:
>  		if (val > 0)
>  			SetPageCgroupFileMapped(pc);
>  		else if (!page_mapped(page))
>  			ClearPageCgroupFileMapped(pc);
> +		idx = MEM_CGROUP_STAT_FILE_MAPPED;
>  		break;
>  	default:
>  		BUG();
>  	}
>
> +	this_cpu_add(mem->stat->count[idx], val);
> +
>  out:
>  	if (unlikely(need_unlock))
>  		unlock_page_cgroup(pc);
>  	rcu_read_unlock();
>  	return;
>  }
> -
> -void mem_cgroup_update_file_mapped(struct page *page, int val)
> -{
> -	mem_cgroup_update_file_stat(page, MEM_CGROUP_STAT_FILE_MAPPED, val);
> -}
> +EXPORT_SYMBOL(mem_cgroup_update_page_stat);
>
>  /*
>   * size of first charge trial. "32" comes from vmscan.c's magic value.
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 1a8bf76..a66ab76 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -911,7 +911,7 @@ void page_add_file_rmap(struct page *page)
>  {
>  	if (atomic_inc_and_test(&page->_mapcount)) {
>  		__inc_zone_page_state(page, NR_FILE_MAPPED);
> -		mem_cgroup_update_file_mapped(page, 1);
> +		mem_cgroup_inc_page_stat(page, MEMCG_NR_FILE_MAPPED);
>  	}
>  }
>
> @@ -949,7 +949,7 @@ void page_remove_rmap(struct page *page)
>  		__dec_zone_page_state(page, NR_ANON_PAGES);
>  	} else {
>  		__dec_zone_page_state(page, NR_FILE_MAPPED);
> -		mem_cgroup_update_file_mapped(page, -1);
> +		mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_MAPPED);
>  	}
>  	/*
>  	 * It would be tidy to reset the PageAnon mapping here,
>   


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v4 11/11] memcg: check memcg dirty limits in page writeback
  2010-10-29 16:06       ` Greg Thelen
@ 2010-10-31 20:03         ` Wu Fengguang
  -1 siblings, 0 replies; 75+ messages in thread
From: Wu Fengguang @ 2010-10-31 20:03 UTC (permalink / raw)
  To: Greg Thelen
  Cc: KAMEZAWA Hiroyuki, Andrew Morton, linux-kernel, linux-mm,
	containers, Andrea Righi, Balbir Singh, Daisuke Nishimura,
	Minchan Kim, Ciju Rajan K, David Rientjes

On Sat, Oct 30, 2010 at 12:06:33AM +0800, Greg Thelen wrote:
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> writes:
> 
> > On Fri, 29 Oct 2010 00:09:14 -0700
> > Greg Thelen <gthelen@google.com> wrote:
> >
> >> If the current process is in a non-root memcg, then
> >> balance_dirty_pages() will consider the memcg dirty limits
> >> as well as the system-wide limits.  This allows different
> >> cgroups to have distinct dirty limits which trigger direct
> >> and background writeback at different levels.
> >> 
> >> Signed-off-by: Andrea Righi <arighi@develer.com>
> >> Signed-off-by: Greg Thelen <gthelen@google.com>
> >
> > Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

The "check both memcg&global dirty limit" looks much more sane than
the V3 implementation. Although it still has misbehaviors in some
cases, it's generally a good new feature to have.

Acked-by: Wu Fengguang <fengguang.wu@intel.com>

> > Ideally, I think some comments in the code for "why we need double-check system's
> > dirty limit and memcg's dirty limit" will be appreciated.
> 
> I will add to the balance_dirty_pages() comment.  It will read:
> /*
>  * balance_dirty_pages() must be called by processes which are generating dirty
>  * data.  It looks at the number of dirty pages in the machine and will force
>  * the caller to perform writeback if the system is over `vm_dirty_ratio'.
                   ~~~~~~~~~~~~~~~~~                  ~~~~

To be exact, it tries to throttle the dirty speed so that
vm_dirty_ratio is not exceeded. In fact balance_dirty_pages() starts
throttling the dirtier slightly below vm_dirty_ratio.

>  * If we're over `background_thresh' then the writeback threads are woken to
>  * perform some writeout.  The current task may have per-memcg dirty
>  * limits, which are also checked.
>  */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v4 11/11] memcg: check memcg dirty limits in page writeback
@ 2010-10-31 20:03         ` Wu Fengguang
  0 siblings, 0 replies; 75+ messages in thread
From: Wu Fengguang @ 2010-10-31 20:03 UTC (permalink / raw)
  To: Greg Thelen
  Cc: KAMEZAWA Hiroyuki, Andrew Morton, linux-kernel, linux-mm,
	containers, Andrea Righi, Balbir Singh, Daisuke Nishimura,
	Minchan Kim, Ciju Rajan K, David Rientjes

On Sat, Oct 30, 2010 at 12:06:33AM +0800, Greg Thelen wrote:
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> writes:
> 
> > On Fri, 29 Oct 2010 00:09:14 -0700
> > Greg Thelen <gthelen@google.com> wrote:
> >
> >> If the current process is in a non-root memcg, then
> >> balance_dirty_pages() will consider the memcg dirty limits
> >> as well as the system-wide limits.  This allows different
> >> cgroups to have distinct dirty limits which trigger direct
> >> and background writeback at different levels.
> >> 
> >> Signed-off-by: Andrea Righi <arighi@develer.com>
> >> Signed-off-by: Greg Thelen <gthelen@google.com>
> >
> > Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

The "check both memcg&global dirty limit" looks much more sane than
the V3 implementation. Although it still has misbehaviors in some
cases, it's generally a good new feature to have.

Acked-by: Wu Fengguang <fengguang.wu@intel.com>

> > Ideally, I think some comments in the code for "why we need double-check system's
> > dirty limit and memcg's dirty limit" will be appreciated.
> 
> I will add to the balance_dirty_pages() comment.  It will read:
> /*
>  * balance_dirty_pages() must be called by processes which are generating dirty
>  * data.  It looks at the number of dirty pages in the machine and will force
>  * the caller to perform writeback if the system is over `vm_dirty_ratio'.
                   ~~~~~~~~~~~~~~~~~                  ~~~~

To be exact, it tries to throttle the dirty speed so that
vm_dirty_ratio is not exceeded. In fact balance_dirty_pages() starts
throttling the dirtier slightly below vm_dirty_ratio.

>  * If we're over `background_thresh' then the writeback threads are woken to
>  * perform some writeout.  The current task may have per-memcg dirty
>  * limits, which are also checked.
>  */

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v4 03/11] memcg: create extensible page stat update routines
  2010-10-31 14:48     ` Ciju Rajan K
@ 2010-10-31 20:11       ` Greg Thelen
  -1 siblings, 0 replies; 75+ messages in thread
From: Greg Thelen @ 2010-10-31 20:11 UTC (permalink / raw)
  To: Ciju Rajan K
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, Andrea Righi,
	Balbir Singh, KAMEZAWA Hiroyuki, Daisuke Nishimura, Minchan Kim,
	David Rientjes, Wu Fengguang

Ciju Rajan K <ciju@linux.vnet.ibm.com> writes:

> Greg Thelen wrote:
>> Replace usage of the mem_cgroup_update_file_mapped() memcg
>> statistic update routine with two new routines:
>> * mem_cgroup_inc_page_stat()
>> * mem_cgroup_dec_page_stat()
>>
>> As before, only the file_mapped statistic is managed.  However,
>> these more general interfaces allow for new statistics to be
>> more easily added.  New statistics are added with memcg dirty
>> page accounting.
>>
>> Signed-off-by: Greg Thelen <gthelen@google.com>
>> Signed-off-by: Andrea Righi <arighi@develer.com>
>> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>> Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
>> ---
>> Changelog since v1:
>> - Rename (for clarity):
>>   - mem_cgroup_write_page_stat_item -> mem_cgroup_page_stat_item
>>   - mem_cgroup_read_page_stat_item -> mem_cgroup_nr_pages_item
>>
>>  include/linux/memcontrol.h |   31 ++++++++++++++++++++++++++++---
>>  mm/memcontrol.c            |   16 +++++++---------
>>  mm/rmap.c                  |    4 ++--
>>  3 files changed, 37 insertions(+), 14 deletions(-)
>>
>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>> index 159a076..067115c 100644
>> --- a/include/linux/memcontrol.h
>> +++ b/include/linux/memcontrol.h
>> @@ -25,6 +25,11 @@ struct page_cgroup;
>>  struct page;
>>  struct mm_struct;
>>
>> +/* Stats that can be updated by kernel. */
>> +enum mem_cgroup_page_stat_item {
>> +	MEMCG_NR_FILE_MAPPED, /* # of pages charged as file rss */
>> +};
>> +
>>  extern unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
>>  					struct list_head *dst,
>>  					unsigned long *scanned, int order,
>> @@ -121,7 +126,22 @@ static inline bool mem_cgroup_disabled(void)
>>  	return false;
>>  }
>>
>> -void mem_cgroup_update_file_mapped(struct page *page, int val);
>> +void mem_cgroup_update_page_stat(struct page *page,
>> +				 enum mem_cgroup_page_stat_item idx,
>> +				 int val);
>> +
>> +static inline void mem_cgroup_inc_page_stat(struct page *page,
>> +					    enum mem_cgroup_page_stat_item idx)
>> +{
>> +	mem_cgroup_update_page_stat(page, idx, 1);
>> +}
>> +
>> +static inline void mem_cgroup_dec_page_stat(struct page *page,
>> +					    enum mem_cgroup_page_stat_item idx)
>> +{
>> +	mem_cgroup_update_page_stat(page, idx, -1);
>> +}
>> +
>>  unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>>  						gfp_t gfp_mask);
>>  u64 mem_cgroup_get_limit(struct mem_cgroup *mem);
>> @@ -293,8 +313,13 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
>>  {
>>  }
>>
>> -static inline void mem_cgroup_update_file_mapped(struct page *page,
>> -							int val)
>> +static inline void mem_cgroup_inc_page_stat(struct page *page,
>> +					    enum mem_cgroup_page_stat_item idx)
>> +{
>> +}
>> +
>> +static inline void mem_cgroup_dec_page_stat(struct page *page,
>> +					    enum mem_cgroup_page_stat_item idx)
>>  {
>>  }
>>
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 9a99cfa..4fd00c4 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -1592,7 +1592,8 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask)
>>   * possibility of race condition. If there is, we take a lock.
>>   */
>>
>>   
> Greg,
>
> I am not seeing the function mem_cgroup_update_file_stat() in the latest mmotm
> 2010-10-22-16-36.
> So not able to apply this patch. Tried couple of times cloning the entire mmotm
> git repository. But no luck.
> Tried in the web interface http://git.zen-kernel.org/mmotm/tree/mm/memcontrol.c
> also. It is not there.
> Surprisingly git log doesn't show any recent changes to mm/memcontrol.c. Am I
> missing something?
> I could see this function in the mainline linux 2.6 git tree.
>
> -Ciju

mem_cgroup_update_file_mapped() was renamed to
mem_cgroup_update_file_stat() in
http://userweb.kernel.org/~akpm/mmotm/broken-out/memcg-generic-filestat-update-interface.patch

I also do not see this in the mmotm git repo.  However, if I manually
apply the mmotm patches to v2.6.36 using quilt then I see the expected
patched memcontrol.c.  I am not sure why the zen-kernel.org git mmotm
repo differs from a mmotm patched mainline 2.6.36.

Here is my procedure using quilt to patch mainline:

# Checkout 2.6.36 mainline
$ git checkout v2.6.36

# Confirm mainline 2.6.36 does not have mem_cgroup_update_file_stat()
$ grep mem_cgroup_update_file_stat -r mm

# Apply patches
$ curl http://userweb.kernel.org/~akpm/mmotm/broken-out.tar.gz | tar -xzf -
$ export QUILT_PATCHES=broken-out
$ quilt push -aq
...
Now at patch memblock-add-input-size-checking-to-memblock_find_region-fix.patch

# Now the memcontrol contains mem_cgroup_update_file_stat()
$ grep mem_cgroup_update_file_stat -r mm
mm/memcontrol.c:static void mem_cgroup_update_file_stat(struct page *page, int idx, int val)
mm/memcontrol.c:        mem_cgroup_update_file_stat(page, MEM_CGROUP_STAT_FILE_MAPPED, val);

>> -static void mem_cgroup_update_file_stat(struct page *page, int idx, int val)
>> +void mem_cgroup_update_page_stat(struct page *page,
>> +				 enum mem_cgroup_page_stat_item idx, int val)
>>  {
>>  	struct mem_cgroup *mem;
>>  	struct page_cgroup *pc = lookup_page_cgroup(page);
>> @@ -1615,30 +1616,27 @@ static void mem_cgroup_update_file_stat(struct page *page, int idx, int val)
>>  			goto out;
>>  	}
>>
>> -	this_cpu_add(mem->stat->count[idx], val);
>> -
>>  	switch (idx) {
>> -	case MEM_CGROUP_STAT_FILE_MAPPED:
>> +	case MEMCG_NR_FILE_MAPPED:
>>  		if (val > 0)
>>  			SetPageCgroupFileMapped(pc);
>>  		else if (!page_mapped(page))
>>  			ClearPageCgroupFileMapped(pc);
>> +		idx = MEM_CGROUP_STAT_FILE_MAPPED;
>>  		break;
>>  	default:
>>  		BUG();
>>  	}
>>
>> +	this_cpu_add(mem->stat->count[idx], val);
>> +
>>  out:
>>  	if (unlikely(need_unlock))
>>  		unlock_page_cgroup(pc);
>>  	rcu_read_unlock();
>>  	return;
>>  }
>> -
>> -void mem_cgroup_update_file_mapped(struct page *page, int val)
>> -{
>> -	mem_cgroup_update_file_stat(page, MEM_CGROUP_STAT_FILE_MAPPED, val);
>> -}
>> +EXPORT_SYMBOL(mem_cgroup_update_page_stat);
>>
>>  /*
>>   * size of first charge trial. "32" comes from vmscan.c's magic value.
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index 1a8bf76..a66ab76 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -911,7 +911,7 @@ void page_add_file_rmap(struct page *page)
>>  {
>>  	if (atomic_inc_and_test(&page->_mapcount)) {
>>  		__inc_zone_page_state(page, NR_FILE_MAPPED);
>> -		mem_cgroup_update_file_mapped(page, 1);
>> +		mem_cgroup_inc_page_stat(page, MEMCG_NR_FILE_MAPPED);
>>  	}
>>  }
>>
>> @@ -949,7 +949,7 @@ void page_remove_rmap(struct page *page)
>>  		__dec_zone_page_state(page, NR_ANON_PAGES);
>>  	} else {
>>  		__dec_zone_page_state(page, NR_FILE_MAPPED);
>> -		mem_cgroup_update_file_mapped(page, -1);
>> +		mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_MAPPED);
>>  	}
>>  	/*
>>  	 * It would be tidy to reset the PageAnon mapping here,
>>   

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v4 03/11] memcg: create extensible page stat update routines
@ 2010-10-31 20:11       ` Greg Thelen
  0 siblings, 0 replies; 75+ messages in thread
From: Greg Thelen @ 2010-10-31 20:11 UTC (permalink / raw)
  To: Ciju Rajan K
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, Andrea Righi,
	Balbir Singh, KAMEZAWA Hiroyuki, Daisuke Nishimura, Minchan Kim,
	David Rientjes, Wu Fengguang

Ciju Rajan K <ciju@linux.vnet.ibm.com> writes:

> Greg Thelen wrote:
>> Replace usage of the mem_cgroup_update_file_mapped() memcg
>> statistic update routine with two new routines:
>> * mem_cgroup_inc_page_stat()
>> * mem_cgroup_dec_page_stat()
>>
>> As before, only the file_mapped statistic is managed.  However,
>> these more general interfaces allow for new statistics to be
>> more easily added.  New statistics are added with memcg dirty
>> page accounting.
>>
>> Signed-off-by: Greg Thelen <gthelen@google.com>
>> Signed-off-by: Andrea Righi <arighi@develer.com>
>> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>> Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
>> ---
>> Changelog since v1:
>> - Rename (for clarity):
>>   - mem_cgroup_write_page_stat_item -> mem_cgroup_page_stat_item
>>   - mem_cgroup_read_page_stat_item -> mem_cgroup_nr_pages_item
>>
>>  include/linux/memcontrol.h |   31 ++++++++++++++++++++++++++++---
>>  mm/memcontrol.c            |   16 +++++++---------
>>  mm/rmap.c                  |    4 ++--
>>  3 files changed, 37 insertions(+), 14 deletions(-)
>>
>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>> index 159a076..067115c 100644
>> --- a/include/linux/memcontrol.h
>> +++ b/include/linux/memcontrol.h
>> @@ -25,6 +25,11 @@ struct page_cgroup;
>>  struct page;
>>  struct mm_struct;
>>
>> +/* Stats that can be updated by kernel. */
>> +enum mem_cgroup_page_stat_item {
>> +	MEMCG_NR_FILE_MAPPED, /* # of pages charged as file rss */
>> +};
>> +
>>  extern unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
>>  					struct list_head *dst,
>>  					unsigned long *scanned, int order,
>> @@ -121,7 +126,22 @@ static inline bool mem_cgroup_disabled(void)
>>  	return false;
>>  }
>>
>> -void mem_cgroup_update_file_mapped(struct page *page, int val);
>> +void mem_cgroup_update_page_stat(struct page *page,
>> +				 enum mem_cgroup_page_stat_item idx,
>> +				 int val);
>> +
>> +static inline void mem_cgroup_inc_page_stat(struct page *page,
>> +					    enum mem_cgroup_page_stat_item idx)
>> +{
>> +	mem_cgroup_update_page_stat(page, idx, 1);
>> +}
>> +
>> +static inline void mem_cgroup_dec_page_stat(struct page *page,
>> +					    enum mem_cgroup_page_stat_item idx)
>> +{
>> +	mem_cgroup_update_page_stat(page, idx, -1);
>> +}
>> +
>>  unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>>  						gfp_t gfp_mask);
>>  u64 mem_cgroup_get_limit(struct mem_cgroup *mem);
>> @@ -293,8 +313,13 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
>>  {
>>  }
>>
>> -static inline void mem_cgroup_update_file_mapped(struct page *page,
>> -							int val)
>> +static inline void mem_cgroup_inc_page_stat(struct page *page,
>> +					    enum mem_cgroup_page_stat_item idx)
>> +{
>> +}
>> +
>> +static inline void mem_cgroup_dec_page_stat(struct page *page,
>> +					    enum mem_cgroup_page_stat_item idx)
>>  {
>>  }
>>
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 9a99cfa..4fd00c4 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -1592,7 +1592,8 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask)
>>   * possibility of race condition. If there is, we take a lock.
>>   */
>>
>>   
> Greg,
>
> I am not seeing the function mem_cgroup_update_file_stat() in the latest mmotm
> 2010-10-22-16-36.
> So not able to apply this patch. Tried couple of times cloning the entire mmotm
> git repository. But no luck.
> Tried in the web interface http://git.zen-kernel.org/mmotm/tree/mm/memcontrol.c
> also. It is not there.
> Surprisingly git log doesn't show any recent changes to mm/memcontrol.c. Am I
> missing something?
> I could see this function in the mainline linux 2.6 git tree.
>
> -Ciju

mem_cgroup_update_file_mapped() was renamed to
mem_cgroup_update_file_stat() in
http://userweb.kernel.org/~akpm/mmotm/broken-out/memcg-generic-filestat-update-interface.patch

I also do not see this in the mmotm git repo.  However, if I manually
apply the mmotm patches to v2.6.36 using quilt then I see the expected
patched memcontrol.c.  I am not sure why the zen-kernel.org git mmotm
repo differs from a mmotm patched mainline 2.6.36.

Here is my procedure using quilt to patch mainline:

# Checkout 2.6.36 mainline
$ git checkout v2.6.36

# Confirm mainline 2.6.36 does not have mem_cgroup_update_file_stat()
$ grep mem_cgroup_update_file_stat -r mm

# Apply patches
$ curl http://userweb.kernel.org/~akpm/mmotm/broken-out.tar.gz | tar -xzf -
$ export QUILT_PATCHES=broken-out
$ quilt push -aq
...
Now at patch memblock-add-input-size-checking-to-memblock_find_region-fix.patch

# Now the memcontrol contains mem_cgroup_update_file_stat()
$ grep mem_cgroup_update_file_stat -r mm
mm/memcontrol.c:static void mem_cgroup_update_file_stat(struct page *page, int idx, int val)
mm/memcontrol.c:        mem_cgroup_update_file_stat(page, MEM_CGROUP_STAT_FILE_MAPPED, val);

>> -static void mem_cgroup_update_file_stat(struct page *page, int idx, int val)
>> +void mem_cgroup_update_page_stat(struct page *page,
>> +				 enum mem_cgroup_page_stat_item idx, int val)
>>  {
>>  	struct mem_cgroup *mem;
>>  	struct page_cgroup *pc = lookup_page_cgroup(page);
>> @@ -1615,30 +1616,27 @@ static void mem_cgroup_update_file_stat(struct page *page, int idx, int val)
>>  			goto out;
>>  	}
>>
>> -	this_cpu_add(mem->stat->count[idx], val);
>> -
>>  	switch (idx) {
>> -	case MEM_CGROUP_STAT_FILE_MAPPED:
>> +	case MEMCG_NR_FILE_MAPPED:
>>  		if (val > 0)
>>  			SetPageCgroupFileMapped(pc);
>>  		else if (!page_mapped(page))
>>  			ClearPageCgroupFileMapped(pc);
>> +		idx = MEM_CGROUP_STAT_FILE_MAPPED;
>>  		break;
>>  	default:
>>  		BUG();
>>  	}
>>
>> +	this_cpu_add(mem->stat->count[idx], val);
>> +
>>  out:
>>  	if (unlikely(need_unlock))
>>  		unlock_page_cgroup(pc);
>>  	rcu_read_unlock();
>>  	return;
>>  }
>> -
>> -void mem_cgroup_update_file_mapped(struct page *page, int val)
>> -{
>> -	mem_cgroup_update_file_stat(page, MEM_CGROUP_STAT_FILE_MAPPED, val);
>> -}
>> +EXPORT_SYMBOL(mem_cgroup_update_page_stat);
>>
>>  /*
>>   * size of first charge trial. "32" comes from vmscan.c's magic value.
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index 1a8bf76..a66ab76 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -911,7 +911,7 @@ void page_add_file_rmap(struct page *page)
>>  {
>>  	if (atomic_inc_and_test(&page->_mapcount)) {
>>  		__inc_zone_page_state(page, NR_FILE_MAPPED);
>> -		mem_cgroup_update_file_mapped(page, 1);
>> +		mem_cgroup_inc_page_stat(page, MEMCG_NR_FILE_MAPPED);
>>  	}
>>  }
>>
>> @@ -949,7 +949,7 @@ void page_remove_rmap(struct page *page)
>>  		__dec_zone_page_state(page, NR_ANON_PAGES);
>>  	} else {
>>  		__dec_zone_page_state(page, NR_FILE_MAPPED);
>> -		mem_cgroup_update_file_mapped(page, -1);
>> +		mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_MAPPED);
>>  	}
>>  	/*
>>  	 * It would be tidy to reset the PageAnon mapping here,
>>   

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v4 03/11] memcg: create extensible page stat update routines
  2010-10-31 20:11       ` Greg Thelen
@ 2010-11-01 20:16         ` Ciju Rajan K
  -1 siblings, 0 replies; 75+ messages in thread
From: Ciju Rajan K @ 2010-11-01 20:16 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, Andrea Righi,
	Balbir Singh, KAMEZAWA Hiroyuki, Daisuke Nishimura, Minchan Kim,
	David Rientjes, Wu Fengguang

Greg Thelen wrote:
> Ciju Rajan K <ciju@linux.vnet.ibm.com> writes:
>
>   
>> Greg Thelen wrote:
>>     
>>> Replace usage of the mem_cgroup_update_file_mapped() memcg
>>> statistic update routine with two new routines:
>>> * mem_cgroup_inc_page_stat()
>>> * mem_cgroup_dec_page_stat()
>>>
>>> As before, only the file_mapped statistic is managed.  However,
>>> these more general interfaces allow for new statistics to be
>>> more easily added.  New statistics are added with memcg dirty
>>> page accounting.
>>>
>>> Signed-off-by: Greg Thelen <gthelen@google.com>
>>> Signed-off-by: Andrea Righi <arighi@develer.com>
>>> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>>> Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
>>> ---
>>> Changelog since v1:
>>> - Rename (for clarity):
>>>   - mem_cgroup_write_page_stat_item -> mem_cgroup_page_stat_item
>>>   - mem_cgroup_read_page_stat_item -> mem_cgroup_nr_pages_item
>>>
>>>  include/linux/memcontrol.h |   31 ++++++++++++++++++++++++++++---
>>>  mm/memcontrol.c            |   16 +++++++---------
>>>  mm/rmap.c                  |    4 ++--
>>>  3 files changed, 37 insertions(+), 14 deletions(-)
>>>
>>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>>> index 159a076..067115c 100644
>>> --- a/include/linux/memcontrol.h
>>> +++ b/include/linux/memcontrol.h
>>> @@ -25,6 +25,11 @@ struct page_cgroup;
>>>  struct page;
>>>  struct mm_struct;
>>>
>>> +/* Stats that can be updated by kernel. */
>>> +enum mem_cgroup_page_stat_item {
>>> +	MEMCG_NR_FILE_MAPPED, /* # of pages charged as file rss */
>>> +};
>>> +
>>>  extern unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
>>>  					struct list_head *dst,
>>>  					unsigned long *scanned, int order,
>>> @@ -121,7 +126,22 @@ static inline bool mem_cgroup_disabled(void)
>>>  	return false;
>>>  }
>>>
>>> -void mem_cgroup_update_file_mapped(struct page *page, int val);
>>> +void mem_cgroup_update_page_stat(struct page *page,
>>> +				 enum mem_cgroup_page_stat_item idx,
>>> +				 int val);
>>> +
>>> +static inline void mem_cgroup_inc_page_stat(struct page *page,
>>> +					    enum mem_cgroup_page_stat_item idx)
>>> +{
>>> +	mem_cgroup_update_page_stat(page, idx, 1);
>>> +}
>>> +
>>> +static inline void mem_cgroup_dec_page_stat(struct page *page,
>>> +					    enum mem_cgroup_page_stat_item idx)
>>> +{
>>> +	mem_cgroup_update_page_stat(page, idx, -1);
>>> +}
>>> +
>>>  unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>>>  						gfp_t gfp_mask);
>>>  u64 mem_cgroup_get_limit(struct mem_cgroup *mem);
>>> @@ -293,8 +313,13 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
>>>  {
>>>  }
>>>
>>> -static inline void mem_cgroup_update_file_mapped(struct page *page,
>>> -							int val)
>>> +static inline void mem_cgroup_inc_page_stat(struct page *page,
>>> +					    enum mem_cgroup_page_stat_item idx)
>>> +{
>>> +}
>>> +
>>> +static inline void mem_cgroup_dec_page_stat(struct page *page,
>>> +					    enum mem_cgroup_page_stat_item idx)
>>>  {
>>>  }
>>>
>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>>> index 9a99cfa..4fd00c4 100644
>>> --- a/mm/memcontrol.c
>>> +++ b/mm/memcontrol.c
>>> @@ -1592,7 +1592,8 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask)
>>>   * possibility of race condition. If there is, we take a lock.
>>>   */
>>>
>>>   
>>>       
>> Greg,
>>
>> I am not seeing the function mem_cgroup_update_file_stat() in the latest mmotm
>> 2010-10-22-16-36.
>> So not able to apply this patch. Tried couple of times cloning the entire mmotm
>> git repository. But no luck.
>> Tried in the web interface http://git.zen-kernel.org/mmotm/tree/mm/memcontrol.c
>> also. It is not there.
>> Surprisingly git log doesn't show any recent changes to mm/memcontrol.c. Am I
>> missing something?
>> I could see this function in the mainline linux 2.6 git tree.
>>
>> -Ciju
>>     
>
> mem_cgroup_update_file_mapped() was renamed to
> mem_cgroup_update_file_stat() in
> http://userweb.kernel.org/~akpm/mmotm/broken-out/memcg-generic-filestat-update-interface.patch
>
> I also do not see this in the mmotm git repo.  However, if I manually
> apply the mmotm patches to v2.6.36 using quilt then I see the expected
> patched memcontrol.c.  I am not sure why the zen-kernel.org git mmotm
> repo differs from a mmotm patched mainline 2.6.36.
>
> Here is my procedure using quilt to patch mainline:
>
> # Checkout 2.6.36 mainline
> $ git checkout v2.6.36
>
> # Confirm mainline 2.6.36 does not have mem_cgroup_update_file_stat()
> $ grep mem_cgroup_update_file_stat -r mm
>
> # Apply patches
> $ curl http://userweb.kernel.org/~akpm/mmotm/broken-out.tar.gz | tar -xzf -
> $ export QUILT_PATCHES=broken-out
> $ quilt push -aq
> ...
> Now at patch memblock-add-input-size-checking-to-memblock_find_region-fix.patch
>
> # Now the memcontrol contains mem_cgroup_update_file_stat()
> $ grep mem_cgroup_update_file_stat -r mm
> mm/memcontrol.c:static void mem_cgroup_update_file_stat(struct page *page, int idx, int val)
> mm/memcontrol.c:        mem_cgroup_update_file_stat(page, MEM_CGROUP_STAT_FILE_MAPPED, val);
>
>   
Thank you Greg! I will try these steps.
I could see the per cgroup dirty page accounting patches already in the 
latest broken-out.tar.gz

-Ciju
>>> -static void mem_cgroup_update_file_stat(struct page *page, int idx, int val)
>>> +void mem_cgroup_update_page_stat(struct page *page,
>>> +				 enum mem_cgroup_page_stat_item idx, int val)
>>>  {
>>>  	struct mem_cgroup *mem;
>>>  	struct page_cgroup *pc = lookup_page_cgroup(page);
>>> @@ -1615,30 +1616,27 @@ static void mem_cgroup_update_file_stat(struct page *page, int idx, int val)
>>>  			goto out;
>>>  	}
>>>
>>> -	this_cpu_add(mem->stat->count[idx], val);
>>> -
>>>  	switch (idx) {
>>> -	case MEM_CGROUP_STAT_FILE_MAPPED:
>>> +	case MEMCG_NR_FILE_MAPPED:
>>>  		if (val > 0)
>>>  			SetPageCgroupFileMapped(pc);
>>>  		else if (!page_mapped(page))
>>>  			ClearPageCgroupFileMapped(pc);
>>> +		idx = MEM_CGROUP_STAT_FILE_MAPPED;
>>>  		break;
>>>  	default:
>>>  		BUG();
>>>  	}
>>>
>>> +	this_cpu_add(mem->stat->count[idx], val);
>>> +
>>>  out:
>>>  	if (unlikely(need_unlock))
>>>  		unlock_page_cgroup(pc);
>>>  	rcu_read_unlock();
>>>  	return;
>>>  }
>>> -
>>> -void mem_cgroup_update_file_mapped(struct page *page, int val)
>>> -{
>>> -	mem_cgroup_update_file_stat(page, MEM_CGROUP_STAT_FILE_MAPPED, val);
>>> -}
>>> +EXPORT_SYMBOL(mem_cgroup_update_page_stat);
>>>
>>>  /*
>>>   * size of first charge trial. "32" comes from vmscan.c's magic value.
>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>> index 1a8bf76..a66ab76 100644
>>> --- a/mm/rmap.c
>>> +++ b/mm/rmap.c
>>> @@ -911,7 +911,7 @@ void page_add_file_rmap(struct page *page)
>>>  {
>>>  	if (atomic_inc_and_test(&page->_mapcount)) {
>>>  		__inc_zone_page_state(page, NR_FILE_MAPPED);
>>> -		mem_cgroup_update_file_mapped(page, 1);
>>> +		mem_cgroup_inc_page_stat(page, MEMCG_NR_FILE_MAPPED);
>>>  	}
>>>  }
>>>
>>> @@ -949,7 +949,7 @@ void page_remove_rmap(struct page *page)
>>>  		__dec_zone_page_state(page, NR_ANON_PAGES);
>>>  	} else {
>>>  		__dec_zone_page_state(page, NR_FILE_MAPPED);
>>> -		mem_cgroup_update_file_mapped(page, -1);
>>> +		mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_MAPPED);
>>>  	}
>>>  	/*
>>>  	 * It would be tidy to reset the PageAnon mapping here,
>>>   
>>>       

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v4 03/11] memcg: create extensible page stat update routines
@ 2010-11-01 20:16         ` Ciju Rajan K
  0 siblings, 0 replies; 75+ messages in thread
From: Ciju Rajan K @ 2010-11-01 20:16 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, Andrea Righi,
	Balbir Singh, KAMEZAWA Hiroyuki, Daisuke Nishimura, Minchan Kim,
	David Rientjes, Wu Fengguang

Greg Thelen wrote:
> Ciju Rajan K <ciju@linux.vnet.ibm.com> writes:
>
>   
>> Greg Thelen wrote:
>>     
>>> Replace usage of the mem_cgroup_update_file_mapped() memcg
>>> statistic update routine with two new routines:
>>> * mem_cgroup_inc_page_stat()
>>> * mem_cgroup_dec_page_stat()
>>>
>>> As before, only the file_mapped statistic is managed.  However,
>>> these more general interfaces allow for new statistics to be
>>> more easily added.  New statistics are added with memcg dirty
>>> page accounting.
>>>
>>> Signed-off-by: Greg Thelen <gthelen@google.com>
>>> Signed-off-by: Andrea Righi <arighi@develer.com>
>>> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>>> Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
>>> ---
>>> Changelog since v1:
>>> - Rename (for clarity):
>>>   - mem_cgroup_write_page_stat_item -> mem_cgroup_page_stat_item
>>>   - mem_cgroup_read_page_stat_item -> mem_cgroup_nr_pages_item
>>>
>>>  include/linux/memcontrol.h |   31 ++++++++++++++++++++++++++++---
>>>  mm/memcontrol.c            |   16 +++++++---------
>>>  mm/rmap.c                  |    4 ++--
>>>  3 files changed, 37 insertions(+), 14 deletions(-)
>>>
>>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>>> index 159a076..067115c 100644
>>> --- a/include/linux/memcontrol.h
>>> +++ b/include/linux/memcontrol.h
>>> @@ -25,6 +25,11 @@ struct page_cgroup;
>>>  struct page;
>>>  struct mm_struct;
>>>
>>> +/* Stats that can be updated by kernel. */
>>> +enum mem_cgroup_page_stat_item {
>>> +	MEMCG_NR_FILE_MAPPED, /* # of pages charged as file rss */
>>> +};
>>> +
>>>  extern unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
>>>  					struct list_head *dst,
>>>  					unsigned long *scanned, int order,
>>> @@ -121,7 +126,22 @@ static inline bool mem_cgroup_disabled(void)
>>>  	return false;
>>>  }
>>>
>>> -void mem_cgroup_update_file_mapped(struct page *page, int val);
>>> +void mem_cgroup_update_page_stat(struct page *page,
>>> +				 enum mem_cgroup_page_stat_item idx,
>>> +				 int val);
>>> +
>>> +static inline void mem_cgroup_inc_page_stat(struct page *page,
>>> +					    enum mem_cgroup_page_stat_item idx)
>>> +{
>>> +	mem_cgroup_update_page_stat(page, idx, 1);
>>> +}
>>> +
>>> +static inline void mem_cgroup_dec_page_stat(struct page *page,
>>> +					    enum mem_cgroup_page_stat_item idx)
>>> +{
>>> +	mem_cgroup_update_page_stat(page, idx, -1);
>>> +}
>>> +
>>>  unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>>>  						gfp_t gfp_mask);
>>>  u64 mem_cgroup_get_limit(struct mem_cgroup *mem);
>>> @@ -293,8 +313,13 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
>>>  {
>>>  }
>>>
>>> -static inline void mem_cgroup_update_file_mapped(struct page *page,
>>> -							int val)
>>> +static inline void mem_cgroup_inc_page_stat(struct page *page,
>>> +					    enum mem_cgroup_page_stat_item idx)
>>> +{
>>> +}
>>> +
>>> +static inline void mem_cgroup_dec_page_stat(struct page *page,
>>> +					    enum mem_cgroup_page_stat_item idx)
>>>  {
>>>  }
>>>
>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>>> index 9a99cfa..4fd00c4 100644
>>> --- a/mm/memcontrol.c
>>> +++ b/mm/memcontrol.c
>>> @@ -1592,7 +1592,8 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask)
>>>   * possibility of race condition. If there is, we take a lock.
>>>   */
>>>
>>>   
>>>       
>> Greg,
>>
>> I am not seeing the function mem_cgroup_update_file_stat() in the latest mmotm
>> 2010-10-22-16-36.
>> So not able to apply this patch. Tried couple of times cloning the entire mmotm
>> git repository. But no luck.
>> Tried in the web interface http://git.zen-kernel.org/mmotm/tree/mm/memcontrol.c
>> also. It is not there.
>> Surprisingly git log doesn't show any recent changes to mm/memcontrol.c. Am I
>> missing something?
>> I could see this function in the mainline linux 2.6 git tree.
>>
>> -Ciju
>>     
>
> mem_cgroup_update_file_mapped() was renamed to
> mem_cgroup_update_file_stat() in
> http://userweb.kernel.org/~akpm/mmotm/broken-out/memcg-generic-filestat-update-interface.patch
>
> I also do not see this in the mmotm git repo.  However, if I manually
> apply the mmotm patches to v2.6.36 using quilt then I see the expected
> patched memcontrol.c.  I am not sure why the zen-kernel.org git mmotm
> repo differs from a mmotm patched mainline 2.6.36.
>
> Here is my procedure using quilt to patch mainline:
>
> # Checkout 2.6.36 mainline
> $ git checkout v2.6.36
>
> # Confirm mainline 2.6.36 does not have mem_cgroup_update_file_stat()
> $ grep mem_cgroup_update_file_stat -r mm
>
> # Apply patches
> $ curl http://userweb.kernel.org/~akpm/mmotm/broken-out.tar.gz | tar -xzf -
> $ export QUILT_PATCHES=broken-out
> $ quilt push -aq
> ...
> Now at patch memblock-add-input-size-checking-to-memblock_find_region-fix.patch
>
> # Now the memcontrol contains mem_cgroup_update_file_stat()
> $ grep mem_cgroup_update_file_stat -r mm
> mm/memcontrol.c:static void mem_cgroup_update_file_stat(struct page *page, int idx, int val)
> mm/memcontrol.c:        mem_cgroup_update_file_stat(page, MEM_CGROUP_STAT_FILE_MAPPED, val);
>
>   
Thank you Greg! I will try these steps.
I could see the per cgroup dirty page accounting patches already in the 
latest broken-out.tar.gz

-Ciju
>>> -static void mem_cgroup_update_file_stat(struct page *page, int idx, int val)
>>> +void mem_cgroup_update_page_stat(struct page *page,
>>> +				 enum mem_cgroup_page_stat_item idx, int val)
>>>  {
>>>  	struct mem_cgroup *mem;
>>>  	struct page_cgroup *pc = lookup_page_cgroup(page);
>>> @@ -1615,30 +1616,27 @@ static void mem_cgroup_update_file_stat(struct page *page, int idx, int val)
>>>  			goto out;
>>>  	}
>>>
>>> -	this_cpu_add(mem->stat->count[idx], val);
>>> -
>>>  	switch (idx) {
>>> -	case MEM_CGROUP_STAT_FILE_MAPPED:
>>> +	case MEMCG_NR_FILE_MAPPED:
>>>  		if (val > 0)
>>>  			SetPageCgroupFileMapped(pc);
>>>  		else if (!page_mapped(page))
>>>  			ClearPageCgroupFileMapped(pc);
>>> +		idx = MEM_CGROUP_STAT_FILE_MAPPED;
>>>  		break;
>>>  	default:
>>>  		BUG();
>>>  	}
>>>
>>> +	this_cpu_add(mem->stat->count[idx], val);
>>> +
>>>  out:
>>>  	if (unlikely(need_unlock))
>>>  		unlock_page_cgroup(pc);
>>>  	rcu_read_unlock();
>>>  	return;
>>>  }
>>> -
>>> -void mem_cgroup_update_file_mapped(struct page *page, int val)
>>> -{
>>> -	mem_cgroup_update_file_stat(page, MEM_CGROUP_STAT_FILE_MAPPED, val);
>>> -}
>>> +EXPORT_SYMBOL(mem_cgroup_update_page_stat);
>>>
>>>  /*
>>>   * size of first charge trial. "32" comes from vmscan.c's magic value.
>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>> index 1a8bf76..a66ab76 100644
>>> --- a/mm/rmap.c
>>> +++ b/mm/rmap.c
>>> @@ -911,7 +911,7 @@ void page_add_file_rmap(struct page *page)
>>>  {
>>>  	if (atomic_inc_and_test(&page->_mapcount)) {
>>>  		__inc_zone_page_state(page, NR_FILE_MAPPED);
>>> -		mem_cgroup_update_file_mapped(page, 1);
>>> +		mem_cgroup_inc_page_stat(page, MEMCG_NR_FILE_MAPPED);
>>>  	}
>>>  }
>>>
>>> @@ -949,7 +949,7 @@ void page_remove_rmap(struct page *page)
>>>  		__dec_zone_page_state(page, NR_ANON_PAGES);
>>>  	} else {
>>>  		__dec_zone_page_state(page, NR_FILE_MAPPED);
>>> -		mem_cgroup_update_file_mapped(page, -1);
>>> +		mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_MAPPED);
>>>  	}
>>>  	/*
>>>  	 * It would be tidy to reset the PageAnon mapping here,
>>>   
>>>       


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v4 00/11] memcg: per cgroup dirty page accounting
  2010-10-30 21:46     ` Greg Thelen
@ 2010-11-02 19:33       ` Ciju Rajan K
  -1 siblings, 0 replies; 75+ messages in thread
From: Ciju Rajan K @ 2010-11-02 19:33 UTC (permalink / raw)
  To: Greg Thelen, Andrew Morton
  Cc: linux-kernel, linux-mm, containers, Andrea Righi, Balbir Singh,
	KAMEZAWA Hiroyuki, Daisuke Nishimura, Minchan Kim,
	David Rientjes, Wu Fengguang, Ciju Rajan K

Greg Thelen wrote:
> Andrew Morton <akpm@linux-foundation.org> writes:
>
>   
>> On Fri, 29 Oct 2010 00:09:03 -0700
>> Greg Thelen <gthelen@google.com> wrote:
>>
>> This is cool stuff - it's been a long haul.  One day we'll be
>> nearly-finished and someone will write a book telling people how to use
>> it all and lots of people will go "holy crap".  I hope.
>>
>>     
>>> Limiting dirty memory is like fixing the max amount of dirty (hard to reclaim)
>>> page cache used by a cgroup.  So, in case of multiple cgroup writers, they will
>>> not be able to consume more than their designated share of dirty pages and will
>>> be forced to perform write-out if they cross that limit.
>>>
>>> The patches are based on a series proposed by Andrea Righi in Mar 2010.
>>>
>>> Overview:
>>> - Add page_cgroup flags to record when pages are dirty, in writeback, or nfs
>>>   unstable.
>>>
>>> - Extend mem_cgroup to record the total number of pages in each of the 
>>>   interesting dirty states (dirty, writeback, unstable_nfs).  
>>>
>>> - Add dirty parameters similar to the system-wide  /proc/sys/vm/dirty_*
>>>   limits to mem_cgroup.  The mem_cgroup dirty parameters are accessible
>>>   via cgroupfs control files.
>>>       
>> Curious minds will want to know what the default values are set to and
>> how they were determined.
>>     
>
> When a memcg is created, its dirty limits are set to a copy of the
> parent's limits.  If the new cgroup is a top level cgroup, then it
> inherits from the system parameters (/proc/sys/vm/dirty_*).
>
>   
>>> - Consider both system and per-memcg dirty limits in page writeback when
>>>   deciding to queue background writeback or block for foreground writeback.
>>>
>>> Known shortcomings:
>>> - When a cgroup dirty limit is exceeded, then bdi writeback is employed to
>>>   writeback dirty inodes.  Bdi writeback considers inodes from any cgroup, not
>>>   just inodes contributing dirty pages to the cgroup exceeding its limit.  
>>>       
>> yup.  Some broader discussion of the implications of this shortcoming
>> is needed.  I'm not sure where it would be placed, though. 
>> Documentation/ for now, until you write that book.
>>     
>
> Fair enough.  I can add more text to Documentation/ describing the
> behavior and issue in more detail.
>
>   
>>> - When memory.use_hierarchy is set, then dirty limits are disabled.  This is a
>>>   implementation detail.
>>>       
>> So this is unintentional, and forced upon us my the present implementation?
>>     
>
> Yes, this is not ideal.  I chose not to address this particular issue in
> this series to keep the series smaller.
>
>   
>>>  An enhanced implementation is needed to check the
>>>   chain of parents to ensure that no dirty limit is exceeded.
>>>       
>> How important is it that this be fixed?
>>     
>
> I am not sure if there is interest in hierarchical per-memcg dirty
> limits.  So I don't think that this is very important to be fixed
> immediately.  But the fact that it doesn't work is unexpected.  It would
> be nice if it just worked.  I'll look into making it work.
>
>   
>> And how feasible would that fix be?  A linear walk up the hierarchy
>> list?  More than that?
>>     
>
> I think it should be a simple matter of enhancing
> mem_cgroup_dirty_info() to walk up the hierarchy looking for the cgroup
> closest to its dirty limit.  The only tricky part is that there are
> really two limits (foreground/throttling limit, and a background limit)
> that need to be considered when finding the memcg that most deserves
> inspection by balance_dirty_pages().
>
>   
>>> Performance data:
>>> - A page fault microbenchmark workload was used to measure performance, which
>>>   can be called in read or write mode:
>>>         f = open(foo. $cpu)
>>>         truncate(f, 4096)
>>>         alarm(60)
>>>         while (1) {
>>>                 p = mmap(f, 4096)
>>>                 if (write)
>>> 			*p = 1
>>> 		else
>>> 			x = *p
>>>                 munmap(p)
>>>         }
>>>
>>> - The workload was called for several points in the patch series in different
>>>   modes:
>>>   - s_read is a single threaded reader
>>>   - s_write is a single threaded writer
>>>   - p_read is a 16 thread reader, each operating on a different file
>>>   - p_write is a 16 thread writer, each operating on a different file
>>>
>>> - Measurements were collected on a 16 core non-numa system using "perf stat
>>>   --repeat 3".  The -a option was used for parallel (p_*) runs.
>>>
>>> - All numbers are page fault rate (M/sec).  Higher is better.
>>>
>>> - To compare the performance of a kernel without non-memcg compare the first and
>>>   last rows, neither has memcg configured.  The first row does not include any
>>>   of these memcg patches.
>>>
>>> - To compare the performance of using memcg dirty limits, compare the baseline
>>>   (2nd row titled "w/ memcg") with the the code and memcg enabled (2nd to last
>>>   row titled "all patches").
>>>
>>>                            root_cgroup                    child_cgroup
>>>                  s_read s_write p_read p_write   s_read s_write p_read p_write
>>> mmotm w/o memcg   0.428  0.390   0.429  0.388
>>> mmotm w/ memcg    0.411  0.378   0.391  0.362     0.412  0.377   0.385  0.363
>>> all patches       0.384  0.360   0.370  0.348     0.381  0.363   0.368  0.347
>>> all patches       0.431  0.402   0.427  0.395
>>>   w/o memcg
>>>       
>> afaict this benchmark has demonstrated that the changes do not cause an
>> appreciable performance regression in terms of CPU loading, yes?
>>     
>
> Using the mmap() workload, which is a fault heavy workload...
>
> When memcg is not configured, there is no significant performance
> change.  Depending on the workload the performance is between 0%..3%
> faster.  This is likely workload noise.
>
> When memcg is configured, the performance drops between 4% and 8%.  Some
> of this might be noise, but it is expected that memcg faults will get
> slower because there's more code in the fault path.
>
>   
>> Can we come up with any tests which demonstrate the _benefits_ of the
>> feature?
>>     
>
> Here is a test script that shows a situation where memcg dirty limits
> are beneficial.  The script runs two programs: a dirty page background
> antagonist (dd) and an interactive foreground process (tar).  If the
> scripts argument is false, then both processes are run together in the
> root cgroup sharing system-wide dirty memory in classic fashion.  If the
> script is given a true argument, then a cgroup is used to contain dd
> dirty page consumption.
>
> ---[start]---
> #!/bin/bash
> # dirty.sh - dirty limit performance test script
> echo use_cgroup: $1
>
> # start antagonist
> if $1; then    # if using cgroup to contain 'dd'...
>   mkdir /dev/cgroup/A
>   echo 400M > /dev/cgroup/A/memory.dirty_limit_in_bytes
>   (echo $BASHPID > /dev/cgroup/A/tasks; dd if=/dev/zero of=big.file
>   count=10k bs=1M) &
> else
>   dd if=/dev/zero of=big.file count=10k bs=1M &
> fi
>
> sleep 10
>
> time tar -xzf linux-2.6.36.tar.gz
> wait
> $1 && rmdir /dev/cgroup/A
> ---[end]---
>
> dirty.sh false : dd 59.7MB/s stddev 7.442%, tar 12.2s stddev 25.720%
>   # both in root_cgroup
> dirty.sh true  : dd 55.4MB/s stddev 0.958%, tar  3.8s stddev  0.250%
>   # tar in root_cgroup, dd in cgroup
>   
Reviewed-by: Ciju Rajan K <ciju@linux.vnet.ibm.com>

Tested-by: Ciju Rajan K <ciju@linux.vnet.ibm.com>

> The cgroup reserved dirty memory resources for the rest of the system
> processes (tar in this case).  The tar process had faster and more
> predictable performance.  memcg dirty ratios might be useful to serve
> different task classes (interactive vs batch).  A past discussion
> touched on this: http://lkml.org/lkml/2010/5/20/136
>   

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v4 00/11] memcg: per cgroup dirty page accounting
@ 2010-11-02 19:33       ` Ciju Rajan K
  0 siblings, 0 replies; 75+ messages in thread
From: Ciju Rajan K @ 2010-11-02 19:33 UTC (permalink / raw)
  To: Greg Thelen, Andrew Morton
  Cc: linux-kernel, linux-mm, containers, Andrea Righi, Balbir Singh,
	KAMEZAWA Hiroyuki, Daisuke Nishimura, Minchan Kim,
	David Rientjes, Wu Fengguang, Ciju Rajan K

Greg Thelen wrote:
> Andrew Morton <akpm@linux-foundation.org> writes:
>
>   
>> On Fri, 29 Oct 2010 00:09:03 -0700
>> Greg Thelen <gthelen@google.com> wrote:
>>
>> This is cool stuff - it's been a long haul.  One day we'll be
>> nearly-finished and someone will write a book telling people how to use
>> it all and lots of people will go "holy crap".  I hope.
>>
>>     
>>> Limiting dirty memory is like fixing the max amount of dirty (hard to reclaim)
>>> page cache used by a cgroup.  So, in case of multiple cgroup writers, they will
>>> not be able to consume more than their designated share of dirty pages and will
>>> be forced to perform write-out if they cross that limit.
>>>
>>> The patches are based on a series proposed by Andrea Righi in Mar 2010.
>>>
>>> Overview:
>>> - Add page_cgroup flags to record when pages are dirty, in writeback, or nfs
>>>   unstable.
>>>
>>> - Extend mem_cgroup to record the total number of pages in each of the 
>>>   interesting dirty states (dirty, writeback, unstable_nfs).  
>>>
>>> - Add dirty parameters similar to the system-wide  /proc/sys/vm/dirty_*
>>>   limits to mem_cgroup.  The mem_cgroup dirty parameters are accessible
>>>   via cgroupfs control files.
>>>       
>> Curious minds will want to know what the default values are set to and
>> how they were determined.
>>     
>
> When a memcg is created, its dirty limits are set to a copy of the
> parent's limits.  If the new cgroup is a top level cgroup, then it
> inherits from the system parameters (/proc/sys/vm/dirty_*).
>
>   
>>> - Consider both system and per-memcg dirty limits in page writeback when
>>>   deciding to queue background writeback or block for foreground writeback.
>>>
>>> Known shortcomings:
>>> - When a cgroup dirty limit is exceeded, then bdi writeback is employed to
>>>   writeback dirty inodes.  Bdi writeback considers inodes from any cgroup, not
>>>   just inodes contributing dirty pages to the cgroup exceeding its limit.  
>>>       
>> yup.  Some broader discussion of the implications of this shortcoming
>> is needed.  I'm not sure where it would be placed, though. 
>> Documentation/ for now, until you write that book.
>>     
>
> Fair enough.  I can add more text to Documentation/ describing the
> behavior and issue in more detail.
>
>   
>>> - When memory.use_hierarchy is set, then dirty limits are disabled.  This is a
>>>   implementation detail.
>>>       
>> So this is unintentional, and forced upon us my the present implementation?
>>     
>
> Yes, this is not ideal.  I chose not to address this particular issue in
> this series to keep the series smaller.
>
>   
>>>  An enhanced implementation is needed to check the
>>>   chain of parents to ensure that no dirty limit is exceeded.
>>>       
>> How important is it that this be fixed?
>>     
>
> I am not sure if there is interest in hierarchical per-memcg dirty
> limits.  So I don't think that this is very important to be fixed
> immediately.  But the fact that it doesn't work is unexpected.  It would
> be nice if it just worked.  I'll look into making it work.
>
>   
>> And how feasible would that fix be?  A linear walk up the hierarchy
>> list?  More than that?
>>     
>
> I think it should be a simple matter of enhancing
> mem_cgroup_dirty_info() to walk up the hierarchy looking for the cgroup
> closest to its dirty limit.  The only tricky part is that there are
> really two limits (foreground/throttling limit, and a background limit)
> that need to be considered when finding the memcg that most deserves
> inspection by balance_dirty_pages().
>
>   
>>> Performance data:
>>> - A page fault microbenchmark workload was used to measure performance, which
>>>   can be called in read or write mode:
>>>         f = open(foo. $cpu)
>>>         truncate(f, 4096)
>>>         alarm(60)
>>>         while (1) {
>>>                 p = mmap(f, 4096)
>>>                 if (write)
>>> 			*p = 1
>>> 		else
>>> 			x = *p
>>>                 munmap(p)
>>>         }
>>>
>>> - The workload was called for several points in the patch series in different
>>>   modes:
>>>   - s_read is a single threaded reader
>>>   - s_write is a single threaded writer
>>>   - p_read is a 16 thread reader, each operating on a different file
>>>   - p_write is a 16 thread writer, each operating on a different file
>>>
>>> - Measurements were collected on a 16 core non-numa system using "perf stat
>>>   --repeat 3".  The -a option was used for parallel (p_*) runs.
>>>
>>> - All numbers are page fault rate (M/sec).  Higher is better.
>>>
>>> - To compare the performance of a kernel without non-memcg compare the first and
>>>   last rows, neither has memcg configured.  The first row does not include any
>>>   of these memcg patches.
>>>
>>> - To compare the performance of using memcg dirty limits, compare the baseline
>>>   (2nd row titled "w/ memcg") with the the code and memcg enabled (2nd to last
>>>   row titled "all patches").
>>>
>>>                            root_cgroup                    child_cgroup
>>>                  s_read s_write p_read p_write   s_read s_write p_read p_write
>>> mmotm w/o memcg   0.428  0.390   0.429  0.388
>>> mmotm w/ memcg    0.411  0.378   0.391  0.362     0.412  0.377   0.385  0.363
>>> all patches       0.384  0.360   0.370  0.348     0.381  0.363   0.368  0.347
>>> all patches       0.431  0.402   0.427  0.395
>>>   w/o memcg
>>>       
>> afaict this benchmark has demonstrated that the changes do not cause an
>> appreciable performance regression in terms of CPU loading, yes?
>>     
>
> Using the mmap() workload, which is a fault heavy workload...
>
> When memcg is not configured, there is no significant performance
> change.  Depending on the workload the performance is between 0%..3%
> faster.  This is likely workload noise.
>
> When memcg is configured, the performance drops between 4% and 8%.  Some
> of this might be noise, but it is expected that memcg faults will get
> slower because there's more code in the fault path.
>
>   
>> Can we come up with any tests which demonstrate the _benefits_ of the
>> feature?
>>     
>
> Here is a test script that shows a situation where memcg dirty limits
> are beneficial.  The script runs two programs: a dirty page background
> antagonist (dd) and an interactive foreground process (tar).  If the
> scripts argument is false, then both processes are run together in the
> root cgroup sharing system-wide dirty memory in classic fashion.  If the
> script is given a true argument, then a cgroup is used to contain dd
> dirty page consumption.
>
> ---[start]---
> #!/bin/bash
> # dirty.sh - dirty limit performance test script
> echo use_cgroup: $1
>
> # start antagonist
> if $1; then    # if using cgroup to contain 'dd'...
>   mkdir /dev/cgroup/A
>   echo 400M > /dev/cgroup/A/memory.dirty_limit_in_bytes
>   (echo $BASHPID > /dev/cgroup/A/tasks; dd if=/dev/zero of=big.file
>   count=10k bs=1M) &
> else
>   dd if=/dev/zero of=big.file count=10k bs=1M &
> fi
>
> sleep 10
>
> time tar -xzf linux-2.6.36.tar.gz
> wait
> $1 && rmdir /dev/cgroup/A
> ---[end]---
>
> dirty.sh false : dd 59.7MB/s stddev 7.442%, tar 12.2s stddev 25.720%
>   # both in root_cgroup
> dirty.sh true  : dd 55.4MB/s stddev 0.958%, tar  3.8s stddev  0.250%
>   # tar in root_cgroup, dd in cgroup
>   
Reviewed-by: Ciju Rajan K <ciju@linux.vnet.ibm.com>

Tested-by: Ciju Rajan K <ciju@linux.vnet.ibm.com>

> The cgroup reserved dirty memory resources for the rest of the system
> processes (tar in this case).  The tar process had faster and more
> predictable performance.  memcg dirty ratios might be useful to serve
> different task classes (interactive vs batch).  A past discussion
> touched on this: http://lkml.org/lkml/2010/5/20/136
>   


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v4 03/11] memcg: create extensible page stat update routines
  2010-10-31 20:11       ` Greg Thelen
@ 2010-11-02 19:35         ` Ciju Rajan K
  -1 siblings, 0 replies; 75+ messages in thread
From: Ciju Rajan K @ 2010-11-02 19:35 UTC (permalink / raw)
  To: Greg Thelen, Andrew Morton
  Cc: linux-kernel, linux-mm, containers, Andrea Righi, Balbir Singh,
	KAMEZAWA Hiroyuki, Daisuke Nishimura, Minchan Kim,
	David Rientjes, Wu Fengguang

Greg Thelen wrote:
> Ciju Rajan K <ciju@linux.vnet.ibm.com> writes:
>
>   
>> Greg Thelen wrote:
>>     
>>> Replace usage of the mem_cgroup_update_file_mapped() memcg
>>> statistic update routine with two new routines:
>>> * mem_cgroup_inc_page_stat()
>>> * mem_cgroup_dec_page_stat()
>>>
>>> As before, only the file_mapped statistic is managed.  However,
>>> these more general interfaces allow for new statistics to be
>>> more easily added.  New statistics are added with memcg dirty
>>> page accounting.
>>>
>>> Signed-off-by: Greg Thelen <gthelen@google.com>
>>> Signed-off-by: Andrea Righi <arighi@develer.com>
>>> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>>> Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
>>> ---
>>> Changelog since v1:
>>> - Rename (for clarity):
>>>   - mem_cgroup_write_page_stat_item -> mem_cgroup_page_stat_item
>>>   - mem_cgroup_read_page_stat_item -> mem_cgroup_nr_pages_item
>>>
>>>  include/linux/memcontrol.h |   31 ++++++++++++++++++++++++++++---
>>>  mm/memcontrol.c            |   16 +++++++---------
>>>  mm/rmap.c                  |    4 ++--
>>>  3 files changed, 37 insertions(+), 14 deletions(-)
>>>
>>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>>> index 159a076..067115c 100644
>>> --- a/include/linux/memcontrol.h
>>> +++ b/include/linux/memcontrol.h
>>> @@ -25,6 +25,11 @@ struct page_cgroup;
>>>  struct page;
>>>  struct mm_struct;
>>>
>>> +/* Stats that can be updated by kernel. */
>>> +enum mem_cgroup_page_stat_item {
>>> +	MEMCG_NR_FILE_MAPPED, /* # of pages charged as file rss */
>>> +};
>>> +
>>>  extern unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
>>>  					struct list_head *dst,
>>>  					unsigned long *scanned, int order,
>>> @@ -121,7 +126,22 @@ static inline bool mem_cgroup_disabled(void)
>>>  	return false;
>>>  }
>>>
>>> -void mem_cgroup_update_file_mapped(struct page *page, int val);
>>> +void mem_cgroup_update_page_stat(struct page *page,
>>> +				 enum mem_cgroup_page_stat_item idx,
>>> +				 int val);
>>> +
>>> +static inline void mem_cgroup_inc_page_stat(struct page *page,
>>> +					    enum mem_cgroup_page_stat_item idx)
>>> +{
>>> +	mem_cgroup_update_page_stat(page, idx, 1);
>>> +}
>>> +
>>> +static inline void mem_cgroup_dec_page_stat(struct page *page,
>>> +					    enum mem_cgroup_page_stat_item idx)
>>> +{
>>> +	mem_cgroup_update_page_stat(page, idx, -1);
>>> +}
>>> +
>>>  unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>>>  						gfp_t gfp_mask);
>>>  u64 mem_cgroup_get_limit(struct mem_cgroup *mem);
>>> @@ -293,8 +313,13 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
>>>  {
>>>  }
>>>
>>> -static inline void mem_cgroup_update_file_mapped(struct page *page,
>>> -							int val)
>>> +static inline void mem_cgroup_inc_page_stat(struct page *page,
>>> +					    enum mem_cgroup_page_stat_item idx)
>>> +{
>>> +}
>>> +
>>> +static inline void mem_cgroup_dec_page_stat(struct page *page,
>>> +					    enum mem_cgroup_page_stat_item idx)
>>>  {
>>>  }
>>>
>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>>> index 9a99cfa..4fd00c4 100644
>>> --- a/mm/memcontrol.c
>>> +++ b/mm/memcontrol.c
>>> @@ -1592,7 +1592,8 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask)
>>>   * possibility of race condition. If there is, we take a lock.
>>>   */
>>>
>>>   
>>>       
>> Greg,
>>
>> I am not seeing the function mem_cgroup_update_file_stat() in the latest mmotm
>> 2010-10-22-16-36.
>> So not able to apply this patch. Tried couple of times cloning the entire mmotm
>> git repository. But no luck.
>> Tried in the web interface http://git.zen-kernel.org/mmotm/tree/mm/memcontrol.c
>> also. It is not there.
>> Surprisingly git log doesn't show any recent changes to mm/memcontrol.c. Am I
>> missing something?
>> I could see this function in the mainline linux 2.6 git tree.
>>
>> -Ciju
>>     
>
> mem_cgroup_update_file_mapped() was renamed to
> mem_cgroup_update_file_stat() in
> http://userweb.kernel.org/~akpm/mmotm/broken-out/memcg-generic-filestat-update-interface.patch
>
> I also do not see this in the mmotm git repo.  However, if I manually
> apply the mmotm patches to v2.6.36 using quilt then I see the expected
> patched memcontrol.c.  I am not sure why the zen-kernel.org git mmotm
> repo differs from a mmotm patched mainline 2.6.36.
>
> Here is my procedure using quilt to patch mainline:
>
> # Checkout 2.6.36 mainline
> $ git checkout v2.6.36
>
> # Confirm mainline 2.6.36 does not have mem_cgroup_update_file_stat()
> $ grep mem_cgroup_update_file_stat -r mm
>
> # Apply patches
> $ curl http://userweb.kernel.org/~akpm/mmotm/broken-out.tar.gz | tar -xzf -
> $ export QUILT_PATCHES=broken-out
> $ quilt push -aq
> ...
> Now at patch memblock-add-input-size-checking-to-memblock_find_region-fix.patch
>
> # Now the memcontrol contains mem_cgroup_update_file_stat()
> $ grep mem_cgroup_update_file_stat -r mm
> mm/memcontrol.c:static void mem_cgroup_update_file_stat(struct page *page, int idx, int val)
> mm/memcontrol.c:        mem_cgroup_update_file_stat(page, MEM_CGROUP_STAT_FILE_MAPPED, val);
>
>   
Reviewed-by: Ciju Rajan K <ciju@linux.vnet.ibm.com>
>>> -static void mem_cgroup_update_file_stat(struct page *page, int idx, int val)
>>> +void mem_cgroup_update_page_stat(struct page *page,
>>> +				 enum mem_cgroup_page_stat_item idx, int val)
>>>  {
>>>  	struct mem_cgroup *mem;
>>>  	struct page_cgroup *pc = lookup_page_cgroup(page);
>>> @@ -1615,30 +1616,27 @@ static void mem_cgroup_update_file_stat(struct page *page, int idx, int val)
>>>  			goto out;
>>>  	}
>>>
>>> -	this_cpu_add(mem->stat->count[idx], val);
>>> -
>>>  	switch (idx) {
>>> -	case MEM_CGROUP_STAT_FILE_MAPPED:
>>> +	case MEMCG_NR_FILE_MAPPED:
>>>  		if (val > 0)
>>>  			SetPageCgroupFileMapped(pc);
>>>  		else if (!page_mapped(page))
>>>  			ClearPageCgroupFileMapped(pc);
>>> +		idx = MEM_CGROUP_STAT_FILE_MAPPED;
>>>  		break;
>>>  	default:
>>>  		BUG();
>>>  	}
>>>
>>> +	this_cpu_add(mem->stat->count[idx], val);
>>> +
>>>  out:
>>>  	if (unlikely(need_unlock))
>>>  		unlock_page_cgroup(pc);
>>>  	rcu_read_unlock();
>>>  	return;
>>>  }
>>> -
>>> -void mem_cgroup_update_file_mapped(struct page *page, int val)
>>> -{
>>> -	mem_cgroup_update_file_stat(page, MEM_CGROUP_STAT_FILE_MAPPED, val);
>>> -}
>>> +EXPORT_SYMBOL(mem_cgroup_update_page_stat);
>>>
>>>  /*
>>>   * size of first charge trial. "32" comes from vmscan.c's magic value.
>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>> index 1a8bf76..a66ab76 100644
>>> --- a/mm/rmap.c
>>> +++ b/mm/rmap.c
>>> @@ -911,7 +911,7 @@ void page_add_file_rmap(struct page *page)
>>>  {
>>>  	if (atomic_inc_and_test(&page->_mapcount)) {
>>>  		__inc_zone_page_state(page, NR_FILE_MAPPED);
>>> -		mem_cgroup_update_file_mapped(page, 1);
>>> +		mem_cgroup_inc_page_stat(page, MEMCG_NR_FILE_MAPPED);
>>>  	}
>>>  }
>>>
>>> @@ -949,7 +949,7 @@ void page_remove_rmap(struct page *page)
>>>  		__dec_zone_page_state(page, NR_ANON_PAGES);
>>>  	} else {
>>>  		__dec_zone_page_state(page, NR_FILE_MAPPED);
>>> -		mem_cgroup_update_file_mapped(page, -1);
>>> +		mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_MAPPED);
>>>  	}
>>>  	/*
>>>  	 * It would be tidy to reset the PageAnon mapping here,
>>>   
>>>       

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v4 03/11] memcg: create extensible page stat update routines
@ 2010-11-02 19:35         ` Ciju Rajan K
  0 siblings, 0 replies; 75+ messages in thread
From: Ciju Rajan K @ 2010-11-02 19:35 UTC (permalink / raw)
  To: Greg Thelen, Andrew Morton
  Cc: linux-kernel, linux-mm, containers, Andrea Righi, Balbir Singh,
	KAMEZAWA Hiroyuki, Daisuke Nishimura, Minchan Kim,
	David Rientjes, Wu Fengguang

Greg Thelen wrote:
> Ciju Rajan K <ciju@linux.vnet.ibm.com> writes:
>
>   
>> Greg Thelen wrote:
>>     
>>> Replace usage of the mem_cgroup_update_file_mapped() memcg
>>> statistic update routine with two new routines:
>>> * mem_cgroup_inc_page_stat()
>>> * mem_cgroup_dec_page_stat()
>>>
>>> As before, only the file_mapped statistic is managed.  However,
>>> these more general interfaces allow for new statistics to be
>>> more easily added.  New statistics are added with memcg dirty
>>> page accounting.
>>>
>>> Signed-off-by: Greg Thelen <gthelen@google.com>
>>> Signed-off-by: Andrea Righi <arighi@develer.com>
>>> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>>> Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
>>> ---
>>> Changelog since v1:
>>> - Rename (for clarity):
>>>   - mem_cgroup_write_page_stat_item -> mem_cgroup_page_stat_item
>>>   - mem_cgroup_read_page_stat_item -> mem_cgroup_nr_pages_item
>>>
>>>  include/linux/memcontrol.h |   31 ++++++++++++++++++++++++++++---
>>>  mm/memcontrol.c            |   16 +++++++---------
>>>  mm/rmap.c                  |    4 ++--
>>>  3 files changed, 37 insertions(+), 14 deletions(-)
>>>
>>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>>> index 159a076..067115c 100644
>>> --- a/include/linux/memcontrol.h
>>> +++ b/include/linux/memcontrol.h
>>> @@ -25,6 +25,11 @@ struct page_cgroup;
>>>  struct page;
>>>  struct mm_struct;
>>>
>>> +/* Stats that can be updated by kernel. */
>>> +enum mem_cgroup_page_stat_item {
>>> +	MEMCG_NR_FILE_MAPPED, /* # of pages charged as file rss */
>>> +};
>>> +
>>>  extern unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
>>>  					struct list_head *dst,
>>>  					unsigned long *scanned, int order,
>>> @@ -121,7 +126,22 @@ static inline bool mem_cgroup_disabled(void)
>>>  	return false;
>>>  }
>>>
>>> -void mem_cgroup_update_file_mapped(struct page *page, int val);
>>> +void mem_cgroup_update_page_stat(struct page *page,
>>> +				 enum mem_cgroup_page_stat_item idx,
>>> +				 int val);
>>> +
>>> +static inline void mem_cgroup_inc_page_stat(struct page *page,
>>> +					    enum mem_cgroup_page_stat_item idx)
>>> +{
>>> +	mem_cgroup_update_page_stat(page, idx, 1);
>>> +}
>>> +
>>> +static inline void mem_cgroup_dec_page_stat(struct page *page,
>>> +					    enum mem_cgroup_page_stat_item idx)
>>> +{
>>> +	mem_cgroup_update_page_stat(page, idx, -1);
>>> +}
>>> +
>>>  unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>>>  						gfp_t gfp_mask);
>>>  u64 mem_cgroup_get_limit(struct mem_cgroup *mem);
>>> @@ -293,8 +313,13 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
>>>  {
>>>  }
>>>
>>> -static inline void mem_cgroup_update_file_mapped(struct page *page,
>>> -							int val)
>>> +static inline void mem_cgroup_inc_page_stat(struct page *page,
>>> +					    enum mem_cgroup_page_stat_item idx)
>>> +{
>>> +}
>>> +
>>> +static inline void mem_cgroup_dec_page_stat(struct page *page,
>>> +					    enum mem_cgroup_page_stat_item idx)
>>>  {
>>>  }
>>>
>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>>> index 9a99cfa..4fd00c4 100644
>>> --- a/mm/memcontrol.c
>>> +++ b/mm/memcontrol.c
>>> @@ -1592,7 +1592,8 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask)
>>>   * possibility of race condition. If there is, we take a lock.
>>>   */
>>>
>>>   
>>>       
>> Greg,
>>
>> I am not seeing the function mem_cgroup_update_file_stat() in the latest mmotm
>> 2010-10-22-16-36.
>> So not able to apply this patch. Tried couple of times cloning the entire mmotm
>> git repository. But no luck.
>> Tried in the web interface http://git.zen-kernel.org/mmotm/tree/mm/memcontrol.c
>> also. It is not there.
>> Surprisingly git log doesn't show any recent changes to mm/memcontrol.c. Am I
>> missing something?
>> I could see this function in the mainline linux 2.6 git tree.
>>
>> -Ciju
>>     
>
> mem_cgroup_update_file_mapped() was renamed to
> mem_cgroup_update_file_stat() in
> http://userweb.kernel.org/~akpm/mmotm/broken-out/memcg-generic-filestat-update-interface.patch
>
> I also do not see this in the mmotm git repo.  However, if I manually
> apply the mmotm patches to v2.6.36 using quilt then I see the expected
> patched memcontrol.c.  I am not sure why the zen-kernel.org git mmotm
> repo differs from a mmotm patched mainline 2.6.36.
>
> Here is my procedure using quilt to patch mainline:
>
> # Checkout 2.6.36 mainline
> $ git checkout v2.6.36
>
> # Confirm mainline 2.6.36 does not have mem_cgroup_update_file_stat()
> $ grep mem_cgroup_update_file_stat -r mm
>
> # Apply patches
> $ curl http://userweb.kernel.org/~akpm/mmotm/broken-out.tar.gz | tar -xzf -
> $ export QUILT_PATCHES=broken-out
> $ quilt push -aq
> ...
> Now at patch memblock-add-input-size-checking-to-memblock_find_region-fix.patch
>
> # Now the memcontrol contains mem_cgroup_update_file_stat()
> $ grep mem_cgroup_update_file_stat -r mm
> mm/memcontrol.c:static void mem_cgroup_update_file_stat(struct page *page, int idx, int val)
> mm/memcontrol.c:        mem_cgroup_update_file_stat(page, MEM_CGROUP_STAT_FILE_MAPPED, val);
>
>   
Reviewed-by: Ciju Rajan K <ciju@linux.vnet.ibm.com>
>>> -static void mem_cgroup_update_file_stat(struct page *page, int idx, int val)
>>> +void mem_cgroup_update_page_stat(struct page *page,
>>> +				 enum mem_cgroup_page_stat_item idx, int val)
>>>  {
>>>  	struct mem_cgroup *mem;
>>>  	struct page_cgroup *pc = lookup_page_cgroup(page);
>>> @@ -1615,30 +1616,27 @@ static void mem_cgroup_update_file_stat(struct page *page, int idx, int val)
>>>  			goto out;
>>>  	}
>>>
>>> -	this_cpu_add(mem->stat->count[idx], val);
>>> -
>>>  	switch (idx) {
>>> -	case MEM_CGROUP_STAT_FILE_MAPPED:
>>> +	case MEMCG_NR_FILE_MAPPED:
>>>  		if (val > 0)
>>>  			SetPageCgroupFileMapped(pc);
>>>  		else if (!page_mapped(page))
>>>  			ClearPageCgroupFileMapped(pc);
>>> +		idx = MEM_CGROUP_STAT_FILE_MAPPED;
>>>  		break;
>>>  	default:
>>>  		BUG();
>>>  	}
>>>
>>> +	this_cpu_add(mem->stat->count[idx], val);
>>> +
>>>  out:
>>>  	if (unlikely(need_unlock))
>>>  		unlock_page_cgroup(pc);
>>>  	rcu_read_unlock();
>>>  	return;
>>>  }
>>> -
>>> -void mem_cgroup_update_file_mapped(struct page *page, int val)
>>> -{
>>> -	mem_cgroup_update_file_stat(page, MEM_CGROUP_STAT_FILE_MAPPED, val);
>>> -}
>>> +EXPORT_SYMBOL(mem_cgroup_update_page_stat);
>>>
>>>  /*
>>>   * size of first charge trial. "32" comes from vmscan.c's magic value.
>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>> index 1a8bf76..a66ab76 100644
>>> --- a/mm/rmap.c
>>> +++ b/mm/rmap.c
>>> @@ -911,7 +911,7 @@ void page_add_file_rmap(struct page *page)
>>>  {
>>>  	if (atomic_inc_and_test(&page->_mapcount)) {
>>>  		__inc_zone_page_state(page, NR_FILE_MAPPED);
>>> -		mem_cgroup_update_file_mapped(page, 1);
>>> +		mem_cgroup_inc_page_stat(page, MEMCG_NR_FILE_MAPPED);
>>>  	}
>>>  }
>>>
>>> @@ -949,7 +949,7 @@ void page_remove_rmap(struct page *page)
>>>  		__dec_zone_page_state(page, NR_ANON_PAGES);
>>>  	} else {
>>>  		__dec_zone_page_state(page, NR_FILE_MAPPED);
>>> -		mem_cgroup_update_file_mapped(page, -1);
>>> +		mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_MAPPED);
>>>  	}
>>>  	/*
>>>  	 * It would be tidy to reset the PageAnon mapping here,
>>>   
>>>       


^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v4 05/11] writeback: create dirty_info structure
  2010-10-29  7:09   ` Greg Thelen
@ 2010-11-18  0:49     ` Andrew Morton
  -1 siblings, 0 replies; 75+ messages in thread
From: Andrew Morton @ 2010-11-18  0:49 UTC (permalink / raw)
  To: Greg Thelen
  Cc: linux-kernel, linux-mm, containers, Andrea Righi, Balbir Singh,
	KAMEZAWA Hiroyuki, Daisuke Nishimura, Minchan Kim, Ciju Rajan K,
	David Rientjes, Wu Fengguang

On Fri, 29 Oct 2010 00:09:08 -0700
Greg Thelen <gthelen@google.com> wrote:

> Bundle dirty limits and dirty memory usage metrics into a dirty_info
> structure to simplify interfaces of routines that need all.

Problems...

These patches interact pretty badly with Fengguang's "IO-less dirty
throttling v2" patches.  I fixed up
writeback-create-dirty_info-structure.patch pretty mechanically but
when it got to memcg-check-memcg-dirty-limits-in-page-writeback.patch
things got sticky and I gave up.

As your stuff was merged first, I'd normally send the bad news to
Fengguang, but the memcg code is logically built upon the core
writeback code so I do think these patches should be staged after the
changes to core writeback.

Also, while I was there it seemed that the chosen members of the
dirty_info structure were a bit random.  Perhaps we should be putting
nr_dirty in there as well, perhaps other things.  Please have a think
about that.

Also, in ratelimit_pages() we call global_dirty_info() to return four
items, but that caller only actually uses two of them.  Wasted effort?


So I'm afraid I'm going to have to request that you redo and retest
these patches:

writeback-create-dirty_info-structure.patch
memcg-add-dirty-page-accounting-infrastructure.patch
memcg-add-kernel-calls-for-memcg-dirty-page-stats.patch
memcg-add-dirty-limits-to-mem_cgroup.patch
memcg-add-dirty-limits-to-mem_cgroup-use-native-word-to-represent-dirtyable-pages.patch
memcg-add-dirty-limits-to-mem_cgroup-catch-negative-per-cpu-sums-in-dirty-info.patch
memcg-add-dirty-limits-to-mem_cgroup-avoid-overflow-in-memcg_hierarchical_free_pages.patch
memcg-add-dirty-limits-to-mem_cgroup-correct-memcg_hierarchical_free_pages-return-type.patch
memcg-add-dirty-limits-to-mem_cgroup-avoid-free-overflow-in-memcg_hierarchical_free_pages.patch
memcg-cpu-hotplug-lockdep-warning-fix.patch
memcg-add-cgroupfs-interface-to-memcg-dirty-limits.patch
memcg-break-out-event-counters-from-other-stats.patch
memcg-check-memcg-dirty-limits-in-page-writeback.patch
memcg-use-native-word-page-statistics-counters.patch
memcg-use-native-word-page-statistics-counters-fix.patch
#
memcg-add-mem_cgroup-parameter-to-mem_cgroup_page_stat.patch
memcg-pass-mem_cgroup-to-mem_cgroup_dirty_info.patch
#memcg-make-throttle_vm_writeout-memcg-aware.patch: "troublesome": Kamezawa
memcg-make-throttle_vm_writeout-memcg-aware.patch
memcg-make-throttle_vm_writeout-memcg-aware-fix.patch
memcg-simplify-mem_cgroup_page_stat.patch
memcg-simplify-mem_cgroup_dirty_info.patch
memcg-make-mem_cgroup_page_stat-return-value-unsigned.patch

against the http://userweb.kernel.org/~akpm/mmotm/ which I just
uploaded, sorry.  I've uploaded my copy of all the above to
http://userweb.kernel.org/~akpm/stuff/gthelen.tar.gz.  I think only the
two patches need fixing and retesting.

Also, while wrangling the above patches, I stumbled across rejects such
as:


***************
*** 99,106 ****
                   "state:            %8lx\n",
                   (unsigned long) K(bdi_stat(bdi, BDI_WRITEBACK)),
                   (unsigned long) K(bdi_stat(bdi, BDI_RECLAIMABLE)),
-                  K(bdi_thresh), K(dirty_thresh),
-                  K(background_thresh), nr_dirty, nr_io, nr_more_io,
                   !list_empty(&bdi->bdi_list), bdi->state);
  #undef K
  
--- 98,106 ----
                   "state:            %8lx\n",
                   (unsigned long) K(bdi_stat(bdi, BDI_WRITEBACK)),
                   (unsigned long) K(bdi_stat(bdi, BDI_RECLAIMABLE)),
+                  K(bdi_thresh), K(dirty_info.dirty_thresh),
+                  K(dirty_info.background_thresh),
+                  nr_dirty, nr_io, nr_more_io,
                   !list_empty(&bdi->bdi_list), bdi->state);

Please, if you discover crud like this, just fix it up.  One item per
line:

                   "state:            %8lx\n",
                   (unsigned long) K(bdi_stat(bdi, BDI_WRITEBACK)),
                   (unsigned long) K(bdi_stat(bdi, BDI_RECLAIMABLE)),
		   K(bdi_thresh),
		   K(dirty_info.dirty_thresh),
		   K(dirty_info.background_thresh),
		   nr_dirty,
		   nr_io,
		   nr_more_io,
                   !list_empty(&bdi->bdi_list), bdi->state);

all very simple.  And while you're there, fix up the
tab-tab-space-space-space indenting - just use tabs.


The other area where code maintenance is harder than it needs to be is
in definitions of locals:

        long nr_reclaimable;
        long nr_dirty, bdi_dirty;  /* = file_dirty + writeback + unstable_nfs */
        long bdi_prev_dirty = 0;

again, that's just dopey.  Change it to

        long nr_reclaimable;
        long nr_dirty;
	long bdi_dirty;		/* = file_dirty + writeback + unstable_nfs */
        long bdi_prev_dirty = 0;

All very simple.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v4 05/11] writeback: create dirty_info structure
@ 2010-11-18  0:49     ` Andrew Morton
  0 siblings, 0 replies; 75+ messages in thread
From: Andrew Morton @ 2010-11-18  0:49 UTC (permalink / raw)
  To: Greg Thelen
  Cc: linux-kernel, linux-mm, containers, Andrea Righi, Balbir Singh,
	KAMEZAWA Hiroyuki, Daisuke Nishimura, Minchan Kim, Ciju Rajan K,
	David Rientjes, Wu Fengguang

On Fri, 29 Oct 2010 00:09:08 -0700
Greg Thelen <gthelen@google.com> wrote:

> Bundle dirty limits and dirty memory usage metrics into a dirty_info
> structure to simplify interfaces of routines that need all.

Problems...

These patches interact pretty badly with Fengguang's "IO-less dirty
throttling v2" patches.  I fixed up
writeback-create-dirty_info-structure.patch pretty mechanically but
when it got to memcg-check-memcg-dirty-limits-in-page-writeback.patch
things got sticky and I gave up.

As your stuff was merged first, I'd normally send the bad news to
Fengguang, but the memcg code is logically built upon the core
writeback code so I do think these patches should be staged after the
changes to core writeback.

Also, while I was there it seemed that the chosen members of the
dirty_info structure were a bit random.  Perhaps we should be putting
nr_dirty in there as well, perhaps other things.  Please have a think
about that.

Also, in ratelimit_pages() we call global_dirty_info() to return four
items, but that caller only actually uses two of them.  Wasted effort?


So I'm afraid I'm going to have to request that you redo and retest
these patches:

writeback-create-dirty_info-structure.patch
memcg-add-dirty-page-accounting-infrastructure.patch
memcg-add-kernel-calls-for-memcg-dirty-page-stats.patch
memcg-add-dirty-limits-to-mem_cgroup.patch
memcg-add-dirty-limits-to-mem_cgroup-use-native-word-to-represent-dirtyable-pages.patch
memcg-add-dirty-limits-to-mem_cgroup-catch-negative-per-cpu-sums-in-dirty-info.patch
memcg-add-dirty-limits-to-mem_cgroup-avoid-overflow-in-memcg_hierarchical_free_pages.patch
memcg-add-dirty-limits-to-mem_cgroup-correct-memcg_hierarchical_free_pages-return-type.patch
memcg-add-dirty-limits-to-mem_cgroup-avoid-free-overflow-in-memcg_hierarchical_free_pages.patch
memcg-cpu-hotplug-lockdep-warning-fix.patch
memcg-add-cgroupfs-interface-to-memcg-dirty-limits.patch
memcg-break-out-event-counters-from-other-stats.patch
memcg-check-memcg-dirty-limits-in-page-writeback.patch
memcg-use-native-word-page-statistics-counters.patch
memcg-use-native-word-page-statistics-counters-fix.patch
#
memcg-add-mem_cgroup-parameter-to-mem_cgroup_page_stat.patch
memcg-pass-mem_cgroup-to-mem_cgroup_dirty_info.patch
#memcg-make-throttle_vm_writeout-memcg-aware.patch: "troublesome": Kamezawa
memcg-make-throttle_vm_writeout-memcg-aware.patch
memcg-make-throttle_vm_writeout-memcg-aware-fix.patch
memcg-simplify-mem_cgroup_page_stat.patch
memcg-simplify-mem_cgroup_dirty_info.patch
memcg-make-mem_cgroup_page_stat-return-value-unsigned.patch

against the http://userweb.kernel.org/~akpm/mmotm/ which I just
uploaded, sorry.  I've uploaded my copy of all the above to
http://userweb.kernel.org/~akpm/stuff/gthelen.tar.gz.  I think only the
two patches need fixing and retesting.

Also, while wrangling the above patches, I stumbled across rejects such
as:


***************
*** 99,106 ****
                   "state:            %8lx\n",
                   (unsigned long) K(bdi_stat(bdi, BDI_WRITEBACK)),
                   (unsigned long) K(bdi_stat(bdi, BDI_RECLAIMABLE)),
-                  K(bdi_thresh), K(dirty_thresh),
-                  K(background_thresh), nr_dirty, nr_io, nr_more_io,
                   !list_empty(&bdi->bdi_list), bdi->state);
  #undef K
  
--- 98,106 ----
                   "state:            %8lx\n",
                   (unsigned long) K(bdi_stat(bdi, BDI_WRITEBACK)),
                   (unsigned long) K(bdi_stat(bdi, BDI_RECLAIMABLE)),
+                  K(bdi_thresh), K(dirty_info.dirty_thresh),
+                  K(dirty_info.background_thresh),
+                  nr_dirty, nr_io, nr_more_io,
                   !list_empty(&bdi->bdi_list), bdi->state);

Please, if you discover crud like this, just fix it up.  One item per
line:

                   "state:            %8lx\n",
                   (unsigned long) K(bdi_stat(bdi, BDI_WRITEBACK)),
                   (unsigned long) K(bdi_stat(bdi, BDI_RECLAIMABLE)),
		   K(bdi_thresh),
		   K(dirty_info.dirty_thresh),
		   K(dirty_info.background_thresh),
		   nr_dirty,
		   nr_io,
		   nr_more_io,
                   !list_empty(&bdi->bdi_list), bdi->state);

all very simple.  And while you're there, fix up the
tab-tab-space-space-space indenting - just use tabs.


The other area where code maintenance is harder than it needs to be is
in definitions of locals:

        long nr_reclaimable;
        long nr_dirty, bdi_dirty;  /* = file_dirty + writeback + unstable_nfs */
        long bdi_prev_dirty = 0;

again, that's just dopey.  Change it to

        long nr_reclaimable;
        long nr_dirty;
	long bdi_dirty;		/* = file_dirty + writeback + unstable_nfs */
        long bdi_prev_dirty = 0;

All very simple.

Thanks.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v4 05/11] writeback: create dirty_info structure
  2010-11-18  0:49     ` Andrew Morton
  (?)
@ 2010-11-18  0:50       ` Andrew Morton
  -1 siblings, 0 replies; 75+ messages in thread
From: Andrew Morton @ 2010-11-18  0:50 UTC (permalink / raw)
  To: Greg Thelen, linux-kernel, linux-mm, containers, Andrea Righi,
	Balbir Singh

On Wed, 17 Nov 2010 16:49:24 -0800
Andrew Morton <akpm@linux-foundation.org> wrote:

> against the http://userweb.kernel.org/~akpm/mmotm/ which I just
> uploaded

err, will upload Real Soon Now.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v4 05/11] writeback: create dirty_info structure
@ 2010-11-18  0:50       ` Andrew Morton
  0 siblings, 0 replies; 75+ messages in thread
From: Andrew Morton @ 2010-11-18  0:50 UTC (permalink / raw)
  To: Greg Thelen, linux-kernel, linux-mm, containers, Andrea Righi,
	Balbir Singh, KAMEZAWA Hiroyuki, Daisuke Nishimura, Minchan Kim,
	Ciju Rajan K, David Rientjes, Wu Fengguang

On Wed, 17 Nov 2010 16:49:24 -0800
Andrew Morton <akpm@linux-foundation.org> wrote:

> against the http://userweb.kernel.org/~akpm/mmotm/ which I just
> uploaded

err, will upload Real Soon Now.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v4 05/11] writeback: create dirty_info structure
@ 2010-11-18  0:50       ` Andrew Morton
  0 siblings, 0 replies; 75+ messages in thread
From: Andrew Morton @ 2010-11-18  0:50 UTC (permalink / raw)
  To: Greg Thelen, linux-kernel, linux-mm, containers, Andrea Righi,
	Balbir Singh, KAMEZAWA Hiroyuki, Daisuke Nishimura, Minchan Kim,
	Ciju Rajan K, David Rientjes, Wu Fengguang

On Wed, 17 Nov 2010 16:49:24 -0800
Andrew Morton <akpm@linux-foundation.org> wrote:

> against the http://userweb.kernel.org/~akpm/mmotm/ which I just
> uploaded

err, will upload Real Soon Now.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v4 05/11] writeback: create dirty_info structure
  2010-11-18  0:49     ` Andrew Morton
@ 2010-11-18  2:02       ` Greg Thelen
  -1 siblings, 0 replies; 75+ messages in thread
From: Greg Thelen @ 2010-11-18  2:02 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, containers, Andrea Righi, Balbir Singh,
	KAMEZAWA Hiroyuki, Daisuke Nishimura, Minchan Kim, Ciju Rajan K,
	David Rientjes, Wu Fengguang

Andrew Morton <akpm@linux-foundation.org> writes:

> On Fri, 29 Oct 2010 00:09:08 -0700
> Greg Thelen <gthelen@google.com> wrote:
>
>> Bundle dirty limits and dirty memory usage metrics into a dirty_info
>> structure to simplify interfaces of routines that need all.
>
> Problems...
>
> These patches interact pretty badly with Fengguang's "IO-less dirty
> throttling v2" patches.  I fixed up
> writeback-create-dirty_info-structure.patch pretty mechanically but
> when it got to memcg-check-memcg-dirty-limits-in-page-writeback.patch
> things got sticky and I gave up.
>
> As your stuff was merged first, I'd normally send the bad news to
> Fengguang, but the memcg code is logically built upon the core
> writeback code so I do think these patches should be staged after the
> changes to core writeback.
>
> Also, while I was there it seemed that the chosen members of the
> dirty_info structure were a bit random.  Perhaps we should be putting
> nr_dirty in there as well, perhaps other things.  Please have a think
> about that.
>
> Also, in ratelimit_pages() we call global_dirty_info() to return four
> items, but that caller only actually uses two of them.  Wasted effort?
>
>
> So I'm afraid I'm going to have to request that you redo and retest
> these patches:
>
> writeback-create-dirty_info-structure.patch
> memcg-add-dirty-page-accounting-infrastructure.patch
> memcg-add-kernel-calls-for-memcg-dirty-page-stats.patch
> memcg-add-dirty-limits-to-mem_cgroup.patch
> memcg-add-dirty-limits-to-mem_cgroup-use-native-word-to-represent-dirtyable-pages.patch
> memcg-add-dirty-limits-to-mem_cgroup-catch-negative-per-cpu-sums-in-dirty-info.patch
> memcg-add-dirty-limits-to-mem_cgroup-avoid-overflow-in-memcg_hierarchical_free_pages.patch
> memcg-add-dirty-limits-to-mem_cgroup-correct-memcg_hierarchical_free_pages-return-type.patch
> memcg-add-dirty-limits-to-mem_cgroup-avoid-free-overflow-in-memcg_hierarchical_free_pages.patch
> memcg-cpu-hotplug-lockdep-warning-fix.patch
> memcg-add-cgroupfs-interface-to-memcg-dirty-limits.patch
> memcg-break-out-event-counters-from-other-stats.patch
> memcg-check-memcg-dirty-limits-in-page-writeback.patch
> memcg-use-native-word-page-statistics-counters.patch
> memcg-use-native-word-page-statistics-counters-fix.patch
> #
> memcg-add-mem_cgroup-parameter-to-mem_cgroup_page_stat.patch
> memcg-pass-mem_cgroup-to-mem_cgroup_dirty_info.patch
> #memcg-make-throttle_vm_writeout-memcg-aware.patch: "troublesome": Kamezawa
> memcg-make-throttle_vm_writeout-memcg-aware.patch
> memcg-make-throttle_vm_writeout-memcg-aware-fix.patch
> memcg-simplify-mem_cgroup_page_stat.patch
> memcg-simplify-mem_cgroup_dirty_info.patch
> memcg-make-mem_cgroup_page_stat-return-value-unsigned.patch
>
> against the http://userweb.kernel.org/~akpm/mmotm/ which I just
> uploaded, sorry.  I've uploaded my copy of all the above to
> http://userweb.kernel.org/~akpm/stuff/gthelen.tar.gz.  I think only the
> two patches need fixing and retesting.
>
> Also, while wrangling the above patches, I stumbled across rejects such
> as:
>
>
> ***************
> *** 99,106 ****
>                    "state:            %8lx\n",
>                    (unsigned long) K(bdi_stat(bdi, BDI_WRITEBACK)),
>                    (unsigned long) K(bdi_stat(bdi, BDI_RECLAIMABLE)),
> -                  K(bdi_thresh), K(dirty_thresh),
> -                  K(background_thresh), nr_dirty, nr_io, nr_more_io,
>                    !list_empty(&bdi->bdi_list), bdi->state);
>   #undef K
>   
> --- 98,106 ----
>                    "state:            %8lx\n",
>                    (unsigned long) K(bdi_stat(bdi, BDI_WRITEBACK)),
>                    (unsigned long) K(bdi_stat(bdi, BDI_RECLAIMABLE)),
> +                  K(bdi_thresh), K(dirty_info.dirty_thresh),
> +                  K(dirty_info.background_thresh),
> +                  nr_dirty, nr_io, nr_more_io,
>                    !list_empty(&bdi->bdi_list), bdi->state);
>
> Please, if you discover crud like this, just fix it up.  One item per
> line:
>
>                    "state:            %8lx\n",
>                    (unsigned long) K(bdi_stat(bdi, BDI_WRITEBACK)),
>                    (unsigned long) K(bdi_stat(bdi, BDI_RECLAIMABLE)),
> 		   K(bdi_thresh),
> 		   K(dirty_info.dirty_thresh),
> 		   K(dirty_info.background_thresh),
> 		   nr_dirty,
> 		   nr_io,
> 		   nr_more_io,
>                    !list_empty(&bdi->bdi_list), bdi->state);
>
> all very simple.  And while you're there, fix up the
> tab-tab-space-space-space indenting - just use tabs.
>
>
> The other area where code maintenance is harder than it needs to be is
> in definitions of locals:
>
>         long nr_reclaimable;
>         long nr_dirty, bdi_dirty;  /* = file_dirty + writeback + unstable_nfs */
>         long bdi_prev_dirty = 0;
>
> again, that's just dopey.  Change it to
>
>         long nr_reclaimable;
>         long nr_dirty;
> 	long bdi_dirty;		/* = file_dirty + writeback + unstable_nfs */
>         long bdi_prev_dirty = 0;
>
> All very simple.
>
> Thanks.

I am leaving on vacation until Mon.  Once I return, this will be one of
the first things I get to.  I will resubmit patches based on whatever
the latest mmotm on Mon.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [PATCH v4 05/11] writeback: create dirty_info structure
@ 2010-11-18  2:02       ` Greg Thelen
  0 siblings, 0 replies; 75+ messages in thread
From: Greg Thelen @ 2010-11-18  2:02 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, containers, Andrea Righi, Balbir Singh,
	KAMEZAWA Hiroyuki, Daisuke Nishimura, Minchan Kim, Ciju Rajan K,
	David Rientjes, Wu Fengguang

Andrew Morton <akpm@linux-foundation.org> writes:

> On Fri, 29 Oct 2010 00:09:08 -0700
> Greg Thelen <gthelen@google.com> wrote:
>
>> Bundle dirty limits and dirty memory usage metrics into a dirty_info
>> structure to simplify interfaces of routines that need all.
>
> Problems...
>
> These patches interact pretty badly with Fengguang's "IO-less dirty
> throttling v2" patches.  I fixed up
> writeback-create-dirty_info-structure.patch pretty mechanically but
> when it got to memcg-check-memcg-dirty-limits-in-page-writeback.patch
> things got sticky and I gave up.
>
> As your stuff was merged first, I'd normally send the bad news to
> Fengguang, but the memcg code is logically built upon the core
> writeback code so I do think these patches should be staged after the
> changes to core writeback.
>
> Also, while I was there it seemed that the chosen members of the
> dirty_info structure were a bit random.  Perhaps we should be putting
> nr_dirty in there as well, perhaps other things.  Please have a think
> about that.
>
> Also, in ratelimit_pages() we call global_dirty_info() to return four
> items, but that caller only actually uses two of them.  Wasted effort?
>
>
> So I'm afraid I'm going to have to request that you redo and retest
> these patches:
>
> writeback-create-dirty_info-structure.patch
> memcg-add-dirty-page-accounting-infrastructure.patch
> memcg-add-kernel-calls-for-memcg-dirty-page-stats.patch
> memcg-add-dirty-limits-to-mem_cgroup.patch
> memcg-add-dirty-limits-to-mem_cgroup-use-native-word-to-represent-dirtyable-pages.patch
> memcg-add-dirty-limits-to-mem_cgroup-catch-negative-per-cpu-sums-in-dirty-info.patch
> memcg-add-dirty-limits-to-mem_cgroup-avoid-overflow-in-memcg_hierarchical_free_pages.patch
> memcg-add-dirty-limits-to-mem_cgroup-correct-memcg_hierarchical_free_pages-return-type.patch
> memcg-add-dirty-limits-to-mem_cgroup-avoid-free-overflow-in-memcg_hierarchical_free_pages.patch
> memcg-cpu-hotplug-lockdep-warning-fix.patch
> memcg-add-cgroupfs-interface-to-memcg-dirty-limits.patch
> memcg-break-out-event-counters-from-other-stats.patch
> memcg-check-memcg-dirty-limits-in-page-writeback.patch
> memcg-use-native-word-page-statistics-counters.patch
> memcg-use-native-word-page-statistics-counters-fix.patch
> #
> memcg-add-mem_cgroup-parameter-to-mem_cgroup_page_stat.patch
> memcg-pass-mem_cgroup-to-mem_cgroup_dirty_info.patch
> #memcg-make-throttle_vm_writeout-memcg-aware.patch: "troublesome": Kamezawa
> memcg-make-throttle_vm_writeout-memcg-aware.patch
> memcg-make-throttle_vm_writeout-memcg-aware-fix.patch
> memcg-simplify-mem_cgroup_page_stat.patch
> memcg-simplify-mem_cgroup_dirty_info.patch
> memcg-make-mem_cgroup_page_stat-return-value-unsigned.patch
>
> against the http://userweb.kernel.org/~akpm/mmotm/ which I just
> uploaded, sorry.  I've uploaded my copy of all the above to
> http://userweb.kernel.org/~akpm/stuff/gthelen.tar.gz.  I think only the
> two patches need fixing and retesting.
>
> Also, while wrangling the above patches, I stumbled across rejects such
> as:
>
>
> ***************
> *** 99,106 ****
>                    "state:            %8lx\n",
>                    (unsigned long) K(bdi_stat(bdi, BDI_WRITEBACK)),
>                    (unsigned long) K(bdi_stat(bdi, BDI_RECLAIMABLE)),
> -                  K(bdi_thresh), K(dirty_thresh),
> -                  K(background_thresh), nr_dirty, nr_io, nr_more_io,
>                    !list_empty(&bdi->bdi_list), bdi->state);
>   #undef K
>   
> --- 98,106 ----
>                    "state:            %8lx\n",
>                    (unsigned long) K(bdi_stat(bdi, BDI_WRITEBACK)),
>                    (unsigned long) K(bdi_stat(bdi, BDI_RECLAIMABLE)),
> +                  K(bdi_thresh), K(dirty_info.dirty_thresh),
> +                  K(dirty_info.background_thresh),
> +                  nr_dirty, nr_io, nr_more_io,
>                    !list_empty(&bdi->bdi_list), bdi->state);
>
> Please, if you discover crud like this, just fix it up.  One item per
> line:
>
>                    "state:            %8lx\n",
>                    (unsigned long) K(bdi_stat(bdi, BDI_WRITEBACK)),
>                    (unsigned long) K(bdi_stat(bdi, BDI_RECLAIMABLE)),
> 		   K(bdi_thresh),
> 		   K(dirty_info.dirty_thresh),
> 		   K(dirty_info.background_thresh),
> 		   nr_dirty,
> 		   nr_io,
> 		   nr_more_io,
>                    !list_empty(&bdi->bdi_list), bdi->state);
>
> all very simple.  And while you're there, fix up the
> tab-tab-space-space-space indenting - just use tabs.
>
>
> The other area where code maintenance is harder than it needs to be is
> in definitions of locals:
>
>         long nr_reclaimable;
>         long nr_dirty, bdi_dirty;  /* = file_dirty + writeback + unstable_nfs */
>         long bdi_prev_dirty = 0;
>
> again, that's just dopey.  Change it to
>
>         long nr_reclaimable;
>         long nr_dirty;
> 	long bdi_dirty;		/* = file_dirty + writeback + unstable_nfs */
>         long bdi_prev_dirty = 0;
>
> All very simple.
>
> Thanks.

I am leaving on vacation until Mon.  Once I return, this will be one of
the first things I get to.  I will resubmit patches based on whatever
the latest mmotm on Mon.

^ permalink raw reply	[flat|nested] 75+ messages in thread

end of thread, other threads:[~2010-11-18  2:03 UTC | newest]

Thread overview: 75+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-10-29  7:09 [PATCH v4 00/11] memcg: per cgroup dirty page accounting Greg Thelen
2010-10-29  7:09 ` Greg Thelen
2010-10-29  7:09 ` [PATCH v4 01/11] memcg: add page_cgroup flags for dirty page tracking Greg Thelen
2010-10-29  7:09   ` Greg Thelen
2010-10-29  7:09 ` [PATCH v4 02/11] memcg: document cgroup dirty memory interfaces Greg Thelen
2010-10-29  7:09   ` Greg Thelen
2010-10-29 11:03   ` Wu Fengguang
2010-10-29 11:03     ` Wu Fengguang
2010-10-29 21:35     ` Greg Thelen
2010-10-29 21:35       ` Greg Thelen
2010-10-30  3:02       ` Wu Fengguang
2010-10-30  3:02         ` Wu Fengguang
2010-10-29 20:19   ` Andrew Morton
2010-10-29 20:19     ` Andrew Morton
2010-10-29 21:37     ` Greg Thelen
2010-10-29 21:37       ` Greg Thelen
2010-10-29  7:09 ` [PATCH v4 03/11] memcg: create extensible page stat update routines Greg Thelen
2010-10-29  7:09   ` Greg Thelen
2010-10-31 14:48   ` Ciju Rajan K
2010-10-31 14:48     ` Ciju Rajan K
2010-10-31 20:11     ` Greg Thelen
2010-10-31 20:11       ` Greg Thelen
2010-11-01 20:16       ` Ciju Rajan K
2010-11-01 20:16         ` Ciju Rajan K
2010-11-02 19:35       ` Ciju Rajan K
2010-11-02 19:35         ` Ciju Rajan K
2010-10-29  7:09 ` [PATCH v4 04/11] memcg: add lock to synchronize page accounting and migration Greg Thelen
2010-10-29  7:09   ` Greg Thelen
2010-10-29  7:09 ` [PATCH v4 05/11] writeback: create dirty_info structure Greg Thelen
2010-10-29  7:09   ` Greg Thelen
2010-10-29  7:50   ` KAMEZAWA Hiroyuki
2010-10-29  7:50     ` KAMEZAWA Hiroyuki
2010-11-18  0:49   ` Andrew Morton
2010-11-18  0:49     ` Andrew Morton
2010-11-18  0:50     ` Andrew Morton
2010-11-18  0:50       ` Andrew Morton
2010-11-18  0:50       ` Andrew Morton
2010-11-18  2:02     ` Greg Thelen
2010-11-18  2:02       ` Greg Thelen
2010-10-29  7:09 ` [PATCH v4 06/11] memcg: add dirty page accounting infrastructure Greg Thelen
2010-10-29  7:09   ` Greg Thelen
2010-10-29 11:13   ` Wu Fengguang
2010-10-29 11:13     ` Wu Fengguang
2010-10-29 11:17     ` KAMEZAWA Hiroyuki
2010-10-29 11:17       ` KAMEZAWA Hiroyuki
2010-10-29  7:09 ` [PATCH v4 07/11] memcg: add kernel calls for memcg dirty page stats Greg Thelen
2010-10-29  7:09   ` Greg Thelen
2010-10-29  7:09 ` [PATCH v4 08/11] memcg: add dirty limits to mem_cgroup Greg Thelen
2010-10-29  7:09   ` Greg Thelen
2010-10-29  7:41   ` KAMEZAWA Hiroyuki
2010-10-29  7:41     ` KAMEZAWA Hiroyuki
2010-10-29 16:00     ` Greg Thelen
2010-10-29 16:00       ` Greg Thelen
2010-10-29  7:09 ` [PATCH v4 09/11] memcg: CPU hotplug lockdep warning fix Greg Thelen
2010-10-29  7:09   ` Greg Thelen
2010-10-29 20:19   ` Andrew Morton
2010-10-29 20:19     ` Andrew Morton
2010-10-29  7:09 ` [PATCH v4 10/11] memcg: add cgroupfs interface to memcg dirty limits Greg Thelen
2010-10-29  7:09   ` Greg Thelen
2010-10-29  7:43   ` KAMEZAWA Hiroyuki
2010-10-29  7:43     ` KAMEZAWA Hiroyuki
2010-10-29  7:09 ` [PATCH v4 11/11] memcg: check memcg dirty limits in page writeback Greg Thelen
2010-10-29  7:09   ` Greg Thelen
2010-10-29  7:48   ` KAMEZAWA Hiroyuki
2010-10-29  7:48     ` KAMEZAWA Hiroyuki
2010-10-29 16:06     ` Greg Thelen
2010-10-29 16:06       ` Greg Thelen
2010-10-31 20:03       ` Wu Fengguang
2010-10-31 20:03         ` Wu Fengguang
2010-10-29 20:19 ` [PATCH v4 00/11] memcg: per cgroup dirty page accounting Andrew Morton
2010-10-29 20:19   ` Andrew Morton
2010-10-30 21:46   ` Greg Thelen
2010-10-30 21:46     ` Greg Thelen
2010-11-02 19:33     ` Ciju Rajan K
2010-11-02 19:33       ` Ciju Rajan K

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.