linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 00/11] memcg: per cgroup dirty page accounting
@ 2010-10-19  0:39 Greg Thelen
  2010-10-19  0:39 ` [PATCH v3 01/11] memcg: add page_cgroup flags for dirty page tracking Greg Thelen
                   ` (11 more replies)
  0 siblings, 12 replies; 64+ messages in thread
From: Greg Thelen @ 2010-10-19  0:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, containers, Andrea Righi, Balbir Singh,
	KAMEZAWA Hiroyuki, Daisuke Nishimura, Minchan Kim, Ciju Rajan K,
	David Rientjes, Greg Thelen

Changes since v2:
- Rather than disabling softirq in lock_page_cgroup(), introduce a separate lock
  to synchronize between memcg page accounting and migration.  This only affects
  patch 4 of the series.  Patch 4 used to disable softirq, now it introduces the
  new lock.

Changes since v1:
- Renamed "nfs"/"total_nfs" to "nfs_unstable"/"total_nfs_unstable" in per cgroup
  memory.stat to match /proc/meminfo.
- Avoid lockdep warnings by using rcu_read_[un]lock() in
  mem_cgroup_has_dirty_limit().
- Fixed lockdep issue in mem_cgroup_read_stat() which is exposed by these
  patches.
- Remove redundant comments.
- Rename (for clarity):
  - mem_cgroup_write_page_stat_item -> mem_cgroup_page_stat_item
  - mem_cgroup_read_page_stat_item -> mem_cgroup_nr_pages_item
- Renamed newly created proc files:
  - memory.dirty_bytes -> memory.dirty_limit_in_bytes
  - memory.dirty_background_bytes -> memory.dirty_background_limit_in_bytes
- Removed unnecessary get_ prefix from get_xxx() functions.
- Allow [kKmMgG] suffixes for newly created dirty limit value cgroupfs files.
- Disable softirq rather than hardirq in lock_page_cgroup()
- Made mem_cgroup_move_account_page_stat() inline.
- Ported patches to mmotm-2010-10-13-17-13.

This patch set provides the ability for each cgroup to have independent dirty
page limits.

Limiting dirty memory is like fixing the max amount of dirty (hard to reclaim)
page cache used by a cgroup.  So, in case of multiple cgroup writers, they will
not be able to consume more than their designated share of dirty pages and will
be forced to perform write-out if they cross that limit.

The patches are based on a series proposed by Andrea Righi in Mar 2010.


Overview:
- Add page_cgroup flags to record when pages are dirty, in writeback, or nfs
  unstable.

- Extend mem_cgroup to record the total number of pages in each of the 
  interesting dirty states (dirty, writeback, unstable_nfs).  

- Add dirty parameters similar to the system-wide  /proc/sys/vm/dirty_*
  limits to mem_cgroup.  The mem_cgroup dirty parameters are accessible
  via cgroupfs control files.

- Consider both system and per-memcg dirty limits in page writeback when
  deciding to queue background writeback or block for foreground writeback.


Known shortcomings:
- When a cgroup dirty limit is exceeded, then bdi writeback is employed to
  writeback dirty inodes.  Bdi writeback considers inodes from any cgroup, not
  just inodes contributing dirty pages to the cgroup exceeding its limit.  


Performance data:
- A page fault microbenchmark workload was used to measure performance, which
  can be called in read or write mode:
        f = open(foo. $cpu)
        truncate(f, 4096)
        alarm(60)
        while (1) {
                p = mmap(f, 4096)
                if (write)
			*p = 1
		else
			x = *p
                munmap(p)
        }

- The workload was called for several points in the patch series in different
  modes:
  - s_read is a single threaded reader
  - s_write is a single threaded writer
  - p_read is a 16 thread reader, each operating on a different file
  - p_write is a 16 thread writer, each operating on a different file

- Measurements were collected on a 16 core non-numa system using "perf stat
  --repeat 3".  The -a option was used for parallel (p_*) runs.

- All numbers are page fault rate (M/sec).  Higher is better.

- To compare the performance of a kernel without non-memcg compare the first and
  last rows, neither has memcg configured.  The first row does not include any
  of these memcg patches.

- To compare the performance of using memcg dirty limits, compare the baseline
  (2nd row titled "w/ memcg") with the the code and memcg enabled (2nd to last
  row titled "all patches").

                           root_cgroup                     child_cgroup
                 s_read s_write p_read p_write    s_read s_write p_read p_write
mmotm w/o memcg   0.424  0.400   0.420  0.395
w/ memcg          0.419  0.390   0.395  0.371      0.413  0.385   0.385  0.361
all patches       0.421  0.384   0.395  0.362      0.418  0.380   0.396  0.360
all patches       0.425  0.396   0.423  0.388
  w/o memcg


Balbir Singh (1):
  memcg: CPU hotplug lockdep warning fix

Greg Thelen (9):
  memcg: add page_cgroup flags for dirty page tracking
  memcg: document cgroup dirty memory interfaces
  memcg: create extensible page stat update routines
  memcg: add dirty page accounting infrastructure
  memcg: add kernel calls for memcg dirty page stats
  memcg: add dirty limits to mem_cgroup
  memcg: add cgroupfs interface to memcg dirty limits
  writeback: make determine_dirtyable_memory() static.
  memcg: check memcg dirty limits in page writeback

KAMEZAWA Hiroyuki (1):
  memcg: add lock to synchronize page accounting and migration

 Documentation/cgroups/memory.txt |   60 ++++++
 fs/nfs/write.c                   |    4 +
 include/linux/memcontrol.h       |   78 +++++++-
 include/linux/page_cgroup.h      |   54 +++++-
 include/linux/writeback.h        |    2 -
 mm/filemap.c                     |    1 +
 mm/memcontrol.c                  |  417 ++++++++++++++++++++++++++++++++++++--
 mm/page-writeback.c              |  213 +++++++++++++-------
 mm/rmap.c                        |    4 +-
 mm/truncate.c                    |    1 +
 10 files changed, 726 insertions(+), 108 deletions(-)

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH v3 01/11] memcg: add page_cgroup flags for dirty page tracking
  2010-10-19  0:39 [PATCH v3 00/11] memcg: per cgroup dirty page accounting Greg Thelen
@ 2010-10-19  0:39 ` Greg Thelen
  2010-10-19  4:31   ` Daisuke Nishimura
  2010-10-19  0:39 ` [PATCH v3 02/11] memcg: document cgroup dirty memory interfaces Greg Thelen
                   ` (10 subsequent siblings)
  11 siblings, 1 reply; 64+ messages in thread
From: Greg Thelen @ 2010-10-19  0:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, containers, Andrea Righi, Balbir Singh,
	KAMEZAWA Hiroyuki, Daisuke Nishimura, Minchan Kim, Ciju Rajan K,
	David Rientjes, Greg Thelen

Add additional flags to page_cgroup to track dirty pages
within a mem_cgroup.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrea Righi <arighi@develer.com>
Signed-off-by: Greg Thelen <gthelen@google.com>
---
 include/linux/page_cgroup.h |   23 +++++++++++++++++++++++
 1 files changed, 23 insertions(+), 0 deletions(-)

diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
index 5bb13b3..b59c298 100644
--- a/include/linux/page_cgroup.h
+++ b/include/linux/page_cgroup.h
@@ -40,6 +40,9 @@ enum {
 	PCG_USED, /* this object is in use. */
 	PCG_ACCT_LRU, /* page has been accounted for */
 	PCG_FILE_MAPPED, /* page is accounted as "mapped" */
+	PCG_FILE_DIRTY, /* page is dirty */
+	PCG_FILE_WRITEBACK, /* page is under writeback */
+	PCG_FILE_UNSTABLE_NFS, /* page is NFS unstable */
 	PCG_MIGRATION, /* under page migration */
 };
 
@@ -59,6 +62,10 @@ static inline void ClearPageCgroup##uname(struct page_cgroup *pc)	\
 static inline int TestClearPageCgroup##uname(struct page_cgroup *pc)	\
 	{ return test_and_clear_bit(PCG_##lname, &pc->flags);  }
 
+#define TESTSETPCGFLAG(uname, lname)			\
+static inline int TestSetPageCgroup##uname(struct page_cgroup *pc)	\
+	{ return test_and_set_bit(PCG_##lname, &pc->flags);  }
+
 TESTPCGFLAG(Locked, LOCK)
 
 /* Cache flag is set only once (at allocation) */
@@ -80,6 +87,22 @@ SETPCGFLAG(FileMapped, FILE_MAPPED)
 CLEARPCGFLAG(FileMapped, FILE_MAPPED)
 TESTPCGFLAG(FileMapped, FILE_MAPPED)
 
+SETPCGFLAG(FileDirty, FILE_DIRTY)
+CLEARPCGFLAG(FileDirty, FILE_DIRTY)
+TESTPCGFLAG(FileDirty, FILE_DIRTY)
+TESTCLEARPCGFLAG(FileDirty, FILE_DIRTY)
+TESTSETPCGFLAG(FileDirty, FILE_DIRTY)
+
+SETPCGFLAG(FileWriteback, FILE_WRITEBACK)
+CLEARPCGFLAG(FileWriteback, FILE_WRITEBACK)
+TESTPCGFLAG(FileWriteback, FILE_WRITEBACK)
+
+SETPCGFLAG(FileUnstableNFS, FILE_UNSTABLE_NFS)
+CLEARPCGFLAG(FileUnstableNFS, FILE_UNSTABLE_NFS)
+TESTPCGFLAG(FileUnstableNFS, FILE_UNSTABLE_NFS)
+TESTCLEARPCGFLAG(FileUnstableNFS, FILE_UNSTABLE_NFS)
+TESTSETPCGFLAG(FileUnstableNFS, FILE_UNSTABLE_NFS)
+
 SETPCGFLAG(Migration, MIGRATION)
 CLEARPCGFLAG(Migration, MIGRATION)
 TESTPCGFLAG(Migration, MIGRATION)
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v3 02/11] memcg: document cgroup dirty memory interfaces
  2010-10-19  0:39 [PATCH v3 00/11] memcg: per cgroup dirty page accounting Greg Thelen
  2010-10-19  0:39 ` [PATCH v3 01/11] memcg: add page_cgroup flags for dirty page tracking Greg Thelen
@ 2010-10-19  0:39 ` Greg Thelen
  2010-10-19  0:46   ` KAMEZAWA Hiroyuki
  2010-10-19  8:27   ` Daisuke Nishimura
  2010-10-19  0:39 ` [PATCH v3 03/11] memcg: create extensible page stat update routines Greg Thelen
                   ` (9 subsequent siblings)
  11 siblings, 2 replies; 64+ messages in thread
From: Greg Thelen @ 2010-10-19  0:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, containers, Andrea Righi, Balbir Singh,
	KAMEZAWA Hiroyuki, Daisuke Nishimura, Minchan Kim, Ciju Rajan K,
	David Rientjes, Greg Thelen

Document cgroup dirty memory interfaces and statistics.

Signed-off-by: Andrea Righi <arighi@develer.com>
Signed-off-by: Greg Thelen <gthelen@google.com>
---

Changelog since v1:
- Renamed "nfs"/"total_nfs" to "nfs_unstable"/"total_nfs_unstable" in per cgroup
  memory.stat to match /proc/meminfo.

- Allow [kKmMgG] suffixes for newly created dirty limit value cgroupfs files.

- Describe a situation where a cgroup can exceed its dirty limit.

 Documentation/cgroups/memory.txt |   60 ++++++++++++++++++++++++++++++++++++++
 1 files changed, 60 insertions(+), 0 deletions(-)

diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index 7781857..02bbd6f 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -385,6 +385,10 @@ mapped_file	- # of bytes of mapped file (includes tmpfs/shmem)
 pgpgin		- # of pages paged in (equivalent to # of charging events).
 pgpgout		- # of pages paged out (equivalent to # of uncharging events).
 swap		- # of bytes of swap usage
+dirty		- # of bytes that are waiting to get written back to the disk.
+writeback	- # of bytes that are actively being written back to the disk.
+nfs_unstable	- # of bytes sent to the NFS server, but not yet committed to
+		the actual storage.
 inactive_anon	- # of bytes of anonymous memory and swap cache memory on
 		LRU list.
 active_anon	- # of bytes of anonymous and swap cache memory on active
@@ -453,6 +457,62 @@ memory under it will be reclaimed.
 You can reset failcnt by writing 0 to failcnt file.
 # echo 0 > .../memory.failcnt
 
+5.5 dirty memory
+
+Control the maximum amount of dirty pages a cgroup can have at any given time.
+
+Limiting dirty memory is like fixing the max amount of dirty (hard to reclaim)
+page cache used by a cgroup.  So, in case of multiple cgroup writers, they will
+not be able to consume more than their designated share of dirty pages and will
+be forced to perform write-out if they cross that limit.
+
+The interface is equivalent to the procfs interface: /proc/sys/vm/dirty_*.  It
+is possible to configure a limit to trigger both a direct writeback or a
+background writeback performed by per-bdi flusher threads.  The root cgroup
+memory.dirty_* control files are read-only and match the contents of
+the /proc/sys/vm/dirty_* files.
+
+Per-cgroup dirty limits can be set using the following files in the cgroupfs:
+
+- memory.dirty_ratio: the amount of dirty memory (expressed as a percentage of
+  cgroup memory) at which a process generating dirty pages will itself start
+  writing out dirty data.
+
+- memory.dirty_limit_in_bytes: the amount of dirty memory (expressed in bytes)
+  in the cgroup at which a process generating dirty pages will start itself
+  writing out dirty data.  Suffix (k, K, m, M, g, or G) can be used to indicate
+  that value is kilo, mega or gigabytes.
+
+  Note: memory.dirty_limit_in_bytes is the counterpart of memory.dirty_ratio.
+  Only one of them may be specified at a time.  When one is written it is
+  immediately taken into account to evaluate the dirty memory limits and the
+  other appears as 0 when read.
+
+- memory.dirty_background_ratio: the amount of dirty memory of the cgroup
+  (expressed as a percentage of cgroup memory) at which background writeback
+  kernel threads will start writing out dirty data.
+
+- memory.dirty_background_limit_in_bytes: the amount of dirty memory (expressed
+  in bytes) in the cgroup at which background writeback kernel threads will
+  start writing out dirty data.  Suffix (k, K, m, M, g, or G) can be used to
+  indicate that value is kilo, mega or gigabytes.
+
+  Note: memory.dirty_background_limit_in_bytes is the counterpart of
+  memory.dirty_background_ratio.  Only one of them may be specified at a time.
+  When one is written it is immediately taken into account to evaluate the dirty
+  memory limits and the other appears as 0 when read.
+
+A cgroup may contain more dirty memory than its dirty limit.  This is possible
+because of the principle that the first cgroup to touch a page is charged for
+it.  Subsequent page counting events (dirty, writeback, nfs_unstable) are also
+counted to the originally charged cgroup.
+
+Example: If page is allocated by a cgroup A task, then the page is charged to
+cgroup A.  If the page is later dirtied by a task in cgroup B, then the cgroup A
+dirty count will be incremented.  If cgroup A is over its dirty limit but cgroup
+B is not, then dirtying a cgroup A page from a cgroup B task may push cgroup A
+over its dirty limit without throttling the dirtying cgroup B task.
+
 6. Hierarchy support
 
 The memory controller supports a deep hierarchy and hierarchical accounting.
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v3 03/11] memcg: create extensible page stat update routines
  2010-10-19  0:39 [PATCH v3 00/11] memcg: per cgroup dirty page accounting Greg Thelen
  2010-10-19  0:39 ` [PATCH v3 01/11] memcg: add page_cgroup flags for dirty page tracking Greg Thelen
  2010-10-19  0:39 ` [PATCH v3 02/11] memcg: document cgroup dirty memory interfaces Greg Thelen
@ 2010-10-19  0:39 ` Greg Thelen
  2010-10-19  0:47   ` KAMEZAWA Hiroyuki
  2010-10-19  4:52   ` Daisuke Nishimura
  2010-10-19  0:39 ` [PATCH v3 04/11] memcg: add lock to synchronize page accounting and migration Greg Thelen
                   ` (8 subsequent siblings)
  11 siblings, 2 replies; 64+ messages in thread
From: Greg Thelen @ 2010-10-19  0:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, containers, Andrea Righi, Balbir Singh,
	KAMEZAWA Hiroyuki, Daisuke Nishimura, Minchan Kim, Ciju Rajan K,
	David Rientjes, Greg Thelen

Replace usage of the mem_cgroup_update_file_mapped() memcg
statistic update routine with two new routines:
* mem_cgroup_inc_page_stat()
* mem_cgroup_dec_page_stat()

As before, only the file_mapped statistic is managed.  However,
these more general interfaces allow for new statistics to be
more easily added.  New statistics are added with memcg dirty
page accounting.

Signed-off-by: Greg Thelen <gthelen@google.com>
Signed-off-by: Andrea Righi <arighi@develer.com>
---

Changelog since v1:
- Rename (for clarity):
  - mem_cgroup_write_page_stat_item -> mem_cgroup_page_stat_item
  - mem_cgroup_read_page_stat_item -> mem_cgroup_nr_pages_item

 include/linux/memcontrol.h |   31 ++++++++++++++++++++++++++++---
 mm/memcontrol.c            |   16 +++++++---------
 mm/rmap.c                  |    4 ++--
 3 files changed, 37 insertions(+), 14 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 159a076..067115c 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -25,6 +25,11 @@ struct page_cgroup;
 struct page;
 struct mm_struct;
 
+/* Stats that can be updated by kernel. */
+enum mem_cgroup_page_stat_item {
+	MEMCG_NR_FILE_MAPPED, /* # of pages charged as file rss */
+};
+
 extern unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
 					struct list_head *dst,
 					unsigned long *scanned, int order,
@@ -121,7 +126,22 @@ static inline bool mem_cgroup_disabled(void)
 	return false;
 }
 
-void mem_cgroup_update_file_mapped(struct page *page, int val);
+void mem_cgroup_update_page_stat(struct page *page,
+				 enum mem_cgroup_page_stat_item idx,
+				 int val);
+
+static inline void mem_cgroup_inc_page_stat(struct page *page,
+					    enum mem_cgroup_page_stat_item idx)
+{
+	mem_cgroup_update_page_stat(page, idx, 1);
+}
+
+static inline void mem_cgroup_dec_page_stat(struct page *page,
+					    enum mem_cgroup_page_stat_item idx)
+{
+	mem_cgroup_update_page_stat(page, idx, -1);
+}
+
 unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
 						gfp_t gfp_mask);
 u64 mem_cgroup_get_limit(struct mem_cgroup *mem);
@@ -293,8 +313,13 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
 {
 }
 
-static inline void mem_cgroup_update_file_mapped(struct page *page,
-							int val)
+static inline void mem_cgroup_inc_page_stat(struct page *page,
+					    enum mem_cgroup_page_stat_item idx)
+{
+}
+
+static inline void mem_cgroup_dec_page_stat(struct page *page,
+					    enum mem_cgroup_page_stat_item idx)
 {
 }
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index a4034b6..369879a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1609,7 +1609,8 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask)
  * possibility of race condition. If there is, we take a lock.
  */
 
-static void mem_cgroup_update_file_stat(struct page *page, int idx, int val)
+void mem_cgroup_update_page_stat(struct page *page,
+				 enum mem_cgroup_page_stat_item idx, int val)
 {
 	struct mem_cgroup *mem;
 	struct page_cgroup *pc = lookup_page_cgroup(page);
@@ -1632,30 +1633,27 @@ static void mem_cgroup_update_file_stat(struct page *page, int idx, int val)
 			goto out;
 	}
 
-	this_cpu_add(mem->stat->count[idx], val);
-
 	switch (idx) {
-	case MEM_CGROUP_STAT_FILE_MAPPED:
+	case MEMCG_NR_FILE_MAPPED:
 		if (val > 0)
 			SetPageCgroupFileMapped(pc);
 		else if (!page_mapped(page))
 			ClearPageCgroupFileMapped(pc);
+		idx = MEM_CGROUP_STAT_FILE_MAPPED;
 		break;
 	default:
 		BUG();
 	}
 
+	this_cpu_add(mem->stat->count[idx], val);
+
 out:
 	if (unlikely(need_unlock))
 		unlock_page_cgroup(pc);
 	rcu_read_unlock();
 	return;
 }
-
-void mem_cgroup_update_file_mapped(struct page *page, int val)
-{
-	mem_cgroup_update_file_stat(page, MEM_CGROUP_STAT_FILE_MAPPED, val);
-}
+EXPORT_SYMBOL(mem_cgroup_update_page_stat);
 
 /*
  * size of first charge trial. "32" comes from vmscan.c's magic value.
diff --git a/mm/rmap.c b/mm/rmap.c
index 1a8bf76..a66ab76 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -911,7 +911,7 @@ void page_add_file_rmap(struct page *page)
 {
 	if (atomic_inc_and_test(&page->_mapcount)) {
 		__inc_zone_page_state(page, NR_FILE_MAPPED);
-		mem_cgroup_update_file_mapped(page, 1);
+		mem_cgroup_inc_page_stat(page, MEMCG_NR_FILE_MAPPED);
 	}
 }
 
@@ -949,7 +949,7 @@ void page_remove_rmap(struct page *page)
 		__dec_zone_page_state(page, NR_ANON_PAGES);
 	} else {
 		__dec_zone_page_state(page, NR_FILE_MAPPED);
-		mem_cgroup_update_file_mapped(page, -1);
+		mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_MAPPED);
 	}
 	/*
 	 * It would be tidy to reset the PageAnon mapping here,
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v3 04/11] memcg: add lock to synchronize page accounting and migration
  2010-10-19  0:39 [PATCH v3 00/11] memcg: per cgroup dirty page accounting Greg Thelen
                   ` (2 preceding siblings ...)
  2010-10-19  0:39 ` [PATCH v3 03/11] memcg: create extensible page stat update routines Greg Thelen
@ 2010-10-19  0:39 ` Greg Thelen
  2010-10-19  0:45   ` KAMEZAWA Hiroyuki
                     ` (2 more replies)
  2010-10-19  0:39 ` [PATCH v3 05/11] memcg: add dirty page accounting infrastructure Greg Thelen
                   ` (7 subsequent siblings)
  11 siblings, 3 replies; 64+ messages in thread
From: Greg Thelen @ 2010-10-19  0:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, containers, Andrea Righi, Balbir Singh,
	KAMEZAWA Hiroyuki, Daisuke Nishimura, Minchan Kim, Ciju Rajan K,
	David Rientjes, Greg Thelen

From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Introduce a new bit spin lock, PCG_MOVE_LOCK, to synchronize
the page accounting and migration code.  This reworks the
locking scheme of _update_stat() and _move_account() by
adding new lock bit PCG_MOVE_LOCK, which is always taken
under IRQ disable.

1. If pages are being migrated from a memcg, then updates to
   that memcg page statistics are protected by grabbing
   PCG_MOVE_LOCK using move_lock_page_cgroup().  In an
   upcoming commit, memcg dirty page accounting will be
   updating memcg page accounting (specifically: num
   writeback pages) from IRQ context (softirq).  Avoid a
   deadlocking nested spin lock attempt by disabling irq on
   the local processor when grabbing the PCG_MOVE_LOCK.

2. lock for update_page_stat is used only for avoiding race
   with move_account().  So, IRQ awareness of
   lock_page_cgroup() itself is not a problem.  The problem
   is between mem_cgroup_update_page_stat() and
   mem_cgroup_move_account_page().

Trade-off:
  * Changing lock_page_cgroup() to always disable IRQ (or
    local_bh) has some impacts on performance and I think
    it's bad to disable IRQ when it's not necessary.
  * adding a new lock makes move_account() slower.  Score is
    here.

Performance Impact: moving a 8G anon process.

Before:
	real    0m0.792s
	user    0m0.000s
	sys     0m0.780s

After:
	real    0m0.854s
	user    0m0.000s
	sys     0m0.842s

This score is bad but planned patches for optimization can reduce
this impact.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Greg Thelen <gthelen@google.com>
---
 include/linux/page_cgroup.h |   31 ++++++++++++++++++++++++++++---
 mm/memcontrol.c             |    9 +++++++--
 2 files changed, 35 insertions(+), 5 deletions(-)

diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
index b59c298..509452e 100644
--- a/include/linux/page_cgroup.h
+++ b/include/linux/page_cgroup.h
@@ -35,15 +35,18 @@ struct page_cgroup *lookup_page_cgroup(struct page *page);
 
 enum {
 	/* flags for mem_cgroup */
-	PCG_LOCK,  /* page cgroup is locked */
+	PCG_LOCK,  /* Lock for pc->mem_cgroup and following bits. */
 	PCG_CACHE, /* charged as cache */
 	PCG_USED, /* this object is in use. */
-	PCG_ACCT_LRU, /* page has been accounted for */
+	PCG_MIGRATION, /* under page migration */
+	/* flags for mem_cgroup and file and I/O status */
+	PCG_MOVE_LOCK, /* For race between move_account v.s. following bits */
 	PCG_FILE_MAPPED, /* page is accounted as "mapped" */
 	PCG_FILE_DIRTY, /* page is dirty */
 	PCG_FILE_WRITEBACK, /* page is under writeback */
 	PCG_FILE_UNSTABLE_NFS, /* page is NFS unstable */
-	PCG_MIGRATION, /* under page migration */
+	/* No lock in page_cgroup */
+	PCG_ACCT_LRU, /* page has been accounted for (under lru_lock) */
 };
 
 #define TESTPCGFLAG(uname, lname)			\
@@ -119,6 +122,10 @@ static inline enum zone_type page_cgroup_zid(struct page_cgroup *pc)
 
 static inline void lock_page_cgroup(struct page_cgroup *pc)
 {
+	/*
+	 * Don't take this lock in IRQ context.
+	 * This lock is for pc->mem_cgroup, USED, CACHE, MIGRATION
+	 */
 	bit_spin_lock(PCG_LOCK, &pc->flags);
 }
 
@@ -127,6 +134,24 @@ static inline void unlock_page_cgroup(struct page_cgroup *pc)
 	bit_spin_unlock(PCG_LOCK, &pc->flags);
 }
 
+static inline void move_lock_page_cgroup(struct page_cgroup *pc,
+	unsigned long *flags)
+{
+	/*
+	 * We know updates to pc->flags of page cache's stats are from both of
+	 * usual context or IRQ context. Disable IRQ to avoid deadlock.
+	 */
+	local_irq_save(*flags);
+	bit_spin_lock(PCG_MOVE_LOCK, &pc->flags);
+}
+
+static inline void move_unlock_page_cgroup(struct page_cgroup *pc,
+	unsigned long *flags)
+{
+	bit_spin_unlock(PCG_MOVE_LOCK, &pc->flags);
+	local_irq_restore(*flags);
+}
+
 #else /* CONFIG_CGROUP_MEM_RES_CTLR */
 struct page_cgroup;
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 369879a..697f7b8 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1615,6 +1615,7 @@ void mem_cgroup_update_page_stat(struct page *page,
 	struct mem_cgroup *mem;
 	struct page_cgroup *pc = lookup_page_cgroup(page);
 	bool need_unlock = false;
+	unsigned long uninitialized_var(flags);
 
 	if (unlikely(!pc))
 		return;
@@ -1626,7 +1627,7 @@ void mem_cgroup_update_page_stat(struct page *page,
 	/* pc->mem_cgroup is unstable ? */
 	if (unlikely(mem_cgroup_stealed(mem))) {
 		/* take a lock against to access pc->mem_cgroup */
-		lock_page_cgroup(pc);
+		move_lock_page_cgroup(pc, &flags);
 		need_unlock = true;
 		mem = pc->mem_cgroup;
 		if (!mem || !PageCgroupUsed(pc))
@@ -1649,7 +1650,7 @@ void mem_cgroup_update_page_stat(struct page *page,
 
 out:
 	if (unlikely(need_unlock))
-		unlock_page_cgroup(pc);
+		move_unlock_page_cgroup(pc, &flags);
 	rcu_read_unlock();
 	return;
 }
@@ -2203,9 +2204,13 @@ static int mem_cgroup_move_account(struct page_cgroup *pc,
 		struct mem_cgroup *from, struct mem_cgroup *to, bool uncharge)
 {
 	int ret = -EINVAL;
+	unsigned long flags;
+
 	lock_page_cgroup(pc);
 	if (PageCgroupUsed(pc) && pc->mem_cgroup == from) {
+		move_lock_page_cgroup(pc, &flags);
 		__mem_cgroup_move_account(pc, from, to, uncharge);
+		move_unlock_page_cgroup(pc, &flags);
 		ret = 0;
 	}
 	unlock_page_cgroup(pc);
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v3 05/11] memcg: add dirty page accounting infrastructure
  2010-10-19  0:39 [PATCH v3 00/11] memcg: per cgroup dirty page accounting Greg Thelen
                   ` (3 preceding siblings ...)
  2010-10-19  0:39 ` [PATCH v3 04/11] memcg: add lock to synchronize page accounting and migration Greg Thelen
@ 2010-10-19  0:39 ` Greg Thelen
  2010-10-19  0:49   ` KAMEZAWA Hiroyuki
  2010-10-20  0:53   ` Daisuke Nishimura
  2010-10-19  0:39 ` [PATCH v3 06/11] memcg: add kernel calls for memcg dirty page stats Greg Thelen
                   ` (6 subsequent siblings)
  11 siblings, 2 replies; 64+ messages in thread
From: Greg Thelen @ 2010-10-19  0:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, containers, Andrea Righi, Balbir Singh,
	KAMEZAWA Hiroyuki, Daisuke Nishimura, Minchan Kim, Ciju Rajan K,
	David Rientjes, Greg Thelen

Add memcg routines to track dirty, writeback, and unstable_NFS pages.
These routines are not yet used by the kernel to count such pages.
A later change adds kernel calls to these new routines.

Signed-off-by: Greg Thelen <gthelen@google.com>
Signed-off-by: Andrea Righi <arighi@develer.com>
---

Changelog since v1:
- Renamed "nfs"/"total_nfs" to "nfs_unstable"/"total_nfs_unstable" in per cgroup
  memory.stat to match /proc/meminfo.
- Rename (for clarity):
  - mem_cgroup_write_page_stat_item -> mem_cgroup_page_stat_item
  - mem_cgroup_read_page_stat_item -> mem_cgroup_nr_pages_item
- Remove redundant comments.
- Made mem_cgroup_move_account_page_stat() inline.

 include/linux/memcontrol.h |    3 ++
 mm/memcontrol.c            |   86 +++++++++++++++++++++++++++++++++++++++----
 2 files changed, 81 insertions(+), 8 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 067115c..ef2eec7 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -28,6 +28,9 @@ struct mm_struct;
 /* Stats that can be updated by kernel. */
 enum mem_cgroup_page_stat_item {
 	MEMCG_NR_FILE_MAPPED, /* # of pages charged as file rss */
+	MEMCG_NR_FILE_DIRTY, /* # of dirty pages in page cache */
+	MEMCG_NR_FILE_WRITEBACK, /* # of pages under writeback */
+	MEMCG_NR_FILE_UNSTABLE_NFS, /* # of NFS unstable pages */
 };
 
 extern unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 697f7b8..3ac2693 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -85,10 +85,13 @@ enum mem_cgroup_stat_index {
 	 */
 	MEM_CGROUP_STAT_CACHE, 	   /* # of pages charged as cache */
 	MEM_CGROUP_STAT_RSS,	   /* # of pages charged as anon rss */
-	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
 	MEM_CGROUP_STAT_PGPGIN_COUNT,	/* # of pages paged in */
 	MEM_CGROUP_STAT_PGPGOUT_COUNT,	/* # of pages paged out */
 	MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
+	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
+	MEM_CGROUP_STAT_FILE_DIRTY,	/* # of dirty pages in page cache */
+	MEM_CGROUP_STAT_FILE_WRITEBACK,		/* # of pages under writeback */
+	MEM_CGROUP_STAT_FILE_UNSTABLE_NFS,	/* # of NFS unstable pages */
 	MEM_CGROUP_STAT_DATA, /* end of data requires synchronization */
 	/* incremented at every  pagein/pageout */
 	MEM_CGROUP_EVENTS = MEM_CGROUP_STAT_DATA,
@@ -1642,6 +1645,44 @@ void mem_cgroup_update_page_stat(struct page *page,
 			ClearPageCgroupFileMapped(pc);
 		idx = MEM_CGROUP_STAT_FILE_MAPPED;
 		break;
+
+	case MEMCG_NR_FILE_DIRTY:
+		/* Use Test{Set,Clear} to only un/charge the memcg once. */
+		if (val > 0) {
+			if (TestSetPageCgroupFileDirty(pc))
+				val = 0;
+		} else {
+			if (!TestClearPageCgroupFileDirty(pc))
+				val = 0;
+		}
+		idx = MEM_CGROUP_STAT_FILE_DIRTY;
+		break;
+
+	case MEMCG_NR_FILE_WRITEBACK:
+		/*
+		 * This counter is adjusted while holding the mapping's
+		 * tree_lock.  Therefore there is no race between settings and
+		 * clearing of this flag.
+		 */
+		if (val > 0)
+			SetPageCgroupFileWriteback(pc);
+		else
+			ClearPageCgroupFileWriteback(pc);
+		idx = MEM_CGROUP_STAT_FILE_WRITEBACK;
+		break;
+
+	case MEMCG_NR_FILE_UNSTABLE_NFS:
+		/* Use Test{Set,Clear} to only un/charge the memcg once. */
+		if (val > 0) {
+			if (TestSetPageCgroupFileUnstableNFS(pc))
+				val = 0;
+		} else {
+			if (!TestClearPageCgroupFileUnstableNFS(pc))
+				val = 0;
+		}
+		idx = MEM_CGROUP_STAT_FILE_UNSTABLE_NFS;
+		break;
+
 	default:
 		BUG();
 	}
@@ -2146,6 +2187,17 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *mem,
 	memcg_check_events(mem, pc->page);
 }
 
+static inline
+void mem_cgroup_move_account_page_stat(struct mem_cgroup *from,
+				       struct mem_cgroup *to,
+				       enum mem_cgroup_stat_index idx)
+{
+	preempt_disable();
+	__this_cpu_dec(from->stat->count[idx]);
+	__this_cpu_inc(to->stat->count[idx]);
+	preempt_enable();
+}
+
 /**
  * __mem_cgroup_move_account - move account of the page
  * @pc:	page_cgroup of the page.
@@ -2172,13 +2224,18 @@ static void __mem_cgroup_move_account(struct page_cgroup *pc,
 	VM_BUG_ON(!PageCgroupUsed(pc));
 	VM_BUG_ON(pc->mem_cgroup != from);
 
-	if (PageCgroupFileMapped(pc)) {
-		/* Update mapped_file data for mem_cgroup */
-		preempt_disable();
-		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
-		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
-		preempt_enable();
-	}
+	if (PageCgroupFileMapped(pc))
+		mem_cgroup_move_account_page_stat(from, to,
+					MEM_CGROUP_STAT_FILE_MAPPED);
+	if (PageCgroupFileDirty(pc))
+		mem_cgroup_move_account_page_stat(from, to,
+					MEM_CGROUP_STAT_FILE_DIRTY);
+	if (PageCgroupFileWriteback(pc))
+		mem_cgroup_move_account_page_stat(from, to,
+					MEM_CGROUP_STAT_FILE_WRITEBACK);
+	if (PageCgroupFileUnstableNFS(pc))
+		mem_cgroup_move_account_page_stat(from, to,
+					MEM_CGROUP_STAT_FILE_UNSTABLE_NFS);
 	mem_cgroup_charge_statistics(from, pc, false);
 	if (uncharge)
 		/* This is not "cancel", but cancel_charge does all we need. */
@@ -3557,6 +3614,9 @@ enum {
 	MCS_PGPGIN,
 	MCS_PGPGOUT,
 	MCS_SWAP,
+	MCS_FILE_DIRTY,
+	MCS_WRITEBACK,
+	MCS_UNSTABLE_NFS,
 	MCS_INACTIVE_ANON,
 	MCS_ACTIVE_ANON,
 	MCS_INACTIVE_FILE,
@@ -3579,6 +3639,9 @@ struct {
 	{"pgpgin", "total_pgpgin"},
 	{"pgpgout", "total_pgpgout"},
 	{"swap", "total_swap"},
+	{"dirty", "total_dirty"},
+	{"writeback", "total_writeback"},
+	{"nfs_unstable", "total_nfs_unstable"},
 	{"inactive_anon", "total_inactive_anon"},
 	{"active_anon", "total_active_anon"},
 	{"inactive_file", "total_inactive_file"},
@@ -3608,6 +3671,13 @@ mem_cgroup_get_local_stat(struct mem_cgroup *mem, struct mcs_total_stat *s)
 		s->stat[MCS_SWAP] += val * PAGE_SIZE;
 	}
 
+	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_FILE_DIRTY);
+	s->stat[MCS_FILE_DIRTY] += val * PAGE_SIZE;
+	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_FILE_WRITEBACK);
+	s->stat[MCS_WRITEBACK] += val * PAGE_SIZE;
+	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_FILE_UNSTABLE_NFS);
+	s->stat[MCS_UNSTABLE_NFS] += val * PAGE_SIZE;
+
 	/* per zone stat */
 	val = mem_cgroup_get_local_zonestat(mem, LRU_INACTIVE_ANON);
 	s->stat[MCS_INACTIVE_ANON] += val * PAGE_SIZE;
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v3 06/11] memcg: add kernel calls for memcg dirty page stats
  2010-10-19  0:39 [PATCH v3 00/11] memcg: per cgroup dirty page accounting Greg Thelen
                   ` (4 preceding siblings ...)
  2010-10-19  0:39 ` [PATCH v3 05/11] memcg: add dirty page accounting infrastructure Greg Thelen
@ 2010-10-19  0:39 ` Greg Thelen
  2010-10-19  0:51   ` KAMEZAWA Hiroyuki
  2010-10-19  7:03   ` Daisuke Nishimura
  2010-10-19  0:39 ` [PATCH v3 07/11] memcg: add dirty limits to mem_cgroup Greg Thelen
                   ` (5 subsequent siblings)
  11 siblings, 2 replies; 64+ messages in thread
From: Greg Thelen @ 2010-10-19  0:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, containers, Andrea Righi, Balbir Singh,
	KAMEZAWA Hiroyuki, Daisuke Nishimura, Minchan Kim, Ciju Rajan K,
	David Rientjes, Greg Thelen

Add calls into memcg dirty page accounting.  Notify memcg when pages
transition between clean, file dirty, writeback, and unstable nfs.
This allows the memory controller to maintain an accurate view of
the amount of its memory that is dirty.

Signed-off-by: Greg Thelen <gthelen@google.com>
Signed-off-by: Andrea Righi <arighi@develer.com>
---
 fs/nfs/write.c      |    4 ++++
 mm/filemap.c        |    1 +
 mm/page-writeback.c |    4 ++++
 mm/truncate.c       |    1 +
 4 files changed, 10 insertions(+), 0 deletions(-)

diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index 4c14c17..a3c39f7 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -450,6 +450,7 @@ nfs_mark_request_commit(struct nfs_page *req)
 			NFS_PAGE_TAG_COMMIT);
 	nfsi->ncommit++;
 	spin_unlock(&inode->i_lock);
+	mem_cgroup_inc_page_stat(req->wb_page, MEMCG_NR_FILE_UNSTABLE_NFS);
 	inc_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
 	inc_bdi_stat(req->wb_page->mapping->backing_dev_info, BDI_RECLAIMABLE);
 	__mark_inode_dirty(inode, I_DIRTY_DATASYNC);
@@ -461,6 +462,7 @@ nfs_clear_request_commit(struct nfs_page *req)
 	struct page *page = req->wb_page;
 
 	if (test_and_clear_bit(PG_CLEAN, &(req)->wb_flags)) {
+		mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_UNSTABLE_NFS);
 		dec_zone_page_state(page, NR_UNSTABLE_NFS);
 		dec_bdi_stat(page->mapping->backing_dev_info, BDI_RECLAIMABLE);
 		return 1;
@@ -1316,6 +1318,8 @@ nfs_commit_list(struct inode *inode, struct list_head *head, int how)
 		req = nfs_list_entry(head->next);
 		nfs_list_remove_request(req);
 		nfs_mark_request_commit(req);
+		mem_cgroup_dec_page_stat(req->wb_page,
+					 MEMCG_NR_FILE_UNSTABLE_NFS);
 		dec_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
 		dec_bdi_stat(req->wb_page->mapping->backing_dev_info,
 				BDI_RECLAIMABLE);
diff --git a/mm/filemap.c b/mm/filemap.c
index 49b2d2e..f6bd6f2 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -146,6 +146,7 @@ void __remove_from_page_cache(struct page *page)
 	 * having removed the page entirely.
 	 */
 	if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
+		mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_DIRTY);
 		dec_zone_page_state(page, NR_FILE_DIRTY);
 		dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
 	}
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index b840afa..820eb66 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1114,6 +1114,7 @@ int __set_page_dirty_no_writeback(struct page *page)
 void account_page_dirtied(struct page *page, struct address_space *mapping)
 {
 	if (mapping_cap_account_dirty(mapping)) {
+		mem_cgroup_inc_page_stat(page, MEMCG_NR_FILE_DIRTY);
 		__inc_zone_page_state(page, NR_FILE_DIRTY);
 		__inc_zone_page_state(page, NR_DIRTIED);
 		__inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
@@ -1303,6 +1304,7 @@ int clear_page_dirty_for_io(struct page *page)
 		 * for more comments.
 		 */
 		if (TestClearPageDirty(page)) {
+			mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_DIRTY);
 			dec_zone_page_state(page, NR_FILE_DIRTY);
 			dec_bdi_stat(mapping->backing_dev_info,
 					BDI_RECLAIMABLE);
@@ -1333,6 +1335,7 @@ int test_clear_page_writeback(struct page *page)
 				__dec_bdi_stat(bdi, BDI_WRITEBACK);
 				__bdi_writeout_inc(bdi);
 			}
+			mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_WRITEBACK);
 		}
 		spin_unlock_irqrestore(&mapping->tree_lock, flags);
 	} else {
@@ -1360,6 +1363,7 @@ int test_set_page_writeback(struct page *page)
 						PAGECACHE_TAG_WRITEBACK);
 			if (bdi_cap_account_writeback(bdi))
 				__inc_bdi_stat(bdi, BDI_WRITEBACK);
+			mem_cgroup_inc_page_stat(page, MEMCG_NR_FILE_WRITEBACK);
 		}
 		if (!PageDirty(page))
 			radix_tree_tag_clear(&mapping->page_tree,
diff --git a/mm/truncate.c b/mm/truncate.c
index cd94607..54cca83 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -76,6 +76,7 @@ void cancel_dirty_page(struct page *page, unsigned int account_size)
 	if (TestClearPageDirty(page)) {
 		struct address_space *mapping = page->mapping;
 		if (mapping && mapping_cap_account_dirty(mapping)) {
+			mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_DIRTY);
 			dec_zone_page_state(page, NR_FILE_DIRTY);
 			dec_bdi_stat(mapping->backing_dev_info,
 					BDI_RECLAIMABLE);
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v3 07/11] memcg: add dirty limits to mem_cgroup
  2010-10-19  0:39 [PATCH v3 00/11] memcg: per cgroup dirty page accounting Greg Thelen
                   ` (5 preceding siblings ...)
  2010-10-19  0:39 ` [PATCH v3 06/11] memcg: add kernel calls for memcg dirty page stats Greg Thelen
@ 2010-10-19  0:39 ` Greg Thelen
  2010-10-19  0:53   ` KAMEZAWA Hiroyuki
  2010-10-20  0:50   ` Daisuke Nishimura
  2010-10-19  0:39 ` [PATCH v3 08/11] memcg: CPU hotplug lockdep warning fix Greg Thelen
                   ` (4 subsequent siblings)
  11 siblings, 2 replies; 64+ messages in thread
From: Greg Thelen @ 2010-10-19  0:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, containers, Andrea Righi, Balbir Singh,
	KAMEZAWA Hiroyuki, Daisuke Nishimura, Minchan Kim, Ciju Rajan K,
	David Rientjes, Greg Thelen

Extend mem_cgroup to contain dirty page limits.  Also add routines
allowing the kernel to query the dirty usage of a memcg.

These interfaces not used by the kernel yet.  A subsequent commit
will add kernel calls to utilize these new routines.

Signed-off-by: Greg Thelen <gthelen@google.com>
Signed-off-by: Andrea Righi <arighi@develer.com>
---

Changelog since v1:
- Rename (for clarity):
  - mem_cgroup_write_page_stat_item -> mem_cgroup_page_stat_item
  - mem_cgroup_read_page_stat_item -> mem_cgroup_nr_pages_item
- Removed unnecessary get_ prefix from get_xxx() functions.
- Avoid lockdep warnings by using rcu_read_[un]lock() in
  mem_cgroup_has_dirty_limit().

 include/linux/memcontrol.h |   44 ++++++++++
 mm/memcontrol.c            |  186 +++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 229 insertions(+), 1 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index ef2eec7..6f3a136 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -19,6 +19,7 @@
 
 #ifndef _LINUX_MEMCONTROL_H
 #define _LINUX_MEMCONTROL_H
+#include <linux/writeback.h>
 #include <linux/cgroup.h>
 struct mem_cgroup;
 struct page_cgroup;
@@ -33,6 +34,30 @@ enum mem_cgroup_page_stat_item {
 	MEMCG_NR_FILE_UNSTABLE_NFS, /* # of NFS unstable pages */
 };
 
+/* Cgroup memory statistics items exported to the kernel. */
+enum mem_cgroup_nr_pages_item {
+	MEMCG_NR_DIRTYABLE_PAGES,
+	MEMCG_NR_RECLAIM_PAGES,
+	MEMCG_NR_WRITEBACK,
+	MEMCG_NR_DIRTY_WRITEBACK_PAGES,
+};
+
+/* Dirty memory parameters */
+struct vm_dirty_param {
+	int dirty_ratio;
+	int dirty_background_ratio;
+	unsigned long dirty_bytes;
+	unsigned long dirty_background_bytes;
+};
+
+static inline void global_vm_dirty_param(struct vm_dirty_param *param)
+{
+	param->dirty_ratio = vm_dirty_ratio;
+	param->dirty_bytes = vm_dirty_bytes;
+	param->dirty_background_ratio = dirty_background_ratio;
+	param->dirty_background_bytes = dirty_background_bytes;
+}
+
 extern unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
 					struct list_head *dst,
 					unsigned long *scanned, int order,
@@ -145,6 +170,10 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,
 	mem_cgroup_update_page_stat(page, idx, -1);
 }
 
+bool mem_cgroup_has_dirty_limit(void);
+void vm_dirty_param(struct vm_dirty_param *param);
+s64 mem_cgroup_page_stat(enum mem_cgroup_nr_pages_item item);
+
 unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
 						gfp_t gfp_mask);
 u64 mem_cgroup_get_limit(struct mem_cgroup *mem);
@@ -326,6 +355,21 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,
 {
 }
 
+static inline bool mem_cgroup_has_dirty_limit(void)
+{
+	return false;
+}
+
+static inline void vm_dirty_param(struct vm_dirty_param *param)
+{
+	global_vm_dirty_param(param);
+}
+
+static inline s64 mem_cgroup_page_stat(enum mem_cgroup_nr_pages_item item)
+{
+	return -ENOSYS;
+}
+
 static inline
 unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
 					    gfp_t gfp_mask)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 3ac2693..f876919 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -233,6 +233,10 @@ struct mem_cgroup {
 	atomic_t	refcnt;
 
 	unsigned int	swappiness;
+
+	/* control memory cgroup dirty pages */
+	struct vm_dirty_param dirty_param;
+
 	/* OOM-Killer disable */
 	int		oom_kill_disable;
 
@@ -1149,6 +1153,178 @@ static unsigned int get_swappiness(struct mem_cgroup *memcg)
 	return swappiness;
 }
 
+/*
+ * Returns a snapshot of the current dirty limits which is not synchronized with
+ * the routines that change the dirty limits.  If this routine races with an
+ * update to the dirty bytes/ratio value, then the caller must handle the case
+ * where both dirty_[background_]_ratio and _bytes are set.
+ */
+static void __mem_cgroup_dirty_param(struct vm_dirty_param *param,
+				     struct mem_cgroup *mem)
+{
+	if (mem && !mem_cgroup_is_root(mem)) {
+		param->dirty_ratio = mem->dirty_param.dirty_ratio;
+		param->dirty_bytes = mem->dirty_param.dirty_bytes;
+		param->dirty_background_ratio =
+			mem->dirty_param.dirty_background_ratio;
+		param->dirty_background_bytes =
+			mem->dirty_param.dirty_background_bytes;
+	} else {
+		global_vm_dirty_param(param);
+	}
+}
+
+/*
+ * Get dirty memory parameters of the current memcg or global values (if memory
+ * cgroups are disabled or querying the root cgroup).
+ *
+ * The current task may be moved to other cgroup while we access cgroup changing
+ * the task's dirty limit.  But a precise check is meaningless because the task
+ * can be moved after our access and writeback tends to take long time.  At
+ * least, "memcg" will not be freed while holding rcu_read_lock().
+ */
+void vm_dirty_param(struct vm_dirty_param *param)
+{
+	struct mem_cgroup *memcg;
+
+	if (mem_cgroup_disabled()) {
+		global_vm_dirty_param(param);
+		return;
+	}
+
+	rcu_read_lock();
+	memcg = mem_cgroup_from_task(current);
+	__mem_cgroup_dirty_param(param, memcg);
+	rcu_read_unlock();
+}
+
+/*
+ * Return true if the current memory cgroup has local dirty memory settings.
+ * There is an allowed race between the current task migrating in-to/out-of the
+ * root cgroup while this routine runs.  So the return value may be incorrect if
+ * the current task is being simultaneously migrated.
+ */
+bool mem_cgroup_has_dirty_limit(void)
+{
+	struct mem_cgroup *mem;
+	bool ret;
+
+	if (mem_cgroup_disabled())
+		return false;
+
+	rcu_read_lock();
+	mem = mem_cgroup_from_task(current);
+	ret = mem && !mem_cgroup_is_root(mem);
+	rcu_read_unlock();
+
+	return ret;
+}
+
+static inline bool mem_cgroup_can_swap(struct mem_cgroup *memcg)
+{
+	if (!do_swap_account)
+		return nr_swap_pages > 0;
+	return !memcg->memsw_is_minimum &&
+		(res_counter_read_u64(&memcg->memsw, RES_LIMIT) > 0);
+}
+
+static s64 mem_cgroup_local_page_stat(struct mem_cgroup *mem,
+				      enum mem_cgroup_nr_pages_item item)
+{
+	s64 ret;
+
+	switch (item) {
+	case MEMCG_NR_DIRTYABLE_PAGES:
+		ret = mem_cgroup_read_stat(mem, LRU_ACTIVE_FILE) +
+			mem_cgroup_read_stat(mem, LRU_INACTIVE_FILE);
+		if (mem_cgroup_can_swap(mem))
+			ret += mem_cgroup_read_stat(mem, LRU_ACTIVE_ANON) +
+				mem_cgroup_read_stat(mem, LRU_INACTIVE_ANON);
+		break;
+	case MEMCG_NR_RECLAIM_PAGES:
+		ret = mem_cgroup_read_stat(mem,	MEM_CGROUP_STAT_FILE_DIRTY) +
+			mem_cgroup_read_stat(mem,
+					     MEM_CGROUP_STAT_FILE_UNSTABLE_NFS);
+		break;
+	case MEMCG_NR_WRITEBACK:
+		ret = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_FILE_WRITEBACK);
+		break;
+	case MEMCG_NR_DIRTY_WRITEBACK_PAGES:
+		ret = mem_cgroup_read_stat(mem,
+					   MEM_CGROUP_STAT_FILE_WRITEBACK) +
+			mem_cgroup_read_stat(mem,
+					     MEM_CGROUP_STAT_FILE_UNSTABLE_NFS);
+		break;
+	default:
+		BUG();
+		break;
+	}
+	return ret;
+}
+
+static unsigned long long
+memcg_hierarchical_free_pages(struct mem_cgroup *mem)
+{
+	struct cgroup *cgroup;
+	unsigned long long min_free, free;
+
+	min_free = res_counter_read_u64(&mem->res, RES_LIMIT) -
+		res_counter_read_u64(&mem->res, RES_USAGE);
+	cgroup = mem->css.cgroup;
+	if (!mem->use_hierarchy)
+		goto out;
+
+	while (cgroup->parent) {
+		cgroup = cgroup->parent;
+		mem = mem_cgroup_from_cont(cgroup);
+		if (!mem->use_hierarchy)
+			break;
+		free = res_counter_read_u64(&mem->res, RES_LIMIT) -
+			res_counter_read_u64(&mem->res, RES_USAGE);
+		min_free = min(min_free, free);
+	}
+out:
+	/* Translate free memory in pages */
+	return min_free >> PAGE_SHIFT;
+}
+
+/*
+ * mem_cgroup_page_stat() - get memory cgroup file cache statistics
+ * @item:      memory statistic item exported to the kernel
+ *
+ * Return the accounted statistic value.
+ */
+s64 mem_cgroup_page_stat(enum mem_cgroup_nr_pages_item item)
+{
+	struct mem_cgroup *mem;
+	struct mem_cgroup *iter;
+	s64 value;
+
+	rcu_read_lock();
+	mem = mem_cgroup_from_task(current);
+	if (mem && !mem_cgroup_is_root(mem)) {
+		/*
+		 * If we're looking for dirtyable pages we need to evaluate
+		 * free pages depending on the limit and usage of the parents
+		 * first of all.
+		 */
+		if (item == MEMCG_NR_DIRTYABLE_PAGES)
+			value = memcg_hierarchical_free_pages(mem);
+		else
+			value = 0;
+		/*
+		 * Recursively evaluate page statistics against all cgroup
+		 * under hierarchy tree
+		 */
+		for_each_mem_cgroup_tree(iter, mem)
+			value += mem_cgroup_local_page_stat(iter, item);
+	} else
+		value = -EINVAL;
+	rcu_read_unlock();
+
+	return value;
+}
+
 static void mem_cgroup_start_move(struct mem_cgroup *mem)
 {
 	int cpu;
@@ -4457,8 +4633,16 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
 	spin_lock_init(&mem->reclaim_param_lock);
 	INIT_LIST_HEAD(&mem->oom_notify);
 
-	if (parent)
+	if (parent) {
 		mem->swappiness = get_swappiness(parent);
+		__mem_cgroup_dirty_param(&mem->dirty_param, parent);
+	} else {
+		/*
+		 * The root cgroup dirty_param field is not used, instead,
+		 * system-wide dirty limits are used.
+		 */
+	}
+
 	atomic_set(&mem->refcnt, 1);
 	mem->move_charge_at_immigrate = 0;
 	mutex_init(&mem->thresholds_lock);
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v3 08/11] memcg: CPU hotplug lockdep warning fix
  2010-10-19  0:39 [PATCH v3 00/11] memcg: per cgroup dirty page accounting Greg Thelen
                   ` (6 preceding siblings ...)
  2010-10-19  0:39 ` [PATCH v3 07/11] memcg: add dirty limits to mem_cgroup Greg Thelen
@ 2010-10-19  0:39 ` Greg Thelen
  2010-10-19  0:54   ` KAMEZAWA Hiroyuki
  2010-10-20  3:47   ` Daisuke Nishimura
  2010-10-19  0:39 ` [PATCH v3 09/11] memcg: add cgroupfs interface to memcg dirty limits Greg Thelen
                   ` (3 subsequent siblings)
  11 siblings, 2 replies; 64+ messages in thread
From: Greg Thelen @ 2010-10-19  0:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, containers, Andrea Righi, Balbir Singh,
	KAMEZAWA Hiroyuki, Daisuke Nishimura, Minchan Kim, Ciju Rajan K,
	David Rientjes

From: Balbir Singh <balbir@linux.vnet.ibm.com>

memcg has lockdep warnings (sleep inside rcu lock)

From: Balbir Singh <balbir@linux.vnet.ibm.com>

Recent move to get_online_cpus() ends up calling get_online_cpus() from
mem_cgroup_read_stat(). However mem_cgroup_read_stat() is called under rcu
lock. get_online_cpus() can sleep. The dirty limit patches expose
this BUG more readily due to their usage of mem_cgroup_page_stat()

This patch address this issue as identified by lockdep and moves the
hotplug protection to a higher layer. This might increase the time
required to hotplug, but not by much.

Warning messages

BUG: sleeping function called from invalid context at kernel/cpu.c:62
in_atomic(): 0, irqs_disabled(): 0, pid: 6325, name: pagetest
2 locks held by pagetest/6325:
do_page_fault+0x27d/0x4a0
mem_cgroup_page_stat+0x0/0x23f
Pid: 6325, comm: pagetest Not tainted 2.6.36-rc5-mm1+ #201
Call Trace:
[<ffffffff81041224>] __might_sleep+0x12d/0x131
[<ffffffff8104f4af>] get_online_cpus+0x1c/0x51
[<ffffffff8110eedb>] mem_cgroup_read_stat+0x27/0xa3
[<ffffffff811125d2>] mem_cgroup_page_stat+0x131/0x23f
[<ffffffff811124a1>] ? mem_cgroup_page_stat+0x0/0x23f
[<ffffffff810d57c3>] global_dirty_limits+0x42/0xf8
[<ffffffff810d58b3>] throttle_vm_writeout+0x3a/0xb4
[<ffffffff810dc2f8>] shrink_zone+0x3e6/0x3f8
[<ffffffff81074a35>] ? ktime_get_ts+0xb2/0xbf
[<ffffffff810dd1aa>] do_try_to_free_pages+0x106/0x478
[<ffffffff810dd601>] try_to_free_mem_cgroup_pages+0xe5/0x14c
[<ffffffff8110f947>] mem_cgroup_hierarchical_reclaim+0x314/0x3a2
[<ffffffff81111b31>] __mem_cgroup_try_charge+0x29b/0x593
[<ffffffff8111194a>] ? __mem_cgroup_try_charge+0xb4/0x593
[<ffffffff81071258>] ? local_clock+0x40/0x59
[<ffffffff81009015>] ? sched_clock+0x9/0xd
[<ffffffff810710d5>] ? sched_clock_local+0x1c/0x82
[<ffffffff8111398a>] mem_cgroup_charge_common+0x4b/0x76
[<ffffffff81141469>] ? bio_add_page+0x36/0x38
[<ffffffff81113ba9>] mem_cgroup_cache_charge+0x1f4/0x214
[<ffffffff810cd195>] add_to_page_cache_locked+0x4a/0x148
....

Acked-by: Greg Thelen <gthelen@google.com>
Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
---
 mm/memcontrol.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f876919..412ce73 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -588,7 +588,6 @@ static s64 mem_cgroup_read_stat(struct mem_cgroup *mem,
 	int cpu;
 	s64 val = 0;
 
-	get_online_cpus();
 	for_each_online_cpu(cpu)
 		val += per_cpu(mem->stat->count[idx], cpu);
 #ifdef CONFIG_HOTPLUG_CPU
@@ -596,7 +595,6 @@ static s64 mem_cgroup_read_stat(struct mem_cgroup *mem,
 	val += mem->nocpu_base.count[idx];
 	spin_unlock(&mem->pcp_counter_lock);
 #endif
-	put_online_cpus();
 	return val;
 }
 
@@ -1300,6 +1298,7 @@ s64 mem_cgroup_page_stat(enum mem_cgroup_nr_pages_item item)
 	struct mem_cgroup *iter;
 	s64 value;
 
+	get_online_cpus();
 	rcu_read_lock();
 	mem = mem_cgroup_from_task(current);
 	if (mem && !mem_cgroup_is_root(mem)) {
@@ -1321,6 +1320,7 @@ s64 mem_cgroup_page_stat(enum mem_cgroup_nr_pages_item item)
 	} else
 		value = -EINVAL;
 	rcu_read_unlock();
+	put_online_cpus();
 
 	return value;
 }
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v3 09/11] memcg: add cgroupfs interface to memcg dirty limits
  2010-10-19  0:39 [PATCH v3 00/11] memcg: per cgroup dirty page accounting Greg Thelen
                   ` (7 preceding siblings ...)
  2010-10-19  0:39 ` [PATCH v3 08/11] memcg: CPU hotplug lockdep warning fix Greg Thelen
@ 2010-10-19  0:39 ` Greg Thelen
  2010-10-19  0:56   ` KAMEZAWA Hiroyuki
  2010-10-20  3:31   ` Daisuke Nishimura
  2010-10-19  0:39 ` [PATCH v3 10/11] writeback: make determine_dirtyable_memory() static Greg Thelen
                   ` (2 subsequent siblings)
  11 siblings, 2 replies; 64+ messages in thread
From: Greg Thelen @ 2010-10-19  0:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, containers, Andrea Righi, Balbir Singh,
	KAMEZAWA Hiroyuki, Daisuke Nishimura, Minchan Kim, Ciju Rajan K,
	David Rientjes, Greg Thelen

Add cgroupfs interface to memcg dirty page limits:
  Direct write-out is controlled with:
  - memory.dirty_ratio
  - memory.dirty_limit_in_bytes

  Background write-out is controlled with:
  - memory.dirty_background_ratio
  - memory.dirty_background_limit_bytes

Other memcg cgroupfs files support 'M', 'm', 'k', 'K', 'g'
and 'G' suffixes for byte counts.  This patch provides the
same functionality for memory.dirty_limit_in_bytes and
memory.dirty_background_limit_bytes.

Signed-off-by: Andrea Righi <arighi@develer.com>
Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Signed-off-by: Greg Thelen <gthelen@google.com>
---

Changelog since v1:
- Renamed newly created proc files:
  - memory.dirty_bytes -> memory.dirty_limit_in_bytes
  - memory.dirty_background_bytes -> memory.dirty_background_limit_in_bytes
- Allow [kKmMgG] suffixes for newly created dirty limit value cgroupfs files.

 mm/memcontrol.c |  116 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 116 insertions(+), 0 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 412ce73..580e665 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -100,6 +100,13 @@ enum mem_cgroup_stat_index {
 	MEM_CGROUP_STAT_NSTATS,
 };
 
+enum {
+	MEM_CGROUP_DIRTY_RATIO,
+	MEM_CGROUP_DIRTY_LIMIT_IN_BYTES,
+	MEM_CGROUP_DIRTY_BACKGROUND_RATIO,
+	MEM_CGROUP_DIRTY_BACKGROUND_LIMIT_IN_BYTES,
+};
+
 struct mem_cgroup_stat_cpu {
 	s64 count[MEM_CGROUP_STAT_NSTATS];
 };
@@ -4311,6 +4318,91 @@ static int mem_cgroup_oom_control_write(struct cgroup *cgrp,
 	return 0;
 }
 
+static u64 mem_cgroup_dirty_read(struct cgroup *cgrp, struct cftype *cft)
+{
+	struct mem_cgroup *mem = mem_cgroup_from_cont(cgrp);
+	bool root;
+
+	root = mem_cgroup_is_root(mem);
+
+	switch (cft->private) {
+	case MEM_CGROUP_DIRTY_RATIO:
+		return root ? vm_dirty_ratio : mem->dirty_param.dirty_ratio;
+	case MEM_CGROUP_DIRTY_LIMIT_IN_BYTES:
+		return root ? vm_dirty_bytes : mem->dirty_param.dirty_bytes;
+	case MEM_CGROUP_DIRTY_BACKGROUND_RATIO:
+		return root ? dirty_background_ratio :
+			mem->dirty_param.dirty_background_ratio;
+	case MEM_CGROUP_DIRTY_BACKGROUND_LIMIT_IN_BYTES:
+		return root ? dirty_background_bytes :
+			mem->dirty_param.dirty_background_bytes;
+	default:
+		BUG();
+	}
+}
+
+static int
+mem_cgroup_dirty_write_string(struct cgroup *cgrp, struct cftype *cft,
+				const char *buffer)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
+	int type = cft->private;
+	int ret = -EINVAL;
+	unsigned long long val;
+
+	if (cgrp->parent == NULL)
+		return ret;
+
+	switch (type) {
+	case MEM_CGROUP_DIRTY_LIMIT_IN_BYTES:
+		/* This function does all necessary parse...reuse it */
+		ret = res_counter_memparse_write_strategy(buffer, &val);
+		if (ret)
+			break;
+		memcg->dirty_param.dirty_bytes = val;
+		memcg->dirty_param.dirty_ratio  = 0;
+		break;
+	case MEM_CGROUP_DIRTY_BACKGROUND_LIMIT_IN_BYTES:
+		ret = res_counter_memparse_write_strategy(buffer, &val);
+		if (ret)
+			break;
+		memcg->dirty_param.dirty_background_bytes = val;
+		memcg->dirty_param.dirty_background_ratio = 0;
+		break;
+	default:
+		BUG();
+		break;
+	}
+	return ret;
+}
+
+static int
+mem_cgroup_dirty_write(struct cgroup *cgrp, struct cftype *cft, u64 val)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
+	int type = cft->private;
+
+	if (cgrp->parent == NULL)
+		return -EINVAL;
+	if ((type == MEM_CGROUP_DIRTY_RATIO ||
+	     type == MEM_CGROUP_DIRTY_BACKGROUND_RATIO) && val > 100)
+		return -EINVAL;
+	switch (type) {
+	case MEM_CGROUP_DIRTY_RATIO:
+		memcg->dirty_param.dirty_ratio = val;
+		memcg->dirty_param.dirty_bytes = 0;
+		break;
+	case MEM_CGROUP_DIRTY_BACKGROUND_RATIO:
+		memcg->dirty_param.dirty_background_ratio = val;
+		memcg->dirty_param.dirty_background_bytes = 0;
+		break;
+	default:
+		BUG();
+		break;
+	}
+	return 0;
+}
+
 static struct cftype mem_cgroup_files[] = {
 	{
 		.name = "usage_in_bytes",
@@ -4374,6 +4466,30 @@ static struct cftype mem_cgroup_files[] = {
 		.unregister_event = mem_cgroup_oom_unregister_event,
 		.private = MEMFILE_PRIVATE(_OOM_TYPE, OOM_CONTROL),
 	},
+	{
+		.name = "dirty_ratio",
+		.read_u64 = mem_cgroup_dirty_read,
+		.write_u64 = mem_cgroup_dirty_write,
+		.private = MEM_CGROUP_DIRTY_RATIO,
+	},
+	{
+		.name = "dirty_limit_in_bytes",
+		.read_u64 = mem_cgroup_dirty_read,
+		.write_string = mem_cgroup_dirty_write_string,
+		.private = MEM_CGROUP_DIRTY_LIMIT_IN_BYTES,
+	},
+	{
+		.name = "dirty_background_ratio",
+		.read_u64 = mem_cgroup_dirty_read,
+		.write_u64 = mem_cgroup_dirty_write,
+		.private = MEM_CGROUP_DIRTY_BACKGROUND_RATIO,
+	},
+	{
+		.name = "dirty_background_limit_in_bytes",
+		.read_u64 = mem_cgroup_dirty_read,
+		.write_string = mem_cgroup_dirty_write_string,
+		.private = MEM_CGROUP_DIRTY_BACKGROUND_LIMIT_IN_BYTES,
+	},
 };
 
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v3 10/11] writeback: make determine_dirtyable_memory() static.
  2010-10-19  0:39 [PATCH v3 00/11] memcg: per cgroup dirty page accounting Greg Thelen
                   ` (8 preceding siblings ...)
  2010-10-19  0:39 ` [PATCH v3 09/11] memcg: add cgroupfs interface to memcg dirty limits Greg Thelen
@ 2010-10-19  0:39 ` Greg Thelen
  2010-10-19  0:57   ` KAMEZAWA Hiroyuki
  2010-10-20  3:47   ` Daisuke Nishimura
  2010-10-19  0:39 ` [PATCH v3 11/11] memcg: check memcg dirty limits in page writeback Greg Thelen
  2010-10-20  3:21 ` [PATCH][memcg+dirtylimit] Fix overwriting global vm dirty limit setting by memcg (Re: [PATCH v3 00/11] memcg: per cgroup dirty page accounting KAMEZAWA Hiroyuki
  11 siblings, 2 replies; 64+ messages in thread
From: Greg Thelen @ 2010-10-19  0:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, containers, Andrea Righi, Balbir Singh,
	KAMEZAWA Hiroyuki, Daisuke Nishimura, Minchan Kim, Ciju Rajan K,
	David Rientjes, Greg Thelen

The determine_dirtyable_memory() function is not used outside of
page writeback.  Make the routine static.  No functional change.
Just a cleanup in preparation for a change that adds memcg dirty
limits consideration into global_dirty_limits().

Signed-off-by: Andrea Righi <arighi@develer.com>
Signed-off-by: Greg Thelen <gthelen@google.com>
---
 include/linux/writeback.h |    2 -
 mm/page-writeback.c       |  122 ++++++++++++++++++++++----------------------
 2 files changed, 61 insertions(+), 63 deletions(-)

diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index c7299d2..c18e374 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -105,8 +105,6 @@ extern int vm_highmem_is_dirtyable;
 extern int block_dump;
 extern int laptop_mode;
 
-extern unsigned long determine_dirtyable_memory(void);
-
 extern int dirty_background_ratio_handler(struct ctl_table *table, int write,
 		void __user *buffer, size_t *lenp,
 		loff_t *ppos);
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 820eb66..a0bb3e2 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -132,6 +132,67 @@ static struct prop_descriptor vm_completions;
 static struct prop_descriptor vm_dirties;
 
 /*
+ * Work out the current dirty-memory clamping and background writeout
+ * thresholds.
+ *
+ * The main aim here is to lower them aggressively if there is a lot of mapped
+ * memory around.  To avoid stressing page reclaim with lots of unreclaimable
+ * pages.  It is better to clamp down on writers than to start swapping, and
+ * performing lots of scanning.
+ *
+ * We only allow 1/2 of the currently-unmapped memory to be dirtied.
+ *
+ * We don't permit the clamping level to fall below 5% - that is getting rather
+ * excessive.
+ *
+ * We make sure that the background writeout level is below the adjusted
+ * clamping level.
+ */
+
+static unsigned long highmem_dirtyable_memory(unsigned long total)
+{
+#ifdef CONFIG_HIGHMEM
+	int node;
+	unsigned long x = 0;
+
+	for_each_node_state(node, N_HIGH_MEMORY) {
+		struct zone *z =
+			&NODE_DATA(node)->node_zones[ZONE_HIGHMEM];
+
+		x += zone_page_state(z, NR_FREE_PAGES) +
+		     zone_reclaimable_pages(z);
+	}
+	/*
+	 * Make sure that the number of highmem pages is never larger
+	 * than the number of the total dirtyable memory. This can only
+	 * occur in very strange VM situations but we want to make sure
+	 * that this does not occur.
+	 */
+	return min(x, total);
+#else
+	return 0;
+#endif
+}
+
+/**
+ * determine_dirtyable_memory - amount of memory that may be used
+ *
+ * Returns the numebr of pages that can currently be freed and used
+ * by the kernel for direct mappings.
+ */
+static unsigned long determine_dirtyable_memory(void)
+{
+	unsigned long x;
+
+	x = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages();
+
+	if (!vm_highmem_is_dirtyable)
+		x -= highmem_dirtyable_memory(x);
+
+	return x + 1;	/* Ensure that we never return 0 */
+}
+
+/*
  * couple the period to the dirty_ratio:
  *
  *   period/2 ~ roundup_pow_of_two(dirty limit)
@@ -337,67 +398,6 @@ int bdi_set_max_ratio(struct backing_dev_info *bdi, unsigned max_ratio)
 EXPORT_SYMBOL(bdi_set_max_ratio);
 
 /*
- * Work out the current dirty-memory clamping and background writeout
- * thresholds.
- *
- * The main aim here is to lower them aggressively if there is a lot of mapped
- * memory around.  To avoid stressing page reclaim with lots of unreclaimable
- * pages.  It is better to clamp down on writers than to start swapping, and
- * performing lots of scanning.
- *
- * We only allow 1/2 of the currently-unmapped memory to be dirtied.
- *
- * We don't permit the clamping level to fall below 5% - that is getting rather
- * excessive.
- *
- * We make sure that the background writeout level is below the adjusted
- * clamping level.
- */
-
-static unsigned long highmem_dirtyable_memory(unsigned long total)
-{
-#ifdef CONFIG_HIGHMEM
-	int node;
-	unsigned long x = 0;
-
-	for_each_node_state(node, N_HIGH_MEMORY) {
-		struct zone *z =
-			&NODE_DATA(node)->node_zones[ZONE_HIGHMEM];
-
-		x += zone_page_state(z, NR_FREE_PAGES) +
-		     zone_reclaimable_pages(z);
-	}
-	/*
-	 * Make sure that the number of highmem pages is never larger
-	 * than the number of the total dirtyable memory. This can only
-	 * occur in very strange VM situations but we want to make sure
-	 * that this does not occur.
-	 */
-	return min(x, total);
-#else
-	return 0;
-#endif
-}
-
-/**
- * determine_dirtyable_memory - amount of memory that may be used
- *
- * Returns the numebr of pages that can currently be freed and used
- * by the kernel for direct mappings.
- */
-unsigned long determine_dirtyable_memory(void)
-{
-	unsigned long x;
-
-	x = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages();
-
-	if (!vm_highmem_is_dirtyable)
-		x -= highmem_dirtyable_memory(x);
-
-	return x + 1;	/* Ensure that we never return 0 */
-}
-
-/*
  * global_dirty_limits - background-writeback and dirty-throttling thresholds
  *
  * Calculate the dirty thresholds based on sysctl parameters
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v3 11/11] memcg: check memcg dirty limits in page writeback
  2010-10-19  0:39 [PATCH v3 00/11] memcg: per cgroup dirty page accounting Greg Thelen
                   ` (9 preceding siblings ...)
  2010-10-19  0:39 ` [PATCH v3 10/11] writeback: make determine_dirtyable_memory() static Greg Thelen
@ 2010-10-19  0:39 ` Greg Thelen
  2010-10-19  1:00   ` KAMEZAWA Hiroyuki
  2010-10-20  5:25   ` Daisuke Nishimura
  2010-10-20  3:21 ` [PATCH][memcg+dirtylimit] Fix overwriting global vm dirty limit setting by memcg (Re: [PATCH v3 00/11] memcg: per cgroup dirty page accounting KAMEZAWA Hiroyuki
  11 siblings, 2 replies; 64+ messages in thread
From: Greg Thelen @ 2010-10-19  0:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, containers, Andrea Righi, Balbir Singh,
	KAMEZAWA Hiroyuki, Daisuke Nishimura, Minchan Kim, Ciju Rajan K,
	David Rientjes, Greg Thelen

If the current process is in a non-root memcg, then
global_dirty_limits() will consider the memcg dirty limit.
This allows different cgroups to have distinct dirty limits
which trigger direct and background writeback at different
levels.

Signed-off-by: Andrea Righi <arighi@develer.com>
Signed-off-by: Greg Thelen <gthelen@google.com>
---

Changelog since v1:
- Removed unnecessary get_ prefix from get_xxx() functions.

 mm/page-writeback.c |   89 +++++++++++++++++++++++++++++++++++++++++---------
 1 files changed, 73 insertions(+), 16 deletions(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index a0bb3e2..9b34f01 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -180,7 +180,7 @@ static unsigned long highmem_dirtyable_memory(unsigned long total)
  * Returns the numebr of pages that can currently be freed and used
  * by the kernel for direct mappings.
  */
-static unsigned long determine_dirtyable_memory(void)
+static unsigned long global_dirtyable_memory(void)
 {
 	unsigned long x;
 
@@ -192,6 +192,58 @@ static unsigned long determine_dirtyable_memory(void)
 	return x + 1;	/* Ensure that we never return 0 */
 }
 
+static unsigned long dirtyable_memory(void)
+{
+	unsigned long memory;
+	s64 memcg_memory;
+
+	memory = global_dirtyable_memory();
+	if (!mem_cgroup_has_dirty_limit())
+		return memory;
+	memcg_memory = mem_cgroup_page_stat(MEMCG_NR_DIRTYABLE_PAGES);
+	BUG_ON(memcg_memory < 0);
+
+	return min((unsigned long)memcg_memory, memory);
+}
+
+static long reclaimable_pages(void)
+{
+	s64 ret;
+
+	if (!mem_cgroup_has_dirty_limit())
+		return global_page_state(NR_FILE_DIRTY) +
+			global_page_state(NR_UNSTABLE_NFS);
+	ret = mem_cgroup_page_stat(MEMCG_NR_RECLAIM_PAGES);
+	BUG_ON(ret < 0);
+
+	return ret;
+}
+
+static long writeback_pages(void)
+{
+	s64 ret;
+
+	if (!mem_cgroup_has_dirty_limit())
+		return global_page_state(NR_WRITEBACK);
+	ret = mem_cgroup_page_stat(MEMCG_NR_WRITEBACK);
+	BUG_ON(ret < 0);
+
+	return ret;
+}
+
+static unsigned long dirty_writeback_pages(void)
+{
+	s64 ret;
+
+	if (!mem_cgroup_has_dirty_limit())
+		return global_page_state(NR_UNSTABLE_NFS) +
+			global_page_state(NR_WRITEBACK);
+	ret = mem_cgroup_page_stat(MEMCG_NR_DIRTY_WRITEBACK_PAGES);
+	BUG_ON(ret < 0);
+
+	return ret;
+}
+
 /*
  * couple the period to the dirty_ratio:
  *
@@ -204,8 +256,8 @@ static int calc_period_shift(void)
 	if (vm_dirty_bytes)
 		dirty_total = vm_dirty_bytes / PAGE_SIZE;
 	else
-		dirty_total = (vm_dirty_ratio * determine_dirtyable_memory()) /
-				100;
+		dirty_total = (vm_dirty_ratio * global_dirtyable_memory()) /
+			100;
 	return 2 + ilog2(dirty_total - 1);
 }
 
@@ -410,18 +462,23 @@ void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty)
 {
 	unsigned long background;
 	unsigned long dirty;
-	unsigned long available_memory = determine_dirtyable_memory();
+	unsigned long available_memory = dirtyable_memory();
 	struct task_struct *tsk;
+	struct vm_dirty_param dirty_param;
 
-	if (vm_dirty_bytes)
-		dirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE);
+	vm_dirty_param(&dirty_param);
+
+	if (dirty_param.dirty_bytes)
+		dirty = DIV_ROUND_UP(dirty_param.dirty_bytes, PAGE_SIZE);
 	else
-		dirty = (vm_dirty_ratio * available_memory) / 100;
+		dirty = (dirty_param.dirty_ratio * available_memory) / 100;
 
-	if (dirty_background_bytes)
-		background = DIV_ROUND_UP(dirty_background_bytes, PAGE_SIZE);
+	if (dirty_param.dirty_background_bytes)
+		background = DIV_ROUND_UP(dirty_param.dirty_background_bytes,
+					  PAGE_SIZE);
 	else
-		background = (dirty_background_ratio * available_memory) / 100;
+		background = (dirty_param.dirty_background_ratio *
+			      available_memory) / 100;
 
 	if (background >= dirty)
 		background = dirty / 2;
@@ -493,9 +550,8 @@ static void balance_dirty_pages(struct address_space *mapping,
 			.range_cyclic	= 1,
 		};
 
-		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
-					global_page_state(NR_UNSTABLE_NFS);
-		nr_writeback = global_page_state(NR_WRITEBACK);
+		nr_reclaimable = reclaimable_pages();
+		nr_writeback = writeback_pages();
 
 		global_dirty_limits(&background_thresh, &dirty_thresh);
 
@@ -652,6 +708,7 @@ void throttle_vm_writeout(gfp_t gfp_mask)
 {
 	unsigned long background_thresh;
 	unsigned long dirty_thresh;
+	unsigned long dirty;
 
         for ( ; ; ) {
 		global_dirty_limits(&background_thresh, &dirty_thresh);
@@ -662,9 +719,9 @@ void throttle_vm_writeout(gfp_t gfp_mask)
                  */
                 dirty_thresh += dirty_thresh / 10;      /* wheeee... */
 
-                if (global_page_state(NR_UNSTABLE_NFS) +
-			global_page_state(NR_WRITEBACK) <= dirty_thresh)
-                        	break;
+		dirty = dirty_writeback_pages();
+		if (dirty <= dirty_thresh)
+			break;
                 congestion_wait(BLK_RW_ASYNC, HZ/10);
 
 		/*
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* Re: [PATCH v3 04/11] memcg: add lock to synchronize page accounting and migration
  2010-10-19  0:39 ` [PATCH v3 04/11] memcg: add lock to synchronize page accounting and migration Greg Thelen
@ 2010-10-19  0:45   ` KAMEZAWA Hiroyuki
  2010-10-19  4:43     ` [RFC][PATCH 1/2] memcg: move_account optimization by reduct put,get page (Re: " KAMEZAWA Hiroyuki
  2010-10-19  1:17   ` Minchan Kim
  2010-10-19  5:03   ` Daisuke Nishimura
  2 siblings, 1 reply; 64+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-10-19  0:45 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, Andrea Righi,
	Balbir Singh, Daisuke Nishimura, Minchan Kim, Ciju Rajan K,
	David Rientjes

On Mon, 18 Oct 2010 17:39:37 -0700
Greg Thelen <gthelen@google.com> wrote:

> Performance Impact: moving a 8G anon process.
> 
> Before:
> 	real    0m0.792s
> 	user    0m0.000s
> 	sys     0m0.780s
> 
> After:
> 	real    0m0.854s
> 	user    0m0.000s
> 	sys     0m0.842s
> 
> This score is bad but planned patches for optimization can reduce
> this impact.
> 

I'll post optimization patches after this set goes to -mm.
RFC version will be posted soon.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v3 02/11] memcg: document cgroup dirty memory interfaces
  2010-10-19  0:39 ` [PATCH v3 02/11] memcg: document cgroup dirty memory interfaces Greg Thelen
@ 2010-10-19  0:46   ` KAMEZAWA Hiroyuki
  2010-10-19  8:27   ` Daisuke Nishimura
  1 sibling, 0 replies; 64+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-10-19  0:46 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, Andrea Righi,
	Balbir Singh, Daisuke Nishimura, Minchan Kim, Ciju Rajan K,
	David Rientjes

On Mon, 18 Oct 2010 17:39:35 -0700
Greg Thelen <gthelen@google.com> wrote:

> Document cgroup dirty memory interfaces and statistics.
> 
> Signed-off-by: Andrea Righi <arighi@develer.com>
> Signed-off-by: Greg Thelen <gthelen@google.com>

I think you don't need to drop Ack if you have no major changes.

Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v3 03/11] memcg: create extensible page stat update routines
  2010-10-19  0:39 ` [PATCH v3 03/11] memcg: create extensible page stat update routines Greg Thelen
@ 2010-10-19  0:47   ` KAMEZAWA Hiroyuki
  2010-10-19  4:52   ` Daisuke Nishimura
  1 sibling, 0 replies; 64+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-10-19  0:47 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, Andrea Righi,
	Balbir Singh, Daisuke Nishimura, Minchan Kim, Ciju Rajan K,
	David Rientjes

On Mon, 18 Oct 2010 17:39:36 -0700
Greg Thelen <gthelen@google.com> wrote:

> Replace usage of the mem_cgroup_update_file_mapped() memcg
> statistic update routine with two new routines:
> * mem_cgroup_inc_page_stat()
> * mem_cgroup_dec_page_stat()
> 
> As before, only the file_mapped statistic is managed.  However,
> these more general interfaces allow for new statistics to be
> more easily added.  New statistics are added with memcg dirty
> page accounting.
> 
> Signed-off-by: Greg Thelen <gthelen@google.com>
> Signed-off-by: Andrea Righi <arighi@develer.com>

Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v3 05/11] memcg: add dirty page accounting infrastructure
  2010-10-19  0:39 ` [PATCH v3 05/11] memcg: add dirty page accounting infrastructure Greg Thelen
@ 2010-10-19  0:49   ` KAMEZAWA Hiroyuki
  2010-10-20  0:53   ` Daisuke Nishimura
  1 sibling, 0 replies; 64+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-10-19  0:49 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, Andrea Righi,
	Balbir Singh, Daisuke Nishimura, Minchan Kim, Ciju Rajan K,
	David Rientjes

On Mon, 18 Oct 2010 17:39:38 -0700
Greg Thelen <gthelen@google.com> wrote:

> Add memcg routines to track dirty, writeback, and unstable_NFS pages.
> These routines are not yet used by the kernel to count such pages.
> A later change adds kernel calls to these new routines.
> 
> Signed-off-by: Greg Thelen <gthelen@google.com>
> Signed-off-by: Andrea Righi <arighi@develer.com>

Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v3 06/11] memcg: add kernel calls for memcg dirty page stats
  2010-10-19  0:39 ` [PATCH v3 06/11] memcg: add kernel calls for memcg dirty page stats Greg Thelen
@ 2010-10-19  0:51   ` KAMEZAWA Hiroyuki
  2010-10-19  7:03   ` Daisuke Nishimura
  1 sibling, 0 replies; 64+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-10-19  0:51 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, Andrea Righi,
	Balbir Singh, Daisuke Nishimura, Minchan Kim, Ciju Rajan K,
	David Rientjes

On Mon, 18 Oct 2010 17:39:39 -0700
Greg Thelen <gthelen@google.com> wrote:

> Add calls into memcg dirty page accounting.  Notify memcg when pages
> transition between clean, file dirty, writeback, and unstable nfs.
> This allows the memory controller to maintain an accurate view of
> the amount of its memory that is dirty.
> 
> Signed-off-by: Greg Thelen <gthelen@google.com>
> Signed-off-by: Andrea Righi <arighi@develer.com>

Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v3 07/11] memcg: add dirty limits to mem_cgroup
  2010-10-19  0:39 ` [PATCH v3 07/11] memcg: add dirty limits to mem_cgroup Greg Thelen
@ 2010-10-19  0:53   ` KAMEZAWA Hiroyuki
  2010-10-20  0:50   ` Daisuke Nishimura
  1 sibling, 0 replies; 64+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-10-19  0:53 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, Andrea Righi,
	Balbir Singh, Daisuke Nishimura, Minchan Kim, Ciju Rajan K,
	David Rientjes

On Mon, 18 Oct 2010 17:39:40 -0700
Greg Thelen <gthelen@google.com> wrote:

> Extend mem_cgroup to contain dirty page limits.  Also add routines
> allowing the kernel to query the dirty usage of a memcg.
> 
> These interfaces not used by the kernel yet.  A subsequent commit
> will add kernel calls to utilize these new routines.
> 
> Signed-off-by: Greg Thelen <gthelen@google.com>
> Signed-off-by: Andrea Righi <arighi@develer.com>
> ---
> 
> Changelog since v1:
> - Rename (for clarity):
>   - mem_cgroup_write_page_stat_item -> mem_cgroup_page_stat_item
>   - mem_cgroup_read_page_stat_item -> mem_cgroup_nr_pages_item
> - Removed unnecessary get_ prefix from get_xxx() functions.
> - Avoid lockdep warnings by using rcu_read_[un]lock() in
>   mem_cgroup_has_dirty_limit().
> 

Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v3 08/11] memcg: CPU hotplug lockdep warning fix
  2010-10-19  0:39 ` [PATCH v3 08/11] memcg: CPU hotplug lockdep warning fix Greg Thelen
@ 2010-10-19  0:54   ` KAMEZAWA Hiroyuki
  2010-10-20  3:47   ` Daisuke Nishimura
  1 sibling, 0 replies; 64+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-10-19  0:54 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, Andrea Righi,
	Balbir Singh, Daisuke Nishimura, Minchan Kim, Ciju Rajan K,
	David Rientjes

On Mon, 18 Oct 2010 17:39:41 -0700
Greg Thelen <gthelen@google.com> wrote:

> From: Balbir Singh <balbir@linux.vnet.ibm.com>
> 
> memcg has lockdep warnings (sleep inside rcu lock)
> 
> From: Balbir Singh <balbir@linux.vnet.ibm.com>
> 
> Recent move to get_online_cpus() ends up calling get_online_cpus() from
> mem_cgroup_read_stat(). However mem_cgroup_read_stat() is called under rcu
> lock. get_online_cpus() can sleep. The dirty limit patches expose
> this BUG more readily due to their usage of mem_cgroup_page_stat()
> 
> This patch address this issue as identified by lockdep and moves the
> hotplug protection to a higher layer. This might increase the time
> required to hotplug, but not by much.
> 
> Warning messages
> 
> BUG: sleeping function called from invalid context at kernel/cpu.c:62
> in_atomic(): 0, irqs_disabled(): 0, pid: 6325, name: pagetest
> 2 locks held by pagetest/6325:
> do_page_fault+0x27d/0x4a0
> mem_cgroup_page_stat+0x0/0x23f
> Pid: 6325, comm: pagetest Not tainted 2.6.36-rc5-mm1+ #201
> Call Trace:
> [<ffffffff81041224>] __might_sleep+0x12d/0x131
> [<ffffffff8104f4af>] get_online_cpus+0x1c/0x51
> [<ffffffff8110eedb>] mem_cgroup_read_stat+0x27/0xa3
> [<ffffffff811125d2>] mem_cgroup_page_stat+0x131/0x23f
> [<ffffffff811124a1>] ? mem_cgroup_page_stat+0x0/0x23f
> [<ffffffff810d57c3>] global_dirty_limits+0x42/0xf8
> [<ffffffff810d58b3>] throttle_vm_writeout+0x3a/0xb4
> [<ffffffff810dc2f8>] shrink_zone+0x3e6/0x3f8
> [<ffffffff81074a35>] ? ktime_get_ts+0xb2/0xbf
> [<ffffffff810dd1aa>] do_try_to_free_pages+0x106/0x478
> [<ffffffff810dd601>] try_to_free_mem_cgroup_pages+0xe5/0x14c
> [<ffffffff8110f947>] mem_cgroup_hierarchical_reclaim+0x314/0x3a2
> [<ffffffff81111b31>] __mem_cgroup_try_charge+0x29b/0x593
> [<ffffffff8111194a>] ? __mem_cgroup_try_charge+0xb4/0x593
> [<ffffffff81071258>] ? local_clock+0x40/0x59
> [<ffffffff81009015>] ? sched_clock+0x9/0xd
> [<ffffffff810710d5>] ? sched_clock_local+0x1c/0x82
> [<ffffffff8111398a>] mem_cgroup_charge_common+0x4b/0x76
> [<ffffffff81141469>] ? bio_add_page+0x36/0x38
> [<ffffffff81113ba9>] mem_cgroup_cache_charge+0x1f4/0x214
> [<ffffffff810cd195>] add_to_page_cache_locked+0x4a/0x148
> ....
> 
> Acked-by: Greg Thelen <gthelen@google.com>
> Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>

Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v3 09/11] memcg: add cgroupfs interface to memcg dirty limits
  2010-10-19  0:39 ` [PATCH v3 09/11] memcg: add cgroupfs interface to memcg dirty limits Greg Thelen
@ 2010-10-19  0:56   ` KAMEZAWA Hiroyuki
  2010-10-20  3:31   ` Daisuke Nishimura
  1 sibling, 0 replies; 64+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-10-19  0:56 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, Andrea Righi,
	Balbir Singh, Daisuke Nishimura, Minchan Kim, Ciju Rajan K,
	David Rientjes

On Mon, 18 Oct 2010 17:39:42 -0700
Greg Thelen <gthelen@google.com> wrote:

> Add cgroupfs interface to memcg dirty page limits:
>   Direct write-out is controlled with:
>   - memory.dirty_ratio
>   - memory.dirty_limit_in_bytes
> 
>   Background write-out is controlled with:
>   - memory.dirty_background_ratio
>   - memory.dirty_background_limit_bytes
> 
> Other memcg cgroupfs files support 'M', 'm', 'k', 'K', 'g'
> and 'G' suffixes for byte counts.  This patch provides the
> same functionality for memory.dirty_limit_in_bytes and
> memory.dirty_background_limit_bytes.
> 
> Signed-off-by: Andrea Righi <arighi@develer.com>
> Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
> Signed-off-by: Greg Thelen <gthelen@google.com>

Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v3 10/11] writeback: make determine_dirtyable_memory() static.
  2010-10-19  0:39 ` [PATCH v3 10/11] writeback: make determine_dirtyable_memory() static Greg Thelen
@ 2010-10-19  0:57   ` KAMEZAWA Hiroyuki
  2010-10-20  3:47   ` Daisuke Nishimura
  1 sibling, 0 replies; 64+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-10-19  0:57 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, Andrea Righi,
	Balbir Singh, Daisuke Nishimura, Minchan Kim, Ciju Rajan K,
	David Rientjes

On Mon, 18 Oct 2010 17:39:43 -0700
Greg Thelen <gthelen@google.com> wrote:

> The determine_dirtyable_memory() function is not used outside of
> page writeback.  Make the routine static.  No functional change.
> Just a cleanup in preparation for a change that adds memcg dirty
> limits consideration into global_dirty_limits().
> 
> Signed-off-by: Andrea Righi <arighi@develer.com>
> Signed-off-by: Greg Thelen <gthelen@google.com>

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v3 11/11] memcg: check memcg dirty limits in page writeback
  2010-10-19  0:39 ` [PATCH v3 11/11] memcg: check memcg dirty limits in page writeback Greg Thelen
@ 2010-10-19  1:00   ` KAMEZAWA Hiroyuki
  2010-10-20  4:18     ` KAMEZAWA Hiroyuki
  2010-10-20  5:25   ` Daisuke Nishimura
  1 sibling, 1 reply; 64+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-10-19  1:00 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, Andrea Righi,
	Balbir Singh, Daisuke Nishimura, Minchan Kim, Ciju Rajan K,
	David Rientjes

On Mon, 18 Oct 2010 17:39:44 -0700
Greg Thelen <gthelen@google.com> wrote:

> If the current process is in a non-root memcg, then
> global_dirty_limits() will consider the memcg dirty limit.
> This allows different cgroups to have distinct dirty limits
> which trigger direct and background writeback at different
> levels.
> 
> Signed-off-by: Andrea Righi <arighi@develer.com>
> Signed-off-by: Greg Thelen <gthelen@google.com>

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v3 04/11] memcg: add lock to synchronize page accounting and migration
  2010-10-19  0:39 ` [PATCH v3 04/11] memcg: add lock to synchronize page accounting and migration Greg Thelen
  2010-10-19  0:45   ` KAMEZAWA Hiroyuki
@ 2010-10-19  1:17   ` Minchan Kim
  2010-10-19  5:03   ` Daisuke Nishimura
  2 siblings, 0 replies; 64+ messages in thread
From: Minchan Kim @ 2010-10-19  1:17 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, Andrea Righi,
	Balbir Singh, KAMEZAWA Hiroyuki, Daisuke Nishimura, Ciju Rajan K,
	David Rientjes

On Tue, Oct 19, 2010 at 9:39 AM, Greg Thelen <gthelen@google.com> wrote:
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>
> Introduce a new bit spin lock, PCG_MOVE_LOCK, to synchronize
> the page accounting and migration code.  This reworks the
> locking scheme of _update_stat() and _move_account() by
> adding new lock bit PCG_MOVE_LOCK, which is always taken
> under IRQ disable.
>
> 1. If pages are being migrated from a memcg, then updates to
>   that memcg page statistics are protected by grabbing
>   PCG_MOVE_LOCK using move_lock_page_cgroup().  In an
>   upcoming commit, memcg dirty page accounting will be
>   updating memcg page accounting (specifically: num
>   writeback pages) from IRQ context (softirq).  Avoid a
>   deadlocking nested spin lock attempt by disabling irq on
>   the local processor when grabbing the PCG_MOVE_LOCK.
>
> 2. lock for update_page_stat is used only for avoiding race
>   with move_account().  So, IRQ awareness of
>   lock_page_cgroup() itself is not a problem.  The problem
>   is between mem_cgroup_update_page_stat() and
>   mem_cgroup_move_account_page().
>
> Trade-off:
>  * Changing lock_page_cgroup() to always disable IRQ (or
>    local_bh) has some impacts on performance and I think
>    it's bad to disable IRQ when it's not necessary.
>  * adding a new lock makes move_account() slower.  Score is
>    here.
>
> Performance Impact: moving a 8G anon process.
>
> Before:
>        real    0m0.792s
>        user    0m0.000s
>        sys     0m0.780s
>
> After:
>        real    0m0.854s
>        user    0m0.000s
>        sys     0m0.842s
>
> This score is bad but planned patches for optimization can reduce
> this impact.
>
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Signed-off-by: Greg Thelen <gthelen@google.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>


-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v3 01/11] memcg: add page_cgroup flags for dirty page tracking
  2010-10-19  0:39 ` [PATCH v3 01/11] memcg: add page_cgroup flags for dirty page tracking Greg Thelen
@ 2010-10-19  4:31   ` Daisuke Nishimura
  0 siblings, 0 replies; 64+ messages in thread
From: Daisuke Nishimura @ 2010-10-19  4:31 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, Andrea Righi,
	Balbir Singh, KAMEZAWA Hiroyuki, Minchan Kim, Ciju Rajan K,
	David Rientjes, Daisuke Nishimura

On Mon, 18 Oct 2010 17:39:34 -0700
Greg Thelen <gthelen@google.com> wrote:

> Add additional flags to page_cgroup to track dirty pages
> within a mem_cgroup.
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Signed-off-by: Andrea Righi <arighi@develer.com>
> Signed-off-by: Greg Thelen <gthelen@google.com>

Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [RFC][PATCH 1/2] memcg: move_account optimization  by reduct put,get page (Re: [PATCH v3 04/11] memcg: add lock to synchronize page accounting and migration
  2010-10-19  0:45   ` KAMEZAWA Hiroyuki
@ 2010-10-19  4:43     ` KAMEZAWA Hiroyuki
  2010-10-19  4:45       ` [RFC][PATCH 2/2] memcg: move_account optimization by reduce locks " KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 64+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-10-19  4:43 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Greg Thelen, Andrew Morton, linux-kernel, linux-mm, containers,
	Andrea Righi, Balbir Singh, Daisuke Nishimura, Minchan Kim,
	Ciju Rajan K, David Rientjes

On Tue, 19 Oct 2010 09:45:12 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> On Mon, 18 Oct 2010 17:39:37 -0700
> Greg Thelen <gthelen@google.com> wrote:
> 
> > Performance Impact: moving a 8G anon process.
> > 
> > Before:
> > 	real    0m0.792s
> > 	user    0m0.000s
> > 	sys     0m0.780s
> > 
> > After:
> > 	real    0m0.854s
> > 	user    0m0.000s
> > 	sys     0m0.842s
> > 
> > This score is bad but planned patches for optimization can reduce
> > this impact.
> > 
> 
> I'll post optimization patches after this set goes to -mm.
> RFC version will be posted soon.
> 

This is a RFC and based on dirty-limit v3 patch.
Then, I'll post again this when dirty-limit patches are queued.
==
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

At moving account, a major source of cost is put/get_page().
This patch reduces cost of put/get page by the fact all operations are done
under pte_lock().
Because move_account() is done under pte_lock, pages present on page table are
never be freed. Then, we don't need to do get/put_page at isolating pages from
LRU.

Cost of moving 8G anon process.

[mmotm]
	real    0m0.792s
	user    0m0.000s
	sys     0m0.780s
	
[dirty]
        real    0m0.854s
        user    0m0.000s
        sys     0m0.842s
[get/put]
	real    0m0.757s
	user    0m0.000s
	sys     0m0.746s

seems nice.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |   59 +++++++++++++++++++++++++++++++++-----------------------
 1 file changed, 35 insertions(+), 24 deletions(-)

Index: dirty_limit_new/mm/memcontrol.c
===================================================================
--- dirty_limit_new.orig/mm/memcontrol.c
+++ dirty_limit_new/mm/memcontrol.c
@@ -4844,9 +4844,13 @@ one_by_one:
  * Returns
  *   0(MC_TARGET_NONE): if the pte is not a target for move charge.
  *   1(MC_TARGET_PAGE): if the page corresponding to this pte is a target for
- *     move charge. if @target is not NULL, the page is stored in target->page
- *     with extra refcnt got(Callers should handle it).
- *   2(MC_TARGET_SWAP): if the swap entry corresponding to this pte is a
+ *     move charge and it's mapped. If @target is not NULL, the page is stored,
+ *     in target->ent. We expect pte_lock is held throughout the operation and
+ *     no extra page_get() is done.
+ *   2.(MC_TARGET_UNMAPPED_PAGE): if the page corresponding to this pte is a
+ *     target for move charge and it's not mapped. If @target is not NULL, the
+ *     page is stored in target->ent with extra refcnt got.
+ *   3(MC_TARGET_SWAP): if the swap entry corresponding to this pte is a
  *     target for charge migration. if @target is not NULL, the entry is stored
  *     in target->ent.
  *
@@ -4859,7 +4863,8 @@ union mc_target {
 
 enum mc_target_type {
 	MC_TARGET_NONE,	/* not used */
-	MC_TARGET_PAGE,
+	MC_TARGET_PAGE, /* page mapped */
+	MC_TARGET_UNMAPPED_PAGE, /* page not mapped*/
 	MC_TARGET_SWAP,
 };
 
@@ -4877,8 +4882,6 @@ static struct page *mc_handle_present_pt
 	} else if (!move_file())
 		/* we ignore mapcount for file pages */
 		return NULL;
-	if (!get_page_unless_zero(page))
-		return NULL;
 
 	return page;
 }
@@ -4944,13 +4947,17 @@ static int is_target_pte_for_mc(struct v
 	struct page_cgroup *pc;
 	int ret = 0;
 	swp_entry_t ent = { .val = 0 };
+	bool mapped = true;
 
-	if (pte_present(ptent))
+	if (pte_present(ptent)) {
 		page = mc_handle_present_pte(vma, addr, ptent);
-	else if (is_swap_pte(ptent))
-		page = mc_handle_swap_pte(vma, addr, ptent, &ent);
-	else if (pte_none(ptent) || pte_file(ptent))
-		page = mc_handle_file_pte(vma, addr, ptent, &ent);
+	} else {
+		mapped = false;
+		if (is_swap_pte(ptent))
+			page = mc_handle_swap_pte(vma, addr, ptent, &ent);
+		else if (pte_none(ptent) || pte_file(ptent))
+			page = mc_handle_file_pte(vma, addr, ptent, &ent);
+	}
 
 	if (!page && !ent.val)
 		return 0;
@@ -4962,11 +4969,14 @@ static int is_target_pte_for_mc(struct v
 		 * the lock.
 		 */
 		if (PageCgroupUsed(pc) && pc->mem_cgroup == mc.from) {
-			ret = MC_TARGET_PAGE;
+			if (mapped)
+				ret = MC_TARGET_PAGE;
+			else
+				ret = MC_TARGET_UNMAPPED_PAGE;
 			if (target)
 				target->page = page;
 		}
-		if (!ret || !target)
+		if (!mapped && (!ret || !target))
 			put_page(page);
 	}
 	/* There is a swap entry and a page doesn't exist or isn't charged */
@@ -5153,19 +5163,20 @@ retry:
 		type = is_target_pte_for_mc(vma, addr, ptent, &target);
 		switch (type) {
 		case MC_TARGET_PAGE:
+		case MC_TARGET_UNMAPPED_PAGE:
 			page = target.page;
-			if (isolate_lru_page(page))
-				goto put;
-			pc = lookup_page_cgroup(page);
-			if (!mem_cgroup_move_account(pc,
-						mc.from, mc.to, false)) {
-				mc.precharge--;
-				/* we uncharge from mc.from later. */
-				mc.moved_charge++;
+			if (!isolate_lru_page(page)) {
+				pc = lookup_page_cgroup(page);
+				if (!mem_cgroup_move_account(pc, mc.from,
+						mc.to, false)) {
+					mc.precharge--;
+					/* we uncharge from mc.from later. */
+					mc.moved_charge++;
+				}
+				putback_lru_page(page);
 			}
-			putback_lru_page(page);
-put:			/* is_target_pte_for_mc() gets the page */
-			put_page(page);
+			if (type == MC_TARGET_UNMAPPED_PAGE)
+				put_page(page);
 			break;
 		case MC_TARGET_SWAP:
 			ent = target.ent;


^ permalink raw reply	[flat|nested] 64+ messages in thread

* [RFC][PATCH 2/2] memcg: move_account optimization  by reduce locks (Re: [PATCH v3 04/11] memcg: add lock to synchronize page accounting and migration
  2010-10-19  4:43     ` [RFC][PATCH 1/2] memcg: move_account optimization by reduct put,get page (Re: " KAMEZAWA Hiroyuki
@ 2010-10-19  4:45       ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 64+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-10-19  4:45 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Greg Thelen, Andrew Morton, linux-kernel, linux-mm, containers,
	Andrea Righi, Balbir Singh, Daisuke Nishimura, Minchan Kim,
	Ciju Rajan K, David Rientjes

From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

reduce lock at account moving.

a patch "memcg: add lock to synchronize page accounting and migration" add
a new lock and make locking cost twice. This patch is for reducing the cost.

At moving charges by scanning page table, we do all jobs under pte_lock.
This means we never have race with "uncharge". Because of that,
we can remove lock_page_cgroup() in some situation.

The cost of moing 8G anon process
==
[mmotm-1013]
Before:
	real    0m0.792s
	user    0m0.000s
	sys     0m0.780s
	
[dirty-limit v3 patch]
        real    0m0.854s
        user    0m0.000s
        sys     0m0.842s
[get/put optimization ]
	real    0m0.757s
	user    0m0.000s
	sys     0m0.746s

[this patch]
	real    0m0.732s
	user    0m0.000s
	sys     0m0.721s

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 Documentation/cgroups/memory.txt |   23 ++++++++++++++++++++++-
 mm/memcontrol.c                  |   29 ++++++++++++++++++++++-------
 2 files changed, 44 insertions(+), 8 deletions(-)

Index: dirty_limit_new/mm/memcontrol.c
===================================================================
--- dirty_limit_new.orig/mm/memcontrol.c
+++ dirty_limit_new/mm/memcontrol.c
@@ -2386,7 +2386,6 @@ static void __mem_cgroup_move_account(st
 {
 	VM_BUG_ON(from == to);
 	VM_BUG_ON(PageLRU(pc->page));
-	VM_BUG_ON(!PageCgroupLocked(pc));
 	VM_BUG_ON(!PageCgroupUsed(pc));
 	VM_BUG_ON(pc->mem_cgroup != from);
 
@@ -2424,19 +2423,32 @@ static void __mem_cgroup_move_account(st
  * __mem_cgroup_move_account()
  */
 static int mem_cgroup_move_account(struct page_cgroup *pc,
-		struct mem_cgroup *from, struct mem_cgroup *to, bool uncharge)
+		struct mem_cgroup *from, struct mem_cgroup *to,
+		bool uncharge, bool stable)
 {
 	int ret = -EINVAL;
 	unsigned long flags;
-
-	lock_page_cgroup(pc);
+	/*
+	 * When stable==true, some lock (page_table_lock etc.) prevents
+	 * modification of PCG_USED bit and pc->mem_cgroup never be invalid.
+	 * IOW, there will be no race with charge/uncharge. From another point
+	 * of view, there will be other races with codes which accesses
+	 * pc->mem_cgroup under lock_page_cgroup(). Considering what
+	 * pc->mem_cgroup the codes will see, they'll see old or new value and
+	 * both of values will never be invalid while they holds
+	 * lock_page_cgroup(). There is no probelm to skip lock_page_cgroup
+	 * when we can.
+	 */
+	if (!stable)
+		lock_page_cgroup(pc);
 	if (PageCgroupUsed(pc) && pc->mem_cgroup == from) {
 		move_lock_page_cgroup(pc, &flags);
 		__mem_cgroup_move_account(pc, from, to, uncharge);
 		move_unlock_page_cgroup(pc, &flags);
 		ret = 0;
 	}
-	unlock_page_cgroup(pc);
+	if (!stable)
+		unlock_page_cgroup(pc);
 	/*
 	 * check events
 	 */
@@ -2474,7 +2486,7 @@ static int mem_cgroup_move_parent(struct
 	if (ret || !parent)
 		goto put_back;
 
-	ret = mem_cgroup_move_account(pc, child, parent, true);
+	ret = mem_cgroup_move_account(pc, child, parent, true, false);
 	if (ret)
 		mem_cgroup_cancel_charge(parent);
 put_back:
@@ -5156,6 +5168,7 @@ retry:
 		struct page *page;
 		struct page_cgroup *pc;
 		swp_entry_t ent;
+		bool mapped = false;
 
 		if (!mc.precharge)
 			break;
@@ -5163,12 +5176,14 @@ retry:
 		type = is_target_pte_for_mc(vma, addr, ptent, &target);
 		switch (type) {
 		case MC_TARGET_PAGE:
+			mapped = true;
+			/* Fall Through */
 		case MC_TARGET_UNMAPPED_PAGE:
 			page = target.page;
 			if (!isolate_lru_page(page)) {
 				pc = lookup_page_cgroup(page);
 				if (!mem_cgroup_move_account(pc, mc.from,
-						mc.to, false)) {
+						mc.to, false, mapped)) {
 					mc.precharge--;
 					/* we uncharge from mc.from later. */
 					mc.moved_charge++;
Index: dirty_limit_new/Documentation/cgroups/memory.txt
===================================================================
--- dirty_limit_new.orig/Documentation/cgroups/memory.txt
+++ dirty_limit_new/Documentation/cgroups/memory.txt
@@ -637,7 +637,28 @@ memory cgroup.
       | page_mapcount(page) > 1). You must enable Swap Extension(see 2.4) to
       | enable move of swap charges.
 
-8.3 TODO
+8.3 Implemenation Detail
+
+  At moving, we need to take care of races. At first thinking, there are
+  several sources of race when we overwrite pc->mem_cgroup.
+  - charge/uncharge
+  - file stat (dirty, writeback, etc..) accounting
+  - LRU add/remove
+
+  Against charge/uncharge, we do all "move" under pte_lock. So, if we move
+  chareges of a mapped pages, we don't need extra locks. If not mapped,
+  we need to take lock_page_cgroup.
+
+  Against file-stat accouning, we need some locks. Current implementation
+  uses 2 level locking, one is light-weight, another is heavy.
+  A light-weight scheme is to use per-cpu counter. If someone moving a charge
+  from a mem_cgroup, per-cpu "caution" counter is incremented and file-stat
+  update will use heavy lock. This heavy lock is a special lock for move_charge
+  and allow mutual execution of accessing pc->mem_cgroup.
+
+  Against LRU, we do isolate_lru_page() before move_account().
+
+8.4 TODO
 
 - Implement madvise(2) to let users decide the vma to be moved or not to be
   moved.


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v3 03/11] memcg: create extensible page stat update routines
  2010-10-19  0:39 ` [PATCH v3 03/11] memcg: create extensible page stat update routines Greg Thelen
  2010-10-19  0:47   ` KAMEZAWA Hiroyuki
@ 2010-10-19  4:52   ` Daisuke Nishimura
  1 sibling, 0 replies; 64+ messages in thread
From: Daisuke Nishimura @ 2010-10-19  4:52 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, Andrea Righi,
	Balbir Singh, KAMEZAWA Hiroyuki, Minchan Kim, Ciju Rajan K,
	David Rientjes, Daisuke Nishimura

On Mon, 18 Oct 2010 17:39:36 -0700
Greg Thelen <gthelen@google.com> wrote:

> Replace usage of the mem_cgroup_update_file_mapped() memcg
> statistic update routine with two new routines:
> * mem_cgroup_inc_page_stat()
> * mem_cgroup_dec_page_stat()
> 
> As before, only the file_mapped statistic is managed.  However,
> these more general interfaces allow for new statistics to be
> more easily added.  New statistics are added with memcg dirty
> page accounting.
> 
> Signed-off-by: Greg Thelen <gthelen@google.com>
> Signed-off-by: Andrea Righi <arighi@develer.com>

Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v3 04/11] memcg: add lock to synchronize page accounting and migration
  2010-10-19  0:39 ` [PATCH v3 04/11] memcg: add lock to synchronize page accounting and migration Greg Thelen
  2010-10-19  0:45   ` KAMEZAWA Hiroyuki
  2010-10-19  1:17   ` Minchan Kim
@ 2010-10-19  5:03   ` Daisuke Nishimura
  2 siblings, 0 replies; 64+ messages in thread
From: Daisuke Nishimura @ 2010-10-19  5:03 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, Andrea Righi,
	Balbir Singh, KAMEZAWA Hiroyuki, Minchan Kim, Ciju Rajan K,
	David Rientjes, Daisuke Nishimura

On Mon, 18 Oct 2010 17:39:37 -0700
Greg Thelen <gthelen@google.com> wrote:

> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> Introduce a new bit spin lock, PCG_MOVE_LOCK, to synchronize
> the page accounting and migration code.  This reworks the
> locking scheme of _update_stat() and _move_account() by
> adding new lock bit PCG_MOVE_LOCK, which is always taken
> under IRQ disable.
> 
> 1. If pages are being migrated from a memcg, then updates to
>    that memcg page statistics are protected by grabbing
>    PCG_MOVE_LOCK using move_lock_page_cgroup().  In an
>    upcoming commit, memcg dirty page accounting will be
>    updating memcg page accounting (specifically: num
>    writeback pages) from IRQ context (softirq).  Avoid a
>    deadlocking nested spin lock attempt by disabling irq on
>    the local processor when grabbing the PCG_MOVE_LOCK.
> 
> 2. lock for update_page_stat is used only for avoiding race
>    with move_account().  So, IRQ awareness of
>    lock_page_cgroup() itself is not a problem.  The problem
>    is between mem_cgroup_update_page_stat() and
>    mem_cgroup_move_account_page().
> 
> Trade-off:
>   * Changing lock_page_cgroup() to always disable IRQ (or
>     local_bh) has some impacts on performance and I think
>     it's bad to disable IRQ when it's not necessary.
>   * adding a new lock makes move_account() slower.  Score is
>     here.
> 
> Performance Impact: moving a 8G anon process.
> 
> Before:
> 	real    0m0.792s
> 	user    0m0.000s
> 	sys     0m0.780s
> 
> After:
> 	real    0m0.854s
> 	user    0m0.000s
> 	sys     0m0.842s
> 
> This score is bad but planned patches for optimization can reduce
> this impact.
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Signed-off-by: Greg Thelen <gthelen@google.com>

This approach is more straightforward and easy to understand.

Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v3 06/11] memcg: add kernel calls for memcg dirty page stats
  2010-10-19  0:39 ` [PATCH v3 06/11] memcg: add kernel calls for memcg dirty page stats Greg Thelen
  2010-10-19  0:51   ` KAMEZAWA Hiroyuki
@ 2010-10-19  7:03   ` Daisuke Nishimura
  1 sibling, 0 replies; 64+ messages in thread
From: Daisuke Nishimura @ 2010-10-19  7:03 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, Andrea Righi,
	Balbir Singh, KAMEZAWA Hiroyuki, Minchan Kim, Ciju Rajan K,
	David Rientjes, Daisuke Nishimura

On Mon, 18 Oct 2010 17:39:39 -0700
Greg Thelen <gthelen@google.com> wrote:

> Add calls into memcg dirty page accounting.  Notify memcg when pages
> transition between clean, file dirty, writeback, and unstable nfs.
> This allows the memory controller to maintain an accurate view of
> the amount of its memory that is dirty.
> 
> Signed-off-by: Greg Thelen <gthelen@google.com>
> Signed-off-by: Andrea Righi <arighi@develer.com>

Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v3 02/11] memcg: document cgroup dirty memory interfaces
  2010-10-19  0:39 ` [PATCH v3 02/11] memcg: document cgroup dirty memory interfaces Greg Thelen
  2010-10-19  0:46   ` KAMEZAWA Hiroyuki
@ 2010-10-19  8:27   ` Daisuke Nishimura
  2010-10-19 21:00     ` Greg Thelen
  1 sibling, 1 reply; 64+ messages in thread
From: Daisuke Nishimura @ 2010-10-19  8:27 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, Andrea Righi,
	Balbir Singh, KAMEZAWA Hiroyuki, Minchan Kim, Ciju Rajan K,
	David Rientjes, Daisuke Nishimura

On Mon, 18 Oct 2010 17:39:35 -0700
Greg Thelen <gthelen@google.com> wrote:

> Document cgroup dirty memory interfaces and statistics.
> 
> Signed-off-by: Andrea Righi <arighi@develer.com>
> Signed-off-by: Greg Thelen <gthelen@google.com>
> ---
> 
> Changelog since v1:
> - Renamed "nfs"/"total_nfs" to "nfs_unstable"/"total_nfs_unstable" in per cgroup
>   memory.stat to match /proc/meminfo.
> 
> - Allow [kKmMgG] suffixes for newly created dirty limit value cgroupfs files.
> 
> - Describe a situation where a cgroup can exceed its dirty limit.
> 
>  Documentation/cgroups/memory.txt |   60 ++++++++++++++++++++++++++++++++++++++
>  1 files changed, 60 insertions(+), 0 deletions(-)
> 
> diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
> index 7781857..02bbd6f 100644
> --- a/Documentation/cgroups/memory.txt
> +++ b/Documentation/cgroups/memory.txt
> @@ -385,6 +385,10 @@ mapped_file	- # of bytes of mapped file (includes tmpfs/shmem)
>  pgpgin		- # of pages paged in (equivalent to # of charging events).
>  pgpgout		- # of pages paged out (equivalent to # of uncharging events).
>  swap		- # of bytes of swap usage
> +dirty		- # of bytes that are waiting to get written back to the disk.
> +writeback	- # of bytes that are actively being written back to the disk.
> +nfs_unstable	- # of bytes sent to the NFS server, but not yet committed to
> +		the actual storage.
>  inactive_anon	- # of bytes of anonymous memory and swap cache memory on
>  		LRU list.
>  active_anon	- # of bytes of anonymous and swap cache memory on active

Shouldn't we add description of "total_diryt/writeback/nfs_unstable" too ?
Seeing [5/11], it will be showed in memory.stat.

> @@ -453,6 +457,62 @@ memory under it will be reclaimed.
>  You can reset failcnt by writing 0 to failcnt file.
>  # echo 0 > .../memory.failcnt
>  
> +5.5 dirty memory
> +
> +Control the maximum amount of dirty pages a cgroup can have at any given time.
> +
> +Limiting dirty memory is like fixing the max amount of dirty (hard to reclaim)
> +page cache used by a cgroup.  So, in case of multiple cgroup writers, they will
> +not be able to consume more than their designated share of dirty pages and will
> +be forced to perform write-out if they cross that limit.
> +
> +The interface is equivalent to the procfs interface: /proc/sys/vm/dirty_*.  It
> +is possible to configure a limit to trigger both a direct writeback or a
> +background writeback performed by per-bdi flusher threads.  The root cgroup
> +memory.dirty_* control files are read-only and match the contents of
> +the /proc/sys/vm/dirty_* files.
> +
> +Per-cgroup dirty limits can be set using the following files in the cgroupfs:
> +
> +- memory.dirty_ratio: the amount of dirty memory (expressed as a percentage of
> +  cgroup memory) at which a process generating dirty pages will itself start
> +  writing out dirty data.
> +
> +- memory.dirty_limit_in_bytes: the amount of dirty memory (expressed in bytes)
> +  in the cgroup at which a process generating dirty pages will start itself
> +  writing out dirty data.  Suffix (k, K, m, M, g, or G) can be used to indicate
> +  that value is kilo, mega or gigabytes.
> +
> +  Note: memory.dirty_limit_in_bytes is the counterpart of memory.dirty_ratio.
> +  Only one of them may be specified at a time.  When one is written it is
> +  immediately taken into account to evaluate the dirty memory limits and the
> +  other appears as 0 when read.
> +
> +- memory.dirty_background_ratio: the amount of dirty memory of the cgroup
> +  (expressed as a percentage of cgroup memory) at which background writeback
> +  kernel threads will start writing out dirty data.
> +
> +- memory.dirty_background_limit_in_bytes: the amount of dirty memory (expressed
> +  in bytes) in the cgroup at which background writeback kernel threads will
> +  start writing out dirty data.  Suffix (k, K, m, M, g, or G) can be used to
> +  indicate that value is kilo, mega or gigabytes.
> +
> +  Note: memory.dirty_background_limit_in_bytes is the counterpart of
> +  memory.dirty_background_ratio.  Only one of them may be specified at a time.
> +  When one is written it is immediately taken into account to evaluate the dirty
> +  memory limits and the other appears as 0 when read.
> +
> +A cgroup may contain more dirty memory than its dirty limit.  This is possible
> +because of the principle that the first cgroup to touch a page is charged for
> +it.  Subsequent page counting events (dirty, writeback, nfs_unstable) are also
> +counted to the originally charged cgroup.
> +
> +Example: If page is allocated by a cgroup A task, then the page is charged to
> +cgroup A.  If the page is later dirtied by a task in cgroup B, then the cgroup A
> +dirty count will be incremented.  If cgroup A is over its dirty limit but cgroup
> +B is not, then dirtying a cgroup A page from a cgroup B task may push cgroup A
> +over its dirty limit without throttling the dirtying cgroup B task.
> +
>  6. Hierarchy support
>  
>  The memory controller supports a deep hierarchy and hierarchical accounting.
> -- 
> 1.7.1
> 
Can you clarify whether we can limit the "total" dirty pages under hierarchy
in use_hierarchy==1 case ?
If we can, I think it would be better to note it in this documentation.


Thanks,
Daisuke Nishimura. 

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v3 02/11] memcg: document cgroup dirty memory interfaces
  2010-10-19  8:27   ` Daisuke Nishimura
@ 2010-10-19 21:00     ` Greg Thelen
  2010-10-20  0:11       ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 64+ messages in thread
From: Greg Thelen @ 2010-10-19 21:00 UTC (permalink / raw)
  To: Daisuke Nishimura
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, Andrea Righi,
	Balbir Singh, KAMEZAWA Hiroyuki, Minchan Kim, Ciju Rajan K,
	David Rientjes

Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> writes:

> On Mon, 18 Oct 2010 17:39:35 -0700
> Greg Thelen <gthelen@google.com> wrote:
>
>> Document cgroup dirty memory interfaces and statistics.
>> 
>> Signed-off-by: Andrea Righi <arighi@develer.com>
>> Signed-off-by: Greg Thelen <gthelen@google.com>
>> ---
>> 
>> Changelog since v1:
>> - Renamed "nfs"/"total_nfs" to "nfs_unstable"/"total_nfs_unstable" in per cgroup
>>   memory.stat to match /proc/meminfo.
>> 
>> - Allow [kKmMgG] suffixes for newly created dirty limit value cgroupfs files.
>> 
>> - Describe a situation where a cgroup can exceed its dirty limit.
>> 
>>  Documentation/cgroups/memory.txt |   60 ++++++++++++++++++++++++++++++++++++++
>>  1 files changed, 60 insertions(+), 0 deletions(-)
>> 
>> diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
>> index 7781857..02bbd6f 100644
>> --- a/Documentation/cgroups/memory.txt
>> +++ b/Documentation/cgroups/memory.txt
>> @@ -385,6 +385,10 @@ mapped_file	- # of bytes of mapped file (includes tmpfs/shmem)
>>  pgpgin		- # of pages paged in (equivalent to # of charging events).
>>  pgpgout		- # of pages paged out (equivalent to # of uncharging events).
>>  swap		- # of bytes of swap usage
>> +dirty		- # of bytes that are waiting to get written back to the disk.
>> +writeback	- # of bytes that are actively being written back to the disk.
>> +nfs_unstable	- # of bytes sent to the NFS server, but not yet committed to
>> +		the actual storage.
>>  inactive_anon	- # of bytes of anonymous memory and swap cache memory on
>>  		LRU list.
>>  active_anon	- # of bytes of anonymous and swap cache memory on active
>
> Shouldn't we add description of "total_diryt/writeback/nfs_unstable" too ?
> Seeing [5/11], it will be showed in memory.stat.

Good catch.  See patch (below).

>> @@ -453,6 +457,62 @@ memory under it will be reclaimed.
>>  You can reset failcnt by writing 0 to failcnt file.
>>  # echo 0 > .../memory.failcnt
>>  
>> +5.5 dirty memory
>> +
>> +Control the maximum amount of dirty pages a cgroup can have at any given time.
>> +
>> +Limiting dirty memory is like fixing the max amount of dirty (hard to reclaim)
>> +page cache used by a cgroup.  So, in case of multiple cgroup writers, they will
>> +not be able to consume more than their designated share of dirty pages and will
>> +be forced to perform write-out if they cross that limit.
>> +
>> +The interface is equivalent to the procfs interface: /proc/sys/vm/dirty_*.  It
>> +is possible to configure a limit to trigger both a direct writeback or a
>> +background writeback performed by per-bdi flusher threads.  The root cgroup
>> +memory.dirty_* control files are read-only and match the contents of
>> +the /proc/sys/vm/dirty_* files.
>> +
>> +Per-cgroup dirty limits can be set using the following files in the cgroupfs:
>> +
>> +- memory.dirty_ratio: the amount of dirty memory (expressed as a percentage of
>> +  cgroup memory) at which a process generating dirty pages will itself start
>> +  writing out dirty data.
>> +
>> +- memory.dirty_limit_in_bytes: the amount of dirty memory (expressed in bytes)
>> +  in the cgroup at which a process generating dirty pages will start itself
>> +  writing out dirty data.  Suffix (k, K, m, M, g, or G) can be used to indicate
>> +  that value is kilo, mega or gigabytes.
>> +
>> +  Note: memory.dirty_limit_in_bytes is the counterpart of memory.dirty_ratio.
>> +  Only one of them may be specified at a time.  When one is written it is
>> +  immediately taken into account to evaluate the dirty memory limits and the
>> +  other appears as 0 when read.
>> +
>> +- memory.dirty_background_ratio: the amount of dirty memory of the cgroup
>> +  (expressed as a percentage of cgroup memory) at which background writeback
>> +  kernel threads will start writing out dirty data.
>> +
>> +- memory.dirty_background_limit_in_bytes: the amount of dirty memory (expressed
>> +  in bytes) in the cgroup at which background writeback kernel threads will
>> +  start writing out dirty data.  Suffix (k, K, m, M, g, or G) can be used to
>> +  indicate that value is kilo, mega or gigabytes.
>> +
>> +  Note: memory.dirty_background_limit_in_bytes is the counterpart of
>> +  memory.dirty_background_ratio.  Only one of them may be specified at a time.
>> +  When one is written it is immediately taken into account to evaluate the dirty
>> +  memory limits and the other appears as 0 when read.
>> +
>> +A cgroup may contain more dirty memory than its dirty limit.  This is possible
>> +because of the principle that the first cgroup to touch a page is charged for
>> +it.  Subsequent page counting events (dirty, writeback, nfs_unstable) are also
>> +counted to the originally charged cgroup.
>> +
>> +Example: If page is allocated by a cgroup A task, then the page is charged to
>> +cgroup A.  If the page is later dirtied by a task in cgroup B, then the cgroup A
>> +dirty count will be incremented.  If cgroup A is over its dirty limit but cgroup
>> +B is not, then dirtying a cgroup A page from a cgroup B task may push cgroup A
>> +over its dirty limit without throttling the dirtying cgroup B task.
>> +
>>  6. Hierarchy support
>>  
>>  The memory controller supports a deep hierarchy and hierarchical accounting.
>> -- 
>> 1.7.1
>> 
> Can you clarify whether we can limit the "total" dirty pages under hierarchy
> in use_hierarchy==1 case ?
> If we can, I think it would be better to note it in this documentation.
>
>
> Thanks,
> Daisuke Nishimura.

Here is a second version of this -v3 doc patch:

Author: Greg Thelen <gthelen@google.com>
Date:   Sat Apr 10 15:34:28 2010 -0700

    memcg: document cgroup dirty memory interfaces
    
    Document cgroup dirty memory interfaces and statistics.
    
    Signed-off-by: Andrea Righi <arighi@develer.com>
    Signed-off-by: Greg Thelen <gthelen@google.com>

diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index 7781857..8bf6d3b 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -385,6 +385,10 @@ mapped_file	- # of bytes of mapped file (includes tmpfs/shmem)
 pgpgin		- # of pages paged in (equivalent to # of charging events).
 pgpgout		- # of pages paged out (equivalent to # of uncharging events).
 swap		- # of bytes of swap usage
+dirty		- # of bytes that are waiting to get written back to the disk.
+writeback	- # of bytes that are actively being written back to the disk.
+nfs_unstable	- # of bytes sent to the NFS server, but not yet committed to
+		the actual storage.
 inactive_anon	- # of bytes of anonymous memory and swap cache memory on
 		LRU list.
 active_anon	- # of bytes of anonymous and swap cache memory on active
@@ -406,6 +410,9 @@ total_mapped_file	- sum of all children's "cache"
 total_pgpgin		- sum of all children's "pgpgin"
 total_pgpgout		- sum of all children's "pgpgout"
 total_swap		- sum of all children's "swap"
+total_dirty		- sum of all children's "dirty"
+total_writeback		- sum of all children's "writeback"
+total_nfs_unstable	- sum of all children's "nfs_unstable"
 total_inactive_anon	- sum of all children's "inactive_anon"
 total_active_anon	- sum of all children's "active_anon"
 total_inactive_file	- sum of all children's "inactive_file"
@@ -453,6 +460,71 @@ memory under it will be reclaimed.
 You can reset failcnt by writing 0 to failcnt file.
 # echo 0 > .../memory.failcnt
 
+5.5 dirty memory
+
+Control the maximum amount of dirty pages a cgroup can have at any given time.
+
+Limiting dirty memory is like fixing the max amount of dirty (hard to reclaim)
+page cache used by a cgroup.  So, in case of multiple cgroup writers, they will
+not be able to consume more than their designated share of dirty pages and will
+be forced to perform write-out if they cross that limit.
+
+The interface is equivalent to the procfs interface: /proc/sys/vm/dirty_*.  It
+is possible to configure a limit to trigger both a direct writeback or a
+background writeback performed by per-bdi flusher threads.  The root cgroup
+memory.dirty_* control files are read-only and match the contents of
+the /proc/sys/vm/dirty_* files.
+
+Per-cgroup dirty limits can be set using the following files in the cgroupfs:
+
+- memory.dirty_ratio: the amount of dirty memory (expressed as a percentage of
+  cgroup memory) at which a process generating dirty pages will itself start
+  writing out dirty data.
+
+- memory.dirty_limit_in_bytes: the amount of dirty memory (expressed in bytes)
+  in the cgroup at which a process generating dirty pages will start itself
+  writing out dirty data.  Suffix (k, K, m, M, g, or G) can be used to indicate
+  that value is kilo, mega or gigabytes.
+
+  Note: memory.dirty_limit_in_bytes is the counterpart of memory.dirty_ratio.
+  Only one of them may be specified at a time.  When one is written it is
+  immediately taken into account to evaluate the dirty memory limits and the
+  other appears as 0 when read.
+
+- memory.dirty_background_ratio: the amount of dirty memory of the cgroup
+  (expressed as a percentage of cgroup memory) at which background writeback
+  kernel threads will start writing out dirty data.
+
+- memory.dirty_background_limit_in_bytes: the amount of dirty memory (expressed
+  in bytes) in the cgroup at which background writeback kernel threads will
+  start writing out dirty data.  Suffix (k, K, m, M, g, or G) can be used to
+  indicate that value is kilo, mega or gigabytes.
+
+  Note: memory.dirty_background_limit_in_bytes is the counterpart of
+  memory.dirty_background_ratio.  Only one of them may be specified at a time.
+  When one is written it is immediately taken into account to evaluate the dirty
+  memory limits and the other appears as 0 when read.
+
+A cgroup may contain more dirty memory than its dirty limit.  This is possible
+because of the principle that the first cgroup to touch a page is charged for
+it.  Subsequent page counting events (dirty, writeback, nfs_unstable) are also
+counted to the originally charged cgroup.
+
+Example: If page is allocated by a cgroup A task, then the page is charged to
+cgroup A.  If the page is later dirtied by a task in cgroup B, then the cgroup A
+dirty count will be incremented.  If cgroup A is over its dirty limit but cgroup
+B is not, then dirtying a cgroup A page from a cgroup B task may push cgroup A
+over its dirty limit without throttling the dirtying cgroup B task.
+
+When use_hierarchy=0, each cgroup has independent dirty memory usage and limits.
+
+When use_hierarchy=1, a parent cgroup increasing its dirty memory usage will
+compare its total_dirty memory (which includes sum of all child cgroup dirty
+memory) to its dirty limits.  This keeps a parent from explicitly exceeding its
+dirty limits.  However, a child cgroup can increase its dirty usage without
+considering the parent's dirty limits.  Thus the parent's total_dirty can exceed
+the parent's dirty limits as a child dirties pages.
+
 6. Hierarchy support
 
 The memory controller supports a deep hierarchy and hierarchical accounting.
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* Re: [PATCH v3 02/11] memcg: document cgroup dirty memory interfaces
  2010-10-19 21:00     ` Greg Thelen
@ 2010-10-20  0:11       ` KAMEZAWA Hiroyuki
  2010-10-20  0:45         ` Greg Thelen
  2010-10-20  0:48         ` Daisuke Nishimura
  0 siblings, 2 replies; 64+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-10-20  0:11 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Daisuke Nishimura, Andrew Morton, linux-kernel, linux-mm,
	containers, Andrea Righi, Balbir Singh, Minchan Kim,
	Ciju Rajan K, David Rientjes

On Tue, 19 Oct 2010 14:00:58 -0700
Greg Thelen <gthelen@google.com> wrote:

> Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> writes:
> 
> > On Mon, 18 Oct 2010 17:39:35 -0700
> > Greg Thelen <gthelen@google.com> wrote:
> >
> >> Document cgroup dirty memory interfaces and statistics.
> >> 
> >> Signed-off-by: Andrea Righi <arighi@develer.com>
> >> Signed-off-by: Greg Thelen <gthelen@google.com>
> >> ---
> >> 
> >> Changelog since v1:
> >> - Renamed "nfs"/"total_nfs" to "nfs_unstable"/"total_nfs_unstable" in per cgroup
> >>   memory.stat to match /proc/meminfo.
> >> 
> >> - Allow [kKmMgG] suffixes for newly created dirty limit value cgroupfs files.
> >> 
> >> - Describe a situation where a cgroup can exceed its dirty limit.
> >> 
> >>  Documentation/cgroups/memory.txt |   60 ++++++++++++++++++++++++++++++++++++++
> >>  1 files changed, 60 insertions(+), 0 deletions(-)
> >> 
> >> diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
> >> index 7781857..02bbd6f 100644
> >> --- a/Documentation/cgroups/memory.txt
> >> +++ b/Documentation/cgroups/memory.txt
> >> @@ -385,6 +385,10 @@ mapped_file	- # of bytes of mapped file (includes tmpfs/shmem)
> >>  pgpgin		- # of pages paged in (equivalent to # of charging events).
> >>  pgpgout		- # of pages paged out (equivalent to # of uncharging events).
> >>  swap		- # of bytes of swap usage
> >> +dirty		- # of bytes that are waiting to get written back to the disk.
> >> +writeback	- # of bytes that are actively being written back to the disk.
> >> +nfs_unstable	- # of bytes sent to the NFS server, but not yet committed to
> >> +		the actual storage.
> >>  inactive_anon	- # of bytes of anonymous memory and swap cache memory on
> >>  		LRU list.
> >>  active_anon	- # of bytes of anonymous and swap cache memory on active
> >
> > Shouldn't we add description of "total_diryt/writeback/nfs_unstable" too ?
> > Seeing [5/11], it will be showed in memory.stat.
> 
> Good catch.  See patch (below).
> 
> >> @@ -453,6 +457,62 @@ memory under it will be reclaimed.
> >>  You can reset failcnt by writing 0 to failcnt file.
> >>  # echo 0 > .../memory.failcnt
> >>  
> >> +5.5 dirty memory
> >> +
> >> +Control the maximum amount of dirty pages a cgroup can have at any given time.
> >> +
> >> +Limiting dirty memory is like fixing the max amount of dirty (hard to reclaim)
> >> +page cache used by a cgroup.  So, in case of multiple cgroup writers, they will
> >> +not be able to consume more than their designated share of dirty pages and will
> >> +be forced to perform write-out if they cross that limit.
> >> +
> >> +The interface is equivalent to the procfs interface: /proc/sys/vm/dirty_*.  It
> >> +is possible to configure a limit to trigger both a direct writeback or a
> >> +background writeback performed by per-bdi flusher threads.  The root cgroup
> >> +memory.dirty_* control files are read-only and match the contents of
> >> +the /proc/sys/vm/dirty_* files.
> >> +
> >> +Per-cgroup dirty limits can be set using the following files in the cgroupfs:
> >> +
> >> +- memory.dirty_ratio: the amount of dirty memory (expressed as a percentage of
> >> +  cgroup memory) at which a process generating dirty pages will itself start
> >> +  writing out dirty data.
> >> +
> >> +- memory.dirty_limit_in_bytes: the amount of dirty memory (expressed in bytes)
> >> +  in the cgroup at which a process generating dirty pages will start itself
> >> +  writing out dirty data.  Suffix (k, K, m, M, g, or G) can be used to indicate
> >> +  that value is kilo, mega or gigabytes.
> >> +
> >> +  Note: memory.dirty_limit_in_bytes is the counterpart of memory.dirty_ratio.
> >> +  Only one of them may be specified at a time.  When one is written it is
> >> +  immediately taken into account to evaluate the dirty memory limits and the
> >> +  other appears as 0 when read.
> >> +
> >> +- memory.dirty_background_ratio: the amount of dirty memory of the cgroup
> >> +  (expressed as a percentage of cgroup memory) at which background writeback
> >> +  kernel threads will start writing out dirty data.
> >> +
> >> +- memory.dirty_background_limit_in_bytes: the amount of dirty memory (expressed
> >> +  in bytes) in the cgroup at which background writeback kernel threads will
> >> +  start writing out dirty data.  Suffix (k, K, m, M, g, or G) can be used to
> >> +  indicate that value is kilo, mega or gigabytes.
> >> +
> >> +  Note: memory.dirty_background_limit_in_bytes is the counterpart of
> >> +  memory.dirty_background_ratio.  Only one of them may be specified at a time.
> >> +  When one is written it is immediately taken into account to evaluate the dirty
> >> +  memory limits and the other appears as 0 when read.
> >> +
> >> +A cgroup may contain more dirty memory than its dirty limit.  This is possible
> >> +because of the principle that the first cgroup to touch a page is charged for
> >> +it.  Subsequent page counting events (dirty, writeback, nfs_unstable) are also
> >> +counted to the originally charged cgroup.
> >> +
> >> +Example: If page is allocated by a cgroup A task, then the page is charged to
> >> +cgroup A.  If the page is later dirtied by a task in cgroup B, then the cgroup A
> >> +dirty count will be incremented.  If cgroup A is over its dirty limit but cgroup
> >> +B is not, then dirtying a cgroup A page from a cgroup B task may push cgroup A
> >> +over its dirty limit without throttling the dirtying cgroup B task.
> >> +
> >>  6. Hierarchy support
> >>  
> >>  The memory controller supports a deep hierarchy and hierarchical accounting.
> >> -- 
> >> 1.7.1
> >> 
> > Can you clarify whether we can limit the "total" dirty pages under hierarchy
> > in use_hierarchy==1 case ?
> > If we can, I think it would be better to note it in this documentation.
> >
> >
> > Thanks,
> > Daisuke Nishimura.
> 
> Here is a second version of this -v3 doc patch:
> 
> Author: Greg Thelen <gthelen@google.com>
> Date:   Sat Apr 10 15:34:28 2010 -0700
> 
>     memcg: document cgroup dirty memory interfaces
>     
>     Document cgroup dirty memory interfaces and statistics.
>     
>     Signed-off-by: Andrea Righi <arighi@develer.com>
>     Signed-off-by: Greg Thelen <gthelen@google.com>
> 

nitpicks. and again, why you always drop Acks ?

> diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
> index 7781857..8bf6d3b 100644
> --- a/Documentation/cgroups/memory.txt
> +++ b/Documentation/cgroups/memory.txt
> @@ -385,6 +385,10 @@ mapped_file	- # of bytes of mapped file (includes tmpfs/shmem)
>  pgpgin		- # of pages paged in (equivalent to # of charging events).
>  pgpgout		- # of pages paged out (equivalent to # of uncharging events).
>  swap		- # of bytes of swap usage
> +dirty		- # of bytes that are waiting to get written back to the disk.

extra tab ?

> +writeback	- # of bytes that are actively being written back to the disk.
> +nfs_unstable	- # of bytes sent to the NFS server, but not yet committed to
> +		the actual storage.
>  inactive_anon	- # of bytes of anonymous memory and swap cache memory on
>  		LRU list.
>  active_anon	- # of bytes of anonymous and swap cache memory on active
> @@ -406,6 +410,9 @@ total_mapped_file	- sum of all children's "cache"
>  total_pgpgin		- sum of all children's "pgpgin"
>  total_pgpgout		- sum of all children's "pgpgout"
>  total_swap		- sum of all children's "swap"
> +total_dirty		- sum of all children's "dirty"
> +total_writeback		- sum of all children's "writeback"

here, too.

> +total_nfs_unstable	- sum of all children's "nfs_unstable"
>  total_inactive_anon	- sum of all children's "inactive_anon"
>  total_active_anon	- sum of all children's "active_anon"
>  total_inactive_file	- sum of all children's "inactive_file"
> @@ -453,6 +460,71 @@ memory under it will be reclaimed.
>  You can reset failcnt by writing 0 to failcnt file.
>  # echo 0 > .../memory.failcnt
>  
> +5.5 dirty memory
> +
> +Control the maximum amount of dirty pages a cgroup can have at any given time.
> +
> +Limiting dirty memory is like fixing the max amount of dirty (hard to reclaim)
> +page cache used by a cgroup.  So, in case of multiple cgroup writers, they will
> +not be able to consume more than their designated share of dirty pages and will
> +be forced to perform write-out if they cross that limit.
> +
> +The interface is equivalent to the procfs interface: /proc/sys/vm/dirty_*.  It
> +is possible to configure a limit to trigger both a direct writeback or a
> +background writeback performed by per-bdi flusher threads.  The root cgroup
> +memory.dirty_* control files are read-only and match the contents of
> +the /proc/sys/vm/dirty_* files.
> +
> +Per-cgroup dirty limits can be set using the following files in the cgroupfs:
> +
> +- memory.dirty_ratio: the amount of dirty memory (expressed as a percentage of
> +  cgroup memory) at which a process generating dirty pages will itself start
> +  writing out dirty data.
> +
> +- memory.dirty_limit_in_bytes: the amount of dirty memory (expressed in bytes)
> +  in the cgroup at which a process generating dirty pages will start itself
> +  writing out dirty data.  Suffix (k, K, m, M, g, or G) can be used to indicate
> +  that value is kilo, mega or gigabytes.
> +
> +  Note: memory.dirty_limit_in_bytes is the counterpart of memory.dirty_ratio.
> +  Only one of them may be specified at a time.  When one is written it is
> +  immediately taken into account to evaluate the dirty memory limits and the
> +  other appears as 0 when read.
> +
> +- memory.dirty_background_ratio: the amount of dirty memory of the cgroup
> +  (expressed as a percentage of cgroup memory) at which background writeback
> +  kernel threads will start writing out dirty data.
> +
> +- memory.dirty_background_limit_in_bytes: the amount of dirty memory (expressed
> +  in bytes) in the cgroup at which background writeback kernel threads will
> +  start writing out dirty data.  Suffix (k, K, m, M, g, or G) can be used to
> +  indicate that value is kilo, mega or gigabytes.
> +
> +  Note: memory.dirty_background_limit_in_bytes is the counterpart of
> +  memory.dirty_background_ratio.  Only one of them may be specified at a time.
> +  When one is written it is immediately taken into account to evaluate the dirty
> +  memory limits and the other appears as 0 when read.
> +
> +A cgroup may contain more dirty memory than its dirty limit.  This is possible
> +because of the principle that the first cgroup to touch a page is charged for
> +it.  Subsequent page counting events (dirty, writeback, nfs_unstable) are also
> +counted to the originally charged cgroup.
> +
> +Example: If page is allocated by a cgroup A task, then the page is charged to
> +cgroup A.  If the page is later dirtied by a task in cgroup B, then the cgroup A
> +dirty count will be incremented.  If cgroup A is over its dirty limit but cgroup
> +B is not, then dirtying a cgroup A page from a cgroup B task may push cgroup A
> +over its dirty limit without throttling the dirtying cgroup B task.
> +
> +When use_hierarchy=0, each cgroup has independent dirty memory usage and limits.
> +
> +When use_hierarchy=1, a parent cgroup increasing its dirty memory usage will
> +compare its total_dirty memory (which includes sum of all child cgroup dirty
> +memory) to its dirty limits.  This keeps a parent from explicitly exceeding its
> +dirty limits.  However, a child cgroup can increase its dirty usage without
> +considering the parent's dirty limits.  Thus the parent's total_dirty can exceed
> +the parent's dirty limits as a child dirties pages.

Hmm. in short, dirty_ratio in use_hierarchy=1 doesn't work as an user expects.
Is this a spec. or a current implementation ?

I think as following.
 - add a limitation as "At setting chidlren's dirty_ratio, it must be below parent's.
   If it exceeds parent's dirty_ratio, EINVAL is returned."

Could you modify setting memory.dirty_ratio code ?
Then, parent's dirty_ratio will never exceeds its own. (If I understand correctly.)

"memory.dirty_limit_in_bytes" will be a bit more complecated, but I think you can.


Thanks,
-Kame


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v3 02/11] memcg: document cgroup dirty memory interfaces
  2010-10-20  0:11       ` KAMEZAWA Hiroyuki
@ 2010-10-20  0:45         ` Greg Thelen
  2010-10-20  4:06           ` KAMEZAWA Hiroyuki
  2010-10-20  0:48         ` Daisuke Nishimura
  1 sibling, 1 reply; 64+ messages in thread
From: Greg Thelen @ 2010-10-20  0:45 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Daisuke Nishimura, Andrew Morton, linux-kernel, linux-mm,
	containers, Andrea Righi, Balbir Singh, Minchan Kim,
	Ciju Rajan K, David Rientjes

KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> writes:

> On Tue, 19 Oct 2010 14:00:58 -0700
> Greg Thelen <gthelen@google.com> wrote:
>
>> Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> writes:
>> 
>> > On Mon, 18 Oct 2010 17:39:35 -0700
>> > Greg Thelen <gthelen@google.com> wrote:
>> >
>> >> Document cgroup dirty memory interfaces and statistics.
>> >> 
>> >> Signed-off-by: Andrea Righi <arighi@develer.com>
>> >> Signed-off-by: Greg Thelen <gthelen@google.com>
>> >> ---
>> >> 
>> >> Changelog since v1:
>> >> - Renamed "nfs"/"total_nfs" to "nfs_unstable"/"total_nfs_unstable" in per cgroup
>> >>   memory.stat to match /proc/meminfo.
>> >> 
>> >> - Allow [kKmMgG] suffixes for newly created dirty limit value cgroupfs files.
>> >> 
>> >> - Describe a situation where a cgroup can exceed its dirty limit.
>> >> 
>> >>  Documentation/cgroups/memory.txt |   60 ++++++++++++++++++++++++++++++++++++++
>> >>  1 files changed, 60 insertions(+), 0 deletions(-)
>> >> 
>> >> diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
>> >> index 7781857..02bbd6f 100644
>> >> --- a/Documentation/cgroups/memory.txt
>> >> +++ b/Documentation/cgroups/memory.txt
>> >> @@ -385,6 +385,10 @@ mapped_file	- # of bytes of mapped file (includes tmpfs/shmem)
>> >>  pgpgin		- # of pages paged in (equivalent to # of charging events).
>> >>  pgpgout		- # of pages paged out (equivalent to # of uncharging events).
>> >>  swap		- # of bytes of swap usage
>> >> +dirty		- # of bytes that are waiting to get written back to the disk.
>> >> +writeback	- # of bytes that are actively being written back to the disk.
>> >> +nfs_unstable	- # of bytes sent to the NFS server, but not yet committed to
>> >> +		the actual storage.
>> >>  inactive_anon	- # of bytes of anonymous memory and swap cache memory on
>> >>  		LRU list.
>> >>  active_anon	- # of bytes of anonymous and swap cache memory on active
>> >
>> > Shouldn't we add description of "total_diryt/writeback/nfs_unstable" too ?
>> > Seeing [5/11], it will be showed in memory.stat.
>> 
>> Good catch.  See patch (below).
>> 
>> >> @@ -453,6 +457,62 @@ memory under it will be reclaimed.
>> >>  You can reset failcnt by writing 0 to failcnt file.
>> >>  # echo 0 > .../memory.failcnt
>> >>  
>> >> +5.5 dirty memory
>> >> +
>> >> +Control the maximum amount of dirty pages a cgroup can have at any given time.
>> >> +
>> >> +Limiting dirty memory is like fixing the max amount of dirty (hard to reclaim)
>> >> +page cache used by a cgroup.  So, in case of multiple cgroup writers, they will
>> >> +not be able to consume more than their designated share of dirty pages and will
>> >> +be forced to perform write-out if they cross that limit.
>> >> +
>> >> +The interface is equivalent to the procfs interface: /proc/sys/vm/dirty_*.  It
>> >> +is possible to configure a limit to trigger both a direct writeback or a
>> >> +background writeback performed by per-bdi flusher threads.  The root cgroup
>> >> +memory.dirty_* control files are read-only and match the contents of
>> >> +the /proc/sys/vm/dirty_* files.
>> >> +
>> >> +Per-cgroup dirty limits can be set using the following files in the cgroupfs:
>> >> +
>> >> +- memory.dirty_ratio: the amount of dirty memory (expressed as a percentage of
>> >> +  cgroup memory) at which a process generating dirty pages will itself start
>> >> +  writing out dirty data.
>> >> +
>> >> +- memory.dirty_limit_in_bytes: the amount of dirty memory (expressed in bytes)
>> >> +  in the cgroup at which a process generating dirty pages will start itself
>> >> +  writing out dirty data.  Suffix (k, K, m, M, g, or G) can be used to indicate
>> >> +  that value is kilo, mega or gigabytes.
>> >> +
>> >> +  Note: memory.dirty_limit_in_bytes is the counterpart of memory.dirty_ratio.
>> >> +  Only one of them may be specified at a time.  When one is written it is
>> >> +  immediately taken into account to evaluate the dirty memory limits and the
>> >> +  other appears as 0 when read.
>> >> +
>> >> +- memory.dirty_background_ratio: the amount of dirty memory of the cgroup
>> >> +  (expressed as a percentage of cgroup memory) at which background writeback
>> >> +  kernel threads will start writing out dirty data.
>> >> +
>> >> +- memory.dirty_background_limit_in_bytes: the amount of dirty memory (expressed
>> >> +  in bytes) in the cgroup at which background writeback kernel threads will
>> >> +  start writing out dirty data.  Suffix (k, K, m, M, g, or G) can be used to
>> >> +  indicate that value is kilo, mega or gigabytes.
>> >> +
>> >> +  Note: memory.dirty_background_limit_in_bytes is the counterpart of
>> >> +  memory.dirty_background_ratio.  Only one of them may be specified at a time.
>> >> +  When one is written it is immediately taken into account to evaluate the dirty
>> >> +  memory limits and the other appears as 0 when read.
>> >> +
>> >> +A cgroup may contain more dirty memory than its dirty limit.  This is possible
>> >> +because of the principle that the first cgroup to touch a page is charged for
>> >> +it.  Subsequent page counting events (dirty, writeback, nfs_unstable) are also
>> >> +counted to the originally charged cgroup.
>> >> +
>> >> +Example: If page is allocated by a cgroup A task, then the page is charged to
>> >> +cgroup A.  If the page is later dirtied by a task in cgroup B, then the cgroup A
>> >> +dirty count will be incremented.  If cgroup A is over its dirty limit but cgroup
>> >> +B is not, then dirtying a cgroup A page from a cgroup B task may push cgroup A
>> >> +over its dirty limit without throttling the dirtying cgroup B task.
>> >> +
>> >>  6. Hierarchy support
>> >>  
>> >>  The memory controller supports a deep hierarchy and hierarchical accounting.
>> >> -- 
>> >> 1.7.1
>> >> 
>> > Can you clarify whether we can limit the "total" dirty pages under hierarchy
>> > in use_hierarchy==1 case ?
>> > If we can, I think it would be better to note it in this documentation.
>> >
>> >
>> > Thanks,
>> > Daisuke Nishimura.
>> 
>> Here is a second version of this -v3 doc patch:
>> 
>> Author: Greg Thelen <gthelen@google.com>
>> Date:   Sat Apr 10 15:34:28 2010 -0700
>> 
>>     memcg: document cgroup dirty memory interfaces
>>     
>>     Document cgroup dirty memory interfaces and statistics.
>>     
>>     Signed-off-by: Andrea Righi <arighi@develer.com>
>>     Signed-off-by: Greg Thelen <gthelen@google.com>
>> 
>
> nitpicks. and again, why you always drop Acks ?

I dropped acks because the patch changed and I did not want to assume
that it was still acceptable.  Is this incorrect protocol?

>> diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
>> index 7781857..8bf6d3b 100644
>> --- a/Documentation/cgroups/memory.txt
>> +++ b/Documentation/cgroups/memory.txt
>> @@ -385,6 +385,10 @@ mapped_file	- # of bytes of mapped file (includes tmpfs/shmem)
>>  pgpgin		- # of pages paged in (equivalent to # of charging events).
>>  pgpgout		- # of pages paged out (equivalent to # of uncharging events).
>>  swap		- # of bytes of swap usage
>> +dirty		- # of bytes that are waiting to get written back to the disk.
>
> extra tab ?

There is no extra tab here.  It's a display artifact.  When the patch is
applied the columns line up.

>> +writeback	- # of bytes that are actively being written back to the disk.
>> +nfs_unstable	- # of bytes sent to the NFS server, but not yet committed to
>> +		the actual storage.
>>  inactive_anon	- # of bytes of anonymous memory and swap cache memory on
>>  		LRU list.
>>  active_anon	- # of bytes of anonymous and swap cache memory on active
>> @@ -406,6 +410,9 @@ total_mapped_file	- sum of all children's "cache"
>>  total_pgpgin		- sum of all children's "pgpgin"
>>  total_pgpgout		- sum of all children's "pgpgout"
>>  total_swap		- sum of all children's "swap"
>> +total_dirty		- sum of all children's "dirty"
>> +total_writeback		- sum of all children's "writeback"
>
> here, too.

There is no extra tab here.  It's a display artifact.  When the patch is
applied the columns line up.

>> +total_nfs_unstable	- sum of all children's "nfs_unstable"
>>  total_inactive_anon	- sum of all children's "inactive_anon"
>>  total_active_anon	- sum of all children's "active_anon"
>>  total_inactive_file	- sum of all children's "inactive_file"
>> @@ -453,6 +460,71 @@ memory under it will be reclaimed.
>>  You can reset failcnt by writing 0 to failcnt file.
>>  # echo 0 > .../memory.failcnt
>>  
>> +5.5 dirty memory
>> +
>> +Control the maximum amount of dirty pages a cgroup can have at any given time.
>> +
>> +Limiting dirty memory is like fixing the max amount of dirty (hard to reclaim)
>> +page cache used by a cgroup.  So, in case of multiple cgroup writers, they will
>> +not be able to consume more than their designated share of dirty pages and will
>> +be forced to perform write-out if they cross that limit.
>> +
>> +The interface is equivalent to the procfs interface: /proc/sys/vm/dirty_*.  It
>> +is possible to configure a limit to trigger both a direct writeback or a
>> +background writeback performed by per-bdi flusher threads.  The root cgroup
>> +memory.dirty_* control files are read-only and match the contents of
>> +the /proc/sys/vm/dirty_* files.
>> +
>> +Per-cgroup dirty limits can be set using the following files in the cgroupfs:
>> +
>> +- memory.dirty_ratio: the amount of dirty memory (expressed as a percentage of
>> +  cgroup memory) at which a process generating dirty pages will itself start
>> +  writing out dirty data.
>> +
>> +- memory.dirty_limit_in_bytes: the amount of dirty memory (expressed in bytes)
>> +  in the cgroup at which a process generating dirty pages will start itself
>> +  writing out dirty data.  Suffix (k, K, m, M, g, or G) can be used to indicate
>> +  that value is kilo, mega or gigabytes.
>> +
>> +  Note: memory.dirty_limit_in_bytes is the counterpart of memory.dirty_ratio.
>> +  Only one of them may be specified at a time.  When one is written it is
>> +  immediately taken into account to evaluate the dirty memory limits and the
>> +  other appears as 0 when read.
>> +
>> +- memory.dirty_background_ratio: the amount of dirty memory of the cgroup
>> +  (expressed as a percentage of cgroup memory) at which background writeback
>> +  kernel threads will start writing out dirty data.
>> +
>> +- memory.dirty_background_limit_in_bytes: the amount of dirty memory (expressed
>> +  in bytes) in the cgroup at which background writeback kernel threads will
>> +  start writing out dirty data.  Suffix (k, K, m, M, g, or G) can be used to
>> +  indicate that value is kilo, mega or gigabytes.
>> +
>> +  Note: memory.dirty_background_limit_in_bytes is the counterpart of
>> +  memory.dirty_background_ratio.  Only one of them may be specified at a time.
>> +  When one is written it is immediately taken into account to evaluate the dirty
>> +  memory limits and the other appears as 0 when read.
>> +
>> +A cgroup may contain more dirty memory than its dirty limit.  This is possible
>> +because of the principle that the first cgroup to touch a page is charged for
>> +it.  Subsequent page counting events (dirty, writeback, nfs_unstable) are also
>> +counted to the originally charged cgroup.
>> +
>> +Example: If page is allocated by a cgroup A task, then the page is charged to
>> +cgroup A.  If the page is later dirtied by a task in cgroup B, then the cgroup A
>> +dirty count will be incremented.  If cgroup A is over its dirty limit but cgroup
>> +B is not, then dirtying a cgroup A page from a cgroup B task may push cgroup A
>> +over its dirty limit without throttling the dirtying cgroup B task.
>> +
>> +When use_hierarchy=0, each cgroup has independent dirty memory usage and limits.
>> +
>> +When use_hierarchy=1, a parent cgroup increasing its dirty memory usage will
>> +compare its total_dirty memory (which includes sum of all child cgroup dirty
>> +memory) to its dirty limits.  This keeps a parent from explicitly exceeding its
>> +dirty limits.  However, a child cgroup can increase its dirty usage without
>> +considering the parent's dirty limits.  Thus the parent's total_dirty can exceed
>> +the parent's dirty limits as a child dirties pages.
>
> Hmm. in short, dirty_ratio in use_hierarchy=1 doesn't work as an user
> expects.  Is this a spec. or a current implementation ?

This limitation is due to the current implementation.  I agree that it
is not perfect.  We could extend the page-writeback.c changes, PATCH
11/11 ( http://marc.info/?l=linux-mm&m=128744907030215 ), to also check
the dirty limit of each parent in the memcg hierarchy.  This would walk
up the tree until root or a cgroup with use_hierarchy=0 is found.
Alternatively, we could provide this functionality in a later patch
series.  The changes to page-writeback.c may be significant.

> I think as following.
>  - add a limitation as "At setting chidlren's dirty_ratio, it must be
>    below parent's.  If it exceeds parent's dirty_ratio, EINVAL is
>    returned."
>
> Could you modify setting memory.dirty_ratio code ?

I assume we are only talking about the use_hierarchy=1 case.  What if
the parent ratio is changed?  If we want to ensure that child ratios are
never larger than parent, then the code must check every child cgroup to
ensure that each child ratio is <= the new parent ratio.  Correct?

Even if we manage to prevent all child ratios from exceeding parent
ratios, we still have the problem of the sum of child ratios may exceed
parent.  Example:
         A (10%)
   B (10%)   C (10%)

There would be nothing to prevent A,B,C dirty ratios from all being set
to 10% as shown.  The current implementation would allow for B and C to
reach 10% thereby pushing the A to 20%.  We could require that each
child dirty limit must fit within parent dirty limit.  So (B+C<=A).
This would allow for:

        A (10%)
   B (7%)   C (3%)

If we had this 10/7/3 limiting code, which statically partitions dirty
memory usage, then we would not needed to walk up the memcg tree
checking each parent.  This nice because it allows us to only
complicates the setting of dirty limits, which is not part of the
performance path.  However, being static partitioning has limitations.
If the system has a dirty ratio of 50% and we create 100 cgroups with
equal dirty limits, the dirty limits for each memcg would be 0.5%.

> Then, parent's dirty_ratio will never exceeds its own. (If I
> understand correctly.)
>
> "memory.dirty_limit_in_bytes" will be a bit more complecated, but I
> think you can.
>
>
> Thanks,
> -Kame

KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> writes:
> I'd like to consider a patch.  Please mention that "use_hierarchy=1
> case depends on implemenation." for now.

I will clarify the current implementation behavior in the documentation.
A later patch series can change the use_hierarchy=1 behavior.


KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> writes:
> BTW, how about supporing dirty_limit_in_bytes when use_hierarchy=0 or
> leave it as broken when use_hierarchy=1 ?  It seems we can only
> support dirty_ratio when hierarchy is used.

I am not sure what you mean here.  Are you suggesting that we prohibit
usage of dirty limits/ratios when use_hierarchy=1?  This is appealing
because it does not expose the user to unexpected behavior.  Only the
well supported case would be configurable.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v3 02/11] memcg: document cgroup dirty memory interfaces
  2010-10-20  0:11       ` KAMEZAWA Hiroyuki
  2010-10-20  0:45         ` Greg Thelen
@ 2010-10-20  0:48         ` Daisuke Nishimura
  2010-10-20  1:14           ` KAMEZAWA Hiroyuki
  1 sibling, 1 reply; 64+ messages in thread
From: Daisuke Nishimura @ 2010-10-20  0:48 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Greg Thelen, Andrew Morton, linux-kernel, linux-mm, containers,
	Andrea Righi, Balbir Singh, Minchan Kim, Ciju Rajan K,
	David Rientjes, Daisuke Nishimura

On Wed, 20 Oct 2010 09:11:09 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> On Tue, 19 Oct 2010 14:00:58 -0700
> Greg Thelen <gthelen@google.com> wrote:
> 
(snip)
> > +When use_hierarchy=0, each cgroup has independent dirty memory usage and limits.
> > +
> > +When use_hierarchy=1, a parent cgroup increasing its dirty memory usage will
> > +compare its total_dirty memory (which includes sum of all child cgroup dirty
> > +memory) to its dirty limits.  This keeps a parent from explicitly exceeding its
> > +dirty limits.  However, a child cgroup can increase its dirty usage without
> > +considering the parent's dirty limits.  Thus the parent's total_dirty can exceed
> > +the parent's dirty limits as a child dirties pages.
> 
> Hmm. in short, dirty_ratio in use_hierarchy=1 doesn't work as an user expects.
> Is this a spec. or a current implementation ?
> 
> I think as following.
>  - add a limitation as "At setting chidlren's dirty_ratio, it must be below parent's.
>    If it exceeds parent's dirty_ratio, EINVAL is returned."
> 
> Could you modify setting memory.dirty_ratio code ?
> Then, parent's dirty_ratio will never exceeds its own. (If I understand correctly.)
> 
> "memory.dirty_limit_in_bytes" will be a bit more complecated, but I think you can.
> 
I agree.

At the first impression, this limitation seems a bit overkill for me, because
we allow memory.limit_in_bytes of a child bigger than that of parent now.
But considering more, the situation is different, because usage_in_bytes never
exceeds limit_in_bytes.


Thanks,
Daisuke Nishimura.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v3 07/11] memcg: add dirty limits to mem_cgroup
  2010-10-19  0:39 ` [PATCH v3 07/11] memcg: add dirty limits to mem_cgroup Greg Thelen
  2010-10-19  0:53   ` KAMEZAWA Hiroyuki
@ 2010-10-20  0:50   ` Daisuke Nishimura
  2010-10-20  4:08     ` Greg Thelen
  1 sibling, 1 reply; 64+ messages in thread
From: Daisuke Nishimura @ 2010-10-20  0:50 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, Andrea Righi,
	Balbir Singh, KAMEZAWA Hiroyuki, Minchan Kim, Ciju Rajan K,
	David Rientjes, Daisuke Nishimura

> +static unsigned long long
> +memcg_hierarchical_free_pages(struct mem_cgroup *mem)
> +{
> +	struct cgroup *cgroup;
> +	unsigned long long min_free, free;
> +
> +	min_free = res_counter_read_u64(&mem->res, RES_LIMIT) -
> +		res_counter_read_u64(&mem->res, RES_USAGE);
> +	cgroup = mem->css.cgroup;
> +	if (!mem->use_hierarchy)
> +		goto out;
> +
> +	while (cgroup->parent) {
> +		cgroup = cgroup->parent;
> +		mem = mem_cgroup_from_cont(cgroup);
> +		if (!mem->use_hierarchy)
> +			break;
> +		free = res_counter_read_u64(&mem->res, RES_LIMIT) -
> +			res_counter_read_u64(&mem->res, RES_USAGE);
> +		min_free = min(min_free, free);
> +	}
> +out:
> +	/* Translate free memory in pages */
> +	return min_free >> PAGE_SHIFT;
> +}
> +
I think you can simplify this function using parent_mem_cgroup().

	unsigned long free, min_free = ULLONG_MAX;

	while (mem) {
		free = res_counter_read_u64(&mem->res, RES_LIMIT) -
			res_counter_read_u64(&mem->res, RES_USAGE);
		min_free = min(min_free, free);
		mem = parent_mem_cgroup();
	}

	/* Translate free memory in pages */
	return min_free >> PAGE_SHIFT;

And, IMHO, we should return min(global_page_state(NR_FREE_PAGES), min_free >> PAGE_SHIFT).
Because we are allowed to set no-limit(or a very big limit) in memcg,
so min_free can be very big if we don't set a limit against all the memcg's in hierarchy.


Thanks,
Dasiuke Nishimura.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v3 05/11] memcg: add dirty page accounting infrastructure
  2010-10-19  0:39 ` [PATCH v3 05/11] memcg: add dirty page accounting infrastructure Greg Thelen
  2010-10-19  0:49   ` KAMEZAWA Hiroyuki
@ 2010-10-20  0:53   ` Daisuke Nishimura
  1 sibling, 0 replies; 64+ messages in thread
From: Daisuke Nishimura @ 2010-10-20  0:53 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, Andrea Righi,
	Balbir Singh, KAMEZAWA Hiroyuki, Minchan Kim, Ciju Rajan K,
	David Rientjes, Daisuke Nishimura

On Mon, 18 Oct 2010 17:39:38 -0700
Greg Thelen <gthelen@google.com> wrote:

> Add memcg routines to track dirty, writeback, and unstable_NFS pages.
> These routines are not yet used by the kernel to count such pages.
> A later change adds kernel calls to these new routines.
> 
> Signed-off-by: Greg Thelen <gthelen@google.com>
> Signed-off-by: Andrea Righi <arighi@develer.com>

Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v3 02/11] memcg: document cgroup dirty memory interfaces
  2010-10-20  0:48         ` Daisuke Nishimura
@ 2010-10-20  1:14           ` KAMEZAWA Hiroyuki
  2010-10-20  2:24             ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 64+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-10-20  1:14 UTC (permalink / raw)
  To: Daisuke Nishimura
  Cc: Greg Thelen, Andrew Morton, linux-kernel, linux-mm, containers,
	Andrea Righi, Balbir Singh, Minchan Kim, Ciju Rajan K,
	David Rientjes

On Wed, 20 Oct 2010 09:48:21 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:

> On Wed, 20 Oct 2010 09:11:09 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 
> > On Tue, 19 Oct 2010 14:00:58 -0700
> > Greg Thelen <gthelen@google.com> wrote:
> > 
> (snip)
> > > +When use_hierarchy=0, each cgroup has independent dirty memory usage and limits.
> > > +
> > > +When use_hierarchy=1, a parent cgroup increasing its dirty memory usage will
> > > +compare its total_dirty memory (which includes sum of all child cgroup dirty
> > > +memory) to its dirty limits.  This keeps a parent from explicitly exceeding its
> > > +dirty limits.  However, a child cgroup can increase its dirty usage without
> > > +considering the parent's dirty limits.  Thus the parent's total_dirty can exceed
> > > +the parent's dirty limits as a child dirties pages.
> > 
> > Hmm. in short, dirty_ratio in use_hierarchy=1 doesn't work as an user expects.
> > Is this a spec. or a current implementation ?
> > 
> > I think as following.
> >  - add a limitation as "At setting chidlren's dirty_ratio, it must be below parent's.
> >    If it exceeds parent's dirty_ratio, EINVAL is returned."
> > 
> > Could you modify setting memory.dirty_ratio code ?
> > Then, parent's dirty_ratio will never exceeds its own. (If I understand correctly.)
> > 
> > "memory.dirty_limit_in_bytes" will be a bit more complecated, but I think you can.
> > 
> I agree.
> 
> At the first impression, this limitation seems a bit overkill for me, because
> we allow memory.limit_in_bytes of a child bigger than that of parent now.
> But considering more, the situation is different, because usage_in_bytes never
> exceeds limit_in_bytes.
> 

I'd like to consider a patch.
Please mention that "use_hierarchy=1 case depends on implemenation." for now.

Thanks,
-Kame



^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v3 02/11] memcg: document cgroup dirty memory interfaces
  2010-10-20  1:14           ` KAMEZAWA Hiroyuki
@ 2010-10-20  2:24             ` KAMEZAWA Hiroyuki
  2010-10-20  3:47               ` Daisuke Nishimura
  0 siblings, 1 reply; 64+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-10-20  2:24 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Daisuke Nishimura, Greg Thelen, Andrew Morton, linux-kernel,
	linux-mm, containers, Andrea Righi, Balbir Singh, Minchan Kim,
	Ciju Rajan K, David Rientjes

On Wed, 20 Oct 2010 10:14:21 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> On Wed, 20 Oct 2010 09:48:21 +0900
> Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> 
> > On Wed, 20 Oct 2010 09:11:09 +0900
> > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > 
> > > On Tue, 19 Oct 2010 14:00:58 -0700
> > > Greg Thelen <gthelen@google.com> wrote:
> > > 
> > (snip)
> > > > +When use_hierarchy=0, each cgroup has independent dirty memory usage and limits.
> > > > +
> > > > +When use_hierarchy=1, a parent cgroup increasing its dirty memory usage will
> > > > +compare its total_dirty memory (which includes sum of all child cgroup dirty
> > > > +memory) to its dirty limits.  This keeps a parent from explicitly exceeding its
> > > > +dirty limits.  However, a child cgroup can increase its dirty usage without
> > > > +considering the parent's dirty limits.  Thus the parent's total_dirty can exceed
> > > > +the parent's dirty limits as a child dirties pages.
> > > 
> > > Hmm. in short, dirty_ratio in use_hierarchy=1 doesn't work as an user expects.
> > > Is this a spec. or a current implementation ?
> > > 
> > > I think as following.
> > >  - add a limitation as "At setting chidlren's dirty_ratio, it must be below parent's.
> > >    If it exceeds parent's dirty_ratio, EINVAL is returned."
> > > 
> > > Could you modify setting memory.dirty_ratio code ?
> > > Then, parent's dirty_ratio will never exceeds its own. (If I understand correctly.)
> > > 
> > > "memory.dirty_limit_in_bytes" will be a bit more complecated, but I think you can.
> > > 
> > I agree.
> > 
> > At the first impression, this limitation seems a bit overkill for me, because
> > we allow memory.limit_in_bytes of a child bigger than that of parent now.
> > But considering more, the situation is different, because usage_in_bytes never
> > exceeds limit_in_bytes.
> > 
> 
> I'd like to consider a patch.
> Please mention that "use_hierarchy=1 case depends on implemenation." for now.
> 

BTW, how about supporing dirty_limit_in_bytes when use_hierarchy=0 or leave it as
broken when use_hierarchy=1 ?
It seems we can only support dirty_ratio when hierarchy is used.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH][memcg+dirtylimit] Fix  overwriting global vm dirty limit setting by memcg (Re: [PATCH v3 00/11] memcg: per cgroup dirty page accounting
  2010-10-19  0:39 [PATCH v3 00/11] memcg: per cgroup dirty page accounting Greg Thelen
                   ` (10 preceding siblings ...)
  2010-10-19  0:39 ` [PATCH v3 11/11] memcg: check memcg dirty limits in page writeback Greg Thelen
@ 2010-10-20  3:21 ` KAMEZAWA Hiroyuki
  2010-10-20  4:14   ` KAMEZAWA Hiroyuki
  2010-10-20  5:02   ` [PATCH v2][memcg+dirtylimit] " KAMEZAWA Hiroyuki
  11 siblings, 2 replies; 64+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-10-20  3:21 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, Andrea Righi,
	Balbir Singh, Daisuke Nishimura, Minchan Kim, Ciju Rajan K,
	David Rientjes


One bug fix here.
==
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Now, at calculating dirty limit, vm_dirty_param() is called.
This function returns dirty-limit related parameters considering
memory cgroup settings.

Now, assume that vm_dirty_bytes=100M (global dirty limit) and
memory cgroup has 1G of pages and 40 dirty_ratio, dirtyable memory is
500MB.

In this case, global_dirty_limits will consider dirty_limt as
500 *0.4 = 200MB. This is bad...memory cgroup is not back door.

This patch limits the return value of vm_dirty_param() considring
global settings.


Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/memcontrol.h |    4 ++--
 mm/memcontrol.c            |   29 ++++++++++++++++++++++++++++-
 mm/page-writeback.c        |    2 +-
 3 files changed, 31 insertions(+), 4 deletions(-)

Index: dirty_limit_new/mm/memcontrol.c
===================================================================
--- dirty_limit_new.orig/mm/memcontrol.c
+++ dirty_limit_new/mm/memcontrol.c
@@ -1171,9 +1171,10 @@ static void __mem_cgroup_dirty_param(str
  * can be moved after our access and writeback tends to take long time.  At
  * least, "memcg" will not be freed while holding rcu_read_lock().
  */
-void vm_dirty_param(struct vm_dirty_param *param)
+void vm_dirty_param(struct vm_dirty_param *param, unsigned long mem)
 {
 	struct mem_cgroup *memcg;
+	u64 limit, bglimit;
 
 	if (mem_cgroup_disabled()) {
 		global_vm_dirty_param(param);
@@ -1183,6 +1184,32 @@ void vm_dirty_param(struct vm_dirty_para
 	rcu_read_lock();
 	memcg = mem_cgroup_from_task(current);
 	__mem_cgroup_dirty_param(param, memcg);
+	/*
+ 	 * A limitation under memory cgroup is under global vm, too.
+ 	 */
+	if (vm_dirty_ratio)
+		limit = mem * vm_dirty_ratio / 100;
+	else
+		limit = vm_dirty_bytes;
+	if (param->dirty_ratio) {
+		param->dirty_bytes = mem * param->dirty_ratio / 100;
+		param->dirty_ratio = 0;
+	}
+	if (param->dirty_bytes > limit)
+		param->dirty_bytes = limit;
+
+	if (dirty_background_ratio)
+		bglimit = mem * dirty_background_ratio / 100;
+	else
+		bglimit = dirty_background_bytes;
+
+	if (param->dirty_background_ratio) {
+		param->dirty_background_bytes =
+			mem * param->dirty_background_ratio /100;
+		param->dirty_background_ratio = 0;
+	}
+	if (param->dirty_background_bytes > bglimit)
+		param->dirty_background_bytes = bglimit;
 	rcu_read_unlock();
 }
 
Index: dirty_limit_new/include/linux/memcontrol.h
===================================================================
--- dirty_limit_new.orig/include/linux/memcontrol.h
+++ dirty_limit_new/include/linux/memcontrol.h
@@ -171,7 +171,7 @@ static inline void mem_cgroup_dec_page_s
 }
 
 bool mem_cgroup_has_dirty_limit(void);
-void vm_dirty_param(struct vm_dirty_param *param);
+void vm_dirty_param(struct vm_dirty_param *param, u64 mem);
 s64 mem_cgroup_page_stat(enum mem_cgroup_nr_pages_item item);
 
 unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
@@ -360,7 +360,7 @@ static inline bool mem_cgroup_has_dirty_
 	return false;
 }
 
-static inline void vm_dirty_param(struct vm_dirty_param *param)
+static inline void vm_dirty_param(struct vm_dirty_param *param, u64 mem)
 {
 	global_vm_dirty_param(param);
 }
Index: dirty_limit_new/mm/page-writeback.c
===================================================================
--- dirty_limit_new.orig/mm/page-writeback.c
+++ dirty_limit_new/mm/page-writeback.c
@@ -466,7 +466,7 @@ void global_dirty_limits(unsigned long *
 	struct task_struct *tsk;
 	struct vm_dirty_param dirty_param;
 
-	vm_dirty_param(&dirty_param);
+	vm_dirty_param(&dirty_param, avialable_memory);
 
 	if (dirty_param.dirty_bytes)
 		dirty = DIV_ROUND_UP(dirty_param.dirty_bytes, PAGE_SIZE);


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v3 09/11] memcg: add cgroupfs interface to memcg dirty limits
  2010-10-19  0:39 ` [PATCH v3 09/11] memcg: add cgroupfs interface to memcg dirty limits Greg Thelen
  2010-10-19  0:56   ` KAMEZAWA Hiroyuki
@ 2010-10-20  3:31   ` Daisuke Nishimura
  2010-10-20  3:44     ` KAMEZAWA Hiroyuki
  2010-10-20  3:46     ` Daisuke Nishimura
  1 sibling, 2 replies; 64+ messages in thread
From: Daisuke Nishimura @ 2010-10-20  3:31 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, Andrea Righi,
	Balbir Singh, KAMEZAWA Hiroyuki, Minchan Kim, Ciju Rajan K,
	David Rientjes, Daisuke Nishimura

On Mon, 18 Oct 2010 17:39:42 -0700
Greg Thelen <gthelen@google.com> wrote:

> Add cgroupfs interface to memcg dirty page limits:
>   Direct write-out is controlled with:
>   - memory.dirty_ratio
>   - memory.dirty_limit_in_bytes
> 
>   Background write-out is controlled with:
>   - memory.dirty_background_ratio
>   - memory.dirty_background_limit_bytes
> 
> Other memcg cgroupfs files support 'M', 'm', 'k', 'K', 'g'
> and 'G' suffixes for byte counts.  This patch provides the
> same functionality for memory.dirty_limit_in_bytes and
> memory.dirty_background_limit_bytes.
> 
> Signed-off-by: Andrea Righi <arighi@develer.com>
> Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
> Signed-off-by: Greg Thelen <gthelen@google.com>

Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>

One question: shouldn't we return -EINVAL when writing to dirty(_background)_limit_bytes
a bigger value than that of global one(if any) ? Or do you intentionally
set the input value without comparing it with the global value ?
But, hmm..., IMHO we should check it in __mem_cgroup_dirty_param() or something
not to allow dirty pages more than global limit.


Thanks,
Daisuke Nishimura.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v3 09/11] memcg: add cgroupfs interface to memcg dirty limits
  2010-10-20  3:31   ` Daisuke Nishimura
@ 2010-10-20  3:44     ` KAMEZAWA Hiroyuki
  2010-10-20  3:46     ` Daisuke Nishimura
  1 sibling, 0 replies; 64+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-10-20  3:44 UTC (permalink / raw)
  To: Daisuke Nishimura
  Cc: Greg Thelen, Andrew Morton, linux-kernel, linux-mm, containers,
	Andrea Righi, Balbir Singh, Minchan Kim, Ciju Rajan K,
	David Rientjes

On Wed, 20 Oct 2010 12:31:10 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:

> On Mon, 18 Oct 2010 17:39:42 -0700
> Greg Thelen <gthelen@google.com> wrote:
> 
> > Add cgroupfs interface to memcg dirty page limits:
> >   Direct write-out is controlled with:
> >   - memory.dirty_ratio
> >   - memory.dirty_limit_in_bytes
> > 
> >   Background write-out is controlled with:
> >   - memory.dirty_background_ratio
> >   - memory.dirty_background_limit_bytes
> > 
> > Other memcg cgroupfs files support 'M', 'm', 'k', 'K', 'g'
> > and 'G' suffixes for byte counts.  This patch provides the
> > same functionality for memory.dirty_limit_in_bytes and
> > memory.dirty_background_limit_bytes.
> > 
> > Signed-off-by: Andrea Righi <arighi@develer.com>
> > Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
> > Signed-off-by: Greg Thelen <gthelen@google.com>
> 
> Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
> 
> One question: shouldn't we return -EINVAL when writing to dirty(_background)_limit_bytes
> a bigger value than that of global one(if any) 

This should be checked. I'm now writing one add-on.

> ? Or do you intentionally
> set the input value without comparing it with the global value ?

please see my patch sent(memcg+dirtylimit] Fix  overwriting global vm dirty limit setting by memcg)

IMHO, check at setting value is not helpful because global value can be changed
after we set this. My patch checks it at calculating dirtyable bytes.


> But, hmm..., IMHO we should check it in __mem_cgroup_dirty_param() or something
> not to allow dirty pages more than global limit.
> 
yes.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v3 09/11] memcg: add cgroupfs interface to memcg dirty limits
  2010-10-20  3:31   ` Daisuke Nishimura
  2010-10-20  3:44     ` KAMEZAWA Hiroyuki
@ 2010-10-20  3:46     ` Daisuke Nishimura
  1 sibling, 0 replies; 64+ messages in thread
From: Daisuke Nishimura @ 2010-10-20  3:46 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, Andrea Righi,
	Balbir Singh, KAMEZAWA Hiroyuki, Minchan Kim, Ciju Rajan K,
	David Rientjes, Daisuke Nishimura

On Wed, 20 Oct 2010 12:31:10 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:

> On Mon, 18 Oct 2010 17:39:42 -0700
> Greg Thelen <gthelen@google.com> wrote:
> 
> > Add cgroupfs interface to memcg dirty page limits:
> >   Direct write-out is controlled with:
> >   - memory.dirty_ratio
> >   - memory.dirty_limit_in_bytes
> > 
> >   Background write-out is controlled with:
> >   - memory.dirty_background_ratio
> >   - memory.dirty_background_limit_bytes
> > 
> > Other memcg cgroupfs files support 'M', 'm', 'k', 'K', 'g'
> > and 'G' suffixes for byte counts.  This patch provides the
> > same functionality for memory.dirty_limit_in_bytes and
> > memory.dirty_background_limit_bytes.
> > 
> > Signed-off-by: Andrea Righi <arighi@develer.com>
> > Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
> > Signed-off-by: Greg Thelen <gthelen@google.com>
> 
> Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
> 
> One question: shouldn't we return -EINVAL when writing to dirty(_background)_limit_bytes
> a bigger value than that of global one(if any) ? Or do you intentionally
> set the input value without comparing it with the global value ?
> But, hmm..., IMHO we should check it in __mem_cgroup_dirty_param() or something
> not to allow dirty pages more than global limit.
> 
Oh, Kamazawa-san has just send a fix for this problem :)
Please ignore this comment.

Thanks,
Daisuke Nishimura.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v3 02/11] memcg: document cgroup dirty memory interfaces
  2010-10-20  2:24             ` KAMEZAWA Hiroyuki
@ 2010-10-20  3:47               ` Daisuke Nishimura
  0 siblings, 0 replies; 64+ messages in thread
From: Daisuke Nishimura @ 2010-10-20  3:47 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Greg Thelen, Andrew Morton, linux-kernel, linux-mm, containers,
	Andrea Righi, Balbir Singh, Minchan Kim, Ciju Rajan K,
	David Rientjes, Daisuke Nishimura

On Wed, 20 Oct 2010 11:24:31 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> On Wed, 20 Oct 2010 10:14:21 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 
> > On Wed, 20 Oct 2010 09:48:21 +0900
> > Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> > 
> > > On Wed, 20 Oct 2010 09:11:09 +0900
> > > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > > 
> > > > On Tue, 19 Oct 2010 14:00:58 -0700
> > > > Greg Thelen <gthelen@google.com> wrote:
> > > > 
> > > (snip)
> > > > > +When use_hierarchy=0, each cgroup has independent dirty memory usage and limits.
> > > > > +
> > > > > +When use_hierarchy=1, a parent cgroup increasing its dirty memory usage will
> > > > > +compare its total_dirty memory (which includes sum of all child cgroup dirty
> > > > > +memory) to its dirty limits.  This keeps a parent from explicitly exceeding its
> > > > > +dirty limits.  However, a child cgroup can increase its dirty usage without
> > > > > +considering the parent's dirty limits.  Thus the parent's total_dirty can exceed
> > > > > +the parent's dirty limits as a child dirties pages.
> > > > 
> > > > Hmm. in short, dirty_ratio in use_hierarchy=1 doesn't work as an user expects.
> > > > Is this a spec. or a current implementation ?
> > > > 
> > > > I think as following.
> > > >  - add a limitation as "At setting chidlren's dirty_ratio, it must be below parent's.
> > > >    If it exceeds parent's dirty_ratio, EINVAL is returned."
> > > > 
> > > > Could you modify setting memory.dirty_ratio code ?
> > > > Then, parent's dirty_ratio will never exceeds its own. (If I understand correctly.)
> > > > 
> > > > "memory.dirty_limit_in_bytes" will be a bit more complecated, but I think you can.
> > > > 
> > > I agree.
> > > 
> > > At the first impression, this limitation seems a bit overkill for me, because
> > > we allow memory.limit_in_bytes of a child bigger than that of parent now.
> > > But considering more, the situation is different, because usage_in_bytes never
> > > exceeds limit_in_bytes.
> > > 
> > 
> > I'd like to consider a patch.
> > Please mention that "use_hierarchy=1 case depends on implemenation." for now.
> > 
> 
> BTW, how about supporing dirty_limit_in_bytes when use_hierarchy=0 or leave it as
> broken when use_hierarchy=1 ?
> It seems we can only support dirty_ratio when hierarchy is used.
> 
It's all right for me.
This feature would be useful even w/o hierarchy support.

Thanks,
Daisuke Nishimura.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v3 10/11] writeback: make determine_dirtyable_memory() static.
  2010-10-19  0:39 ` [PATCH v3 10/11] writeback: make determine_dirtyable_memory() static Greg Thelen
  2010-10-19  0:57   ` KAMEZAWA Hiroyuki
@ 2010-10-20  3:47   ` Daisuke Nishimura
  1 sibling, 0 replies; 64+ messages in thread
From: Daisuke Nishimura @ 2010-10-20  3:47 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Daisuke Nishimura, Andrew Morton, linux-kernel, linux-mm,
	containers, Andrea Righi, Balbir Singh, KAMEZAWA Hiroyuki,
	Minchan Kim, Ciju Rajan K, David Rientjes

On Mon, 18 Oct 2010 17:39:43 -0700
Greg Thelen <gthelen@google.com> wrote:

> The determine_dirtyable_memory() function is not used outside of
> page writeback.  Make the routine static.  No functional change.
> Just a cleanup in preparation for a change that adds memcg dirty
> limits consideration into global_dirty_limits().
> 
> Signed-off-by: Andrea Righi <arighi@develer.com>
> Signed-off-by: Greg Thelen <gthelen@google.com>

Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v3 08/11] memcg: CPU hotplug lockdep warning fix
  2010-10-19  0:39 ` [PATCH v3 08/11] memcg: CPU hotplug lockdep warning fix Greg Thelen
  2010-10-19  0:54   ` KAMEZAWA Hiroyuki
@ 2010-10-20  3:47   ` Daisuke Nishimura
  1 sibling, 0 replies; 64+ messages in thread
From: Daisuke Nishimura @ 2010-10-20  3:47 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, Andrea Righi,
	Balbir Singh, KAMEZAWA Hiroyuki, Minchan Kim, Ciju Rajan K,
	David Rientjes, Daisuke Nishimura

On Mon, 18 Oct 2010 17:39:41 -0700
Greg Thelen <gthelen@google.com> wrote:

> From: Balbir Singh <balbir@linux.vnet.ibm.com>
> 
> memcg has lockdep warnings (sleep inside rcu lock)
> 
> From: Balbir Singh <balbir@linux.vnet.ibm.com>
> 
> Recent move to get_online_cpus() ends up calling get_online_cpus() from
> mem_cgroup_read_stat(). However mem_cgroup_read_stat() is called under rcu
> lock. get_online_cpus() can sleep. The dirty limit patches expose
> this BUG more readily due to their usage of mem_cgroup_page_stat()
> 
> This patch address this issue as identified by lockdep and moves the
> hotplug protection to a higher layer. This might increase the time
> required to hotplug, but not by much.
> 
> Warning messages
> 
> BUG: sleeping function called from invalid context at kernel/cpu.c:62
> in_atomic(): 0, irqs_disabled(): 0, pid: 6325, name: pagetest
> 2 locks held by pagetest/6325:
> do_page_fault+0x27d/0x4a0
> mem_cgroup_page_stat+0x0/0x23f
> Pid: 6325, comm: pagetest Not tainted 2.6.36-rc5-mm1+ #201
> Call Trace:
> [<ffffffff81041224>] __might_sleep+0x12d/0x131
> [<ffffffff8104f4af>] get_online_cpus+0x1c/0x51
> [<ffffffff8110eedb>] mem_cgroup_read_stat+0x27/0xa3
> [<ffffffff811125d2>] mem_cgroup_page_stat+0x131/0x23f
> [<ffffffff811124a1>] ? mem_cgroup_page_stat+0x0/0x23f
> [<ffffffff810d57c3>] global_dirty_limits+0x42/0xf8
> [<ffffffff810d58b3>] throttle_vm_writeout+0x3a/0xb4
> [<ffffffff810dc2f8>] shrink_zone+0x3e6/0x3f8
> [<ffffffff81074a35>] ? ktime_get_ts+0xb2/0xbf
> [<ffffffff810dd1aa>] do_try_to_free_pages+0x106/0x478
> [<ffffffff810dd601>] try_to_free_mem_cgroup_pages+0xe5/0x14c
> [<ffffffff8110f947>] mem_cgroup_hierarchical_reclaim+0x314/0x3a2
> [<ffffffff81111b31>] __mem_cgroup_try_charge+0x29b/0x593
> [<ffffffff8111194a>] ? __mem_cgroup_try_charge+0xb4/0x593
> [<ffffffff81071258>] ? local_clock+0x40/0x59
> [<ffffffff81009015>] ? sched_clock+0x9/0xd
> [<ffffffff810710d5>] ? sched_clock_local+0x1c/0x82
> [<ffffffff8111398a>] mem_cgroup_charge_common+0x4b/0x76
> [<ffffffff81141469>] ? bio_add_page+0x36/0x38
> [<ffffffff81113ba9>] mem_cgroup_cache_charge+0x1f4/0x214
> [<ffffffff810cd195>] add_to_page_cache_locked+0x4a/0x148
> ....
> 
> Acked-by: Greg Thelen <gthelen@google.com>
> Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>

Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v3 02/11] memcg: document cgroup dirty memory interfaces
  2010-10-20  0:45         ` Greg Thelen
@ 2010-10-20  4:06           ` KAMEZAWA Hiroyuki
  2010-10-20  4:25             ` Greg Thelen
  0 siblings, 1 reply; 64+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-10-20  4:06 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Daisuke Nishimura, Andrew Morton, linux-kernel, linux-mm,
	containers, Andrea Righi, Balbir Singh, Minchan Kim,
	Ciju Rajan K, David Rientjes

On Tue, 19 Oct 2010 17:45:08 -0700
Greg Thelen <gthelen@google.com> wrote:

> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> writes:
> > BTW, how about supporing dirty_limit_in_bytes when use_hierarchy=0 or
> > leave it as broken when use_hierarchy=1 ?  It seems we can only
> > support dirty_ratio when hierarchy is used.
> 
> I am not sure what you mean here.

When using dirty_ratio, we can check the value of dirty_ratio at setting it
and make guarantee that any children's dirty_ratio cannot exceeds it parent's.

If we guarantee that, we can keep dirty_ratio even under hierarchy.

When it comes to dirty_limit_in_bytes, we never able to do such kind of
controls. So, it will be broken and will do different behavior than
dirty_ratio.

So, not supporing dirty_bytes when use_hierarchy==1 for now sounds reasonable to me.

Thanks,
-Kame






^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v3 07/11] memcg: add dirty limits to mem_cgroup
  2010-10-20  0:50   ` Daisuke Nishimura
@ 2010-10-20  4:08     ` Greg Thelen
  0 siblings, 0 replies; 64+ messages in thread
From: Greg Thelen @ 2010-10-20  4:08 UTC (permalink / raw)
  To: Daisuke Nishimura
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, Andrea Righi,
	Balbir Singh, KAMEZAWA Hiroyuki, Minchan Kim, Ciju Rajan K,
	David Rientjes

Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> writes:

>> +static unsigned long long
>> +memcg_hierarchical_free_pages(struct mem_cgroup *mem)
>> +{
>> +	struct cgroup *cgroup;
>> +	unsigned long long min_free, free;
>> +
>> +	min_free = res_counter_read_u64(&mem->res, RES_LIMIT) -
>> +		res_counter_read_u64(&mem->res, RES_USAGE);
>> +	cgroup = mem->css.cgroup;
>> +	if (!mem->use_hierarchy)
>> +		goto out;
>> +
>> +	while (cgroup->parent) {
>> +		cgroup = cgroup->parent;
>> +		mem = mem_cgroup_from_cont(cgroup);
>> +		if (!mem->use_hierarchy)
>> +			break;
>> +		free = res_counter_read_u64(&mem->res, RES_LIMIT) -
>> +			res_counter_read_u64(&mem->res, RES_USAGE);
>> +		min_free = min(min_free, free);
>> +	}
>> +out:
>> +	/* Translate free memory in pages */
>> +	return min_free >> PAGE_SHIFT;
>> +}
>> +
> I think you can simplify this function using parent_mem_cgroup().
>
> 	unsigned long free, min_free = ULLONG_MAX;
>
> 	while (mem) {
> 		free = res_counter_read_u64(&mem->res, RES_LIMIT) -
> 			res_counter_read_u64(&mem->res, RES_USAGE);
> 		min_free = min(min_free, free);
> 		mem = parent_mem_cgroup();
> 	}
>
> 	/* Translate free memory in pages */
> 	return min_free >> PAGE_SHIFT;
>
> And, IMHO, we should return min(global_page_state(NR_FREE_PAGES), min_free >> PAGE_SHIFT).
> Because we are allowed to set no-limit(or a very big limit) in memcg,
> so min_free can be very big if we don't set a limit against all the memcg's in hierarchy.
>
>
> Thanks,
> Dasiuke Nishimura.

Thank you.  This is a good suggestion.  I will update the page to include this.

--
Greg

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH][memcg+dirtylimit] Fix  overwriting global vm dirty limit setting by memcg (Re: [PATCH v3 00/11] memcg: per cgroup dirty page accounting
  2010-10-20  3:21 ` [PATCH][memcg+dirtylimit] Fix overwriting global vm dirty limit setting by memcg (Re: [PATCH v3 00/11] memcg: per cgroup dirty page accounting KAMEZAWA Hiroyuki
@ 2010-10-20  4:14   ` KAMEZAWA Hiroyuki
  2010-10-20  5:02   ` [PATCH v2][memcg+dirtylimit] " KAMEZAWA Hiroyuki
  1 sibling, 0 replies; 64+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-10-20  4:14 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Greg Thelen, Andrew Morton, linux-kernel, linux-mm, containers,
	Andrea Righi, Balbir Singh, Daisuke Nishimura, Minchan Kim,
	Ciju Rajan K, David Rientjes

On Wed, 20 Oct 2010 12:21:44 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> 
> One bug fix here.
> ==
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> Now, at calculating dirty limit, vm_dirty_param() is called.
> This function returns dirty-limit related parameters considering
> memory cgroup settings.
> 
> Now, assume that vm_dirty_bytes=100M (global dirty limit) and
> memory cgroup has 1G of pages and 40 dirty_ratio, dirtyable memory is
> 500MB.
> 
> In this case, global_dirty_limits will consider dirty_limt as
> 500 *0.4 = 200MB. This is bad...memory cgroup is not back door.
> 
> This patch limits the return value of vm_dirty_param() considring
> global settings.
> 
> 

Sorry, this one is buggy. I'll post a new one later.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v3 11/11] memcg: check memcg dirty limits in page writeback
  2010-10-19  1:00   ` KAMEZAWA Hiroyuki
@ 2010-10-20  4:18     ` KAMEZAWA Hiroyuki
  2010-10-20  4:33       ` Greg Thelen
  2010-10-20  4:34       ` Daisuke Nishimura
  0 siblings, 2 replies; 64+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-10-20  4:18 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Greg Thelen, Andrew Morton, linux-kernel, linux-mm, containers,
	Andrea Righi, Balbir Singh, Daisuke Nishimura, Minchan Kim,
	Ciju Rajan K, David Rientjes

On Tue, 19 Oct 2010 10:00:15 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> On Mon, 18 Oct 2010 17:39:44 -0700
> Greg Thelen <gthelen@google.com> wrote:
> 
> > If the current process is in a non-root memcg, then
> > global_dirty_limits() will consider the memcg dirty limit.
> > This allows different cgroups to have distinct dirty limits
> > which trigger direct and background writeback at different
> > levels.
> > 
> > Signed-off-by: Andrea Righi <arighi@develer.com>
> > Signed-off-by: Greg Thelen <gthelen@google.com>
> 
> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
Why FREEPAGES in memcg is not counted as dirtyable ?

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v3 02/11] memcg: document cgroup dirty memory interfaces
  2010-10-20  4:06           ` KAMEZAWA Hiroyuki
@ 2010-10-20  4:25             ` Greg Thelen
  2010-10-20  4:26               ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 64+ messages in thread
From: Greg Thelen @ 2010-10-20  4:25 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Daisuke Nishimura, Andrew Morton, linux-kernel, linux-mm,
	containers, Andrea Righi, Balbir Singh, Minchan Kim,
	Ciju Rajan K, David Rientjes

KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> writes:

> On Tue, 19 Oct 2010 17:45:08 -0700
> Greg Thelen <gthelen@google.com> wrote:
>
>> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> writes:
>> > BTW, how about supporing dirty_limit_in_bytes when use_hierarchy=0 or
>> > leave it as broken when use_hierarchy=1 ?  It seems we can only
>> > support dirty_ratio when hierarchy is used.
>> 
>> I am not sure what you mean here.
>
> When using dirty_ratio, we can check the value of dirty_ratio at setting it
> and make guarantee that any children's dirty_ratio cannot exceeds it parent's.
>
> If we guarantee that, we can keep dirty_ratio even under hierarchy.
>
> When it comes to dirty_limit_in_bytes, we never able to do such kind of
> controls. So, it will be broken and will do different behavior than
> dirty_ratio.

I think that for use_hierarchy=1, we could support either dirty_ratio or
dirty_limit_in_bytes.  The code that modifies dirty_limit_in_bytes could
ensure that the sum the dirty_limit_in_bytes of each child does not
exceed the parent's dirty_limit_in_bytes.

> So, not supporing dirty_bytes when use_hierarchy==1 for now sounds
> reasonable to me.

Ok, I will add the use_hierarchy==1 check and repost the patches.

I will wait to post the -v4 patch series until you post an improved
"[PATCH][memcg+dirtylimit] Fix overwriting global vm dirty limit setting
by memcg (Re: [PATCH v3 00/11] memcg: per cgroup dirty page accounting"
patch.  I think it makes sense to integrate that into -v4 of the series.

> Thanks,
> -Kame

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v3 02/11] memcg: document cgroup dirty memory interfaces
  2010-10-20  4:25             ` Greg Thelen
@ 2010-10-20  4:26               ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 64+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-10-20  4:26 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Daisuke Nishimura, Andrew Morton, linux-kernel, linux-mm,
	containers, Andrea Righi, Balbir Singh, Minchan Kim,
	Ciju Rajan K, David Rientjes

On Tue, 19 Oct 2010 21:25:53 -0700
Greg Thelen <gthelen@google.com> wrote:

> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> writes:
> 
> > On Tue, 19 Oct 2010 17:45:08 -0700
> > Greg Thelen <gthelen@google.com> wrote:
> >
> >> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> writes:
> >> > BTW, how about supporing dirty_limit_in_bytes when use_hierarchy=0 or
> >> > leave it as broken when use_hierarchy=1 ?  It seems we can only
> >> > support dirty_ratio when hierarchy is used.
> >> 
> >> I am not sure what you mean here.
> >
> > When using dirty_ratio, we can check the value of dirty_ratio at setting it
> > and make guarantee that any children's dirty_ratio cannot exceeds it parent's.
> >
> > If we guarantee that, we can keep dirty_ratio even under hierarchy.
> >
> > When it comes to dirty_limit_in_bytes, we never able to do such kind of
> > controls. So, it will be broken and will do different behavior than
> > dirty_ratio.
> 
> I think that for use_hierarchy=1, we could support either dirty_ratio or
> dirty_limit_in_bytes.  The code that modifies dirty_limit_in_bytes could
> ensure that the sum the dirty_limit_in_bytes of each child does not
> exceed the parent's dirty_limit_in_bytes.
> 
But the sum of all children's dirty_bytes can exceeds. (Adding check code
will be messy at this stage. Maybe in TODO list)

> > So, not supporing dirty_bytes when use_hierarchy==1 for now sounds
> > reasonable to me.
> 
> Ok, I will add the use_hierarchy==1 check and repost the patches.
> 
> I will wait to post the -v4 patch series until you post an improved
> "[PATCH][memcg+dirtylimit] Fix overwriting global vm dirty limit setting
> by memcg (Re: [PATCH v3 00/11] memcg: per cgroup dirty page accounting"
> patch.  I think it makes sense to integrate that into -v4 of the series.
> 

yes...but I'm now wondering how "FreePages" of memcg should be handled...

Considering only dirtyable pages may make sense but it's different behavior
than global's one and the user will see the limitation of effects even when
they use only small pages. I'd like to consider this, too.

Thanks,
-Kame



^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v3 11/11] memcg: check memcg dirty limits in page writeback
  2010-10-20  4:18     ` KAMEZAWA Hiroyuki
@ 2010-10-20  4:33       ` Greg Thelen
  2010-10-20  4:33         ` KAMEZAWA Hiroyuki
  2010-10-20  4:34       ` Daisuke Nishimura
  1 sibling, 1 reply; 64+ messages in thread
From: Greg Thelen @ 2010-10-20  4:33 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, Andrea Righi,
	Balbir Singh, Daisuke Nishimura, Minchan Kim, Ciju Rajan K,
	David Rientjes

KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> writes:

> On Tue, 19 Oct 2010 10:00:15 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>
>> On Mon, 18 Oct 2010 17:39:44 -0700
>> Greg Thelen <gthelen@google.com> wrote:
>> 
>> > If the current process is in a non-root memcg, then
>> > global_dirty_limits() will consider the memcg dirty limit.
>> > This allows different cgroups to have distinct dirty limits
>> > which trigger direct and background writeback at different
>> > levels.
>> > 
>> > Signed-off-by: Andrea Righi <arighi@develer.com>
>> > Signed-off-by: Greg Thelen <gthelen@google.com>
>> 
>> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>> 
> Why FREEPAGES in memcg is not counted as dirtyable ?
>
> Thanks,
> -Kame

I think that FREEPAGES is considered dirtyable.  Below I include the
latest version of the code, which includes an improved version of
memcg_hierarchical_free_pages().

Notice that mem_cgroup_page_stat(MEMCG_NR_DIRTYABLE_PAGES) returns the
sum of:
1. compute free pages using memcg_hierarchical_free_pages()
2. mem_cgroup_read_stat(mem, LRU_ACTIVE_FILE) +
	mem_cgroup_read_stat(mem, LRU_INACTIVE_FILE)
3. if (mem_cgroup_can_swap(mem))
   ret += mem_cgroup_read_stat(mem, LRU_ACTIVE_ANON) +
	mem_cgroup_read_stat(mem, LRU_INACTIVE_ANON)

This algorithm is similar to how global (non-memcg) limits are computed
in global_dirtyable_memory().

/*
 * Return the number of the number of pages that the @mem cgroup could allocate.
 * If use_hierarchy is set, then this involves parent mem cgroups to find the
 * cgroup with the smallest free space.
 */
static unsigned long long
memcg_hierarchical_free_pages(struct mem_cgroup *mem)
{
	unsigned long free, min_free;

	min_free = global_page_state(NR_FREE_PAGES) << PAGE_SHIFT;

 	while (mem) {
 		free = res_counter_read_u64(&mem->res, RES_LIMIT) -
 			res_counter_read_u64(&mem->res, RES_USAGE);
 		min_free = min(min_free, free);
 		mem = parent_mem_cgroup(mem);
 	}

 	/* Translate free memory in pages */
 	return min_free >> PAGE_SHIFT;
}

/*
 * mem_cgroup_page_stat() - get memory cgroup file cache statistics
 * @item:      memory statistic item exported to the kernel
 *
 * Return the accounted statistic value.
 */
s64 mem_cgroup_page_stat(enum mem_cgroup_nr_pages_item item)
{
	struct mem_cgroup *mem;
	struct mem_cgroup *iter;
	s64 value;

	get_online_cpus();
	rcu_read_lock();
	mem = mem_cgroup_from_task(current);
	if (mem && !mem_cgroup_is_root(mem)) {
		/*
		 * If we're looking for dirtyable pages we need to evaluate
		 * free pages depending on the limit and usage of the parents
		 * first of all.
		 */
		if (item == MEMCG_NR_DIRTYABLE_PAGES)
			value = memcg_hierarchical_free_pages(mem);
		else
			value = 0;
		/*
		 * Recursively evaluate page statistics against all cgroup
		 * under hierarchy tree
		 */
		for_each_mem_cgroup_tree(iter, mem)
			value += mem_cgroup_local_page_stat(iter, item);
	} else
		value = -EINVAL;
	rcu_read_unlock();
	put_online_cpus();

	return value;
}

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v3 11/11] memcg: check memcg dirty limits in page writeback
  2010-10-20  4:33       ` Greg Thelen
@ 2010-10-20  4:33         ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 64+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-10-20  4:33 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, Andrea Righi,
	Balbir Singh, Daisuke Nishimura, Minchan Kim, Ciju Rajan K,
	David Rientjes

On Tue, 19 Oct 2010 21:33:21 -0700
Greg Thelen <gthelen@google.com> wrote:

> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> writes:
> 
> > On Tue, 19 Oct 2010 10:00:15 +0900
> > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> >
> >> On Mon, 18 Oct 2010 17:39:44 -0700
> >> Greg Thelen <gthelen@google.com> wrote:
> >> 
> >> > If the current process is in a non-root memcg, then
> >> > global_dirty_limits() will consider the memcg dirty limit.
> >> > This allows different cgroups to have distinct dirty limits
> >> > which trigger direct and background writeback at different
> >> > levels.
> >> > 
> >> > Signed-off-by: Andrea Righi <arighi@develer.com>
> >> > Signed-off-by: Greg Thelen <gthelen@google.com>
> >> 
> >> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> >> 
> > Why FREEPAGES in memcg is not counted as dirtyable ?
> >
> > Thanks,
> > -Kame
> 
> I think that FREEPAGES is considered dirtyable.  Below I include the
> latest version of the code, which includes an improved version of
> memcg_hierarchical_free_pages().
> 
> Notice that mem_cgroup_page_stat(MEMCG_NR_DIRTYABLE_PAGES) returns the
> sum of:
> 1. compute free pages using memcg_hierarchical_free_pages()
> 2. mem_cgroup_read_stat(mem, LRU_ACTIVE_FILE) +
> 	mem_cgroup_read_stat(mem, LRU_INACTIVE_FILE)
> 3. if (mem_cgroup_can_swap(mem))
>    ret += mem_cgroup_read_stat(mem, LRU_ACTIVE_ANON) +
> 	mem_cgroup_read_stat(mem, LRU_INACTIVE_ANON)
> 
> This algorithm is similar to how global (non-memcg) limits are computed
> in global_dirtyable_memory().
> 

seems very nice. Thank you. 

-Kame

> /*
>  * Return the number of the number of pages that the @mem cgroup could allocate.
>  * If use_hierarchy is set, then this involves parent mem cgroups to find the
>  * cgroup with the smallest free space.
>  */
> static unsigned long long
> memcg_hierarchical_free_pages(struct mem_cgroup *mem)
> {
> 	unsigned long free, min_free;
> 
> 	min_free = global_page_state(NR_FREE_PAGES) << PAGE_SHIFT;
> 
>  	while (mem) {
>  		free = res_counter_read_u64(&mem->res, RES_LIMIT) -
>  			res_counter_read_u64(&mem->res, RES_USAGE);
>  		min_free = min(min_free, free);
>  		mem = parent_mem_cgroup(mem);
>  	}
> 
>  	/* Translate free memory in pages */
>  	return min_free >> PAGE_SHIFT;
> }
> 
> /*
>  * mem_cgroup_page_stat() - get memory cgroup file cache statistics
>  * @item:      memory statistic item exported to the kernel
>  *
>  * Return the accounted statistic value.
>  */
> s64 mem_cgroup_page_stat(enum mem_cgroup_nr_pages_item item)
> {
> 	struct mem_cgroup *mem;
> 	struct mem_cgroup *iter;
> 	s64 value;
> 
> 	get_online_cpus();
> 	rcu_read_lock();
> 	mem = mem_cgroup_from_task(current);
> 	if (mem && !mem_cgroup_is_root(mem)) {
> 		/*
> 		 * If we're looking for dirtyable pages we need to evaluate
> 		 * free pages depending on the limit and usage of the parents
> 		 * first of all.
> 		 */
> 		if (item == MEMCG_NR_DIRTYABLE_PAGES)
> 			value = memcg_hierarchical_free_pages(mem);
> 		else
> 			value = 0;
> 		/*
> 		 * Recursively evaluate page statistics against all cgroup
> 		 * under hierarchy tree
> 		 */
> 		for_each_mem_cgroup_tree(iter, mem)
> 			value += mem_cgroup_local_page_stat(iter, item);
> 	} else
> 		value = -EINVAL;
> 	rcu_read_unlock();
> 	put_online_cpus();
> 
> 	return value;
> }
> 


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v3 11/11] memcg: check memcg dirty limits in page writeback
  2010-10-20  4:18     ` KAMEZAWA Hiroyuki
  2010-10-20  4:33       ` Greg Thelen
@ 2010-10-20  4:34       ` Daisuke Nishimura
  1 sibling, 0 replies; 64+ messages in thread
From: Daisuke Nishimura @ 2010-10-20  4:34 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Greg Thelen, Andrew Morton, linux-kernel, linux-mm, containers,
	Andrea Righi, Balbir Singh, Minchan Kim, Ciju Rajan K,
	David Rientjes, Daisuke Nishimura

On Wed, 20 Oct 2010 13:18:57 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> On Tue, 19 Oct 2010 10:00:15 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 
> > On Mon, 18 Oct 2010 17:39:44 -0700
> > Greg Thelen <gthelen@google.com> wrote:
> > 
> > > If the current process is in a non-root memcg, then
> > > global_dirty_limits() will consider the memcg dirty limit.
> > > This allows different cgroups to have distinct dirty limits
> > > which trigger direct and background writeback at different
> > > levels.
> > > 
> > > Signed-off-by: Andrea Righi <arighi@develer.com>
> > > Signed-off-by: Greg Thelen <gthelen@google.com>
> > 
> > Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > 
> Why FREEPAGES in memcg is not counted as dirtyable ?
> 
It's counted as memcg_hierarchical_free_pages() in mem_cgroup_page_stat().
As for FREEPAGES, we should count ancestors, not children.

mem_cgroup_page_stat():
   1311         if (mem && !mem_cgroup_is_root(mem)) {
   1312                 /*
   1313                  * If we're looking for dirtyable pages we need to evaluate
   1314                  * free pages depending on the limit and usage of the parents
   1315                  * first of all.
   1316                  */
   1317                 if (item == MEMCG_NR_DIRTYABLE_PAGES)
   1318                         value = memcg_hierarchical_free_pages(mem);
   1319                 else
   1320                         value = 0;
   1321                 /*
   1322                  * Recursively evaluate page statistics against all cgroup
   1323                  * under hierarchy tree
   1324                  */
   1325                 for_each_mem_cgroup_tree(iter, mem)
   1326                         value += mem_cgroup_local_page_stat(iter, item);


^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH v2][memcg+dirtylimit] Fix  overwriting global vm dirty limit setting by memcg (Re: [PATCH v3 00/11] memcg: per cgroup dirty page accounting
  2010-10-20  3:21 ` [PATCH][memcg+dirtylimit] Fix overwriting global vm dirty limit setting by memcg (Re: [PATCH v3 00/11] memcg: per cgroup dirty page accounting KAMEZAWA Hiroyuki
  2010-10-20  4:14   ` KAMEZAWA Hiroyuki
@ 2010-10-20  5:02   ` KAMEZAWA Hiroyuki
  2010-10-20  6:09     ` Daisuke Nishimura
                       ` (2 more replies)
  1 sibling, 3 replies; 64+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-10-20  5:02 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Greg Thelen, Andrew Morton, linux-kernel, linux-mm, containers,
	Andrea Righi, Balbir Singh, Daisuke Nishimura, Minchan Kim,
	Ciju Rajan K, David Rientjes

Fixed one here.
==
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Now, at calculating dirty limit, vm_dirty_param() is called.
This function returns dirty-limit related parameters considering
memory cgroup settings.

Now, assume that vm_dirty_bytes=100M (global dirty limit) and
memory cgroup has 1G of pages and 40 dirty_ratio, dirtyable memory is
500MB.

In this case, global_dirty_limits will consider dirty_limt as
500 *0.4 = 200MB. This is bad...memory cgroup is not back door.

This patch limits the return value of vm_dirty_param() considring
global settings.

Changelog:
 - fixed an argument "mem" int to u64
 - fixed to use global available memory to cap memcg's value.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/memcontrol.h |    5 +++--
 mm/memcontrol.c            |   30 +++++++++++++++++++++++++++++-
 mm/page-writeback.c        |    3 ++-
 3 files changed, 34 insertions(+), 4 deletions(-)

Index: dirty_limit_new/mm/memcontrol.c
===================================================================
--- dirty_limit_new.orig/mm/memcontrol.c
+++ dirty_limit_new/mm/memcontrol.c
@@ -1171,9 +1171,11 @@ static void __mem_cgroup_dirty_param(str
  * can be moved after our access and writeback tends to take long time.  At
  * least, "memcg" will not be freed while holding rcu_read_lock().
  */
-void vm_dirty_param(struct vm_dirty_param *param)
+void vm_dirty_param(struct vm_dirty_param *param,
+	 u64 mem, u64 global)
 {
 	struct mem_cgroup *memcg;
+	u64 limit, bglimit;
 
 	if (mem_cgroup_disabled()) {
 		global_vm_dirty_param(param);
@@ -1183,6 +1185,32 @@ void vm_dirty_param(struct vm_dirty_para
 	rcu_read_lock();
 	memcg = mem_cgroup_from_task(current);
 	__mem_cgroup_dirty_param(param, memcg);
+	/*
+	 * A limitation under memory cgroup is under global vm, too.
+	 */
+	if (vm_dirty_ratio)
+		limit = global * vm_dirty_ratio / 100;
+	else
+		limit = vm_dirty_bytes;
+	if (param->dirty_ratio) {
+		param->dirty_bytes = mem * param->dirty_ratio / 100;
+		param->dirty_ratio = 0;
+	}
+	if (param->dirty_bytes > limit)
+		param->dirty_bytes = limit;
+
+	if (dirty_background_ratio)
+		bglimit = global * dirty_background_ratio / 100;
+	else
+		bglimit = dirty_background_bytes;
+
+	if (param->dirty_background_ratio) {
+		param->dirty_background_bytes =
+			mem * param->dirty_background_ratio / 100;
+		param->dirty_background_ratio = 0;
+	}
+	if (param->dirty_background_bytes > bglimit)
+		param->dirty_background_bytes = bglimit;
 	rcu_read_unlock();
 }
 
Index: dirty_limit_new/include/linux/memcontrol.h
===================================================================
--- dirty_limit_new.orig/include/linux/memcontrol.h
+++ dirty_limit_new/include/linux/memcontrol.h
@@ -171,7 +171,7 @@ static inline void mem_cgroup_dec_page_s
 }
 
 bool mem_cgroup_has_dirty_limit(void);
-void vm_dirty_param(struct vm_dirty_param *param);
+void vm_dirty_param(struct vm_dirty_param *param, u64 mem, u64 global);
 s64 mem_cgroup_page_stat(enum mem_cgroup_nr_pages_item item);
 
 unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
@@ -360,7 +360,8 @@ static inline bool mem_cgroup_has_dirty_
 	return false;
 }
 
-static inline void vm_dirty_param(struct vm_dirty_param *param)
+static inline void vm_dirty_param(struct vm_dirty_param *param,
+		u64 mem, u64 global)
 {
 	global_vm_dirty_param(param);
 }
Index: dirty_limit_new/mm/page-writeback.c
===================================================================
--- dirty_limit_new.orig/mm/page-writeback.c
+++ dirty_limit_new/mm/page-writeback.c
@@ -466,7 +466,8 @@ void global_dirty_limits(unsigned long *
 	struct task_struct *tsk;
 	struct vm_dirty_param dirty_param;
 
-	vm_dirty_param(&dirty_param);
+	vm_dirty_param(&dirty_param,
+		available_memory, global_dirtyable_memory());
 
 	if (dirty_param.dirty_bytes)
 		dirty = DIV_ROUND_UP(dirty_param.dirty_bytes, PAGE_SIZE);


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v3 11/11] memcg: check memcg dirty limits in page writeback
  2010-10-19  0:39 ` [PATCH v3 11/11] memcg: check memcg dirty limits in page writeback Greg Thelen
  2010-10-19  1:00   ` KAMEZAWA Hiroyuki
@ 2010-10-20  5:25   ` Daisuke Nishimura
  1 sibling, 0 replies; 64+ messages in thread
From: Daisuke Nishimura @ 2010-10-20  5:25 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, Andrea Righi,
	Balbir Singh, KAMEZAWA Hiroyuki, Minchan Kim, Ciju Rajan K,
	David Rientjes, Daisuke Nishimura

On Mon, 18 Oct 2010 17:39:44 -0700
Greg Thelen <gthelen@google.com> wrote:

> If the current process is in a non-root memcg, then
> global_dirty_limits() will consider the memcg dirty limit.
> This allows different cgroups to have distinct dirty limits
> which trigger direct and background writeback at different
> levels.
> 
> Signed-off-by: Andrea Righi <arighi@develer.com>
> Signed-off-by: Greg Thelen <gthelen@google.com>

Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2][memcg+dirtylimit] Fix  overwriting global vm dirty limit setting by memcg (Re: [PATCH v3 00/11] memcg: per cgroup dirty page accounting
  2010-10-20  5:02   ` [PATCH v2][memcg+dirtylimit] " KAMEZAWA Hiroyuki
@ 2010-10-20  6:09     ` Daisuke Nishimura
  2010-10-20 14:35     ` Minchan Kim
  2010-10-24 18:44     ` Greg Thelen
  2 siblings, 0 replies; 64+ messages in thread
From: Daisuke Nishimura @ 2010-10-20  6:09 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Greg Thelen, Andrew Morton, linux-kernel, linux-mm, containers,
	Andrea Righi, Balbir Singh, Minchan Kim, Ciju Rajan K,
	David Rientjes, Daisuke Nishimura

On Wed, 20 Oct 2010 14:02:55 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> Fixed one here.
> ==
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> Now, at calculating dirty limit, vm_dirty_param() is called.
> This function returns dirty-limit related parameters considering
> memory cgroup settings.
> 
> Now, assume that vm_dirty_bytes=100M (global dirty limit) and
> memory cgroup has 1G of pages and 40 dirty_ratio, dirtyable memory is
> 500MB.
> 
> In this case, global_dirty_limits will consider dirty_limt as
> 500 *0.4 = 200MB. This is bad...memory cgroup is not back door.
> 
> This patch limits the return value of vm_dirty_param() considring
> global settings.
> 
> Changelog:
>  - fixed an argument "mem" int to u64
>  - fixed to use global available memory to cap memcg's value.
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2][memcg+dirtylimit] Fix  overwriting global vm dirty limit setting by memcg (Re: [PATCH v3 00/11] memcg: per cgroup dirty page accounting
  2010-10-20  5:02   ` [PATCH v2][memcg+dirtylimit] " KAMEZAWA Hiroyuki
  2010-10-20  6:09     ` Daisuke Nishimura
@ 2010-10-20 14:35     ` Minchan Kim
  2010-10-21  0:10       ` KAMEZAWA Hiroyuki
  2010-10-24 18:44     ` Greg Thelen
  2 siblings, 1 reply; 64+ messages in thread
From: Minchan Kim @ 2010-10-20 14:35 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Greg Thelen, Andrew Morton, linux-kernel, linux-mm, containers,
	Andrea Righi, Balbir Singh, Daisuke Nishimura, Ciju Rajan K,
	David Rientjes

On Wed, Oct 20, 2010 at 02:02:55PM +0900, KAMEZAWA Hiroyuki wrote:
> Fixed one here.
> ==
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> Now, at calculating dirty limit, vm_dirty_param() is called.
> This function returns dirty-limit related parameters considering
> memory cgroup settings.
> 
> Now, assume that vm_dirty_bytes=100M (global dirty limit) and
> memory cgroup has 1G of pages and 40 dirty_ratio, dirtyable memory is
> 500MB.
> 
> In this case, global_dirty_limits will consider dirty_limt as
> 500 *0.4 = 200MB. This is bad...memory cgroup is not back door.
> 
> This patch limits the return value of vm_dirty_param() considring
> global settings.
> 
> Changelog:
>  - fixed an argument "mem" int to u64
>  - fixed to use global available memory to cap memcg's value.
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>

It should have written this on Documentation.
"memcg dirty limit can't exceed global dirty limit"

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2][memcg+dirtylimit] Fix  overwriting global vm dirty limit setting by memcg (Re: [PATCH v3 00/11] memcg: per cgroup dirty page accounting
  2010-10-20 14:35     ` Minchan Kim
@ 2010-10-21  0:10       ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 64+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-10-21  0:10 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Greg Thelen, Andrew Morton, linux-kernel, linux-mm, containers,
	Andrea Righi, Balbir Singh, Daisuke Nishimura, Ciju Rajan K,
	David Rientjes

On Wed, 20 Oct 2010 23:35:15 +0900
Minchan Kim <minchan.kim@gmail.com> wrote:

> On Wed, Oct 20, 2010 at 02:02:55PM +0900, KAMEZAWA Hiroyuki wrote:
> > Fixed one here.
> > ==
> > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > 
> > Now, at calculating dirty limit, vm_dirty_param() is called.
> > This function returns dirty-limit related parameters considering
> > memory cgroup settings.
> > 
> > Now, assume that vm_dirty_bytes=100M (global dirty limit) and
> > memory cgroup has 1G of pages and 40 dirty_ratio, dirtyable memory is
> > 500MB.
> > 
> > In this case, global_dirty_limits will consider dirty_limt as
> > 500 *0.4 = 200MB. This is bad...memory cgroup is not back door.
> > 
> > This patch limits the return value of vm_dirty_param() considring
> > global settings.
> > 
> > Changelog:
> >  - fixed an argument "mem" int to u64
> >  - fixed to use global available memory to cap memcg's value.
> > 
> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
> 
> It should have written this on Documentation.
> "memcg dirty limit can't exceed global dirty limit"
> 
Sure. Anyway we need review & rewrite Documenation after dirty limit
merged. (I think Greg will do much.)

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2][memcg+dirtylimit] Fix  overwriting global vm dirty limit setting by memcg (Re: [PATCH v3 00/11] memcg: per cgroup dirty page accounting
  2010-10-20  5:02   ` [PATCH v2][memcg+dirtylimit] " KAMEZAWA Hiroyuki
  2010-10-20  6:09     ` Daisuke Nishimura
  2010-10-20 14:35     ` Minchan Kim
@ 2010-10-24 18:44     ` Greg Thelen
  2010-10-25  0:24       ` KAMEZAWA Hiroyuki
                         ` (2 more replies)
  2 siblings, 3 replies; 64+ messages in thread
From: Greg Thelen @ 2010-10-24 18:44 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, Andrea Righi,
	Balbir Singh, Daisuke Nishimura, Minchan Kim, Ciju Rajan K,
	David Rientjes

KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> writes:

> Fixed one here.
> ==
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>
> Now, at calculating dirty limit, vm_dirty_param() is called.
> This function returns dirty-limit related parameters considering
> memory cgroup settings.
>
> Now, assume that vm_dirty_bytes=100M (global dirty limit) and
> memory cgroup has 1G of pages and 40 dirty_ratio, dirtyable memory is
> 500MB.
>
> In this case, global_dirty_limits will consider dirty_limt as
> 500 *0.4 = 200MB. This is bad...memory cgroup is not back door.
>
> This patch limits the return value of vm_dirty_param() considring
> global settings.
>
> Changelog:
>  - fixed an argument "mem" int to u64
>  - fixed to use global available memory to cap memcg's value.
>
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
>  include/linux/memcontrol.h |    5 +++--
>  mm/memcontrol.c            |   30 +++++++++++++++++++++++++++++-
>  mm/page-writeback.c        |    3 ++-
>  3 files changed, 34 insertions(+), 4 deletions(-)
>
> Index: dirty_limit_new/mm/memcontrol.c
> ===================================================================
> --- dirty_limit_new.orig/mm/memcontrol.c
> +++ dirty_limit_new/mm/memcontrol.c
> @@ -1171,9 +1171,11 @@ static void __mem_cgroup_dirty_param(str
>   * can be moved after our access and writeback tends to take long time.  At
>   * least, "memcg" will not be freed while holding rcu_read_lock().
>   */
> -void vm_dirty_param(struct vm_dirty_param *param)
> +void vm_dirty_param(struct vm_dirty_param *param,
> +	 u64 mem, u64 global)
>  {
>  	struct mem_cgroup *memcg;
> +	u64 limit, bglimit;
>  
>  	if (mem_cgroup_disabled()) {
>  		global_vm_dirty_param(param);
> @@ -1183,6 +1185,32 @@ void vm_dirty_param(struct vm_dirty_para
>  	rcu_read_lock();
>  	memcg = mem_cgroup_from_task(current);
>  	__mem_cgroup_dirty_param(param, memcg);
> +	/*
> +	 * A limitation under memory cgroup is under global vm, too.
> +	 */
> +	if (vm_dirty_ratio)
> +		limit = global * vm_dirty_ratio / 100;
> +	else
> +		limit = vm_dirty_bytes;
> +	if (param->dirty_ratio) {
> +		param->dirty_bytes = mem * param->dirty_ratio / 100;
> +		param->dirty_ratio = 0;
> +	}
> +	if (param->dirty_bytes > limit)
> +		param->dirty_bytes = limit;
> +
> +	if (dirty_background_ratio)
> +		bglimit = global * dirty_background_ratio / 100;
> +	else
> +		bglimit = dirty_background_bytes;
> +
> +	if (param->dirty_background_ratio) {
> +		param->dirty_background_bytes =
> +			mem * param->dirty_background_ratio / 100;
> +		param->dirty_background_ratio = 0;
> +	}
> +	if (param->dirty_background_bytes > bglimit)
> +		param->dirty_background_bytes = bglimit;
>  	rcu_read_unlock();
>  }
>  
> Index: dirty_limit_new/include/linux/memcontrol.h
> ===================================================================
> --- dirty_limit_new.orig/include/linux/memcontrol.h
> +++ dirty_limit_new/include/linux/memcontrol.h
> @@ -171,7 +171,7 @@ static inline void mem_cgroup_dec_page_s
>  }
>  
>  bool mem_cgroup_has_dirty_limit(void);
> -void vm_dirty_param(struct vm_dirty_param *param);
> +void vm_dirty_param(struct vm_dirty_param *param, u64 mem, u64 global);
>  s64 mem_cgroup_page_stat(enum mem_cgroup_nr_pages_item item);
>  
>  unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
> @@ -360,7 +360,8 @@ static inline bool mem_cgroup_has_dirty_
>  	return false;
>  }
>  
> -static inline void vm_dirty_param(struct vm_dirty_param *param)
> +static inline void vm_dirty_param(struct vm_dirty_param *param,
> +		u64 mem, u64 global)
>  {
>  	global_vm_dirty_param(param);
>  }
> Index: dirty_limit_new/mm/page-writeback.c
> ===================================================================
> --- dirty_limit_new.orig/mm/page-writeback.c
> +++ dirty_limit_new/mm/page-writeback.c
> @@ -466,7 +466,8 @@ void global_dirty_limits(unsigned long *
>  	struct task_struct *tsk;
>  	struct vm_dirty_param dirty_param;
>  
> -	vm_dirty_param(&dirty_param);
> +	vm_dirty_param(&dirty_param,
> +		available_memory, global_dirtyable_memory());
>  
>  	if (dirty_param.dirty_bytes)
>  		dirty = DIV_ROUND_UP(dirty_param.dirty_bytes, PAGE_SIZE);

I think there is a problem with the patch above.  In the patch
vm_dirty_param() sets param->dirty_[background_]bytes to the smallest
limits considering the memcg and global limits.  Assuming the current
task is in a memcg, then the memcg dirty (not system-wide) usage is
always compared to the selected limits (which may be per-memcg or
system).  The problem is that if:
a) per-memcg dirty limit is smaller than system then vm_dirty_param()
   will select per-memcg dirty limit, and
b) per-memcg dirty usage is well below memcg dirty limit, and
b) system usage is at system limit
Then the above patch will not trigger writeback.  Example with two
memcg:
         sys
        B   C
      
      limit  usage
  sys  10     10
   B    7      6
   C    5      4

If B wants to grow, the system will exceed system limit of 10 and should
be throttled.  However, the smaller limit (7) will be selected and
applied to memcg usage (6), which indicates no need to throttle, so the
system could get as bad as:

      limit  usage
  sys  10     12
   B    7      7
   C    5      5

In this case the system usage exceeds the system limit because each
the per-memcg checks see no per-memcg problems.

To solve this I propose we create a new structure to aggregate both
dirty limit and usage data:
	struct dirty_limits {
	       unsigned long dirty_thresh;
	       unsigned long background_thresh;
	       unsigned long nr_reclaimable;
	       unsigned long nr_writeback;
	};

global_dirty_limits() would then query both the global and memcg limits
and dirty usage of one that is closest to its limit.  This change makes
global_dirty_limits() look like:

void global_dirty_limits(struct dirty_limits *limits)
{
	unsigned long background;
	unsigned long dirty;
	unsigned long nr_reclaimable;
	unsigned long nr_writeback;
	unsigned long available_memory = determine_dirtyable_memory();
	struct task_struct *tsk;

	if (vm_dirty_bytes)
		dirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE);
	else
		dirty = (vm_dirty_ratio * available_memory) / 100;

	if (dirty_background_bytes)
		background = DIV_ROUND_UP(dirty_background_bytes, PAGE_SIZE);
	else
		background = (dirty_background_ratio * available_memory) / 100;

	nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
				global_page_state(NR_UNSTABLE_NFS);
	nr_writeback = global_page_state(NR_WRITEBACK);

	if (mem_cgroup_dirty_limits(available_memory, limits) &&
	    dirty_available(limits->dirty_thresh, limits->nr_reclaimable,
			    limits->nr_writeback) <
	    dirty_available(dirty, nr_reclaimable, nr_writeback)) {
		dirty = min(dirty, limits->dirty_thresh);
		background = min(background, limits->background_thresh);
	} else {
		limits->nr_reclaimable = nr_reclaimable;
		limits->nr_writeback = nr_writeback;
	}

	if (background >= dirty)
		background = dirty / 2;
	tsk = current;
	if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk)) {
		background += background / 4;
		dirty += dirty / 4;
	}
	limits->background_thresh = background;
	limits->dirty_thresh = dirty;
}

Because this approach considered both memcg and system limits, the
problem described above is avoided.

I have this change integrated into the memcg dirty limit series (-v3 was
the last post; v4 is almost ready with this change).  I will post -v4
with this approach is there is no strong objection.

--
Greg

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2][memcg+dirtylimit] Fix  overwriting global vm dirty limit setting by memcg (Re: [PATCH v3 00/11] memcg: per cgroup dirty page accounting
  2010-10-24 18:44     ` Greg Thelen
@ 2010-10-25  0:24       ` KAMEZAWA Hiroyuki
  2010-10-25  2:00       ` Daisuke Nishimura
  2010-10-25  7:03       ` Ciju Rajan K
  2 siblings, 0 replies; 64+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-10-25  0:24 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, Andrea Righi,
	Balbir Singh, Daisuke Nishimura, Minchan Kim, Ciju Rajan K,
	David Rientjes

On Sun, 24 Oct 2010 11:44:38 -0700
Greg Thelen <gthelen@google.com> wrote:

> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> writes:
> 
> > Fixed one here.
> > ==
> > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> >
> > Now, at calculating dirty limit, vm_dirty_param() is called.
> > This function returns dirty-limit related parameters considering
> > memory cgroup settings.
> >
> > Now, assume that vm_dirty_bytes=100M (global dirty limit) and
> > memory cgroup has 1G of pages and 40 dirty_ratio, dirtyable memory is
> > 500MB.
> >
> > In this case, global_dirty_limits will consider dirty_limt as
> > 500 *0.4 = 200MB. This is bad...memory cgroup is not back door.
> >
> > This patch limits the return value of vm_dirty_param() considring
> > global settings.
> >
> > Changelog:
> >  - fixed an argument "mem" int to u64
> >  - fixed to use global available memory to cap memcg's value.
> >
> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > ---
> >  include/linux/memcontrol.h |    5 +++--
> >  mm/memcontrol.c            |   30 +++++++++++++++++++++++++++++-
> >  mm/page-writeback.c        |    3 ++-
> >  3 files changed, 34 insertions(+), 4 deletions(-)
> >
> > Index: dirty_limit_new/mm/memcontrol.c
> > ===================================================================
> > --- dirty_limit_new.orig/mm/memcontrol.c
> > +++ dirty_limit_new/mm/memcontrol.c
> > @@ -1171,9 +1171,11 @@ static void __mem_cgroup_dirty_param(str
> >   * can be moved after our access and writeback tends to take long time.  At
> >   * least, "memcg" will not be freed while holding rcu_read_lock().
> >   */
> > -void vm_dirty_param(struct vm_dirty_param *param)
> > +void vm_dirty_param(struct vm_dirty_param *param,
> > +	 u64 mem, u64 global)
> >  {
> >  	struct mem_cgroup *memcg;
> > +	u64 limit, bglimit;
> >  
> >  	if (mem_cgroup_disabled()) {
> >  		global_vm_dirty_param(param);
> > @@ -1183,6 +1185,32 @@ void vm_dirty_param(struct vm_dirty_para
> >  	rcu_read_lock();
> >  	memcg = mem_cgroup_from_task(current);
> >  	__mem_cgroup_dirty_param(param, memcg);
> > +	/*
> > +	 * A limitation under memory cgroup is under global vm, too.
> > +	 */
> > +	if (vm_dirty_ratio)
> > +		limit = global * vm_dirty_ratio / 100;
> > +	else
> > +		limit = vm_dirty_bytes;
> > +	if (param->dirty_ratio) {
> > +		param->dirty_bytes = mem * param->dirty_ratio / 100;
> > +		param->dirty_ratio = 0;
> > +	}
> > +	if (param->dirty_bytes > limit)
> > +		param->dirty_bytes = limit;
> > +
> > +	if (dirty_background_ratio)
> > +		bglimit = global * dirty_background_ratio / 100;
> > +	else
> > +		bglimit = dirty_background_bytes;
> > +
> > +	if (param->dirty_background_ratio) {
> > +		param->dirty_background_bytes =
> > +			mem * param->dirty_background_ratio / 100;
> > +		param->dirty_background_ratio = 0;
> > +	}
> > +	if (param->dirty_background_bytes > bglimit)
> > +		param->dirty_background_bytes = bglimit;
> >  	rcu_read_unlock();
> >  }
> >  
> > Index: dirty_limit_new/include/linux/memcontrol.h
> > ===================================================================
> > --- dirty_limit_new.orig/include/linux/memcontrol.h
> > +++ dirty_limit_new/include/linux/memcontrol.h
> > @@ -171,7 +171,7 @@ static inline void mem_cgroup_dec_page_s
> >  }
> >  
> >  bool mem_cgroup_has_dirty_limit(void);
> > -void vm_dirty_param(struct vm_dirty_param *param);
> > +void vm_dirty_param(struct vm_dirty_param *param, u64 mem, u64 global);
> >  s64 mem_cgroup_page_stat(enum mem_cgroup_nr_pages_item item);
> >  
> >  unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
> > @@ -360,7 +360,8 @@ static inline bool mem_cgroup_has_dirty_
> >  	return false;
> >  }
> >  
> > -static inline void vm_dirty_param(struct vm_dirty_param *param)
> > +static inline void vm_dirty_param(struct vm_dirty_param *param,
> > +		u64 mem, u64 global)
> >  {
> >  	global_vm_dirty_param(param);
> >  }
> > Index: dirty_limit_new/mm/page-writeback.c
> > ===================================================================
> > --- dirty_limit_new.orig/mm/page-writeback.c
> > +++ dirty_limit_new/mm/page-writeback.c
> > @@ -466,7 +466,8 @@ void global_dirty_limits(unsigned long *
> >  	struct task_struct *tsk;
> >  	struct vm_dirty_param dirty_param;
> >  
> > -	vm_dirty_param(&dirty_param);
> > +	vm_dirty_param(&dirty_param,
> > +		available_memory, global_dirtyable_memory());
> >  
> >  	if (dirty_param.dirty_bytes)
> >  		dirty = DIV_ROUND_UP(dirty_param.dirty_bytes, PAGE_SIZE);
> 
> I think there is a problem with the patch above.  In the patch
> vm_dirty_param() sets param->dirty_[background_]bytes to the smallest
> limits considering the memcg and global limits.  Assuming the current
> task is in a memcg, then the memcg dirty (not system-wide) usage is
> always compared to the selected limits (which may be per-memcg or
> system).  The problem is that if:
> a) per-memcg dirty limit is smaller than system then vm_dirty_param()
>    will select per-memcg dirty limit, and
> b) per-memcg dirty usage is well below memcg dirty limit, and
> b) system usage is at system limit
> Then the above patch will not trigger writeback.  Example with two
> memcg:
>          sys
>         B   C
>       
>       limit  usage
>   sys  10     10
>    B    7      6
>    C    5      4
> 
> If B wants to grow, the system will exceed system limit of 10 and should
> be throttled.  However, the smaller limit (7) will be selected and
> applied to memcg usage (6), which indicates no need to throttle, so the
> system could get as bad as:
> 
>       limit  usage
>   sys  10     12
>    B    7      7
>    C    5      5
> 
> In this case the system usage exceeds the system limit because each
> the per-memcg checks see no per-memcg problems.
> 

ok.

> To solve this I propose we create a new structure to aggregate both
> dirty limit and usage data:
> 	struct dirty_limits {
> 	       unsigned long dirty_thresh;
> 	       unsigned long background_thresh;
> 	       unsigned long nr_reclaimable;
> 	       unsigned long nr_writeback;
> 	};
> 
> global_dirty_limits() would then query both the global and memcg limits
> and dirty usage of one that is closest to its limit.  This change makes
> global_dirty_limits() look like:
> 
> void global_dirty_limits(struct dirty_limits *limits)
> {
> 	unsigned long background;
> 	unsigned long dirty;
> 	unsigned long nr_reclaimable;
> 	unsigned long nr_writeback;
> 	unsigned long available_memory = determine_dirtyable_memory();
> 	struct task_struct *tsk;
> 
> 	if (vm_dirty_bytes)
> 		dirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE);
> 	else
> 		dirty = (vm_dirty_ratio * available_memory) / 100;
> 
> 	if (dirty_background_bytes)
> 		background = DIV_ROUND_UP(dirty_background_bytes, PAGE_SIZE);
> 	else
> 		background = (dirty_background_ratio * available_memory) / 100;
> 
> 	nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
> 				global_page_state(NR_UNSTABLE_NFS);
> 	nr_writeback = global_page_state(NR_WRITEBACK);
> 
> 	if (mem_cgroup_dirty_limits(available_memory, limits) &&
> 	    dirty_available(limits->dirty_thresh, limits->nr_reclaimable,
> 			    limits->nr_writeback) <
> 	    dirty_available(dirty, nr_reclaimable, nr_writeback)) {
> 		dirty = min(dirty, limits->dirty_thresh);
> 		background = min(background, limits->background_thresh);
> 	} else {
> 		limits->nr_reclaimable = nr_reclaimable;
> 		limits->nr_writeback = nr_writeback;
> 	}
> 

Hmm. I think it's better to add enough comments.

Or easier logic will be simple double-check.

1) calculates global-dirty-limit (always do)
	If this hits global_dirty_limit, skip below.
2) if under memcg, calculate memcg-dirty-limit.
   and check memcg-dirty-limit.





> 	if (background >= dirty)
> 		background = dirty / 2;
> 	tsk = current;
> 	if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk)) {
> 		background += background / 4;
> 		dirty += dirty / 4;
> 	}
> 	limits->background_thresh = background;
> 	limits->dirty_thresh = dirty;
> }
> 
> Because this approach considered both memcg and system limits, the
> problem described above is avoided.
> 
> I have this change integrated into the memcg dirty limit series (-v3 was
> the last post; v4 is almost ready with this change).  I will post -v4
> with this approach is there is no strong objection.
> 
Sure. Thank you.

-Kame


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2][memcg+dirtylimit] Fix  overwriting global vm dirty limit setting by memcg (Re: [PATCH v3 00/11] memcg: per cgroup dirty page accounting
  2010-10-24 18:44     ` Greg Thelen
  2010-10-25  0:24       ` KAMEZAWA Hiroyuki
@ 2010-10-25  2:00       ` Daisuke Nishimura
  2010-10-25  7:03       ` Ciju Rajan K
  2 siblings, 0 replies; 64+ messages in thread
From: Daisuke Nishimura @ 2010-10-25  2:00 UTC (permalink / raw)
  To: Greg Thelen
  Cc: KAMEZAWA Hiroyuki, Andrew Morton, linux-kernel, linux-mm,
	containers, Andrea Righi, Balbir Singh, Minchan Kim,
	Ciju Rajan K, David Rientjes, Daisuke Nishimura

On Sun, 24 Oct 2010 11:44:38 -0700
Greg Thelen <gthelen@google.com> wrote:

> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> writes:
> 
> > Fixed one here.
> > ==
> > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> >
> > Now, at calculating dirty limit, vm_dirty_param() is called.
> > This function returns dirty-limit related parameters considering
> > memory cgroup settings.
> >
> > Now, assume that vm_dirty_bytes=100M (global dirty limit) and
> > memory cgroup has 1G of pages and 40 dirty_ratio, dirtyable memory is
> > 500MB.
> >
> > In this case, global_dirty_limits will consider dirty_limt as
> > 500 *0.4 = 200MB. This is bad...memory cgroup is not back door.
> >
> > This patch limits the return value of vm_dirty_param() considring
> > global settings.
> >
> > Changelog:
> >  - fixed an argument "mem" int to u64
> >  - fixed to use global available memory to cap memcg's value.
> >
> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > ---
> >  include/linux/memcontrol.h |    5 +++--
> >  mm/memcontrol.c            |   30 +++++++++++++++++++++++++++++-
> >  mm/page-writeback.c        |    3 ++-
> >  3 files changed, 34 insertions(+), 4 deletions(-)
> >
> > Index: dirty_limit_new/mm/memcontrol.c
> > ===================================================================
> > --- dirty_limit_new.orig/mm/memcontrol.c
> > +++ dirty_limit_new/mm/memcontrol.c
> > @@ -1171,9 +1171,11 @@ static void __mem_cgroup_dirty_param(str
> >   * can be moved after our access and writeback tends to take long time.  At
> >   * least, "memcg" will not be freed while holding rcu_read_lock().
> >   */
> > -void vm_dirty_param(struct vm_dirty_param *param)
> > +void vm_dirty_param(struct vm_dirty_param *param,
> > +	 u64 mem, u64 global)
> >  {
> >  	struct mem_cgroup *memcg;
> > +	u64 limit, bglimit;
> >  
> >  	if (mem_cgroup_disabled()) {
> >  		global_vm_dirty_param(param);
> > @@ -1183,6 +1185,32 @@ void vm_dirty_param(struct vm_dirty_para
> >  	rcu_read_lock();
> >  	memcg = mem_cgroup_from_task(current);
> >  	__mem_cgroup_dirty_param(param, memcg);
> > +	/*
> > +	 * A limitation under memory cgroup is under global vm, too.
> > +	 */
> > +	if (vm_dirty_ratio)
> > +		limit = global * vm_dirty_ratio / 100;
> > +	else
> > +		limit = vm_dirty_bytes;
> > +	if (param->dirty_ratio) {
> > +		param->dirty_bytes = mem * param->dirty_ratio / 100;
> > +		param->dirty_ratio = 0;
> > +	}
> > +	if (param->dirty_bytes > limit)
> > +		param->dirty_bytes = limit;
> > +
> > +	if (dirty_background_ratio)
> > +		bglimit = global * dirty_background_ratio / 100;
> > +	else
> > +		bglimit = dirty_background_bytes;
> > +
> > +	if (param->dirty_background_ratio) {
> > +		param->dirty_background_bytes =
> > +			mem * param->dirty_background_ratio / 100;
> > +		param->dirty_background_ratio = 0;
> > +	}
> > +	if (param->dirty_background_bytes > bglimit)
> > +		param->dirty_background_bytes = bglimit;
> >  	rcu_read_unlock();
> >  }
> >  
> > Index: dirty_limit_new/include/linux/memcontrol.h
> > ===================================================================
> > --- dirty_limit_new.orig/include/linux/memcontrol.h
> > +++ dirty_limit_new/include/linux/memcontrol.h
> > @@ -171,7 +171,7 @@ static inline void mem_cgroup_dec_page_s
> >  }
> >  
> >  bool mem_cgroup_has_dirty_limit(void);
> > -void vm_dirty_param(struct vm_dirty_param *param);
> > +void vm_dirty_param(struct vm_dirty_param *param, u64 mem, u64 global);
> >  s64 mem_cgroup_page_stat(enum mem_cgroup_nr_pages_item item);
> >  
> >  unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
> > @@ -360,7 +360,8 @@ static inline bool mem_cgroup_has_dirty_
> >  	return false;
> >  }
> >  
> > -static inline void vm_dirty_param(struct vm_dirty_param *param)
> > +static inline void vm_dirty_param(struct vm_dirty_param *param,
> > +		u64 mem, u64 global)
> >  {
> >  	global_vm_dirty_param(param);
> >  }
> > Index: dirty_limit_new/mm/page-writeback.c
> > ===================================================================
> > --- dirty_limit_new.orig/mm/page-writeback.c
> > +++ dirty_limit_new/mm/page-writeback.c
> > @@ -466,7 +466,8 @@ void global_dirty_limits(unsigned long *
> >  	struct task_struct *tsk;
> >  	struct vm_dirty_param dirty_param;
> >  
> > -	vm_dirty_param(&dirty_param);
> > +	vm_dirty_param(&dirty_param,
> > +		available_memory, global_dirtyable_memory());
> >  
> >  	if (dirty_param.dirty_bytes)
> >  		dirty = DIV_ROUND_UP(dirty_param.dirty_bytes, PAGE_SIZE);
> 
> I think there is a problem with the patch above.  In the patch
> vm_dirty_param() sets param->dirty_[background_]bytes to the smallest
> limits considering the memcg and global limits.  Assuming the current
> task is in a memcg, then the memcg dirty (not system-wide) usage is
> always compared to the selected limits (which may be per-memcg or
> system).  The problem is that if:
> a) per-memcg dirty limit is smaller than system then vm_dirty_param()
>    will select per-memcg dirty limit, and
> b) per-memcg dirty usage is well below memcg dirty limit, and
> b) system usage is at system limit
> Then the above patch will not trigger writeback.  Example with two
> memcg:
>          sys
>         B   C
>       
>       limit  usage
>   sys  10     10
>    B    7      6
>    C    5      4
> 
> If B wants to grow, the system will exceed system limit of 10 and should
> be throttled.  However, the smaller limit (7) will be selected and
> applied to memcg usage (6), which indicates no need to throttle, so the
> system could get as bad as:
> 
>       limit  usage
>   sys  10     12
>    B    7      7
>    C    5      5
> 
> In this case the system usage exceeds the system limit because each
> the per-memcg checks see no per-memcg problems.
> 
nice catch !

> To solve this I propose we create a new structure to aggregate both
> dirty limit and usage data:
> 	struct dirty_limits {
> 	       unsigned long dirty_thresh;
> 	       unsigned long background_thresh;
> 	       unsigned long nr_reclaimable;
> 	       unsigned long nr_writeback;
> 	};
> 
> global_dirty_limits() would then query both the global and memcg limits
> and dirty usage of one that is closest to its limit.  This change makes
> global_dirty_limits() look like:
> 
> void global_dirty_limits(struct dirty_limits *limits)
> {
> 	unsigned long background;
> 	unsigned long dirty;
> 	unsigned long nr_reclaimable;
> 	unsigned long nr_writeback;
> 	unsigned long available_memory = determine_dirtyable_memory();
> 	struct task_struct *tsk;
> 
> 	if (vm_dirty_bytes)
> 		dirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE);
> 	else
> 		dirty = (vm_dirty_ratio * available_memory) / 100;
> 
> 	if (dirty_background_bytes)
> 		background = DIV_ROUND_UP(dirty_background_bytes, PAGE_SIZE);
> 	else
> 		background = (dirty_background_ratio * available_memory) / 100;
> 
> 	nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
> 				global_page_state(NR_UNSTABLE_NFS);
> 	nr_writeback = global_page_state(NR_WRITEBACK);
> 
> 	if (mem_cgroup_dirty_limits(available_memory, limits) &&
> 	    dirty_available(limits->dirty_thresh, limits->nr_reclaimable,
> 			    limits->nr_writeback) <
> 	    dirty_available(dirty, nr_reclaimable, nr_writeback)) {
> 		dirty = min(dirty, limits->dirty_thresh);
> 		background = min(background, limits->background_thresh);
> 	} else {
> 		limits->nr_reclaimable = nr_reclaimable;
> 		limits->nr_writeback = nr_writeback;
> 	}
> 
> 	if (background >= dirty)
> 		background = dirty / 2;
> 	tsk = current;
> 	if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk)) {
> 		background += background / 4;
> 		dirty += dirty / 4;
> 	}
> 	limits->background_thresh = background;
> 	limits->dirty_thresh = dirty;
> }
> 
> Because this approach considered both memcg and system limits, the
> problem described above is avoided.
> 
> I have this change integrated into the memcg dirty limit series (-v3 was
> the last post; v4 is almost ready with this change).  I will post -v4
> with this approach is there is no strong objection.
> 
I have one concern now.

I've noticed that global_dirty_limits() is called bdi_debug_stats_show(),
so the output of "cat /sys/kernel/debug/bdi/.../stats" changes depending
on the memcg where the command is executed.

Can you take it into account too ?

Thanks,
Daisuke Nishimura.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2][memcg+dirtylimit] Fix  overwriting global vm dirty limit setting by memcg (Re: [PATCH v3 00/11] memcg: per cgroup dirty page accounting
  2010-10-24 18:44     ` Greg Thelen
  2010-10-25  0:24       ` KAMEZAWA Hiroyuki
  2010-10-25  2:00       ` Daisuke Nishimura
@ 2010-10-25  7:03       ` Ciju Rajan K
  2010-10-25  7:08         ` KAMEZAWA Hiroyuki
  2 siblings, 1 reply; 64+ messages in thread
From: Ciju Rajan K @ 2010-10-25  7:03 UTC (permalink / raw)
  To: Greg Thelen
  Cc: KAMEZAWA Hiroyuki, Andrew Morton, linux-kernel, linux-mm,
	containers, Andrea Righi, Balbir Singh, Daisuke Nishimura,
	Minchan Kim, David Rientjes, Ciju Rajan K

Greg Thelen wrote:
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> writes:
>
>   
>> Fixed one here.
>> ==
>> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>>
>> Now, at calculating dirty limit, vm_dirty_param() is called.
>> This function returns dirty-limit related parameters considering
>> memory cgroup settings.
>>
>> Now, assume that vm_dirty_bytes=100M (global dirty limit) and
>> memory cgroup has 1G of pages and 40 dirty_ratio, dirtyable memory is
>> 500MB.
>>
>> In this case, global_dirty_limits will consider dirty_limt as
>> 500 *0.4 = 200MB. This is bad...memory cgroup is not back door.
>>
>> This patch limits the return value of vm_dirty_param() considring
>> global settings.
>>
>> Changelog:
>>  - fixed an argument "mem" int to u64
>>  - fixed to use global available memory to cap memcg's value.
>>
>> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>> ---
>>  include/linux/memcontrol.h |    5 +++--
>>  mm/memcontrol.c            |   30 +++++++++++++++++++++++++++++-
>>  mm/page-writeback.c        |    3 ++-
>>  3 files changed, 34 insertions(+), 4 deletions(-)
>>
>> Index: dirty_limit_new/mm/memcontrol.c
>> ===================================================================
>> --- dirty_limit_new.orig/mm/memcontrol.c
>> +++ dirty_limit_new/mm/memcontrol.c
>> @@ -1171,9 +1171,11 @@ static void __mem_cgroup_dirty_param(str
>>   * can be moved after our access and writeback tends to take long time.  At
>>   * least, "memcg" will not be freed while holding rcu_read_lock().
>>   */
>> -void vm_dirty_param(struct vm_dirty_param *param)
>> +void vm_dirty_param(struct vm_dirty_param *param,
>> +	 u64 mem, u64 global)
>>  {
>>  	struct mem_cgroup *memcg;
>> +	u64 limit, bglimit;
>>  
>>  	if (mem_cgroup_disabled()) {
>>  		global_vm_dirty_param(param);
>> @@ -1183,6 +1185,32 @@ void vm_dirty_param(struct vm_dirty_para
>>  	rcu_read_lock();
>>  	memcg = mem_cgroup_from_task(current);
>>  	__mem_cgroup_dirty_param(param, memcg);
>> +	/*
>> +	 * A limitation under memory cgroup is under global vm, too.
>> +	 */
>> +	if (vm_dirty_ratio)
>> +		limit = global * vm_dirty_ratio / 100;
>> +	else
>> +		limit = vm_dirty_bytes;
>> +	if (param->dirty_ratio) {
>> +		param->dirty_bytes = mem * param->dirty_ratio / 100;
>> +		param->dirty_ratio = 0;
>> +	}
>> +	if (param->dirty_bytes > limit)
>> +		param->dirty_bytes = limit;
>> +
>> +	if (dirty_background_ratio)
>> +		bglimit = global * dirty_background_ratio / 100;
>> +	else
>> +		bglimit = dirty_background_bytes;
>> +
>> +	if (param->dirty_background_ratio) {
>> +		param->dirty_background_bytes =
>> +			mem * param->dirty_background_ratio / 100;
>> +		param->dirty_background_ratio = 0;
>> +	}
>> +	if (param->dirty_background_bytes > bglimit)
>> +		param->dirty_background_bytes = bglimit;
>>  	rcu_read_unlock();
>>  }
>>  
>> Index: dirty_limit_new/include/linux/memcontrol.h
>> ===================================================================
>> --- dirty_limit_new.orig/include/linux/memcontrol.h
>> +++ dirty_limit_new/include/linux/memcontrol.h
>> @@ -171,7 +171,7 @@ static inline void mem_cgroup_dec_page_s
>>  }
>>  
>>  bool mem_cgroup_has_dirty_limit(void);
>> -void vm_dirty_param(struct vm_dirty_param *param);
>> +void vm_dirty_param(struct vm_dirty_param *param, u64 mem, u64 global);
>>  s64 mem_cgroup_page_stat(enum mem_cgroup_nr_pages_item item);
>>  
>>  unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>> @@ -360,7 +360,8 @@ static inline bool mem_cgroup_has_dirty_
>>  	return false;
>>  }
>>  
>> -static inline void vm_dirty_param(struct vm_dirty_param *param)
>> +static inline void vm_dirty_param(struct vm_dirty_param *param,
>> +		u64 mem, u64 global)
>>  {
>>  	global_vm_dirty_param(param);
>>  }
>> Index: dirty_limit_new/mm/page-writeback.c
>> ===================================================================
>> --- dirty_limit_new.orig/mm/page-writeback.c
>> +++ dirty_limit_new/mm/page-writeback.c
>> @@ -466,7 +466,8 @@ void global_dirty_limits(unsigned long *
>>  	struct task_struct *tsk;
>>  	struct vm_dirty_param dirty_param;
>>  
>> -	vm_dirty_param(&dirty_param);
>> +	vm_dirty_param(&dirty_param,
>> +		available_memory, global_dirtyable_memory());
>>  
>>  	if (dirty_param.dirty_bytes)
>>  		dirty = DIV_ROUND_UP(dirty_param.dirty_bytes, PAGE_SIZE);
>>     
>
> I think there is a problem with the patch above.  In the patch
> vm_dirty_param() sets param->dirty_[background_]bytes to the smallest
> limits considering the memcg and global limits.  Assuming the current
> task is in a memcg, then the memcg dirty (not system-wide) usage is
> always compared to the selected limits (which may be per-memcg or
> system).  The problem is that if:
> a) per-memcg dirty limit is smaller than system then vm_dirty_param()
>    will select per-memcg dirty limit, and
> b) per-memcg dirty usage is well below memcg dirty limit, and
> b) system usage is at system limit
> Then the above patch will not trigger writeback.  Example with two
> memcg:
>          sys
>         B   C
>       
>       limit  usage
>   sys  10     10
>    B    7      6
>    C    5      4
>
> If B wants to grow, the system will exceed system limit of 10 and should
> be throttled.  However, the smaller limit (7) will be selected and
> applied to memcg usage (6), which indicates no need to throttle, so the
> system could get as bad as:
>
>       limit  usage
>   sys  10     12
>    B    7      7
>    C    5      5
>
> In this case the system usage exceeds the system limit because each
> the per-memcg checks see no per-memcg problems.
>
>   
What about the following scenarios?
a) limit usage
sys 9 7
B 8 6
A 4 1
Now assume B consumes 2 more. The total of B reaches 8 (memcg max) and 
the system total reaches 9 (Global limit).
The scenario will be like this.

limit usage
sys 9 9
B 8 8
A 4 1

In this case, group A is not getting a fair chance to utilize its limit.
Do we need to consider this case also?

b) Even though we are defining per cgroup dirty limit, it is not 
actually the case.
Is it indirectly dependent on the the total system wide limit in this 
implementation?

-Ciju
> To solve this I propose we create a new structure to aggregate both
> dirty limit and usage data:
> 	struct dirty_limits {
> 	       unsigned long dirty_thresh;
> 	       unsigned long background_thresh;
> 	       unsigned long nr_reclaimable;
> 	       unsigned long nr_writeback;
> 	};
>
> global_dirty_limits() would then query both the global and memcg limits
> and dirty usage of one that is closest to its limit.  This change makes
> global_dirty_limits() look like:
>
> void global_dirty_limits(struct dirty_limits *limits)
> {
> 	unsigned long background;
> 	unsigned long dirty;
> 	unsigned long nr_reclaimable;
> 	unsigned long nr_writeback;
> 	unsigned long available_memory = determine_dirtyable_memory();
> 	struct task_struct *tsk;
>
> 	if (vm_dirty_bytes)
> 		dirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE);
> 	else
> 		dirty = (vm_dirty_ratio * available_memory) / 100;
>
> 	if (dirty_background_bytes)
> 		background = DIV_ROUND_UP(dirty_background_bytes, PAGE_SIZE);
> 	else
> 		background = (dirty_background_ratio * available_memory) / 100;
>
> 	nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
> 				global_page_state(NR_UNSTABLE_NFS);
> 	nr_writeback = global_page_state(NR_WRITEBACK);
>
> 	if (mem_cgroup_dirty_limits(available_memory, limits) &&
> 	    dirty_available(limits->dirty_thresh, limits->nr_reclaimable,
> 			    limits->nr_writeback) <
> 	    dirty_available(dirty, nr_reclaimable, nr_writeback)) {
> 		dirty = min(dirty, limits->dirty_thresh);
> 		background = min(background, limits->background_thresh);
> 	} else {
> 		limits->nr_reclaimable = nr_reclaimable;
> 		limits->nr_writeback = nr_writeback;
> 	}
>
> 	if (background >= dirty)
> 		background = dirty / 2;
> 	tsk = current;
> 	if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk)) {
> 		background += background / 4;
> 		dirty += dirty / 4;
> 	}
> 	limits->background_thresh = background;
> 	limits->dirty_thresh = dirty;
> }
>
> Because this approach considered both memcg and system limits, the
> problem described above is avoided.
>
> I have this change integrated into the memcg dirty limit series (-v3 was
> the last post; v4 is almost ready with this change).  I will post -v4
> with this approach is there is no strong objection.
>
> --
> Greg
>   


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2][memcg+dirtylimit] Fix  overwriting global vm dirty limit setting by memcg (Re: [PATCH v3 00/11] memcg: per cgroup dirty page accounting
  2010-10-25  7:03       ` Ciju Rajan K
@ 2010-10-25  7:08         ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 64+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-10-25  7:08 UTC (permalink / raw)
  To: Ciju Rajan K
  Cc: Greg Thelen, Andrew Morton, linux-kernel, linux-mm, containers,
	Andrea Righi, Balbir Singh, Daisuke Nishimura, Minchan Kim,
	David Rientjes

On Mon, 25 Oct 2010 12:33:23 +0530
Ciju Rajan K <ciju@linux.vnet.ibm.com> wrote:

> Greg Thelen wrote:
> > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> writes:
> >
> >   
> >> Fixed one here.
> >> ==
> >> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> >>
> >> Now, at calculating dirty limit, vm_dirty_param() is called.
> >> This function returns dirty-limit related parameters considering
> >> memory cgroup settings.
> >>
> >> Now, assume that vm_dirty_bytes=100M (global dirty limit) and
> >> memory cgroup has 1G of pages and 40 dirty_ratio, dirtyable memory is
> >> 500MB.
> >>
> >> In this case, global_dirty_limits will consider dirty_limt as
> >> 500 *0.4 = 200MB. This is bad...memory cgroup is not back door.
> >>
> >> This patch limits the return value of vm_dirty_param() considring
> >> global settings.
> >>
> >> Changelog:
> >>  - fixed an argument "mem" int to u64
> >>  - fixed to use global available memory to cap memcg's value.
> >>
> >> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> >> ---
> >>  include/linux/memcontrol.h |    5 +++--
> >>  mm/memcontrol.c            |   30 +++++++++++++++++++++++++++++-
> >>  mm/page-writeback.c        |    3 ++-
> >>  3 files changed, 34 insertions(+), 4 deletions(-)
> >>
> >> Index: dirty_limit_new/mm/memcontrol.c
> >> ===================================================================
> >> --- dirty_limit_new.orig/mm/memcontrol.c
> >> +++ dirty_limit_new/mm/memcontrol.c
> >> @@ -1171,9 +1171,11 @@ static void __mem_cgroup_dirty_param(str
> >>   * can be moved after our access and writeback tends to take long time.  At
> >>   * least, "memcg" will not be freed while holding rcu_read_lock().
> >>   */
> >> -void vm_dirty_param(struct vm_dirty_param *param)
> >> +void vm_dirty_param(struct vm_dirty_param *param,
> >> +	 u64 mem, u64 global)
> >>  {
> >>  	struct mem_cgroup *memcg;
> >> +	u64 limit, bglimit;
> >>  
> >>  	if (mem_cgroup_disabled()) {
> >>  		global_vm_dirty_param(param);
> >> @@ -1183,6 +1185,32 @@ void vm_dirty_param(struct vm_dirty_para
> >>  	rcu_read_lock();
> >>  	memcg = mem_cgroup_from_task(current);
> >>  	__mem_cgroup_dirty_param(param, memcg);
> >> +	/*
> >> +	 * A limitation under memory cgroup is under global vm, too.
> >> +	 */
> >> +	if (vm_dirty_ratio)
> >> +		limit = global * vm_dirty_ratio / 100;
> >> +	else
> >> +		limit = vm_dirty_bytes;
> >> +	if (param->dirty_ratio) {
> >> +		param->dirty_bytes = mem * param->dirty_ratio / 100;
> >> +		param->dirty_ratio = 0;
> >> +	}
> >> +	if (param->dirty_bytes > limit)
> >> +		param->dirty_bytes = limit;
> >> +
> >> +	if (dirty_background_ratio)
> >> +		bglimit = global * dirty_background_ratio / 100;
> >> +	else
> >> +		bglimit = dirty_background_bytes;
> >> +
> >> +	if (param->dirty_background_ratio) {
> >> +		param->dirty_background_bytes =
> >> +			mem * param->dirty_background_ratio / 100;
> >> +		param->dirty_background_ratio = 0;
> >> +	}
> >> +	if (param->dirty_background_bytes > bglimit)
> >> +		param->dirty_background_bytes = bglimit;
> >>  	rcu_read_unlock();
> >>  }
> >>  
> >> Index: dirty_limit_new/include/linux/memcontrol.h
> >> ===================================================================
> >> --- dirty_limit_new.orig/include/linux/memcontrol.h
> >> +++ dirty_limit_new/include/linux/memcontrol.h
> >> @@ -171,7 +171,7 @@ static inline void mem_cgroup_dec_page_s
> >>  }
> >>  
> >>  bool mem_cgroup_has_dirty_limit(void);
> >> -void vm_dirty_param(struct vm_dirty_param *param);
> >> +void vm_dirty_param(struct vm_dirty_param *param, u64 mem, u64 global);
> >>  s64 mem_cgroup_page_stat(enum mem_cgroup_nr_pages_item item);
> >>  
> >>  unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
> >> @@ -360,7 +360,8 @@ static inline bool mem_cgroup_has_dirty_
> >>  	return false;
> >>  }
> >>  
> >> -static inline void vm_dirty_param(struct vm_dirty_param *param)
> >> +static inline void vm_dirty_param(struct vm_dirty_param *param,
> >> +		u64 mem, u64 global)
> >>  {
> >>  	global_vm_dirty_param(param);
> >>  }
> >> Index: dirty_limit_new/mm/page-writeback.c
> >> ===================================================================
> >> --- dirty_limit_new.orig/mm/page-writeback.c
> >> +++ dirty_limit_new/mm/page-writeback.c
> >> @@ -466,7 +466,8 @@ void global_dirty_limits(unsigned long *
> >>  	struct task_struct *tsk;
> >>  	struct vm_dirty_param dirty_param;
> >>  
> >> -	vm_dirty_param(&dirty_param);
> >> +	vm_dirty_param(&dirty_param,
> >> +		available_memory, global_dirtyable_memory());
> >>  
> >>  	if (dirty_param.dirty_bytes)
> >>  		dirty = DIV_ROUND_UP(dirty_param.dirty_bytes, PAGE_SIZE);
> >>     
> >
> > I think there is a problem with the patch above.  In the patch
> > vm_dirty_param() sets param->dirty_[background_]bytes to the smallest
> > limits considering the memcg and global limits.  Assuming the current
> > task is in a memcg, then the memcg dirty (not system-wide) usage is
> > always compared to the selected limits (which may be per-memcg or
> > system).  The problem is that if:
> > a) per-memcg dirty limit is smaller than system then vm_dirty_param()
> >    will select per-memcg dirty limit, and
> > b) per-memcg dirty usage is well below memcg dirty limit, and
> > b) system usage is at system limit
> > Then the above patch will not trigger writeback.  Example with two
> > memcg:
> >          sys
> >         B   C
> >       
> >       limit  usage
> >   sys  10     10
> >    B    7      6
> >    C    5      4
> >
> > If B wants to grow, the system will exceed system limit of 10 and should
> > be throttled.  However, the smaller limit (7) will be selected and
> > applied to memcg usage (6), which indicates no need to throttle, so the
> > system could get as bad as:
> >
> >       limit  usage
> >   sys  10     12
> >    B    7      7
> >    C    5      5
> >
> > In this case the system usage exceeds the system limit because each
> > the per-memcg checks see no per-memcg problems.
> >
> >   
> What about the following scenarios?
> a) limit usage
> sys 9 7
> B 8 6
> A 4 1
> Now assume B consumes 2 more. The total of B reaches 8 (memcg max) and 
> the system total reaches 9 (Global limit).
> The scenario will be like this.
> 
> limit usage
> sys 9 9
> B 8 8
> A 4 1
> 
> In this case, group A is not getting a fair chance to utilize its limit.
> Do we need to consider this case also?
> 

IMHO, it's admin's job to make the limitation fair.
In general, all cgroups should use the same dirty_ratio for getting fairness.

> b) Even though we are defining per cgroup dirty limit, it is not 
> actually the case.
> Is it indirectly dependent on the the total system wide limit in this 
> implementation?
> 
Yes, it should be. memory cgroup isn't a backdoor to break system's control.

Thanks,
-Kame






^ permalink raw reply	[flat|nested] 64+ messages in thread

end of thread, other threads:[~2010-10-25  7:14 UTC | newest]

Thread overview: 64+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-10-19  0:39 [PATCH v3 00/11] memcg: per cgroup dirty page accounting Greg Thelen
2010-10-19  0:39 ` [PATCH v3 01/11] memcg: add page_cgroup flags for dirty page tracking Greg Thelen
2010-10-19  4:31   ` Daisuke Nishimura
2010-10-19  0:39 ` [PATCH v3 02/11] memcg: document cgroup dirty memory interfaces Greg Thelen
2010-10-19  0:46   ` KAMEZAWA Hiroyuki
2010-10-19  8:27   ` Daisuke Nishimura
2010-10-19 21:00     ` Greg Thelen
2010-10-20  0:11       ` KAMEZAWA Hiroyuki
2010-10-20  0:45         ` Greg Thelen
2010-10-20  4:06           ` KAMEZAWA Hiroyuki
2010-10-20  4:25             ` Greg Thelen
2010-10-20  4:26               ` KAMEZAWA Hiroyuki
2010-10-20  0:48         ` Daisuke Nishimura
2010-10-20  1:14           ` KAMEZAWA Hiroyuki
2010-10-20  2:24             ` KAMEZAWA Hiroyuki
2010-10-20  3:47               ` Daisuke Nishimura
2010-10-19  0:39 ` [PATCH v3 03/11] memcg: create extensible page stat update routines Greg Thelen
2010-10-19  0:47   ` KAMEZAWA Hiroyuki
2010-10-19  4:52   ` Daisuke Nishimura
2010-10-19  0:39 ` [PATCH v3 04/11] memcg: add lock to synchronize page accounting and migration Greg Thelen
2010-10-19  0:45   ` KAMEZAWA Hiroyuki
2010-10-19  4:43     ` [RFC][PATCH 1/2] memcg: move_account optimization by reduct put,get page (Re: " KAMEZAWA Hiroyuki
2010-10-19  4:45       ` [RFC][PATCH 2/2] memcg: move_account optimization by reduce locks " KAMEZAWA Hiroyuki
2010-10-19  1:17   ` Minchan Kim
2010-10-19  5:03   ` Daisuke Nishimura
2010-10-19  0:39 ` [PATCH v3 05/11] memcg: add dirty page accounting infrastructure Greg Thelen
2010-10-19  0:49   ` KAMEZAWA Hiroyuki
2010-10-20  0:53   ` Daisuke Nishimura
2010-10-19  0:39 ` [PATCH v3 06/11] memcg: add kernel calls for memcg dirty page stats Greg Thelen
2010-10-19  0:51   ` KAMEZAWA Hiroyuki
2010-10-19  7:03   ` Daisuke Nishimura
2010-10-19  0:39 ` [PATCH v3 07/11] memcg: add dirty limits to mem_cgroup Greg Thelen
2010-10-19  0:53   ` KAMEZAWA Hiroyuki
2010-10-20  0:50   ` Daisuke Nishimura
2010-10-20  4:08     ` Greg Thelen
2010-10-19  0:39 ` [PATCH v3 08/11] memcg: CPU hotplug lockdep warning fix Greg Thelen
2010-10-19  0:54   ` KAMEZAWA Hiroyuki
2010-10-20  3:47   ` Daisuke Nishimura
2010-10-19  0:39 ` [PATCH v3 09/11] memcg: add cgroupfs interface to memcg dirty limits Greg Thelen
2010-10-19  0:56   ` KAMEZAWA Hiroyuki
2010-10-20  3:31   ` Daisuke Nishimura
2010-10-20  3:44     ` KAMEZAWA Hiroyuki
2010-10-20  3:46     ` Daisuke Nishimura
2010-10-19  0:39 ` [PATCH v3 10/11] writeback: make determine_dirtyable_memory() static Greg Thelen
2010-10-19  0:57   ` KAMEZAWA Hiroyuki
2010-10-20  3:47   ` Daisuke Nishimura
2010-10-19  0:39 ` [PATCH v3 11/11] memcg: check memcg dirty limits in page writeback Greg Thelen
2010-10-19  1:00   ` KAMEZAWA Hiroyuki
2010-10-20  4:18     ` KAMEZAWA Hiroyuki
2010-10-20  4:33       ` Greg Thelen
2010-10-20  4:33         ` KAMEZAWA Hiroyuki
2010-10-20  4:34       ` Daisuke Nishimura
2010-10-20  5:25   ` Daisuke Nishimura
2010-10-20  3:21 ` [PATCH][memcg+dirtylimit] Fix overwriting global vm dirty limit setting by memcg (Re: [PATCH v3 00/11] memcg: per cgroup dirty page accounting KAMEZAWA Hiroyuki
2010-10-20  4:14   ` KAMEZAWA Hiroyuki
2010-10-20  5:02   ` [PATCH v2][memcg+dirtylimit] " KAMEZAWA Hiroyuki
2010-10-20  6:09     ` Daisuke Nishimura
2010-10-20 14:35     ` Minchan Kim
2010-10-21  0:10       ` KAMEZAWA Hiroyuki
2010-10-24 18:44     ` Greg Thelen
2010-10-25  0:24       ` KAMEZAWA Hiroyuki
2010-10-25  2:00       ` Daisuke Nishimura
2010-10-25  7:03       ` Ciju Rajan K
2010-10-25  7:08         ` KAMEZAWA Hiroyuki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).