linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/10] memcg: per cgroup dirty page accounting
@ 2010-10-04  6:57 Greg Thelen
  2010-10-04  6:57 ` [PATCH 01/10] memcg: add page_cgroup flags for dirty page tracking Greg Thelen
                   ` (14 more replies)
  0 siblings, 15 replies; 96+ messages in thread
From: Greg Thelen @ 2010-10-04  6:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, containers, Andrea Righi, Balbir Singh,
	KAMEZAWA Hiroyuki, Daisuke Nishimura, Greg Thelen

This patch set provides the ability for each cgroup to have independent dirty
page limits.

Limiting dirty memory is like fixing the max amount of dirty (hard to reclaim)
page cache used by a cgroup.  So, in case of multiple cgroup writers, they will
not be able to consume more than their designated share of dirty pages and will
be forced to perform write-out if they cross that limit.

These patches were developed and tested on mmotm 2010-09-28-16-13.  The patches
are based on a series proposed by Andrea Righi in Mar 2010.

Overview:
- Add page_cgroup flags to record when pages are dirty, in writeback, or nfs
  unstable.
- Extend mem_cgroup to record the total number of pages in each of the 
  interesting dirty states (dirty, writeback, unstable_nfs).  
- Add dirty parameters similar to the system-wide  /proc/sys/vm/dirty_*
  limits to mem_cgroup.  The mem_cgroup dirty parameters are accessible
  via cgroupfs control files.
- Consider both system and per-memcg dirty limits in page writeback when
  deciding to queue background writeback or block for foreground writeback.

Known shortcomings:
- When a cgroup dirty limit is exceeded, then bdi writeback is employed to
  writeback dirty inodes.  Bdi writeback considers inodes from any cgroup, not
  just inodes contributing dirty pages to the cgroup exceeding its limit.  

Performance measurements:
- kernel builds are unaffected unless run with a small dirty limit.
- all data collected with CONFIG_CGROUP_MEM_RES_CTLR=y.
- dd has three data points (in secs) for three data sizes (100M, 200M, and 1G).  
  As expected, dd slows when it exceed its cgroup dirty limit.

               kernel_build          dd
mmotm             2:37        0.18, 0.38, 1.65
  root_memcg

mmotm             2:37        0.18, 0.35, 1.66
  non-root_memcg

mmotm+patches     2:37        0.18, 0.35, 1.68
  root_memcg

mmotm+patches     2:37        0.19, 0.35, 1.69
  non-root_memcg

mmotm+patches     2:37        0.19, 2.34, 22.82
  non-root_memcg
  150 MiB memcg dirty limit

mmotm+patches     3:58        1.71, 3.38, 17.33
  non-root_memcg
  1 MiB memcg dirty limit

Greg Thelen (10):
  memcg: add page_cgroup flags for dirty page tracking
  memcg: document cgroup dirty memory interfaces
  memcg: create extensible page stat update routines
  memcg: disable local interrupts in lock_page_cgroup()
  memcg: add dirty page accounting infrastructure
  memcg: add kernel calls for memcg dirty page stats
  memcg: add dirty limits to mem_cgroup
  memcg: add cgroupfs interface to memcg dirty limits
  writeback: make determine_dirtyable_memory() static.
  memcg: check memcg dirty limits in page writeback

 Documentation/cgroups/memory.txt |   37 ++++
 fs/nfs/write.c                   |    4 +
 include/linux/memcontrol.h       |   78 +++++++-
 include/linux/page_cgroup.h      |   31 +++-
 include/linux/writeback.h        |    2 -
 mm/filemap.c                     |    1 +
 mm/memcontrol.c                  |  426 ++++++++++++++++++++++++++++++++++----
 mm/page-writeback.c              |  211 ++++++++++++-------
 mm/rmap.c                        |    4 +-
 mm/truncate.c                    |    1 +
 10 files changed, 672 insertions(+), 123 deletions(-)


^ permalink raw reply	[flat|nested] 96+ messages in thread

* [PATCH 01/10] memcg: add page_cgroup flags for dirty page tracking
  2010-10-04  6:57 [PATCH 00/10] memcg: per cgroup dirty page accounting Greg Thelen
@ 2010-10-04  6:57 ` Greg Thelen
  2010-10-05  6:20   ` KAMEZAWA Hiroyuki
                     ` (2 more replies)
  2010-10-04  6:57 ` [PATCH 02/10] memcg: document cgroup dirty memory interfaces Greg Thelen
                   ` (13 subsequent siblings)
  14 siblings, 3 replies; 96+ messages in thread
From: Greg Thelen @ 2010-10-04  6:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, containers, Andrea Righi, Balbir Singh,
	KAMEZAWA Hiroyuki, Daisuke Nishimura, Greg Thelen

Add additional flags to page_cgroup to track dirty pages
within a mem_cgroup.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrea Righi <arighi@develer.com>
Signed-off-by: Greg Thelen <gthelen@google.com>
---
 include/linux/page_cgroup.h |   23 +++++++++++++++++++++++
 1 files changed, 23 insertions(+), 0 deletions(-)

diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
index 5bb13b3..b59c298 100644
--- a/include/linux/page_cgroup.h
+++ b/include/linux/page_cgroup.h
@@ -40,6 +40,9 @@ enum {
 	PCG_USED, /* this object is in use. */
 	PCG_ACCT_LRU, /* page has been accounted for */
 	PCG_FILE_MAPPED, /* page is accounted as "mapped" */
+	PCG_FILE_DIRTY, /* page is dirty */
+	PCG_FILE_WRITEBACK, /* page is under writeback */
+	PCG_FILE_UNSTABLE_NFS, /* page is NFS unstable */
 	PCG_MIGRATION, /* under page migration */
 };
 
@@ -59,6 +62,10 @@ static inline void ClearPageCgroup##uname(struct page_cgroup *pc)	\
 static inline int TestClearPageCgroup##uname(struct page_cgroup *pc)	\
 	{ return test_and_clear_bit(PCG_##lname, &pc->flags);  }
 
+#define TESTSETPCGFLAG(uname, lname)			\
+static inline int TestSetPageCgroup##uname(struct page_cgroup *pc)	\
+	{ return test_and_set_bit(PCG_##lname, &pc->flags);  }
+
 TESTPCGFLAG(Locked, LOCK)
 
 /* Cache flag is set only once (at allocation) */
@@ -80,6 +87,22 @@ SETPCGFLAG(FileMapped, FILE_MAPPED)
 CLEARPCGFLAG(FileMapped, FILE_MAPPED)
 TESTPCGFLAG(FileMapped, FILE_MAPPED)
 
+SETPCGFLAG(FileDirty, FILE_DIRTY)
+CLEARPCGFLAG(FileDirty, FILE_DIRTY)
+TESTPCGFLAG(FileDirty, FILE_DIRTY)
+TESTCLEARPCGFLAG(FileDirty, FILE_DIRTY)
+TESTSETPCGFLAG(FileDirty, FILE_DIRTY)
+
+SETPCGFLAG(FileWriteback, FILE_WRITEBACK)
+CLEARPCGFLAG(FileWriteback, FILE_WRITEBACK)
+TESTPCGFLAG(FileWriteback, FILE_WRITEBACK)
+
+SETPCGFLAG(FileUnstableNFS, FILE_UNSTABLE_NFS)
+CLEARPCGFLAG(FileUnstableNFS, FILE_UNSTABLE_NFS)
+TESTPCGFLAG(FileUnstableNFS, FILE_UNSTABLE_NFS)
+TESTCLEARPCGFLAG(FileUnstableNFS, FILE_UNSTABLE_NFS)
+TESTSETPCGFLAG(FileUnstableNFS, FILE_UNSTABLE_NFS)
+
 SETPCGFLAG(Migration, MIGRATION)
 CLEARPCGFLAG(Migration, MIGRATION)
 TESTPCGFLAG(Migration, MIGRATION)
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH 02/10] memcg: document cgroup dirty memory interfaces
  2010-10-04  6:57 [PATCH 00/10] memcg: per cgroup dirty page accounting Greg Thelen
  2010-10-04  6:57 ` [PATCH 01/10] memcg: add page_cgroup flags for dirty page tracking Greg Thelen
@ 2010-10-04  6:57 ` Greg Thelen
  2010-10-05  6:48   ` KAMEZAWA Hiroyuki
                     ` (2 more replies)
  2010-10-04  6:57 ` [PATCH 03/10] memcg: create extensible page stat update routines Greg Thelen
                   ` (12 subsequent siblings)
  14 siblings, 3 replies; 96+ messages in thread
From: Greg Thelen @ 2010-10-04  6:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, containers, Andrea Righi, Balbir Singh,
	KAMEZAWA Hiroyuki, Daisuke Nishimura, Greg Thelen

Document cgroup dirty memory interfaces and statistics.

Signed-off-by: Andrea Righi <arighi@develer.com>
Signed-off-by: Greg Thelen <gthelen@google.com>
---
 Documentation/cgroups/memory.txt |   37 +++++++++++++++++++++++++++++++++++++
 1 files changed, 37 insertions(+), 0 deletions(-)

diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index 7781857..eab65e2 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -385,6 +385,10 @@ mapped_file	- # of bytes of mapped file (includes tmpfs/shmem)
 pgpgin		- # of pages paged in (equivalent to # of charging events).
 pgpgout		- # of pages paged out (equivalent to # of uncharging events).
 swap		- # of bytes of swap usage
+dirty		- # of bytes that are waiting to get written back to the disk.
+writeback	- # of bytes that are actively being written back to the disk.
+nfs		- # of bytes sent to the NFS server, but not yet committed to
+		the actual storage.
 inactive_anon	- # of bytes of anonymous memory and swap cache memory on
 		LRU list.
 active_anon	- # of bytes of anonymous and swap cache memory on active
@@ -453,6 +457,39 @@ memory under it will be reclaimed.
 You can reset failcnt by writing 0 to failcnt file.
 # echo 0 > .../memory.failcnt
 
+5.5 dirty memory
+
+Control the maximum amount of dirty pages a cgroup can have at any given time.
+
+Limiting dirty memory is like fixing the max amount of dirty (hard to reclaim)
+page cache used by a cgroup.  So, in case of multiple cgroup writers, they will
+not be able to consume more than their designated share of dirty pages and will
+be forced to perform write-out if they cross that limit.
+
+The interface is equivalent to the procfs interface: /proc/sys/vm/dirty_*.  It
+is possible to configure a limit to trigger both a direct writeback or a
+background writeback performed by per-bdi flusher threads.  The root cgroup
+memory.dirty_* control files are read-only and match the contents of
+the /proc/sys/vm/dirty_* files.
+
+Per-cgroup dirty limits can be set using the following files in the cgroupfs:
+
+- memory.dirty_ratio: the amount of dirty memory (expressed as a percentage of
+  cgroup memory) at which a process generating dirty pages will itself start
+  writing out dirty data.
+
+- memory.dirty_bytes: the amount of dirty memory (expressed in bytes) in the
+  cgroup at which a process generating dirty pages will start itself writing out
+  dirty data.
+
+- memory.dirty_background_ratio: the amount of dirty memory of the cgroup
+  (expressed as a percentage of cgroup memory) at which background writeback
+  kernel threads will start writing out dirty data.
+
+- memory.dirty_background_bytes: the amount of dirty memory (expressed in bytes)
+  in the cgroup at which background writeback kernel threads will start writing
+  out dirty data.
+
 6. Hierarchy support
 
 The memory controller supports a deep hierarchy and hierarchical accounting.
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH 03/10] memcg: create extensible page stat update routines
  2010-10-04  6:57 [PATCH 00/10] memcg: per cgroup dirty page accounting Greg Thelen
  2010-10-04  6:57 ` [PATCH 01/10] memcg: add page_cgroup flags for dirty page tracking Greg Thelen
  2010-10-04  6:57 ` [PATCH 02/10] memcg: document cgroup dirty memory interfaces Greg Thelen
@ 2010-10-04  6:57 ` Greg Thelen
  2010-10-04 13:48   ` Ciju Rajan K
                     ` (3 more replies)
  2010-10-04  6:57 ` [PATCH 04/10] memcg: disable local interrupts in lock_page_cgroup() Greg Thelen
                   ` (11 subsequent siblings)
  14 siblings, 4 replies; 96+ messages in thread
From: Greg Thelen @ 2010-10-04  6:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, containers, Andrea Righi, Balbir Singh,
	KAMEZAWA Hiroyuki, Daisuke Nishimura, Greg Thelen

Replace usage of the mem_cgroup_update_file_mapped() memcg
statistic update routine with two new routines:
* mem_cgroup_inc_page_stat()
* mem_cgroup_dec_page_stat()

As before, only the file_mapped statistic is managed.  However,
these more general interfaces allow for new statistics to be
more easily added.  New statistics are added with memcg dirty
page accounting.

Signed-off-by: Greg Thelen <gthelen@google.com>
Signed-off-by: Andrea Righi <arighi@develer.com>
---
 include/linux/memcontrol.h |   31 ++++++++++++++++++++++++++++---
 mm/memcontrol.c            |   17 ++++++++---------
 mm/rmap.c                  |    4 ++--
 3 files changed, 38 insertions(+), 14 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 159a076..7c7bec4 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -25,6 +25,11 @@ struct page_cgroup;
 struct page;
 struct mm_struct;
 
+/* Stats that can be updated by kernel. */
+enum mem_cgroup_write_page_stat_item {
+	MEMCG_NR_FILE_MAPPED, /* # of pages charged as file rss */
+};
+
 extern unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
 					struct list_head *dst,
 					unsigned long *scanned, int order,
@@ -121,7 +126,22 @@ static inline bool mem_cgroup_disabled(void)
 	return false;
 }
 
-void mem_cgroup_update_file_mapped(struct page *page, int val);
+void mem_cgroup_update_page_stat(struct page *page,
+				 enum mem_cgroup_write_page_stat_item idx,
+				 int val);
+
+static inline void mem_cgroup_inc_page_stat(struct page *page,
+				enum mem_cgroup_write_page_stat_item idx)
+{
+	mem_cgroup_update_page_stat(page, idx, 1);
+}
+
+static inline void mem_cgroup_dec_page_stat(struct page *page,
+				enum mem_cgroup_write_page_stat_item idx)
+{
+	mem_cgroup_update_page_stat(page, idx, -1);
+}
+
 unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
 						gfp_t gfp_mask);
 u64 mem_cgroup_get_limit(struct mem_cgroup *mem);
@@ -293,8 +313,13 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
 {
 }
 
-static inline void mem_cgroup_update_file_mapped(struct page *page,
-							int val)
+static inline void mem_cgroup_inc_page_stat(struct page *page,
+				enum mem_cgroup_write_page_stat_item idx)
+{
+}
+
+static inline void mem_cgroup_dec_page_stat(struct page *page,
+				enum mem_cgroup_write_page_stat_item idx)
 {
 }
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 512cb12..f4259f4 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1592,7 +1592,9 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask)
  * possibility of race condition. If there is, we take a lock.
  */
 
-static void mem_cgroup_update_file_stat(struct page *page, int idx, int val)
+void mem_cgroup_update_page_stat(struct page *page,
+				 enum mem_cgroup_write_page_stat_item idx,
+				 int val)
 {
 	struct mem_cgroup *mem;
 	struct page_cgroup *pc = lookup_page_cgroup(page);
@@ -1615,30 +1617,27 @@ static void mem_cgroup_update_file_stat(struct page *page, int idx, int val)
 			goto out;
 	}
 
-	this_cpu_add(mem->stat->count[idx], val);
-
 	switch (idx) {
-	case MEM_CGROUP_STAT_FILE_MAPPED:
+	case MEMCG_NR_FILE_MAPPED:
 		if (val > 0)
 			SetPageCgroupFileMapped(pc);
 		else if (!page_mapped(page))
 			ClearPageCgroupFileMapped(pc);
+		idx = MEM_CGROUP_STAT_FILE_MAPPED;
 		break;
 	default:
 		BUG();
 	}
 
+	this_cpu_add(mem->stat->count[idx], val);
+
 out:
 	if (unlikely(need_unlock))
 		unlock_page_cgroup(pc);
 	rcu_read_unlock();
 	return;
 }
-
-void mem_cgroup_update_file_mapped(struct page *page, int val)
-{
-	mem_cgroup_update_file_stat(page, MEM_CGROUP_STAT_FILE_MAPPED, val);
-}
+EXPORT_SYMBOL(mem_cgroup_update_page_stat);
 
 /*
  * size of first charge trial. "32" comes from vmscan.c's magic value.
diff --git a/mm/rmap.c b/mm/rmap.c
index 8734312..779c0db 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -912,7 +912,7 @@ void page_add_file_rmap(struct page *page)
 {
 	if (atomic_inc_and_test(&page->_mapcount)) {
 		__inc_zone_page_state(page, NR_FILE_MAPPED);
-		mem_cgroup_update_file_mapped(page, 1);
+		mem_cgroup_inc_page_stat(page, MEMCG_NR_FILE_MAPPED);
 	}
 }
 
@@ -950,7 +950,7 @@ void page_remove_rmap(struct page *page)
 		__dec_zone_page_state(page, NR_ANON_PAGES);
 	} else {
 		__dec_zone_page_state(page, NR_FILE_MAPPED);
-		mem_cgroup_update_file_mapped(page, -1);
+		mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_MAPPED);
 	}
 	/*
 	 * It would be tidy to reset the PageAnon mapping here,
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH 04/10] memcg: disable local interrupts in lock_page_cgroup()
  2010-10-04  6:57 [PATCH 00/10] memcg: per cgroup dirty page accounting Greg Thelen
                   ` (2 preceding siblings ...)
  2010-10-04  6:57 ` [PATCH 03/10] memcg: create extensible page stat update routines Greg Thelen
@ 2010-10-04  6:57 ` Greg Thelen
  2010-10-05  6:54   ` KAMEZAWA Hiroyuki
                     ` (2 more replies)
  2010-10-04  6:58 ` [PATCH 05/10] memcg: add dirty page accounting infrastructure Greg Thelen
                   ` (10 subsequent siblings)
  14 siblings, 3 replies; 96+ messages in thread
From: Greg Thelen @ 2010-10-04  6:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, containers, Andrea Righi, Balbir Singh,
	KAMEZAWA Hiroyuki, Daisuke Nishimura, Greg Thelen

If pages are being migrated from a memcg, then updates to that
memcg's page statistics are protected by grabbing a bit spin lock
using lock_page_cgroup().  In an upcoming commit memcg dirty page
accounting will be updating memcg page accounting (specifically:
num writeback pages) from softirq.  Avoid a deadlocking nested
spin lock attempt by disabling interrupts on the local processor
when grabbing the page_cgroup bit_spin_lock in lock_page_cgroup().
This avoids the following deadlock:
statistic
      CPU 0             CPU 1
                    inc_file_mapped
                    rcu_read_lock
  start move
  synchronize_rcu
                    lock_page_cgroup
                      softirq
                      test_clear_page_writeback
                      mem_cgroup_dec_page_stat(NR_WRITEBACK)
                      rcu_read_lock
                      lock_page_cgroup   /* deadlock */
                      unlock_page_cgroup
                      rcu_read_unlock
                    unlock_page_cgroup
                    rcu_read_unlock

By disabling interrupts in lock_page_cgroup, nested calls
are avoided.  The softirq would be delayed until after inc_file_mapped
enables interrupts when calling unlock_page_cgroup().

The normal, fast path, of memcg page stat updates typically
does not need to call lock_page_cgroup(), so this change does
not affect the performance of the common case page accounting.

Signed-off-by: Andrea Righi <arighi@develer.com>
Signed-off-by: Greg Thelen <gthelen@google.com>
---
 include/linux/page_cgroup.h |    8 +++++-
 mm/memcontrol.c             |   51 +++++++++++++++++++++++++-----------------
 2 files changed, 36 insertions(+), 23 deletions(-)

diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
index b59c298..872f6b1 100644
--- a/include/linux/page_cgroup.h
+++ b/include/linux/page_cgroup.h
@@ -117,14 +117,18 @@ static inline enum zone_type page_cgroup_zid(struct page_cgroup *pc)
 	return page_zonenum(pc->page);
 }
 
-static inline void lock_page_cgroup(struct page_cgroup *pc)
+static inline void lock_page_cgroup(struct page_cgroup *pc,
+				    unsigned long *flags)
 {
+	local_irq_save(*flags);
 	bit_spin_lock(PCG_LOCK, &pc->flags);
 }
 
-static inline void unlock_page_cgroup(struct page_cgroup *pc)
+static inline void unlock_page_cgroup(struct page_cgroup *pc,
+				      unsigned long flags)
 {
 	bit_spin_unlock(PCG_LOCK, &pc->flags);
+	local_irq_restore(flags);
 }
 
 #else /* CONFIG_CGROUP_MEM_RES_CTLR */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f4259f4..267d774 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1599,6 +1599,7 @@ void mem_cgroup_update_page_stat(struct page *page,
 	struct mem_cgroup *mem;
 	struct page_cgroup *pc = lookup_page_cgroup(page);
 	bool need_unlock = false;
+	unsigned long flags;
 
 	if (unlikely(!pc))
 		return;
@@ -1610,7 +1611,7 @@ void mem_cgroup_update_page_stat(struct page *page,
 	/* pc->mem_cgroup is unstable ? */
 	if (unlikely(mem_cgroup_stealed(mem))) {
 		/* take a lock against to access pc->mem_cgroup */
-		lock_page_cgroup(pc);
+		lock_page_cgroup(pc, &flags);
 		need_unlock = true;
 		mem = pc->mem_cgroup;
 		if (!mem || !PageCgroupUsed(pc))
@@ -1633,7 +1634,7 @@ void mem_cgroup_update_page_stat(struct page *page,
 
 out:
 	if (unlikely(need_unlock))
-		unlock_page_cgroup(pc);
+		unlock_page_cgroup(pc, flags);
 	rcu_read_unlock();
 	return;
 }
@@ -2053,11 +2054,12 @@ struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
 	struct page_cgroup *pc;
 	unsigned short id;
 	swp_entry_t ent;
+	unsigned long flags;
 
 	VM_BUG_ON(!PageLocked(page));
 
 	pc = lookup_page_cgroup(page);
-	lock_page_cgroup(pc);
+	lock_page_cgroup(pc, &flags);
 	if (PageCgroupUsed(pc)) {
 		mem = pc->mem_cgroup;
 		if (mem && !css_tryget(&mem->css))
@@ -2071,7 +2073,7 @@ struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
 			mem = NULL;
 		rcu_read_unlock();
 	}
-	unlock_page_cgroup(pc);
+	unlock_page_cgroup(pc, flags);
 	return mem;
 }
 
@@ -2084,13 +2086,15 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *mem,
 				     struct page_cgroup *pc,
 				     enum charge_type ctype)
 {
+	unsigned long flags;
+
 	/* try_charge() can return NULL to *memcg, taking care of it. */
 	if (!mem)
 		return;
 
-	lock_page_cgroup(pc);
+	lock_page_cgroup(pc, &flags);
 	if (unlikely(PageCgroupUsed(pc))) {
-		unlock_page_cgroup(pc);
+		unlock_page_cgroup(pc, flags);
 		mem_cgroup_cancel_charge(mem);
 		return;
 	}
@@ -2120,7 +2124,7 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *mem,
 
 	mem_cgroup_charge_statistics(mem, pc, true);
 
-	unlock_page_cgroup(pc);
+	unlock_page_cgroup(pc, flags);
 	/*
 	 * "charge_statistics" updated event counter. Then, check it.
 	 * Insert ancestor (and ancestor's ancestors), to softlimit RB-tree.
@@ -2187,12 +2191,13 @@ static int mem_cgroup_move_account(struct page_cgroup *pc,
 		struct mem_cgroup *from, struct mem_cgroup *to, bool uncharge)
 {
 	int ret = -EINVAL;
-	lock_page_cgroup(pc);
+	unsigned long flags;
+	lock_page_cgroup(pc, &flags);
 	if (PageCgroupUsed(pc) && pc->mem_cgroup == from) {
 		__mem_cgroup_move_account(pc, from, to, uncharge);
 		ret = 0;
 	}
-	unlock_page_cgroup(pc);
+	unlock_page_cgroup(pc, flags);
 	/*
 	 * check events
 	 */
@@ -2298,6 +2303,7 @@ int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
 				gfp_t gfp_mask)
 {
 	int ret;
+	unsigned long flags;
 
 	if (mem_cgroup_disabled())
 		return 0;
@@ -2320,12 +2326,12 @@ int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
 		pc = lookup_page_cgroup(page);
 		if (!pc)
 			return 0;
-		lock_page_cgroup(pc);
+		lock_page_cgroup(pc, &flags);
 		if (PageCgroupUsed(pc)) {
-			unlock_page_cgroup(pc);
+			unlock_page_cgroup(pc, flags);
 			return 0;
 		}
-		unlock_page_cgroup(pc);
+		unlock_page_cgroup(pc, flags);
 	}
 
 	if (unlikely(!mm))
@@ -2511,6 +2517,7 @@ __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype)
 {
 	struct page_cgroup *pc;
 	struct mem_cgroup *mem = NULL;
+	unsigned long flags;
 
 	if (mem_cgroup_disabled())
 		return NULL;
@@ -2525,7 +2532,7 @@ __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype)
 	if (unlikely(!pc || !PageCgroupUsed(pc)))
 		return NULL;
 
-	lock_page_cgroup(pc);
+	lock_page_cgroup(pc, &flags);
 
 	mem = pc->mem_cgroup;
 
@@ -2560,7 +2567,7 @@ __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype)
 	 * special functions.
 	 */
 
-	unlock_page_cgroup(pc);
+	unlock_page_cgroup(pc, flags);
 	/*
 	 * even after unlock, we have mem->res.usage here and this memcg
 	 * will never be freed.
@@ -2576,7 +2583,7 @@ __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype)
 	return mem;
 
 unlock_out:
-	unlock_page_cgroup(pc);
+	unlock_page_cgroup(pc, flags);
 	return NULL;
 }
 
@@ -2765,12 +2772,13 @@ int mem_cgroup_prepare_migration(struct page *page,
 	struct mem_cgroup *mem = NULL;
 	enum charge_type ctype;
 	int ret = 0;
+	unsigned long flags;
 
 	if (mem_cgroup_disabled())
 		return 0;
 
 	pc = lookup_page_cgroup(page);
-	lock_page_cgroup(pc);
+	lock_page_cgroup(pc, &flags);
 	if (PageCgroupUsed(pc)) {
 		mem = pc->mem_cgroup;
 		css_get(&mem->css);
@@ -2806,7 +2814,7 @@ int mem_cgroup_prepare_migration(struct page *page,
 		if (PageAnon(page))
 			SetPageCgroupMigration(pc);
 	}
-	unlock_page_cgroup(pc);
+	unlock_page_cgroup(pc, flags);
 	/*
 	 * If the page is not charged at this point,
 	 * we return here.
@@ -2819,9 +2827,9 @@ int mem_cgroup_prepare_migration(struct page *page,
 	css_put(&mem->css);/* drop extra refcnt */
 	if (ret || *ptr == NULL) {
 		if (PageAnon(page)) {
-			lock_page_cgroup(pc);
+			lock_page_cgroup(pc, &flags);
 			ClearPageCgroupMigration(pc);
-			unlock_page_cgroup(pc);
+			unlock_page_cgroup(pc, flags);
 			/*
 			 * The old page may be fully unmapped while we kept it.
 			 */
@@ -2852,6 +2860,7 @@ void mem_cgroup_end_migration(struct mem_cgroup *mem,
 {
 	struct page *used, *unused;
 	struct page_cgroup *pc;
+	unsigned long flags;
 
 	if (!mem)
 		return;
@@ -2871,9 +2880,9 @@ void mem_cgroup_end_migration(struct mem_cgroup *mem,
 	 * Clear the flag and check the page should be charged.
 	 */
 	pc = lookup_page_cgroup(oldpage);
-	lock_page_cgroup(pc);
+	lock_page_cgroup(pc, &flags);
 	ClearPageCgroupMigration(pc);
-	unlock_page_cgroup(pc);
+	unlock_page_cgroup(pc, flags);
 
 	__mem_cgroup_uncharge_common(unused, MEM_CGROUP_CHARGE_TYPE_FORCE);
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH 05/10] memcg: add dirty page accounting infrastructure
  2010-10-04  6:57 [PATCH 00/10] memcg: per cgroup dirty page accounting Greg Thelen
                   ` (3 preceding siblings ...)
  2010-10-04  6:57 ` [PATCH 04/10] memcg: disable local interrupts in lock_page_cgroup() Greg Thelen
@ 2010-10-04  6:58 ` Greg Thelen
  2010-10-05  7:22   ` KAMEZAWA Hiroyuki
  2010-10-05 16:09   ` Minchan Kim
  2010-10-04  6:58 ` [PATCH 06/10] memcg: add kernel calls for memcg dirty page stats Greg Thelen
                   ` (9 subsequent siblings)
  14 siblings, 2 replies; 96+ messages in thread
From: Greg Thelen @ 2010-10-04  6:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, containers, Andrea Righi, Balbir Singh,
	KAMEZAWA Hiroyuki, Daisuke Nishimura, Greg Thelen

Add memcg routines to track dirty, writeback, and unstable_NFS pages.
These routines are not yet used by the kernel to count such pages.
A later change adds kernel calls to these new routines.

Signed-off-by: Greg Thelen <gthelen@google.com>
Signed-off-by: Andrea Righi <arighi@develer.com>
---
 include/linux/memcontrol.h |    3 +
 mm/memcontrol.c            |   89 ++++++++++++++++++++++++++++++++++++++++----
 2 files changed, 84 insertions(+), 8 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 7c7bec4..6303da1 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -28,6 +28,9 @@ struct mm_struct;
 /* Stats that can be updated by kernel. */
 enum mem_cgroup_write_page_stat_item {
 	MEMCG_NR_FILE_MAPPED, /* # of pages charged as file rss */
+	MEMCG_NR_FILE_DIRTY, /* # of dirty pages in page cache */
+	MEMCG_NR_FILE_WRITEBACK, /* # of pages under writeback */
+	MEMCG_NR_FILE_UNSTABLE_NFS, /* # of NFS unstable pages */
 };
 
 extern unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 267d774..f40839f 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -85,10 +85,13 @@ enum mem_cgroup_stat_index {
 	 */
 	MEM_CGROUP_STAT_CACHE, 	   /* # of pages charged as cache */
 	MEM_CGROUP_STAT_RSS,	   /* # of pages charged as anon rss */
-	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
 	MEM_CGROUP_STAT_PGPGIN_COUNT,	/* # of pages paged in */
 	MEM_CGROUP_STAT_PGPGOUT_COUNT,	/* # of pages paged out */
 	MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
+	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
+	MEM_CGROUP_STAT_FILE_DIRTY,	/* # of dirty pages in page cache */
+	MEM_CGROUP_STAT_FILE_WRITEBACK,		/* # of pages under writeback */
+	MEM_CGROUP_STAT_FILE_UNSTABLE_NFS,	/* # of NFS unstable pages */
 	MEM_CGROUP_STAT_DATA, /* end of data requires synchronization */
 	/* incremented at every  pagein/pageout */
 	MEM_CGROUP_EVENTS = MEM_CGROUP_STAT_DATA,
@@ -1626,6 +1629,48 @@ void mem_cgroup_update_page_stat(struct page *page,
 			ClearPageCgroupFileMapped(pc);
 		idx = MEM_CGROUP_STAT_FILE_MAPPED;
 		break;
+
+	case MEMCG_NR_FILE_DIRTY:
+		/* Use Test{Set,Clear} to only un/charge the memcg once. */
+		if (val > 0) {
+			if (TestSetPageCgroupFileDirty(pc))
+				/* already set */
+				val = 0;
+		} else {
+			if (!TestClearPageCgroupFileDirty(pc))
+				/* already cleared */
+				val = 0;
+		}
+		idx = MEM_CGROUP_STAT_FILE_DIRTY;
+		break;
+
+	case MEMCG_NR_FILE_WRITEBACK:
+		/*
+		 * This counter is adjusted while holding the mapping's
+		 * tree_lock.  Therefore there is no race between settings and
+		 * clearing of this flag.
+		 */
+		if (val > 0)
+			SetPageCgroupFileWriteback(pc);
+		else
+			ClearPageCgroupFileWriteback(pc);
+		idx = MEM_CGROUP_STAT_FILE_WRITEBACK;
+		break;
+
+	case MEMCG_NR_FILE_UNSTABLE_NFS:
+		/* Use Test{Set,Clear} to only un/charge the memcg once. */
+		if (val > 0) {
+			if (TestSetPageCgroupFileUnstableNFS(pc))
+				/* already set */
+				val = 0;
+		} else {
+			if (!TestClearPageCgroupFileUnstableNFS(pc))
+				/* already cleared */
+				val = 0;
+		}
+		idx = MEM_CGROUP_STAT_FILE_UNSTABLE_NFS;
+		break;
+
 	default:
 		BUG();
 	}
@@ -2133,6 +2178,16 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *mem,
 	memcg_check_events(mem, pc->page);
 }
 
+static void mem_cgroup_move_account_page_stat(struct mem_cgroup *from,
+					      struct mem_cgroup *to,
+					      enum mem_cgroup_stat_index idx)
+{
+	preempt_disable();
+	__this_cpu_dec(from->stat->count[idx]);
+	__this_cpu_inc(to->stat->count[idx]);
+	preempt_enable();
+}
+
 /**
  * __mem_cgroup_move_account - move account of the page
  * @pc:	page_cgroup of the page.
@@ -2159,13 +2214,18 @@ static void __mem_cgroup_move_account(struct page_cgroup *pc,
 	VM_BUG_ON(!PageCgroupUsed(pc));
 	VM_BUG_ON(pc->mem_cgroup != from);
 
-	if (PageCgroupFileMapped(pc)) {
-		/* Update mapped_file data for mem_cgroup */
-		preempt_disable();
-		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
-		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
-		preempt_enable();
-	}
+	if (PageCgroupFileMapped(pc))
+		mem_cgroup_move_account_page_stat(from, to,
+					MEM_CGROUP_STAT_FILE_MAPPED);
+	if (PageCgroupFileDirty(pc))
+		mem_cgroup_move_account_page_stat(from, to,
+					MEM_CGROUP_STAT_FILE_DIRTY);
+	if (PageCgroupFileWriteback(pc))
+		mem_cgroup_move_account_page_stat(from, to,
+					MEM_CGROUP_STAT_FILE_WRITEBACK);
+	if (PageCgroupFileUnstableNFS(pc))
+		mem_cgroup_move_account_page_stat(from, to,
+					MEM_CGROUP_STAT_FILE_UNSTABLE_NFS);
 	mem_cgroup_charge_statistics(from, pc, false);
 	if (uncharge)
 		/* This is not "cancel", but cancel_charge does all we need. */
@@ -3545,6 +3605,9 @@ enum {
 	MCS_PGPGIN,
 	MCS_PGPGOUT,
 	MCS_SWAP,
+	MCS_FILE_DIRTY,
+	MCS_WRITEBACK,
+	MCS_UNSTABLE_NFS,
 	MCS_INACTIVE_ANON,
 	MCS_ACTIVE_ANON,
 	MCS_INACTIVE_FILE,
@@ -3567,6 +3630,9 @@ struct {
 	{"pgpgin", "total_pgpgin"},
 	{"pgpgout", "total_pgpgout"},
 	{"swap", "total_swap"},
+	{"dirty", "total_dirty"},
+	{"writeback", "total_writeback"},
+	{"nfs", "total_nfs"},
 	{"inactive_anon", "total_inactive_anon"},
 	{"active_anon", "total_active_anon"},
 	{"inactive_file", "total_inactive_file"},
@@ -3596,6 +3662,13 @@ mem_cgroup_get_local_stat(struct mem_cgroup *mem, struct mcs_total_stat *s)
 		s->stat[MCS_SWAP] += val * PAGE_SIZE;
 	}
 
+	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_FILE_DIRTY);
+	s->stat[MCS_FILE_DIRTY] += val * PAGE_SIZE;
+	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_FILE_WRITEBACK);
+	s->stat[MCS_WRITEBACK] += val * PAGE_SIZE;
+	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_FILE_UNSTABLE_NFS);
+	s->stat[MCS_UNSTABLE_NFS] += val * PAGE_SIZE;
+
 	/* per zone stat */
 	val = mem_cgroup_get_local_zonestat(mem, LRU_INACTIVE_ANON);
 	s->stat[MCS_INACTIVE_ANON] += val * PAGE_SIZE;
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH 06/10] memcg: add kernel calls for memcg dirty page stats
  2010-10-04  6:57 [PATCH 00/10] memcg: per cgroup dirty page accounting Greg Thelen
                   ` (4 preceding siblings ...)
  2010-10-04  6:58 ` [PATCH 05/10] memcg: add dirty page accounting infrastructure Greg Thelen
@ 2010-10-04  6:58 ` Greg Thelen
  2010-10-05  6:55   ` KAMEZAWA Hiroyuki
  2010-10-04  6:58 ` [PATCH 07/10] memcg: add dirty limits to mem_cgroup Greg Thelen
                   ` (8 subsequent siblings)
  14 siblings, 1 reply; 96+ messages in thread
From: Greg Thelen @ 2010-10-04  6:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, containers, Andrea Righi, Balbir Singh,
	KAMEZAWA Hiroyuki, Daisuke Nishimura, Greg Thelen

Add calls into memcg dirty page accounting.  Notify memcg when pages
transition between clean, file dirty, writeback, and unstable nfs.
This allows the memory controller to maintain an accurate view of
the amount of its memory that is dirty.

Signed-off-by: Greg Thelen <gthelen@google.com>
Signed-off-by: Andrea Righi <arighi@develer.com>
---
 fs/nfs/write.c      |    4 ++++
 mm/filemap.c        |    1 +
 mm/page-writeback.c |    4 ++++
 mm/truncate.c       |    1 +
 4 files changed, 10 insertions(+), 0 deletions(-)

diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index 48199fb..9e206bd 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -450,6 +450,7 @@ nfs_mark_request_commit(struct nfs_page *req)
 			NFS_PAGE_TAG_COMMIT);
 	nfsi->ncommit++;
 	spin_unlock(&inode->i_lock);
+	mem_cgroup_inc_page_stat(req->wb_page, MEMCG_NR_FILE_UNSTABLE_NFS);
 	inc_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
 	inc_bdi_stat(req->wb_page->mapping->backing_dev_info, BDI_RECLAIMABLE);
 	__mark_inode_dirty(inode, I_DIRTY_DATASYNC);
@@ -461,6 +462,7 @@ nfs_clear_request_commit(struct nfs_page *req)
 	struct page *page = req->wb_page;
 
 	if (test_and_clear_bit(PG_CLEAN, &(req)->wb_flags)) {
+		mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_UNSTABLE_NFS);
 		dec_zone_page_state(page, NR_UNSTABLE_NFS);
 		dec_bdi_stat(page->mapping->backing_dev_info, BDI_RECLAIMABLE);
 		return 1;
@@ -1316,6 +1318,8 @@ nfs_commit_list(struct inode *inode, struct list_head *head, int how)
 		req = nfs_list_entry(head->next);
 		nfs_list_remove_request(req);
 		nfs_mark_request_commit(req);
+		mem_cgroup_dec_page_stat(req->wb_page,
+					 MEMCG_NR_FILE_UNSTABLE_NFS);
 		dec_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
 		dec_bdi_stat(req->wb_page->mapping->backing_dev_info,
 				BDI_RECLAIMABLE);
diff --git a/mm/filemap.c b/mm/filemap.c
index 3d4df44..82e0870 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -135,6 +135,7 @@ void __remove_from_page_cache(struct page *page)
 	 * having removed the page entirely.
 	 */
 	if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
+		mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_DIRTY);
 		dec_zone_page_state(page, NR_FILE_DIRTY);
 		dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
 	}
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index b840afa..820eb66 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1114,6 +1114,7 @@ int __set_page_dirty_no_writeback(struct page *page)
 void account_page_dirtied(struct page *page, struct address_space *mapping)
 {
 	if (mapping_cap_account_dirty(mapping)) {
+		mem_cgroup_inc_page_stat(page, MEMCG_NR_FILE_DIRTY);
 		__inc_zone_page_state(page, NR_FILE_DIRTY);
 		__inc_zone_page_state(page, NR_DIRTIED);
 		__inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
@@ -1303,6 +1304,7 @@ int clear_page_dirty_for_io(struct page *page)
 		 * for more comments.
 		 */
 		if (TestClearPageDirty(page)) {
+			mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_DIRTY);
 			dec_zone_page_state(page, NR_FILE_DIRTY);
 			dec_bdi_stat(mapping->backing_dev_info,
 					BDI_RECLAIMABLE);
@@ -1333,6 +1335,7 @@ int test_clear_page_writeback(struct page *page)
 				__dec_bdi_stat(bdi, BDI_WRITEBACK);
 				__bdi_writeout_inc(bdi);
 			}
+			mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_WRITEBACK);
 		}
 		spin_unlock_irqrestore(&mapping->tree_lock, flags);
 	} else {
@@ -1360,6 +1363,7 @@ int test_set_page_writeback(struct page *page)
 						PAGECACHE_TAG_WRITEBACK);
 			if (bdi_cap_account_writeback(bdi))
 				__inc_bdi_stat(bdi, BDI_WRITEBACK);
+			mem_cgroup_inc_page_stat(page, MEMCG_NR_FILE_WRITEBACK);
 		}
 		if (!PageDirty(page))
 			radix_tree_tag_clear(&mapping->page_tree,
diff --git a/mm/truncate.c b/mm/truncate.c
index ba887bf..551dc23 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -74,6 +74,7 @@ void cancel_dirty_page(struct page *page, unsigned int account_size)
 	if (TestClearPageDirty(page)) {
 		struct address_space *mapping = page->mapping;
 		if (mapping && mapping_cap_account_dirty(mapping)) {
+			mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_DIRTY);
 			dec_zone_page_state(page, NR_FILE_DIRTY);
 			dec_bdi_stat(mapping->backing_dev_info,
 					BDI_RECLAIMABLE);
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH 07/10] memcg: add dirty limits to mem_cgroup
  2010-10-04  6:57 [PATCH 00/10] memcg: per cgroup dirty page accounting Greg Thelen
                   ` (5 preceding siblings ...)
  2010-10-04  6:58 ` [PATCH 06/10] memcg: add kernel calls for memcg dirty page stats Greg Thelen
@ 2010-10-04  6:58 ` Greg Thelen
  2010-10-05  7:07   ` KAMEZAWA Hiroyuki
  2010-10-05  9:43   ` Andrea Righi
  2010-10-04  6:58 ` [PATCH 08/10] memcg: add cgroupfs interface to memcg dirty limits Greg Thelen
                   ` (7 subsequent siblings)
  14 siblings, 2 replies; 96+ messages in thread
From: Greg Thelen @ 2010-10-04  6:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, containers, Andrea Righi, Balbir Singh,
	KAMEZAWA Hiroyuki, Daisuke Nishimura, Greg Thelen

Extend mem_cgroup to contain dirty page limits.  Also add routines
allowing the kernel to query the dirty usage of a memcg.

These interfaces not used by the kernel yet.  A subsequent commit
will add kernel calls to utilize these new routines.

Signed-off-by: Greg Thelen <gthelen@google.com>
Signed-off-by: Andrea Righi <arighi@develer.com>
---
 include/linux/memcontrol.h |   44 +++++++++++
 mm/memcontrol.c            |  180 +++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 223 insertions(+), 1 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 6303da1..dc8952d 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -19,6 +19,7 @@
 
 #ifndef _LINUX_MEMCONTROL_H
 #define _LINUX_MEMCONTROL_H
+#include <linux/writeback.h>
 #include <linux/cgroup.h>
 struct mem_cgroup;
 struct page_cgroup;
@@ -33,6 +34,30 @@ enum mem_cgroup_write_page_stat_item {
 	MEMCG_NR_FILE_UNSTABLE_NFS, /* # of NFS unstable pages */
 };
 
+/* Cgroup memory statistics items exported to the kernel */
+enum mem_cgroup_read_page_stat_item {
+	MEMCG_NR_DIRTYABLE_PAGES,
+	MEMCG_NR_RECLAIM_PAGES,
+	MEMCG_NR_WRITEBACK,
+	MEMCG_NR_DIRTY_WRITEBACK_PAGES,
+};
+
+/* Dirty memory parameters */
+struct vm_dirty_param {
+	int dirty_ratio;
+	int dirty_background_ratio;
+	unsigned long dirty_bytes;
+	unsigned long dirty_background_bytes;
+};
+
+static inline void get_global_vm_dirty_param(struct vm_dirty_param *param)
+{
+	param->dirty_ratio = vm_dirty_ratio;
+	param->dirty_bytes = vm_dirty_bytes;
+	param->dirty_background_ratio = dirty_background_ratio;
+	param->dirty_background_bytes = dirty_background_bytes;
+}
+
 extern unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
 					struct list_head *dst,
 					unsigned long *scanned, int order,
@@ -145,6 +170,10 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,
 	mem_cgroup_update_page_stat(page, idx, -1);
 }
 
+bool mem_cgroup_has_dirty_limit(void);
+void get_vm_dirty_param(struct vm_dirty_param *param);
+s64 mem_cgroup_page_stat(enum mem_cgroup_read_page_stat_item item);
+
 unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
 						gfp_t gfp_mask);
 u64 mem_cgroup_get_limit(struct mem_cgroup *mem);
@@ -326,6 +355,21 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,
 {
 }
 
+static inline bool mem_cgroup_has_dirty_limit(void)
+{
+	return false;
+}
+
+static inline void get_vm_dirty_param(struct vm_dirty_param *param)
+{
+	get_global_vm_dirty_param(param);
+}
+
+static inline s64 mem_cgroup_page_stat(enum mem_cgroup_read_page_stat_item item)
+{
+	return -ENOSYS;
+}
+
 static inline
 unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
 					    gfp_t gfp_mask)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f40839f..6ec2625 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -233,6 +233,10 @@ struct mem_cgroup {
 	atomic_t	refcnt;
 
 	unsigned int	swappiness;
+
+	/* control memory cgroup dirty pages */
+	struct vm_dirty_param dirty_param;
+
 	/* OOM-Killer disable */
 	int		oom_kill_disable;
 
@@ -1132,6 +1136,172 @@ static unsigned int get_swappiness(struct mem_cgroup *memcg)
 	return swappiness;
 }
 
+/*
+ * Returns a snapshot of the current dirty limits which is not synchronized with
+ * the routines that change the dirty limits.  If this routine races with an
+ * update to the dirty bytes/ratio value, then the caller must handle the case
+ * where both dirty_[background_]_ratio and _bytes are set.
+ */
+static void __mem_cgroup_get_dirty_param(struct vm_dirty_param *param,
+					 struct mem_cgroup *mem)
+{
+	if (mem && !mem_cgroup_is_root(mem)) {
+		param->dirty_ratio = mem->dirty_param.dirty_ratio;
+		param->dirty_bytes = mem->dirty_param.dirty_bytes;
+		param->dirty_background_ratio =
+			mem->dirty_param.dirty_background_ratio;
+		param->dirty_background_bytes =
+			mem->dirty_param.dirty_background_bytes;
+	} else {
+		get_global_vm_dirty_param(param);
+	}
+}
+
+/*
+ * Get dirty memory parameters of the current memcg or global values (if memory
+ * cgroups are disabled or querying the root cgroup).
+ */
+void get_vm_dirty_param(struct vm_dirty_param *param)
+{
+	struct mem_cgroup *memcg;
+
+	if (mem_cgroup_disabled()) {
+		get_global_vm_dirty_param(param);
+		return;
+	}
+
+	/*
+	 * It's possible that "current" may be moved to other cgroup while we
+	 * access cgroup. But precise check is meaningless because the task can
+	 * be moved after our access and writeback tends to take long time.  At
+	 * least, "memcg" will not be freed under rcu_read_lock().
+	 */
+	rcu_read_lock();
+	memcg = mem_cgroup_from_task(current);
+	__mem_cgroup_get_dirty_param(param, memcg);
+	rcu_read_unlock();
+}
+
+/*
+ * Check if current memcg has local dirty limits.  Return true if the current
+ * memory cgroup has local dirty memory settings.
+ */
+bool mem_cgroup_has_dirty_limit(void)
+{
+	struct mem_cgroup *mem;
+
+	if (mem_cgroup_disabled())
+		return false;
+
+	mem = mem_cgroup_from_task(current);
+	return mem && !mem_cgroup_is_root(mem);
+}
+
+static inline bool mem_cgroup_can_swap(struct mem_cgroup *memcg)
+{
+	if (!do_swap_account)
+		return nr_swap_pages > 0;
+	return !memcg->memsw_is_minimum &&
+		(res_counter_read_u64(&memcg->memsw, RES_LIMIT) > 0);
+}
+
+static s64 mem_cgroup_get_local_page_stat(struct mem_cgroup *mem,
+				enum mem_cgroup_read_page_stat_item item)
+{
+	s64 ret;
+
+	switch (item) {
+	case MEMCG_NR_DIRTYABLE_PAGES:
+		ret = mem_cgroup_read_stat(mem, LRU_ACTIVE_FILE) +
+			mem_cgroup_read_stat(mem, LRU_INACTIVE_FILE);
+		if (mem_cgroup_can_swap(mem))
+			ret += mem_cgroup_read_stat(mem, LRU_ACTIVE_ANON) +
+				mem_cgroup_read_stat(mem, LRU_INACTIVE_ANON);
+		break;
+	case MEMCG_NR_RECLAIM_PAGES:
+		ret = mem_cgroup_read_stat(mem,	MEM_CGROUP_STAT_FILE_DIRTY) +
+			mem_cgroup_read_stat(mem,
+					     MEM_CGROUP_STAT_FILE_UNSTABLE_NFS);
+		break;
+	case MEMCG_NR_WRITEBACK:
+		ret = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_FILE_WRITEBACK);
+		break;
+	case MEMCG_NR_DIRTY_WRITEBACK_PAGES:
+		ret = mem_cgroup_read_stat(mem,
+					   MEM_CGROUP_STAT_FILE_WRITEBACK) +
+			mem_cgroup_read_stat(mem,
+					     MEM_CGROUP_STAT_FILE_UNSTABLE_NFS);
+		break;
+	default:
+		BUG();
+		break;
+	}
+	return ret;
+}
+
+static unsigned long long
+memcg_get_hierarchical_free_pages(struct mem_cgroup *mem)
+{
+	struct cgroup *cgroup;
+	unsigned long long min_free, free;
+
+	min_free = res_counter_read_u64(&mem->res, RES_LIMIT) -
+		res_counter_read_u64(&mem->res, RES_USAGE);
+	cgroup = mem->css.cgroup;
+	if (!mem->use_hierarchy)
+		goto out;
+
+	while (cgroup->parent) {
+		cgroup = cgroup->parent;
+		mem = mem_cgroup_from_cont(cgroup);
+		if (!mem->use_hierarchy)
+			break;
+		free = res_counter_read_u64(&mem->res, RES_LIMIT) -
+			res_counter_read_u64(&mem->res, RES_USAGE);
+		min_free = min(min_free, free);
+	}
+out:
+	/* Translate free memory in pages */
+	return min_free >> PAGE_SHIFT;
+}
+
+/*
+ * mem_cgroup_page_stat() - get memory cgroup file cache statistics
+ * @item:      memory statistic item exported to the kernel
+ *
+ * Return the accounted statistic value.
+ */
+s64 mem_cgroup_page_stat(enum mem_cgroup_read_page_stat_item item)
+{
+	struct mem_cgroup *mem;
+	struct mem_cgroup *iter;
+	s64 value;
+
+	rcu_read_lock();
+	mem = mem_cgroup_from_task(current);
+	if (mem && !mem_cgroup_is_root(mem)) {
+		/*
+		 * If we're looking for dirtyable pages we need to evaluate
+		 * free pages depending on the limit and usage of the parents
+		 * first of all.
+		 */
+		if (item == MEMCG_NR_DIRTYABLE_PAGES)
+			value = memcg_get_hierarchical_free_pages(mem);
+		else
+			value = 0;
+		/*
+		 * Recursively evaluate page statistics against all cgroup
+		 * under hierarchy tree
+		 */
+		for_each_mem_cgroup_tree(iter, mem)
+			value += mem_cgroup_get_local_page_stat(iter, item);
+	} else
+		value = -EINVAL;
+	rcu_read_unlock();
+
+	return value;
+}
+
 static void mem_cgroup_start_move(struct mem_cgroup *mem)
 {
 	int cpu;
@@ -4444,8 +4614,16 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
 	spin_lock_init(&mem->reclaim_param_lock);
 	INIT_LIST_HEAD(&mem->oom_notify);
 
-	if (parent)
+	if (parent) {
 		mem->swappiness = get_swappiness(parent);
+		__mem_cgroup_get_dirty_param(&mem->dirty_param, parent);
+	} else {
+		/*
+		 * The root cgroup dirty_param field is not used, instead,
+		 * system-wide dirty limits are used.
+		 */
+	}
+
 	atomic_set(&mem->refcnt, 1);
 	mem->move_charge_at_immigrate = 0;
 	mutex_init(&mem->thresholds_lock);
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH 08/10] memcg: add cgroupfs interface to memcg dirty limits
  2010-10-04  6:57 [PATCH 00/10] memcg: per cgroup dirty page accounting Greg Thelen
                   ` (6 preceding siblings ...)
  2010-10-04  6:58 ` [PATCH 07/10] memcg: add dirty limits to mem_cgroup Greg Thelen
@ 2010-10-04  6:58 ` Greg Thelen
  2010-10-05  7:13   ` KAMEZAWA Hiroyuki
                     ` (2 more replies)
  2010-10-04  6:58 ` [PATCH 09/10] writeback: make determine_dirtyable_memory() static Greg Thelen
                   ` (6 subsequent siblings)
  14 siblings, 3 replies; 96+ messages in thread
From: Greg Thelen @ 2010-10-04  6:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, containers, Andrea Righi, Balbir Singh,
	KAMEZAWA Hiroyuki, Daisuke Nishimura, Greg Thelen

Add cgroupfs interface to memcg dirty page limits:
  Direct write-out is controlled with:
  - memory.dirty_ratio
  - memory.dirty_bytes

  Background write-out is controlled with:
  - memory.dirty_background_ratio
  - memory.dirty_background_bytes

Signed-off-by: Andrea Righi <arighi@develer.com>
Signed-off-by: Greg Thelen <gthelen@google.com>
---
 mm/memcontrol.c |   89 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 89 insertions(+), 0 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6ec2625..2d45a0a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -100,6 +100,13 @@ enum mem_cgroup_stat_index {
 	MEM_CGROUP_STAT_NSTATS,
 };
 
+enum {
+	MEM_CGROUP_DIRTY_RATIO,
+	MEM_CGROUP_DIRTY_BYTES,
+	MEM_CGROUP_DIRTY_BACKGROUND_RATIO,
+	MEM_CGROUP_DIRTY_BACKGROUND_BYTES,
+};
+
 struct mem_cgroup_stat_cpu {
 	s64 count[MEM_CGROUP_STAT_NSTATS];
 };
@@ -4292,6 +4299,64 @@ static int mem_cgroup_oom_control_write(struct cgroup *cgrp,
 	return 0;
 }
 
+static u64 mem_cgroup_dirty_read(struct cgroup *cgrp, struct cftype *cft)
+{
+	struct mem_cgroup *mem = mem_cgroup_from_cont(cgrp);
+	bool root;
+
+	root = mem_cgroup_is_root(mem);
+
+	switch (cft->private) {
+	case MEM_CGROUP_DIRTY_RATIO:
+		return root ? vm_dirty_ratio : mem->dirty_param.dirty_ratio;
+	case MEM_CGROUP_DIRTY_BYTES:
+		return root ? vm_dirty_bytes : mem->dirty_param.dirty_bytes;
+	case MEM_CGROUP_DIRTY_BACKGROUND_RATIO:
+		return root ? dirty_background_ratio :
+			mem->dirty_param.dirty_background_ratio;
+	case MEM_CGROUP_DIRTY_BACKGROUND_BYTES:
+		return root ? dirty_background_bytes :
+			mem->dirty_param.dirty_background_bytes;
+	default:
+		BUG();
+	}
+}
+
+static int
+mem_cgroup_dirty_write(struct cgroup *cgrp, struct cftype *cft, u64 val)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
+	int type = cft->private;
+
+	if (cgrp->parent == NULL)
+		return -EINVAL;
+	if ((type == MEM_CGROUP_DIRTY_RATIO ||
+	     type == MEM_CGROUP_DIRTY_BACKGROUND_RATIO) && val > 100)
+		return -EINVAL;
+	switch (type) {
+	case MEM_CGROUP_DIRTY_RATIO:
+		memcg->dirty_param.dirty_ratio = val;
+		memcg->dirty_param.dirty_bytes = 0;
+		break;
+	case MEM_CGROUP_DIRTY_BYTES:
+		memcg->dirty_param.dirty_bytes = val;
+		memcg->dirty_param.dirty_ratio  = 0;
+		break;
+	case MEM_CGROUP_DIRTY_BACKGROUND_RATIO:
+		memcg->dirty_param.dirty_background_ratio = val;
+		memcg->dirty_param.dirty_background_bytes = 0;
+		break;
+	case MEM_CGROUP_DIRTY_BACKGROUND_BYTES:
+		memcg->dirty_param.dirty_background_bytes = val;
+		memcg->dirty_param.dirty_background_ratio = 0;
+		break;
+	default:
+		BUG();
+		break;
+	}
+	return 0;
+}
+
 static struct cftype mem_cgroup_files[] = {
 	{
 		.name = "usage_in_bytes",
@@ -4355,6 +4420,30 @@ static struct cftype mem_cgroup_files[] = {
 		.unregister_event = mem_cgroup_oom_unregister_event,
 		.private = MEMFILE_PRIVATE(_OOM_TYPE, OOM_CONTROL),
 	},
+	{
+		.name = "dirty_ratio",
+		.read_u64 = mem_cgroup_dirty_read,
+		.write_u64 = mem_cgroup_dirty_write,
+		.private = MEM_CGROUP_DIRTY_RATIO,
+	},
+	{
+		.name = "dirty_bytes",
+		.read_u64 = mem_cgroup_dirty_read,
+		.write_u64 = mem_cgroup_dirty_write,
+		.private = MEM_CGROUP_DIRTY_BYTES,
+	},
+	{
+		.name = "dirty_background_ratio",
+		.read_u64 = mem_cgroup_dirty_read,
+		.write_u64 = mem_cgroup_dirty_write,
+		.private = MEM_CGROUP_DIRTY_BACKGROUND_RATIO,
+	},
+	{
+		.name = "dirty_background_bytes",
+		.read_u64 = mem_cgroup_dirty_read,
+		.write_u64 = mem_cgroup_dirty_write,
+		.private = MEM_CGROUP_DIRTY_BACKGROUND_BYTES,
+	},
 };
 
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH 09/10] writeback: make determine_dirtyable_memory() static.
  2010-10-04  6:57 [PATCH 00/10] memcg: per cgroup dirty page accounting Greg Thelen
                   ` (7 preceding siblings ...)
  2010-10-04  6:58 ` [PATCH 08/10] memcg: add cgroupfs interface to memcg dirty limits Greg Thelen
@ 2010-10-04  6:58 ` Greg Thelen
  2010-10-05  7:15   ` KAMEZAWA Hiroyuki
  2010-10-04  6:58 ` [PATCH 10/10] memcg: check memcg dirty limits in page writeback Greg Thelen
                   ` (5 subsequent siblings)
  14 siblings, 1 reply; 96+ messages in thread
From: Greg Thelen @ 2010-10-04  6:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, containers, Andrea Righi, Balbir Singh,
	KAMEZAWA Hiroyuki, Daisuke Nishimura, Greg Thelen

The determine_dirtyable_memory() function is not used outside of
page writeback.  Make the routine static.  No functional change.
Just a cleanup in preparation for a change that adds memcg dirty
limits consideration into global_dirty_limits().

Signed-off-by: Andrea Righi <arighi@develer.com>
Signed-off-by: Greg Thelen <gthelen@google.com>
---
 include/linux/writeback.h |    2 -
 mm/page-writeback.c       |  122 ++++++++++++++++++++++----------------------
 2 files changed, 61 insertions(+), 63 deletions(-)

diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 72a5d64..9eacdca 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -105,8 +105,6 @@ extern int vm_highmem_is_dirtyable;
 extern int block_dump;
 extern int laptop_mode;
 
-extern unsigned long determine_dirtyable_memory(void);
-
 extern int dirty_background_ratio_handler(struct ctl_table *table, int write,
 		void __user *buffer, size_t *lenp,
 		loff_t *ppos);
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 820eb66..a0bb3e2 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -132,6 +132,67 @@ static struct prop_descriptor vm_completions;
 static struct prop_descriptor vm_dirties;
 
 /*
+ * Work out the current dirty-memory clamping and background writeout
+ * thresholds.
+ *
+ * The main aim here is to lower them aggressively if there is a lot of mapped
+ * memory around.  To avoid stressing page reclaim with lots of unreclaimable
+ * pages.  It is better to clamp down on writers than to start swapping, and
+ * performing lots of scanning.
+ *
+ * We only allow 1/2 of the currently-unmapped memory to be dirtied.
+ *
+ * We don't permit the clamping level to fall below 5% - that is getting rather
+ * excessive.
+ *
+ * We make sure that the background writeout level is below the adjusted
+ * clamping level.
+ */
+
+static unsigned long highmem_dirtyable_memory(unsigned long total)
+{
+#ifdef CONFIG_HIGHMEM
+	int node;
+	unsigned long x = 0;
+
+	for_each_node_state(node, N_HIGH_MEMORY) {
+		struct zone *z =
+			&NODE_DATA(node)->node_zones[ZONE_HIGHMEM];
+
+		x += zone_page_state(z, NR_FREE_PAGES) +
+		     zone_reclaimable_pages(z);
+	}
+	/*
+	 * Make sure that the number of highmem pages is never larger
+	 * than the number of the total dirtyable memory. This can only
+	 * occur in very strange VM situations but we want to make sure
+	 * that this does not occur.
+	 */
+	return min(x, total);
+#else
+	return 0;
+#endif
+}
+
+/**
+ * determine_dirtyable_memory - amount of memory that may be used
+ *
+ * Returns the numebr of pages that can currently be freed and used
+ * by the kernel for direct mappings.
+ */
+static unsigned long determine_dirtyable_memory(void)
+{
+	unsigned long x;
+
+	x = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages();
+
+	if (!vm_highmem_is_dirtyable)
+		x -= highmem_dirtyable_memory(x);
+
+	return x + 1;	/* Ensure that we never return 0 */
+}
+
+/*
  * couple the period to the dirty_ratio:
  *
  *   period/2 ~ roundup_pow_of_two(dirty limit)
@@ -337,67 +398,6 @@ int bdi_set_max_ratio(struct backing_dev_info *bdi, unsigned max_ratio)
 EXPORT_SYMBOL(bdi_set_max_ratio);
 
 /*
- * Work out the current dirty-memory clamping and background writeout
- * thresholds.
- *
- * The main aim here is to lower them aggressively if there is a lot of mapped
- * memory around.  To avoid stressing page reclaim with lots of unreclaimable
- * pages.  It is better to clamp down on writers than to start swapping, and
- * performing lots of scanning.
- *
- * We only allow 1/2 of the currently-unmapped memory to be dirtied.
- *
- * We don't permit the clamping level to fall below 5% - that is getting rather
- * excessive.
- *
- * We make sure that the background writeout level is below the adjusted
- * clamping level.
- */
-
-static unsigned long highmem_dirtyable_memory(unsigned long total)
-{
-#ifdef CONFIG_HIGHMEM
-	int node;
-	unsigned long x = 0;
-
-	for_each_node_state(node, N_HIGH_MEMORY) {
-		struct zone *z =
-			&NODE_DATA(node)->node_zones[ZONE_HIGHMEM];
-
-		x += zone_page_state(z, NR_FREE_PAGES) +
-		     zone_reclaimable_pages(z);
-	}
-	/*
-	 * Make sure that the number of highmem pages is never larger
-	 * than the number of the total dirtyable memory. This can only
-	 * occur in very strange VM situations but we want to make sure
-	 * that this does not occur.
-	 */
-	return min(x, total);
-#else
-	return 0;
-#endif
-}
-
-/**
- * determine_dirtyable_memory - amount of memory that may be used
- *
- * Returns the numebr of pages that can currently be freed and used
- * by the kernel for direct mappings.
- */
-unsigned long determine_dirtyable_memory(void)
-{
-	unsigned long x;
-
-	x = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages();
-
-	if (!vm_highmem_is_dirtyable)
-		x -= highmem_dirtyable_memory(x);
-
-	return x + 1;	/* Ensure that we never return 0 */
-}
-
-/*
  * global_dirty_limits - background-writeback and dirty-throttling thresholds
  *
  * Calculate the dirty thresholds based on sysctl parameters
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH 10/10] memcg: check memcg dirty limits in page writeback
  2010-10-04  6:57 [PATCH 00/10] memcg: per cgroup dirty page accounting Greg Thelen
                   ` (8 preceding siblings ...)
  2010-10-04  6:58 ` [PATCH 09/10] writeback: make determine_dirtyable_memory() static Greg Thelen
@ 2010-10-04  6:58 ` Greg Thelen
  2010-10-05  7:29   ` KAMEZAWA Hiroyuki
  2010-10-06  0:32   ` Minchan Kim
  2010-10-05  4:20 ` [PATCH 00/10] memcg: per cgroup dirty page accounting Balbir Singh
                   ` (4 subsequent siblings)
  14 siblings, 2 replies; 96+ messages in thread
From: Greg Thelen @ 2010-10-04  6:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, containers, Andrea Righi, Balbir Singh,
	KAMEZAWA Hiroyuki, Daisuke Nishimura, Greg Thelen

If the current process is in a non-root memcg, then
global_dirty_limits() will consider the memcg dirty limit.
This allows different cgroups to have distinct dirty limits
which trigger direct and background writeback at different
levels.

Signed-off-by: Andrea Righi <arighi@develer.com>
Signed-off-by: Greg Thelen <gthelen@google.com>
---
 mm/page-writeback.c |   87 ++++++++++++++++++++++++++++++++++++++++++---------
 1 files changed, 72 insertions(+), 15 deletions(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index a0bb3e2..c1db336 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -180,7 +180,7 @@ static unsigned long highmem_dirtyable_memory(unsigned long total)
  * Returns the numebr of pages that can currently be freed and used
  * by the kernel for direct mappings.
  */
-static unsigned long determine_dirtyable_memory(void)
+static unsigned long get_global_dirtyable_memory(void)
 {
 	unsigned long x;
 
@@ -192,6 +192,58 @@ static unsigned long determine_dirtyable_memory(void)
 	return x + 1;	/* Ensure that we never return 0 */
 }
 
+static unsigned long get_dirtyable_memory(void)
+{
+	unsigned long memory;
+	s64 memcg_memory;
+
+	memory = get_global_dirtyable_memory();
+	if (!mem_cgroup_has_dirty_limit())
+		return memory;
+	memcg_memory = mem_cgroup_page_stat(MEMCG_NR_DIRTYABLE_PAGES);
+	BUG_ON(memcg_memory < 0);
+
+	return min((unsigned long)memcg_memory, memory);
+}
+
+static long get_reclaimable_pages(void)
+{
+	s64 ret;
+
+	if (!mem_cgroup_has_dirty_limit())
+		return global_page_state(NR_FILE_DIRTY) +
+			global_page_state(NR_UNSTABLE_NFS);
+	ret = mem_cgroup_page_stat(MEMCG_NR_RECLAIM_PAGES);
+	BUG_ON(ret < 0);
+
+	return ret;
+}
+
+static long get_writeback_pages(void)
+{
+	s64 ret;
+
+	if (!mem_cgroup_has_dirty_limit())
+		return global_page_state(NR_WRITEBACK);
+	ret = mem_cgroup_page_stat(MEMCG_NR_WRITEBACK);
+	BUG_ON(ret < 0);
+
+	return ret;
+}
+
+static unsigned long get_dirty_writeback_pages(void)
+{
+	s64 ret;
+
+	if (!mem_cgroup_has_dirty_limit())
+		return global_page_state(NR_UNSTABLE_NFS) +
+			global_page_state(NR_WRITEBACK);
+	ret = mem_cgroup_page_stat(MEMCG_NR_DIRTY_WRITEBACK_PAGES);
+	BUG_ON(ret < 0);
+
+	return ret;
+}
+
 /*
  * couple the period to the dirty_ratio:
  *
@@ -204,7 +256,7 @@ static int calc_period_shift(void)
 	if (vm_dirty_bytes)
 		dirty_total = vm_dirty_bytes / PAGE_SIZE;
 	else
-		dirty_total = (vm_dirty_ratio * determine_dirtyable_memory()) /
+		dirty_total = (vm_dirty_ratio * get_global_dirtyable_memory()) /
 				100;
 	return 2 + ilog2(dirty_total - 1);
 }
@@ -410,18 +462,23 @@ void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty)
 {
 	unsigned long background;
 	unsigned long dirty;
-	unsigned long available_memory = determine_dirtyable_memory();
+	unsigned long available_memory = get_dirtyable_memory();
 	struct task_struct *tsk;
+	struct vm_dirty_param dirty_param;
 
-	if (vm_dirty_bytes)
-		dirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE);
+	get_vm_dirty_param(&dirty_param);
+
+	if (dirty_param.dirty_bytes)
+		dirty = DIV_ROUND_UP(dirty_param.dirty_bytes, PAGE_SIZE);
 	else
-		dirty = (vm_dirty_ratio * available_memory) / 100;
+		dirty = (dirty_param.dirty_ratio * available_memory) / 100;
 
-	if (dirty_background_bytes)
-		background = DIV_ROUND_UP(dirty_background_bytes, PAGE_SIZE);
+	if (dirty_param.dirty_background_bytes)
+		background = DIV_ROUND_UP(dirty_param.dirty_background_bytes,
+					  PAGE_SIZE);
 	else
-		background = (dirty_background_ratio * available_memory) / 100;
+		background = (dirty_param.dirty_background_ratio *
+			      available_memory) / 100;
 
 	if (background >= dirty)
 		background = dirty / 2;
@@ -493,9 +550,8 @@ static void balance_dirty_pages(struct address_space *mapping,
 			.range_cyclic	= 1,
 		};
 
-		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
-					global_page_state(NR_UNSTABLE_NFS);
-		nr_writeback = global_page_state(NR_WRITEBACK);
+		nr_reclaimable = get_reclaimable_pages();
+		nr_writeback = get_writeback_pages();
 
 		global_dirty_limits(&background_thresh, &dirty_thresh);
 
@@ -652,6 +708,7 @@ void throttle_vm_writeout(gfp_t gfp_mask)
 {
 	unsigned long background_thresh;
 	unsigned long dirty_thresh;
+	unsigned long dirty;
 
         for ( ; ; ) {
 		global_dirty_limits(&background_thresh, &dirty_thresh);
@@ -662,9 +719,9 @@ void throttle_vm_writeout(gfp_t gfp_mask)
                  */
                 dirty_thresh += dirty_thresh / 10;      /* wheeee... */
 
-                if (global_page_state(NR_UNSTABLE_NFS) +
-			global_page_state(NR_WRITEBACK) <= dirty_thresh)
-                        	break;
+		dirty = get_dirty_writeback_pages();
+		if (dirty <= dirty_thresh)
+			break;
                 congestion_wait(BLK_RW_ASYNC, HZ/10);
 
 		/*
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* Re: [PATCH 03/10] memcg: create extensible page stat update routines
  2010-10-04  6:57 ` [PATCH 03/10] memcg: create extensible page stat update routines Greg Thelen
@ 2010-10-04 13:48   ` Ciju Rajan K
  2010-10-04 15:43     ` Greg Thelen
  2010-10-05  6:51   ` KAMEZAWA Hiroyuki
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 96+ messages in thread
From: Ciju Rajan K @ 2010-10-04 13:48 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, Andrea Righi,
	Balbir Singh, KAMEZAWA Hiroyuki, Daisuke Nishimura, Ciju Rajan K

Greg Thelen wrote:
> Replace usage of the mem_cgroup_update_file_mapped() memcg
> statistic update routine with two new routines:
> * mem_cgroup_inc_page_stat()
> * mem_cgroup_dec_page_stat()
>
> As before, only the file_mapped statistic is managed.  However,
> these more general interfaces allow for new statistics to be
> more easily added.  New statistics are added with memcg dirty
> page accounting.
>
>
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 512cb12..f4259f4 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1592,7 +1592,9 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask)
>   * possibility of race condition. If there is, we take a lock.
>   */
>
>   
> -static void mem_cgroup_update_file_stat(struct page *page, int idx, int val)
>   
Not seeing this function in mmotm 28/09. So not able to apply this patch.
Am I missing anything?
> +void mem_cgroup_update_page_stat(struct page *page,
> +				 enum mem_cgroup_write_page_stat_item idx,
> +				 int val)
>  {
>  	struct mem_cgroup *mem;
>
>   


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 03/10] memcg: create extensible page stat update routines
  2010-10-04 13:48   ` Ciju Rajan K
@ 2010-10-04 15:43     ` Greg Thelen
  2010-10-04 17:35       ` Ciju Rajan K
  0 siblings, 1 reply; 96+ messages in thread
From: Greg Thelen @ 2010-10-04 15:43 UTC (permalink / raw)
  To: Ciju Rajan K
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, Andrea Righi,
	Balbir Singh, KAMEZAWA Hiroyuki, Daisuke Nishimura

Ciju Rajan K <ciju@linux.vnet.ibm.com> writes:

> Greg Thelen wrote:
>> Replace usage of the mem_cgroup_update_file_mapped() memcg
>> statistic update routine with two new routines:
>> * mem_cgroup_inc_page_stat()
>> * mem_cgroup_dec_page_stat()
>>
>> As before, only the file_mapped statistic is managed.  However,
>> these more general interfaces allow for new statistics to be
>> more easily added.  New statistics are added with memcg dirty
>> page accounting.
>>
>>
>>
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 512cb12..f4259f4 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -1592,7 +1592,9 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask)
>>   * possibility of race condition. If there is, we take a lock.
>>   */
>>
>>   -static void mem_cgroup_update_file_stat(struct page *page, int idx, int
>> val)
>>   
> Not seeing this function in mmotm 28/09. So not able to apply this patch.
> Am I missing anything?

How are you getting mmotm?

I see the mem_cgroup_update_file_stat() routine added in mmotm
(stamp-2010-09-28-16-13) using patch file:
  http://userweb.kernel.org/~akpm/mmotm/broken-out/memcg-generic-filestat-update-interface.patch

  Author: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
  Date:   Tue Sep 28 21:48:19 2010 -0700
  
      This patch extracts the core logic from mem_cgroup_update_file_mapped() as
      mem_cgroup_update_file_stat() and adds a wrapper.
  
      As a planned future update, memory cgroup has to count dirty pages to
      implement dirty_ratio/limit.  And more, the number of dirty pages is
      required to kick flusher thread to start writeback.  (Now, no kick.)
  
      This patch is preparation for it and makes other statistics implementation
      clearer.  Just a clean up.
  
      Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
      Reviewed-by: Greg Thelen <gthelen@google.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

If you are using the zen mmotm repository,
git://zen-kernel.org/kernel/mmotm.git, the commit id of
memcg-generic-filestat-update-interface.patch is
616960dc0cb0172a5e5adc9e2b83e668e1255b50.

>> +void mem_cgroup_update_page_stat(struct page *page,
>> +				 enum mem_cgroup_write_page_stat_item idx,
>> +				 int val)
>>  {
>>  	struct mem_cgroup *mem;
>>
>>   

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 03/10] memcg: create extensible page stat update routines
  2010-10-04 15:43     ` Greg Thelen
@ 2010-10-04 17:35       ` Ciju Rajan K
  0 siblings, 0 replies; 96+ messages in thread
From: Ciju Rajan K @ 2010-10-04 17:35 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, Andrea Righi,
	Balbir Singh, KAMEZAWA Hiroyuki, Daisuke Nishimura

Greg Thelen wrote:
> Ciju Rajan K <ciju@linux.vnet.ibm.com> writes:
>
>   
>> Greg Thelen wrote:
>>     
>>> Replace usage of the mem_cgroup_update_file_mapped() memcg
>>> statistic update routine with two new routines:
>>> * mem_cgroup_inc_page_stat()
>>> * mem_cgroup_dec_page_stat()
>>>
>>> As before, only the file_mapped statistic is managed.  However,
>>> these more general interfaces allow for new statistics to be
>>> more easily added.  New statistics are added with memcg dirty
>>> page accounting.
>>>
>>>
>>>
>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>>> index 512cb12..f4259f4 100644
>>> --- a/mm/memcontrol.c
>>> +++ b/mm/memcontrol.c
>>> @@ -1592,7 +1592,9 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask)
>>>   * possibility of race condition. If there is, we take a lock.
>>>   */
>>>
>>>   -static void mem_cgroup_update_file_stat(struct page *page, int idx, int
>>> val)
>>>   
>>>       
>> Not seeing this function in mmotm 28/09. So not able to apply this patch.
>> Am I missing anything?
>>     
>
> How are you getting mmotm?
>
> I see the mem_cgroup_update_file_stat() routine added in mmotm
> (stamp-2010-09-28-16-13) using patch file:
>   http://userweb.kernel.org/~akpm/mmotm/broken-out/memcg-generic-filestat-update-interface.patch
>   
Sorry for the noise Greg. It was a mistake at my end. Corrected now.
Thanks!
>   Author: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>   Date:   Tue Sep 28 21:48:19 2010 -0700
>   
>       This patch extracts the core logic from mem_cgroup_update_file_mapped() as
>       mem_cgroup_update_file_stat() and adds a wrapper.
>   
>       As a planned future update, memory cgroup has to count dirty pages to
>       implement dirty_ratio/limit.  And more, the number of dirty pages is
>       required to kick flusher thread to start writeback.  (Now, no kick.)
>   
>       This patch is preparation for it and makes other statistics implementation
>       clearer.  Just a clean up.
>   
>       Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>       Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
>       Reviewed-by: Greg Thelen <gthelen@google.com>
>       Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
>       Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
>
> If you are using the zen mmotm repository,
> git://zen-kernel.org/kernel/mmotm.git, the commit id of
> memcg-generic-filestat-update-interface.patch is
> 616960dc0cb0172a5e5adc9e2b83e668e1255b50.
>
>   
>>> +void mem_cgroup_update_page_stat(struct page *page,
>>> +				 enum mem_cgroup_write_page_stat_item idx,
>>> +				 int val)
>>>  {
>>>  	struct mem_cgroup *mem;
>>>
>>>   
>>>       


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 00/10] memcg: per cgroup dirty page accounting
  2010-10-04  6:57 [PATCH 00/10] memcg: per cgroup dirty page accounting Greg Thelen
                   ` (9 preceding siblings ...)
  2010-10-04  6:58 ` [PATCH 10/10] memcg: check memcg dirty limits in page writeback Greg Thelen
@ 2010-10-05  4:20 ` Balbir Singh
  2010-10-05  4:50 ` Balbir Singh
                   ` (3 subsequent siblings)
  14 siblings, 0 replies; 96+ messages in thread
From: Balbir Singh @ 2010-10-05  4:20 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, Andrea Righi,
	KAMEZAWA Hiroyuki, Daisuke Nishimura

* Greg Thelen <gthelen@google.com> [2010-10-03 23:57:55]:

> This patch set provides the ability for each cgroup to have independent dirty
> page limits.
> 
> Limiting dirty memory is like fixing the max amount of dirty (hard to reclaim)
> page cache used by a cgroup.  So, in case of multiple cgroup writers, they will
> not be able to consume more than their designated share of dirty pages and will
> be forced to perform write-out if they cross that limit.
> 
> These patches were developed and tested on mmotm 2010-09-28-16-13.  The patches
> are based on a series proposed by Andrea Righi in Mar 2010.
> 
> Overview:
> - Add page_cgroup flags to record when pages are dirty, in writeback, or nfs
>   unstable.
> - Extend mem_cgroup to record the total number of pages in each of the 
>   interesting dirty states (dirty, writeback, unstable_nfs).  
> - Add dirty parameters similar to the system-wide  /proc/sys/vm/dirty_*
>   limits to mem_cgroup.  The mem_cgroup dirty parameters are accessible
>   via cgroupfs control files.
> - Consider both system and per-memcg dirty limits in page writeback when
>   deciding to queue background writeback or block for foreground writeback.
> 
> Known shortcomings:
> - When a cgroup dirty limit is exceeded, then bdi writeback is employed to
>   writeback dirty inodes.  Bdi writeback considers inodes from any cgroup, not
>   just inodes contributing dirty pages to the cgroup exceeding its limit.  

I suspect this means that we'll need a bdi controller in the I/O
controller spectrum or make writeback cgroup aware.

> 
> Performance measurements:
> - kernel builds are unaffected unless run with a small dirty limit.
> - all data collected with CONFIG_CGROUP_MEM_RES_CTLR=y.
> - dd has three data points (in secs) for three data sizes (100M, 200M, and 1G).  
>   As expected, dd slows when it exceed its cgroup dirty limit.
> 
>                kernel_build          dd
> mmotm             2:37        0.18, 0.38, 1.65
>   root_memcg
> 
> mmotm             2:37        0.18, 0.35, 1.66
>   non-root_memcg
> 
> mmotm+patches     2:37        0.18, 0.35, 1.68
>   root_memcg
> 
> mmotm+patches     2:37        0.19, 0.35, 1.69
>   non-root_memcg
> 
> mmotm+patches     2:37        0.19, 2.34, 22.82
>   non-root_memcg
>   150 MiB memcg dirty limit
> 
> mmotm+patches     3:58        1.71, 3.38, 17.33
>   non-root_memcg
>   1 MiB memcg dirty limit
> 
> Greg Thelen (10):
>   memcg: add page_cgroup flags for dirty page tracking
>   memcg: document cgroup dirty memory interfaces
>   memcg: create extensible page stat update routines
>   memcg: disable local interrupts in lock_page_cgroup()
>   memcg: add dirty page accounting infrastructure
>   memcg: add kernel calls for memcg dirty page stats
>   memcg: add dirty limits to mem_cgroup
>   memcg: add cgroupfs interface to memcg dirty limits
>   writeback: make determine_dirtyable_memory() static.
>   memcg: check memcg dirty limits in page writeback
> 
>  Documentation/cgroups/memory.txt |   37 ++++
>  fs/nfs/write.c                   |    4 +
>  include/linux/memcontrol.h       |   78 +++++++-
>  include/linux/page_cgroup.h      |   31 +++-
>  include/linux/writeback.h        |    2 -
>  mm/filemap.c                     |    1 +
>  mm/memcontrol.c                  |  426 ++++++++++++++++++++++++++++++++++----
>  mm/page-writeback.c              |  211 ++++++++++++-------
>  mm/rmap.c                        |    4 +-
>  mm/truncate.c                    |    1 +
>  10 files changed, 672 insertions(+), 123 deletions(-)
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

-- 
	Three Cheers,
	Balbir

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 00/10] memcg: per cgroup dirty page accounting
  2010-10-04  6:57 [PATCH 00/10] memcg: per cgroup dirty page accounting Greg Thelen
                   ` (10 preceding siblings ...)
  2010-10-05  4:20 ` [PATCH 00/10] memcg: per cgroup dirty page accounting Balbir Singh
@ 2010-10-05  4:50 ` Balbir Singh
  2010-10-05  5:50   ` Greg Thelen
  2010-10-05 22:15 ` Andrea Righi
                   ` (2 subsequent siblings)
  14 siblings, 1 reply; 96+ messages in thread
From: Balbir Singh @ 2010-10-05  4:50 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, Andrea Righi,
	KAMEZAWA Hiroyuki, Daisuke Nishimura

* Greg Thelen <gthelen@google.com> [2010-10-03 23:57:55]:

> This patch set provides the ability for each cgroup to have independent dirty
> page limits.
> 
> Limiting dirty memory is like fixing the max amount of dirty (hard to reclaim)
> page cache used by a cgroup.  So, in case of multiple cgroup writers, they will
> not be able to consume more than their designated share of dirty pages and will
> be forced to perform write-out if they cross that limit.
> 
> These patches were developed and tested on mmotm 2010-09-28-16-13.  The patches
> are based on a series proposed by Andrea Righi in Mar 2010.

Hi, Greg,

I see a problem with "    memcg: add dirty page accounting infrastructure".

The reject is

 enum mem_cgroup_write_page_stat_item {
        MEMCG_NR_FILE_MAPPED, /* # of pages charged as file rss */
+       MEMCG_NR_FILE_DIRTY, /* # of dirty pages in page cache */
+       MEMCG_NR_FILE_WRITEBACK, /* # of pages under writeback */
+       MEMCG_NR_FILE_UNSTABLE_NFS, /* # of NFS unstable pages */
 };

I don't see mem_cgroup_write_page_stat_item in memcontrol.h. Is this
based on top of Kame's cleanup.

I am working off of mmotm 28 sept 2010 16:13.


-- 
	Three Cheers,
	Balbir

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 00/10] memcg: per cgroup dirty page accounting
  2010-10-05  4:50 ` Balbir Singh
@ 2010-10-05  5:50   ` Greg Thelen
  2010-10-05  8:37     ` Ciju Rajan K
  0 siblings, 1 reply; 96+ messages in thread
From: Greg Thelen @ 2010-10-05  5:50 UTC (permalink / raw)
  To: balbir
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, Andrea Righi,
	KAMEZAWA Hiroyuki, Daisuke Nishimura

Balbir Singh <balbir@linux.vnet.ibm.com> writes:
>
> * Greg Thelen <gthelen@google.com> [2010-10-03 23:57:55]:
>
>> This patch set provides the ability for each cgroup to have independent dirty
>> page limits.
>> 
>> Limiting dirty memory is like fixing the max amount of dirty (hard to reclaim)
>> page cache used by a cgroup.  So, in case of multiple cgroup writers, they will
>> not be able to consume more than their designated share of dirty pages and will
>> be forced to perform write-out if they cross that limit.
>> 
>> These patches were developed and tested on mmotm 2010-09-28-16-13.  The patches
>> are based on a series proposed by Andrea Righi in Mar 2010.
>
> Hi, Greg,
>
> I see a problem with "    memcg: add dirty page accounting infrastructure".
>
> The reject is
>
>  enum mem_cgroup_write_page_stat_item {
>         MEMCG_NR_FILE_MAPPED, /* # of pages charged as file rss */
> +       MEMCG_NR_FILE_DIRTY, /* # of dirty pages in page cache */
> +       MEMCG_NR_FILE_WRITEBACK, /* # of pages under writeback */
> +       MEMCG_NR_FILE_UNSTABLE_NFS, /* # of NFS unstable pages */
>  };
>
> I don't see mem_cgroup_write_page_stat_item in memcontrol.h. Is this
> based on top of Kame's cleanup.
>
> I am working off of mmotm 28 sept 2010 16:13.

Balbir,

All of the 10 memcg dirty limits patches should apply directly to mmotm
28 sept 2010 16:13 without any other patches.  Any of Kame's cleanup
patches that are not in mmotm are not needed by this memcg dirty limit
series.

The patch you refer to, "[PATCH 05/10] memcg: add dirty page accounting
infrastructure" depends on a change from an earlier patch in the series.
Specifically, "[PATCH 03/10] memcg: create extensible page stat update
routines" contains the addition of mem_cgroup_write_page_stat_item:

--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -25,6 +25,11 @@ struct page_cgroup;
 struct page;
 struct mm_struct;
 
+/* Stats that can be updated by kernel. */
+enum mem_cgroup_write_page_stat_item {
+     MEMCG_NR_FILE_MAPPED, /* # of pages charged as file rss */
+};
+

Do you have trouble applying patch 5 after applying patches 1-4?

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 01/10] memcg: add page_cgroup flags for dirty page tracking
  2010-10-04  6:57 ` [PATCH 01/10] memcg: add page_cgroup flags for dirty page tracking Greg Thelen
@ 2010-10-05  6:20   ` KAMEZAWA Hiroyuki
  2010-10-06  0:37   ` Daisuke Nishimura
  2010-10-06 11:07   ` Balbir Singh
  2 siblings, 0 replies; 96+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-10-05  6:20 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, Andrea Righi,
	Balbir Singh, Daisuke Nishimura

On Sun,  3 Oct 2010 23:57:56 -0700
Greg Thelen <gthelen@google.com> wrote:

> Add additional flags to page_cgroup to track dirty pages
> within a mem_cgroup.
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Signed-off-by: Andrea Righi <arighi@develer.com>
> Signed-off-by: Greg Thelen <gthelen@google.com>

Ack...oh, but it seems I've signed. Thanks.
-Kame

> ---
>  include/linux/page_cgroup.h |   23 +++++++++++++++++++++++
>  1 files changed, 23 insertions(+), 0 deletions(-)
> 
> diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
> index 5bb13b3..b59c298 100644
> --- a/include/linux/page_cgroup.h
> +++ b/include/linux/page_cgroup.h
> @@ -40,6 +40,9 @@ enum {
>  	PCG_USED, /* this object is in use. */
>  	PCG_ACCT_LRU, /* page has been accounted for */
>  	PCG_FILE_MAPPED, /* page is accounted as "mapped" */
> +	PCG_FILE_DIRTY, /* page is dirty */
> +	PCG_FILE_WRITEBACK, /* page is under writeback */
> +	PCG_FILE_UNSTABLE_NFS, /* page is NFS unstable */
>  	PCG_MIGRATION, /* under page migration */
>  };
>  
> @@ -59,6 +62,10 @@ static inline void ClearPageCgroup##uname(struct page_cgroup *pc)	\
>  static inline int TestClearPageCgroup##uname(struct page_cgroup *pc)	\
>  	{ return test_and_clear_bit(PCG_##lname, &pc->flags);  }
>  
> +#define TESTSETPCGFLAG(uname, lname)			\
> +static inline int TestSetPageCgroup##uname(struct page_cgroup *pc)	\
> +	{ return test_and_set_bit(PCG_##lname, &pc->flags);  }
> +
>  TESTPCGFLAG(Locked, LOCK)
>  
>  /* Cache flag is set only once (at allocation) */
> @@ -80,6 +87,22 @@ SETPCGFLAG(FileMapped, FILE_MAPPED)
>  CLEARPCGFLAG(FileMapped, FILE_MAPPED)
>  TESTPCGFLAG(FileMapped, FILE_MAPPED)
>  
> +SETPCGFLAG(FileDirty, FILE_DIRTY)
> +CLEARPCGFLAG(FileDirty, FILE_DIRTY)
> +TESTPCGFLAG(FileDirty, FILE_DIRTY)
> +TESTCLEARPCGFLAG(FileDirty, FILE_DIRTY)
> +TESTSETPCGFLAG(FileDirty, FILE_DIRTY)
> +
> +SETPCGFLAG(FileWriteback, FILE_WRITEBACK)
> +CLEARPCGFLAG(FileWriteback, FILE_WRITEBACK)
> +TESTPCGFLAG(FileWriteback, FILE_WRITEBACK)
> +
> +SETPCGFLAG(FileUnstableNFS, FILE_UNSTABLE_NFS)
> +CLEARPCGFLAG(FileUnstableNFS, FILE_UNSTABLE_NFS)
> +TESTPCGFLAG(FileUnstableNFS, FILE_UNSTABLE_NFS)
> +TESTCLEARPCGFLAG(FileUnstableNFS, FILE_UNSTABLE_NFS)
> +TESTSETPCGFLAG(FileUnstableNFS, FILE_UNSTABLE_NFS)
> +
>  SETPCGFLAG(Migration, MIGRATION)
>  CLEARPCGFLAG(Migration, MIGRATION)
>  TESTPCGFLAG(Migration, MIGRATION)
> -- 
> 1.7.1
> 
> 


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 02/10] memcg: document cgroup dirty memory interfaces
  2010-10-04  6:57 ` [PATCH 02/10] memcg: document cgroup dirty memory interfaces Greg Thelen
@ 2010-10-05  6:48   ` KAMEZAWA Hiroyuki
  2010-10-06  0:49   ` Daisuke Nishimura
  2010-10-06 11:12   ` Balbir Singh
  2 siblings, 0 replies; 96+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-10-05  6:48 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, Andrea Righi,
	Balbir Singh, Daisuke Nishimura

On Sun,  3 Oct 2010 23:57:57 -0700
Greg Thelen <gthelen@google.com> wrote:

> Document cgroup dirty memory interfaces and statistics.
> 
> Signed-off-by: Andrea Righi <arighi@develer.com>
> Signed-off-by: Greg Thelen <gthelen@google.com>

Nice.

Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>





> ---
>  Documentation/cgroups/memory.txt |   37 +++++++++++++++++++++++++++++++++++++
>  1 files changed, 37 insertions(+), 0 deletions(-)
> 
> diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
> index 7781857..eab65e2 100644
> --- a/Documentation/cgroups/memory.txt
> +++ b/Documentation/cgroups/memory.txt
> @@ -385,6 +385,10 @@ mapped_file	- # of bytes of mapped file (includes tmpfs/shmem)
>  pgpgin		- # of pages paged in (equivalent to # of charging events).
>  pgpgout		- # of pages paged out (equivalent to # of uncharging events).
>  swap		- # of bytes of swap usage
> +dirty		- # of bytes that are waiting to get written back to the disk.
> +writeback	- # of bytes that are actively being written back to the disk.
> +nfs		- # of bytes sent to the NFS server, but not yet committed to
> +		the actual storage.
>  inactive_anon	- # of bytes of anonymous memory and swap cache memory on
>  		LRU list.
>  active_anon	- # of bytes of anonymous and swap cache memory on active
> @@ -453,6 +457,39 @@ memory under it will be reclaimed.
>  You can reset failcnt by writing 0 to failcnt file.
>  # echo 0 > .../memory.failcnt
>  
> +5.5 dirty memory
> +
> +Control the maximum amount of dirty pages a cgroup can have at any given time.
> +
> +Limiting dirty memory is like fixing the max amount of dirty (hard to reclaim)
> +page cache used by a cgroup.  So, in case of multiple cgroup writers, they will
> +not be able to consume more than their designated share of dirty pages and will
> +be forced to perform write-out if they cross that limit.
> +
> +The interface is equivalent to the procfs interface: /proc/sys/vm/dirty_*.  It
> +is possible to configure a limit to trigger both a direct writeback or a
> +background writeback performed by per-bdi flusher threads.  The root cgroup
> +memory.dirty_* control files are read-only and match the contents of
> +the /proc/sys/vm/dirty_* files.
> +
> +Per-cgroup dirty limits can be set using the following files in the cgroupfs:
> +
> +- memory.dirty_ratio: the amount of dirty memory (expressed as a percentage of
> +  cgroup memory) at which a process generating dirty pages will itself start
> +  writing out dirty data.
> +
> +- memory.dirty_bytes: the amount of dirty memory (expressed in bytes) in the
> +  cgroup at which a process generating dirty pages will start itself writing out
> +  dirty data.
> +
> +- memory.dirty_background_ratio: the amount of dirty memory of the cgroup
> +  (expressed as a percentage of cgroup memory) at which background writeback
> +  kernel threads will start writing out dirty data.
> +
> +- memory.dirty_background_bytes: the amount of dirty memory (expressed in bytes)
> +  in the cgroup at which background writeback kernel threads will start writing
> +  out dirty data.
> +
>  6. Hierarchy support
>  
>  The memory controller supports a deep hierarchy and hierarchical accounting.
> -- 
> 1.7.1
> 
> 


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 03/10] memcg: create extensible page stat update routines
  2010-10-04  6:57 ` [PATCH 03/10] memcg: create extensible page stat update routines Greg Thelen
  2010-10-04 13:48   ` Ciju Rajan K
@ 2010-10-05  6:51   ` KAMEZAWA Hiroyuki
  2010-10-05  7:10     ` Greg Thelen
  2010-10-05 15:42   ` Minchan Kim
  2010-10-06 16:19   ` Balbir Singh
  3 siblings, 1 reply; 96+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-10-05  6:51 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, Andrea Righi,
	Balbir Singh, Daisuke Nishimura

On Sun,  3 Oct 2010 23:57:58 -0700
Greg Thelen <gthelen@google.com> wrote:

> Replace usage of the mem_cgroup_update_file_mapped() memcg
> statistic update routine with two new routines:
> * mem_cgroup_inc_page_stat()
> * mem_cgroup_dec_page_stat()
> 
> As before, only the file_mapped statistic is managed.  However,
> these more general interfaces allow for new statistics to be
> more easily added.  New statistics are added with memcg dirty
> page accounting.
> 
> Signed-off-by: Greg Thelen <gthelen@google.com>
> Signed-off-by: Andrea Righi <arighi@develer.com>

Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

a nitpick. see below.

> ---
>  include/linux/memcontrol.h |   31 ++++++++++++++++++++++++++++---
>  mm/memcontrol.c            |   17 ++++++++---------
>  mm/rmap.c                  |    4 ++--
>  3 files changed, 38 insertions(+), 14 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 159a076..7c7bec4 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -25,6 +25,11 @@ struct page_cgroup;
>  struct page;
>  struct mm_struct;
>  
> +/* Stats that can be updated by kernel. */
> +enum mem_cgroup_write_page_stat_item {
> +	MEMCG_NR_FILE_MAPPED, /* # of pages charged as file rss */
> +};
> +
>  extern unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
>  					struct list_head *dst,
>  					unsigned long *scanned, int order,
> @@ -121,7 +126,22 @@ static inline bool mem_cgroup_disabled(void)
>  	return false;
>  }
>  
> -void mem_cgroup_update_file_mapped(struct page *page, int val);
> +void mem_cgroup_update_page_stat(struct page *page,
> +				 enum mem_cgroup_write_page_stat_item idx,
> +				 int val);
> +
> +static inline void mem_cgroup_inc_page_stat(struct page *page,
> +				enum mem_cgroup_write_page_stat_item idx)
> +{
> +	mem_cgroup_update_page_stat(page, idx, 1);
> +}
> +
> +static inline void mem_cgroup_dec_page_stat(struct page *page,
> +				enum mem_cgroup_write_page_stat_item idx)
> +{
> +	mem_cgroup_update_page_stat(page, idx, -1);
> +}
> +
>  unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>  						gfp_t gfp_mask);
>  u64 mem_cgroup_get_limit(struct mem_cgroup *mem);
> @@ -293,8 +313,13 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
>  {
>  }
>  
> -static inline void mem_cgroup_update_file_mapped(struct page *page,
> -							int val)
> +static inline void mem_cgroup_inc_page_stat(struct page *page,
> +				enum mem_cgroup_write_page_stat_item idx)
> +{
> +}
> +
> +static inline void mem_cgroup_dec_page_stat(struct page *page,
> +				enum mem_cgroup_write_page_stat_item idx)
>  {
>  }
>  
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 512cb12..f4259f4 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1592,7 +1592,9 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask)
>   * possibility of race condition. If there is, we take a lock.
>   */
>  
> -static void mem_cgroup_update_file_stat(struct page *page, int idx, int val)
> +void mem_cgroup_update_page_stat(struct page *page,
> +				 enum mem_cgroup_write_page_stat_item idx,
> +				 int val)
>  {
>  	struct mem_cgroup *mem;
>  	struct page_cgroup *pc = lookup_page_cgroup(page);
> @@ -1615,30 +1617,27 @@ static void mem_cgroup_update_file_stat(struct page *page, int idx, int val)
>  			goto out;
>  	}
>  
> -	this_cpu_add(mem->stat->count[idx], val);
> -
>  	switch (idx) {
> -	case MEM_CGROUP_STAT_FILE_MAPPED:
> +	case MEMCG_NR_FILE_MAPPED:
>  		if (val > 0)
>  			SetPageCgroupFileMapped(pc);
>  		else if (!page_mapped(page))
>  			ClearPageCgroupFileMapped(pc);
> +		idx = MEM_CGROUP_STAT_FILE_MAPPED;
>  		break;
>  	default:
>  		BUG();
>  	}
>  
> +	this_cpu_add(mem->stat->count[idx], val);
> +

Why you move this_cpu_add() placement ?
(This placement is ok but I just wonder..)

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 04/10] memcg: disable local interrupts in lock_page_cgroup()
  2010-10-04  6:57 ` [PATCH 04/10] memcg: disable local interrupts in lock_page_cgroup() Greg Thelen
@ 2010-10-05  6:54   ` KAMEZAWA Hiroyuki
  2010-10-05  7:18     ` Greg Thelen
  2010-10-05 16:03   ` Minchan Kim
  2010-10-12  5:39   ` [PATCH 04/10] memcg: disable local interrupts in lock_page_cgroup() Balbir Singh
  2 siblings, 1 reply; 96+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-10-05  6:54 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, Andrea Righi,
	Balbir Singh, Daisuke Nishimura

On Sun,  3 Oct 2010 23:57:59 -0700
Greg Thelen <gthelen@google.com> wrote:

> If pages are being migrated from a memcg, then updates to that
> memcg's page statistics are protected by grabbing a bit spin lock
> using lock_page_cgroup().  In an upcoming commit memcg dirty page
> accounting will be updating memcg page accounting (specifically:
> num writeback pages) from softirq.  Avoid a deadlocking nested
> spin lock attempt by disabling interrupts on the local processor
> when grabbing the page_cgroup bit_spin_lock in lock_page_cgroup().
> This avoids the following deadlock:
> statistic
>       CPU 0             CPU 1
>                     inc_file_mapped
>                     rcu_read_lock
>   start move
>   synchronize_rcu
>                     lock_page_cgroup
>                       softirq
>                       test_clear_page_writeback
>                       mem_cgroup_dec_page_stat(NR_WRITEBACK)
>                       rcu_read_lock
>                       lock_page_cgroup   /* deadlock */
>                       unlock_page_cgroup
>                       rcu_read_unlock
>                     unlock_page_cgroup
>                     rcu_read_unlock
> 
> By disabling interrupts in lock_page_cgroup, nested calls
> are avoided.  The softirq would be delayed until after inc_file_mapped
> enables interrupts when calling unlock_page_cgroup().
> 
> The normal, fast path, of memcg page stat updates typically
> does not need to call lock_page_cgroup(), so this change does
> not affect the performance of the common case page accounting.
> 
> Signed-off-by: Andrea Righi <arighi@develer.com>
> Signed-off-by: Greg Thelen <gthelen@google.com>

Nice Catch!

But..hmm this wasn't necessary for FILE_MAPPED but necesary for new
statistics, right ? (This affects the order of patches.)

Anyway

Acked-By: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>


> ---
>  include/linux/page_cgroup.h |    8 +++++-
>  mm/memcontrol.c             |   51 +++++++++++++++++++++++++-----------------
>  2 files changed, 36 insertions(+), 23 deletions(-)
> 
> diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
> index b59c298..872f6b1 100644
> --- a/include/linux/page_cgroup.h
> +++ b/include/linux/page_cgroup.h
> @@ -117,14 +117,18 @@ static inline enum zone_type page_cgroup_zid(struct page_cgroup *pc)
>  	return page_zonenum(pc->page);
>  }
>  
> -static inline void lock_page_cgroup(struct page_cgroup *pc)
> +static inline void lock_page_cgroup(struct page_cgroup *pc,
> +				    unsigned long *flags)
>  {
> +	local_irq_save(*flags);
>  	bit_spin_lock(PCG_LOCK, &pc->flags);
>  }
>  
> -static inline void unlock_page_cgroup(struct page_cgroup *pc)
> +static inline void unlock_page_cgroup(struct page_cgroup *pc,
> +				      unsigned long flags)
>  {
>  	bit_spin_unlock(PCG_LOCK, &pc->flags);
> +	local_irq_restore(flags);
>  }
>  
>  #else /* CONFIG_CGROUP_MEM_RES_CTLR */
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index f4259f4..267d774 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1599,6 +1599,7 @@ void mem_cgroup_update_page_stat(struct page *page,
>  	struct mem_cgroup *mem;
>  	struct page_cgroup *pc = lookup_page_cgroup(page);
>  	bool need_unlock = false;
> +	unsigned long flags;
>  
>  	if (unlikely(!pc))
>  		return;
> @@ -1610,7 +1611,7 @@ void mem_cgroup_update_page_stat(struct page *page,
>  	/* pc->mem_cgroup is unstable ? */
>  	if (unlikely(mem_cgroup_stealed(mem))) {
>  		/* take a lock against to access pc->mem_cgroup */
> -		lock_page_cgroup(pc);
> +		lock_page_cgroup(pc, &flags);
>  		need_unlock = true;
>  		mem = pc->mem_cgroup;
>  		if (!mem || !PageCgroupUsed(pc))
> @@ -1633,7 +1634,7 @@ void mem_cgroup_update_page_stat(struct page *page,
>  
>  out:
>  	if (unlikely(need_unlock))
> -		unlock_page_cgroup(pc);
> +		unlock_page_cgroup(pc, flags);
>  	rcu_read_unlock();
>  	return;
>  }
> @@ -2053,11 +2054,12 @@ struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
>  	struct page_cgroup *pc;
>  	unsigned short id;
>  	swp_entry_t ent;
> +	unsigned long flags;
>  
>  	VM_BUG_ON(!PageLocked(page));
>  
>  	pc = lookup_page_cgroup(page);
> -	lock_page_cgroup(pc);
> +	lock_page_cgroup(pc, &flags);
>  	if (PageCgroupUsed(pc)) {
>  		mem = pc->mem_cgroup;
>  		if (mem && !css_tryget(&mem->css))
> @@ -2071,7 +2073,7 @@ struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
>  			mem = NULL;
>  		rcu_read_unlock();
>  	}
> -	unlock_page_cgroup(pc);
> +	unlock_page_cgroup(pc, flags);
>  	return mem;
>  }
>  
> @@ -2084,13 +2086,15 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *mem,
>  				     struct page_cgroup *pc,
>  				     enum charge_type ctype)
>  {
> +	unsigned long flags;
> +
>  	/* try_charge() can return NULL to *memcg, taking care of it. */
>  	if (!mem)
>  		return;
>  
> -	lock_page_cgroup(pc);
> +	lock_page_cgroup(pc, &flags);
>  	if (unlikely(PageCgroupUsed(pc))) {
> -		unlock_page_cgroup(pc);
> +		unlock_page_cgroup(pc, flags);
>  		mem_cgroup_cancel_charge(mem);
>  		return;
>  	}
> @@ -2120,7 +2124,7 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *mem,
>  
>  	mem_cgroup_charge_statistics(mem, pc, true);
>  
> -	unlock_page_cgroup(pc);
> +	unlock_page_cgroup(pc, flags);
>  	/*
>  	 * "charge_statistics" updated event counter. Then, check it.
>  	 * Insert ancestor (and ancestor's ancestors), to softlimit RB-tree.
> @@ -2187,12 +2191,13 @@ static int mem_cgroup_move_account(struct page_cgroup *pc,
>  		struct mem_cgroup *from, struct mem_cgroup *to, bool uncharge)
>  {
>  	int ret = -EINVAL;
> -	lock_page_cgroup(pc);
> +	unsigned long flags;
> +	lock_page_cgroup(pc, &flags);
>  	if (PageCgroupUsed(pc) && pc->mem_cgroup == from) {
>  		__mem_cgroup_move_account(pc, from, to, uncharge);
>  		ret = 0;
>  	}
> -	unlock_page_cgroup(pc);
> +	unlock_page_cgroup(pc, flags);
>  	/*
>  	 * check events
>  	 */
> @@ -2298,6 +2303,7 @@ int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
>  				gfp_t gfp_mask)
>  {
>  	int ret;
> +	unsigned long flags;
>  
>  	if (mem_cgroup_disabled())
>  		return 0;
> @@ -2320,12 +2326,12 @@ int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
>  		pc = lookup_page_cgroup(page);
>  		if (!pc)
>  			return 0;
> -		lock_page_cgroup(pc);
> +		lock_page_cgroup(pc, &flags);
>  		if (PageCgroupUsed(pc)) {
> -			unlock_page_cgroup(pc);
> +			unlock_page_cgroup(pc, flags);
>  			return 0;
>  		}
> -		unlock_page_cgroup(pc);
> +		unlock_page_cgroup(pc, flags);
>  	}
>  
>  	if (unlikely(!mm))
> @@ -2511,6 +2517,7 @@ __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype)
>  {
>  	struct page_cgroup *pc;
>  	struct mem_cgroup *mem = NULL;
> +	unsigned long flags;
>  
>  	if (mem_cgroup_disabled())
>  		return NULL;
> @@ -2525,7 +2532,7 @@ __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype)
>  	if (unlikely(!pc || !PageCgroupUsed(pc)))
>  		return NULL;
>  
> -	lock_page_cgroup(pc);
> +	lock_page_cgroup(pc, &flags);
>  
>  	mem = pc->mem_cgroup;
>  
> @@ -2560,7 +2567,7 @@ __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype)
>  	 * special functions.
>  	 */
>  
> -	unlock_page_cgroup(pc);
> +	unlock_page_cgroup(pc, flags);
>  	/*
>  	 * even after unlock, we have mem->res.usage here and this memcg
>  	 * will never be freed.
> @@ -2576,7 +2583,7 @@ __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype)
>  	return mem;
>  
>  unlock_out:
> -	unlock_page_cgroup(pc);
> +	unlock_page_cgroup(pc, flags);
>  	return NULL;
>  }
>  
> @@ -2765,12 +2772,13 @@ int mem_cgroup_prepare_migration(struct page *page,
>  	struct mem_cgroup *mem = NULL;
>  	enum charge_type ctype;
>  	int ret = 0;
> +	unsigned long flags;
>  
>  	if (mem_cgroup_disabled())
>  		return 0;
>  
>  	pc = lookup_page_cgroup(page);
> -	lock_page_cgroup(pc);
> +	lock_page_cgroup(pc, &flags);
>  	if (PageCgroupUsed(pc)) {
>  		mem = pc->mem_cgroup;
>  		css_get(&mem->css);
> @@ -2806,7 +2814,7 @@ int mem_cgroup_prepare_migration(struct page *page,
>  		if (PageAnon(page))
>  			SetPageCgroupMigration(pc);
>  	}
> -	unlock_page_cgroup(pc);
> +	unlock_page_cgroup(pc, flags);
>  	/*
>  	 * If the page is not charged at this point,
>  	 * we return here.
> @@ -2819,9 +2827,9 @@ int mem_cgroup_prepare_migration(struct page *page,
>  	css_put(&mem->css);/* drop extra refcnt */
>  	if (ret || *ptr == NULL) {
>  		if (PageAnon(page)) {
> -			lock_page_cgroup(pc);
> +			lock_page_cgroup(pc, &flags);
>  			ClearPageCgroupMigration(pc);
> -			unlock_page_cgroup(pc);
> +			unlock_page_cgroup(pc, flags);
>  			/*
>  			 * The old page may be fully unmapped while we kept it.
>  			 */
> @@ -2852,6 +2860,7 @@ void mem_cgroup_end_migration(struct mem_cgroup *mem,
>  {
>  	struct page *used, *unused;
>  	struct page_cgroup *pc;
> +	unsigned long flags;
>  
>  	if (!mem)
>  		return;
> @@ -2871,9 +2880,9 @@ void mem_cgroup_end_migration(struct mem_cgroup *mem,
>  	 * Clear the flag and check the page should be charged.
>  	 */
>  	pc = lookup_page_cgroup(oldpage);
> -	lock_page_cgroup(pc);
> +	lock_page_cgroup(pc, &flags);
>  	ClearPageCgroupMigration(pc);
> -	unlock_page_cgroup(pc);
> +	unlock_page_cgroup(pc, flags);
>  
>  	__mem_cgroup_uncharge_common(unused, MEM_CGROUP_CHARGE_TYPE_FORCE);
>  
> -- 
> 1.7.1
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 
> 


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 06/10] memcg: add kernel calls for memcg dirty page stats
  2010-10-04  6:58 ` [PATCH 06/10] memcg: add kernel calls for memcg dirty page stats Greg Thelen
@ 2010-10-05  6:55   ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 96+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-10-05  6:55 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, Andrea Righi,
	Balbir Singh, Daisuke Nishimura

On Sun,  3 Oct 2010 23:58:01 -0700
Greg Thelen <gthelen@google.com> wrote:

> Add calls into memcg dirty page accounting.  Notify memcg when pages
> transition between clean, file dirty, writeback, and unstable nfs.
> This allows the memory controller to maintain an accurate view of
> the amount of its memory that is dirty.
> 
> Signed-off-by: Greg Thelen <gthelen@google.com>
> Signed-off-by: Andrea Righi <arighi@develer.com>

Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 07/10] memcg: add dirty limits to mem_cgroup
  2010-10-04  6:58 ` [PATCH 07/10] memcg: add dirty limits to mem_cgroup Greg Thelen
@ 2010-10-05  7:07   ` KAMEZAWA Hiroyuki
  2010-10-05  9:43   ` Andrea Righi
  1 sibling, 0 replies; 96+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-10-05  7:07 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, Andrea Righi,
	Balbir Singh, Daisuke Nishimura

On Sun,  3 Oct 2010 23:58:02 -0700
Greg Thelen <gthelen@google.com> wrote:

> Extend mem_cgroup to contain dirty page limits.  Also add routines
> allowing the kernel to query the dirty usage of a memcg.
> 
> These interfaces not used by the kernel yet.  A subsequent commit
> will add kernel calls to utilize these new routines.
> 
> Signed-off-by: Greg Thelen <gthelen@google.com>
> Signed-off-by: Andrea Righi <arighi@develer.com>

Seems nice.
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 03/10] memcg: create extensible page stat update routines
  2010-10-05  6:51   ` KAMEZAWA Hiroyuki
@ 2010-10-05  7:10     ` Greg Thelen
  0 siblings, 0 replies; 96+ messages in thread
From: Greg Thelen @ 2010-10-05  7:10 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, Andrea Righi,
	Balbir Singh, Daisuke Nishimura

KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> writes:

> On Sun,  3 Oct 2010 23:57:58 -0700
> Greg Thelen <gthelen@google.com> wrote:
>
>> Replace usage of the mem_cgroup_update_file_mapped() memcg
>> statistic update routine with two new routines:
>> * mem_cgroup_inc_page_stat()
>> * mem_cgroup_dec_page_stat()
>> 
>> As before, only the file_mapped statistic is managed.  However,
>> these more general interfaces allow for new statistics to be
>> more easily added.  New statistics are added with memcg dirty
>> page accounting.
>> 
>> Signed-off-by: Greg Thelen <gthelen@google.com>
>> Signed-off-by: Andrea Righi <arighi@develer.com>
>
> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>
> a nitpick. see below.
>
>> ---
>>  include/linux/memcontrol.h |   31 ++++++++++++++++++++++++++++---
>>  mm/memcontrol.c            |   17 ++++++++---------
>>  mm/rmap.c                  |    4 ++--
>>  3 files changed, 38 insertions(+), 14 deletions(-)
>> 
>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>> index 159a076..7c7bec4 100644
>> --- a/include/linux/memcontrol.h
>> +++ b/include/linux/memcontrol.h
>> @@ -25,6 +25,11 @@ struct page_cgroup;
>>  struct page;
>>  struct mm_struct;
>>  
>> +/* Stats that can be updated by kernel. */
>> +enum mem_cgroup_write_page_stat_item {
>> +	MEMCG_NR_FILE_MAPPED, /* # of pages charged as file rss */
>> +};
>> +
>>  extern unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
>>  					struct list_head *dst,
>>  					unsigned long *scanned, int order,
>> @@ -121,7 +126,22 @@ static inline bool mem_cgroup_disabled(void)
>>  	return false;
>>  }
>>  
>> -void mem_cgroup_update_file_mapped(struct page *page, int val);
>> +void mem_cgroup_update_page_stat(struct page *page,
>> +				 enum mem_cgroup_write_page_stat_item idx,
>> +				 int val);
>> +
>> +static inline void mem_cgroup_inc_page_stat(struct page *page,
>> +				enum mem_cgroup_write_page_stat_item idx)
>> +{
>> +	mem_cgroup_update_page_stat(page, idx, 1);
>> +}
>> +
>> +static inline void mem_cgroup_dec_page_stat(struct page *page,
>> +				enum mem_cgroup_write_page_stat_item idx)
>> +{
>> +	mem_cgroup_update_page_stat(page, idx, -1);
>> +}
>> +
>>  unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>>  						gfp_t gfp_mask);
>>  u64 mem_cgroup_get_limit(struct mem_cgroup *mem);
>> @@ -293,8 +313,13 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
>>  {
>>  }
>>  
>> -static inline void mem_cgroup_update_file_mapped(struct page *page,
>> -							int val)
>> +static inline void mem_cgroup_inc_page_stat(struct page *page,
>> +				enum mem_cgroup_write_page_stat_item idx)
>> +{
>> +}
>> +
>> +static inline void mem_cgroup_dec_page_stat(struct page *page,
>> +				enum mem_cgroup_write_page_stat_item idx)
>>  {
>>  }
>>  
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 512cb12..f4259f4 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -1592,7 +1592,9 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask)
>>   * possibility of race condition. If there is, we take a lock.
>>   */
>>  
>> -static void mem_cgroup_update_file_stat(struct page *page, int idx, int val)
>> +void mem_cgroup_update_page_stat(struct page *page,
>> +				 enum mem_cgroup_write_page_stat_item idx,
>> +				 int val)
>>  {
>>  	struct mem_cgroup *mem;
>>  	struct page_cgroup *pc = lookup_page_cgroup(page);
>> @@ -1615,30 +1617,27 @@ static void mem_cgroup_update_file_stat(struct page *page, int idx, int val)
>>  			goto out;
>>  	}
>>  
>> -	this_cpu_add(mem->stat->count[idx], val);
>> -
>>  	switch (idx) {
>> -	case MEM_CGROUP_STAT_FILE_MAPPED:
>> +	case MEMCG_NR_FILE_MAPPED:
>>  		if (val > 0)
>>  			SetPageCgroupFileMapped(pc);
>>  		else if (!page_mapped(page))
>>  			ClearPageCgroupFileMapped(pc);
>> +		idx = MEM_CGROUP_STAT_FILE_MAPPED;
>>  		break;
>>  	default:
>>  		BUG();
>>  	}
>>  
>> +	this_cpu_add(mem->stat->count[idx], val);
>> +
>
> Why you move this_cpu_add() placement ?
> (This placement is ok but I just wonder..)
>
> Thanks,
> -Kame

The reason this_cpu_add() is moved to after the switch is because the
switch is needed to convert the input parameter from an enum
mem_cgroup_write_page_stat_item (example: MEMCG_NR_FILE_MAPPED) to enum
mem_cgroup_stat_index (example: MEM_CGROUP_STAT_FILE_MAPPED) before
indexing into the count array.

Also in subsequent patches (in this series) "val" is updated depending
on page_cgroup flags before usage by this_cpu_add().

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 08/10] memcg: add cgroupfs interface to memcg dirty limits
  2010-10-04  6:58 ` [PATCH 08/10] memcg: add cgroupfs interface to memcg dirty limits Greg Thelen
@ 2010-10-05  7:13   ` KAMEZAWA Hiroyuki
  2010-10-05  7:33     ` Greg Thelen
  2010-10-06 13:30   ` Balbir Singh
  2010-10-07  6:23   ` Ciju Rajan K
  2 siblings, 1 reply; 96+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-10-05  7:13 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, Andrea Righi,
	Balbir Singh, Daisuke Nishimura

On Sun,  3 Oct 2010 23:58:03 -0700
Greg Thelen <gthelen@google.com> wrote:

> Add cgroupfs interface to memcg dirty page limits:
>   Direct write-out is controlled with:
>   - memory.dirty_ratio
>   - memory.dirty_bytes
> 
>   Background write-out is controlled with:
>   - memory.dirty_background_ratio
>   - memory.dirty_background_bytes
> 
> Signed-off-by: Andrea Righi <arighi@develer.com>
> Signed-off-by: Greg Thelen <gthelen@google.com>

Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

a question below.


> ---
>  mm/memcontrol.c |   89 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 files changed, 89 insertions(+), 0 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 6ec2625..2d45a0a 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -100,6 +100,13 @@ enum mem_cgroup_stat_index {
>  	MEM_CGROUP_STAT_NSTATS,
>  };
>  
> +enum {
> +	MEM_CGROUP_DIRTY_RATIO,
> +	MEM_CGROUP_DIRTY_BYTES,
> +	MEM_CGROUP_DIRTY_BACKGROUND_RATIO,
> +	MEM_CGROUP_DIRTY_BACKGROUND_BYTES,
> +};
> +
>  struct mem_cgroup_stat_cpu {
>  	s64 count[MEM_CGROUP_STAT_NSTATS];
>  };
> @@ -4292,6 +4299,64 @@ static int mem_cgroup_oom_control_write(struct cgroup *cgrp,
>  	return 0;
>  }
>  
> +static u64 mem_cgroup_dirty_read(struct cgroup *cgrp, struct cftype *cft)
> +{
> +	struct mem_cgroup *mem = mem_cgroup_from_cont(cgrp);
> +	bool root;
> +
> +	root = mem_cgroup_is_root(mem);
> +
> +	switch (cft->private) {
> +	case MEM_CGROUP_DIRTY_RATIO:
> +		return root ? vm_dirty_ratio : mem->dirty_param.dirty_ratio;
> +	case MEM_CGROUP_DIRTY_BYTES:
> +		return root ? vm_dirty_bytes : mem->dirty_param.dirty_bytes;
> +	case MEM_CGROUP_DIRTY_BACKGROUND_RATIO:
> +		return root ? dirty_background_ratio :
> +			mem->dirty_param.dirty_background_ratio;
> +	case MEM_CGROUP_DIRTY_BACKGROUND_BYTES:
> +		return root ? dirty_background_bytes :
> +			mem->dirty_param.dirty_background_bytes;
> +	default:
> +		BUG();
> +	}
> +}
> +
> +static int
> +mem_cgroup_dirty_write(struct cgroup *cgrp, struct cftype *cft, u64 val)
> +{
> +	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
> +	int type = cft->private;
> +
> +	if (cgrp->parent == NULL)
> +		return -EINVAL;
> +	if ((type == MEM_CGROUP_DIRTY_RATIO ||
> +	     type == MEM_CGROUP_DIRTY_BACKGROUND_RATIO) && val > 100)
> +		return -EINVAL;
> +	switch (type) {
> +	case MEM_CGROUP_DIRTY_RATIO:
> +		memcg->dirty_param.dirty_ratio = val;
> +		memcg->dirty_param.dirty_bytes = 0;
> +		break;
> +	case MEM_CGROUP_DIRTY_BYTES:
> +		memcg->dirty_param.dirty_bytes = val;
> +		memcg->dirty_param.dirty_ratio  = 0;
> +		break;
> +	case MEM_CGROUP_DIRTY_BACKGROUND_RATIO:
> +		memcg->dirty_param.dirty_background_ratio = val;
> +		memcg->dirty_param.dirty_background_bytes = 0;
> +		break;
> +	case MEM_CGROUP_DIRTY_BACKGROUND_BYTES:
> +		memcg->dirty_param.dirty_background_bytes = val;
> +		memcg->dirty_param.dirty_background_ratio = 0;
> +		break;


Curious....is this same behavior as vm_dirty_ratio ?


Thanks,
-Kame


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 09/10] writeback: make determine_dirtyable_memory() static.
  2010-10-04  6:58 ` [PATCH 09/10] writeback: make determine_dirtyable_memory() static Greg Thelen
@ 2010-10-05  7:15   ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 96+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-10-05  7:15 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, Andrea Righi,
	Balbir Singh, Daisuke Nishimura

On Sun,  3 Oct 2010 23:58:04 -0700
Greg Thelen <gthelen@google.com> wrote:

> The determine_dirtyable_memory() function is not used outside of
> page writeback.  Make the routine static.  No functional change.
> Just a cleanup in preparation for a change that adds memcg dirty
> limits consideration into global_dirty_limits().
> 
> Signed-off-by: Andrea Righi <arighi@develer.com>
> Signed-off-by: Greg Thelen <gthelen@google.com>

Hmm.
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 04/10] memcg: disable local interrupts in lock_page_cgroup()
  2010-10-05  6:54   ` KAMEZAWA Hiroyuki
@ 2010-10-05  7:18     ` Greg Thelen
  0 siblings, 0 replies; 96+ messages in thread
From: Greg Thelen @ 2010-10-05  7:18 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, Andrea Righi,
	Balbir Singh, Daisuke Nishimura

KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> writes:

> On Sun,  3 Oct 2010 23:57:59 -0700
> Greg Thelen <gthelen@google.com> wrote:
>
>> If pages are being migrated from a memcg, then updates to that
>> memcg's page statistics are protected by grabbing a bit spin lock
>> using lock_page_cgroup().  In an upcoming commit memcg dirty page
>> accounting will be updating memcg page accounting (specifically:
>> num writeback pages) from softirq.  Avoid a deadlocking nested
>> spin lock attempt by disabling interrupts on the local processor
>> when grabbing the page_cgroup bit_spin_lock in lock_page_cgroup().
>> This avoids the following deadlock:
>> statistic
>>       CPU 0             CPU 1
>>                     inc_file_mapped
>>                     rcu_read_lock
>>   start move
>>   synchronize_rcu
>>                     lock_page_cgroup
>>                       softirq
>>                       test_clear_page_writeback
>>                       mem_cgroup_dec_page_stat(NR_WRITEBACK)
>>                       rcu_read_lock
>>                       lock_page_cgroup   /* deadlock */
>>                       unlock_page_cgroup
>>                       rcu_read_unlock
>>                     unlock_page_cgroup
>>                     rcu_read_unlock
>> 
>> By disabling interrupts in lock_page_cgroup, nested calls
>> are avoided.  The softirq would be delayed until after inc_file_mapped
>> enables interrupts when calling unlock_page_cgroup().
>> 
>> The normal, fast path, of memcg page stat updates typically
>> does not need to call lock_page_cgroup(), so this change does
>> not affect the performance of the common case page accounting.
>> 
>> Signed-off-by: Andrea Righi <arighi@develer.com>
>> Signed-off-by: Greg Thelen <gthelen@google.com>
>
> Nice Catch!
>
> But..hmm this wasn't necessary for FILE_MAPPED but necesary for new
> statistics, right ? (This affects the order of patches.)

This patch (disabling interrupts) is not needed until later patches (in
this series) update memcg statistics from softirq.  If we only had
FILE_MAPPED, then this patch would not be needed.  I placed this patch
before the following dependent patches that need it.  The opposite order
seemed wrong because it would introduce the possibility of the deadlock
until this patch was applied.  By having this patch come first there
should be no way to apply the series in order and see the mentioned
deadlock.

> Anyway
>
> Acked-By: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>
>
>> ---
>>  include/linux/page_cgroup.h |    8 +++++-
>>  mm/memcontrol.c             |   51 +++++++++++++++++++++++++-----------------
>>  2 files changed, 36 insertions(+), 23 deletions(-)
>> 
>> diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
>> index b59c298..872f6b1 100644
>> --- a/include/linux/page_cgroup.h
>> +++ b/include/linux/page_cgroup.h
>> @@ -117,14 +117,18 @@ static inline enum zone_type page_cgroup_zid(struct page_cgroup *pc)
>>  	return page_zonenum(pc->page);
>>  }
>>  
>> -static inline void lock_page_cgroup(struct page_cgroup *pc)
>> +static inline void lock_page_cgroup(struct page_cgroup *pc,
>> +				    unsigned long *flags)
>>  {
>> +	local_irq_save(*flags);
>>  	bit_spin_lock(PCG_LOCK, &pc->flags);
>>  }
>>  
>> -static inline void unlock_page_cgroup(struct page_cgroup *pc)
>> +static inline void unlock_page_cgroup(struct page_cgroup *pc,
>> +				      unsigned long flags)
>>  {
>>  	bit_spin_unlock(PCG_LOCK, &pc->flags);
>> +	local_irq_restore(flags);
>>  }
>>  
>>  #else /* CONFIG_CGROUP_MEM_RES_CTLR */
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index f4259f4..267d774 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -1599,6 +1599,7 @@ void mem_cgroup_update_page_stat(struct page *page,
>>  	struct mem_cgroup *mem;
>>  	struct page_cgroup *pc = lookup_page_cgroup(page);
>>  	bool need_unlock = false;
>> +	unsigned long flags;
>>  
>>  	if (unlikely(!pc))
>>  		return;
>> @@ -1610,7 +1611,7 @@ void mem_cgroup_update_page_stat(struct page *page,
>>  	/* pc->mem_cgroup is unstable ? */
>>  	if (unlikely(mem_cgroup_stealed(mem))) {
>>  		/* take a lock against to access pc->mem_cgroup */
>> -		lock_page_cgroup(pc);
>> +		lock_page_cgroup(pc, &flags);
>>  		need_unlock = true;
>>  		mem = pc->mem_cgroup;
>>  		if (!mem || !PageCgroupUsed(pc))
>> @@ -1633,7 +1634,7 @@ void mem_cgroup_update_page_stat(struct page *page,
>>  
>>  out:
>>  	if (unlikely(need_unlock))
>> -		unlock_page_cgroup(pc);
>> +		unlock_page_cgroup(pc, flags);
>>  	rcu_read_unlock();
>>  	return;
>>  }
>> @@ -2053,11 +2054,12 @@ struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
>>  	struct page_cgroup *pc;
>>  	unsigned short id;
>>  	swp_entry_t ent;
>> +	unsigned long flags;
>>  
>>  	VM_BUG_ON(!PageLocked(page));
>>  
>>  	pc = lookup_page_cgroup(page);
>> -	lock_page_cgroup(pc);
>> +	lock_page_cgroup(pc, &flags);
>>  	if (PageCgroupUsed(pc)) {
>>  		mem = pc->mem_cgroup;
>>  		if (mem && !css_tryget(&mem->css))
>> @@ -2071,7 +2073,7 @@ struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
>>  			mem = NULL;
>>  		rcu_read_unlock();
>>  	}
>> -	unlock_page_cgroup(pc);
>> +	unlock_page_cgroup(pc, flags);
>>  	return mem;
>>  }
>>  
>> @@ -2084,13 +2086,15 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *mem,
>>  				     struct page_cgroup *pc,
>>  				     enum charge_type ctype)
>>  {
>> +	unsigned long flags;
>> +
>>  	/* try_charge() can return NULL to *memcg, taking care of it. */
>>  	if (!mem)
>>  		return;
>>  
>> -	lock_page_cgroup(pc);
>> +	lock_page_cgroup(pc, &flags);
>>  	if (unlikely(PageCgroupUsed(pc))) {
>> -		unlock_page_cgroup(pc);
>> +		unlock_page_cgroup(pc, flags);
>>  		mem_cgroup_cancel_charge(mem);
>>  		return;
>>  	}
>> @@ -2120,7 +2124,7 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *mem,
>>  
>>  	mem_cgroup_charge_statistics(mem, pc, true);
>>  
>> -	unlock_page_cgroup(pc);
>> +	unlock_page_cgroup(pc, flags);
>>  	/*
>>  	 * "charge_statistics" updated event counter. Then, check it.
>>  	 * Insert ancestor (and ancestor's ancestors), to softlimit RB-tree.
>> @@ -2187,12 +2191,13 @@ static int mem_cgroup_move_account(struct page_cgroup *pc,
>>  		struct mem_cgroup *from, struct mem_cgroup *to, bool uncharge)
>>  {
>>  	int ret = -EINVAL;
>> -	lock_page_cgroup(pc);
>> +	unsigned long flags;
>> +	lock_page_cgroup(pc, &flags);
>>  	if (PageCgroupUsed(pc) && pc->mem_cgroup == from) {
>>  		__mem_cgroup_move_account(pc, from, to, uncharge);
>>  		ret = 0;
>>  	}
>> -	unlock_page_cgroup(pc);
>> +	unlock_page_cgroup(pc, flags);
>>  	/*
>>  	 * check events
>>  	 */
>> @@ -2298,6 +2303,7 @@ int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
>>  				gfp_t gfp_mask)
>>  {
>>  	int ret;
>> +	unsigned long flags;
>>  
>>  	if (mem_cgroup_disabled())
>>  		return 0;
>> @@ -2320,12 +2326,12 @@ int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
>>  		pc = lookup_page_cgroup(page);
>>  		if (!pc)
>>  			return 0;
>> -		lock_page_cgroup(pc);
>> +		lock_page_cgroup(pc, &flags);
>>  		if (PageCgroupUsed(pc)) {
>> -			unlock_page_cgroup(pc);
>> +			unlock_page_cgroup(pc, flags);
>>  			return 0;
>>  		}
>> -		unlock_page_cgroup(pc);
>> +		unlock_page_cgroup(pc, flags);
>>  	}
>>  
>>  	if (unlikely(!mm))
>> @@ -2511,6 +2517,7 @@ __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype)
>>  {
>>  	struct page_cgroup *pc;
>>  	struct mem_cgroup *mem = NULL;
>> +	unsigned long flags;
>>  
>>  	if (mem_cgroup_disabled())
>>  		return NULL;
>> @@ -2525,7 +2532,7 @@ __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype)
>>  	if (unlikely(!pc || !PageCgroupUsed(pc)))
>>  		return NULL;
>>  
>> -	lock_page_cgroup(pc);
>> +	lock_page_cgroup(pc, &flags);
>>  
>>  	mem = pc->mem_cgroup;
>>  
>> @@ -2560,7 +2567,7 @@ __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype)
>>  	 * special functions.
>>  	 */
>>  
>> -	unlock_page_cgroup(pc);
>> +	unlock_page_cgroup(pc, flags);
>>  	/*
>>  	 * even after unlock, we have mem->res.usage here and this memcg
>>  	 * will never be freed.
>> @@ -2576,7 +2583,7 @@ __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype)
>>  	return mem;
>>  
>>  unlock_out:
>> -	unlock_page_cgroup(pc);
>> +	unlock_page_cgroup(pc, flags);
>>  	return NULL;
>>  }
>>  
>> @@ -2765,12 +2772,13 @@ int mem_cgroup_prepare_migration(struct page *page,
>>  	struct mem_cgroup *mem = NULL;
>>  	enum charge_type ctype;
>>  	int ret = 0;
>> +	unsigned long flags;
>>  
>>  	if (mem_cgroup_disabled())
>>  		return 0;
>>  
>>  	pc = lookup_page_cgroup(page);
>> -	lock_page_cgroup(pc);
>> +	lock_page_cgroup(pc, &flags);
>>  	if (PageCgroupUsed(pc)) {
>>  		mem = pc->mem_cgroup;
>>  		css_get(&mem->css);
>> @@ -2806,7 +2814,7 @@ int mem_cgroup_prepare_migration(struct page *page,
>>  		if (PageAnon(page))
>>  			SetPageCgroupMigration(pc);
>>  	}
>> -	unlock_page_cgroup(pc);
>> +	unlock_page_cgroup(pc, flags);
>>  	/*
>>  	 * If the page is not charged at this point,
>>  	 * we return here.
>> @@ -2819,9 +2827,9 @@ int mem_cgroup_prepare_migration(struct page *page,
>>  	css_put(&mem->css);/* drop extra refcnt */
>>  	if (ret || *ptr == NULL) {
>>  		if (PageAnon(page)) {
>> -			lock_page_cgroup(pc);
>> +			lock_page_cgroup(pc, &flags);
>>  			ClearPageCgroupMigration(pc);
>> -			unlock_page_cgroup(pc);
>> +			unlock_page_cgroup(pc, flags);
>>  			/*
>>  			 * The old page may be fully unmapped while we kept it.
>>  			 */
>> @@ -2852,6 +2860,7 @@ void mem_cgroup_end_migration(struct mem_cgroup *mem,
>>  {
>>  	struct page *used, *unused;
>>  	struct page_cgroup *pc;
>> +	unsigned long flags;
>>  
>>  	if (!mem)
>>  		return;
>> @@ -2871,9 +2880,9 @@ void mem_cgroup_end_migration(struct mem_cgroup *mem,
>>  	 * Clear the flag and check the page should be charged.
>>  	 */
>>  	pc = lookup_page_cgroup(oldpage);
>> -	lock_page_cgroup(pc);
>> +	lock_page_cgroup(pc, &flags);
>>  	ClearPageCgroupMigration(pc);
>> -	unlock_page_cgroup(pc);
>> +	unlock_page_cgroup(pc, flags);
>>  
>>  	__mem_cgroup_uncharge_common(unused, MEM_CGROUP_CHARGE_TYPE_FORCE);
>>  
>> -- 
>> 1.7.1
>> 
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majordomo@kvack.org.  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>> 
>> 

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 05/10] memcg: add dirty page accounting infrastructure
  2010-10-04  6:58 ` [PATCH 05/10] memcg: add dirty page accounting infrastructure Greg Thelen
@ 2010-10-05  7:22   ` KAMEZAWA Hiroyuki
  2010-10-05  7:35     ` Greg Thelen
  2010-10-05 16:09   ` Minchan Kim
  1 sibling, 1 reply; 96+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-10-05  7:22 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, Andrea Righi,
	Balbir Singh, Daisuke Nishimura

On Sun,  3 Oct 2010 23:58:00 -0700
Greg Thelen <gthelen@google.com> wrote:

> Add memcg routines to track dirty, writeback, and unstable_NFS pages.
> These routines are not yet used by the kernel to count such pages.
> A later change adds kernel calls to these new routines.
> 
> Signed-off-by: Greg Thelen <gthelen@google.com>
> Signed-off-by: Andrea Righi <arighi@develer.com>

a small request. see below.

> ---
>  include/linux/memcontrol.h |    3 +
>  mm/memcontrol.c            |   89 ++++++++++++++++++++++++++++++++++++++++----
>  2 files changed, 84 insertions(+), 8 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 7c7bec4..6303da1 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -28,6 +28,9 @@ struct mm_struct;
>  /* Stats that can be updated by kernel. */
>  enum mem_cgroup_write_page_stat_item {
>  	MEMCG_NR_FILE_MAPPED, /* # of pages charged as file rss */
> +	MEMCG_NR_FILE_DIRTY, /* # of dirty pages in page cache */
> +	MEMCG_NR_FILE_WRITEBACK, /* # of pages under writeback */
> +	MEMCG_NR_FILE_UNSTABLE_NFS, /* # of NFS unstable pages */
>  };
>  
>  extern unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 267d774..f40839f 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -85,10 +85,13 @@ enum mem_cgroup_stat_index {
>  	 */
>  	MEM_CGROUP_STAT_CACHE, 	   /* # of pages charged as cache */
>  	MEM_CGROUP_STAT_RSS,	   /* # of pages charged as anon rss */
> -	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
>  	MEM_CGROUP_STAT_PGPGIN_COUNT,	/* # of pages paged in */
>  	MEM_CGROUP_STAT_PGPGOUT_COUNT,	/* # of pages paged out */
>  	MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
> +	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
> +	MEM_CGROUP_STAT_FILE_DIRTY,	/* # of dirty pages in page cache */
> +	MEM_CGROUP_STAT_FILE_WRITEBACK,		/* # of pages under writeback */
> +	MEM_CGROUP_STAT_FILE_UNSTABLE_NFS,	/* # of NFS unstable pages */
>  	MEM_CGROUP_STAT_DATA, /* end of data requires synchronization */
>  	/* incremented at every  pagein/pageout */
>  	MEM_CGROUP_EVENTS = MEM_CGROUP_STAT_DATA,
> @@ -1626,6 +1629,48 @@ void mem_cgroup_update_page_stat(struct page *page,
>  			ClearPageCgroupFileMapped(pc);
>  		idx = MEM_CGROUP_STAT_FILE_MAPPED;
>  		break;
> +
> +	case MEMCG_NR_FILE_DIRTY:
> +		/* Use Test{Set,Clear} to only un/charge the memcg once. */
> +		if (val > 0) {
> +			if (TestSetPageCgroupFileDirty(pc))
> +				/* already set */
> +				val = 0;
> +		} else {
> +			if (!TestClearPageCgroupFileDirty(pc))
> +				/* already cleared */
> +				val = 0;
> +		}
> +		idx = MEM_CGROUP_STAT_FILE_DIRTY;
> +		break;
> +
> +	case MEMCG_NR_FILE_WRITEBACK:
> +		/*
> +		 * This counter is adjusted while holding the mapping's
> +		 * tree_lock.  Therefore there is no race between settings and
> +		 * clearing of this flag.
> +		 */

nice description.

> +		if (val > 0)
> +			SetPageCgroupFileWriteback(pc);
> +		else
> +			ClearPageCgroupFileWriteback(pc);
> +		idx = MEM_CGROUP_STAT_FILE_WRITEBACK;
> +		break;
> +
> +	case MEMCG_NR_FILE_UNSTABLE_NFS:
> +		/* Use Test{Set,Clear} to only un/charge the memcg once. */
> +		if (val > 0) {
> +			if (TestSetPageCgroupFileUnstableNFS(pc))
> +				/* already set */
> +				val = 0;
> +		} else {
> +			if (!TestClearPageCgroupFileUnstableNFS(pc))
> +				/* already cleared */
> +				val = 0;
> +		}
> +		idx = MEM_CGROUP_STAT_FILE_UNSTABLE_NFS;
> +		break;
> +
>  	default:
>  		BUG();
>  	}
> @@ -2133,6 +2178,16 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *mem,
>  	memcg_check_events(mem, pc->page);
>  }
>  
> +static void mem_cgroup_move_account_page_stat(struct mem_cgroup *from,
> +					      struct mem_cgroup *to,
> +					      enum mem_cgroup_stat_index idx)
> +{
> +	preempt_disable();
> +	__this_cpu_dec(from->stat->count[idx]);
> +	__this_cpu_inc(to->stat->count[idx]);
> +	preempt_enable();
> +}
> +
>  /**
>   * __mem_cgroup_move_account - move account of the page
>   * @pc:	page_cgroup of the page.
> @@ -2159,13 +2214,18 @@ static void __mem_cgroup_move_account(struct page_cgroup *pc,
>  	VM_BUG_ON(!PageCgroupUsed(pc));
>  	VM_BUG_ON(pc->mem_cgroup != from);
>  
> -	if (PageCgroupFileMapped(pc)) {
> -		/* Update mapped_file data for mem_cgroup */
> -		preempt_disable();
> -		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> -		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> -		preempt_enable();
> -	}
> +	if (PageCgroupFileMapped(pc))
> +		mem_cgroup_move_account_page_stat(from, to,
> +					MEM_CGROUP_STAT_FILE_MAPPED);
> +	if (PageCgroupFileDirty(pc))
> +		mem_cgroup_move_account_page_stat(from, to,
> +					MEM_CGROUP_STAT_FILE_DIRTY);
> +	if (PageCgroupFileWriteback(pc))
> +		mem_cgroup_move_account_page_stat(from, to,
> +					MEM_CGROUP_STAT_FILE_WRITEBACK);
> +	if (PageCgroupFileUnstableNFS(pc))
> +		mem_cgroup_move_account_page_stat(from, to,
> +					MEM_CGROUP_STAT_FILE_UNSTABLE_NFS);
>  	mem_cgroup_charge_statistics(from, pc, false);
>  	if (uncharge)
>  		/* This is not "cancel", but cancel_charge does all we need. */
> @@ -3545,6 +3605,9 @@ enum {
>  	MCS_PGPGIN,
>  	MCS_PGPGOUT,
>  	MCS_SWAP,
> +	MCS_FILE_DIRTY,
> +	MCS_WRITEBACK,
> +	MCS_UNSTABLE_NFS,
>  	MCS_INACTIVE_ANON,
>  	MCS_ACTIVE_ANON,
>  	MCS_INACTIVE_FILE,
> @@ -3567,6 +3630,9 @@ struct {
>  	{"pgpgin", "total_pgpgin"},
>  	{"pgpgout", "total_pgpgout"},
>  	{"swap", "total_swap"},
> +	{"dirty", "total_dirty"},
> +	{"writeback", "total_writeback"},
> +	{"nfs", "total_nfs"},

Could you make this as nfs_unstable as meminfo shows ?
If I am a user, I think this is the number of NFS pages not NFS_UNSTABLE pages.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 10/10] memcg: check memcg dirty limits in page writeback
  2010-10-04  6:58 ` [PATCH 10/10] memcg: check memcg dirty limits in page writeback Greg Thelen
@ 2010-10-05  7:29   ` KAMEZAWA Hiroyuki
  2010-10-06  0:32   ` Minchan Kim
  1 sibling, 0 replies; 96+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-10-05  7:29 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, Andrea Righi,
	Balbir Singh, Daisuke Nishimura

On Sun,  3 Oct 2010 23:58:05 -0700
Greg Thelen <gthelen@google.com> wrote:

> If the current process is in a non-root memcg, then
> global_dirty_limits() will consider the memcg dirty limit.
> This allows different cgroups to have distinct dirty limits
> which trigger direct and background writeback at different
> levels.
> 
> Signed-off-by: Andrea Righi <arighi@develer.com>
> Signed-off-by: Greg Thelen <gthelen@google.com>

This patch seems good because of straightforward implementation.
I think worth to be tested in -mm tree.

Thank you very much.

Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 08/10] memcg: add cgroupfs interface to memcg dirty limits
  2010-10-05  7:33     ` Greg Thelen
@ 2010-10-05  7:31       ` KAMEZAWA Hiroyuki
  2010-10-05  9:18       ` Andrea Righi
  1 sibling, 0 replies; 96+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-10-05  7:31 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, Andrea Righi,
	Balbir Singh, Daisuke Nishimura

On Tue, 05 Oct 2010 00:33:15 -0700
Greg Thelen <gthelen@google.com> wrote:

> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> writes:
> 
> > On Sun,  3 Oct 2010 23:58:03 -0700
> > Greg Thelen <gthelen@google.com> wrote:
> >
> >> Add cgroupfs interface to memcg dirty page limits:
> >>   Direct write-out is controlled with:
> >>   - memory.dirty_ratio
> >>   - memory.dirty_bytes
> >> 
> >>   Background write-out is controlled with:
> >>   - memory.dirty_background_ratio
> >>   - memory.dirty_background_bytes
> >> 
> >> Signed-off-by: Andrea Righi <arighi@develer.com>
> >> Signed-off-by: Greg Thelen <gthelen@google.com>
> >
> > Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> >
> > a question below.
> >
> >
> >> ---
> >>  mm/memcontrol.c |   89 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>  1 files changed, 89 insertions(+), 0 deletions(-)
> >> 
> >> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> >> index 6ec2625..2d45a0a 100644
> >> --- a/mm/memcontrol.c
> >> +++ b/mm/memcontrol.c
> >> @@ -100,6 +100,13 @@ enum mem_cgroup_stat_index {
> >>  	MEM_CGROUP_STAT_NSTATS,
> >>  };
> >>  
> >> +enum {
> >> +	MEM_CGROUP_DIRTY_RATIO,
> >> +	MEM_CGROUP_DIRTY_BYTES,
> >> +	MEM_CGROUP_DIRTY_BACKGROUND_RATIO,
> >> +	MEM_CGROUP_DIRTY_BACKGROUND_BYTES,
> >> +};
> >> +
> >>  struct mem_cgroup_stat_cpu {
> >>  	s64 count[MEM_CGROUP_STAT_NSTATS];
> >>  };
> >> @@ -4292,6 +4299,64 @@ static int mem_cgroup_oom_control_write(struct cgroup *cgrp,
> >>  	return 0;
> >>  }
> >>  
> >> +static u64 mem_cgroup_dirty_read(struct cgroup *cgrp, struct cftype *cft)
> >> +{
> >> +	struct mem_cgroup *mem = mem_cgroup_from_cont(cgrp);
> >> +	bool root;
> >> +
> >> +	root = mem_cgroup_is_root(mem);
> >> +
> >> +	switch (cft->private) {
> >> +	case MEM_CGROUP_DIRTY_RATIO:
> >> +		return root ? vm_dirty_ratio : mem->dirty_param.dirty_ratio;
> >> +	case MEM_CGROUP_DIRTY_BYTES:
> >> +		return root ? vm_dirty_bytes : mem->dirty_param.dirty_bytes;
> >> +	case MEM_CGROUP_DIRTY_BACKGROUND_RATIO:
> >> +		return root ? dirty_background_ratio :
> >> +			mem->dirty_param.dirty_background_ratio;
> >> +	case MEM_CGROUP_DIRTY_BACKGROUND_BYTES:
> >> +		return root ? dirty_background_bytes :
> >> +			mem->dirty_param.dirty_background_bytes;
> >> +	default:
> >> +		BUG();
> >> +	}
> >> +}
> >> +
> >> +static int
> >> +mem_cgroup_dirty_write(struct cgroup *cgrp, struct cftype *cft, u64 val)
> >> +{
> >> +	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
> >> +	int type = cft->private;
> >> +
> >> +	if (cgrp->parent == NULL)
> >> +		return -EINVAL;
> >> +	if ((type == MEM_CGROUP_DIRTY_RATIO ||
> >> +	     type == MEM_CGROUP_DIRTY_BACKGROUND_RATIO) && val > 100)
> >> +		return -EINVAL;
> >> +	switch (type) {
> >> +	case MEM_CGROUP_DIRTY_RATIO:
> >> +		memcg->dirty_param.dirty_ratio = val;
> >> +		memcg->dirty_param.dirty_bytes = 0;
> >> +		break;
> >> +	case MEM_CGROUP_DIRTY_BYTES:
> >> +		memcg->dirty_param.dirty_bytes = val;
> >> +		memcg->dirty_param.dirty_ratio  = 0;
> >> +		break;
> >> +	case MEM_CGROUP_DIRTY_BACKGROUND_RATIO:
> >> +		memcg->dirty_param.dirty_background_ratio = val;
> >> +		memcg->dirty_param.dirty_background_bytes = 0;
> >> +		break;
> >> +	case MEM_CGROUP_DIRTY_BACKGROUND_BYTES:
> >> +		memcg->dirty_param.dirty_background_bytes = val;
> >> +		memcg->dirty_param.dirty_background_ratio = 0;
> >> +		break;
> >
> >
> > Curious....is this same behavior as vm_dirty_ratio ?
> 
> I think this is same behavior as vm_dirty_ratio.  When vm_dirty_ratio is
> changed then dirty_ratio_handler() will set vm_dirty_bytes=0.  When
> vm_dirty_bytes is written dirty_bytes_handler() will set
> vm_dirty_ratio=0.  So I think that the per-memcg dirty memory parameters
> mimic the behavior of vm_dirty_ratio, vm_dirty_bytes and the other
> global dirty parameters.
> 
Okay.

> Am I missing your question?
> 
No. Thank you for clarification.

-Kame


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 08/10] memcg: add cgroupfs interface to memcg dirty limits
  2010-10-05  7:13   ` KAMEZAWA Hiroyuki
@ 2010-10-05  7:33     ` Greg Thelen
  2010-10-05  7:31       ` KAMEZAWA Hiroyuki
  2010-10-05  9:18       ` Andrea Righi
  0 siblings, 2 replies; 96+ messages in thread
From: Greg Thelen @ 2010-10-05  7:33 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, Andrea Righi,
	Balbir Singh, Daisuke Nishimura

KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> writes:

> On Sun,  3 Oct 2010 23:58:03 -0700
> Greg Thelen <gthelen@google.com> wrote:
>
>> Add cgroupfs interface to memcg dirty page limits:
>>   Direct write-out is controlled with:
>>   - memory.dirty_ratio
>>   - memory.dirty_bytes
>> 
>>   Background write-out is controlled with:
>>   - memory.dirty_background_ratio
>>   - memory.dirty_background_bytes
>> 
>> Signed-off-by: Andrea Righi <arighi@develer.com>
>> Signed-off-by: Greg Thelen <gthelen@google.com>
>
> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>
> a question below.
>
>
>> ---
>>  mm/memcontrol.c |   89 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>  1 files changed, 89 insertions(+), 0 deletions(-)
>> 
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 6ec2625..2d45a0a 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -100,6 +100,13 @@ enum mem_cgroup_stat_index {
>>  	MEM_CGROUP_STAT_NSTATS,
>>  };
>>  
>> +enum {
>> +	MEM_CGROUP_DIRTY_RATIO,
>> +	MEM_CGROUP_DIRTY_BYTES,
>> +	MEM_CGROUP_DIRTY_BACKGROUND_RATIO,
>> +	MEM_CGROUP_DIRTY_BACKGROUND_BYTES,
>> +};
>> +
>>  struct mem_cgroup_stat_cpu {
>>  	s64 count[MEM_CGROUP_STAT_NSTATS];
>>  };
>> @@ -4292,6 +4299,64 @@ static int mem_cgroup_oom_control_write(struct cgroup *cgrp,
>>  	return 0;
>>  }
>>  
>> +static u64 mem_cgroup_dirty_read(struct cgroup *cgrp, struct cftype *cft)
>> +{
>> +	struct mem_cgroup *mem = mem_cgroup_from_cont(cgrp);
>> +	bool root;
>> +
>> +	root = mem_cgroup_is_root(mem);
>> +
>> +	switch (cft->private) {
>> +	case MEM_CGROUP_DIRTY_RATIO:
>> +		return root ? vm_dirty_ratio : mem->dirty_param.dirty_ratio;
>> +	case MEM_CGROUP_DIRTY_BYTES:
>> +		return root ? vm_dirty_bytes : mem->dirty_param.dirty_bytes;
>> +	case MEM_CGROUP_DIRTY_BACKGROUND_RATIO:
>> +		return root ? dirty_background_ratio :
>> +			mem->dirty_param.dirty_background_ratio;
>> +	case MEM_CGROUP_DIRTY_BACKGROUND_BYTES:
>> +		return root ? dirty_background_bytes :
>> +			mem->dirty_param.dirty_background_bytes;
>> +	default:
>> +		BUG();
>> +	}
>> +}
>> +
>> +static int
>> +mem_cgroup_dirty_write(struct cgroup *cgrp, struct cftype *cft, u64 val)
>> +{
>> +	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
>> +	int type = cft->private;
>> +
>> +	if (cgrp->parent == NULL)
>> +		return -EINVAL;
>> +	if ((type == MEM_CGROUP_DIRTY_RATIO ||
>> +	     type == MEM_CGROUP_DIRTY_BACKGROUND_RATIO) && val > 100)
>> +		return -EINVAL;
>> +	switch (type) {
>> +	case MEM_CGROUP_DIRTY_RATIO:
>> +		memcg->dirty_param.dirty_ratio = val;
>> +		memcg->dirty_param.dirty_bytes = 0;
>> +		break;
>> +	case MEM_CGROUP_DIRTY_BYTES:
>> +		memcg->dirty_param.dirty_bytes = val;
>> +		memcg->dirty_param.dirty_ratio  = 0;
>> +		break;
>> +	case MEM_CGROUP_DIRTY_BACKGROUND_RATIO:
>> +		memcg->dirty_param.dirty_background_ratio = val;
>> +		memcg->dirty_param.dirty_background_bytes = 0;
>> +		break;
>> +	case MEM_CGROUP_DIRTY_BACKGROUND_BYTES:
>> +		memcg->dirty_param.dirty_background_bytes = val;
>> +		memcg->dirty_param.dirty_background_ratio = 0;
>> +		break;
>
>
> Curious....is this same behavior as vm_dirty_ratio ?

I think this is same behavior as vm_dirty_ratio.  When vm_dirty_ratio is
changed then dirty_ratio_handler() will set vm_dirty_bytes=0.  When
vm_dirty_bytes is written dirty_bytes_handler() will set
vm_dirty_ratio=0.  So I think that the per-memcg dirty memory parameters
mimic the behavior of vm_dirty_ratio, vm_dirty_bytes and the other
global dirty parameters.

Am I missing your question?

> Thanks,
> -Kame

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 05/10] memcg: add dirty page accounting infrastructure
  2010-10-05  7:22   ` KAMEZAWA Hiroyuki
@ 2010-10-05  7:35     ` Greg Thelen
  0 siblings, 0 replies; 96+ messages in thread
From: Greg Thelen @ 2010-10-05  7:35 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, Andrea Righi,
	Balbir Singh, Daisuke Nishimura

KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> writes:

> On Sun,  3 Oct 2010 23:58:00 -0700
> Greg Thelen <gthelen@google.com> wrote:
>
>> Add memcg routines to track dirty, writeback, and unstable_NFS pages.
>> These routines are not yet used by the kernel to count such pages.
>> A later change adds kernel calls to these new routines.
>> 
>> Signed-off-by: Greg Thelen <gthelen@google.com>
>> Signed-off-by: Andrea Righi <arighi@develer.com>
>
> a small request. see below.
>
>> ---
>>  include/linux/memcontrol.h |    3 +
>>  mm/memcontrol.c            |   89 ++++++++++++++++++++++++++++++++++++++++----
>>  2 files changed, 84 insertions(+), 8 deletions(-)
>> 
>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>> index 7c7bec4..6303da1 100644
>> --- a/include/linux/memcontrol.h
>> +++ b/include/linux/memcontrol.h
>> @@ -28,6 +28,9 @@ struct mm_struct;
>>  /* Stats that can be updated by kernel. */
>>  enum mem_cgroup_write_page_stat_item {
>>  	MEMCG_NR_FILE_MAPPED, /* # of pages charged as file rss */
>> +	MEMCG_NR_FILE_DIRTY, /* # of dirty pages in page cache */
>> +	MEMCG_NR_FILE_WRITEBACK, /* # of pages under writeback */
>> +	MEMCG_NR_FILE_UNSTABLE_NFS, /* # of NFS unstable pages */
>>  };
>>  
>>  extern unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 267d774..f40839f 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -85,10 +85,13 @@ enum mem_cgroup_stat_index {
>>  	 */
>>  	MEM_CGROUP_STAT_CACHE, 	   /* # of pages charged as cache */
>>  	MEM_CGROUP_STAT_RSS,	   /* # of pages charged as anon rss */
>> -	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
>>  	MEM_CGROUP_STAT_PGPGIN_COUNT,	/* # of pages paged in */
>>  	MEM_CGROUP_STAT_PGPGOUT_COUNT,	/* # of pages paged out */
>>  	MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
>> +	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
>> +	MEM_CGROUP_STAT_FILE_DIRTY,	/* # of dirty pages in page cache */
>> +	MEM_CGROUP_STAT_FILE_WRITEBACK,		/* # of pages under writeback */
>> +	MEM_CGROUP_STAT_FILE_UNSTABLE_NFS,	/* # of NFS unstable pages */
>>  	MEM_CGROUP_STAT_DATA, /* end of data requires synchronization */
>>  	/* incremented at every  pagein/pageout */
>>  	MEM_CGROUP_EVENTS = MEM_CGROUP_STAT_DATA,
>> @@ -1626,6 +1629,48 @@ void mem_cgroup_update_page_stat(struct page *page,
>>  			ClearPageCgroupFileMapped(pc);
>>  		idx = MEM_CGROUP_STAT_FILE_MAPPED;
>>  		break;
>> +
>> +	case MEMCG_NR_FILE_DIRTY:
>> +		/* Use Test{Set,Clear} to only un/charge the memcg once. */
>> +		if (val > 0) {
>> +			if (TestSetPageCgroupFileDirty(pc))
>> +				/* already set */
>> +				val = 0;
>> +		} else {
>> +			if (!TestClearPageCgroupFileDirty(pc))
>> +				/* already cleared */
>> +				val = 0;
>> +		}
>> +		idx = MEM_CGROUP_STAT_FILE_DIRTY;
>> +		break;
>> +
>> +	case MEMCG_NR_FILE_WRITEBACK:
>> +		/*
>> +		 * This counter is adjusted while holding the mapping's
>> +		 * tree_lock.  Therefore there is no race between settings and
>> +		 * clearing of this flag.
>> +		 */
>
> nice description.
>
>> +		if (val > 0)
>> +			SetPageCgroupFileWriteback(pc);
>> +		else
>> +			ClearPageCgroupFileWriteback(pc);
>> +		idx = MEM_CGROUP_STAT_FILE_WRITEBACK;
>> +		break;
>> +
>> +	case MEMCG_NR_FILE_UNSTABLE_NFS:
>> +		/* Use Test{Set,Clear} to only un/charge the memcg once. */
>> +		if (val > 0) {
>> +			if (TestSetPageCgroupFileUnstableNFS(pc))
>> +				/* already set */
>> +				val = 0;
>> +		} else {
>> +			if (!TestClearPageCgroupFileUnstableNFS(pc))
>> +				/* already cleared */
>> +				val = 0;
>> +		}
>> +		idx = MEM_CGROUP_STAT_FILE_UNSTABLE_NFS;
>> +		break;
>> +
>>  	default:
>>  		BUG();
>>  	}
>> @@ -2133,6 +2178,16 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *mem,
>>  	memcg_check_events(mem, pc->page);
>>  }
>>  
>> +static void mem_cgroup_move_account_page_stat(struct mem_cgroup *from,
>> +					      struct mem_cgroup *to,
>> +					      enum mem_cgroup_stat_index idx)
>> +{
>> +	preempt_disable();
>> +	__this_cpu_dec(from->stat->count[idx]);
>> +	__this_cpu_inc(to->stat->count[idx]);
>> +	preempt_enable();
>> +}
>> +
>>  /**
>>   * __mem_cgroup_move_account - move account of the page
>>   * @pc:	page_cgroup of the page.
>> @@ -2159,13 +2214,18 @@ static void __mem_cgroup_move_account(struct page_cgroup *pc,
>>  	VM_BUG_ON(!PageCgroupUsed(pc));
>>  	VM_BUG_ON(pc->mem_cgroup != from);
>>  
>> -	if (PageCgroupFileMapped(pc)) {
>> -		/* Update mapped_file data for mem_cgroup */
>> -		preempt_disable();
>> -		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
>> -		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
>> -		preempt_enable();
>> -	}
>> +	if (PageCgroupFileMapped(pc))
>> +		mem_cgroup_move_account_page_stat(from, to,
>> +					MEM_CGROUP_STAT_FILE_MAPPED);
>> +	if (PageCgroupFileDirty(pc))
>> +		mem_cgroup_move_account_page_stat(from, to,
>> +					MEM_CGROUP_STAT_FILE_DIRTY);
>> +	if (PageCgroupFileWriteback(pc))
>> +		mem_cgroup_move_account_page_stat(from, to,
>> +					MEM_CGROUP_STAT_FILE_WRITEBACK);
>> +	if (PageCgroupFileUnstableNFS(pc))
>> +		mem_cgroup_move_account_page_stat(from, to,
>> +					MEM_CGROUP_STAT_FILE_UNSTABLE_NFS);
>>  	mem_cgroup_charge_statistics(from, pc, false);
>>  	if (uncharge)
>>  		/* This is not "cancel", but cancel_charge does all we need. */
>> @@ -3545,6 +3605,9 @@ enum {
>>  	MCS_PGPGIN,
>>  	MCS_PGPGOUT,
>>  	MCS_SWAP,
>> +	MCS_FILE_DIRTY,
>> +	MCS_WRITEBACK,
>> +	MCS_UNSTABLE_NFS,
>>  	MCS_INACTIVE_ANON,
>>  	MCS_ACTIVE_ANON,
>>  	MCS_INACTIVE_FILE,
>> @@ -3567,6 +3630,9 @@ struct {
>>  	{"pgpgin", "total_pgpgin"},
>>  	{"pgpgout", "total_pgpgout"},
>>  	{"swap", "total_swap"},
>> +	{"dirty", "total_dirty"},
>> +	{"writeback", "total_writeback"},
>> +	{"nfs", "total_nfs"},
>
> Could you make this as nfs_unstable as meminfo shows ?
> If I am a user, I think this is the number of NFS pages not NFS_UNSTABLE pages.

Good catch!  In the next revision I will change this from
"nfs"/"total_nfs" to "NFS_Unstable"/"total_NFS_Unstable" to match
/proc/meminfo.

> Thanks,
> -Kame

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 00/10] memcg: per cgroup dirty page accounting
  2010-10-05  5:50   ` Greg Thelen
@ 2010-10-05  8:37     ` Ciju Rajan K
  0 siblings, 0 replies; 96+ messages in thread
From: Ciju Rajan K @ 2010-10-05  8:37 UTC (permalink / raw)
  To: Greg Thelen
  Cc: balbir, Andrew Morton, linux-kernel, linux-mm, containers,
	Andrea Righi, KAMEZAWA Hiroyuki, Daisuke Nishimura

Greg Thelen wrote:
> Balbir Singh <balbir@linux.vnet.ibm.com> writes:
>   
>> * Greg Thelen <gthelen@google.com> [2010-10-03 23:57:55]:
>>
>>     
>>> This patch set provides the ability for each cgroup to have independent dirty
>>> page limits.
>>>
>>> Limiting dirty memory is like fixing the max amount of dirty (hard to reclaim)
>>> page cache used by a cgroup.  So, in case of multiple cgroup writers, they will
>>> not be able to consume more than their designated share of dirty pages and will
>>> be forced to perform write-out if they cross that limit.
>>>
>>> These patches were developed and tested on mmotm 2010-09-28-16-13.  The patches
>>> are based on a series proposed by Andrea Righi in Mar 2010.
>>>       
>> Hi, Greg,
>>
>> I see a problem with "    memcg: add dirty page accounting infrastructure".
>>
>> The reject is
>>
>>  enum mem_cgroup_write_page_stat_item {
>>         MEMCG_NR_FILE_MAPPED, /* # of pages charged as file rss */
>> +       MEMCG_NR_FILE_DIRTY, /* # of dirty pages in page cache */
>> +       MEMCG_NR_FILE_WRITEBACK, /* # of pages under writeback */
>> +       MEMCG_NR_FILE_UNSTABLE_NFS, /* # of NFS unstable pages */
>>  };
>>
>> I don't see mem_cgroup_write_page_stat_item in memcontrol.h. Is this
>> based on top of Kame's cleanup.
>>
>> I am working off of mmotm 28 sept 2010 16:13.
>>     
>
> Balbir,
>
> All of the 10 memcg dirty limits patches should apply directly to mmotm
> 28 sept 2010 16:13 without any other patches.  Any of Kame's cleanup
> patches that are not in mmotm are not needed by this memcg dirty limit
> series.
>
> The patch you refer to, "[PATCH 05/10] memcg: add dirty page accounting
> infrastructure" depends on a change from an earlier patch in the series.
> Specifically, "[PATCH 03/10] memcg: create extensible page stat update
> routines" contains the addition of mem_cgroup_write_page_stat_item:
>
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -25,6 +25,11 @@ struct page_cgroup;
>  struct page;
>  struct mm_struct;
>
> +/* Stats that can be updated by kernel. */
> +enum mem_cgroup_write_page_stat_item {
> +     MEMCG_NR_FILE_MAPPED, /* # of pages charged as file rss */
> +};
> +
>
> Do you have trouble applying patch 5 after applying patches 1-4?
>   
I could apply all the patches cleanly on mmotm 28/09/2010. Kernel build 
also went through.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>
>   


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 08/10] memcg: add cgroupfs interface to memcg dirty limits
  2010-10-05  7:33     ` Greg Thelen
  2010-10-05  7:31       ` KAMEZAWA Hiroyuki
@ 2010-10-05  9:18       ` Andrea Righi
  2010-10-05 18:31         ` David Rientjes
  2010-10-06 18:34         ` Greg Thelen
  1 sibling, 2 replies; 96+ messages in thread
From: Andrea Righi @ 2010-10-05  9:18 UTC (permalink / raw)
  To: Greg Thelen
  Cc: KAMEZAWA Hiroyuki, Andrew Morton, linux-kernel, linux-mm,
	containers, Balbir Singh, Daisuke Nishimura, David Rientjes

On Tue, Oct 05, 2010 at 12:33:15AM -0700, Greg Thelen wrote:
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> writes:
> 
> > On Sun,  3 Oct 2010 23:58:03 -0700
> > Greg Thelen <gthelen@google.com> wrote:
> >
> >> Add cgroupfs interface to memcg dirty page limits:
> >>   Direct write-out is controlled with:
> >>   - memory.dirty_ratio
> >>   - memory.dirty_bytes
> >> 
> >>   Background write-out is controlled with:
> >>   - memory.dirty_background_ratio
> >>   - memory.dirty_background_bytes
> >> 
> >> Signed-off-by: Andrea Righi <arighi@develer.com>
> >> Signed-off-by: Greg Thelen <gthelen@google.com>
> >
> > Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> >
> > a question below.
> >
> >
> >> ---
> >>  mm/memcontrol.c |   89 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>  1 files changed, 89 insertions(+), 0 deletions(-)
> >> 
> >> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> >> index 6ec2625..2d45a0a 100644
> >> --- a/mm/memcontrol.c
> >> +++ b/mm/memcontrol.c
> >> @@ -100,6 +100,13 @@ enum mem_cgroup_stat_index {
> >>  	MEM_CGROUP_STAT_NSTATS,
> >>  };
> >>  
> >> +enum {
> >> +	MEM_CGROUP_DIRTY_RATIO,
> >> +	MEM_CGROUP_DIRTY_BYTES,
> >> +	MEM_CGROUP_DIRTY_BACKGROUND_RATIO,
> >> +	MEM_CGROUP_DIRTY_BACKGROUND_BYTES,
> >> +};
> >> +
> >>  struct mem_cgroup_stat_cpu {
> >>  	s64 count[MEM_CGROUP_STAT_NSTATS];
> >>  };
> >> @@ -4292,6 +4299,64 @@ static int mem_cgroup_oom_control_write(struct cgroup *cgrp,
> >>  	return 0;
> >>  }
> >>  
> >> +static u64 mem_cgroup_dirty_read(struct cgroup *cgrp, struct cftype *cft)
> >> +{
> >> +	struct mem_cgroup *mem = mem_cgroup_from_cont(cgrp);
> >> +	bool root;
> >> +
> >> +	root = mem_cgroup_is_root(mem);
> >> +
> >> +	switch (cft->private) {
> >> +	case MEM_CGROUP_DIRTY_RATIO:
> >> +		return root ? vm_dirty_ratio : mem->dirty_param.dirty_ratio;
> >> +	case MEM_CGROUP_DIRTY_BYTES:
> >> +		return root ? vm_dirty_bytes : mem->dirty_param.dirty_bytes;
> >> +	case MEM_CGROUP_DIRTY_BACKGROUND_RATIO:
> >> +		return root ? dirty_background_ratio :
> >> +			mem->dirty_param.dirty_background_ratio;
> >> +	case MEM_CGROUP_DIRTY_BACKGROUND_BYTES:
> >> +		return root ? dirty_background_bytes :
> >> +			mem->dirty_param.dirty_background_bytes;
> >> +	default:
> >> +		BUG();
> >> +	}
> >> +}
> >> +
> >> +static int
> >> +mem_cgroup_dirty_write(struct cgroup *cgrp, struct cftype *cft, u64 val)
> >> +{
> >> +	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
> >> +	int type = cft->private;
> >> +
> >> +	if (cgrp->parent == NULL)
> >> +		return -EINVAL;
> >> +	if ((type == MEM_CGROUP_DIRTY_RATIO ||
> >> +	     type == MEM_CGROUP_DIRTY_BACKGROUND_RATIO) && val > 100)
> >> +		return -EINVAL;
> >> +	switch (type) {
> >> +	case MEM_CGROUP_DIRTY_RATIO:
> >> +		memcg->dirty_param.dirty_ratio = val;
> >> +		memcg->dirty_param.dirty_bytes = 0;
> >> +		break;
> >> +	case MEM_CGROUP_DIRTY_BYTES:
> >> +		memcg->dirty_param.dirty_bytes = val;
> >> +		memcg->dirty_param.dirty_ratio  = 0;
> >> +		break;
> >> +	case MEM_CGROUP_DIRTY_BACKGROUND_RATIO:
> >> +		memcg->dirty_param.dirty_background_ratio = val;
> >> +		memcg->dirty_param.dirty_background_bytes = 0;
> >> +		break;
> >> +	case MEM_CGROUP_DIRTY_BACKGROUND_BYTES:
> >> +		memcg->dirty_param.dirty_background_bytes = val;
> >> +		memcg->dirty_param.dirty_background_ratio = 0;
> >> +		break;
> >
> >
> > Curious....is this same behavior as vm_dirty_ratio ?
> 
> I think this is same behavior as vm_dirty_ratio.  When vm_dirty_ratio is
> changed then dirty_ratio_handler() will set vm_dirty_bytes=0.  When
> vm_dirty_bytes is written dirty_bytes_handler() will set
> vm_dirty_ratio=0.  So I think that the per-memcg dirty memory parameters
> mimic the behavior of vm_dirty_ratio, vm_dirty_bytes and the other
> global dirty parameters.
> 
> Am I missing your question?

mmh... looking at the code it seems the same behaviour, but in
Documentation/sysctl/vm.txt we say a different thing (i.e., for
dirty_bytes):

"If dirty_bytes is written, dirty_ratio becomes a function of its value
(dirty_bytes / the amount of dirtyable system memory)."

However, in dirty_bytes_handler()/dirty_ratio_handler() we actually set
the counterpart value as 0.

I think we should clarify the documentation.

Signed-off-by: Andrea Righi <arighi@develer.com>
---
 Documentation/sysctl/vm.txt |   12 ++++++++----
 1 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index b606c2c..30289fa 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -80,8 +80,10 @@ dirty_background_bytes
 Contains the amount of dirty memory at which the pdflush background writeback
 daemon will start writeback.
 
-If dirty_background_bytes is written, dirty_background_ratio becomes a function
-of its value (dirty_background_bytes / the amount of dirtyable system memory).
+Note: dirty_background_bytes is the counterpart of dirty_background_ratio. Only
+one of them may be specified at a time. When one sysctl is written it is
+immediately taken into account to evaluate the dirty memory limits and the
+other appears as 0 when read.
 
 ==============================================================
 
@@ -97,8 +99,10 @@ dirty_bytes
 Contains the amount of dirty memory at which a process generating disk writes
 will itself start writeback.
 
-If dirty_bytes is written, dirty_ratio becomes a function of its value
-(dirty_bytes / the amount of dirtyable system memory).
+Note: dirty_bytes is the counterpart of dirty_ratio. Only one of them may be
+specified at a time. When one sysctl is written it is immediately taken into
+account to evaluate the dirty memory limits and the other appears as 0 when
+read.
 
 Note: the minimum value allowed for dirty_bytes is two pages (in bytes); any
 value lower than this limit will be ignored and the old configuration will be

^ permalink raw reply related	[flat|nested] 96+ messages in thread

* Re: [PATCH 07/10] memcg: add dirty limits to mem_cgroup
  2010-10-04  6:58 ` [PATCH 07/10] memcg: add dirty limits to mem_cgroup Greg Thelen
  2010-10-05  7:07   ` KAMEZAWA Hiroyuki
@ 2010-10-05  9:43   ` Andrea Righi
  2010-10-05 19:00     ` Greg Thelen
  1 sibling, 1 reply; 96+ messages in thread
From: Andrea Righi @ 2010-10-05  9:43 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, Balbir Singh,
	KAMEZAWA Hiroyuki, Daisuke Nishimura

On Sun, Oct 03, 2010 at 11:58:02PM -0700, Greg Thelen wrote:
> Extend mem_cgroup to contain dirty page limits.  Also add routines
> allowing the kernel to query the dirty usage of a memcg.
> 
> These interfaces not used by the kernel yet.  A subsequent commit
> will add kernel calls to utilize these new routines.

A small note below.

> 
> Signed-off-by: Greg Thelen <gthelen@google.com>
> Signed-off-by: Andrea Righi <arighi@develer.com>
> ---
>  include/linux/memcontrol.h |   44 +++++++++++
>  mm/memcontrol.c            |  180 +++++++++++++++++++++++++++++++++++++++++++-
>  2 files changed, 223 insertions(+), 1 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 6303da1..dc8952d 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -19,6 +19,7 @@
>  
>  #ifndef _LINUX_MEMCONTROL_H
>  #define _LINUX_MEMCONTROL_H
> +#include <linux/writeback.h>
>  #include <linux/cgroup.h>
>  struct mem_cgroup;
>  struct page_cgroup;
> @@ -33,6 +34,30 @@ enum mem_cgroup_write_page_stat_item {
>  	MEMCG_NR_FILE_UNSTABLE_NFS, /* # of NFS unstable pages */
>  };
>  
> +/* Cgroup memory statistics items exported to the kernel */
> +enum mem_cgroup_read_page_stat_item {
> +	MEMCG_NR_DIRTYABLE_PAGES,
> +	MEMCG_NR_RECLAIM_PAGES,
> +	MEMCG_NR_WRITEBACK,
> +	MEMCG_NR_DIRTY_WRITEBACK_PAGES,
> +};
> +
> +/* Dirty memory parameters */
> +struct vm_dirty_param {
> +	int dirty_ratio;
> +	int dirty_background_ratio;
> +	unsigned long dirty_bytes;
> +	unsigned long dirty_background_bytes;
> +};
> +
> +static inline void get_global_vm_dirty_param(struct vm_dirty_param *param)
> +{
> +	param->dirty_ratio = vm_dirty_ratio;
> +	param->dirty_bytes = vm_dirty_bytes;
> +	param->dirty_background_ratio = dirty_background_ratio;
> +	param->dirty_background_bytes = dirty_background_bytes;
> +}
> +
>  extern unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
>  					struct list_head *dst,
>  					unsigned long *scanned, int order,
> @@ -145,6 +170,10 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,
>  	mem_cgroup_update_page_stat(page, idx, -1);
>  }
>  
> +bool mem_cgroup_has_dirty_limit(void);
> +void get_vm_dirty_param(struct vm_dirty_param *param);
> +s64 mem_cgroup_page_stat(enum mem_cgroup_read_page_stat_item item);
> +
>  unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>  						gfp_t gfp_mask);
>  u64 mem_cgroup_get_limit(struct mem_cgroup *mem);
> @@ -326,6 +355,21 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,
>  {
>  }
>  
> +static inline bool mem_cgroup_has_dirty_limit(void)
> +{
> +	return false;
> +}
> +
> +static inline void get_vm_dirty_param(struct vm_dirty_param *param)
> +{
> +	get_global_vm_dirty_param(param);
> +}
> +
> +static inline s64 mem_cgroup_page_stat(enum mem_cgroup_read_page_stat_item item)
> +{
> +	return -ENOSYS;
> +}
> +
>  static inline
>  unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>  					    gfp_t gfp_mask)
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index f40839f..6ec2625 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -233,6 +233,10 @@ struct mem_cgroup {
>  	atomic_t	refcnt;
>  
>  	unsigned int	swappiness;
> +
> +	/* control memory cgroup dirty pages */
> +	struct vm_dirty_param dirty_param;
> +
>  	/* OOM-Killer disable */
>  	int		oom_kill_disable;
>  
> @@ -1132,6 +1136,172 @@ static unsigned int get_swappiness(struct mem_cgroup *memcg)
>  	return swappiness;
>  }
>  
> +/*
> + * Returns a snapshot of the current dirty limits which is not synchronized with
> + * the routines that change the dirty limits.  If this routine races with an
> + * update to the dirty bytes/ratio value, then the caller must handle the case
> + * where both dirty_[background_]_ratio and _bytes are set.
> + */
> +static void __mem_cgroup_get_dirty_param(struct vm_dirty_param *param,
> +					 struct mem_cgroup *mem)
> +{
> +	if (mem && !mem_cgroup_is_root(mem)) {
> +		param->dirty_ratio = mem->dirty_param.dirty_ratio;
> +		param->dirty_bytes = mem->dirty_param.dirty_bytes;
> +		param->dirty_background_ratio =
> +			mem->dirty_param.dirty_background_ratio;
> +		param->dirty_background_bytes =
> +			mem->dirty_param.dirty_background_bytes;
> +	} else {
> +		get_global_vm_dirty_param(param);
> +	}
> +}
> +
> +/*
> + * Get dirty memory parameters of the current memcg or global values (if memory
> + * cgroups are disabled or querying the root cgroup).
> + */
> +void get_vm_dirty_param(struct vm_dirty_param *param)
> +{
> +	struct mem_cgroup *memcg;
> +
> +	if (mem_cgroup_disabled()) {
> +		get_global_vm_dirty_param(param);
> +		return;
> +	}
> +
> +	/*
> +	 * It's possible that "current" may be moved to other cgroup while we
> +	 * access cgroup. But precise check is meaningless because the task can
> +	 * be moved after our access and writeback tends to take long time.  At
> +	 * least, "memcg" will not be freed under rcu_read_lock().
> +	 */
> +	rcu_read_lock();
> +	memcg = mem_cgroup_from_task(current);
> +	__mem_cgroup_get_dirty_param(param, memcg);
> +	rcu_read_unlock();
> +}
> +
> +/*
> + * Check if current memcg has local dirty limits.  Return true if the current
> + * memory cgroup has local dirty memory settings.
> + */
> +bool mem_cgroup_has_dirty_limit(void)
> +{
> +	struct mem_cgroup *mem;
> +
> +	if (mem_cgroup_disabled())
> +		return false;
> +
> +	mem = mem_cgroup_from_task(current);
> +	return mem && !mem_cgroup_is_root(mem);
> +}

We only check the pointer without dereferencing it, so this is probably
ok, but maybe this is safer:

bool mem_cgroup_has_dirty_limit(void)
{
	struct mem_cgroup *mem;
	bool ret;

	if (mem_cgroup_disabled())
		return false;

	rcu_read_lock();
	mem = mem_cgroup_from_task(current);
	ret = mem && !mem_cgroup_is_root(mem);
	rcu_read_unlock();

	return ret;
}

rcu_read_lock() should be held in mem_cgroup_from_task(), otherwise
lockdep could detect this as an error.

Thanks,
-Andrea

> +
> +static inline bool mem_cgroup_can_swap(struct mem_cgroup *memcg)
> +{
> +	if (!do_swap_account)
> +		return nr_swap_pages > 0;
> +	return !memcg->memsw_is_minimum &&
> +		(res_counter_read_u64(&memcg->memsw, RES_LIMIT) > 0);
> +}
> +
> +static s64 mem_cgroup_get_local_page_stat(struct mem_cgroup *mem,
> +				enum mem_cgroup_read_page_stat_item item)
> +{
> +	s64 ret;
> +
> +	switch (item) {
> +	case MEMCG_NR_DIRTYABLE_PAGES:
> +		ret = mem_cgroup_read_stat(mem, LRU_ACTIVE_FILE) +
> +			mem_cgroup_read_stat(mem, LRU_INACTIVE_FILE);
> +		if (mem_cgroup_can_swap(mem))
> +			ret += mem_cgroup_read_stat(mem, LRU_ACTIVE_ANON) +
> +				mem_cgroup_read_stat(mem, LRU_INACTIVE_ANON);
> +		break;
> +	case MEMCG_NR_RECLAIM_PAGES:
> +		ret = mem_cgroup_read_stat(mem,	MEM_CGROUP_STAT_FILE_DIRTY) +
> +			mem_cgroup_read_stat(mem,
> +					     MEM_CGROUP_STAT_FILE_UNSTABLE_NFS);
> +		break;
> +	case MEMCG_NR_WRITEBACK:
> +		ret = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_FILE_WRITEBACK);
> +		break;
> +	case MEMCG_NR_DIRTY_WRITEBACK_PAGES:
> +		ret = mem_cgroup_read_stat(mem,
> +					   MEM_CGROUP_STAT_FILE_WRITEBACK) +
> +			mem_cgroup_read_stat(mem,
> +					     MEM_CGROUP_STAT_FILE_UNSTABLE_NFS);
> +		break;
> +	default:
> +		BUG();
> +		break;
> +	}
> +	return ret;
> +}
> +
> +static unsigned long long
> +memcg_get_hierarchical_free_pages(struct mem_cgroup *mem)
> +{
> +	struct cgroup *cgroup;
> +	unsigned long long min_free, free;
> +
> +	min_free = res_counter_read_u64(&mem->res, RES_LIMIT) -
> +		res_counter_read_u64(&mem->res, RES_USAGE);
> +	cgroup = mem->css.cgroup;
> +	if (!mem->use_hierarchy)
> +		goto out;
> +
> +	while (cgroup->parent) {
> +		cgroup = cgroup->parent;
> +		mem = mem_cgroup_from_cont(cgroup);
> +		if (!mem->use_hierarchy)
> +			break;
> +		free = res_counter_read_u64(&mem->res, RES_LIMIT) -
> +			res_counter_read_u64(&mem->res, RES_USAGE);
> +		min_free = min(min_free, free);
> +	}
> +out:
> +	/* Translate free memory in pages */
> +	return min_free >> PAGE_SHIFT;
> +}
> +
> +/*
> + * mem_cgroup_page_stat() - get memory cgroup file cache statistics
> + * @item:      memory statistic item exported to the kernel
> + *
> + * Return the accounted statistic value.
> + */
> +s64 mem_cgroup_page_stat(enum mem_cgroup_read_page_stat_item item)
> +{
> +	struct mem_cgroup *mem;
> +	struct mem_cgroup *iter;
> +	s64 value;
> +
> +	rcu_read_lock();
> +	mem = mem_cgroup_from_task(current);
> +	if (mem && !mem_cgroup_is_root(mem)) {
> +		/*
> +		 * If we're looking for dirtyable pages we need to evaluate
> +		 * free pages depending on the limit and usage of the parents
> +		 * first of all.
> +		 */
> +		if (item == MEMCG_NR_DIRTYABLE_PAGES)
> +			value = memcg_get_hierarchical_free_pages(mem);
> +		else
> +			value = 0;
> +		/*
> +		 * Recursively evaluate page statistics against all cgroup
> +		 * under hierarchy tree
> +		 */
> +		for_each_mem_cgroup_tree(iter, mem)
> +			value += mem_cgroup_get_local_page_stat(iter, item);
> +	} else
> +		value = -EINVAL;
> +	rcu_read_unlock();
> +
> +	return value;
> +}
> +
>  static void mem_cgroup_start_move(struct mem_cgroup *mem)
>  {
>  	int cpu;
> @@ -4444,8 +4614,16 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
>  	spin_lock_init(&mem->reclaim_param_lock);
>  	INIT_LIST_HEAD(&mem->oom_notify);
>  
> -	if (parent)
> +	if (parent) {
>  		mem->swappiness = get_swappiness(parent);
> +		__mem_cgroup_get_dirty_param(&mem->dirty_param, parent);
> +	} else {
> +		/*
> +		 * The root cgroup dirty_param field is not used, instead,
> +		 * system-wide dirty limits are used.
> +		 */
> +	}
> +
>  	atomic_set(&mem->refcnt, 1);
>  	mem->move_charge_at_immigrate = 0;
>  	mutex_init(&mem->thresholds_lock);
> -- 
> 1.7.1
> 

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 03/10] memcg: create extensible page stat update routines
  2010-10-04  6:57 ` [PATCH 03/10] memcg: create extensible page stat update routines Greg Thelen
  2010-10-04 13:48   ` Ciju Rajan K
  2010-10-05  6:51   ` KAMEZAWA Hiroyuki
@ 2010-10-05 15:42   ` Minchan Kim
  2010-10-05 19:59     ` Greg Thelen
  2010-10-06 16:19   ` Balbir Singh
  3 siblings, 1 reply; 96+ messages in thread
From: Minchan Kim @ 2010-10-05 15:42 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, Andrea Righi,
	Balbir Singh, KAMEZAWA Hiroyuki, Daisuke Nishimura

On Sun, Oct 03, 2010 at 11:57:58PM -0700, Greg Thelen wrote:
> Replace usage of the mem_cgroup_update_file_mapped() memcg
> statistic update routine with two new routines:
> * mem_cgroup_inc_page_stat()
> * mem_cgroup_dec_page_stat()
> 
> As before, only the file_mapped statistic is managed.  However,
> these more general interfaces allow for new statistics to be
> more easily added.  New statistics are added with memcg dirty
> page accounting.
> 
> Signed-off-by: Greg Thelen <gthelen@google.com>
> Signed-off-by: Andrea Righi <arighi@develer.com>
> ---
>  include/linux/memcontrol.h |   31 ++++++++++++++++++++++++++++---
>  mm/memcontrol.c            |   17 ++++++++---------
>  mm/rmap.c                  |    4 ++--
>  3 files changed, 38 insertions(+), 14 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 159a076..7c7bec4 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -25,6 +25,11 @@ struct page_cgroup;
>  struct page;
>  struct mm_struct;
>  
> +/* Stats that can be updated by kernel. */
> +enum mem_cgroup_write_page_stat_item {
> +	MEMCG_NR_FILE_MAPPED, /* # of pages charged as file rss */
> +};
> +

mem_cgrou_"write"_page_stat_item?
Does "write" make sense to abstract page_state generally?

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 04/10] memcg: disable local interrupts in lock_page_cgroup()
  2010-10-04  6:57 ` [PATCH 04/10] memcg: disable local interrupts in lock_page_cgroup() Greg Thelen
  2010-10-05  6:54   ` KAMEZAWA Hiroyuki
@ 2010-10-05 16:03   ` Minchan Kim
  2010-10-05 23:26     ` Greg Thelen
  2010-10-12  5:39   ` [PATCH 04/10] memcg: disable local interrupts in lock_page_cgroup() Balbir Singh
  2 siblings, 1 reply; 96+ messages in thread
From: Minchan Kim @ 2010-10-05 16:03 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, Andrea Righi,
	Balbir Singh, KAMEZAWA Hiroyuki, Daisuke Nishimura

On Sun, Oct 03, 2010 at 11:57:59PM -0700, Greg Thelen wrote:
> If pages are being migrated from a memcg, then updates to that
> memcg's page statistics are protected by grabbing a bit spin lock
> using lock_page_cgroup().  In an upcoming commit memcg dirty page
> accounting will be updating memcg page accounting (specifically:
> num writeback pages) from softirq.  Avoid a deadlocking nested
> spin lock attempt by disabling interrupts on the local processor
> when grabbing the page_cgroup bit_spin_lock in lock_page_cgroup().
> This avoids the following deadlock:
> statistic
>       CPU 0             CPU 1
>                     inc_file_mapped
>                     rcu_read_lock
>   start move
>   synchronize_rcu
>                     lock_page_cgroup
>                       softirq
>                       test_clear_page_writeback
>                       mem_cgroup_dec_page_stat(NR_WRITEBACK)
>                       rcu_read_lock
>                       lock_page_cgroup   /* deadlock */
>                       unlock_page_cgroup
>                       rcu_read_unlock
>                     unlock_page_cgroup
>                     rcu_read_unlock
> 
> By disabling interrupts in lock_page_cgroup, nested calls
> are avoided.  The softirq would be delayed until after inc_file_mapped
> enables interrupts when calling unlock_page_cgroup().
> 
> The normal, fast path, of memcg page stat updates typically
> does not need to call lock_page_cgroup(), so this change does
> not affect the performance of the common case page accounting.
> 
> Signed-off-by: Andrea Righi <arighi@develer.com>
> Signed-off-by: Greg Thelen <gthelen@google.com>
> ---
>  include/linux/page_cgroup.h |    8 +++++-
>  mm/memcontrol.c             |   51 +++++++++++++++++++++++++-----------------
>  2 files changed, 36 insertions(+), 23 deletions(-)
> 
> diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
> index b59c298..872f6b1 100644
> --- a/include/linux/page_cgroup.h
> +++ b/include/linux/page_cgroup.h
> @@ -117,14 +117,18 @@ static inline enum zone_type page_cgroup_zid(struct page_cgroup *pc)
>  	return page_zonenum(pc->page);
>  }
>  
> -static inline void lock_page_cgroup(struct page_cgroup *pc)
> +static inline void lock_page_cgroup(struct page_cgroup *pc,
> +				    unsigned long *flags)
>  {
> +	local_irq_save(*flags);
>  	bit_spin_lock(PCG_LOCK, &pc->flags);
>  }

Hmm. Let me ask questions. 

1. Why do you add new irq disable region in general function?
I think __do_fault is a one of fast path.

Could you disable softirq using _local_bh_disable_ not in general function 
but in your context?
How do you expect that how many users need irq lock to update page state?
If they don't need to disalbe irq?

We can pass some argument which present to need irq lock or not.
But it seems to make code very ugly. 

2. So could you solve the problem in your design?
I mean you could update page state out of softirq?
(I didn't look at the your patches all. Sorry if I am missing something)

3. Normally, we have updated page state without disable irq. 
Why does memcg need it?

I hope we don't add irq disable region as far as possbile. 

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 05/10] memcg: add dirty page accounting infrastructure
  2010-10-04  6:58 ` [PATCH 05/10] memcg: add dirty page accounting infrastructure Greg Thelen
  2010-10-05  7:22   ` KAMEZAWA Hiroyuki
@ 2010-10-05 16:09   ` Minchan Kim
  2010-10-05 20:06     ` Greg Thelen
  1 sibling, 1 reply; 96+ messages in thread
From: Minchan Kim @ 2010-10-05 16:09 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, Andrea Righi,
	Balbir Singh, KAMEZAWA Hiroyuki, Daisuke Nishimura

On Sun, Oct 03, 2010 at 11:58:00PM -0700, Greg Thelen wrote:
> Add memcg routines to track dirty, writeback, and unstable_NFS pages.
> These routines are not yet used by the kernel to count such pages.
> A later change adds kernel calls to these new routines.
> 
> Signed-off-by: Greg Thelen <gthelen@google.com>
> Signed-off-by: Andrea Righi <arighi@develer.com>
> ---
>  include/linux/memcontrol.h |    3 +
>  mm/memcontrol.c            |   89 ++++++++++++++++++++++++++++++++++++++++----
>  2 files changed, 84 insertions(+), 8 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 7c7bec4..6303da1 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -28,6 +28,9 @@ struct mm_struct;
>  /* Stats that can be updated by kernel. */
>  enum mem_cgroup_write_page_stat_item {
>  	MEMCG_NR_FILE_MAPPED, /* # of pages charged as file rss */
> +	MEMCG_NR_FILE_DIRTY, /* # of dirty pages in page cache */
> +	MEMCG_NR_FILE_WRITEBACK, /* # of pages under writeback */
> +	MEMCG_NR_FILE_UNSTABLE_NFS, /* # of NFS unstable pages */
>  };
>  
>  extern unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 267d774..f40839f 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -85,10 +85,13 @@ enum mem_cgroup_stat_index {
>  	 */
>  	MEM_CGROUP_STAT_CACHE, 	   /* # of pages charged as cache */
>  	MEM_CGROUP_STAT_RSS,	   /* # of pages charged as anon rss */
> -	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
>  	MEM_CGROUP_STAT_PGPGIN_COUNT,	/* # of pages paged in */
>  	MEM_CGROUP_STAT_PGPGOUT_COUNT,	/* # of pages paged out */
>  	MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
> +	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
> +	MEM_CGROUP_STAT_FILE_DIRTY,	/* # of dirty pages in page cache */
> +	MEM_CGROUP_STAT_FILE_WRITEBACK,		/* # of pages under writeback */
> +	MEM_CGROUP_STAT_FILE_UNSTABLE_NFS,	/* # of NFS unstable pages */
>  	MEM_CGROUP_STAT_DATA, /* end of data requires synchronization */
>  	/* incremented at every  pagein/pageout */
>  	MEM_CGROUP_EVENTS = MEM_CGROUP_STAT_DATA,
> @@ -1626,6 +1629,48 @@ void mem_cgroup_update_page_stat(struct page *page,
>  			ClearPageCgroupFileMapped(pc);
>  		idx = MEM_CGROUP_STAT_FILE_MAPPED;
>  		break;
> +
> +	case MEMCG_NR_FILE_DIRTY:
> +		/* Use Test{Set,Clear} to only un/charge the memcg once. */
> +		if (val > 0) {
> +			if (TestSetPageCgroupFileDirty(pc))
> +				/* already set */

Nitpick. 
The comment doesn't give any useful information.
It looks like redundant. 

> +				val = 0;
> +		} else {
> +			if (!TestClearPageCgroupFileDirty(pc))
> +				/* already cleared */

Ditto

> +				val = 0;
> +		}
> +		idx = MEM_CGROUP_STAT_FILE_DIRTY;
> +		break;
> +
> +	case MEMCG_NR_FILE_WRITEBACK:
> +		/*
> +		 * This counter is adjusted while holding the mapping's
> +		 * tree_lock.  Therefore there is no race between settings and
> +		 * clearing of this flag.
> +		 */
> +		if (val > 0)
> +			SetPageCgroupFileWriteback(pc);
> +		else
> +			ClearPageCgroupFileWriteback(pc);
> +		idx = MEM_CGROUP_STAT_FILE_WRITEBACK;
> +		break;
> +
> +	case MEMCG_NR_FILE_UNSTABLE_NFS:
> +		/* Use Test{Set,Clear} to only un/charge the memcg once. */
> +		if (val > 0) {
> +			if (TestSetPageCgroupFileUnstableNFS(pc))
> +				/* already set */

Ditto 

> +				val = 0;
> +		} else {
> +			if (!TestClearPageCgroupFileUnstableNFS(pc))
> +				/* already cleared */

Ditto 

> +				val = 0;
> +		}
> +		idx = MEM_CGROUP_STAT_FILE_UNSTABLE_NFS;
> +		break;
> +
>  	default:
>  		BUG();
>  	}
> @@ -2133,6 +2178,16 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *mem,
>  	memcg_check_events(mem, pc->page);
>  }
>  
> +static void mem_cgroup_move_account_page_stat(struct mem_cgroup *from,
> +					      struct mem_cgroup *to,
> +					      enum mem_cgroup_stat_index idx)
> +{
> +	preempt_disable();
> +	__this_cpu_dec(from->stat->count[idx]);
> +	__this_cpu_inc(to->stat->count[idx]);
> +	preempt_enable();
> +}
> +
>  /**
>   * __mem_cgroup_move_account - move account of the page
>   * @pc:	page_cgroup of the page.
> @@ -2159,13 +2214,18 @@ static void __mem_cgroup_move_account(struct page_cgroup *pc,
>  	VM_BUG_ON(!PageCgroupUsed(pc));
>  	VM_BUG_ON(pc->mem_cgroup != from);
>  
> -	if (PageCgroupFileMapped(pc)) {
> -		/* Update mapped_file data for mem_cgroup */
> -		preempt_disable();
> -		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> -		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> -		preempt_enable();
> -	}
> +	if (PageCgroupFileMapped(pc))
> +		mem_cgroup_move_account_page_stat(from, to,
> +					MEM_CGROUP_STAT_FILE_MAPPED);
> +	if (PageCgroupFileDirty(pc))
> +		mem_cgroup_move_account_page_stat(from, to,
> +					MEM_CGROUP_STAT_FILE_DIRTY);
> +	if (PageCgroupFileWriteback(pc))
> +		mem_cgroup_move_account_page_stat(from, to,
> +					MEM_CGROUP_STAT_FILE_WRITEBACK);
> +	if (PageCgroupFileUnstableNFS(pc))
> +		mem_cgroup_move_account_page_stat(from, to,
> +					MEM_CGROUP_STAT_FILE_UNSTABLE_NFS);
>  	mem_cgroup_charge_statistics(from, pc, false);
>  	if (uncharge)
>  		/* This is not "cancel", but cancel_charge does all we need. */
> @@ -3545,6 +3605,9 @@ enum {
>  	MCS_PGPGIN,
>  	MCS_PGPGOUT,
>  	MCS_SWAP,
> +	MCS_FILE_DIRTY,
> +	MCS_WRITEBACK,
> +	MCS_UNSTABLE_NFS,
>  	MCS_INACTIVE_ANON,
>  	MCS_ACTIVE_ANON,
>  	MCS_INACTIVE_FILE,
> @@ -3567,6 +3630,9 @@ struct {
>  	{"pgpgin", "total_pgpgin"},
>  	{"pgpgout", "total_pgpgout"},
>  	{"swap", "total_swap"},
> +	{"dirty", "total_dirty"},
> +	{"writeback", "total_writeback"},
> +	{"nfs", "total_nfs"},
>  	{"inactive_anon", "total_inactive_anon"},
>  	{"active_anon", "total_active_anon"},
>  	{"inactive_file", "total_inactive_file"},
> @@ -3596,6 +3662,13 @@ mem_cgroup_get_local_stat(struct mem_cgroup *mem, struct mcs_total_stat *s)
>  		s->stat[MCS_SWAP] += val * PAGE_SIZE;
>  	}
>  
> +	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_FILE_DIRTY);
> +	s->stat[MCS_FILE_DIRTY] += val * PAGE_SIZE;
> +	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_FILE_WRITEBACK);
> +	s->stat[MCS_WRITEBACK] += val * PAGE_SIZE;
> +	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_FILE_UNSTABLE_NFS);
> +	s->stat[MCS_UNSTABLE_NFS] += val * PAGE_SIZE;
> +
>  	/* per zone stat */
>  	val = mem_cgroup_get_local_zonestat(mem, LRU_INACTIVE_ANON);
>  	s->stat[MCS_INACTIVE_ANON] += val * PAGE_SIZE;
> -- 
> 1.7.1
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 08/10] memcg: add cgroupfs interface to memcg dirty limits
  2010-10-05  9:18       ` Andrea Righi
@ 2010-10-05 18:31         ` David Rientjes
  2010-10-06 18:34         ` Greg Thelen
  1 sibling, 0 replies; 96+ messages in thread
From: David Rientjes @ 2010-10-05 18:31 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Greg Thelen, KAMEZAWA Hiroyuki, Andrew Morton, linux-kernel,
	linux-mm, containers, Balbir Singh, Daisuke Nishimura

On Tue, 5 Oct 2010, Andrea Righi wrote:

> mmh... looking at the code it seems the same behaviour, but in
> Documentation/sysctl/vm.txt we say a different thing (i.e., for
> dirty_bytes):
> 
> "If dirty_bytes is written, dirty_ratio becomes a function of its value
> (dirty_bytes / the amount of dirtyable system memory)."
> 
> However, in dirty_bytes_handler()/dirty_ratio_handler() we actually set
> the counterpart value as 0.
> 
> I think we should clarify the documentation.
> 
> Signed-off-by: Andrea Righi <arighi@develer.com>

Acked-by: David Rientjes <rientjes@google.com>

Thanks for cc'ing me on this, Andrea.

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 07/10] memcg: add dirty limits to mem_cgroup
  2010-10-05  9:43   ` Andrea Righi
@ 2010-10-05 19:00     ` Greg Thelen
  2010-10-07  0:13       ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 96+ messages in thread
From: Greg Thelen @ 2010-10-05 19:00 UTC (permalink / raw)
  To: Andrea Righi
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, Balbir Singh,
	KAMEZAWA Hiroyuki, Daisuke Nishimura

Andrea Righi <arighi@develer.com> writes:

> On Sun, Oct 03, 2010 at 11:58:02PM -0700, Greg Thelen wrote:
>> Extend mem_cgroup to contain dirty page limits.  Also add routines
>> allowing the kernel to query the dirty usage of a memcg.
>> 
>> These interfaces not used by the kernel yet.  A subsequent commit
>> will add kernel calls to utilize these new routines.
>
> A small note below.
>
>> 
>> Signed-off-by: Greg Thelen <gthelen@google.com>
>> Signed-off-by: Andrea Righi <arighi@develer.com>
>> ---
>>  include/linux/memcontrol.h |   44 +++++++++++
>>  mm/memcontrol.c            |  180 +++++++++++++++++++++++++++++++++++++++++++-
>>  2 files changed, 223 insertions(+), 1 deletions(-)
>> 
>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>> index 6303da1..dc8952d 100644
>> --- a/include/linux/memcontrol.h
>> +++ b/include/linux/memcontrol.h
>> @@ -19,6 +19,7 @@
>>  
>>  #ifndef _LINUX_MEMCONTROL_H
>>  #define _LINUX_MEMCONTROL_H
>> +#include <linux/writeback.h>
>>  #include <linux/cgroup.h>
>>  struct mem_cgroup;
>>  struct page_cgroup;
>> @@ -33,6 +34,30 @@ enum mem_cgroup_write_page_stat_item {
>>  	MEMCG_NR_FILE_UNSTABLE_NFS, /* # of NFS unstable pages */
>>  };
>>  
>> +/* Cgroup memory statistics items exported to the kernel */
>> +enum mem_cgroup_read_page_stat_item {
>> +	MEMCG_NR_DIRTYABLE_PAGES,
>> +	MEMCG_NR_RECLAIM_PAGES,
>> +	MEMCG_NR_WRITEBACK,
>> +	MEMCG_NR_DIRTY_WRITEBACK_PAGES,
>> +};
>> +
>> +/* Dirty memory parameters */
>> +struct vm_dirty_param {
>> +	int dirty_ratio;
>> +	int dirty_background_ratio;
>> +	unsigned long dirty_bytes;
>> +	unsigned long dirty_background_bytes;
>> +};
>> +
>> +static inline void get_global_vm_dirty_param(struct vm_dirty_param *param)
>> +{
>> +	param->dirty_ratio = vm_dirty_ratio;
>> +	param->dirty_bytes = vm_dirty_bytes;
>> +	param->dirty_background_ratio = dirty_background_ratio;
>> +	param->dirty_background_bytes = dirty_background_bytes;
>> +}
>> +
>>  extern unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
>>  					struct list_head *dst,
>>  					unsigned long *scanned, int order,
>> @@ -145,6 +170,10 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,
>>  	mem_cgroup_update_page_stat(page, idx, -1);
>>  }
>>  
>> +bool mem_cgroup_has_dirty_limit(void);
>> +void get_vm_dirty_param(struct vm_dirty_param *param);
>> +s64 mem_cgroup_page_stat(enum mem_cgroup_read_page_stat_item item);
>> +
>>  unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>>  						gfp_t gfp_mask);
>>  u64 mem_cgroup_get_limit(struct mem_cgroup *mem);
>> @@ -326,6 +355,21 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,
>>  {
>>  }
>>  
>> +static inline bool mem_cgroup_has_dirty_limit(void)
>> +{
>> +	return false;
>> +}
>> +
>> +static inline void get_vm_dirty_param(struct vm_dirty_param *param)
>> +{
>> +	get_global_vm_dirty_param(param);
>> +}
>> +
>> +static inline s64 mem_cgroup_page_stat(enum mem_cgroup_read_page_stat_item item)
>> +{
>> +	return -ENOSYS;
>> +}
>> +
>>  static inline
>>  unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>>  					    gfp_t gfp_mask)
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index f40839f..6ec2625 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -233,6 +233,10 @@ struct mem_cgroup {
>>  	atomic_t	refcnt;
>>  
>>  	unsigned int	swappiness;
>> +
>> +	/* control memory cgroup dirty pages */
>> +	struct vm_dirty_param dirty_param;
>> +
>>  	/* OOM-Killer disable */
>>  	int		oom_kill_disable;
>>  
>> @@ -1132,6 +1136,172 @@ static unsigned int get_swappiness(struct mem_cgroup *memcg)
>>  	return swappiness;
>>  }
>>  
>> +/*
>> + * Returns a snapshot of the current dirty limits which is not synchronized with
>> + * the routines that change the dirty limits.  If this routine races with an
>> + * update to the dirty bytes/ratio value, then the caller must handle the case
>> + * where both dirty_[background_]_ratio and _bytes are set.
>> + */
>> +static void __mem_cgroup_get_dirty_param(struct vm_dirty_param *param,
>> +					 struct mem_cgroup *mem)
>> +{
>> +	if (mem && !mem_cgroup_is_root(mem)) {
>> +		param->dirty_ratio = mem->dirty_param.dirty_ratio;
>> +		param->dirty_bytes = mem->dirty_param.dirty_bytes;
>> +		param->dirty_background_ratio =
>> +			mem->dirty_param.dirty_background_ratio;
>> +		param->dirty_background_bytes =
>> +			mem->dirty_param.dirty_background_bytes;
>> +	} else {
>> +		get_global_vm_dirty_param(param);
>> +	}
>> +}
>> +
>> +/*
>> + * Get dirty memory parameters of the current memcg or global values (if memory
>> + * cgroups are disabled or querying the root cgroup).
>> + */
>> +void get_vm_dirty_param(struct vm_dirty_param *param)
>> +{
>> +	struct mem_cgroup *memcg;
>> +
>> +	if (mem_cgroup_disabled()) {
>> +		get_global_vm_dirty_param(param);
>> +		return;
>> +	}
>> +
>> +	/*
>> +	 * It's possible that "current" may be moved to other cgroup while we
>> +	 * access cgroup. But precise check is meaningless because the task can
>> +	 * be moved after our access and writeback tends to take long time.  At
>> +	 * least, "memcg" will not be freed under rcu_read_lock().
>> +	 */
>> +	rcu_read_lock();
>> +	memcg = mem_cgroup_from_task(current);
>> +	__mem_cgroup_get_dirty_param(param, memcg);
>> +	rcu_read_unlock();
>> +}
>> +
>> +/*
>> + * Check if current memcg has local dirty limits.  Return true if the current
>> + * memory cgroup has local dirty memory settings.
>> + */
>> +bool mem_cgroup_has_dirty_limit(void)
>> +{
>> +	struct mem_cgroup *mem;
>> +
>> +	if (mem_cgroup_disabled())
>> +		return false;
>> +
>> +	mem = mem_cgroup_from_task(current);
>> +	return mem && !mem_cgroup_is_root(mem);
>> +}
>
> We only check the pointer without dereferencing it, so this is probably
> ok, but maybe this is safer:
>
> bool mem_cgroup_has_dirty_limit(void)
> {
> 	struct mem_cgroup *mem;
> 	bool ret;
>
> 	if (mem_cgroup_disabled())
> 		return false;
>
> 	rcu_read_lock();
> 	mem = mem_cgroup_from_task(current);
> 	ret = mem && !mem_cgroup_is_root(mem);
> 	rcu_read_unlock();
>
> 	return ret;
> }
>
> rcu_read_lock() should be held in mem_cgroup_from_task(), otherwise
> lockdep could detect this as an error.
>
> Thanks,
> -Andrea

Good suggestion.  I agree that lockdep might catch this.  There are some
unrelated debug_locks failures (even without my patches) that I worked
around to get lockdep to complain about this one.  I applied your
suggested fix and lockdep was happy.  I will incorporate this fix into
the next revision of the patch series.

>> +
>> +static inline bool mem_cgroup_can_swap(struct mem_cgroup *memcg)
>> +{
>> +	if (!do_swap_account)
>> +		return nr_swap_pages > 0;
>> +	return !memcg->memsw_is_minimum &&
>> +		(res_counter_read_u64(&memcg->memsw, RES_LIMIT) > 0);
>> +}
>> +
>> +static s64 mem_cgroup_get_local_page_stat(struct mem_cgroup *mem,
>> +				enum mem_cgroup_read_page_stat_item item)
>> +{
>> +	s64 ret;
>> +
>> +	switch (item) {
>> +	case MEMCG_NR_DIRTYABLE_PAGES:
>> +		ret = mem_cgroup_read_stat(mem, LRU_ACTIVE_FILE) +
>> +			mem_cgroup_read_stat(mem, LRU_INACTIVE_FILE);
>> +		if (mem_cgroup_can_swap(mem))
>> +			ret += mem_cgroup_read_stat(mem, LRU_ACTIVE_ANON) +
>> +				mem_cgroup_read_stat(mem, LRU_INACTIVE_ANON);
>> +		break;
>> +	case MEMCG_NR_RECLAIM_PAGES:
>> +		ret = mem_cgroup_read_stat(mem,	MEM_CGROUP_STAT_FILE_DIRTY) +
>> +			mem_cgroup_read_stat(mem,
>> +					     MEM_CGROUP_STAT_FILE_UNSTABLE_NFS);
>> +		break;
>> +	case MEMCG_NR_WRITEBACK:
>> +		ret = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_FILE_WRITEBACK);
>> +		break;
>> +	case MEMCG_NR_DIRTY_WRITEBACK_PAGES:
>> +		ret = mem_cgroup_read_stat(mem,
>> +					   MEM_CGROUP_STAT_FILE_WRITEBACK) +
>> +			mem_cgroup_read_stat(mem,
>> +					     MEM_CGROUP_STAT_FILE_UNSTABLE_NFS);
>> +		break;
>> +	default:
>> +		BUG();
>> +		break;
>> +	}
>> +	return ret;
>> +}
>> +
>> +static unsigned long long
>> +memcg_get_hierarchical_free_pages(struct mem_cgroup *mem)
>> +{
>> +	struct cgroup *cgroup;
>> +	unsigned long long min_free, free;
>> +
>> +	min_free = res_counter_read_u64(&mem->res, RES_LIMIT) -
>> +		res_counter_read_u64(&mem->res, RES_USAGE);
>> +	cgroup = mem->css.cgroup;
>> +	if (!mem->use_hierarchy)
>> +		goto out;
>> +
>> +	while (cgroup->parent) {
>> +		cgroup = cgroup->parent;
>> +		mem = mem_cgroup_from_cont(cgroup);
>> +		if (!mem->use_hierarchy)
>> +			break;
>> +		free = res_counter_read_u64(&mem->res, RES_LIMIT) -
>> +			res_counter_read_u64(&mem->res, RES_USAGE);
>> +		min_free = min(min_free, free);
>> +	}
>> +out:
>> +	/* Translate free memory in pages */
>> +	return min_free >> PAGE_SHIFT;
>> +}
>> +
>> +/*
>> + * mem_cgroup_page_stat() - get memory cgroup file cache statistics
>> + * @item:      memory statistic item exported to the kernel
>> + *
>> + * Return the accounted statistic value.
>> + */
>> +s64 mem_cgroup_page_stat(enum mem_cgroup_read_page_stat_item item)
>> +{
>> +	struct mem_cgroup *mem;
>> +	struct mem_cgroup *iter;
>> +	s64 value;
>> +
>> +	rcu_read_lock();
>> +	mem = mem_cgroup_from_task(current);
>> +	if (mem && !mem_cgroup_is_root(mem)) {
>> +		/*
>> +		 * If we're looking for dirtyable pages we need to evaluate
>> +		 * free pages depending on the limit and usage of the parents
>> +		 * first of all.
>> +		 */
>> +		if (item == MEMCG_NR_DIRTYABLE_PAGES)
>> +			value = memcg_get_hierarchical_free_pages(mem);
>> +		else
>> +			value = 0;
>> +		/*
>> +		 * Recursively evaluate page statistics against all cgroup
>> +		 * under hierarchy tree
>> +		 */
>> +		for_each_mem_cgroup_tree(iter, mem)
>> +			value += mem_cgroup_get_local_page_stat(iter, item);
>> +	} else
>> +		value = -EINVAL;
>> +	rcu_read_unlock();
>> +
>> +	return value;
>> +}
>> +
>>  static void mem_cgroup_start_move(struct mem_cgroup *mem)
>>  {
>>  	int cpu;
>> @@ -4444,8 +4614,16 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
>>  	spin_lock_init(&mem->reclaim_param_lock);
>>  	INIT_LIST_HEAD(&mem->oom_notify);
>>  
>> -	if (parent)
>> +	if (parent) {
>>  		mem->swappiness = get_swappiness(parent);
>> +		__mem_cgroup_get_dirty_param(&mem->dirty_param, parent);
>> +	} else {
>> +		/*
>> +		 * The root cgroup dirty_param field is not used, instead,
>> +		 * system-wide dirty limits are used.
>> +		 */
>> +	}
>> +
>>  	atomic_set(&mem->refcnt, 1);
>>  	mem->move_charge_at_immigrate = 0;
>>  	mutex_init(&mem->thresholds_lock);
>> -- 
>> 1.7.1
>> 

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 03/10] memcg: create extensible page stat update routines
  2010-10-05 15:42   ` Minchan Kim
@ 2010-10-05 19:59     ` Greg Thelen
  2010-10-05 23:57       ` Minchan Kim
  0 siblings, 1 reply; 96+ messages in thread
From: Greg Thelen @ 2010-10-05 19:59 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, Andrea Righi,
	Balbir Singh, KAMEZAWA Hiroyuki, Daisuke Nishimura

Minchan Kim <minchan.kim@gmail.com> writes:

> On Sun, Oct 03, 2010 at 11:57:58PM -0700, Greg Thelen wrote:
>> Replace usage of the mem_cgroup_update_file_mapped() memcg
>> statistic update routine with two new routines:
>> * mem_cgroup_inc_page_stat()
>> * mem_cgroup_dec_page_stat()
>> 
>> As before, only the file_mapped statistic is managed.  However,
>> these more general interfaces allow for new statistics to be
>> more easily added.  New statistics are added with memcg dirty
>> page accounting.
>> 
>> Signed-off-by: Greg Thelen <gthelen@google.com>
>> Signed-off-by: Andrea Righi <arighi@develer.com>
>> ---
>>  include/linux/memcontrol.h |   31 ++++++++++++++++++++++++++++---
>>  mm/memcontrol.c            |   17 ++++++++---------
>>  mm/rmap.c                  |    4 ++--
>>  3 files changed, 38 insertions(+), 14 deletions(-)
>> 
>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>> index 159a076..7c7bec4 100644
>> --- a/include/linux/memcontrol.h
>> +++ b/include/linux/memcontrol.h
>> @@ -25,6 +25,11 @@ struct page_cgroup;
>>  struct page;
>>  struct mm_struct;
>>  
>> +/* Stats that can be updated by kernel. */
>> +enum mem_cgroup_write_page_stat_item {
>> +	MEMCG_NR_FILE_MAPPED, /* # of pages charged as file rss */
>> +};
>> +
>
> mem_cgrou_"write"_page_stat_item?
> Does "write" make sense to abstract page_state generally?

First I will summarize the portion of the design relevant to this
comment:

This patch series introduces two sets of memcg statistics.
a) the writable set of statistics the kernel updates when pages change
   state (example: when a page becomes dirty) using:
     mem_cgroup_inc_page_stat(struct page *page,
     				enum mem_cgroup_write_page_stat_item idx)
     mem_cgroup_dec_page_stat(struct page *page,
     				enum mem_cgroup_write_page_stat_item idx)

b) the read-only set of statistics the kernel queries to measure the
   amount of dirty memory used by the current cgroup using:
     s64 mem_cgroup_page_stat(enum mem_cgroup_read_page_stat_item item)

   This read-only set of statistics is set of a higher level conceptual
   counters.  For example, MEMCG_NR_DIRTYABLE_PAGES is the sum of the
   counts of pages in various states (active + inactive).  mem_cgroup
   exports this value as a higher level counter rather than individual
   counters (active & inactive) to minimize the number of calls into
   mem_cgroup_page_stat().  This avoids extra calls to cgroup tree
   iteration with for_each_mem_cgroup_tree().

Notice that each of the two sets of statistics are addressed by a
different type, mem_cgroup_{read vs write}_page_stat_item.

This particular patch (memcg: create extensible page stat update
routines) introduces part of this design.  A later patch I emailed
(memcg: add dirty limits to mem_cgroup) added
mem_cgroup_read_page_stat_item.


I think the code would read better if I renamed 
enum mem_cgroup_write_page_stat_item to 
enum mem_cgroup_update_page_stat_item.

Would this address your concern?

--
Greg

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 05/10] memcg: add dirty page accounting infrastructure
  2010-10-05 16:09   ` Minchan Kim
@ 2010-10-05 20:06     ` Greg Thelen
  0 siblings, 0 replies; 96+ messages in thread
From: Greg Thelen @ 2010-10-05 20:06 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, Andrea Righi,
	Balbir Singh, KAMEZAWA Hiroyuki, Daisuke Nishimura

Minchan Kim <minchan.kim@gmail.com> writes:

> On Sun, Oct 03, 2010 at 11:58:00PM -0700, Greg Thelen wrote:
>> Add memcg routines to track dirty, writeback, and unstable_NFS pages.
>> These routines are not yet used by the kernel to count such pages.
>> A later change adds kernel calls to these new routines.
>> 
>> Signed-off-by: Greg Thelen <gthelen@google.com>
>> Signed-off-by: Andrea Righi <arighi@develer.com>
>> ---
>>  include/linux/memcontrol.h |    3 +
>>  mm/memcontrol.c            |   89 ++++++++++++++++++++++++++++++++++++++++----
>>  2 files changed, 84 insertions(+), 8 deletions(-)
>> 
>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>> index 7c7bec4..6303da1 100644
>> --- a/include/linux/memcontrol.h
>> +++ b/include/linux/memcontrol.h
>> @@ -28,6 +28,9 @@ struct mm_struct;
>>  /* Stats that can be updated by kernel. */
>>  enum mem_cgroup_write_page_stat_item {
>>  	MEMCG_NR_FILE_MAPPED, /* # of pages charged as file rss */
>> +	MEMCG_NR_FILE_DIRTY, /* # of dirty pages in page cache */
>> +	MEMCG_NR_FILE_WRITEBACK, /* # of pages under writeback */
>> +	MEMCG_NR_FILE_UNSTABLE_NFS, /* # of NFS unstable pages */
>>  };
>>  
>>  extern unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 267d774..f40839f 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -85,10 +85,13 @@ enum mem_cgroup_stat_index {
>>  	 */
>>  	MEM_CGROUP_STAT_CACHE, 	   /* # of pages charged as cache */
>>  	MEM_CGROUP_STAT_RSS,	   /* # of pages charged as anon rss */
>> -	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
>>  	MEM_CGROUP_STAT_PGPGIN_COUNT,	/* # of pages paged in */
>>  	MEM_CGROUP_STAT_PGPGOUT_COUNT,	/* # of pages paged out */
>>  	MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
>> +	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
>> +	MEM_CGROUP_STAT_FILE_DIRTY,	/* # of dirty pages in page cache */
>> +	MEM_CGROUP_STAT_FILE_WRITEBACK,		/* # of pages under writeback */
>> +	MEM_CGROUP_STAT_FILE_UNSTABLE_NFS,	/* # of NFS unstable pages */
>>  	MEM_CGROUP_STAT_DATA, /* end of data requires synchronization */
>>  	/* incremented at every  pagein/pageout */
>>  	MEM_CGROUP_EVENTS = MEM_CGROUP_STAT_DATA,
>> @@ -1626,6 +1629,48 @@ void mem_cgroup_update_page_stat(struct page *page,
>>  			ClearPageCgroupFileMapped(pc);
>>  		idx = MEM_CGROUP_STAT_FILE_MAPPED;
>>  		break;
>> +
>> +	case MEMCG_NR_FILE_DIRTY:
>> +		/* Use Test{Set,Clear} to only un/charge the memcg once. */
>> +		if (val > 0) {
>> +			if (TestSetPageCgroupFileDirty(pc))
>> +				/* already set */
>
> Nitpick. 
> The comment doesn't give any useful information.
> It looks like redundant. 

I agree.  I removed the four redundant comments you referred to.  Thanks
for the feedback.

>> +				val = 0;
>> +		} else {
>> +			if (!TestClearPageCgroupFileDirty(pc))
>> +				/* already cleared */
>
> Ditto
>
>> +				val = 0;
>> +		}
>> +		idx = MEM_CGROUP_STAT_FILE_DIRTY;
>> +		break;
>> +
>> +	case MEMCG_NR_FILE_WRITEBACK:
>> +		/*
>> +		 * This counter is adjusted while holding the mapping's
>> +		 * tree_lock.  Therefore there is no race between settings and
>> +		 * clearing of this flag.
>> +		 */
>> +		if (val > 0)
>> +			SetPageCgroupFileWriteback(pc);
>> +		else
>> +			ClearPageCgroupFileWriteback(pc);
>> +		idx = MEM_CGROUP_STAT_FILE_WRITEBACK;
>> +		break;
>> +
>> +	case MEMCG_NR_FILE_UNSTABLE_NFS:
>> +		/* Use Test{Set,Clear} to only un/charge the memcg once. */
>> +		if (val > 0) {
>> +			if (TestSetPageCgroupFileUnstableNFS(pc))
>> +				/* already set */
>
> Ditto 
>
>> +				val = 0;
>> +		} else {
>> +			if (!TestClearPageCgroupFileUnstableNFS(pc))
>> +				/* already cleared */
>
> Ditto 
>
>> +				val = 0;
>> +		}
>> +		idx = MEM_CGROUP_STAT_FILE_UNSTABLE_NFS;
>> +		break;
>> +
>>  	default:
>>  		BUG();
>>  	}
>> @@ -2133,6 +2178,16 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *mem,
>>  	memcg_check_events(mem, pc->page);
>>  }
>>  
>> +static void mem_cgroup_move_account_page_stat(struct mem_cgroup *from,
>> +					      struct mem_cgroup *to,
>> +					      enum mem_cgroup_stat_index idx)
>> +{
>> +	preempt_disable();
>> +	__this_cpu_dec(from->stat->count[idx]);
>> +	__this_cpu_inc(to->stat->count[idx]);
>> +	preempt_enable();
>> +}
>> +
>>  /**
>>   * __mem_cgroup_move_account - move account of the page
>>   * @pc:	page_cgroup of the page.
>> @@ -2159,13 +2214,18 @@ static void __mem_cgroup_move_account(struct page_cgroup *pc,
>>  	VM_BUG_ON(!PageCgroupUsed(pc));
>>  	VM_BUG_ON(pc->mem_cgroup != from);
>>  
>> -	if (PageCgroupFileMapped(pc)) {
>> -		/* Update mapped_file data for mem_cgroup */
>> -		preempt_disable();
>> -		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
>> -		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
>> -		preempt_enable();
>> -	}
>> +	if (PageCgroupFileMapped(pc))
>> +		mem_cgroup_move_account_page_stat(from, to,
>> +					MEM_CGROUP_STAT_FILE_MAPPED);
>> +	if (PageCgroupFileDirty(pc))
>> +		mem_cgroup_move_account_page_stat(from, to,
>> +					MEM_CGROUP_STAT_FILE_DIRTY);
>> +	if (PageCgroupFileWriteback(pc))
>> +		mem_cgroup_move_account_page_stat(from, to,
>> +					MEM_CGROUP_STAT_FILE_WRITEBACK);
>> +	if (PageCgroupFileUnstableNFS(pc))
>> +		mem_cgroup_move_account_page_stat(from, to,
>> +					MEM_CGROUP_STAT_FILE_UNSTABLE_NFS);
>>  	mem_cgroup_charge_statistics(from, pc, false);
>>  	if (uncharge)
>>  		/* This is not "cancel", but cancel_charge does all we need. */
>> @@ -3545,6 +3605,9 @@ enum {
>>  	MCS_PGPGIN,
>>  	MCS_PGPGOUT,
>>  	MCS_SWAP,
>> +	MCS_FILE_DIRTY,
>> +	MCS_WRITEBACK,
>> +	MCS_UNSTABLE_NFS,
>>  	MCS_INACTIVE_ANON,
>>  	MCS_ACTIVE_ANON,
>>  	MCS_INACTIVE_FILE,
>> @@ -3567,6 +3630,9 @@ struct {
>>  	{"pgpgin", "total_pgpgin"},
>>  	{"pgpgout", "total_pgpgout"},
>>  	{"swap", "total_swap"},
>> +	{"dirty", "total_dirty"},
>> +	{"writeback", "total_writeback"},
>> +	{"nfs", "total_nfs"},
>>  	{"inactive_anon", "total_inactive_anon"},
>>  	{"active_anon", "total_active_anon"},
>>  	{"inactive_file", "total_inactive_file"},
>> @@ -3596,6 +3662,13 @@ mem_cgroup_get_local_stat(struct mem_cgroup *mem, struct mcs_total_stat *s)
>>  		s->stat[MCS_SWAP] += val * PAGE_SIZE;
>>  	}
>>  
>> +	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_FILE_DIRTY);
>> +	s->stat[MCS_FILE_DIRTY] += val * PAGE_SIZE;
>> +	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_FILE_WRITEBACK);
>> +	s->stat[MCS_WRITEBACK] += val * PAGE_SIZE;
>> +	val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_FILE_UNSTABLE_NFS);
>> +	s->stat[MCS_UNSTABLE_NFS] += val * PAGE_SIZE;
>> +
>>  	/* per zone stat */
>>  	val = mem_cgroup_get_local_zonestat(mem, LRU_INACTIVE_ANON);
>>  	s->stat[MCS_INACTIVE_ANON] += val * PAGE_SIZE;
>> -- 
>> 1.7.1
>> 
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majordomo@kvack.org.  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>> 

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 00/10] memcg: per cgroup dirty page accounting
  2010-10-04  6:57 [PATCH 00/10] memcg: per cgroup dirty page accounting Greg Thelen
                   ` (11 preceding siblings ...)
  2010-10-05  4:50 ` Balbir Singh
@ 2010-10-05 22:15 ` Andrea Righi
  2010-10-06  3:23 ` Balbir Singh
  2010-10-18  5:56 ` KAMEZAWA Hiroyuki
  14 siblings, 0 replies; 96+ messages in thread
From: Andrea Righi @ 2010-10-05 22:15 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, Balbir Singh,
	KAMEZAWA Hiroyuki, Daisuke Nishimura

On Sun, Oct 03, 2010 at 11:57:55PM -0700, Greg Thelen wrote:
> This patch set provides the ability for each cgroup to have independent dirty
> page limits.
> 
> Limiting dirty memory is like fixing the max amount of dirty (hard to reclaim)
> page cache used by a cgroup.  So, in case of multiple cgroup writers, they will
> not be able to consume more than their designated share of dirty pages and will
> be forced to perform write-out if they cross that limit.
> 
> These patches were developed and tested on mmotm 2010-09-28-16-13.  The patches
> are based on a series proposed by Andrea Righi in Mar 2010.
> 
> Overview:
> - Add page_cgroup flags to record when pages are dirty, in writeback, or nfs
>   unstable.
> - Extend mem_cgroup to record the total number of pages in each of the 
>   interesting dirty states (dirty, writeback, unstable_nfs).  
> - Add dirty parameters similar to the system-wide  /proc/sys/vm/dirty_*
>   limits to mem_cgroup.  The mem_cgroup dirty parameters are accessible
>   via cgroupfs control files.
> - Consider both system and per-memcg dirty limits in page writeback when
>   deciding to queue background writeback or block for foreground writeback.
> 
> Known shortcomings:
> - When a cgroup dirty limit is exceeded, then bdi writeback is employed to
>   writeback dirty inodes.  Bdi writeback considers inodes from any cgroup, not
>   just inodes contributing dirty pages to the cgroup exceeding its limit.  
> 
> Performance measurements:
> - kernel builds are unaffected unless run with a small dirty limit.
> - all data collected with CONFIG_CGROUP_MEM_RES_CTLR=y.
> - dd has three data points (in secs) for three data sizes (100M, 200M, and 1G).  
>   As expected, dd slows when it exceed its cgroup dirty limit.
> 
>                kernel_build          dd
> mmotm             2:37        0.18, 0.38, 1.65
>   root_memcg
> 
> mmotm             2:37        0.18, 0.35, 1.66
>   non-root_memcg
> 
> mmotm+patches     2:37        0.18, 0.35, 1.68
>   root_memcg
> 
> mmotm+patches     2:37        0.19, 0.35, 1.69
>   non-root_memcg
> 
> mmotm+patches     2:37        0.19, 2.34, 22.82
>   non-root_memcg
>   150 MiB memcg dirty limit
> 
> mmotm+patches     3:58        1.71, 3.38, 17.33
>   non-root_memcg
>   1 MiB memcg dirty limit

Hi Greg,

the patchset seems to work fine on my box.

I also ran a pretty simple test to directly verify the effectiveness of
the dirty memory limit, using a dd running on a non-root memcg:

  dd if=/dev/zero of=tmpfile bs=1M count=512

and monitoring the max of the "dirty" value in cgroup/memory.stat:

Here the results:
  dd in non-root memcg (  4 MiB memcg dirty limit): dirty max=4227072
  dd in non-root memcg (  8 MiB memcg dirty limit): dirty max=8454144
  dd in non-root memcg ( 16 MiB memcg dirty limit): dirty max=15179776
  dd in non-root memcg ( 32 MiB memcg dirty limit): dirty max=32235520
  dd in non-root memcg ( 64 MiB memcg dirty limit): dirty max=64245760
  dd in non-root memcg (128 MiB memcg dirty limit): dirty max=121028608
  dd in non-root memcg (256 MiB memcg dirty limit): dirty max=232865792
  dd in non-root memcg (512 MiB memcg dirty limit): dirty max=445194240

-Andrea

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 04/10] memcg: disable local interrupts in lock_page_cgroup()
  2010-10-05 16:03   ` Minchan Kim
@ 2010-10-05 23:26     ` Greg Thelen
  2010-10-06  0:15       ` Minchan Kim
  0 siblings, 1 reply; 96+ messages in thread
From: Greg Thelen @ 2010-10-05 23:26 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, Andrea Righi,
	Balbir Singh, KAMEZAWA Hiroyuki, Daisuke Nishimura

Minchan Kim <minchan.kim@gmail.com> writes:

> On Sun, Oct 03, 2010 at 11:57:59PM -0700, Greg Thelen wrote:
>> If pages are being migrated from a memcg, then updates to that
>> memcg's page statistics are protected by grabbing a bit spin lock
>> using lock_page_cgroup().  In an upcoming commit memcg dirty page
>> accounting will be updating memcg page accounting (specifically:
>> num writeback pages) from softirq.  Avoid a deadlocking nested
>> spin lock attempt by disabling interrupts on the local processor
>> when grabbing the page_cgroup bit_spin_lock in lock_page_cgroup().
>> This avoids the following deadlock:
>> statistic
>>       CPU 0             CPU 1
>>                     inc_file_mapped
>>                     rcu_read_lock
>>   start move
>>   synchronize_rcu
>>                     lock_page_cgroup
>>                       softirq
>>                       test_clear_page_writeback
>>                       mem_cgroup_dec_page_stat(NR_WRITEBACK)
>>                       rcu_read_lock
>>                       lock_page_cgroup   /* deadlock */
>>                       unlock_page_cgroup
>>                       rcu_read_unlock
>>                     unlock_page_cgroup
>>                     rcu_read_unlock
>> 
>> By disabling interrupts in lock_page_cgroup, nested calls
>> are avoided.  The softirq would be delayed until after inc_file_mapped
>> enables interrupts when calling unlock_page_cgroup().
>> 
>> The normal, fast path, of memcg page stat updates typically
>> does not need to call lock_page_cgroup(), so this change does
>> not affect the performance of the common case page accounting.
>> 
>> Signed-off-by: Andrea Righi <arighi@develer.com>
>> Signed-off-by: Greg Thelen <gthelen@google.com>
>> ---
>>  include/linux/page_cgroup.h |    8 +++++-
>>  mm/memcontrol.c             |   51 +++++++++++++++++++++++++-----------------
>>  2 files changed, 36 insertions(+), 23 deletions(-)
>> 
>> diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
>> index b59c298..872f6b1 100644
>> --- a/include/linux/page_cgroup.h
>> +++ b/include/linux/page_cgroup.h
>> @@ -117,14 +117,18 @@ static inline enum zone_type page_cgroup_zid(struct page_cgroup *pc)
>>  	return page_zonenum(pc->page);
>>  }
>>  
>> -static inline void lock_page_cgroup(struct page_cgroup *pc)
>> +static inline void lock_page_cgroup(struct page_cgroup *pc,
>> +				    unsigned long *flags)
>>  {
>> +	local_irq_save(*flags);
>>  	bit_spin_lock(PCG_LOCK, &pc->flags);
>>  }
>
> Hmm. Let me ask questions. 
>
> 1. Why do you add new irq disable region in general function?
> I think __do_fault is a one of fast path.

This is true.  I used pft to measure the cost of this extra locking
code.  This pft workload exercises this memcg call stack:
	lock_page_cgroup+0x39/0x5b
	__mem_cgroup_commit_charge+0x2c/0x98
	mem_cgroup_charge_common+0x66/0x76
	mem_cgroup_newpage_charge+0x40/0x4f
	handle_mm_fault+0x2e3/0x869
	do_page_fault+0x286/0x29b
	page_fault+0x1f/0x30

I ran 100 iterations of "pft -m 8g -t 16 -a" and focused on the
flt/cpu/s.

First I established a performance baseline using upstream mmotm locking
(not disabling interrupts).
	100 samples: mean 51930.16383  stddev 2.032% (1055.40818272)

Then I introduced this patch, which disabled interrupts in
lock_page_cgroup():
	100 samples: mean 52174.17434  stddev 1.306% (681.14442646)

Then I replaced this patch's usage of local_irq_save/restore() with
local_bh_disable/enable().
	100 samples: mean 51810.58591  stddev 1.892% (980.340335322)

The proposed patch (#2) actually improves allocation performance by
0.47% when compared to the baseline (#1).  However, I believe that this
is in the statistical noise.  This particular workload does not seem to
be affected this patch.

> Could you disable softirq using _local_bh_disable_ not in general function 
> but in your context?

lock_page_cgroup() is only used by mem_cgroup in memcontrol.c.

local_bh_disable() should also work instead of my proposed patch, which
used local_irq_save/restore().  local_bh_disable() will not disable all
interrupts so it should have less impact.  But I think that usage of
local_bh_disable() it still something that has to happen in the general
lock_page_cgroup() function.  The softirq can occur at an arbitrary time
and processor with the possibility of interrupting anyone who does not
have interrupts or softirq disabled.  Therefore the softirq could
interrupt code that has used lock_page_cgroup(), unless
lock_page_cgroup() explicitly (as proposed) disables interrupts (or
softirq).  If (as you suggest) some calls to lock_page_cgroup() did not
disable softirq, then a deadlock is possible because the softirq may
interrupt the holder of the page cgroup spinlock and the softirq routine
that also wants the spinlock would spin forever.

Is there a preference between local_bh_disable() and local_irq_save()?
Currently the page uses local_irq_save().  However I think it could work
by local_bh_disable(), which might have less system impact.

> how do you expect that how many users need irq lock to update page state?
> If they don't need to disalbe irq?

Are you asking how many cases need to disable irq to update page state?
Because there exists some code (writeback memcg counter update) that
lock the spinlock in softirq, then it must not be allowed to interrupt
any holders of the spinlock.  Therefore any code that locked the
page_cgroup spinlock must disable interrupts (or softirq) to prevent
being preempted by a softirq that will attempt to lock the same
spinlock.

> We can pass some argument which present to need irq lock or not.
> But it seems to make code very ugly. 

This would be ugly and I do not think it would avoid the deadlock
because the softirq for the writeback may occur for a particular page at
any time.  Anyone who might be interrupted by this softirq must either:
a) not hold the page_cgroup spinlock
or
b) disable interrupts (or softirq) to avoid being preempted by code that
   may want the spinlock.

> 2. So could you solve the problem in your design?
> I mean you could update page state out of softirq?
> (I didn't look at the your patches all. Sorry if I am missing something)

The writeback statistics are normally updated for non-memcg in
test_clear_page_writeback().  Here is an example call stack (innermost
last):
	system_call_fastpath+0x16/0x1b
	sys_exit_group+0x17/0x1b
	do_group_exit+0x7d/0xa8
	do_exit+0x1fb/0x705
	exit_mm+0x129/0x136
	mmput+0x48/0xb9
	exit_mmap+0x96/0xe9
	unmap_vmas+0x52e/0x788
	page_remove_rmap+0x69/0x6d
	mem_cgroup_update_page_stat+0x191/0x1af
		<INTERRUPT>
		call_function_single_interrupt+0x13/0x20
		smp_call_function_single_interrupt+0x25/0x27
		irq_exit+0x4a/0x8c
		do_softirq+0x3d/0x85
		call_softirq+0x1c/0x3e
		__do_softirq+0xed/0x1e3
		blk_done_softirq+0x72/0x82
		scsi_softirq_done+0x10a/0x113
		scsi_finish_command+0xe8/0xf1
		scsi_io_completion+0x1b0/0x42c
		blk_end_request+0x10/0x12
		blk_end_bidi_request+0x1f/0x5d
		blk_update_bidi_request+0x20/0x6f
		blk_update_request+0x1a1/0x360
		req_bio_endio+0x96/0xb6
		bio_endio+0x31/0x33
		mpage_end_io_write+0x66/0x7d
		end_page_writeback+0x29/0x43
		test_clear_page_writeback+0xb6/0xef
		mem_cgroup_update_page_stat+0xb2/0x1af

Given that test_clear_page_writeback() is where non-memcg stats are
updated for non-memcg, it seems like the most natural place to update
memcg writeback stats.  Theoretically we could introduce some sort of
work queue of pages that need writeback stat updates.
test_clear_page_writeback() would enqueue to-do work items to this list.
A worker thread (not running in softirq) would process this list and
apply the changes to the mem_cgroup.  This seems very complex and will
likely introduce a longer code path that will introduce even more
overhead.

> 3. Normally, we have updated page state without disable irq. 
> Why does memcg need it?

Non-memcg writeback stats updates do disable interrupts by using
spin_lock_irqsave().  See upstream test_clear_page_writeback() for
an example.

Memcg must determine the cgroup associated with the page to adjust that
cgroup's page counter.  Example: when a page writeback completes, the
associated mem_cgroup writeback page counter is decremented.  In memcg
this is complicated by the ability to migrate pages between cgroups.
When a page migration is in progress then locking is needed to ensure
that page's associated cgroup does not change until after the statistic
update is complete.  This migration race is already efficiently solved
in mmotm efficiently with mem_cgroup_stealed(), which safely avoids many
unneeded locking calls.  This proposed patch integrates with the
mem_cgroup_stealed() solution.

> I hope we don't add irq disable region as far as possbile. 

I also do not like this, but do not see a better way.  We could use
local_bh_disable(), but I think it needs to be uniformly applied by
adding it to lock_page_cgroup().

--
Greg

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 03/10] memcg: create extensible page stat update routines
  2010-10-05 19:59     ` Greg Thelen
@ 2010-10-05 23:57       ` Minchan Kim
  2010-10-06  0:48         ` Greg Thelen
  0 siblings, 1 reply; 96+ messages in thread
From: Minchan Kim @ 2010-10-05 23:57 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, Andrea Righi,
	Balbir Singh, KAMEZAWA Hiroyuki, Daisuke Nishimura

On Wed, Oct 6, 2010 at 4:59 AM, Greg Thelen <gthelen@google.com> wrote:
> Minchan Kim <minchan.kim@gmail.com> writes:
>
>> On Sun, Oct 03, 2010 at 11:57:58PM -0700, Greg Thelen wrote:
>>> Replace usage of the mem_cgroup_update_file_mapped() memcg
>>> statistic update routine with two new routines:
>>> * mem_cgroup_inc_page_stat()
>>> * mem_cgroup_dec_page_stat()
>>>
>>> As before, only the file_mapped statistic is managed.  However,
>>> these more general interfaces allow for new statistics to be
>>> more easily added.  New statistics are added with memcg dirty
>>> page accounting.
>>>
>>> Signed-off-by: Greg Thelen <gthelen@google.com>
>>> Signed-off-by: Andrea Righi <arighi@develer.com>
>>> ---
>>>  include/linux/memcontrol.h |   31 ++++++++++++++++++++++++++++---
>>>  mm/memcontrol.c            |   17 ++++++++---------
>>>  mm/rmap.c                  |    4 ++--
>>>  3 files changed, 38 insertions(+), 14 deletions(-)
>>>
>>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>>> index 159a076..7c7bec4 100644
>>> --- a/include/linux/memcontrol.h
>>> +++ b/include/linux/memcontrol.h
>>> @@ -25,6 +25,11 @@ struct page_cgroup;
>>>  struct page;
>>>  struct mm_struct;
>>>
>>> +/* Stats that can be updated by kernel. */
>>> +enum mem_cgroup_write_page_stat_item {
>>> +    MEMCG_NR_FILE_MAPPED, /* # of pages charged as file rss */
>>> +};
>>> +
>>
>> mem_cgrou_"write"_page_stat_item?
>> Does "write" make sense to abstract page_state generally?
>
> First I will summarize the portion of the design relevant to this
> comment:
>
> This patch series introduces two sets of memcg statistics.
> a) the writable set of statistics the kernel updates when pages change
>   state (example: when a page becomes dirty) using:
>     mem_cgroup_inc_page_stat(struct page *page,
>                                enum mem_cgroup_write_page_stat_item idx)
>     mem_cgroup_dec_page_stat(struct page *page,
>                                enum mem_cgroup_write_page_stat_item idx)
>
> b) the read-only set of statistics the kernel queries to measure the
>   amount of dirty memory used by the current cgroup using:
>     s64 mem_cgroup_page_stat(enum mem_cgroup_read_page_stat_item item)
>
>   This read-only set of statistics is set of a higher level conceptual
>   counters.  For example, MEMCG_NR_DIRTYABLE_PAGES is the sum of the
>   counts of pages in various states (active + inactive).  mem_cgroup
>   exports this value as a higher level counter rather than individual
>   counters (active & inactive) to minimize the number of calls into
>   mem_cgroup_page_stat().  This avoids extra calls to cgroup tree
>   iteration with for_each_mem_cgroup_tree().
>
> Notice that each of the two sets of statistics are addressed by a
> different type, mem_cgroup_{read vs write}_page_stat_item.
>
> This particular patch (memcg: create extensible page stat update
> routines) introduces part of this design.  A later patch I emailed
> (memcg: add dirty limits to mem_cgroup) added
> mem_cgroup_read_page_stat_item.
>
>
> I think the code would read better if I renamed
> enum mem_cgroup_write_page_stat_item to
> enum mem_cgroup_update_page_stat_item.
>
> Would this address your concern

Thanks for the kind explanation.
I understand your concept.

I think you makes update and query as completely different level
abstraction but you could use similar terms.
Even the terms(write VS read) make me more confusing.

How about renaming following as?

1. mem_cgroup_write_page_stat_item -> mem_cgroup_page_stat_item
2. mem_cgroup_read_page_stat_item -> mem_cgroup_nr_pages_item

At least it looks to be easy for me to understand the code.
But it's just my preference. If others think your semantic is more
desirable, I am not against it strongly.

Thanks, Greg.

>
> --
> Greg
>



-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 04/10] memcg: disable local interrupts in lock_page_cgroup()
  2010-10-05 23:26     ` Greg Thelen
@ 2010-10-06  0:15       ` Minchan Kim
  2010-10-07  0:35         ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 96+ messages in thread
From: Minchan Kim @ 2010-10-06  0:15 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, Andrea Righi,
	Balbir Singh, KAMEZAWA Hiroyuki, Daisuke Nishimura

On Wed, Oct 6, 2010 at 8:26 AM, Greg Thelen <gthelen@google.com> wrote:
> Minchan Kim <minchan.kim@gmail.com> writes:
>
>> On Sun, Oct 03, 2010 at 11:57:59PM -0700, Greg Thelen wrote:
>>> If pages are being migrated from a memcg, then updates to that
>>> memcg's page statistics are protected by grabbing a bit spin lock
>>> using lock_page_cgroup().  In an upcoming commit memcg dirty page
>>> accounting will be updating memcg page accounting (specifically:
>>> num writeback pages) from softirq.  Avoid a deadlocking nested
>>> spin lock attempt by disabling interrupts on the local processor
>>> when grabbing the page_cgroup bit_spin_lock in lock_page_cgroup().
>>> This avoids the following deadlock:
>>> statistic
>>>       CPU 0             CPU 1
>>>                     inc_file_mapped
>>>                     rcu_read_lock
>>>   start move
>>>   synchronize_rcu
>>>                     lock_page_cgroup
>>>                       softirq
>>>                       test_clear_page_writeback
>>>                       mem_cgroup_dec_page_stat(NR_WRITEBACK)
>>>                       rcu_read_lock
>>>                       lock_page_cgroup   /* deadlock */
>>>                       unlock_page_cgroup
>>>                       rcu_read_unlock
>>>                     unlock_page_cgroup
>>>                     rcu_read_unlock
>>>
>>> By disabling interrupts in lock_page_cgroup, nested calls
>>> are avoided.  The softirq would be delayed until after inc_file_mapped
>>> enables interrupts when calling unlock_page_cgroup().
>>>
>>> The normal, fast path, of memcg page stat updates typically
>>> does not need to call lock_page_cgroup(), so this change does
>>> not affect the performance of the common case page accounting.
>>>
>>> Signed-off-by: Andrea Righi <arighi@develer.com>
>>> Signed-off-by: Greg Thelen <gthelen@google.com>
>>> ---
>>>  include/linux/page_cgroup.h |    8 +++++-
>>>  mm/memcontrol.c             |   51 +++++++++++++++++++++++++-----------------
>>>  2 files changed, 36 insertions(+), 23 deletions(-)
>>>
>>> diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
>>> index b59c298..872f6b1 100644
>>> --- a/include/linux/page_cgroup.h
>>> +++ b/include/linux/page_cgroup.h
>>> @@ -117,14 +117,18 @@ static inline enum zone_type page_cgroup_zid(struct page_cgroup *pc)
>>>      return page_zonenum(pc->page);
>>>  }
>>>
>>> -static inline void lock_page_cgroup(struct page_cgroup *pc)
>>> +static inline void lock_page_cgroup(struct page_cgroup *pc,
>>> +                                unsigned long *flags)
>>>  {
>>> +    local_irq_save(*flags);
>>>      bit_spin_lock(PCG_LOCK, &pc->flags);
>>>  }
>>
>> Hmm. Let me ask questions.
>>
>> 1. Why do you add new irq disable region in general function?
>> I think __do_fault is a one of fast path.
>
> This is true.  I used pft to measure the cost of this extra locking
> code.  This pft workload exercises this memcg call stack:
>        lock_page_cgroup+0x39/0x5b
>        __mem_cgroup_commit_charge+0x2c/0x98
>        mem_cgroup_charge_common+0x66/0x76
>        mem_cgroup_newpage_charge+0x40/0x4f
>        handle_mm_fault+0x2e3/0x869
>        do_page_fault+0x286/0x29b
>        page_fault+0x1f/0x30
>
> I ran 100 iterations of "pft -m 8g -t 16 -a" and focused on the
> flt/cpu/s.
>
> First I established a performance baseline using upstream mmotm locking
> (not disabling interrupts).
>        100 samples: mean 51930.16383  stddev 2.032% (1055.40818272)
>
> Then I introduced this patch, which disabled interrupts in
> lock_page_cgroup():
>        100 samples: mean 52174.17434  stddev 1.306% (681.14442646)
>
> Then I replaced this patch's usage of local_irq_save/restore() with
> local_bh_disable/enable().
>        100 samples: mean 51810.58591  stddev 1.892% (980.340335322)
>
> The proposed patch (#2) actually improves allocation performance by
> 0.47% when compared to the baseline (#1).  However, I believe that this
> is in the statistical noise.  This particular workload does not seem to
> be affected this patch.

Yes. But irq disable has a interrupt latency problem as well as just
overhead of instruction.
I have a concern about interrupt latency.
I have a experience that too many disable irq makes irq handler
latency too slow in embedded system.
For example, irq handler latency is a important factor in ARM perf to
capture program counter.
That's because ARM perf doesn't use NMI handler.

>
>> Could you disable softirq using _local_bh_disable_ not in general function
>> but in your context?
>
> lock_page_cgroup() is only used by mem_cgroup in memcontrol.c.
>
> local_bh_disable() should also work instead of my proposed patch, which
> used local_irq_save/restore().  local_bh_disable() will not disable all
> interrupts so it should have less impact.  But I think that usage of
> local_bh_disable() it still something that has to happen in the general
> lock_page_cgroup() function.  The softirq can occur at an arbitrary time
> and processor with the possibility of interrupting anyone who does not
> have interrupts or softirq disabled.  Therefore the softirq could
> interrupt code that has used lock_page_cgroup(), unless
> lock_page_cgroup() explicitly (as proposed) disables interrupts (or
> softirq).  If (as you suggest) some calls to lock_page_cgroup() did not
> disable softirq, then a deadlock is possible because the softirq may
> interrupt the holder of the page cgroup spinlock and the softirq routine
> that also wants the spinlock would spin forever.
>
> Is there a preference between local_bh_disable() and local_irq_save()?
> Currently the page uses local_irq_save().  However I think it could work
> by local_bh_disable(), which might have less system impact.

If many users need to update page stat in interrupt handler in future,
local_irq_save would be good candidate. otherwise, local_bh_disable doesn't
affect the system as you said. We could add the comment following as.

/*
 * NOTE :
 * If some user want to update page stat in interrupt handler,
 * We should consider local_irq_save instead of local_bh_disable.
 */

>
>> how do you expect that how many users need irq lock to update page state?
>> If they don't need to disalbe irq?
>
> Are you asking how many cases need to disable irq to update page state?
> Because there exists some code (writeback memcg counter update) that
> lock the spinlock in softirq, then it must not be allowed to interrupt
> any holders of the spinlock.  Therefore any code that locked the
> page_cgroup spinlock must disable interrupts (or softirq) to prevent
> being preempted by a softirq that will attempt to lock the same
> spinlock.
>
>> We can pass some argument which present to need irq lock or not.
>> But it seems to make code very ugly.
>
> This would be ugly and I do not think it would avoid the deadlock
> because the softirq for the writeback may occur for a particular page at
> any time.  Anyone who might be interrupted by this softirq must either:
> a) not hold the page_cgroup spinlock
> or
> b) disable interrupts (or softirq) to avoid being preempted by code that
>   may want the spinlock.
>
>> 2. So could you solve the problem in your design?
>> I mean you could update page state out of softirq?
>> (I didn't look at the your patches all. Sorry if I am missing something)
>
> The writeback statistics are normally updated for non-memcg in
> test_clear_page_writeback().  Here is an example call stack (innermost
> last):
>        system_call_fastpath+0x16/0x1b
>        sys_exit_group+0x17/0x1b
>        do_group_exit+0x7d/0xa8
>        do_exit+0x1fb/0x705
>        exit_mm+0x129/0x136
>        mmput+0x48/0xb9
>        exit_mmap+0x96/0xe9
>        unmap_vmas+0x52e/0x788
>        page_remove_rmap+0x69/0x6d
>        mem_cgroup_update_page_stat+0x191/0x1af
>                <INTERRUPT>
>                call_function_single_interrupt+0x13/0x20
>                smp_call_function_single_interrupt+0x25/0x27
>                irq_exit+0x4a/0x8c
>                do_softirq+0x3d/0x85
>                call_softirq+0x1c/0x3e
>                __do_softirq+0xed/0x1e3
>                blk_done_softirq+0x72/0x82
>                scsi_softirq_done+0x10a/0x113
>                scsi_finish_command+0xe8/0xf1
>                scsi_io_completion+0x1b0/0x42c
>                blk_end_request+0x10/0x12
>                blk_end_bidi_request+0x1f/0x5d
>                blk_update_bidi_request+0x20/0x6f
>                blk_update_request+0x1a1/0x360
>                req_bio_endio+0x96/0xb6
>                bio_endio+0x31/0x33
>                mpage_end_io_write+0x66/0x7d
>                end_page_writeback+0x29/0x43
>                test_clear_page_writeback+0xb6/0xef
>                mem_cgroup_update_page_stat+0xb2/0x1af
>
> Given that test_clear_page_writeback() is where non-memcg stats are
> updated for non-memcg, it seems like the most natural place to update
> memcg writeback stats.  Theoretically we could introduce some sort of
> work queue of pages that need writeback stat updates.
> test_clear_page_writeback() would enqueue to-do work items to this list.
> A worker thread (not running in softirq) would process this list and
> apply the changes to the mem_cgroup.  This seems very complex and will
> likely introduce a longer code path that will introduce even more
> overhead.

Agreed.

>
>> 3. Normally, we have updated page state without disable irq.
>> Why does memcg need it?
>
> Non-memcg writeback stats updates do disable interrupts by using
> spin_lock_irqsave().  See upstream test_clear_page_writeback() for
> an example.
>
> Memcg must determine the cgroup associated with the page to adjust that
> cgroup's page counter.  Example: when a page writeback completes, the
> associated mem_cgroup writeback page counter is decremented.  In memcg
> this is complicated by the ability to migrate pages between cgroups.
> When a page migration is in progress then locking is needed to ensure
> that page's associated cgroup does not change until after the statistic
> update is complete.  This migration race is already efficiently solved
> in mmotm efficiently with mem_cgroup_stealed(), which safely avoids many
> unneeded locking calls.  This proposed patch integrates with the
> mem_cgroup_stealed() solution.
>
>> I hope we don't add irq disable region as far as possbile.
>
> I also do not like this, but do not see a better way.  We could use
> local_bh_disable(), but I think it needs to be uniformly applied by
> adding it to lock_page_cgroup().


First of all, we could add your patch as it is and I don't expect any
regression report about interrupt latency.
That's because many embedded guys doesn't use mmotm and have a
tendency to not report regression of VM.
Even they don't use memcg. Hmm...

I pass the decision to MAINTAINER Kame and Balbir.
Thanks for the detail explanation.

>
> --
> Greg
>



-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 10/10] memcg: check memcg dirty limits in page writeback
  2010-10-04  6:58 ` [PATCH 10/10] memcg: check memcg dirty limits in page writeback Greg Thelen
  2010-10-05  7:29   ` KAMEZAWA Hiroyuki
@ 2010-10-06  0:32   ` Minchan Kim
  1 sibling, 0 replies; 96+ messages in thread
From: Minchan Kim @ 2010-10-06  0:32 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, Andrea Righi,
	Balbir Singh, KAMEZAWA Hiroyuki, Daisuke Nishimura

On Mon, Oct 4, 2010 at 3:58 PM, Greg Thelen <gthelen@google.com> wrote:
> If the current process is in a non-root memcg, then
> global_dirty_limits() will consider the memcg dirty limit.
> This allows different cgroups to have distinct dirty limits
> which trigger direct and background writeback at different
> levels.
>
> Signed-off-by: Andrea Righi <arighi@develer.com>
> Signed-off-by: Greg Thelen <gthelen@google.com>
> ---
>  mm/page-writeback.c |   87 ++++++++++++++++++++++++++++++++++++++++++---------
>  1 files changed, 72 insertions(+), 15 deletions(-)
>
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index a0bb3e2..c1db336 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -180,7 +180,7 @@ static unsigned long highmem_dirtyable_memory(unsigned long total)
>  * Returns the numebr of pages that can currently be freed and used
>  * by the kernel for direct mappings.
>  */
> -static unsigned long determine_dirtyable_memory(void)
> +static unsigned long get_global_dirtyable_memory(void)
>  {
>        unsigned long x;
>
> @@ -192,6 +192,58 @@ static unsigned long determine_dirtyable_memory(void)
>        return x + 1;   /* Ensure that we never return 0 */
>  }
>

Just a nitpick.
You seem to like get_xxxx name.
But I think it's a redundant and just makes function name longer
without any benefit.
In kernel, many place doesn't use get_xxx naming.

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 01/10] memcg: add page_cgroup flags for dirty page tracking
  2010-10-04  6:57 ` [PATCH 01/10] memcg: add page_cgroup flags for dirty page tracking Greg Thelen
  2010-10-05  6:20   ` KAMEZAWA Hiroyuki
@ 2010-10-06  0:37   ` Daisuke Nishimura
  2010-10-06 11:07   ` Balbir Singh
  2 siblings, 0 replies; 96+ messages in thread
From: Daisuke Nishimura @ 2010-10-06  0:37 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, Andrea Righi,
	Balbir Singh, KAMEZAWA Hiroyuki, Daisuke Nishimura

On Sun,  3 Oct 2010 23:57:56 -0700
Greg Thelen <gthelen@google.com> wrote:

> Add additional flags to page_cgroup to track dirty pages
> within a mem_cgroup.
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Signed-off-by: Andrea Righi <arighi@develer.com>
> Signed-off-by: Greg Thelen <gthelen@google.com>

Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>

Thanks,
Daisuke Nishimura.

> ---
>  include/linux/page_cgroup.h |   23 +++++++++++++++++++++++
>  1 files changed, 23 insertions(+), 0 deletions(-)
> 
> diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
> index 5bb13b3..b59c298 100644
> --- a/include/linux/page_cgroup.h
> +++ b/include/linux/page_cgroup.h
> @@ -40,6 +40,9 @@ enum {
>  	PCG_USED, /* this object is in use. */
>  	PCG_ACCT_LRU, /* page has been accounted for */
>  	PCG_FILE_MAPPED, /* page is accounted as "mapped" */
> +	PCG_FILE_DIRTY, /* page is dirty */
> +	PCG_FILE_WRITEBACK, /* page is under writeback */
> +	PCG_FILE_UNSTABLE_NFS, /* page is NFS unstable */
>  	PCG_MIGRATION, /* under page migration */
>  };
>  
> @@ -59,6 +62,10 @@ static inline void ClearPageCgroup##uname(struct page_cgroup *pc)	\
>  static inline int TestClearPageCgroup##uname(struct page_cgroup *pc)	\
>  	{ return test_and_clear_bit(PCG_##lname, &pc->flags);  }
>  
> +#define TESTSETPCGFLAG(uname, lname)			\
> +static inline int TestSetPageCgroup##uname(struct page_cgroup *pc)	\
> +	{ return test_and_set_bit(PCG_##lname, &pc->flags);  }
> +
>  TESTPCGFLAG(Locked, LOCK)
>  
>  /* Cache flag is set only once (at allocation) */
> @@ -80,6 +87,22 @@ SETPCGFLAG(FileMapped, FILE_MAPPED)
>  CLEARPCGFLAG(FileMapped, FILE_MAPPED)
>  TESTPCGFLAG(FileMapped, FILE_MAPPED)
>  
> +SETPCGFLAG(FileDirty, FILE_DIRTY)
> +CLEARPCGFLAG(FileDirty, FILE_DIRTY)
> +TESTPCGFLAG(FileDirty, FILE_DIRTY)
> +TESTCLEARPCGFLAG(FileDirty, FILE_DIRTY)
> +TESTSETPCGFLAG(FileDirty, FILE_DIRTY)
> +
> +SETPCGFLAG(FileWriteback, FILE_WRITEBACK)
> +CLEARPCGFLAG(FileWriteback, FILE_WRITEBACK)
> +TESTPCGFLAG(FileWriteback, FILE_WRITEBACK)
> +
> +SETPCGFLAG(FileUnstableNFS, FILE_UNSTABLE_NFS)
> +CLEARPCGFLAG(FileUnstableNFS, FILE_UNSTABLE_NFS)
> +TESTPCGFLAG(FileUnstableNFS, FILE_UNSTABLE_NFS)
> +TESTCLEARPCGFLAG(FileUnstableNFS, FILE_UNSTABLE_NFS)
> +TESTSETPCGFLAG(FileUnstableNFS, FILE_UNSTABLE_NFS)
> +
>  SETPCGFLAG(Migration, MIGRATION)
>  CLEARPCGFLAG(Migration, MIGRATION)
>  TESTPCGFLAG(Migration, MIGRATION)
> -- 
> 1.7.1
> 

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 03/10] memcg: create extensible page stat update routines
  2010-10-05 23:57       ` Minchan Kim
@ 2010-10-06  0:48         ` Greg Thelen
  0 siblings, 0 replies; 96+ messages in thread
From: Greg Thelen @ 2010-10-06  0:48 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, Andrea Righi,
	Balbir Singh, KAMEZAWA Hiroyuki, Daisuke Nishimura

Minchan Kim <minchan.kim@gmail.com> writes:

> On Wed, Oct 6, 2010 at 4:59 AM, Greg Thelen <gthelen@google.com> wrote:
>> Minchan Kim <minchan.kim@gmail.com> writes:
>>
>>> On Sun, Oct 03, 2010 at 11:57:58PM -0700, Greg Thelen wrote:
>>>> Replace usage of the mem_cgroup_update_file_mapped() memcg
>>>> statistic update routine with two new routines:
>>>> * mem_cgroup_inc_page_stat()
>>>> * mem_cgroup_dec_page_stat()
>>>>
>>>> As before, only the file_mapped statistic is managed.  However,
>>>> these more general interfaces allow for new statistics to be
>>>> more easily added.  New statistics are added with memcg dirty
>>>> page accounting.
>>>>
>>>> Signed-off-by: Greg Thelen <gthelen@google.com>
>>>> Signed-off-by: Andrea Righi <arighi@develer.com>
>>>> ---
>>>>  include/linux/memcontrol.h |   31 ++++++++++++++++++++++++++++---
>>>>  mm/memcontrol.c            |   17 ++++++++---------
>>>>  mm/rmap.c                  |    4 ++--
>>>>  3 files changed, 38 insertions(+), 14 deletions(-)
>>>>
>>>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>>>> index 159a076..7c7bec4 100644
>>>> --- a/include/linux/memcontrol.h
>>>> +++ b/include/linux/memcontrol.h
>>>> @@ -25,6 +25,11 @@ struct page_cgroup;
>>>>  struct page;
>>>>  struct mm_struct;
>>>>
>>>> +/* Stats that can be updated by kernel. */
>>>> +enum mem_cgroup_write_page_stat_item {
>>>> +    MEMCG_NR_FILE_MAPPED, /* # of pages charged as file rss */
>>>> +};
>>>> +
>>>
>>> mem_cgrou_"write"_page_stat_item?
>>> Does "write" make sense to abstract page_state generally?
>>
>> First I will summarize the portion of the design relevant to this
>> comment:
>>
>> This patch series introduces two sets of memcg statistics.
>> a) the writable set of statistics the kernel updates when pages change
>>   state (example: when a page becomes dirty) using:
>>     mem_cgroup_inc_page_stat(struct page *page,
>>                                enum mem_cgroup_write_page_stat_item idx)
>>     mem_cgroup_dec_page_stat(struct page *page,
>>                                enum mem_cgroup_write_page_stat_item idx)
>>
>> b) the read-only set of statistics the kernel queries to measure the
>>   amount of dirty memory used by the current cgroup using:
>>     s64 mem_cgroup_page_stat(enum mem_cgroup_read_page_stat_item item)
>>
>>   This read-only set of statistics is set of a higher level conceptual
>>   counters.  For example, MEMCG_NR_DIRTYABLE_PAGES is the sum of the
>>   counts of pages in various states (active + inactive).  mem_cgroup
>>   exports this value as a higher level counter rather than individual
>>   counters (active & inactive) to minimize the number of calls into
>>   mem_cgroup_page_stat().  This avoids extra calls to cgroup tree
>>   iteration with for_each_mem_cgroup_tree().
>>
>> Notice that each of the two sets of statistics are addressed by a
>> different type, mem_cgroup_{read vs write}_page_stat_item.
>>
>> This particular patch (memcg: create extensible page stat update
>> routines) introduces part of this design.  A later patch I emailed
>> (memcg: add dirty limits to mem_cgroup) added
>> mem_cgroup_read_page_stat_item.
>>
>>
>> I think the code would read better if I renamed
>> enum mem_cgroup_write_page_stat_item to
>> enum mem_cgroup_update_page_stat_item.
>>
>> Would this address your concern
>
> Thanks for the kind explanation.
> I understand your concept.
>
> I think you makes update and query as completely different level
> abstraction but you could use similar terms.
> Even the terms(write VS read) make me more confusing.
>
> How about renaming following as?
>
> 1. mem_cgroup_write_page_stat_item -> mem_cgroup_page_stat_item
> 2. mem_cgroup_read_page_stat_item -> mem_cgroup_nr_pages_item
>
> At least it looks to be easy for me to understand the code.
> But it's just my preference. If others think your semantic is more
> desirable, I am not against it strongly.

I think your suggestion is good.  I will include it in the next revision
of the patch series.

> Thanks, Greg.
>
>>
>> --
>> Greg
>>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 02/10] memcg: document cgroup dirty memory interfaces
  2010-10-04  6:57 ` [PATCH 02/10] memcg: document cgroup dirty memory interfaces Greg Thelen
  2010-10-05  6:48   ` KAMEZAWA Hiroyuki
@ 2010-10-06  0:49   ` Daisuke Nishimura
  2010-10-06 11:12   ` Balbir Singh
  2 siblings, 0 replies; 96+ messages in thread
From: Daisuke Nishimura @ 2010-10-06  0:49 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, Andrea Righi,
	Balbir Singh, KAMEZAWA Hiroyuki, Daisuke Nishimura

On Sun,  3 Oct 2010 23:57:57 -0700
Greg Thelen <gthelen@google.com> wrote:

> Document cgroup dirty memory interfaces and statistics.
> 
> Signed-off-by: Andrea Righi <arighi@develer.com>
> Signed-off-by: Greg Thelen <gthelen@google.com>

I think you will change "nfs" to "nfs_unstable", but anyway,

Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>

Thanks
Daisuke Nishimura.

> ---
>  Documentation/cgroups/memory.txt |   37 +++++++++++++++++++++++++++++++++++++
>  1 files changed, 37 insertions(+), 0 deletions(-)
> 
> diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
> index 7781857..eab65e2 100644
> --- a/Documentation/cgroups/memory.txt
> +++ b/Documentation/cgroups/memory.txt
> @@ -385,6 +385,10 @@ mapped_file	- # of bytes of mapped file (includes tmpfs/shmem)
>  pgpgin		- # of pages paged in (equivalent to # of charging events).
>  pgpgout		- # of pages paged out (equivalent to # of uncharging events).
>  swap		- # of bytes of swap usage
> +dirty		- # of bytes that are waiting to get written back to the disk.
> +writeback	- # of bytes that are actively being written back to the disk.
> +nfs		- # of bytes sent to the NFS server, but not yet committed to
> +		the actual storage.
>  inactive_anon	- # of bytes of anonymous memory and swap cache memory on
>  		LRU list.
>  active_anon	- # of bytes of anonymous and swap cache memory on active
> @@ -453,6 +457,39 @@ memory under it will be reclaimed.
>  You can reset failcnt by writing 0 to failcnt file.
>  # echo 0 > .../memory.failcnt
>  
> +5.5 dirty memory
> +
> +Control the maximum amount of dirty pages a cgroup can have at any given time.
> +
> +Limiting dirty memory is like fixing the max amount of dirty (hard to reclaim)
> +page cache used by a cgroup.  So, in case of multiple cgroup writers, they will
> +not be able to consume more than their designated share of dirty pages and will
> +be forced to perform write-out if they cross that limit.
> +
> +The interface is equivalent to the procfs interface: /proc/sys/vm/dirty_*.  It
> +is possible to configure a limit to trigger both a direct writeback or a
> +background writeback performed by per-bdi flusher threads.  The root cgroup
> +memory.dirty_* control files are read-only and match the contents of
> +the /proc/sys/vm/dirty_* files.
> +
> +Per-cgroup dirty limits can be set using the following files in the cgroupfs:
> +
> +- memory.dirty_ratio: the amount of dirty memory (expressed as a percentage of
> +  cgroup memory) at which a process generating dirty pages will itself start
> +  writing out dirty data.
> +
> +- memory.dirty_bytes: the amount of dirty memory (expressed in bytes) in the
> +  cgroup at which a process generating dirty pages will start itself writing out
> +  dirty data.
> +
> +- memory.dirty_background_ratio: the amount of dirty memory of the cgroup
> +  (expressed as a percentage of cgroup memory) at which background writeback
> +  kernel threads will start writing out dirty data.
> +
> +- memory.dirty_background_bytes: the amount of dirty memory (expressed in bytes)
> +  in the cgroup at which background writeback kernel threads will start writing
> +  out dirty data.
> +
>  6. Hierarchy support
>  
>  The memory controller supports a deep hierarchy and hierarchical accounting.
> -- 
> 1.7.1
> 

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 00/10] memcg: per cgroup dirty page accounting
  2010-10-04  6:57 [PATCH 00/10] memcg: per cgroup dirty page accounting Greg Thelen
                   ` (12 preceding siblings ...)
  2010-10-05 22:15 ` Andrea Righi
@ 2010-10-06  3:23 ` Balbir Singh
  2010-10-18  5:56 ` KAMEZAWA Hiroyuki
  14 siblings, 0 replies; 96+ messages in thread
From: Balbir Singh @ 2010-10-06  3:23 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, Andrea Righi,
	KAMEZAWA Hiroyuki, Daisuke Nishimura

* Greg Thelen <gthelen@google.com> [2010-10-03 23:57:55]:

> This patch set provides the ability for each cgroup to have independent dirty
> page limits.
> 
> Limiting dirty memory is like fixing the max amount of dirty (hard to reclaim)
> page cache used by a cgroup.  So, in case of multiple cgroup writers, they will
> not be able to consume more than their designated share of dirty pages and will
> be forced to perform write-out if they cross that limit.
> 
> These patches were developed and tested on mmotm 2010-09-28-16-13.  The patches
> are based on a series proposed by Andrea Righi in Mar 2010.
> 
> Overview:
> - Add page_cgroup flags to record when pages are dirty, in writeback, or nfs
>   unstable.
> - Extend mem_cgroup to record the total number of pages in each of the 
>   interesting dirty states (dirty, writeback, unstable_nfs).  
> - Add dirty parameters similar to the system-wide  /proc/sys/vm/dirty_*
>   limits to mem_cgroup.  The mem_cgroup dirty parameters are accessible
>   via cgroupfs control files.
> - Consider both system and per-memcg dirty limits in page writeback when
>   deciding to queue background writeback or block for foreground writeback.
> 
> Known shortcomings:
> - When a cgroup dirty limit is exceeded, then bdi writeback is employed to
>   writeback dirty inodes.  Bdi writeback considers inodes from any cgroup, not
>   just inodes contributing dirty pages to the cgroup exceeding its limit.  
> 
> Performance measurements:
> - kernel builds are unaffected unless run with a small dirty limit.
> - all data collected with CONFIG_CGROUP_MEM_RES_CTLR=y.
> - dd has three data points (in secs) for three data sizes (100M, 200M, and 1G).  
>   As expected, dd slows when it exceed its cgroup dirty limit.
> 
>                kernel_build          dd
> mmotm             2:37        0.18, 0.38, 1.65
>   root_memcg
> 
> mmotm             2:37        0.18, 0.35, 1.66
>   non-root_memcg
> 
> mmotm+patches     2:37        0.18, 0.35, 1.68
>   root_memcg
> 
> mmotm+patches     2:37        0.19, 0.35, 1.69
>   non-root_memcg
> 
> mmotm+patches     2:37        0.19, 2.34, 22.82
>   non-root_memcg
>   150 MiB memcg dirty limit
> 
> mmotm+patches     3:58        1.71, 3.38, 17.33
>   non-root_memcg
>   1 MiB memcg dirty limit
>

Greg, could you please try the parallel page fault test. Could you
look at commit 0c3e73e84fe3f64cf1c2e8bb4e91e8901cbcdc38 and
569b846df54ffb2827b83ce3244c5f032394cba4 for examples. 

-- 
	Three Cheers,
	Balbir

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 01/10] memcg: add page_cgroup flags for dirty page tracking
  2010-10-04  6:57 ` [PATCH 01/10] memcg: add page_cgroup flags for dirty page tracking Greg Thelen
  2010-10-05  6:20   ` KAMEZAWA Hiroyuki
  2010-10-06  0:37   ` Daisuke Nishimura
@ 2010-10-06 11:07   ` Balbir Singh
  2 siblings, 0 replies; 96+ messages in thread
From: Balbir Singh @ 2010-10-06 11:07 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, Andrea Righi,
	KAMEZAWA Hiroyuki, Daisuke Nishimura

* Greg Thelen <gthelen@google.com> [2010-10-03 23:57:56]:

> Add additional flags to page_cgroup to track dirty pages
> within a mem_cgroup.
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Signed-off-by: Andrea Righi <arighi@develer.com>
> Signed-off-by: Greg Thelen <gthelen@google.com>
> ---
>  include/linux/page_cgroup.h |   23 +++++++++++++++++++++++
>  1 files changed, 23 insertions(+), 0 deletions(-)
> 
> diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
> index 5bb13b3..b59c298 100644
> --- a/include/linux/page_cgroup.h
> +++ b/include/linux/page_cgroup.h
> @@ -40,6 +40,9 @@ enum {
>  	PCG_USED, /* this object is in use. */
>  	PCG_ACCT_LRU, /* page has been accounted for */
>  	PCG_FILE_MAPPED, /* page is accounted as "mapped" */
> +	PCG_FILE_DIRTY, /* page is dirty */
> +	PCG_FILE_WRITEBACK, /* page is under writeback */
> +	PCG_FILE_UNSTABLE_NFS, /* page is NFS unstable */
>  	PCG_MIGRATION, /* under page migration */
>  };
> 
> @@ -59,6 +62,10 @@ static inline void ClearPageCgroup##uname(struct page_cgroup *pc)	\
>  static inline int TestClearPageCgroup##uname(struct page_cgroup *pc)	\
>  	{ return test_and_clear_bit(PCG_##lname, &pc->flags);  }
> 
> +#define TESTSETPCGFLAG(uname, lname)			\
> +static inline int TestSetPageCgroup##uname(struct page_cgroup *pc)	\
> +	{ return test_and_set_bit(PCG_##lname, &pc->flags);  }
> +
>  TESTPCGFLAG(Locked, LOCK)
> 
>  /* Cache flag is set only once (at allocation) */
> @@ -80,6 +87,22 @@ SETPCGFLAG(FileMapped, FILE_MAPPED)
>  CLEARPCGFLAG(FileMapped, FILE_MAPPED)
>  TESTPCGFLAG(FileMapped, FILE_MAPPED)
> 
> +SETPCGFLAG(FileDirty, FILE_DIRTY)
> +CLEARPCGFLAG(FileDirty, FILE_DIRTY)
> +TESTPCGFLAG(FileDirty, FILE_DIRTY)
> +TESTCLEARPCGFLAG(FileDirty, FILE_DIRTY)
> +TESTSETPCGFLAG(FileDirty, FILE_DIRTY)
> +
> +SETPCGFLAG(FileWriteback, FILE_WRITEBACK)
> +CLEARPCGFLAG(FileWriteback, FILE_WRITEBACK)
> +TESTPCGFLAG(FileWriteback, FILE_WRITEBACK)
> +
> +SETPCGFLAG(FileUnstableNFS, FILE_UNSTABLE_NFS)
> +CLEARPCGFLAG(FileUnstableNFS, FILE_UNSTABLE_NFS)
> +TESTPCGFLAG(FileUnstableNFS, FILE_UNSTABLE_NFS)
> +TESTCLEARPCGFLAG(FileUnstableNFS, FILE_UNSTABLE_NFS)
> +TESTSETPCGFLAG(FileUnstableNFS, FILE_UNSTABLE_NFS)
> +
>  SETPCGFLAG(Migration, MIGRATION)
>  CLEARPCGFLAG(Migration, MIGRATION)
>  TESTPCGFLAG(Migration, MIGRATION)

Looks good to me


Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
 

-- 
	Three Cheers,
	Balbir

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 02/10] memcg: document cgroup dirty memory interfaces
  2010-10-04  6:57 ` [PATCH 02/10] memcg: document cgroup dirty memory interfaces Greg Thelen
  2010-10-05  6:48   ` KAMEZAWA Hiroyuki
  2010-10-06  0:49   ` Daisuke Nishimura
@ 2010-10-06 11:12   ` Balbir Singh
  2 siblings, 0 replies; 96+ messages in thread
From: Balbir Singh @ 2010-10-06 11:12 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, Andrea Righi,
	KAMEZAWA Hiroyuki, Daisuke Nishimura

* Greg Thelen <gthelen@google.com> [2010-10-03 23:57:57]:

> Document cgroup dirty memory interfaces and statistics.
> 
> Signed-off-by: Andrea Righi <arighi@develer.com>
> Signed-off-by: Greg Thelen <gthelen@google.com>


Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
 

-- 
	Three Cheers,
	Balbir

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 08/10] memcg: add cgroupfs interface to memcg dirty limits
  2010-10-04  6:58 ` [PATCH 08/10] memcg: add cgroupfs interface to memcg dirty limits Greg Thelen
  2010-10-05  7:13   ` KAMEZAWA Hiroyuki
@ 2010-10-06 13:30   ` Balbir Singh
  2010-10-06 13:32     ` Balbir Singh
  2010-10-07  6:23   ` Ciju Rajan K
  2 siblings, 1 reply; 96+ messages in thread
From: Balbir Singh @ 2010-10-06 13:30 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, Andrea Righi,
	KAMEZAWA Hiroyuki, Daisuke Nishimura

* Greg Thelen <gthelen@google.com> [2010-10-03 23:58:03]:

> Add cgroupfs interface to memcg dirty page limits:
>   Direct write-out is controlled with:
>   - memory.dirty_ratio
>   - memory.dirty_bytes
> 
>   Background write-out is controlled with:
>   - memory.dirty_background_ratio
>   - memory.dirty_background_bytes
> 
> Signed-off-by: Andrea Righi <arighi@develer.com>
> Signed-off-by: Greg Thelen <gthelen@google.com>
> ---

The added interface is not uniform with the rest of our write
operations. Does the patch below help? I did a quick compile and run
test.


Make writes to memcg dirty tunables more uniform

From: Balbir Singh <balbir@linux.vnet.ibm.com>

We today support 'M', 'm', 'k', 'K', 'g' and 'G' suffixes for
general memcg writes. This patch provides the same functionality
for dirty tunables.
---

 mm/memcontrol.c |   47 +++++++++++++++++++++++++++++++++++++----------
 1 files changed, 37 insertions(+), 10 deletions(-)


diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2d45a0a..3c360e6 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4323,6 +4323,41 @@ static u64 mem_cgroup_dirty_read(struct cgroup *cgrp, struct cftype *cft)
 }
 
 static int
+mem_cgroup_dirty_write_string(struct cgroup *cont, struct cftype *cft,
+				const char *buffer)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
+	int type = cft->private;
+	int ret = -EINVAL;
+	unsigned long long val;
+
+	if (cgrp->parent == NULL)
+		return ret;
+
+	switch (type) {
+	case MEM_CGROUP_DIRTY_BYTES:
+		/* This function does all necessary parse...reuse it */
+		ret = res_counter_memparse_write_strategy(buffer, &val);
+		if (ret)
+			break;
+		memcg->dirty_param.dirty_bytes = val;
+		memcg->dirty_param.dirty_ratio  = 0;
+		break;
+	case MEM_CGROUP_DIRTY_BACKGROUND_BYTES:
+		ret = res_counter_memparse_write_strategy(buffer, &val);
+		if (ret)
+			break;
+		memcg->dirty_param.dirty_background_bytes = val;
+		memcg->dirty_param.dirty_background_ratio = 0;
+		break;
+	default:
+		BUG();
+		break;
+	}
+	return ret;
+}
+
+static int
 mem_cgroup_dirty_write(struct cgroup *cgrp, struct cftype *cft, u64 val)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
@@ -4338,18 +4373,10 @@ mem_cgroup_dirty_write(struct cgroup *cgrp, struct cftype *cft, u64 val)
 		memcg->dirty_param.dirty_ratio = val;
 		memcg->dirty_param.dirty_bytes = 0;
 		break;
-	case MEM_CGROUP_DIRTY_BYTES:
-		memcg->dirty_param.dirty_bytes = val;
-		memcg->dirty_param.dirty_ratio  = 0;
-		break;
 	case MEM_CGROUP_DIRTY_BACKGROUND_RATIO:
 		memcg->dirty_param.dirty_background_ratio = val;
 		memcg->dirty_param.dirty_background_bytes = 0;
 		break;
-	case MEM_CGROUP_DIRTY_BACKGROUND_BYTES:
-		memcg->dirty_param.dirty_background_bytes = val;
-		memcg->dirty_param.dirty_background_ratio = 0;
-		break;
 	default:
 		BUG();
 		break;
@@ -4429,7 +4456,7 @@ static struct cftype mem_cgroup_files[] = {
 	{
 		.name = "dirty_bytes",
 		.read_u64 = mem_cgroup_dirty_read,
-		.write_u64 = mem_cgroup_dirty_write,
+		.write_string = mem_cgroup_dirty_write_string,
 		.private = MEM_CGROUP_DIRTY_BYTES,
 	},
 	{
@@ -4441,7 +4468,7 @@ static struct cftype mem_cgroup_files[] = {
 	{
 		.name = "dirty_background_bytes",
 		.read_u64 = mem_cgroup_dirty_read,
-		.write_u64 = mem_cgroup_dirty_write,
+		.write_u64 = mem_cgroup_dirty_write_string,
 		.private = MEM_CGROUP_DIRTY_BACKGROUND_BYTES,
 	},
 };

-- 
	Three Cheers,
	Balbir

^ permalink raw reply related	[flat|nested] 96+ messages in thread

* Re: [PATCH 08/10] memcg: add cgroupfs interface to memcg dirty limits
  2010-10-06 13:30   ` Balbir Singh
@ 2010-10-06 13:32     ` Balbir Singh
  2010-10-06 16:21       ` Greg Thelen
  0 siblings, 1 reply; 96+ messages in thread
From: Balbir Singh @ 2010-10-06 13:32 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, Andrea Righi,
	KAMEZAWA Hiroyuki, Daisuke Nishimura

* Balbir Singh <balbir@linux.vnet.ibm.com> [2010-10-06 19:00:24]:

> * Greg Thelen <gthelen@google.com> [2010-10-03 23:58:03]:
> 
> > Add cgroupfs interface to memcg dirty page limits:
> >   Direct write-out is controlled with:
> >   - memory.dirty_ratio
> >   - memory.dirty_bytes
> > 
> >   Background write-out is controlled with:
> >   - memory.dirty_background_ratio
> >   - memory.dirty_background_bytes
> > 
> > Signed-off-by: Andrea Righi <arighi@develer.com>
> > Signed-off-by: Greg Thelen <gthelen@google.com>
> > ---
> 
> The added interface is not uniform with the rest of our write
> operations. Does the patch below help? I did a quick compile and run
> test.

here is a version with my signed-off-by


Make writes to memcg dirty tunables more uniform

From: Balbir Singh <balbir@linux.vnet.ibm.com>

We today support 'M', 'm', 'k', 'K', 'g' and 'G' suffixes for
general memcg writes. This patch provides the same functionality
for dirty tunables.

Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
---

 mm/memcontrol.c |   47 +++++++++++++++++++++++++++++++++++++----------
 1 files changed, 37 insertions(+), 10 deletions(-)


diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2d45a0a..116fecd 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4323,6 +4323,41 @@ static u64 mem_cgroup_dirty_read(struct cgroup *cgrp, struct cftype *cft)
 }
 
 static int
+mem_cgroup_dirty_write_string(struct cgroup *cgrp, struct cftype *cft,
+				const char *buffer)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
+	int type = cft->private;
+	int ret = -EINVAL;
+	unsigned long long val;
+
+	if (cgrp->parent == NULL)
+		return ret;
+
+	switch (type) {
+	case MEM_CGROUP_DIRTY_BYTES:
+		/* This function does all necessary parse...reuse it */
+		ret = res_counter_memparse_write_strategy(buffer, &val);
+		if (ret)
+			break;
+		memcg->dirty_param.dirty_bytes = val;
+		memcg->dirty_param.dirty_ratio  = 0;
+		break;
+	case MEM_CGROUP_DIRTY_BACKGROUND_BYTES:
+		ret = res_counter_memparse_write_strategy(buffer, &val);
+		if (ret)
+			break;
+		memcg->dirty_param.dirty_background_bytes = val;
+		memcg->dirty_param.dirty_background_ratio = 0;
+		break;
+	default:
+		BUG();
+		break;
+	}
+	return ret;
+}
+
+static int
 mem_cgroup_dirty_write(struct cgroup *cgrp, struct cftype *cft, u64 val)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
@@ -4338,18 +4373,10 @@ mem_cgroup_dirty_write(struct cgroup *cgrp, struct cftype *cft, u64 val)
 		memcg->dirty_param.dirty_ratio = val;
 		memcg->dirty_param.dirty_bytes = 0;
 		break;
-	case MEM_CGROUP_DIRTY_BYTES:
-		memcg->dirty_param.dirty_bytes = val;
-		memcg->dirty_param.dirty_ratio  = 0;
-		break;
 	case MEM_CGROUP_DIRTY_BACKGROUND_RATIO:
 		memcg->dirty_param.dirty_background_ratio = val;
 		memcg->dirty_param.dirty_background_bytes = 0;
 		break;
-	case MEM_CGROUP_DIRTY_BACKGROUND_BYTES:
-		memcg->dirty_param.dirty_background_bytes = val;
-		memcg->dirty_param.dirty_background_ratio = 0;
-		break;
 	default:
 		BUG();
 		break;
@@ -4429,7 +4456,7 @@ static struct cftype mem_cgroup_files[] = {
 	{
 		.name = "dirty_bytes",
 		.read_u64 = mem_cgroup_dirty_read,
-		.write_u64 = mem_cgroup_dirty_write,
+		.write_string = mem_cgroup_dirty_write_string,
 		.private = MEM_CGROUP_DIRTY_BYTES,
 	},
 	{
@@ -4441,7 +4468,7 @@ static struct cftype mem_cgroup_files[] = {
 	{
 		.name = "dirty_background_bytes",
 		.read_u64 = mem_cgroup_dirty_read,
-		.write_u64 = mem_cgroup_dirty_write,
+		.write_string = mem_cgroup_dirty_write_string,
 		.private = MEM_CGROUP_DIRTY_BACKGROUND_BYTES,
 	},
 };

-- 
	Three Cheers,
	Balbir

^ permalink raw reply related	[flat|nested] 96+ messages in thread

* Re: [PATCH 03/10] memcg: create extensible page stat update routines
  2010-10-04  6:57 ` [PATCH 03/10] memcg: create extensible page stat update routines Greg Thelen
                     ` (2 preceding siblings ...)
  2010-10-05 15:42   ` Minchan Kim
@ 2010-10-06 16:19   ` Balbir Singh
  3 siblings, 0 replies; 96+ messages in thread
From: Balbir Singh @ 2010-10-06 16:19 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, Andrea Righi,
	KAMEZAWA Hiroyuki, Daisuke Nishimura

* Greg Thelen <gthelen@google.com> [2010-10-03 23:57:58]:

> Replace usage of the mem_cgroup_update_file_mapped() memcg
> statistic update routine with two new routines:
> * mem_cgroup_inc_page_stat()
> * mem_cgroup_dec_page_stat()
> 
> As before, only the file_mapped statistic is managed.  However,
> these more general interfaces allow for new statistics to be
> more easily added.  New statistics are added with memcg dirty
> page accounting.
> 
> Signed-off-by: Greg Thelen <gthelen@google.com>
> Signed-off-by: Andrea Righi <arighi@develer.com>
> ---
>  include/linux/memcontrol.h |   31 ++++++++++++++++++++++++++++---
>  mm/memcontrol.c            |   17 ++++++++---------
>  mm/rmap.c                  |    4 ++--
>  3 files changed, 38 insertions(+), 14 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 159a076..7c7bec4 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -25,6 +25,11 @@ struct page_cgroup;
>  struct page;
>  struct mm_struct;
> 
> +/* Stats that can be updated by kernel. */
> +enum mem_cgroup_write_page_stat_item {
> +	MEMCG_NR_FILE_MAPPED, /* # of pages charged as file rss */
> +};
> +
>  extern unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
>  					struct list_head *dst,
>  					unsigned long *scanned, int order,
> @@ -121,7 +126,22 @@ static inline bool mem_cgroup_disabled(void)
>  	return false;
>  }
> 
> -void mem_cgroup_update_file_mapped(struct page *page, int val);
> +void mem_cgroup_update_page_stat(struct page *page,
> +				 enum mem_cgroup_write_page_stat_item idx,
> +				 int val);
> +
> +static inline void mem_cgroup_inc_page_stat(struct page *page,
> +				enum mem_cgroup_write_page_stat_item idx)
> +{
> +	mem_cgroup_update_page_stat(page, idx, 1);
> +}
> +
> +static inline void mem_cgroup_dec_page_stat(struct page *page,
> +				enum mem_cgroup_write_page_stat_item idx)
> +{
> +	mem_cgroup_update_page_stat(page, idx, -1);
> +}
> +
>  unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>  						gfp_t gfp_mask);
>  u64 mem_cgroup_get_limit(struct mem_cgroup *mem);
> @@ -293,8 +313,13 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
>  {
>  }
> 
> -static inline void mem_cgroup_update_file_mapped(struct page *page,
> -							int val)
> +static inline void mem_cgroup_inc_page_stat(struct page *page,
> +				enum mem_cgroup_write_page_stat_item idx)
> +{
> +}
> +
> +static inline void mem_cgroup_dec_page_stat(struct page *page,
> +				enum mem_cgroup_write_page_stat_item idx)
>  {
>  }
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 512cb12..f4259f4 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1592,7 +1592,9 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask)
>   * possibility of race condition. If there is, we take a lock.
>   */
> 
> -static void mem_cgroup_update_file_stat(struct page *page, int idx, int val)
> +void mem_cgroup_update_page_stat(struct page *page,
> +				 enum mem_cgroup_write_page_stat_item idx,
> +				 int val)
>  {
>  	struct mem_cgroup *mem;
>  	struct page_cgroup *pc = lookup_page_cgroup(page);
> @@ -1615,30 +1617,27 @@ static void mem_cgroup_update_file_stat(struct page *page, int idx, int val)
>  			goto out;
>  	}
> 
> -	this_cpu_add(mem->stat->count[idx], val);
> -
>  	switch (idx) {
> -	case MEM_CGROUP_STAT_FILE_MAPPED:
> +	case MEMCG_NR_FILE_MAPPED:
>  		if (val > 0)
>  			SetPageCgroupFileMapped(pc);
>  		else if (!page_mapped(page))
>  			ClearPageCgroupFileMapped(pc);
> +		idx = MEM_CGROUP_STAT_FILE_MAPPED;
>  		break;
>  	default:
>  		BUG();
>  	}
> 
> +	this_cpu_add(mem->stat->count[idx], val);
> +
>  out:
>  	if (unlikely(need_unlock))
>  		unlock_page_cgroup(pc);
>  	rcu_read_unlock();
>  	return;
>  }
> -
> -void mem_cgroup_update_file_mapped(struct page *page, int val)
> -{
> -	mem_cgroup_update_file_stat(page, MEM_CGROUP_STAT_FILE_MAPPED, val);
> -}
> +EXPORT_SYMBOL(mem_cgroup_update_page_stat);
> 
>  /*
>   * size of first charge trial. "32" comes from vmscan.c's magic value.
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 8734312..779c0db 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -912,7 +912,7 @@ void page_add_file_rmap(struct page *page)
>  {
>  	if (atomic_inc_and_test(&page->_mapcount)) {
>  		__inc_zone_page_state(page, NR_FILE_MAPPED);
> -		mem_cgroup_update_file_mapped(page, 1);
> +		mem_cgroup_inc_page_stat(page, MEMCG_NR_FILE_MAPPED);
>  	}
>  }
> 
> @@ -950,7 +950,7 @@ void page_remove_rmap(struct page *page)
>  		__dec_zone_page_state(page, NR_ANON_PAGES);
>  	} else {
>  		__dec_zone_page_state(page, NR_FILE_MAPPED);
> -		mem_cgroup_update_file_mapped(page, -1);
> +		mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_MAPPED);
>  	}
>  	/*
>  	 * It would be tidy to reset the PageAnon mapping here,


Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
 

-- 
	Three Cheers,
	Balbir

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 08/10] memcg: add cgroupfs interface to memcg dirty limits
  2010-10-06 13:32     ` Balbir Singh
@ 2010-10-06 16:21       ` Greg Thelen
  2010-10-06 16:24         ` Balbir Singh
  0 siblings, 1 reply; 96+ messages in thread
From: Greg Thelen @ 2010-10-06 16:21 UTC (permalink / raw)
  To: balbir
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, Andrea Righi,
	KAMEZAWA Hiroyuki, Daisuke Nishimura

Balbir Singh <balbir@linux.vnet.ibm.com> writes:

> * Balbir Singh <balbir@linux.vnet.ibm.com> [2010-10-06 19:00:24]:
>
>> * Greg Thelen <gthelen@google.com> [2010-10-03 23:58:03]:
>> 
>> > Add cgroupfs interface to memcg dirty page limits:
>> >   Direct write-out is controlled with:
>> >   - memory.dirty_ratio
>> >   - memory.dirty_bytes
>> > 
>> >   Background write-out is controlled with:
>> >   - memory.dirty_background_ratio
>> >   - memory.dirty_background_bytes
>> > 
>> > Signed-off-by: Andrea Righi <arighi@develer.com>
>> > Signed-off-by: Greg Thelen <gthelen@google.com>
>> > ---
>> 
>> The added interface is not uniform with the rest of our write
>> operations. Does the patch below help? I did a quick compile and run
>> test.
> here is a version with my signed-off-by
>
>
> Make writes to memcg dirty tunables more uniform
>
> From: Balbir Singh <balbir@linux.vnet.ibm.com>
>
> We today support 'M', 'm', 'k', 'K', 'g' and 'G' suffixes for
> general memcg writes. This patch provides the same functionality
> for dirty tunables.
>
> Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
> ---
>
>  mm/memcontrol.c |   47 +++++++++++++++++++++++++++++++++++++----------
>  1 files changed, 37 insertions(+), 10 deletions(-)
>
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 2d45a0a..116fecd 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -4323,6 +4323,41 @@ static u64 mem_cgroup_dirty_read(struct cgroup *cgrp, struct cftype *cft)
>  }
>  
>  static int
> +mem_cgroup_dirty_write_string(struct cgroup *cgrp, struct cftype *cft,
> +				const char *buffer)
> +{
> +	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
> +	int type = cft->private;
> +	int ret = -EINVAL;
> +	unsigned long long val;
> +
> +	if (cgrp->parent == NULL)
> +		return ret;
> +
> +	switch (type) {
> +	case MEM_CGROUP_DIRTY_BYTES:
> +		/* This function does all necessary parse...reuse it */
> +		ret = res_counter_memparse_write_strategy(buffer, &val);
> +		if (ret)
> +			break;
> +		memcg->dirty_param.dirty_bytes = val;
> +		memcg->dirty_param.dirty_ratio  = 0;
> +		break;
> +	case MEM_CGROUP_DIRTY_BACKGROUND_BYTES:
> +		ret = res_counter_memparse_write_strategy(buffer, &val);
> +		if (ret)
> +			break;
> +		memcg->dirty_param.dirty_background_bytes = val;
> +		memcg->dirty_param.dirty_background_ratio = 0;
> +		break;
> +	default:
> +		BUG();
> +		break;
> +	}
> +	return ret;
> +}
> +
> +static int
>  mem_cgroup_dirty_write(struct cgroup *cgrp, struct cftype *cft, u64 val)
>  {
>  	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
> @@ -4338,18 +4373,10 @@ mem_cgroup_dirty_write(struct cgroup *cgrp, struct cftype *cft, u64 val)
>  		memcg->dirty_param.dirty_ratio = val;
>  		memcg->dirty_param.dirty_bytes = 0;
>  		break;
> -	case MEM_CGROUP_DIRTY_BYTES:
> -		memcg->dirty_param.dirty_bytes = val;
> -		memcg->dirty_param.dirty_ratio  = 0;
> -		break;
>  	case MEM_CGROUP_DIRTY_BACKGROUND_RATIO:
>  		memcg->dirty_param.dirty_background_ratio = val;
>  		memcg->dirty_param.dirty_background_bytes = 0;
>  		break;
> -	case MEM_CGROUP_DIRTY_BACKGROUND_BYTES:
> -		memcg->dirty_param.dirty_background_bytes = val;
> -		memcg->dirty_param.dirty_background_ratio = 0;
> -		break;
>  	default:
>  		BUG();
>  		break;
> @@ -4429,7 +4456,7 @@ static struct cftype mem_cgroup_files[] = {
>  	{
>  		.name = "dirty_bytes",
>  		.read_u64 = mem_cgroup_dirty_read,
> -		.write_u64 = mem_cgroup_dirty_write,
> +		.write_string = mem_cgroup_dirty_write_string,
>  		.private = MEM_CGROUP_DIRTY_BYTES,
>  	},
>  	{
> @@ -4441,7 +4468,7 @@ static struct cftype mem_cgroup_files[] = {
>  	{
>  		.name = "dirty_background_bytes",
>  		.read_u64 = mem_cgroup_dirty_read,
> -		.write_u64 = mem_cgroup_dirty_write,
> +		.write_string = mem_cgroup_dirty_write_string,
>  		.private = MEM_CGROUP_DIRTY_BACKGROUND_BYTES,
>  	},
>  };

Looks good to me.  I am currently gather performance data on the memcg
series.  It should be done in an hour or so.  I'll then repost V2 of the
memcg dirty limits series.  I'll integrate this patch into the series,
unless there's objection.

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 08/10] memcg: add cgroupfs interface to memcg dirty limits
  2010-10-06 16:21       ` Greg Thelen
@ 2010-10-06 16:24         ` Balbir Singh
  0 siblings, 0 replies; 96+ messages in thread
From: Balbir Singh @ 2010-10-06 16:24 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, Andrea Righi,
	KAMEZAWA Hiroyuki, Daisuke Nishimura

* Greg Thelen <gthelen@google.com> [2010-10-06 09:21:55]:

> Looks good to me.  I am currently gather performance data on the memcg
> series.  It should be done in an hour or so.  I'll then repost V2 of the
> memcg dirty limits series.  I'll integrate this patch into the series,
> unless there's objection.
>

Please go ahead and incorporate it. Thanks! 

-- 
	Three Cheers,
	Balbir

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 08/10] memcg: add cgroupfs interface to memcg dirty limits
  2010-10-05  9:18       ` Andrea Righi
  2010-10-05 18:31         ` David Rientjes
@ 2010-10-06 18:34         ` Greg Thelen
  2010-10-06 20:54           ` Andrea Righi
  1 sibling, 1 reply; 96+ messages in thread
From: Greg Thelen @ 2010-10-06 18:34 UTC (permalink / raw)
  To: Andrea Righi
  Cc: KAMEZAWA Hiroyuki, Andrew Morton, linux-kernel, linux-mm,
	containers, Balbir Singh, Daisuke Nishimura, David Rientjes

Andrea Righi <arighi@develer.com> writes:

> On Tue, Oct 05, 2010 at 12:33:15AM -0700, Greg Thelen wrote:
>> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> writes:
>> 
>> > On Sun,  3 Oct 2010 23:58:03 -0700
>> > Greg Thelen <gthelen@google.com> wrote:
>> >
>> >> Add cgroupfs interface to memcg dirty page limits:
>> >>   Direct write-out is controlled with:
>> >>   - memory.dirty_ratio
>> >>   - memory.dirty_bytes
>> >> 
>> >>   Background write-out is controlled with:
>> >>   - memory.dirty_background_ratio
>> >>   - memory.dirty_background_bytes
>> >> 
>> >> Signed-off-by: Andrea Righi <arighi@develer.com>
>> >> Signed-off-by: Greg Thelen <gthelen@google.com>
>> >
>> > Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>> >
>> > a question below.
>> >
>> >
>> >> ---
>> >>  mm/memcontrol.c |   89 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> >>  1 files changed, 89 insertions(+), 0 deletions(-)
>> >> 
>> >> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> >> index 6ec2625..2d45a0a 100644
>> >> --- a/mm/memcontrol.c
>> >> +++ b/mm/memcontrol.c
>> >> @@ -100,6 +100,13 @@ enum mem_cgroup_stat_index {
>> >>  	MEM_CGROUP_STAT_NSTATS,
>> >>  };
>> >>  
>> >> +enum {
>> >> +	MEM_CGROUP_DIRTY_RATIO,
>> >> +	MEM_CGROUP_DIRTY_BYTES,
>> >> +	MEM_CGROUP_DIRTY_BACKGROUND_RATIO,
>> >> +	MEM_CGROUP_DIRTY_BACKGROUND_BYTES,
>> >> +};
>> >> +
>> >>  struct mem_cgroup_stat_cpu {
>> >>  	s64 count[MEM_CGROUP_STAT_NSTATS];
>> >>  };
>> >> @@ -4292,6 +4299,64 @@ static int mem_cgroup_oom_control_write(struct cgroup *cgrp,
>> >>  	return 0;
>> >>  }
>> >>  
>> >> +static u64 mem_cgroup_dirty_read(struct cgroup *cgrp, struct cftype *cft)
>> >> +{
>> >> +	struct mem_cgroup *mem = mem_cgroup_from_cont(cgrp);
>> >> +	bool root;
>> >> +
>> >> +	root = mem_cgroup_is_root(mem);
>> >> +
>> >> +	switch (cft->private) {
>> >> +	case MEM_CGROUP_DIRTY_RATIO:
>> >> +		return root ? vm_dirty_ratio : mem->dirty_param.dirty_ratio;
>> >> +	case MEM_CGROUP_DIRTY_BYTES:
>> >> +		return root ? vm_dirty_bytes : mem->dirty_param.dirty_bytes;
>> >> +	case MEM_CGROUP_DIRTY_BACKGROUND_RATIO:
>> >> +		return root ? dirty_background_ratio :
>> >> +			mem->dirty_param.dirty_background_ratio;
>> >> +	case MEM_CGROUP_DIRTY_BACKGROUND_BYTES:
>> >> +		return root ? dirty_background_bytes :
>> >> +			mem->dirty_param.dirty_background_bytes;
>> >> +	default:
>> >> +		BUG();
>> >> +	}
>> >> +}
>> >> +
>> >> +static int
>> >> +mem_cgroup_dirty_write(struct cgroup *cgrp, struct cftype *cft, u64 val)
>> >> +{
>> >> +	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
>> >> +	int type = cft->private;
>> >> +
>> >> +	if (cgrp->parent == NULL)
>> >> +		return -EINVAL;
>> >> +	if ((type == MEM_CGROUP_DIRTY_RATIO ||
>> >> +	     type == MEM_CGROUP_DIRTY_BACKGROUND_RATIO) && val > 100)
>> >> +		return -EINVAL;
>> >> +	switch (type) {
>> >> +	case MEM_CGROUP_DIRTY_RATIO:
>> >> +		memcg->dirty_param.dirty_ratio = val;
>> >> +		memcg->dirty_param.dirty_bytes = 0;
>> >> +		break;
>> >> +	case MEM_CGROUP_DIRTY_BYTES:
>> >> +		memcg->dirty_param.dirty_bytes = val;
>> >> +		memcg->dirty_param.dirty_ratio  = 0;
>> >> +		break;
>> >> +	case MEM_CGROUP_DIRTY_BACKGROUND_RATIO:
>> >> +		memcg->dirty_param.dirty_background_ratio = val;
>> >> +		memcg->dirty_param.dirty_background_bytes = 0;
>> >> +		break;
>> >> +	case MEM_CGROUP_DIRTY_BACKGROUND_BYTES:
>> >> +		memcg->dirty_param.dirty_background_bytes = val;
>> >> +		memcg->dirty_param.dirty_background_ratio = 0;
>> >> +		break;
>> >
>> >
>> > Curious....is this same behavior as vm_dirty_ratio ?
>> 
>> I think this is same behavior as vm_dirty_ratio.  When vm_dirty_ratio is
>> changed then dirty_ratio_handler() will set vm_dirty_bytes=0.  When
>> vm_dirty_bytes is written dirty_bytes_handler() will set
>> vm_dirty_ratio=0.  So I think that the per-memcg dirty memory parameters
>> mimic the behavior of vm_dirty_ratio, vm_dirty_bytes and the other
>> global dirty parameters.
>> 
>> Am I missing your question?
>
> mmh... looking at the code it seems the same behaviour, but in
> Documentation/sysctl/vm.txt we say a different thing (i.e., for
> dirty_bytes):
>
> "If dirty_bytes is written, dirty_ratio becomes a function of its value
> (dirty_bytes / the amount of dirtyable system memory)."
>
> However, in dirty_bytes_handler()/dirty_ratio_handler() we actually set
> the counterpart value as 0.
>
> I think we should clarify the documentation.
>
> Signed-off-by: Andrea Righi <arighi@develer.com>

Reviewed-by: Greg Thelen <gthelen@google.com>

This documentation change is general cleanup that is independent of the
memcg patch series shown on the subject.

> ---
>  Documentation/sysctl/vm.txt |   12 ++++++++----
>  1 files changed, 8 insertions(+), 4 deletions(-)
>
> diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
> index b606c2c..30289fa 100644
> --- a/Documentation/sysctl/vm.txt
> +++ b/Documentation/sysctl/vm.txt
> @@ -80,8 +80,10 @@ dirty_background_bytes
>  Contains the amount of dirty memory at which the pdflush background writeback
>  daemon will start writeback.
>  
> -If dirty_background_bytes is written, dirty_background_ratio becomes a function
> -of its value (dirty_background_bytes / the amount of dirtyable system memory).
> +Note: dirty_background_bytes is the counterpart of dirty_background_ratio. Only
> +one of them may be specified at a time. When one sysctl is written it is
> +immediately taken into account to evaluate the dirty memory limits and the
> +other appears as 0 when read.
>  
>  ==============================================================
>  
> @@ -97,8 +99,10 @@ dirty_bytes
>  Contains the amount of dirty memory at which a process generating disk writes
>  will itself start writeback.
>  
> -If dirty_bytes is written, dirty_ratio becomes a function of its value
> -(dirty_bytes / the amount of dirtyable system memory).
> +Note: dirty_bytes is the counterpart of dirty_ratio. Only one of them may be
> +specified at a time. When one sysctl is written it is immediately taken into
> +account to evaluate the dirty memory limits and the other appears as 0 when
> +read.
>  
>  Note: the minimum value allowed for dirty_bytes is two pages (in bytes); any
>  value lower than this limit will be ignored and the old configuration will be

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 08/10] memcg: add cgroupfs interface to memcg dirty limits
  2010-10-06 18:34         ` Greg Thelen
@ 2010-10-06 20:54           ` Andrea Righi
  0 siblings, 0 replies; 96+ messages in thread
From: Andrea Righi @ 2010-10-06 20:54 UTC (permalink / raw)
  To: Greg Thelen
  Cc: KAMEZAWA Hiroyuki, Andrew Morton, linux-kernel, linux-mm,
	containers, Balbir Singh, Daisuke Nishimura, David Rientjes

On Wed, Oct 06, 2010 at 11:34:16AM -0700, Greg Thelen wrote:
> Andrea Righi <arighi@develer.com> writes:
> 
> > On Tue, Oct 05, 2010 at 12:33:15AM -0700, Greg Thelen wrote:
> >> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> writes:
> >> 
> >> > On Sun,  3 Oct 2010 23:58:03 -0700
> >> > Greg Thelen <gthelen@google.com> wrote:
> >> >
> >> >> Add cgroupfs interface to memcg dirty page limits:
> >> >>   Direct write-out is controlled with:
> >> >>   - memory.dirty_ratio
> >> >>   - memory.dirty_bytes
> >> >> 
> >> >>   Background write-out is controlled with:
> >> >>   - memory.dirty_background_ratio
> >> >>   - memory.dirty_background_bytes
> >> >> 
> >> >> Signed-off-by: Andrea Righi <arighi@develer.com>
> >> >> Signed-off-by: Greg Thelen <gthelen@google.com>
> >> >
> >> > Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> >> >
> >> > a question below.
> >> >
> >> >
> >> >> ---
> >> >>  mm/memcontrol.c |   89 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >> >>  1 files changed, 89 insertions(+), 0 deletions(-)
> >> >> 
> >> >> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> >> >> index 6ec2625..2d45a0a 100644
> >> >> --- a/mm/memcontrol.c
> >> >> +++ b/mm/memcontrol.c
> >> >> @@ -100,6 +100,13 @@ enum mem_cgroup_stat_index {
> >> >>  	MEM_CGROUP_STAT_NSTATS,
> >> >>  };
> >> >>  
> >> >> +enum {
> >> >> +	MEM_CGROUP_DIRTY_RATIO,
> >> >> +	MEM_CGROUP_DIRTY_BYTES,
> >> >> +	MEM_CGROUP_DIRTY_BACKGROUND_RATIO,
> >> >> +	MEM_CGROUP_DIRTY_BACKGROUND_BYTES,
> >> >> +};
> >> >> +
> >> >>  struct mem_cgroup_stat_cpu {
> >> >>  	s64 count[MEM_CGROUP_STAT_NSTATS];
> >> >>  };
> >> >> @@ -4292,6 +4299,64 @@ static int mem_cgroup_oom_control_write(struct cgroup *cgrp,
> >> >>  	return 0;
> >> >>  }
> >> >>  
> >> >> +static u64 mem_cgroup_dirty_read(struct cgroup *cgrp, struct cftype *cft)
> >> >> +{
> >> >> +	struct mem_cgroup *mem = mem_cgroup_from_cont(cgrp);
> >> >> +	bool root;
> >> >> +
> >> >> +	root = mem_cgroup_is_root(mem);
> >> >> +
> >> >> +	switch (cft->private) {
> >> >> +	case MEM_CGROUP_DIRTY_RATIO:
> >> >> +		return root ? vm_dirty_ratio : mem->dirty_param.dirty_ratio;
> >> >> +	case MEM_CGROUP_DIRTY_BYTES:
> >> >> +		return root ? vm_dirty_bytes : mem->dirty_param.dirty_bytes;
> >> >> +	case MEM_CGROUP_DIRTY_BACKGROUND_RATIO:
> >> >> +		return root ? dirty_background_ratio :
> >> >> +			mem->dirty_param.dirty_background_ratio;
> >> >> +	case MEM_CGROUP_DIRTY_BACKGROUND_BYTES:
> >> >> +		return root ? dirty_background_bytes :
> >> >> +			mem->dirty_param.dirty_background_bytes;
> >> >> +	default:
> >> >> +		BUG();
> >> >> +	}
> >> >> +}
> >> >> +
> >> >> +static int
> >> >> +mem_cgroup_dirty_write(struct cgroup *cgrp, struct cftype *cft, u64 val)
> >> >> +{
> >> >> +	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
> >> >> +	int type = cft->private;
> >> >> +
> >> >> +	if (cgrp->parent == NULL)
> >> >> +		return -EINVAL;
> >> >> +	if ((type == MEM_CGROUP_DIRTY_RATIO ||
> >> >> +	     type == MEM_CGROUP_DIRTY_BACKGROUND_RATIO) && val > 100)
> >> >> +		return -EINVAL;
> >> >> +	switch (type) {
> >> >> +	case MEM_CGROUP_DIRTY_RATIO:
> >> >> +		memcg->dirty_param.dirty_ratio = val;
> >> >> +		memcg->dirty_param.dirty_bytes = 0;
> >> >> +		break;
> >> >> +	case MEM_CGROUP_DIRTY_BYTES:
> >> >> +		memcg->dirty_param.dirty_bytes = val;
> >> >> +		memcg->dirty_param.dirty_ratio  = 0;
> >> >> +		break;
> >> >> +	case MEM_CGROUP_DIRTY_BACKGROUND_RATIO:
> >> >> +		memcg->dirty_param.dirty_background_ratio = val;
> >> >> +		memcg->dirty_param.dirty_background_bytes = 0;
> >> >> +		break;
> >> >> +	case MEM_CGROUP_DIRTY_BACKGROUND_BYTES:
> >> >> +		memcg->dirty_param.dirty_background_bytes = val;
> >> >> +		memcg->dirty_param.dirty_background_ratio = 0;
> >> >> +		break;
> >> >
> >> >
> >> > Curious....is this same behavior as vm_dirty_ratio ?
> >> 
> >> I think this is same behavior as vm_dirty_ratio.  When vm_dirty_ratio is
> >> changed then dirty_ratio_handler() will set vm_dirty_bytes=0.  When
> >> vm_dirty_bytes is written dirty_bytes_handler() will set
> >> vm_dirty_ratio=0.  So I think that the per-memcg dirty memory parameters
> >> mimic the behavior of vm_dirty_ratio, vm_dirty_bytes and the other
> >> global dirty parameters.
> >> 
> >> Am I missing your question?
> >
> > mmh... looking at the code it seems the same behaviour, but in
> > Documentation/sysctl/vm.txt we say a different thing (i.e., for
> > dirty_bytes):
> >
> > "If dirty_bytes is written, dirty_ratio becomes a function of its value
> > (dirty_bytes / the amount of dirtyable system memory)."
> >
> > However, in dirty_bytes_handler()/dirty_ratio_handler() we actually set
> > the counterpart value as 0.
> >
> > I think we should clarify the documentation.
> >
> > Signed-off-by: Andrea Righi <arighi@develer.com>
> 
> Reviewed-by: Greg Thelen <gthelen@google.com>
> 
> This documentation change is general cleanup that is independent of the
> memcg patch series shown on the subject.

Thanks Greg. I'll resend it as an independent patch.

-Andrea

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 07/10] memcg: add dirty limits to mem_cgroup
  2010-10-05 19:00     ` Greg Thelen
@ 2010-10-07  0:13       ` KAMEZAWA Hiroyuki
  2010-10-07  0:27         ` Greg Thelen
  0 siblings, 1 reply; 96+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-10-07  0:13 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Andrea Righi, Andrew Morton, linux-kernel, linux-mm, containers,
	Balbir Singh, Daisuke Nishimura

On Tue, 05 Oct 2010 12:00:17 -0700
Greg Thelen <gthelen@google.com> wrote:

> Andrea Righi <arighi@develer.com> writes:
> 
> > On Sun, Oct 03, 2010 at 11:58:02PM -0700, Greg Thelen wrote:
> >> Extend mem_cgroup to contain dirty page limits.  Also add routines
> >> allowing the kernel to query the dirty usage of a memcg.
> >> 
> >> These interfaces not used by the kernel yet.  A subsequent commit
> >> will add kernel calls to utilize these new routines.
> >
> > A small note below.
> >
> >> 
> >> Signed-off-by: Greg Thelen <gthelen@google.com>
> >> Signed-off-by: Andrea Righi <arighi@develer.com>
> >> ---
> >>  include/linux/memcontrol.h |   44 +++++++++++
> >>  mm/memcontrol.c            |  180 +++++++++++++++++++++++++++++++++++++++++++-
> >>  2 files changed, 223 insertions(+), 1 deletions(-)
> >> 
> >> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> >> index 6303da1..dc8952d 100644
> >> --- a/include/linux/memcontrol.h
> >> +++ b/include/linux/memcontrol.h
> >> @@ -19,6 +19,7 @@
> >>  
> >>  #ifndef _LINUX_MEMCONTROL_H
> >>  #define _LINUX_MEMCONTROL_H
> >> +#include <linux/writeback.h>
> >>  #include <linux/cgroup.h>
> >>  struct mem_cgroup;
> >>  struct page_cgroup;
> >> @@ -33,6 +34,30 @@ enum mem_cgroup_write_page_stat_item {
> >>  	MEMCG_NR_FILE_UNSTABLE_NFS, /* # of NFS unstable pages */
> >>  };
> >>  
> >> +/* Cgroup memory statistics items exported to the kernel */
> >> +enum mem_cgroup_read_page_stat_item {
> >> +	MEMCG_NR_DIRTYABLE_PAGES,
> >> +	MEMCG_NR_RECLAIM_PAGES,
> >> +	MEMCG_NR_WRITEBACK,
> >> +	MEMCG_NR_DIRTY_WRITEBACK_PAGES,
> >> +};
> >> +
> >> +/* Dirty memory parameters */
> >> +struct vm_dirty_param {
> >> +	int dirty_ratio;
> >> +	int dirty_background_ratio;
> >> +	unsigned long dirty_bytes;
> >> +	unsigned long dirty_background_bytes;
> >> +};
> >> +
> >> +static inline void get_global_vm_dirty_param(struct vm_dirty_param *param)
> >> +{
> >> +	param->dirty_ratio = vm_dirty_ratio;
> >> +	param->dirty_bytes = vm_dirty_bytes;
> >> +	param->dirty_background_ratio = dirty_background_ratio;
> >> +	param->dirty_background_bytes = dirty_background_bytes;
> >> +}
> >> +
> >>  extern unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
> >>  					struct list_head *dst,
> >>  					unsigned long *scanned, int order,
> >> @@ -145,6 +170,10 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,
> >>  	mem_cgroup_update_page_stat(page, idx, -1);
> >>  }
> >>  
> >> +bool mem_cgroup_has_dirty_limit(void);
> >> +void get_vm_dirty_param(struct vm_dirty_param *param);
> >> +s64 mem_cgroup_page_stat(enum mem_cgroup_read_page_stat_item item);
> >> +
> >>  unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
> >>  						gfp_t gfp_mask);
> >>  u64 mem_cgroup_get_limit(struct mem_cgroup *mem);
> >> @@ -326,6 +355,21 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,
> >>  {
> >>  }
> >>  
> >> +static inline bool mem_cgroup_has_dirty_limit(void)
> >> +{
> >> +	return false;
> >> +}
> >> +
> >> +static inline void get_vm_dirty_param(struct vm_dirty_param *param)
> >> +{
> >> +	get_global_vm_dirty_param(param);
> >> +}
> >> +
> >> +static inline s64 mem_cgroup_page_stat(enum mem_cgroup_read_page_stat_item item)
> >> +{
> >> +	return -ENOSYS;
> >> +}
> >> +
> >>  static inline
> >>  unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
> >>  					    gfp_t gfp_mask)
> >> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> >> index f40839f..6ec2625 100644
> >> --- a/mm/memcontrol.c
> >> +++ b/mm/memcontrol.c
> >> @@ -233,6 +233,10 @@ struct mem_cgroup {
> >>  	atomic_t	refcnt;
> >>  
> >>  	unsigned int	swappiness;
> >> +
> >> +	/* control memory cgroup dirty pages */
> >> +	struct vm_dirty_param dirty_param;
> >> +
> >>  	/* OOM-Killer disable */
> >>  	int		oom_kill_disable;
> >>  
> >> @@ -1132,6 +1136,172 @@ static unsigned int get_swappiness(struct mem_cgroup *memcg)
> >>  	return swappiness;
> >>  }
> >>  
> >> +/*
> >> + * Returns a snapshot of the current dirty limits which is not synchronized with
> >> + * the routines that change the dirty limits.  If this routine races with an
> >> + * update to the dirty bytes/ratio value, then the caller must handle the case
> >> + * where both dirty_[background_]_ratio and _bytes are set.
> >> + */
> >> +static void __mem_cgroup_get_dirty_param(struct vm_dirty_param *param,
> >> +					 struct mem_cgroup *mem)
> >> +{
> >> +	if (mem && !mem_cgroup_is_root(mem)) {
> >> +		param->dirty_ratio = mem->dirty_param.dirty_ratio;
> >> +		param->dirty_bytes = mem->dirty_param.dirty_bytes;
> >> +		param->dirty_background_ratio =
> >> +			mem->dirty_param.dirty_background_ratio;
> >> +		param->dirty_background_bytes =
> >> +			mem->dirty_param.dirty_background_bytes;
> >> +	} else {
> >> +		get_global_vm_dirty_param(param);
> >> +	}
> >> +}
> >> +
> >> +/*
> >> + * Get dirty memory parameters of the current memcg or global values (if memory
> >> + * cgroups are disabled or querying the root cgroup).
> >> + */
> >> +void get_vm_dirty_param(struct vm_dirty_param *param)
> >> +{
> >> +	struct mem_cgroup *memcg;
> >> +
> >> +	if (mem_cgroup_disabled()) {
> >> +		get_global_vm_dirty_param(param);
> >> +		return;
> >> +	}
> >> +
> >> +	/*
> >> +	 * It's possible that "current" may be moved to other cgroup while we
> >> +	 * access cgroup. But precise check is meaningless because the task can
> >> +	 * be moved after our access and writeback tends to take long time.  At
> >> +	 * least, "memcg" will not be freed under rcu_read_lock().
> >> +	 */
> >> +	rcu_read_lock();
> >> +	memcg = mem_cgroup_from_task(current);
> >> +	__mem_cgroup_get_dirty_param(param, memcg);
> >> +	rcu_read_unlock();
> >> +}
> >> +
> >> +/*
> >> + * Check if current memcg has local dirty limits.  Return true if the current
> >> + * memory cgroup has local dirty memory settings.
> >> + */
> >> +bool mem_cgroup_has_dirty_limit(void)
> >> +{
> >> +	struct mem_cgroup *mem;
> >> +
> >> +	if (mem_cgroup_disabled())
> >> +		return false;
> >> +
> >> +	mem = mem_cgroup_from_task(current);
> >> +	return mem && !mem_cgroup_is_root(mem);
> >> +}
> >
> > We only check the pointer without dereferencing it, so this is probably
> > ok, but maybe this is safer:
> >
> > bool mem_cgroup_has_dirty_limit(void)
> > {
> > 	struct mem_cgroup *mem;
> > 	bool ret;
> >
> > 	if (mem_cgroup_disabled())
> > 		return false;
> >
> > 	rcu_read_lock();
> > 	mem = mem_cgroup_from_task(current);
> > 	ret = mem && !mem_cgroup_is_root(mem);
> > 	rcu_read_unlock();
> >
> > 	return ret;
> > }
> >
> > rcu_read_lock() should be held in mem_cgroup_from_task(), otherwise
> > lockdep could detect this as an error.
> >
> > Thanks,
> > -Andrea
> 
> Good suggestion.  I agree that lockdep might catch this.  There are some
> unrelated debug_locks failures (even without my patches) that I worked
> around to get lockdep to complain about this one.  I applied your
> suggested fix and lockdep was happy.  I will incorporate this fix into
> the next revision of the patch series.
> 

Hmm, considering other parts, shouldn't we define mem_cgroup_from_task
as macro ?

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 07/10] memcg: add dirty limits to mem_cgroup
  2010-10-07  0:13       ` KAMEZAWA Hiroyuki
@ 2010-10-07  0:27         ` Greg Thelen
  2010-10-07  0:48           ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 96+ messages in thread
From: Greg Thelen @ 2010-10-07  0:27 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrea Righi, Andrew Morton, linux-kernel, linux-mm, containers,
	Balbir Singh, Daisuke Nishimura

KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> writes:

> On Tue, 05 Oct 2010 12:00:17 -0700
> Greg Thelen <gthelen@google.com> wrote:
>
>> Andrea Righi <arighi@develer.com> writes:
>> 
>> > On Sun, Oct 03, 2010 at 11:58:02PM -0700, Greg Thelen wrote:
>> >> Extend mem_cgroup to contain dirty page limits.  Also add routines
>> >> allowing the kernel to query the dirty usage of a memcg.
>> >> 
>> >> These interfaces not used by the kernel yet.  A subsequent commit
>> >> will add kernel calls to utilize these new routines.
>> >
>> > A small note below.
>> >
>> >> 
>> >> Signed-off-by: Greg Thelen <gthelen@google.com>
>> >> Signed-off-by: Andrea Righi <arighi@develer.com>
>> >> ---
>> >>  include/linux/memcontrol.h |   44 +++++++++++
>> >>  mm/memcontrol.c            |  180 +++++++++++++++++++++++++++++++++++++++++++-
>> >>  2 files changed, 223 insertions(+), 1 deletions(-)
>> >> 
>> >> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>> >> index 6303da1..dc8952d 100644
>> >> --- a/include/linux/memcontrol.h
>> >> +++ b/include/linux/memcontrol.h
>> >> @@ -19,6 +19,7 @@
>> >>  
>> >>  #ifndef _LINUX_MEMCONTROL_H
>> >>  #define _LINUX_MEMCONTROL_H
>> >> +#include <linux/writeback.h>
>> >>  #include <linux/cgroup.h>
>> >>  struct mem_cgroup;
>> >>  struct page_cgroup;
>> >> @@ -33,6 +34,30 @@ enum mem_cgroup_write_page_stat_item {
>> >>  	MEMCG_NR_FILE_UNSTABLE_NFS, /* # of NFS unstable pages */
>> >>  };
>> >>  
>> >> +/* Cgroup memory statistics items exported to the kernel */
>> >> +enum mem_cgroup_read_page_stat_item {
>> >> +	MEMCG_NR_DIRTYABLE_PAGES,
>> >> +	MEMCG_NR_RECLAIM_PAGES,
>> >> +	MEMCG_NR_WRITEBACK,
>> >> +	MEMCG_NR_DIRTY_WRITEBACK_PAGES,
>> >> +};
>> >> +
>> >> +/* Dirty memory parameters */
>> >> +struct vm_dirty_param {
>> >> +	int dirty_ratio;
>> >> +	int dirty_background_ratio;
>> >> +	unsigned long dirty_bytes;
>> >> +	unsigned long dirty_background_bytes;
>> >> +};
>> >> +
>> >> +static inline void get_global_vm_dirty_param(struct vm_dirty_param *param)
>> >> +{
>> >> +	param->dirty_ratio = vm_dirty_ratio;
>> >> +	param->dirty_bytes = vm_dirty_bytes;
>> >> +	param->dirty_background_ratio = dirty_background_ratio;
>> >> +	param->dirty_background_bytes = dirty_background_bytes;
>> >> +}
>> >> +
>> >>  extern unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
>> >>  					struct list_head *dst,
>> >>  					unsigned long *scanned, int order,
>> >> @@ -145,6 +170,10 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,
>> >>  	mem_cgroup_update_page_stat(page, idx, -1);
>> >>  }
>> >>  
>> >> +bool mem_cgroup_has_dirty_limit(void);
>> >> +void get_vm_dirty_param(struct vm_dirty_param *param);
>> >> +s64 mem_cgroup_page_stat(enum mem_cgroup_read_page_stat_item item);
>> >> +
>> >>  unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>> >>  						gfp_t gfp_mask);
>> >>  u64 mem_cgroup_get_limit(struct mem_cgroup *mem);
>> >> @@ -326,6 +355,21 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,
>> >>  {
>> >>  }
>> >>  
>> >> +static inline bool mem_cgroup_has_dirty_limit(void)
>> >> +{
>> >> +	return false;
>> >> +}
>> >> +
>> >> +static inline void get_vm_dirty_param(struct vm_dirty_param *param)
>> >> +{
>> >> +	get_global_vm_dirty_param(param);
>> >> +}
>> >> +
>> >> +static inline s64 mem_cgroup_page_stat(enum mem_cgroup_read_page_stat_item item)
>> >> +{
>> >> +	return -ENOSYS;
>> >> +}
>> >> +
>> >>  static inline
>> >>  unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>> >>  					    gfp_t gfp_mask)
>> >> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> >> index f40839f..6ec2625 100644
>> >> --- a/mm/memcontrol.c
>> >> +++ b/mm/memcontrol.c
>> >> @@ -233,6 +233,10 @@ struct mem_cgroup {
>> >>  	atomic_t	refcnt;
>> >>  
>> >>  	unsigned int	swappiness;
>> >> +
>> >> +	/* control memory cgroup dirty pages */
>> >> +	struct vm_dirty_param dirty_param;
>> >> +
>> >>  	/* OOM-Killer disable */
>> >>  	int		oom_kill_disable;
>> >>  
>> >> @@ -1132,6 +1136,172 @@ static unsigned int get_swappiness(struct mem_cgroup *memcg)
>> >>  	return swappiness;
>> >>  }
>> >>  
>> >> +/*
>> >> + * Returns a snapshot of the current dirty limits which is not synchronized with
>> >> + * the routines that change the dirty limits.  If this routine races with an
>> >> + * update to the dirty bytes/ratio value, then the caller must handle the case
>> >> + * where both dirty_[background_]_ratio and _bytes are set.
>> >> + */
>> >> +static void __mem_cgroup_get_dirty_param(struct vm_dirty_param *param,
>> >> +					 struct mem_cgroup *mem)
>> >> +{
>> >> +	if (mem && !mem_cgroup_is_root(mem)) {
>> >> +		param->dirty_ratio = mem->dirty_param.dirty_ratio;
>> >> +		param->dirty_bytes = mem->dirty_param.dirty_bytes;
>> >> +		param->dirty_background_ratio =
>> >> +			mem->dirty_param.dirty_background_ratio;
>> >> +		param->dirty_background_bytes =
>> >> +			mem->dirty_param.dirty_background_bytes;
>> >> +	} else {
>> >> +		get_global_vm_dirty_param(param);
>> >> +	}
>> >> +}
>> >> +
>> >> +/*
>> >> + * Get dirty memory parameters of the current memcg or global values (if memory
>> >> + * cgroups are disabled or querying the root cgroup).
>> >> + */
>> >> +void get_vm_dirty_param(struct vm_dirty_param *param)
>> >> +{
>> >> +	struct mem_cgroup *memcg;
>> >> +
>> >> +	if (mem_cgroup_disabled()) {
>> >> +		get_global_vm_dirty_param(param);
>> >> +		return;
>> >> +	}
>> >> +
>> >> +	/*
>> >> +	 * It's possible that "current" may be moved to other cgroup while we
>> >> +	 * access cgroup. But precise check is meaningless because the task can
>> >> +	 * be moved after our access and writeback tends to take long time.  At
>> >> +	 * least, "memcg" will not be freed under rcu_read_lock().
>> >> +	 */
>> >> +	rcu_read_lock();
>> >> +	memcg = mem_cgroup_from_task(current);
>> >> +	__mem_cgroup_get_dirty_param(param, memcg);
>> >> +	rcu_read_unlock();
>> >> +}
>> >> +
>> >> +/*
>> >> + * Check if current memcg has local dirty limits.  Return true if the current
>> >> + * memory cgroup has local dirty memory settings.
>> >> + */
>> >> +bool mem_cgroup_has_dirty_limit(void)
>> >> +{
>> >> +	struct mem_cgroup *mem;
>> >> +
>> >> +	if (mem_cgroup_disabled())
>> >> +		return false;
>> >> +
>> >> +	mem = mem_cgroup_from_task(current);
>> >> +	return mem && !mem_cgroup_is_root(mem);
>> >> +}
>> >
>> > We only check the pointer without dereferencing it, so this is probably
>> > ok, but maybe this is safer:
>> >
>> > bool mem_cgroup_has_dirty_limit(void)
>> > {
>> > 	struct mem_cgroup *mem;
>> > 	bool ret;
>> >
>> > 	if (mem_cgroup_disabled())
>> > 		return false;
>> >
>> > 	rcu_read_lock();
>> > 	mem = mem_cgroup_from_task(current);
>> > 	ret = mem && !mem_cgroup_is_root(mem);
>> > 	rcu_read_unlock();
>> >
>> > 	return ret;
>> > }
>> >
>> > rcu_read_lock() should be held in mem_cgroup_from_task(), otherwise
>> > lockdep could detect this as an error.
>> >
>> > Thanks,
>> > -Andrea
>> 
>> Good suggestion.  I agree that lockdep might catch this.  There are some
>> unrelated debug_locks failures (even without my patches) that I worked
>> around to get lockdep to complain about this one.  I applied your
>> suggested fix and lockdep was happy.  I will incorporate this fix into
>> the next revision of the patch series.
>> 
>
> Hmm, considering other parts, shouldn't we define mem_cgroup_from_task
> as macro ?
>
> Thanks,
> -Kame

Is your motivation to increase performance with the same functionality?
If so, then would a 'static inline' be performance equivalent to a
preprocessor macro yet be safer to use?

Maybe it makes more sense to find a way to perform this check in
mem_cgroup_has_dirty_limit() without needing to grab the rcu lock.  I
think this lock grab is unneeded.  I am still collecting performance
data, but suspect that this may be making the code slower than it needs
to be.

--
Greg

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 04/10] memcg: disable local interrupts in lock_page_cgroup()
  2010-10-06  0:15       ` Minchan Kim
@ 2010-10-07  0:35         ` KAMEZAWA Hiroyuki
  2010-10-07  1:54           ` Daisuke Nishimura
  0 siblings, 1 reply; 96+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-10-07  0:35 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Greg Thelen, Andrew Morton, linux-kernel, linux-mm, containers,
	Andrea Righi, Balbir Singh, Daisuke Nishimura

On Wed, 6 Oct 2010 09:15:34 +0900
Minchan Kim <minchan.kim@gmail.com> wrote:

> First of all, we could add your patch as it is and I don't expect any
> regression report about interrupt latency.
> That's because many embedded guys doesn't use mmotm and have a
> tendency to not report regression of VM.
> Even they don't use memcg. Hmm...
> 
> I pass the decision to MAINTAINER Kame and Balbir.
> Thanks for the detail explanation.
> 

Hmm. IRQ delay is a concern. So, my option is this. How do you think ?

1. remove local_irq_save()/restore() in lock/unlock_page_cgroup().
   yes, I don't like it.

2. At moving charge, do this:
	a) lock_page()/ or trylock_page()
	b) wait_on_page_writeback()
	c) do move_account under lock_page_cgroup().
	c) unlock_page()


Then, Writeback updates will never come from IRQ context while
lock/unlock_page_cgroup() is held by move_account(). There will be no race.

Do I miss something ?

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 07/10] memcg: add dirty limits to mem_cgroup
  2010-10-07  0:27         ` Greg Thelen
@ 2010-10-07  0:48           ` KAMEZAWA Hiroyuki
  2010-10-12  0:24             ` Greg Thelen
  0 siblings, 1 reply; 96+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-10-07  0:48 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Andrea Righi, Andrew Morton, linux-kernel, linux-mm, containers,
	Balbir Singh, Daisuke Nishimura

On Wed, 06 Oct 2010 17:27:13 -0700
Greg Thelen <gthelen@google.com> wrote:

> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> writes:
> 
> > On Tue, 05 Oct 2010 12:00:17 -0700
> > Greg Thelen <gthelen@google.com> wrote:
> >
> >> Andrea Righi <arighi@develer.com> writes:
> >> 
> >> > On Sun, Oct 03, 2010 at 11:58:02PM -0700, Greg Thelen wrote:
> >> >> Extend mem_cgroup to contain dirty page limits.  Also add routines
> >> >> allowing the kernel to query the dirty usage of a memcg.
> >> >> 
> >> >> These interfaces not used by the kernel yet.  A subsequent commit
> >> >> will add kernel calls to utilize these new routines.
> >> >
> >> > A small note below.
> >> >
> >> >> 
> >> >> Signed-off-by: Greg Thelen <gthelen@google.com>
> >> >> Signed-off-by: Andrea Righi <arighi@develer.com>
> >> >> ---
> >> >>  include/linux/memcontrol.h |   44 +++++++++++
> >> >>  mm/memcontrol.c            |  180 +++++++++++++++++++++++++++++++++++++++++++-
> >> >>  2 files changed, 223 insertions(+), 1 deletions(-)
> >> >> 
> >> >> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> >> >> index 6303da1..dc8952d 100644
> >> >> --- a/include/linux/memcontrol.h
> >> >> +++ b/include/linux/memcontrol.h
> >> >> @@ -19,6 +19,7 @@
> >> >>  
> >> >>  #ifndef _LINUX_MEMCONTROL_H
> >> >>  #define _LINUX_MEMCONTROL_H
> >> >> +#include <linux/writeback.h>
> >> >>  #include <linux/cgroup.h>
> >> >>  struct mem_cgroup;
> >> >>  struct page_cgroup;
> >> >> @@ -33,6 +34,30 @@ enum mem_cgroup_write_page_stat_item {
> >> >>  	MEMCG_NR_FILE_UNSTABLE_NFS, /* # of NFS unstable pages */
> >> >>  };
> >> >>  
> >> >> +/* Cgroup memory statistics items exported to the kernel */
> >> >> +enum mem_cgroup_read_page_stat_item {
> >> >> +	MEMCG_NR_DIRTYABLE_PAGES,
> >> >> +	MEMCG_NR_RECLAIM_PAGES,
> >> >> +	MEMCG_NR_WRITEBACK,
> >> >> +	MEMCG_NR_DIRTY_WRITEBACK_PAGES,
> >> >> +};
> >> >> +
> >> >> +/* Dirty memory parameters */
> >> >> +struct vm_dirty_param {
> >> >> +	int dirty_ratio;
> >> >> +	int dirty_background_ratio;
> >> >> +	unsigned long dirty_bytes;
> >> >> +	unsigned long dirty_background_bytes;
> >> >> +};
> >> >> +
> >> >> +static inline void get_global_vm_dirty_param(struct vm_dirty_param *param)
> >> >> +{
> >> >> +	param->dirty_ratio = vm_dirty_ratio;
> >> >> +	param->dirty_bytes = vm_dirty_bytes;
> >> >> +	param->dirty_background_ratio = dirty_background_ratio;
> >> >> +	param->dirty_background_bytes = dirty_background_bytes;
> >> >> +}
> >> >> +
> >> >>  extern unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
> >> >>  					struct list_head *dst,
> >> >>  					unsigned long *scanned, int order,
> >> >> @@ -145,6 +170,10 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,
> >> >>  	mem_cgroup_update_page_stat(page, idx, -1);
> >> >>  }
> >> >>  
> >> >> +bool mem_cgroup_has_dirty_limit(void);
> >> >> +void get_vm_dirty_param(struct vm_dirty_param *param);
> >> >> +s64 mem_cgroup_page_stat(enum mem_cgroup_read_page_stat_item item);
> >> >> +
> >> >>  unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
> >> >>  						gfp_t gfp_mask);
> >> >>  u64 mem_cgroup_get_limit(struct mem_cgroup *mem);
> >> >> @@ -326,6 +355,21 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,
> >> >>  {
> >> >>  }
> >> >>  
> >> >> +static inline bool mem_cgroup_has_dirty_limit(void)
> >> >> +{
> >> >> +	return false;
> >> >> +}
> >> >> +
> >> >> +static inline void get_vm_dirty_param(struct vm_dirty_param *param)
> >> >> +{
> >> >> +	get_global_vm_dirty_param(param);
> >> >> +}
> >> >> +
> >> >> +static inline s64 mem_cgroup_page_stat(enum mem_cgroup_read_page_stat_item item)
> >> >> +{
> >> >> +	return -ENOSYS;
> >> >> +}
> >> >> +
> >> >>  static inline
> >> >>  unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
> >> >>  					    gfp_t gfp_mask)
> >> >> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> >> >> index f40839f..6ec2625 100644
> >> >> --- a/mm/memcontrol.c
> >> >> +++ b/mm/memcontrol.c
> >> >> @@ -233,6 +233,10 @@ struct mem_cgroup {
> >> >>  	atomic_t	refcnt;
> >> >>  
> >> >>  	unsigned int	swappiness;
> >> >> +
> >> >> +	/* control memory cgroup dirty pages */
> >> >> +	struct vm_dirty_param dirty_param;
> >> >> +
> >> >>  	/* OOM-Killer disable */
> >> >>  	int		oom_kill_disable;
> >> >>  
> >> >> @@ -1132,6 +1136,172 @@ static unsigned int get_swappiness(struct mem_cgroup *memcg)
> >> >>  	return swappiness;
> >> >>  }
> >> >>  
> >> >> +/*
> >> >> + * Returns a snapshot of the current dirty limits which is not synchronized with
> >> >> + * the routines that change the dirty limits.  If this routine races with an
> >> >> + * update to the dirty bytes/ratio value, then the caller must handle the case
> >> >> + * where both dirty_[background_]_ratio and _bytes are set.
> >> >> + */
> >> >> +static void __mem_cgroup_get_dirty_param(struct vm_dirty_param *param,
> >> >> +					 struct mem_cgroup *mem)
> >> >> +{
> >> >> +	if (mem && !mem_cgroup_is_root(mem)) {
> >> >> +		param->dirty_ratio = mem->dirty_param.dirty_ratio;
> >> >> +		param->dirty_bytes = mem->dirty_param.dirty_bytes;
> >> >> +		param->dirty_background_ratio =
> >> >> +			mem->dirty_param.dirty_background_ratio;
> >> >> +		param->dirty_background_bytes =
> >> >> +			mem->dirty_param.dirty_background_bytes;
> >> >> +	} else {
> >> >> +		get_global_vm_dirty_param(param);
> >> >> +	}
> >> >> +}
> >> >> +
> >> >> +/*
> >> >> + * Get dirty memory parameters of the current memcg or global values (if memory
> >> >> + * cgroups are disabled or querying the root cgroup).
> >> >> + */
> >> >> +void get_vm_dirty_param(struct vm_dirty_param *param)
> >> >> +{
> >> >> +	struct mem_cgroup *memcg;
> >> >> +
> >> >> +	if (mem_cgroup_disabled()) {
> >> >> +		get_global_vm_dirty_param(param);
> >> >> +		return;
> >> >> +	}
> >> >> +
> >> >> +	/*
> >> >> +	 * It's possible that "current" may be moved to other cgroup while we
> >> >> +	 * access cgroup. But precise check is meaningless because the task can
> >> >> +	 * be moved after our access and writeback tends to take long time.  At
> >> >> +	 * least, "memcg" will not be freed under rcu_read_lock().
> >> >> +	 */
> >> >> +	rcu_read_lock();
> >> >> +	memcg = mem_cgroup_from_task(current);
> >> >> +	__mem_cgroup_get_dirty_param(param, memcg);
> >> >> +	rcu_read_unlock();
> >> >> +}
> >> >> +
> >> >> +/*
> >> >> + * Check if current memcg has local dirty limits.  Return true if the current
> >> >> + * memory cgroup has local dirty memory settings.
> >> >> + */
> >> >> +bool mem_cgroup_has_dirty_limit(void)
> >> >> +{
> >> >> +	struct mem_cgroup *mem;
> >> >> +
> >> >> +	if (mem_cgroup_disabled())
> >> >> +		return false;
> >> >> +
> >> >> +	mem = mem_cgroup_from_task(current);
> >> >> +	return mem && !mem_cgroup_is_root(mem);
> >> >> +}
> >> >
> >> > We only check the pointer without dereferencing it, so this is probably
> >> > ok, but maybe this is safer:
> >> >
> >> > bool mem_cgroup_has_dirty_limit(void)
> >> > {
> >> > 	struct mem_cgroup *mem;
> >> > 	bool ret;
> >> >
> >> > 	if (mem_cgroup_disabled())
> >> > 		return false;
> >> >
> >> > 	rcu_read_lock();
> >> > 	mem = mem_cgroup_from_task(current);
> >> > 	ret = mem && !mem_cgroup_is_root(mem);
> >> > 	rcu_read_unlock();
> >> >
> >> > 	return ret;
> >> > }
> >> >
> >> > rcu_read_lock() should be held in mem_cgroup_from_task(), otherwise
> >> > lockdep could detect this as an error.
> >> >
> >> > Thanks,
> >> > -Andrea
> >> 
> >> Good suggestion.  I agree that lockdep might catch this.  There are some
> >> unrelated debug_locks failures (even without my patches) that I worked
> >> around to get lockdep to complain about this one.  I applied your
> >> suggested fix and lockdep was happy.  I will incorporate this fix into
> >> the next revision of the patch series.
> >> 
> >
> > Hmm, considering other parts, shouldn't we define mem_cgroup_from_task
> > as macro ?
> >
> > Thanks,
> > -Kame
> 
> Is your motivation to increase performance with the same functionality?
> If so, then would a 'static inline' be performance equivalent to a
> preprocessor macro yet be safer to use?
> 
Ah, if lockdep finds this as bug, I think other parts will hit this, too.

like this.
> static struct mem_cgroup *try_get_mem_cgroup_from_mm(struct mm_struct *mm)
> {
>         struct mem_cgroup *mem = NULL;
> 
>         if (!mm)
>                 return NULL;
>         /*
>          * Because we have no locks, mm->owner's may be being moved to other
>          * cgroup. We use css_tryget() here even if this looks
>          * pessimistic (rather than adding locks here).
>          */
>         rcu_read_lock();
>         do {
>                 mem = mem_cgroup_from_task(rcu_dereference(mm->owner));
>                 if (unlikely(!mem))
>                         break;
>         } while (!css_tryget(&mem->css));
>         rcu_read_unlock();
>         return mem;
> }

mem_cgroup_from_task() is designed to be used as this.
If defined as macro, I think it will not be catched.


> Maybe it makes more sense to find a way to perform this check in
> mem_cgroup_has_dirty_limit() without needing to grab the rcu lock.  I
> think this lock grab is unneeded.  I am still collecting performance
> data, but suspect that this may be making the code slower than it needs
> to be.
> 

Hmm. css_set[] itself is freed by RCU..what idea to remove rcu_read_lock() do
you have ? Adding some flags ?

Ah...I noticed that you should do

 mem = mem_cgroup_from_task(current->mm->owner);

to check has_dirty_limit...

-Kame


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 04/10] memcg: disable local interrupts in lock_page_cgroup()
  2010-10-07  0:35         ` KAMEZAWA Hiroyuki
@ 2010-10-07  1:54           ` Daisuke Nishimura
  2010-10-07  2:17             ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 96+ messages in thread
From: Daisuke Nishimura @ 2010-10-07  1:54 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Minchan Kim, Greg Thelen, Andrew Morton, linux-kernel, linux-mm,
	containers, Andrea Righi, Balbir Singh, Daisuke Nishimura

On Thu, 7 Oct 2010 09:35:45 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> On Wed, 6 Oct 2010 09:15:34 +0900
> Minchan Kim <minchan.kim@gmail.com> wrote:
> 
> > First of all, we could add your patch as it is and I don't expect any
> > regression report about interrupt latency.
> > That's because many embedded guys doesn't use mmotm and have a
> > tendency to not report regression of VM.
> > Even they don't use memcg. Hmm...
> > 
> > I pass the decision to MAINTAINER Kame and Balbir.
> > Thanks for the detail explanation.
> > 
> 
> Hmm. IRQ delay is a concern. So, my option is this. How do you think ?
> 
> 1. remove local_irq_save()/restore() in lock/unlock_page_cgroup().
>    yes, I don't like it.
> 
> 2. At moving charge, do this:
> 	a) lock_page()/ or trylock_page()
> 	b) wait_on_page_writeback()
> 	c) do move_account under lock_page_cgroup().
> 	c) unlock_page()
> 
> 
> Then, Writeback updates will never come from IRQ context while
> lock/unlock_page_cgroup() is held by move_account(). There will be no race.
> 
hmm, if we'll do that, I think we need to do that under pte_lock in
mem_cgroup_move_charge_pte_range(). But, we can't do wait_on_page_writeback()
under pte_lock, right? Or, we need re-organize current move-charge implementation.

Thanks,
Daisuke Nishimura.

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 04/10] memcg: disable local interrupts in lock_page_cgroup()
  2010-10-07  1:54           ` Daisuke Nishimura
@ 2010-10-07  2:17             ` KAMEZAWA Hiroyuki
  2010-10-07  6:21               ` [PATCH] memcg: reduce lock time at move charge (Was " KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 96+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-10-07  2:17 UTC (permalink / raw)
  To: Daisuke Nishimura
  Cc: Minchan Kim, Greg Thelen, Andrew Morton, linux-kernel, linux-mm,
	containers, Andrea Righi, Balbir Singh

On Thu, 7 Oct 2010 10:54:56 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:

> On Thu, 7 Oct 2010 09:35:45 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 
> > On Wed, 6 Oct 2010 09:15:34 +0900
> > Minchan Kim <minchan.kim@gmail.com> wrote:
> > 
> > > First of all, we could add your patch as it is and I don't expect any
> > > regression report about interrupt latency.
> > > That's because many embedded guys doesn't use mmotm and have a
> > > tendency to not report regression of VM.
> > > Even they don't use memcg. Hmm...
> > > 
> > > I pass the decision to MAINTAINER Kame and Balbir.
> > > Thanks for the detail explanation.
> > > 
> > 
> > Hmm. IRQ delay is a concern. So, my option is this. How do you think ?
> > 
> > 1. remove local_irq_save()/restore() in lock/unlock_page_cgroup().
> >    yes, I don't like it.
> > 
> > 2. At moving charge, do this:
> > 	a) lock_page()/ or trylock_page()
> > 	b) wait_on_page_writeback()
> > 	c) do move_account under lock_page_cgroup().
> > 	c) unlock_page()
> > 
> > 
> > Then, Writeback updates will never come from IRQ context while
> > lock/unlock_page_cgroup() is held by move_account(). There will be no race.
> > 
> hmm, if we'll do that, I think we need to do that under pte_lock in
> mem_cgroup_move_charge_pte_range(). But, we can't do wait_on_page_writeback()
> under pte_lock, right? Or, we need re-organize current move-charge implementation.
> 
Nice catch. I think releaseing pte_lock() is okay. (and it should be released)

IIUC, task's css_set() points to new cgroup when "move" is called. Then,
it's not necessary to take pte_lock, I guess.
(And taking pte_lock too long is not appreciated..)

I'll write a sample patch today.

Thanks,
-Kame








> Thanks,
> Daisuke Nishimura.
> 


^ permalink raw reply	[flat|nested] 96+ messages in thread

* [PATCH] memcg: reduce lock time at move charge (Was Re: [PATCH 04/10] memcg: disable local interrupts in lock_page_cgroup()
  2010-10-07  2:17             ` KAMEZAWA Hiroyuki
@ 2010-10-07  6:21               ` KAMEZAWA Hiroyuki
  2010-10-07  6:24                 ` [PATCH] memcg: lock-free clear page writeback " KAMEZAWA Hiroyuki
  2010-10-07  7:28                 ` [PATCH] memcg: reduce lock time at move charge " Daisuke Nishimura
  0 siblings, 2 replies; 96+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-10-07  6:21 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Daisuke Nishimura, Minchan Kim, Greg Thelen, Andrew Morton,
	linux-kernel, linux-mm, containers, Andrea Righi, Balbir Singh

On Thu, 7 Oct 2010 11:17:43 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
 
> > hmm, if we'll do that, I think we need to do that under pte_lock in
> > mem_cgroup_move_charge_pte_range(). But, we can't do wait_on_page_writeback()
> > under pte_lock, right? Or, we need re-organize current move-charge implementation.
> > 
> Nice catch. I think releaseing pte_lock() is okay. (and it should be released)
> 
> IIUC, task's css_set() points to new cgroup when "move" is called. Then,
> it's not necessary to take pte_lock, I guess.
> (And taking pte_lock too long is not appreciated..)
> 
> I'll write a sample patch today.
> 
Here.
==
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Now, at task migration among cgroup, memory cgroup scans page table and moving
account if flags are properly set.

The core code, mem_cgroup_move_charge_pte_range() does

 	pte_offset_map_lock();
	for all ptes in a page table:
		1. look into page table, find_and_get a page
		2. remove it from LRU.
		3. move charge.
		4. putback to LRU. put_page()
	pte_offset_map_unlock();

for pte entries on a 3rd(2nd) level page table.

This pte_offset_map_lock seems a bit long. This patch modifies a rountine as

	pte_offset_map_lock()
	for 32 pages:
		      find_and_get a page
		      record it
	pte_offset_map_unlock()
	for all recorded pages
		      isolate it from LRU.
		      move charge
		      putback to LRU
	for all recorded pages
		      put_page()

Note: newly-charged pages while we move account are charged to the new group.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |   92 ++++++++++++++++++++++++++++++++++++--------------------
 1 file changed, 60 insertions(+), 32 deletions(-)

Index: mmotm-0928/mm/memcontrol.c
===================================================================
--- mmotm-0928.orig/mm/memcontrol.c
+++ mmotm-0928/mm/memcontrol.c
@@ -4475,17 +4475,22 @@ one_by_one:
  *
  * Called with pte lock held.
  */
-union mc_target {
-	struct page	*page;
-	swp_entry_t	ent;
-};
 
 enum mc_target_type {
-	MC_TARGET_NONE,	/* not used */
+	MC_TARGET_NONE, /* used as failure code(0) */
 	MC_TARGET_PAGE,
 	MC_TARGET_SWAP,
 };
 
+struct mc_target {
+	enum mc_target_type type;
+	union {
+		struct page	*page;
+		swp_entry_t	ent;
+	} val;
+};
+
+
 static struct page *mc_handle_present_pte(struct vm_area_struct *vma,
 						unsigned long addr, pte_t ptent)
 {
@@ -4561,7 +4566,7 @@ static struct page *mc_handle_file_pte(s
 }
 
 static int is_target_pte_for_mc(struct vm_area_struct *vma,
-		unsigned long addr, pte_t ptent, union mc_target *target)
+		unsigned long addr, pte_t ptent, struct mc_target *target)
 {
 	struct page *page = NULL;
 	struct page_cgroup *pc;
@@ -4587,7 +4592,7 @@ static int is_target_pte_for_mc(struct v
 		if (PageCgroupUsed(pc) && pc->mem_cgroup == mc.from) {
 			ret = MC_TARGET_PAGE;
 			if (target)
-				target->page = page;
+				target->val.page = page;
 		}
 		if (!ret || !target)
 			put_page(page);
@@ -4597,8 +4602,10 @@ static int is_target_pte_for_mc(struct v
 			css_id(&mc.from->css) == lookup_swap_cgroup(ent)) {
 		ret = MC_TARGET_SWAP;
 		if (target)
-			target->ent = ent;
+			target->val.ent = ent;
 	}
+	if (target)
+		target->type = ret;
 	return ret;
 }
 
@@ -4751,6 +4758,9 @@ static void mem_cgroup_cancel_attach(str
 	mem_cgroup_clear_mc();
 }
 
+
+#define MC_MOVE_ONCE		(32)
+
 static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
 				unsigned long addr, unsigned long end,
 				struct mm_walk *walk)
@@ -4759,26 +4769,47 @@ static int mem_cgroup_move_charge_pte_ra
 	struct vm_area_struct *vma = walk->private;
 	pte_t *pte;
 	spinlock_t *ptl;
+	struct mc_target *target;
+	int index, num;
+
+	target = kzalloc(sizeof(struct mc_target) *MC_MOVE_ONCE, GFP_KERNEL);
+	if (!target)
+		return -ENOMEM;
 
 retry:
 	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
-	for (; addr != end; addr += PAGE_SIZE) {
+	for (num = 0; num < MC_MOVE_ONCE && addr != end; addr += PAGE_SIZE) {
 		pte_t ptent = *(pte++);
-		union mc_target target;
-		int type;
+		ret = is_target_pte_for_mc(vma, addr, ptent, &target[num]);
+		if (!ret)
+			continue;
+		target[num++].type = ret;
+	}
+	pte_unmap_unlock(pte - 1, ptl);
+	cond_resched();
+
+	ret = 0;
+	index = 0;
+	do {
+		struct mc_target *mt;
 		struct page *page;
 		struct page_cgroup *pc;
 		swp_entry_t ent;
 
-		if (!mc.precharge)
-			break;
+		if (!mc.precharge) {
+			ret = mem_cgroup_do_precharge(1);
+			if (ret)
+				goto out;
+			continue;
+		}
+
+		mt = &target[index++];
 
-		type = is_target_pte_for_mc(vma, addr, ptent, &target);
-		switch (type) {
+		switch (mt->type) {
 		case MC_TARGET_PAGE:
-			page = target.page;
+			page = mt->val.page;
 			if (isolate_lru_page(page))
-				goto put;
+				break;
 			pc = lookup_page_cgroup(page);
 			if (!mem_cgroup_move_account(pc,
 						mc.from, mc.to, false)) {
@@ -4787,11 +4818,9 @@ retry:
 				mc.moved_charge++;
 			}
 			putback_lru_page(page);
-put:			/* is_target_pte_for_mc() gets the page */
-			put_page(page);
 			break;
 		case MC_TARGET_SWAP:
-			ent = target.ent;
+			ent = mt->val.ent;
 			if (!mem_cgroup_move_swap_account(ent,
 						mc.from, mc.to, false)) {
 				mc.precharge--;
@@ -4802,21 +4831,20 @@ put:			/* is_target_pte_for_mc() gets th
 		default:
 			break;
 		}
+	} while (index < num);
+out:
+	for (index = 0; index < num; index++) {
+		if (target[index].type == MC_TARGET_PAGE)
+			put_page(target[index].val.page);
+		target[index].type = MC_TARGET_NONE;
 	}
-	pte_unmap_unlock(pte - 1, ptl);
+
+	if (ret)
+		return ret;
 	cond_resched();
 
-	if (addr != end) {
-		/*
-		 * We have consumed all precharges we got in can_attach().
-		 * We try charge one by one, but don't do any additional
-		 * charges to mc.to if we have failed in charge once in attach()
-		 * phase.
-		 */
-		ret = mem_cgroup_do_precharge(1);
-		if (!ret)
-			goto retry;
-	}
+	if (addr != end)
+		goto retry;
 
 	return ret;
 }


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 08/10] memcg: add cgroupfs interface to memcg dirty limits
  2010-10-04  6:58 ` [PATCH 08/10] memcg: add cgroupfs interface to memcg dirty limits Greg Thelen
  2010-10-05  7:13   ` KAMEZAWA Hiroyuki
  2010-10-06 13:30   ` Balbir Singh
@ 2010-10-07  6:23   ` Ciju Rajan K
  2010-10-07 17:46     ` Greg Thelen
  2 siblings, 1 reply; 96+ messages in thread
From: Ciju Rajan K @ 2010-10-07  6:23 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, Andrea Righi,
	Balbir Singh, KAMEZAWA Hiroyuki, Daisuke Nishimura, Ciju Rajan K

Greg Thelen wrote:
> Add cgroupfs interface to memcg dirty page limits:
>   Direct write-out is controlled with:
>   - memory.dirty_ratio
>   - memory.dirty_bytes
>
>   Background write-out is controlled with:
>   - memory.dirty_background_ratio
>   - memory.dirty_background_bytes
>
> Signed-off-by: Andrea Righi <arighi@develer.com>
> Signed-off-by: Greg Thelen <gthelen@google.com>
> ---
>  mm/memcontrol.c |   89 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 files changed, 89 insertions(+), 0 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 6ec2625..2d45a0a 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -100,6 +100,13 @@ enum mem_cgroup_stat_index {
>  	MEM_CGROUP_STAT_NSTATS,
>  };
>
> +enum {
> +	MEM_CGROUP_DIRTY_RATIO,
> +	MEM_CGROUP_DIRTY_BYTES,
> +	MEM_CGROUP_DIRTY_BACKGROUND_RATIO,
> +	MEM_CGROUP_DIRTY_BACKGROUND_BYTES,
> +};
> +
>  struct mem_cgroup_stat_cpu {
>  	s64 count[MEM_CGROUP_STAT_NSTATS];
>  };
> @@ -4292,6 +4299,64 @@ static int mem_cgroup_oom_control_write(struct cgroup *cgrp,
>  	return 0;
>  }
>
> +static u64 mem_cgroup_dirty_read(struct cgroup *cgrp, struct cftype *cft)
> +{
> +	struct mem_cgroup *mem = mem_cgroup_from_cont(cgrp);
> +	bool root;
> +
> +	root = mem_cgroup_is_root(mem);
> +
> +	switch (cft->private) {
> +	case MEM_CGROUP_DIRTY_RATIO:
> +		return root ? vm_dirty_ratio : mem->dirty_param.dirty_ratio;
> +	case MEM_CGROUP_DIRTY_BYTES:
> +		return root ? vm_dirty_bytes : mem->dirty_param.dirty_bytes;
> +	case MEM_CGROUP_DIRTY_BACKGROUND_RATIO:
> +		return root ? dirty_background_ratio :
> +			mem->dirty_param.dirty_background_ratio;
> +	case MEM_CGROUP_DIRTY_BACKGROUND_BYTES:
> +		return root ? dirty_background_bytes :
> +			mem->dirty_param.dirty_background_bytes;
> +	default:
> +		BUG();
> +	}
> +}
> +
> +static int
> +mem_cgroup_dirty_write(struct cgroup *cgrp, struct cftype *cft, u64 val)
> +{
> +	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
> +	int type = cft->private;
> +
> +	if (cgrp->parent == NULL)
> +		return -EINVAL;
> +	if ((type == MEM_CGROUP_DIRTY_RATIO ||
> +	     type == MEM_CGROUP_DIRTY_BACKGROUND_RATIO) && val > 100)
> +		return -EINVAL;
> +	switch (type) {
> +	case MEM_CGROUP_DIRTY_RATIO:
> +		memcg->dirty_param.dirty_ratio = val;
> +		memcg->dirty_param.dirty_bytes = 0;
> +		break;
> +	case MEM_CGROUP_DIRTY_BYTES:
> +		memcg->dirty_param.dirty_bytes = val;
> +		memcg->dirty_param.dirty_ratio  = 0;
> +		break;
> +	case MEM_CGROUP_DIRTY_BACKGROUND_RATIO:
> +		memcg->dirty_param.dirty_background_ratio = val;
> +		memcg->dirty_param.dirty_background_bytes = 0;
> +		break;
> +	case MEM_CGROUP_DIRTY_BACKGROUND_BYTES:
> +		memcg->dirty_param.dirty_background_bytes = val;
> +		memcg->dirty_param.dirty_background_ratio = 0;
> +		break;
> +	default:
> +		BUG();
> +		break;
> +	}
> +	return 0;
> +}
> +
>  static struct cftype mem_cgroup_files[] = {
>  	{
>  		.name = "usage_in_bytes",
> @@ -4355,6 +4420,30 @@ static struct cftype mem_cgroup_files[] = {
>  		.unregister_event = mem_cgroup_oom_unregister_event,
>  		.private = MEMFILE_PRIVATE(_OOM_TYPE, OOM_CONTROL),
>  	},
> +	{
> +		.name = "dirty_ratio",
> +		.read_u64 = mem_cgroup_dirty_read,
> +		.write_u64 = mem_cgroup_dirty_write,
> +		.private = MEM_CGROUP_DIRTY_RATIO,
> +	},
> +	{
> +		.name = "dirty_bytes",
> +		.read_u64 = mem_cgroup_dirty_read,
> +		.write_u64 = mem_cgroup_dirty_write,
> +		.private = MEM_CGROUP_DIRTY_BYTES,
> +	},
> +	{
>   
Is it a good idea to rename "dirty_bytes" to "dirty_limit_in_bytes" ?
So that it can match with other memcg tunable naming convention.
We already have memory.memsw.limit_in_bytes, memory.limit_in_bytes, 
memory.soft_limit_in_bytes, etc.
> +		.name = "dirty_background_ratio",
> +		.read_u64 = mem_cgroup_dirty_read,
> +		.write_u64 = mem_cgroup_dirty_write,
> +		.private = MEM_CGROUP_DIRTY_BACKGROUND_RATIO,
> +	},
> +	{
> +		.name = "dirty_background_bytes",
> +		.read_u64 = mem_cgroup_dirty_read,
> +		.write_u64 = mem_cgroup_dirty_write,
> +		.private = MEM_CGROUP_DIRTY_BACKGROUND_BYTES,
>   
Similarly "dirty_background_bytes" to dirty_background_limit_in_bytes ?
> +	},
>  };
>
>  #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
>   


^ permalink raw reply	[flat|nested] 96+ messages in thread

* [PATCH] memcg: lock-free clear page writeback  (Was Re: [PATCH 04/10] memcg: disable local interrupts in lock_page_cgroup()
  2010-10-07  6:21               ` [PATCH] memcg: reduce lock time at move charge (Was " KAMEZAWA Hiroyuki
@ 2010-10-07  6:24                 ` KAMEZAWA Hiroyuki
  2010-10-07  9:05                   ` KAMEZAWA Hiroyuki
  2010-10-07 23:35                   ` Minchan Kim
  2010-10-07  7:28                 ` [PATCH] memcg: reduce lock time at move charge " Daisuke Nishimura
  1 sibling, 2 replies; 96+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-10-07  6:24 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Daisuke Nishimura, Minchan Kim, Greg Thelen, Andrew Morton,
	linux-kernel, linux-mm, containers, Andrea Righi, Balbir Singh

Greg, I think clear_page_writeback() will not require _any_ locks with this patch.
But set_page_writeback() requires it...
(Maybe adding a special function for clear_page_writeback() is better rather than
 adding some complex to switch() in update_page_stat())

==
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Now, at page information accounting, we do lock_page_cgroup() if pc->mem_cgroup
points to a cgroup where someone is moving charges from.

At supporing dirty-page accounting, one of troubles is writeback bit.
In general, writeback can be cleared via IRQ context. To update writeback bit
with lock_page_cgroup() in safe way, we'll have to disable IRQ.
....or do something.

This patch waits for completion of writeback under lock_page() and do
lock_page_cgroup() in safe way. (We never got end_io via IRQ context.)

By this, writeback-accounting will never see race with account_move() and
it can trust pc->mem_cgroup always _without_ any lock.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |   18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

Index: mmotm-0928/mm/memcontrol.c
===================================================================
--- mmotm-0928.orig/mm/memcontrol.c
+++ mmotm-0928/mm/memcontrol.c
@@ -2183,17 +2183,35 @@ static void __mem_cgroup_move_account(st
 /*
  * check whether the @pc is valid for moving account and call
  * __mem_cgroup_move_account()
+ * Don't call this under pte_lock etc...we'll do lock_page() and wait for
+ * the end of I/O.
  */
 static int mem_cgroup_move_account(struct page_cgroup *pc,
 		struct mem_cgroup *from, struct mem_cgroup *to, bool uncharge)
 {
 	int ret = -EINVAL;
+
+	/*
+ 	 * We move severl flags and accounting information here. So we need to
+ 	 * avoid the races with update_stat routines. For most of routines,
+ 	 * lock_page_cgroup() is enough for avoiding race. But we need to take
+ 	 * care of IRQ context. If flag updates comes from IRQ context, This
+ 	 * "move account" will be racy (and cause deadlock in lock_page_cgroup())
+ 	 *
+ 	 * Now, the only race we have is Writeback flag. We wait for it cleared
+ 	 * before starting our jobs.
+ 	 */
+
+	lock_page(pc->page);
+	wait_on_page_writeback(pc->page);
+
 	lock_page_cgroup(pc);
 	if (PageCgroupUsed(pc) && pc->mem_cgroup == from) {
 		__mem_cgroup_move_account(pc, from, to, uncharge);
 		ret = 0;
 	}
 	unlock_page_cgroup(pc);
+	unlock_page(pc->page);
 	/*
 	 * check events
 	 */


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] memcg: reduce lock time at move charge (Was Re: [PATCH 04/10] memcg: disable local interrupts in lock_page_cgroup()
  2010-10-07  6:21               ` [PATCH] memcg: reduce lock time at move charge (Was " KAMEZAWA Hiroyuki
  2010-10-07  6:24                 ` [PATCH] memcg: lock-free clear page writeback " KAMEZAWA Hiroyuki
@ 2010-10-07  7:28                 ` Daisuke Nishimura
  2010-10-07  7:42                   ` KAMEZAWA Hiroyuki
  1 sibling, 1 reply; 96+ messages in thread
From: Daisuke Nishimura @ 2010-10-07  7:28 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Minchan Kim, Greg Thelen, Andrew Morton, linux-kernel, linux-mm,
	containers, Andrea Righi, Balbir Singh, Daisuke Nishimura

On Thu, 7 Oct 2010 15:21:11 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> On Thu, 7 Oct 2010 11:17:43 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>  
> > > hmm, if we'll do that, I think we need to do that under pte_lock in
> > > mem_cgroup_move_charge_pte_range(). But, we can't do wait_on_page_writeback()
> > > under pte_lock, right? Or, we need re-organize current move-charge implementation.
> > > 
> > Nice catch. I think releaseing pte_lock() is okay. (and it should be released)
> > 
> > IIUC, task's css_set() points to new cgroup when "move" is called. Then,
> > it's not necessary to take pte_lock, I guess.
> > (And taking pte_lock too long is not appreciated..)
> > 
> > I'll write a sample patch today.
> > 
> Here.
Great!

> ==
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> Now, at task migration among cgroup, memory cgroup scans page table and moving
> account if flags are properly set.
> 
> The core code, mem_cgroup_move_charge_pte_range() does
> 
>  	pte_offset_map_lock();
> 	for all ptes in a page table:
> 		1. look into page table, find_and_get a page
> 		2. remove it from LRU.
> 		3. move charge.
> 		4. putback to LRU. put_page()
> 	pte_offset_map_unlock();
> 
> for pte entries on a 3rd(2nd) level page table.
> 
> This pte_offset_map_lock seems a bit long. This patch modifies a rountine as
> 
> 	pte_offset_map_lock()
> 	for 32 pages:
> 		      find_and_get a page
> 		      record it
> 	pte_offset_map_unlock()
> 	for all recorded pages
> 		      isolate it from LRU.
> 		      move charge
> 		      putback to LRU
> 	for all recorded pages
> 		      put_page()
> 
> Note: newly-charged pages while we move account are charged to the new group.
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
>  mm/memcontrol.c |   92 ++++++++++++++++++++++++++++++++++++--------------------
>  1 file changed, 60 insertions(+), 32 deletions(-)
> 
> Index: mmotm-0928/mm/memcontrol.c
> ===================================================================
> --- mmotm-0928.orig/mm/memcontrol.c
> +++ mmotm-0928/mm/memcontrol.c
> @@ -4475,17 +4475,22 @@ one_by_one:
>   *
>   * Called with pte lock held.
>   */
> -union mc_target {
> -	struct page	*page;
> -	swp_entry_t	ent;
> -};
>  
>  enum mc_target_type {
> -	MC_TARGET_NONE,	/* not used */
> +	MC_TARGET_NONE, /* used as failure code(0) */
>  	MC_TARGET_PAGE,
>  	MC_TARGET_SWAP,
>  };
>  
> +struct mc_target {
> +	enum mc_target_type type;
> +	union {
> +		struct page	*page;
> +		swp_entry_t	ent;
> +	} val;
> +};
> +
> +
>  static struct page *mc_handle_present_pte(struct vm_area_struct *vma,
>  						unsigned long addr, pte_t ptent)
>  {
> @@ -4561,7 +4566,7 @@ static struct page *mc_handle_file_pte(s
>  }
>  
>  static int is_target_pte_for_mc(struct vm_area_struct *vma,
> -		unsigned long addr, pte_t ptent, union mc_target *target)
> +		unsigned long addr, pte_t ptent, struct mc_target *target)
>  {
>  	struct page *page = NULL;
>  	struct page_cgroup *pc;
> @@ -4587,7 +4592,7 @@ static int is_target_pte_for_mc(struct v
>  		if (PageCgroupUsed(pc) && pc->mem_cgroup == mc.from) {
>  			ret = MC_TARGET_PAGE;
>  			if (target)
> -				target->page = page;
> +				target->val.page = page;
>  		}
>  		if (!ret || !target)
>  			put_page(page);
> @@ -4597,8 +4602,10 @@ static int is_target_pte_for_mc(struct v
>  			css_id(&mc.from->css) == lookup_swap_cgroup(ent)) {
>  		ret = MC_TARGET_SWAP;
>  		if (target)
> -			target->ent = ent;
> +			target->val.ent = ent;
>  	}
> +	if (target)
> +		target->type = ret;
>  	return ret;
>  }
>  
> @@ -4751,6 +4758,9 @@ static void mem_cgroup_cancel_attach(str
>  	mem_cgroup_clear_mc();
>  }
>  
> +
> +#define MC_MOVE_ONCE		(32)
> +
>  static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
>  				unsigned long addr, unsigned long end,
>  				struct mm_walk *walk)
> @@ -4759,26 +4769,47 @@ static int mem_cgroup_move_charge_pte_ra
>  	struct vm_area_struct *vma = walk->private;
>  	pte_t *pte;
>  	spinlock_t *ptl;
> +	struct mc_target *target;
> +	int index, num;
> +
> +	target = kzalloc(sizeof(struct mc_target) *MC_MOVE_ONCE, GFP_KERNEL);
hmm? I can't see it freed anywhere.

Considering you reset target[]->type to MC_TARGET_NONE, do you intended to
reuse targe[] while walking the page table ?
If so, how about adding a new member(struct mc_target *targe) to move_charge_struct,
and allocate/free it at mem_cgroup_move_charge() ?

Thanks,
Daisuke Nishimura.

> +	if (!target)
> +		return -ENOMEM;
>  
>  retry:
>  	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> -	for (; addr != end; addr += PAGE_SIZE) {
> +	for (num = 0; num < MC_MOVE_ONCE && addr != end; addr += PAGE_SIZE) {
>  		pte_t ptent = *(pte++);
> -		union mc_target target;
> -		int type;
> +		ret = is_target_pte_for_mc(vma, addr, ptent, &target[num]);
> +		if (!ret)
> +			continue;
> +		target[num++].type = ret;
> +	}
> +	pte_unmap_unlock(pte - 1, ptl);
> +	cond_resched();
> +
> +	ret = 0;
> +	index = 0;
> +	do {
> +		struct mc_target *mt;
>  		struct page *page;
>  		struct page_cgroup *pc;
>  		swp_entry_t ent;
>  
> -		if (!mc.precharge)
> -			break;
> +		if (!mc.precharge) {
> +			ret = mem_cgroup_do_precharge(1);
> +			if (ret)
> +				goto out;
> +			continue;
> +		}
> +
> +		mt = &target[index++];
>  
> -		type = is_target_pte_for_mc(vma, addr, ptent, &target);
> -		switch (type) {
> +		switch (mt->type) {
>  		case MC_TARGET_PAGE:
> -			page = target.page;
> +			page = mt->val.page;
>  			if (isolate_lru_page(page))
> -				goto put;
> +				break;
>  			pc = lookup_page_cgroup(page);
>  			if (!mem_cgroup_move_account(pc,
>  						mc.from, mc.to, false)) {
> @@ -4787,11 +4818,9 @@ retry:
>  				mc.moved_charge++;
>  			}
>  			putback_lru_page(page);
> -put:			/* is_target_pte_for_mc() gets the page */
> -			put_page(page);
>  			break;
>  		case MC_TARGET_SWAP:
> -			ent = target.ent;
> +			ent = mt->val.ent;
>  			if (!mem_cgroup_move_swap_account(ent,
>  						mc.from, mc.to, false)) {
>  				mc.precharge--;
> @@ -4802,21 +4831,20 @@ put:			/* is_target_pte_for_mc() gets th
>  		default:
>  			break;
>  		}
> +	} while (index < num);
> +out:
> +	for (index = 0; index < num; index++) {
> +		if (target[index].type == MC_TARGET_PAGE)
> +			put_page(target[index].val.page);
> +		target[index].type = MC_TARGET_NONE;
>  	}
> -	pte_unmap_unlock(pte - 1, ptl);
> +
> +	if (ret)
> +		return ret;
>  	cond_resched();
>  
> -	if (addr != end) {
> -		/*
> -		 * We have consumed all precharges we got in can_attach().
> -		 * We try charge one by one, but don't do any additional
> -		 * charges to mc.to if we have failed in charge once in attach()
> -		 * phase.
> -		 */
> -		ret = mem_cgroup_do_precharge(1);
> -		if (!ret)
> -			goto retry;
> -	}
> +	if (addr != end)
> +		goto retry;
>  
>  	return ret;
>  }
> 

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] memcg: reduce lock time at move charge (Was Re: [PATCH 04/10] memcg: disable local interrupts in lock_page_cgroup()
  2010-10-07  7:28                 ` [PATCH] memcg: reduce lock time at move charge " Daisuke Nishimura
@ 2010-10-07  7:42                   ` KAMEZAWA Hiroyuki
  2010-10-07  8:04                     ` [PATCH v2] " KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 96+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-10-07  7:42 UTC (permalink / raw)
  To: Daisuke Nishimura
  Cc: Minchan Kim, Greg Thelen, Andrew Morton, linux-kernel, linux-mm,
	containers, Andrea Righi, Balbir Singh

On Thu, 7 Oct 2010 16:28:11 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:

> On Thu, 7 Oct 2010 15:21:11 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 
> > On Thu, 7 Oct 2010 11:17:43 +0900
> > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> >  
> > > > hmm, if we'll do that, I think we need to do that under pte_lock in
> > > > mem_cgroup_move_charge_pte_range(). But, we can't do wait_on_page_writeback()
> > > > under pte_lock, right? Or, we need re-organize current move-charge implementation.
> > > > 
> > > Nice catch. I think releaseing pte_lock() is okay. (and it should be released)
> > > 
> > > IIUC, task's css_set() points to new cgroup when "move" is called. Then,
> > > it's not necessary to take pte_lock, I guess.
> > > (And taking pte_lock too long is not appreciated..)
> > > 
> > > I'll write a sample patch today.
> > > 
> > Here.
> Great!
> 
> > ==
> > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > 
> > Now, at task migration among cgroup, memory cgroup scans page table and moving
> > account if flags are properly set.
> > 
> > The core code, mem_cgroup_move_charge_pte_range() does
> > 
> >  	pte_offset_map_lock();
> > 	for all ptes in a page table:
> > 		1. look into page table, find_and_get a page
> > 		2. remove it from LRU.
> > 		3. move charge.
> > 		4. putback to LRU. put_page()
> > 	pte_offset_map_unlock();
> > 
> > for pte entries on a 3rd(2nd) level page table.
> > 
> > This pte_offset_map_lock seems a bit long. This patch modifies a rountine as
> > 
> > 	pte_offset_map_lock()
> > 	for 32 pages:
> > 		      find_and_get a page
> > 		      record it
> > 	pte_offset_map_unlock()
> > 	for all recorded pages
> > 		      isolate it from LRU.
> > 		      move charge
> > 		      putback to LRU
> > 	for all recorded pages
> > 		      put_page()
> > 
> > Note: newly-charged pages while we move account are charged to the new group.
> > 
> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > ---
> >  mm/memcontrol.c |   92 ++++++++++++++++++++++++++++++++++++--------------------
> >  1 file changed, 60 insertions(+), 32 deletions(-)
> > 
> > Index: mmotm-0928/mm/memcontrol.c
> > ===================================================================
> > --- mmotm-0928.orig/mm/memcontrol.c
> > +++ mmotm-0928/mm/memcontrol.c
> > @@ -4475,17 +4475,22 @@ one_by_one:
> >   *
> >   * Called with pte lock held.
> >   */
> > -union mc_target {
> > -	struct page	*page;
> > -	swp_entry_t	ent;
> > -};
> >  
> >  enum mc_target_type {
> > -	MC_TARGET_NONE,	/* not used */
> > +	MC_TARGET_NONE, /* used as failure code(0) */
> >  	MC_TARGET_PAGE,
> >  	MC_TARGET_SWAP,
> >  };
> >  
> > +struct mc_target {
> > +	enum mc_target_type type;
> > +	union {
> > +		struct page	*page;
> > +		swp_entry_t	ent;
> > +	} val;
> > +};
> > +
> > +
> >  static struct page *mc_handle_present_pte(struct vm_area_struct *vma,
> >  						unsigned long addr, pte_t ptent)
> >  {
> > @@ -4561,7 +4566,7 @@ static struct page *mc_handle_file_pte(s
> >  }
> >  
> >  static int is_target_pte_for_mc(struct vm_area_struct *vma,
> > -		unsigned long addr, pte_t ptent, union mc_target *target)
> > +		unsigned long addr, pte_t ptent, struct mc_target *target)
> >  {
> >  	struct page *page = NULL;
> >  	struct page_cgroup *pc;
> > @@ -4587,7 +4592,7 @@ static int is_target_pte_for_mc(struct v
> >  		if (PageCgroupUsed(pc) && pc->mem_cgroup == mc.from) {
> >  			ret = MC_TARGET_PAGE;
> >  			if (target)
> > -				target->page = page;
> > +				target->val.page = page;
> >  		}
> >  		if (!ret || !target)
> >  			put_page(page);
> > @@ -4597,8 +4602,10 @@ static int is_target_pte_for_mc(struct v
> >  			css_id(&mc.from->css) == lookup_swap_cgroup(ent)) {
> >  		ret = MC_TARGET_SWAP;
> >  		if (target)
> > -			target->ent = ent;
> > +			target->val.ent = ent;
> >  	}
> > +	if (target)
> > +		target->type = ret;
> >  	return ret;
> >  }
> >  
> > @@ -4751,6 +4758,9 @@ static void mem_cgroup_cancel_attach(str
> >  	mem_cgroup_clear_mc();
> >  }
> >  
> > +
> > +#define MC_MOVE_ONCE		(32)
> > +
> >  static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
> >  				unsigned long addr, unsigned long end,
> >  				struct mm_walk *walk)
> > @@ -4759,26 +4769,47 @@ static int mem_cgroup_move_charge_pte_ra
> >  	struct vm_area_struct *vma = walk->private;
> >  	pte_t *pte;
> >  	spinlock_t *ptl;
> > +	struct mc_target *target;
> > +	int index, num;
> > +
> > +	target = kzalloc(sizeof(struct mc_target) *MC_MOVE_ONCE, GFP_KERNEL);
> hmm? I can't see it freed anywhere.
> 
leaked..


> Considering you reset target[]->type to MC_TARGET_NONE, do you intended to
> reuse targe[] while walking the page table ?
yes.

> If so, how about adding a new member(struct mc_target *targe) to move_charge_struct,
> and allocate/free it at mem_cgroup_move_charge() ?
> 
Hmm, sounds nice.

I'll do.

Thanks,
-Kame

> Thanks,
> Daisuke Nishimura.
> 
> > +	if (!target)
> > +		return -ENOMEM;
> >  
> >  retry:
> >  	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> > -	for (; addr != end; addr += PAGE_SIZE) {
> > +	for (num = 0; num < MC_MOVE_ONCE && addr != end; addr += PAGE_SIZE) {
> >  		pte_t ptent = *(pte++);
> > -		union mc_target target;
> > -		int type;
> > +		ret = is_target_pte_for_mc(vma, addr, ptent, &target[num]);
> > +		if (!ret)
> > +			continue;
> > +		target[num++].type = ret;
> > +	}
> > +	pte_unmap_unlock(pte - 1, ptl);
> > +	cond_resched();
> > +
> > +	ret = 0;
> > +	index = 0;
> > +	do {
> > +		struct mc_target *mt;
> >  		struct page *page;
> >  		struct page_cgroup *pc;
> >  		swp_entry_t ent;
> >  
> > -		if (!mc.precharge)
> > -			break;
> > +		if (!mc.precharge) {
> > +			ret = mem_cgroup_do_precharge(1);
> > +			if (ret)
> > +				goto out;
> > +			continue;
> > +		}
> > +
> > +		mt = &target[index++];
> >  
> > -		type = is_target_pte_for_mc(vma, addr, ptent, &target);
> > -		switch (type) {
> > +		switch (mt->type) {
> >  		case MC_TARGET_PAGE:
> > -			page = target.page;
> > +			page = mt->val.page;
> >  			if (isolate_lru_page(page))
> > -				goto put;
> > +				break;
> >  			pc = lookup_page_cgroup(page);
> >  			if (!mem_cgroup_move_account(pc,
> >  						mc.from, mc.to, false)) {
> > @@ -4787,11 +4818,9 @@ retry:
> >  				mc.moved_charge++;
> >  			}
> >  			putback_lru_page(page);
> > -put:			/* is_target_pte_for_mc() gets the page */
> > -			put_page(page);
> >  			break;
> >  		case MC_TARGET_SWAP:
> > -			ent = target.ent;
> > +			ent = mt->val.ent;
> >  			if (!mem_cgroup_move_swap_account(ent,
> >  						mc.from, mc.to, false)) {
> >  				mc.precharge--;
> > @@ -4802,21 +4831,20 @@ put:			/* is_target_pte_for_mc() gets th
> >  		default:
> >  			break;
> >  		}
> > +	} while (index < num);
> > +out:
> > +	for (index = 0; index < num; index++) {
> > +		if (target[index].type == MC_TARGET_PAGE)
> > +			put_page(target[index].val.page);
> > +		target[index].type = MC_TARGET_NONE;
> >  	}
> > -	pte_unmap_unlock(pte - 1, ptl);
> > +
> > +	if (ret)
> > +		return ret;
> >  	cond_resched();
> >  
> > -	if (addr != end) {
> > -		/*
> > -		 * We have consumed all precharges we got in can_attach().
> > -		 * We try charge one by one, but don't do any additional
> > -		 * charges to mc.to if we have failed in charge once in attach()
> > -		 * phase.
> > -		 */
> > -		ret = mem_cgroup_do_precharge(1);
> > -		if (!ret)
> > -			goto retry;
> > -	}
> > +	if (addr != end)
> > +		goto retry;
> >  
> >  	return ret;
> >  }
> > 
> 


^ permalink raw reply	[flat|nested] 96+ messages in thread

* [PATCH v2] memcg: reduce lock time at move charge (Was Re: [PATCH 04/10] memcg: disable local interrupts in lock_page_cgroup()
  2010-10-07  7:42                   ` KAMEZAWA Hiroyuki
@ 2010-10-07  8:04                     ` KAMEZAWA Hiroyuki
  2010-10-07 23:14                       ` Andrew Morton
  0 siblings, 1 reply; 96+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-10-07  8:04 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Daisuke Nishimura, Minchan Kim, Greg Thelen, Andrew Morton,
	linux-kernel, linux-mm, containers, Andrea Righi, Balbir Singh

On Thu, 7 Oct 2010 16:42:04 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > If so, how about adding a new member(struct mc_target *targe) to move_charge_struct,
> > and allocate/free it at mem_cgroup_move_charge() ?
> > 
> Hmm, sounds nice.
> 

Here.
==
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Now, at task migration among cgroup, memory cgroup scans page table and moving
account if flags are properly set.

The core code, mem_cgroup_move_charge_pte_range() does

 	pte_offset_map_lock();
	for all ptes in a page table:
		1. look into page table, find_and_get a page
		2. remove it from LRU.
		3. move charge.
		4. putback to LRU. put_page()
	pte_offset_map_unlock();

for pte entries on a 3rd level? page table.

This pte_offset_map_lock seems a bit long. This patch modifies a rountine as

	for 32 pages: pte_offset_map_lock()
		      find_and_get a page
		      record it
		      pte_offset_map_unlock()
	for all recorded pages
		      isolate it from LRU.
		      move charge
		      putback to LRU
	for all recorded pages
		      put_page()

Changelog: v1->v2
 - removed kzalloc() of mc_target. preallocate it on "mc"

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |   95 ++++++++++++++++++++++++++++++++++----------------------
 1 file changed, 59 insertions(+), 36 deletions(-)

Index: mmotm-0928/mm/memcontrol.c
===================================================================
--- mmotm-0928.orig/mm/memcontrol.c
+++ mmotm-0928/mm/memcontrol.c
@@ -276,6 +276,21 @@ enum move_type {
 	NR_MOVE_TYPE,
 };
 
+enum mc_target_type {
+	MC_TARGET_NONE, /* used as failure code(0) */
+	MC_TARGET_PAGE,
+	MC_TARGET_SWAP,
+};
+
+struct mc_target {
+	enum mc_target_type type;
+	union {
+		struct page *page;
+		swp_entry_t	ent;
+	} val;
+};
+#define MC_MOVE_ONCE	(32)
+
 /* "mc" and its members are protected by cgroup_mutex */
 static struct move_charge_struct {
 	spinlock_t	  lock; /* for from, to, moving_task */
@@ -284,6 +299,7 @@ static struct move_charge_struct {
 	unsigned long precharge;
 	unsigned long moved_charge;
 	unsigned long moved_swap;
+	struct mc_target target[MC_MOVE_ONCE];
 	struct task_struct *moving_task;	/* a task moving charges */
 	wait_queue_head_t waitq;		/* a waitq for other context */
 } mc = {
@@ -291,6 +307,7 @@ static struct move_charge_struct {
 	.waitq = __WAIT_QUEUE_HEAD_INITIALIZER(mc.waitq),
 };
 
+
 static bool move_anon(void)
 {
 	return test_bit(MOVE_CHARGE_TYPE_ANON,
@@ -4475,16 +4492,7 @@ one_by_one:
  *
  * Called with pte lock held.
  */
-union mc_target {
-	struct page	*page;
-	swp_entry_t	ent;
-};
 
-enum mc_target_type {
-	MC_TARGET_NONE,	/* not used */
-	MC_TARGET_PAGE,
-	MC_TARGET_SWAP,
-};
 
 static struct page *mc_handle_present_pte(struct vm_area_struct *vma,
 						unsigned long addr, pte_t ptent)
@@ -4561,7 +4569,7 @@ static struct page *mc_handle_file_pte(s
 }
 
 static int is_target_pte_for_mc(struct vm_area_struct *vma,
-		unsigned long addr, pte_t ptent, union mc_target *target)
+		unsigned long addr, pte_t ptent, struct mc_target *target)
 {
 	struct page *page = NULL;
 	struct page_cgroup *pc;
@@ -4587,7 +4595,7 @@ static int is_target_pte_for_mc(struct v
 		if (PageCgroupUsed(pc) && pc->mem_cgroup == mc.from) {
 			ret = MC_TARGET_PAGE;
 			if (target)
-				target->page = page;
+				target->val.page = page;
 		}
 		if (!ret || !target)
 			put_page(page);
@@ -4597,8 +4605,10 @@ static int is_target_pte_for_mc(struct v
 			css_id(&mc.from->css) == lookup_swap_cgroup(ent)) {
 		ret = MC_TARGET_SWAP;
 		if (target)
-			target->ent = ent;
+			target->val.ent = ent;
 	}
+	if (target)
+		target->type = ret;
 	return ret;
 }
 
@@ -4759,26 +4769,42 @@ static int mem_cgroup_move_charge_pte_ra
 	struct vm_area_struct *vma = walk->private;
 	pte_t *pte;
 	spinlock_t *ptl;
+	int index, num;
 
 retry:
 	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
-	for (; addr != end; addr += PAGE_SIZE) {
+	for (num = 0; num < MC_MOVE_ONCE && addr != end; addr += PAGE_SIZE) {
 		pte_t ptent = *(pte++);
-		union mc_target target;
-		int type;
+		ret = is_target_pte_for_mc(vma, addr, ptent, &mc.target[num]);
+		if (!ret)
+			continue;
+		mc.target[num++].type = ret;
+	}
+	pte_unmap_unlock(pte - 1, ptl);
+	cond_resched();
+
+	ret = 0;
+	index = 0;
+	do {
+		struct mc_target *mt;
 		struct page *page;
 		struct page_cgroup *pc;
 		swp_entry_t ent;
 
-		if (!mc.precharge)
-			break;
+		if (!mc.precharge) {
+			ret = mem_cgroup_do_precharge(1);
+			if (ret)
+				goto out;
+			continue;
+		}
+
+		mt = &mc.target[index++];
 
-		type = is_target_pte_for_mc(vma, addr, ptent, &target);
-		switch (type) {
+		switch (mt->type) {
 		case MC_TARGET_PAGE:
-			page = target.page;
+			page = mt->val.page;
 			if (isolate_lru_page(page))
-				goto put;
+				break;
 			pc = lookup_page_cgroup(page);
 			if (!mem_cgroup_move_account(pc,
 						mc.from, mc.to, false)) {
@@ -4787,11 +4813,9 @@ retry:
 				mc.moved_charge++;
 			}
 			putback_lru_page(page);
-put:			/* is_target_pte_for_mc() gets the page */
-			put_page(page);
 			break;
 		case MC_TARGET_SWAP:
-			ent = target.ent;
+			ent = mt->val.ent;
 			if (!mem_cgroup_move_swap_account(ent,
 						mc.from, mc.to, false)) {
 				mc.precharge--;
@@ -4802,21 +4826,20 @@ put:			/* is_target_pte_for_mc() gets th
 		default:
 			break;
 		}
+	} while (index < num);
+out:
+	for (index = 0; index < num; index++) {
+		if (mc.target[index].type == MC_TARGET_PAGE)
+			put_page(mc.target[index].val.page);
+		mc.target[index].type = MC_TARGET_NONE;
 	}
-	pte_unmap_unlock(pte - 1, ptl);
+
+	if (ret)
+		return ret;
 	cond_resched();
 
-	if (addr != end) {
-		/*
-		 * We have consumed all precharges we got in can_attach().
-		 * We try charge one by one, but don't do any additional
-		 * charges to mc.to if we have failed in charge once in attach()
-		 * phase.
-		 */
-		ret = mem_cgroup_do_precharge(1);
-		if (!ret)
-			goto retry;
-	}
+	if (addr != end)
+		goto retry;
 
 	return ret;
 }


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] memcg: lock-free clear page writeback  (Was Re: [PATCH 04/10] memcg: disable local interrupts in lock_page_cgroup()
  2010-10-07  6:24                 ` [PATCH] memcg: lock-free clear page writeback " KAMEZAWA Hiroyuki
@ 2010-10-07  9:05                   ` KAMEZAWA Hiroyuki
  2010-10-07 23:35                   ` Minchan Kim
  1 sibling, 0 replies; 96+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-10-07  9:05 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Daisuke Nishimura, Minchan Kim, Greg Thelen, Andrew Morton,
	linux-kernel, linux-mm, containers, Andrea Righi, Balbir Singh

On Thu, 7 Oct 2010 15:24:22 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> Greg, I think clear_page_writeback() will not require _any_ locks with this patch.
> But set_page_writeback() requires it...
> (Maybe adding a special function for clear_page_writeback() is better rather than
>  adding some complex to switch() in update_page_stat())
> 

I'm testing a code like this.
==
       /* pc->mem_cgroup is unstable ? */
        if (unlikely(mem_cgroup_stealed(mem))) {
                /* take a lock against to access pc->mem_cgroup */
                if (!in_interrupt()) {
                        lock_page_cgroup(pc);
                        need_unlock = true;
                        mem = pc->mem_cgroup;
                        if (!mem || !PageCgroupUsed(pc))
                                goto out;
                } else if (idx == MEMCG_NR_FILE_WRITEBACK && (val < 0)) {
                        /* This is allowed */
                } else
                        BUG();
        }
==
Thanks,
-Kame


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 08/10] memcg: add cgroupfs interface to memcg dirty limits
  2010-10-07  6:23   ` Ciju Rajan K
@ 2010-10-07 17:46     ` Greg Thelen
  0 siblings, 0 replies; 96+ messages in thread
From: Greg Thelen @ 2010-10-07 17:46 UTC (permalink / raw)
  To: Ciju Rajan K
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, Andrea Righi,
	Balbir Singh, KAMEZAWA Hiroyuki, Daisuke Nishimura

Ciju Rajan K <ciju@linux.vnet.ibm.com> writes:

> Greg Thelen wrote:
>> Add cgroupfs interface to memcg dirty page limits:
>>   Direct write-out is controlled with:
>>   - memory.dirty_ratio
>>   - memory.dirty_bytes
>>
>>   Background write-out is controlled with:
>>   - memory.dirty_background_ratio
>>   - memory.dirty_background_bytes
>>
>> Signed-off-by: Andrea Righi <arighi@develer.com>
>> Signed-off-by: Greg Thelen <gthelen@google.com>
>> ---
>>  mm/memcontrol.c |   89 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>  1 files changed, 89 insertions(+), 0 deletions(-)
>>
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 6ec2625..2d45a0a 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -100,6 +100,13 @@ enum mem_cgroup_stat_index {
>>  	MEM_CGROUP_STAT_NSTATS,
>>  };
>>
>> +enum {
>> +	MEM_CGROUP_DIRTY_RATIO,
>> +	MEM_CGROUP_DIRTY_BYTES,
>> +	MEM_CGROUP_DIRTY_BACKGROUND_RATIO,
>> +	MEM_CGROUP_DIRTY_BACKGROUND_BYTES,
>> +};
>> +
>>  struct mem_cgroup_stat_cpu {
>>  	s64 count[MEM_CGROUP_STAT_NSTATS];
>>  };
>> @@ -4292,6 +4299,64 @@ static int mem_cgroup_oom_control_write(struct cgroup *cgrp,
>>  	return 0;
>>  }
>>
>> +static u64 mem_cgroup_dirty_read(struct cgroup *cgrp, struct cftype *cft)
>> +{
>> +	struct mem_cgroup *mem = mem_cgroup_from_cont(cgrp);
>> +	bool root;
>> +
>> +	root = mem_cgroup_is_root(mem);
>> +
>> +	switch (cft->private) {
>> +	case MEM_CGROUP_DIRTY_RATIO:
>> +		return root ? vm_dirty_ratio : mem->dirty_param.dirty_ratio;
>> +	case MEM_CGROUP_DIRTY_BYTES:
>> +		return root ? vm_dirty_bytes : mem->dirty_param.dirty_bytes;
>> +	case MEM_CGROUP_DIRTY_BACKGROUND_RATIO:
>> +		return root ? dirty_background_ratio :
>> +			mem->dirty_param.dirty_background_ratio;
>> +	case MEM_CGROUP_DIRTY_BACKGROUND_BYTES:
>> +		return root ? dirty_background_bytes :
>> +			mem->dirty_param.dirty_background_bytes;
>> +	default:
>> +		BUG();
>> +	}
>> +}
>> +
>> +static int
>> +mem_cgroup_dirty_write(struct cgroup *cgrp, struct cftype *cft, u64 val)
>> +{
>> +	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
>> +	int type = cft->private;
>> +
>> +	if (cgrp->parent == NULL)
>> +		return -EINVAL;
>> +	if ((type == MEM_CGROUP_DIRTY_RATIO ||
>> +	     type == MEM_CGROUP_DIRTY_BACKGROUND_RATIO) && val > 100)
>> +		return -EINVAL;
>> +	switch (type) {
>> +	case MEM_CGROUP_DIRTY_RATIO:
>> +		memcg->dirty_param.dirty_ratio = val;
>> +		memcg->dirty_param.dirty_bytes = 0;
>> +		break;
>> +	case MEM_CGROUP_DIRTY_BYTES:
>> +		memcg->dirty_param.dirty_bytes = val;
>> +		memcg->dirty_param.dirty_ratio  = 0;
>> +		break;
>> +	case MEM_CGROUP_DIRTY_BACKGROUND_RATIO:
>> +		memcg->dirty_param.dirty_background_ratio = val;
>> +		memcg->dirty_param.dirty_background_bytes = 0;
>> +		break;
>> +	case MEM_CGROUP_DIRTY_BACKGROUND_BYTES:
>> +		memcg->dirty_param.dirty_background_bytes = val;
>> +		memcg->dirty_param.dirty_background_ratio = 0;
>> +		break;
>> +	default:
>> +		BUG();
>> +		break;
>> +	}
>> +	return 0;
>> +}
>> +
>>  static struct cftype mem_cgroup_files[] = {
>>  	{
>>  		.name = "usage_in_bytes",
>> @@ -4355,6 +4420,30 @@ static struct cftype mem_cgroup_files[] = {
>>  		.unregister_event = mem_cgroup_oom_unregister_event,
>>  		.private = MEMFILE_PRIVATE(_OOM_TYPE, OOM_CONTROL),
>>  	},
>> +	{
>> +		.name = "dirty_ratio",
>> +		.read_u64 = mem_cgroup_dirty_read,
>> +		.write_u64 = mem_cgroup_dirty_write,
>> +		.private = MEM_CGROUP_DIRTY_RATIO,
>> +	},
>> +	{
>> +		.name = "dirty_bytes",
>> +		.read_u64 = mem_cgroup_dirty_read,
>> +		.write_u64 = mem_cgroup_dirty_write,
>> +		.private = MEM_CGROUP_DIRTY_BYTES,
>> +	},
>> +	{
>>   
> Is it a good idea to rename "dirty_bytes" to "dirty_limit_in_bytes" ?
> So that it can match with other memcg tunable naming convention.
> We already have memory.memsw.limit_in_bytes, memory.limit_in_bytes,
> memory.soft_limit_in_bytes, etc.

I see your point in trying to be more internally consistent with other
memcg counter.

It's a trade-off, either use names consistent with /proc/sys/vm, or use
names similar to other memory.* control files.  I prefer your suggestion
and will rename as you suggested, unless I hear strong objection.

>> +		.name = "dirty_background_ratio",
>> +		.read_u64 = mem_cgroup_dirty_read,
>> +		.write_u64 = mem_cgroup_dirty_write,
>> +		.private = MEM_CGROUP_DIRTY_BACKGROUND_RATIO,
>> +	},
>> +	{
>> +		.name = "dirty_background_bytes",
>> +		.read_u64 = mem_cgroup_dirty_read,
>> +		.write_u64 = mem_cgroup_dirty_write,
>> +		.private = MEM_CGROUP_DIRTY_BACKGROUND_BYTES,
>>   
> Similarly "dirty_background_bytes" to dirty_background_limit_in_bytes ?
>> +	},
>>  };
>>
>>  #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
>>   

PS: I am collecting performance data on patch series (including Kame's
lockless writeback stats).  I should have some useful data today.

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v2] memcg: reduce lock time at move charge (Was Re: [PATCH 04/10] memcg: disable local interrupts in lock_page_cgroup()
  2010-10-07  8:04                     ` [PATCH v2] " KAMEZAWA Hiroyuki
@ 2010-10-07 23:14                       ` Andrew Morton
  2010-10-08  1:12                         ` Daisuke Nishimura
  2010-10-08  4:37                         ` KAMEZAWA Hiroyuki
  0 siblings, 2 replies; 96+ messages in thread
From: Andrew Morton @ 2010-10-07 23:14 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Daisuke Nishimura, Minchan Kim, Greg Thelen, linux-kernel,
	linux-mm, containers, Andrea Righi, Balbir Singh

On Thu, 7 Oct 2010 17:04:05 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> Now, at task migration among cgroup, memory cgroup scans page table and moving
> account if flags are properly set.
> 
> The core code, mem_cgroup_move_charge_pte_range() does
> 
>  	pte_offset_map_lock();
> 	for all ptes in a page table:
> 		1. look into page table, find_and_get a page
> 		2. remove it from LRU.
> 		3. move charge.
> 		4. putback to LRU. put_page()
> 	pte_offset_map_unlock();
> 
> for pte entries on a 3rd level? page table.
> 
> This pte_offset_map_lock seems a bit long. This patch modifies a rountine as
> 
> 	for 32 pages: pte_offset_map_lock()
> 		      find_and_get a page
> 		      record it
> 		      pte_offset_map_unlock()
> 	for all recorded pages
> 		      isolate it from LRU.
> 		      move charge
> 		      putback to LRU
> 	for all recorded pages
> 		      put_page()

The patch makes the code larger, more complex and slower!

I do think we're owed a more complete description of its benefits than
"seems a bit long".  Have problems been observed?  Any measurements
taken?



^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] memcg: lock-free clear page writeback (Was Re: [PATCH 04/10] memcg: disable local interrupts in lock_page_cgroup()
  2010-10-07  6:24                 ` [PATCH] memcg: lock-free clear page writeback " KAMEZAWA Hiroyuki
  2010-10-07  9:05                   ` KAMEZAWA Hiroyuki
@ 2010-10-07 23:35                   ` Minchan Kim
  2010-10-08  4:41                     ` KAMEZAWA Hiroyuki
  1 sibling, 1 reply; 96+ messages in thread
From: Minchan Kim @ 2010-10-07 23:35 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Daisuke Nishimura, Greg Thelen, Andrew Morton, linux-kernel,
	linux-mm, containers, Andrea Righi, Balbir Singh

Hi Kame,

On Thu, Oct 7, 2010 at 3:24 PM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> Greg, I think clear_page_writeback() will not require _any_ locks with this patch.
> But set_page_writeback() requires it...
> (Maybe adding a special function for clear_page_writeback() is better rather than
>  adding some complex to switch() in update_page_stat())
>
> ==
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>
> Now, at page information accounting, we do lock_page_cgroup() if pc->mem_cgroup
> points to a cgroup where someone is moving charges from.
>
> At supporing dirty-page accounting, one of troubles is writeback bit.
> In general, writeback can be cleared via IRQ context. To update writeback bit
> with lock_page_cgroup() in safe way, we'll have to disable IRQ.
> ....or do something.
>
> This patch waits for completion of writeback under lock_page() and do
> lock_page_cgroup() in safe way. (We never got end_io via IRQ context.)
>
> By this, writeback-accounting will never see race with account_move() and
> it can trust pc->mem_cgroup always _without_ any lock.
>
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
>  mm/memcontrol.c |   18 ++++++++++++++++++
>  1 file changed, 18 insertions(+)
>
> Index: mmotm-0928/mm/memcontrol.c
> ===================================================================
> --- mmotm-0928.orig/mm/memcontrol.c
> +++ mmotm-0928/mm/memcontrol.c
> @@ -2183,17 +2183,35 @@ static void __mem_cgroup_move_account(st
>  /*
>  * check whether the @pc is valid for moving account and call
>  * __mem_cgroup_move_account()
> + * Don't call this under pte_lock etc...we'll do lock_page() and wait for
> + * the end of I/O.
>  */
>  static int mem_cgroup_move_account(struct page_cgroup *pc,
>                struct mem_cgroup *from, struct mem_cgroup *to, bool uncharge)
>  {
>        int ret = -EINVAL;
> +
> +       /*
> +        * We move severl flags and accounting information here. So we need to
> +        * avoid the races with update_stat routines. For most of routines,
> +        * lock_page_cgroup() is enough for avoiding race. But we need to take
> +        * care of IRQ context. If flag updates comes from IRQ context, This
> +        * "move account" will be racy (and cause deadlock in lock_page_cgroup())
> +        *
> +        * Now, the only race we have is Writeback flag. We wait for it cleared
> +        * before starting our jobs.
> +        */
> +
> +       lock_page(pc->page);
> +       wait_on_page_writeback(pc->page);
> +
>        lock_page_cgroup(pc);
>        if (PageCgroupUsed(pc) && pc->mem_cgroup == from) {
>                __mem_cgroup_move_account(pc, from, to, uncharge);
>                ret = 0;
>        }
>        unlock_page_cgroup(pc);
> +       unlock_page(pc->page);
>        /*
>         * check events
>         */
>
>

Looks good to me.
But let me ask a question.
Why do only move_account need this logic?
Is deadlock candidate is only this place?
How about mem_cgroup_prepare_migration?

unmap_and_move
lock_page
mem_cgroup_prepare_migration
lock_page_cgroup
...
softirq happen
lock_page_cgroup


If race happens only where move_account and writeback, please describe
it as comment.
It would help to review the code in future.

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v2] memcg: reduce lock time at move charge (Was Re: [PATCH 04/10] memcg: disable local interrupts in lock_page_cgroup()
  2010-10-07 23:14                       ` Andrew Morton
@ 2010-10-08  1:12                         ` Daisuke Nishimura
  2010-10-08  4:37                         ` KAMEZAWA Hiroyuki
  1 sibling, 0 replies; 96+ messages in thread
From: Daisuke Nishimura @ 2010-10-08  1:12 UTC (permalink / raw)
  To: Andrew Morton
  Cc: KAMEZAWA Hiroyuki, Minchan Kim, Greg Thelen, linux-kernel,
	linux-mm, containers, Andrea Righi, Balbir Singh,
	Daisuke Nishimura

On Thu, 7 Oct 2010 16:14:54 -0700
Andrew Morton <akpm@linux-foundation.org> wrote:

> On Thu, 7 Oct 2010 17:04:05 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 
> > Now, at task migration among cgroup, memory cgroup scans page table and moving
> > account if flags are properly set.
> > 
> > The core code, mem_cgroup_move_charge_pte_range() does
> > 
> >  	pte_offset_map_lock();
> > 	for all ptes in a page table:
> > 		1. look into page table, find_and_get a page
> > 		2. remove it from LRU.
> > 		3. move charge.
> > 		4. putback to LRU. put_page()
> > 	pte_offset_map_unlock();
> > 
> > for pte entries on a 3rd level? page table.
> > 
> > This pte_offset_map_lock seems a bit long. This patch modifies a rountine as
> > 
> > 	for 32 pages: pte_offset_map_lock()
> > 		      find_and_get a page
> > 		      record it
> > 		      pte_offset_map_unlock()
> > 	for all recorded pages
> > 		      isolate it from LRU.
> > 		      move charge
> > 		      putback to LRU
> > 	for all recorded pages
> > 		      put_page()
> 
> The patch makes the code larger, more complex and slower!
> 
Before this patch:
   text    data     bss     dec     hex filename
  27163   11782    4100   43045    a825 mm/memcontrol.o

After this patch:
   text    data     bss     dec     hex filename
  27307   12294    4100   43701    aab5 mm/memcontrol.o

hmm, allocating mc.target[] statically might be bad, but I'm now wondering
whether I could allocate mc itself dynamically(I'll try).

> I do think we're owed a more complete description of its benefits than
> "seems a bit long".  Have problems been observed?  Any measurements
> taken?
> 
IIUC, this patch is necessary for "[PATCH] memcg: lock-free clear page writeback"
later, but I agree we should describe it.

Thanks,
Daisuke Nishimura.

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v2] memcg: reduce lock time at move charge (Was Re: [PATCH 04/10] memcg: disable local interrupts in lock_page_cgroup()
  2010-10-07 23:14                       ` Andrew Morton
  2010-10-08  1:12                         ` Daisuke Nishimura
@ 2010-10-08  4:37                         ` KAMEZAWA Hiroyuki
  2010-10-08  4:55                           ` Andrew Morton
  1 sibling, 1 reply; 96+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-10-08  4:37 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Daisuke Nishimura, Minchan Kim, Greg Thelen, linux-kernel,
	linux-mm, containers, Andrea Righi, Balbir Singh

On Thu, 7 Oct 2010 16:14:54 -0700
Andrew Morton <akpm@linux-foundation.org> wrote:

> On Thu, 7 Oct 2010 17:04:05 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 
> > Now, at task migration among cgroup, memory cgroup scans page table and moving
> > account if flags are properly set.
> > 
> > The core code, mem_cgroup_move_charge_pte_range() does
> > 
> >  	pte_offset_map_lock();
> > 	for all ptes in a page table:
> > 		1. look into page table, find_and_get a page
> > 		2. remove it from LRU.
> > 		3. move charge.
> > 		4. putback to LRU. put_page()
> > 	pte_offset_map_unlock();
> > 
> > for pte entries on a 3rd level? page table.
> > 
> > This pte_offset_map_lock seems a bit long. This patch modifies a rountine as
> > 
> > 	for 32 pages: pte_offset_map_lock()
> > 		      find_and_get a page
> > 		      record it
> > 		      pte_offset_map_unlock()
> > 	for all recorded pages
> > 		      isolate it from LRU.
> > 		      move charge
> > 		      putback to LRU
> > 	for all recorded pages
> > 		      put_page()
> 
> The patch makes the code larger, more complex and slower!
> 

Slower ?

> I do think we're owed a more complete description of its benefits than
> "seems a bit long".  Have problems been observed?  Any measurements
> taken?
> 

I'll rewrite the whole patch against today's mmotom.

Thanks,
-Kame

> 
> 


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH] memcg: lock-free clear page writeback (Was Re: [PATCH 04/10] memcg: disable local interrupts in lock_page_cgroup()
  2010-10-07 23:35                   ` Minchan Kim
@ 2010-10-08  4:41                     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 96+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-10-08  4:41 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Daisuke Nishimura, Greg Thelen, Andrew Morton, linux-kernel,
	linux-mm, containers, Andrea Righi, Balbir Singh

On Fri, 8 Oct 2010 08:35:30 +0900
Minchan Kim <minchan.kim@gmail.com> wrote:

> Hi Kame,
> 
> On Thu, Oct 7, 2010 at 3:24 PM, KAMEZAWA Hiroyuki
> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > Greg, I think clear_page_writeback() will not require _any_ locks with this patch.
> > But set_page_writeback() requires it...
> > (Maybe adding a special function for clear_page_writeback() is better rather than
> >  adding some complex to switch() in update_page_stat())
> >
> > ==
> > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> >
> > Now, at page information accounting, we do lock_page_cgroup() if pc->mem_cgroup
> > points to a cgroup where someone is moving charges from.
> >
> > At supporing dirty-page accounting, one of troubles is writeback bit.
> > In general, writeback can be cleared via IRQ context. To update writeback bit
> > with lock_page_cgroup() in safe way, we'll have to disable IRQ.
> > ....or do something.
> >
> > This patch waits for completion of writeback under lock_page() and do
> > lock_page_cgroup() in safe way. (We never got end_io via IRQ context.)
> >
> > By this, writeback-accounting will never see race with account_move() and
> > it can trust pc->mem_cgroup always _without_ any lock.
> >
> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > ---
> >  mm/memcontrol.c |   18 ++++++++++++++++++
> >  1 file changed, 18 insertions(+)
> >
> > Index: mmotm-0928/mm/memcontrol.c
> > ===================================================================
> > --- mmotm-0928.orig/mm/memcontrol.c
> > +++ mmotm-0928/mm/memcontrol.c
> > @@ -2183,17 +2183,35 @@ static void __mem_cgroup_move_account(st
> >  /*
> >  * check whether the @pc is valid for moving account and call
> >  * __mem_cgroup_move_account()
> > + * Don't call this under pte_lock etc...we'll do lock_page() and wait for
> > + * the end of I/O.
> >  */
> >  static int mem_cgroup_move_account(struct page_cgroup *pc,
> >                struct mem_cgroup *from, struct mem_cgroup *to, bool uncharge)
> >  {
> >        int ret = -EINVAL;
> > +
> > +       /*
> > +        * We move severl flags and accounting information here. So we need to
> > +        * avoid the races with update_stat routines. For most of routines,
> > +        * lock_page_cgroup() is enough for avoiding race. But we need to take
> > +        * care of IRQ context. If flag updates comes from IRQ context, This
> > +        * "move account" will be racy (and cause deadlock in lock_page_cgroup())
> > +        *
> > +        * Now, the only race we have is Writeback flag. We wait for it cleared
> > +        * before starting our jobs.
> > +        */
> > +
> > +       lock_page(pc->page);
> > +       wait_on_page_writeback(pc->page);
> > +
> >        lock_page_cgroup(pc);
> >        if (PageCgroupUsed(pc) && pc->mem_cgroup == from) {
> >                __mem_cgroup_move_account(pc, from, to, uncharge);
> >                ret = 0;
> >        }
> >        unlock_page_cgroup(pc);
> > +       unlock_page(pc->page);
> >        /*
> >         * check events
> >         */
> >
> >
> 
> Looks good to me.
> But let me ask a question.
> Why do only move_account need this logic?

Because charge/uncharge (add/remove to radix-tree or swapcache)
never happens while a page is PG_writeback.

> Is deadlock candidate is only this place?
yes.

> How about mem_cgroup_prepare_migration?
> 
> unmap_and_move
> lock_page
> mem_cgroup_prepare_migration
> lock_page_cgroup
> ...
> softirq happen
> lock_page_cgroup
> 
> 
Nice cactch. I'll move prepare_migraon after wait_on_page_writeback()

> If race happens only where move_account and writeback, please describe
> it as comment.
> It would help to review the code in future.
> 

Sure, updates are necessary.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v2] memcg: reduce lock time at move charge (Was Re: [PATCH 04/10] memcg: disable local interrupts in lock_page_cgroup()
  2010-10-08  4:37                         ` KAMEZAWA Hiroyuki
@ 2010-10-08  4:55                           ` Andrew Morton
  2010-10-08  5:12                             ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 96+ messages in thread
From: Andrew Morton @ 2010-10-08  4:55 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Daisuke Nishimura, Minchan Kim, Greg Thelen, linux-kernel,
	linux-mm, containers, Andrea Righi, Balbir Singh

On Fri, 8 Oct 2010 13:37:12 +0900 KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> On Thu, 7 Oct 2010 16:14:54 -0700
> Andrew Morton <akpm@linux-foundation.org> wrote:
> 
> > On Thu, 7 Oct 2010 17:04:05 +0900
> > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > 
> > > Now, at task migration among cgroup, memory cgroup scans page table and moving
> > > account if flags are properly set.
> > > 
> > > The core code, mem_cgroup_move_charge_pte_range() does
> > > 
> > >  	pte_offset_map_lock();
> > > 	for all ptes in a page table:
> > > 		1. look into page table, find_and_get a page
> > > 		2. remove it from LRU.
> > > 		3. move charge.
> > > 		4. putback to LRU. put_page()
> > > 	pte_offset_map_unlock();
> > > 
> > > for pte entries on a 3rd level? page table.
> > > 
> > > This pte_offset_map_lock seems a bit long. This patch modifies a rountine as
> > > 
> > > 	for 32 pages: pte_offset_map_lock()
> > > 		      find_and_get a page
> > > 		      record it
> > > 		      pte_offset_map_unlock()
> > > 	for all recorded pages
> > > 		      isolate it from LRU.
> > > 		      move charge
> > > 		      putback to LRU
> > > 	for all recorded pages
> > > 		      put_page()
> > 
> > The patch makes the code larger, more complex and slower!
> > 
> 
> Slower ?

Sure.  It walks the same data three times, potentially causing
thrashing in the L1 cache.  It takes and releases locks at a higher
frequency.  It increases the text size.


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v2] memcg: reduce lock time at move charge (Was Re: [PATCH 04/10] memcg: disable local interrupts in lock_page_cgroup()
  2010-10-08  4:55                           ` Andrew Morton
@ 2010-10-08  5:12                             ` KAMEZAWA Hiroyuki
  2010-10-08 10:41                               ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 96+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-10-08  5:12 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Daisuke Nishimura, Minchan Kim, Greg Thelen, linux-kernel,
	linux-mm, containers, Andrea Righi, Balbir Singh

On Thu, 7 Oct 2010 21:55:56 -0700
Andrew Morton <akpm@linux-foundation.org> wrote:

> On Fri, 8 Oct 2010 13:37:12 +0900 KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 
> > On Thu, 7 Oct 2010 16:14:54 -0700
> > Andrew Morton <akpm@linux-foundation.org> wrote:
> > 
> > > On Thu, 7 Oct 2010 17:04:05 +0900
> > > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > > 
> > > > Now, at task migration among cgroup, memory cgroup scans page table and moving
> > > > account if flags are properly set.
> > > > 
> > > > The core code, mem_cgroup_move_charge_pte_range() does
> > > > 
> > > >  	pte_offset_map_lock();
> > > > 	for all ptes in a page table:
> > > > 		1. look into page table, find_and_get a page
> > > > 		2. remove it from LRU.
> > > > 		3. move charge.
> > > > 		4. putback to LRU. put_page()
> > > > 	pte_offset_map_unlock();
> > > > 
> > > > for pte entries on a 3rd level? page table.
> > > > 
> > > > This pte_offset_map_lock seems a bit long. This patch modifies a rountine as
> > > > 
> > > > 	for 32 pages: pte_offset_map_lock()
> > > > 		      find_and_get a page
> > > > 		      record it
> > > > 		      pte_offset_map_unlock()
> > > > 	for all recorded pages
> > > > 		      isolate it from LRU.
> > > > 		      move charge
> > > > 		      putback to LRU
> > > > 	for all recorded pages
> > > > 		      put_page()
> > > 
> > > The patch makes the code larger, more complex and slower!
> > > 
> > 
> > Slower ?
> 
> Sure.  It walks the same data three times, potentially causing
> thrashing in the L1 cache.

Hmm, make this 2 times, at least.

> It takes and releases locks at a higher frequency.  It increases the text size.
> 

But I don't think page_table_lock is a lock which someone can hold so long
that
	1. find_get_page
	2. spin_lock(zone->lock)
		3. remove it from LRU
	4. lock_page_cgroup()
	5. move charge (This means page 
	5. putback to LRU
for 4096/8=1024 pages long.

will try to make the routine smarter.

But I want to get rid of page_table_lock -> lock_page_cgroup().

Thanks,
-Kame



^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v2] memcg: reduce lock time at move charge (Was Re: [PATCH 04/10] memcg: disable local interrupts in lock_page_cgroup()
  2010-10-08  5:12                             ` KAMEZAWA Hiroyuki
@ 2010-10-08 10:41                               ` KAMEZAWA Hiroyuki
  2010-10-12  3:39                                 ` Balbir Singh
  2010-10-12  3:56                                 ` Daisuke Nishimura
  0 siblings, 2 replies; 96+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-10-08 10:41 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, Daisuke Nishimura, Minchan Kim, Greg Thelen,
	linux-kernel, linux-mm, containers, Andrea Righi, Balbir Singh

On Fri, 8 Oct 2010 14:12:01 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > Sure.  It walks the same data three times, potentially causing
> > thrashing in the L1 cache.
> 
> Hmm, make this 2 times, at least.
> 
How about this ?
==
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Presently, at task migration among cgroups, memory cgroup scans page tables and
moves accounting if flags are properly set.


The core code, mem_cgroup_move_charge_pte_range() does

 	pte_offset_map_lock();
	for all ptes in a page table:
		1. look into page table, find_and_get a page
		2. remove it from LRU.
		3. move charge.
		4. putback to LRU. put_page()
	pte_offset_map_unlock();

for pte entries on a 3rd level? page table.

As a planned updates, we'll support dirty-page accounting. Because move_charge()
is highly race, we need to add more check in move_charge.
For example, lock_page();-> wait_on_page_writeback();-> unlock_page();
is an candidate for new check.


This patch modifies a rountine as

	for 32 pages: pte_offset_map_lock()
		      find_and_get a page
		      record it
		      pte_offset_map_unlock()
	for all recorded pages
		      isolate it from LRU.
		      move charge
		      putback to LRU
		      put_page()
Code size change is:
(Before)
[kamezawa@bluextal mmotm-1008]$ size mm/memcontrol.o
   text    data     bss     dec     hex filename
  28247    7685    4100   40032    9c60 mm/memcontrol.o
(After)
[kamezawa@bluextal mmotm-1008]$ size mm/memcontrol.o
   text    data     bss     dec     hex filename
  28591    7685    4100   40376    9db8 mm/memcontrol.o

Easy Bencmark score.

Moving 2Gbytes anonymous memory task between cgroup/A and cgroup/B.
 <===== shows a function under pte_lock.
Before Patch.

real    0m42.346s
user    0m0.002s
sys     0m39.668s

    13.88%  swap_task.sh  [kernel.kallsyms]  [k] put_page	     <=====
    10.37%  swap_task.sh  [kernel.kallsyms]  [k] isolate_lru_page    <===== 
    10.25%  swap_task.sh  [kernel.kallsyms]  [k] is_target_pte_for_mc  <=====
     7.85%  swap_task.sh  [kernel.kallsyms]  [k] mem_cgroup_move_account <=====
     7.63%  swap_task.sh  [kernel.kallsyms]  [k] lookup_page_cgroup      <=====
     6.96%  swap_task.sh  [kernel.kallsyms]  [k] ____pagevec_lru_add
     6.43%  swap_task.sh  [kernel.kallsyms]  [k] mem_cgroup_del_lru_list
     6.31%  swap_task.sh  [kernel.kallsyms]  [k] putback_lru_page
     5.28%  swap_task.sh  [kernel.kallsyms]  [k] mem_cgroup_add_lru_list
     3.58%  swap_task.sh  [kernel.kallsyms]  [k] __lru_cache_add
     3.57%  swap_task.sh  [kernel.kallsyms]  [k] _raw_spin_lock_irq
     3.06%  swap_task.sh  [kernel.kallsyms]  [k] release_pages
     2.35%  swap_task.sh  [kernel.kallsyms]  [k] mem_cgroup_get_reclaim_stat_from_page
     2.31%  swap_task.sh  [kernel.kallsyms]  [k] mem_cgroup_move_charge_pte_range
     1.80%  swap_task.sh  [kernel.kallsyms]  [k] memcg_check_events
     1.59%  swap_task.sh  [kernel.kallsyms]  [k] page_evictable
     1.55%  swap_task.sh  [kernel.kallsyms]  [k] vm_normal_page
     1.53%  swap_task.sh  [kernel.kallsyms]  [k] mem_cgroup_charge_statistics

After patch:

real    0m43.440s
user    0m0.000s
sys     0m40.704s
    13.68%  swap_task.sh  [kernel.kallsyms]  [k] is_target_pte_for_mc <====
    13.29%  swap_task.sh  [kernel.kallsyms]  [k] put_page
    10.34%  swap_task.sh  [kernel.kallsyms]  [k] isolate_lru_page
     7.48%  swap_task.sh  [kernel.kallsyms]  [k] lookup_page_cgroup
     7.42%  swap_task.sh  [kernel.kallsyms]  [k] mem_cgroup_move_account
     6.98%  swap_task.sh  [kernel.kallsyms]  [k] ____pagevec_lru_add
     6.15%  swap_task.sh  [kernel.kallsyms]  [k] putback_lru_page
     5.46%  swap_task.sh  [kernel.kallsyms]  [k] mem_cgroup_add_lru_list
     5.00%  swap_task.sh  [kernel.kallsyms]  [k] mem_cgroup_del_lru_list
     3.38%  swap_task.sh  [kernel.kallsyms]  [k] _raw_spin_lock_irq
     3.31%  swap_task.sh  [kernel.kallsyms]  [k] __lru_cache_add
     3.02%  swap_task.sh  [kernel.kallsyms]  [k] release_pages
     2.24%  swap_task.sh  [kernel.kallsyms]  [k] mem_cgroup_get_reclaim_stat_from_page
     2.04%  swap_task.sh  [kernel.kallsyms]  [k] mem_cgroup_move_charge_pte_range
     1.84%  swap_task.sh  [kernel.kallsyms]  [k] memcg_check_events

I think this meets our trade-off between speed v.s. moving a function to allow
lockess update of page_cgroup information (will be done.)

Changelog: v2->v3
 - rebased onto mmotm 1008
 - redecued the number of loops.
 - clean ups. reduced unnecessary switch, break, continue, goto.
 - added kzalloc again.

Changelog: v1->v2
 - removed kzalloc() of mc_target. preallocate it on "mc"

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |  129 ++++++++++++++++++++++++++++++++------------------------
 1 file changed, 76 insertions(+), 53 deletions(-)

Index: mmotm-1008/mm/memcontrol.c
===================================================================
--- mmotm-1008.orig/mm/memcontrol.c
+++ mmotm-1008/mm/memcontrol.c
@@ -276,6 +276,21 @@ enum move_type {
 	NR_MOVE_TYPE,
 };
 
+enum mc_target_type {
+	MC_TARGET_NONE, /* used as failure code(0) */
+	MC_TARGET_PAGE,
+	MC_TARGET_SWAP,
+};
+
+struct mc_target {
+	enum mc_target_type type;
+	union {
+		struct page *page;
+		swp_entry_t ent;
+	} val;
+};
+#define MC_MOVE_ONCE	(16)
+
 /* "mc" and its members are protected by cgroup_mutex */
 static struct move_charge_struct {
 	spinlock_t	  lock; /* for from, to, moving_task */
@@ -4479,16 +4494,7 @@ one_by_one:
  *
  * Called with pte lock held.
  */
-union mc_target {
-	struct page	*page;
-	swp_entry_t	ent;
-};
 
-enum mc_target_type {
-	MC_TARGET_NONE,	/* not used */
-	MC_TARGET_PAGE,
-	MC_TARGET_SWAP,
-};
 
 static struct page *mc_handle_present_pte(struct vm_area_struct *vma,
 						unsigned long addr, pte_t ptent)
@@ -4565,7 +4571,7 @@ static struct page *mc_handle_file_pte(s
 }
 
 static int is_target_pte_for_mc(struct vm_area_struct *vma,
-		unsigned long addr, pte_t ptent, union mc_target *target)
+		unsigned long addr, pte_t ptent, struct mc_target *target)
 {
 	struct page *page = NULL;
 	struct page_cgroup *pc;
@@ -4590,8 +4596,10 @@ static int is_target_pte_for_mc(struct v
 		 */
 		if (PageCgroupUsed(pc) && pc->mem_cgroup == mc.from) {
 			ret = MC_TARGET_PAGE;
-			if (target)
-				target->page = page;
+			if (target) {
+				target->val.page = page;
+				target->type = ret;
+			}
 		}
 		if (!ret || !target)
 			put_page(page);
@@ -4600,8 +4608,10 @@ static int is_target_pte_for_mc(struct v
 	if (ent.val && !ret &&
 			css_id(&mc.from->css) == lookup_swap_cgroup(ent)) {
 		ret = MC_TARGET_SWAP;
-		if (target)
-			target->ent = ent;
+		if (target) {
+			target->val.ent = ent;
+			target->type = ret;
+		}
 	}
 	return ret;
 }
@@ -4761,68 +4771,81 @@ static int mem_cgroup_move_charge_pte_ra
 {
 	int ret = 0;
 	struct vm_area_struct *vma = walk->private;
+	struct mc_target *info, *mt;
+	struct page_cgroup *pc;
 	pte_t *pte;
 	spinlock_t *ptl;
+	int num;
+
+	info = kzalloc(sizeof(struct mc_target) *MC_MOVE_ONCE, GFP_KERNEL);
+	if (!info)
+		return -ENOMEM;
 
 retry:
+	/*
+	 * We want to move account without taking pte_offset_map_lock() because
+	 * "move" may need to wait for some event completion.(in future)
+	 * At 1st half, scan page table and grab pages.  At 2nd half, remove it
+	 * from LRU and overwrite page_cgroup's information.
+	 */
 	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
-	for (; addr != end; addr += PAGE_SIZE) {
+	for (num = 0; num < MC_MOVE_ONCE && addr != end; addr += PAGE_SIZE) {
 		pte_t ptent = *(pte++);
-		union mc_target target;
-		int type;
-		struct page *page;
-		struct page_cgroup *pc;
-		swp_entry_t ent;
+		ret = is_target_pte_for_mc(vma, addr, ptent, info + num);
+		if (!ret)
+			continue;
+		num++;
+	}
+	pte_unmap_unlock(pte - 1, ptl);
+	cond_resched();
 
-		if (!mc.precharge)
-			break;
+	mt = info;
 
-		type = is_target_pte_for_mc(vma, addr, ptent, &target);
-		switch (type) {
+	while (mc.precharge < num) {
+		ret = mem_cgroup_do_precharge(1);
+		if (ret)
+			goto err_out;
+	}
+
+	for (ret = 0; mt < info + num; mt++) {
+		switch (mt->type) {
 		case MC_TARGET_PAGE:
-			page = target.page;
-			if (isolate_lru_page(page))
-				goto put;
-			pc = lookup_page_cgroup(page);
-			if (!mem_cgroup_move_account(pc,
+			if (!isolate_lru_page(mt->val.page)) {
+				pc = lookup_page_cgroup(mt->val.page);
+				if (!mem_cgroup_move_account(pc,
 						mc.from, mc.to, false)) {
-				mc.precharge--;
-				/* we uncharge from mc.from later. */
-				mc.moved_charge++;
+					mc.precharge--;
+					/* we uncharge from mc.from later. */
+					mc.moved_charge++;
+				}
+				putback_lru_page(mt->val.page);
 			}
-			putback_lru_page(page);
-put:			/* is_target_pte_for_mc() gets the page */
-			put_page(page);
+			put_page(mt->val.page);
 			break;
 		case MC_TARGET_SWAP:
-			ent = target.ent;
-			if (!mem_cgroup_move_swap_account(ent,
+			if (!mem_cgroup_move_swap_account(mt->val.ent,
 						mc.from, mc.to, false)) {
-				mc.precharge--;
 				/* we fixup refcnts and charges later. */
+				mc.precharge--;
 				mc.moved_swap++;
 			}
-			break;
 		default:
 			break;
 		}
 	}
-	pte_unmap_unlock(pte - 1, ptl);
-	cond_resched();
-
-	if (addr != end) {
-		/*
-		 * We have consumed all precharges we got in can_attach().
-		 * We try charge one by one, but don't do any additional
-		 * charges to mc.to if we have failed in charge once in attach()
-		 * phase.
-		 */
-		ret = mem_cgroup_do_precharge(1);
-		if (!ret)
-			goto retry;
-	}
 
+	if (addr != end)
+		goto retry;
+out:
+	kfree(info);
 	return ret;
+err_out:
+	for (; mt < info + num; mt++)
+		if (mt->type == MC_TARGET_PAGE) {
+			putback_lru_page(mt->val.page);
+			put_page(mt->val.page);
+		}
+	goto out;
 }
 
 static void mem_cgroup_move_charge(struct mm_struct *mm)


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 07/10] memcg: add dirty limits to mem_cgroup
  2010-10-07  0:48           ` KAMEZAWA Hiroyuki
@ 2010-10-12  0:24             ` Greg Thelen
  2010-10-12  0:55               ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 96+ messages in thread
From: Greg Thelen @ 2010-10-12  0:24 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrea Righi, Andrew Morton, linux-kernel, linux-mm, containers,
	Balbir Singh, Daisuke Nishimura

KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> writes:

> On Wed, 06 Oct 2010 17:27:13 -0700
> Greg Thelen <gthelen@google.com> wrote:
>
>> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> writes:
>> 
>> > On Tue, 05 Oct 2010 12:00:17 -0700
>> > Greg Thelen <gthelen@google.com> wrote:
>> >
>> >> Andrea Righi <arighi@develer.com> writes:
>> >> 
>> >> > On Sun, Oct 03, 2010 at 11:58:02PM -0700, Greg Thelen wrote:
>> >> >> Extend mem_cgroup to contain dirty page limits.  Also add routines
>> >> >> allowing the kernel to query the dirty usage of a memcg.
>> >> >> 
>> >> >> These interfaces not used by the kernel yet.  A subsequent commit
>> >> >> will add kernel calls to utilize these new routines.
>> >> >
>> >> > A small note below.
>> >> >
>> >> >> 
>> >> >> Signed-off-by: Greg Thelen <gthelen@google.com>
>> >> >> Signed-off-by: Andrea Righi <arighi@develer.com>
>> >> >> ---
>> >> >>  include/linux/memcontrol.h |   44 +++++++++++
>> >> >>  mm/memcontrol.c            |  180 +++++++++++++++++++++++++++++++++++++++++++-
>> >> >>  2 files changed, 223 insertions(+), 1 deletions(-)
>> >> >> 
>> >> >> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>> >> >> index 6303da1..dc8952d 100644
>> >> >> --- a/include/linux/memcontrol.h
>> >> >> +++ b/include/linux/memcontrol.h
>> >> >> @@ -19,6 +19,7 @@
>> >> >>  
>> >> >>  #ifndef _LINUX_MEMCONTROL_H
>> >> >>  #define _LINUX_MEMCONTROL_H
>> >> >> +#include <linux/writeback.h>
>> >> >>  #include <linux/cgroup.h>
>> >> >>  struct mem_cgroup;
>> >> >>  struct page_cgroup;
>> >> >> @@ -33,6 +34,30 @@ enum mem_cgroup_write_page_stat_item {
>> >> >>  	MEMCG_NR_FILE_UNSTABLE_NFS, /* # of NFS unstable pages */
>> >> >>  };
>> >> >>  
>> >> >> +/* Cgroup memory statistics items exported to the kernel */
>> >> >> +enum mem_cgroup_read_page_stat_item {
>> >> >> +	MEMCG_NR_DIRTYABLE_PAGES,
>> >> >> +	MEMCG_NR_RECLAIM_PAGES,
>> >> >> +	MEMCG_NR_WRITEBACK,
>> >> >> +	MEMCG_NR_DIRTY_WRITEBACK_PAGES,
>> >> >> +};
>> >> >> +
>> >> >> +/* Dirty memory parameters */
>> >> >> +struct vm_dirty_param {
>> >> >> +	int dirty_ratio;
>> >> >> +	int dirty_background_ratio;
>> >> >> +	unsigned long dirty_bytes;
>> >> >> +	unsigned long dirty_background_bytes;
>> >> >> +};
>> >> >> +
>> >> >> +static inline void get_global_vm_dirty_param(struct vm_dirty_param *param)
>> >> >> +{
>> >> >> +	param->dirty_ratio = vm_dirty_ratio;
>> >> >> +	param->dirty_bytes = vm_dirty_bytes;
>> >> >> +	param->dirty_background_ratio = dirty_background_ratio;
>> >> >> +	param->dirty_background_bytes = dirty_background_bytes;
>> >> >> +}
>> >> >> +
>> >> >>  extern unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
>> >> >>  					struct list_head *dst,
>> >> >>  					unsigned long *scanned, int order,
>> >> >> @@ -145,6 +170,10 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,
>> >> >>  	mem_cgroup_update_page_stat(page, idx, -1);
>> >> >>  }
>> >> >>  
>> >> >> +bool mem_cgroup_has_dirty_limit(void);
>> >> >> +void get_vm_dirty_param(struct vm_dirty_param *param);
>> >> >> +s64 mem_cgroup_page_stat(enum mem_cgroup_read_page_stat_item item);
>> >> >> +
>> >> >>  unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>> >> >>  						gfp_t gfp_mask);
>> >> >>  u64 mem_cgroup_get_limit(struct mem_cgroup *mem);
>> >> >> @@ -326,6 +355,21 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,
>> >> >>  {
>> >> >>  }
>> >> >>  
>> >> >> +static inline bool mem_cgroup_has_dirty_limit(void)
>> >> >> +{
>> >> >> +	return false;
>> >> >> +}
>> >> >> +
>> >> >> +static inline void get_vm_dirty_param(struct vm_dirty_param *param)
>> >> >> +{
>> >> >> +	get_global_vm_dirty_param(param);
>> >> >> +}
>> >> >> +
>> >> >> +static inline s64 mem_cgroup_page_stat(enum mem_cgroup_read_page_stat_item item)
>> >> >> +{
>> >> >> +	return -ENOSYS;
>> >> >> +}
>> >> >> +
>> >> >>  static inline
>> >> >>  unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>> >> >>  					    gfp_t gfp_mask)
>> >> >> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> >> >> index f40839f..6ec2625 100644
>> >> >> --- a/mm/memcontrol.c
>> >> >> +++ b/mm/memcontrol.c
>> >> >> @@ -233,6 +233,10 @@ struct mem_cgroup {
>> >> >>  	atomic_t	refcnt;
>> >> >>  
>> >> >>  	unsigned int	swappiness;
>> >> >> +
>> >> >> +	/* control memory cgroup dirty pages */
>> >> >> +	struct vm_dirty_param dirty_param;
>> >> >> +
>> >> >>  	/* OOM-Killer disable */
>> >> >>  	int		oom_kill_disable;
>> >> >>  
>> >> >> @@ -1132,6 +1136,172 @@ static unsigned int get_swappiness(struct mem_cgroup *memcg)
>> >> >>  	return swappiness;
>> >> >>  }
>> >> >>  
>> >> >> +/*
>> >> >> + * Returns a snapshot of the current dirty limits which is not synchronized with
>> >> >> + * the routines that change the dirty limits.  If this routine races with an
>> >> >> + * update to the dirty bytes/ratio value, then the caller must handle the case
>> >> >> + * where both dirty_[background_]_ratio and _bytes are set.
>> >> >> + */
>> >> >> +static void __mem_cgroup_get_dirty_param(struct vm_dirty_param *param,
>> >> >> +					 struct mem_cgroup *mem)
>> >> >> +{
>> >> >> +	if (mem && !mem_cgroup_is_root(mem)) {
>> >> >> +		param->dirty_ratio = mem->dirty_param.dirty_ratio;
>> >> >> +		param->dirty_bytes = mem->dirty_param.dirty_bytes;
>> >> >> +		param->dirty_background_ratio =
>> >> >> +			mem->dirty_param.dirty_background_ratio;
>> >> >> +		param->dirty_background_bytes =
>> >> >> +			mem->dirty_param.dirty_background_bytes;
>> >> >> +	} else {
>> >> >> +		get_global_vm_dirty_param(param);
>> >> >> +	}
>> >> >> +}
>> >> >> +
>> >> >> +/*
>> >> >> + * Get dirty memory parameters of the current memcg or global values (if memory
>> >> >> + * cgroups are disabled or querying the root cgroup).
>> >> >> + */
>> >> >> +void get_vm_dirty_param(struct vm_dirty_param *param)
>> >> >> +{
>> >> >> +	struct mem_cgroup *memcg;
>> >> >> +
>> >> >> +	if (mem_cgroup_disabled()) {
>> >> >> +		get_global_vm_dirty_param(param);
>> >> >> +		return;
>> >> >> +	}
>> >> >> +
>> >> >> +	/*
>> >> >> +	 * It's possible that "current" may be moved to other cgroup while we
>> >> >> +	 * access cgroup. But precise check is meaningless because the task can
>> >> >> +	 * be moved after our access and writeback tends to take long time.  At
>> >> >> +	 * least, "memcg" will not be freed under rcu_read_lock().
>> >> >> +	 */
>> >> >> +	rcu_read_lock();
>> >> >> +	memcg = mem_cgroup_from_task(current);
>> >> >> +	__mem_cgroup_get_dirty_param(param, memcg);
>> >> >> +	rcu_read_unlock();
>> >> >> +}
>> >> >> +
>> >> >> +/*
>> >> >> + * Check if current memcg has local dirty limits.  Return true if the current
>> >> >> + * memory cgroup has local dirty memory settings.
>> >> >> + */
>> >> >> +bool mem_cgroup_has_dirty_limit(void)
>> >> >> +{
>> >> >> +	struct mem_cgroup *mem;
>> >> >> +
>> >> >> +	if (mem_cgroup_disabled())
>> >> >> +		return false;
>> >> >> +
>> >> >> +	mem = mem_cgroup_from_task(current);
>> >> >> +	return mem && !mem_cgroup_is_root(mem);
>> >> >> +}
>> >> >
>> >> > We only check the pointer without dereferencing it, so this is probably
>> >> > ok, but maybe this is safer:
>> >> >
>> >> > bool mem_cgroup_has_dirty_limit(void)
>> >> > {
>> >> > 	struct mem_cgroup *mem;
>> >> > 	bool ret;
>> >> >
>> >> > 	if (mem_cgroup_disabled())
>> >> > 		return false;
>> >> >
>> >> > 	rcu_read_lock();
>> >> > 	mem = mem_cgroup_from_task(current);
>> >> > 	ret = mem && !mem_cgroup_is_root(mem);
>> >> > 	rcu_read_unlock();
>> >> >
>> >> > 	return ret;
>> >> > }
>> >> >
>> >> > rcu_read_lock() should be held in mem_cgroup_from_task(), otherwise
>> >> > lockdep could detect this as an error.
>> >> >
>> >> > Thanks,
>> >> > -Andrea
>> >> 
>> >> Good suggestion.  I agree that lockdep might catch this.  There are some
>> >> unrelated debug_locks failures (even without my patches) that I worked
>> >> around to get lockdep to complain about this one.  I applied your
>> >> suggested fix and lockdep was happy.  I will incorporate this fix into
>> >> the next revision of the patch series.
>> >> 
>> >
>> > Hmm, considering other parts, shouldn't we define mem_cgroup_from_task
>> > as macro ?
>> >
>> > Thanks,
>> > -Kame
>> 
>> Is your motivation to increase performance with the same functionality?
>> If so, then would a 'static inline' be performance equivalent to a
>> preprocessor macro yet be safer to use?
>> 
> Ah, if lockdep finds this as bug, I think other parts will hit this,
> too.  like this.
>> static struct mem_cgroup *try_get_mem_cgroup_from_mm(struct mm_struct *mm)
>> {
>>         struct mem_cgroup *mem = NULL;
>> 
>>         if (!mm)
>>                 return NULL;
>>         /*
>>          * Because we have no locks, mm->owner's may be being moved to other
>>          * cgroup. We use css_tryget() here even if this looks
>>          * pessimistic (rather than adding locks here).
>>          */
>>         rcu_read_lock();
>>         do {
>>                 mem = mem_cgroup_from_task(rcu_dereference(mm->owner));
>>                 if (unlikely(!mem))
>>                         break;
>>         } while (!css_tryget(&mem->css));
>>         rcu_read_unlock();
>>         return mem;
>> }

mem_cgroup_from_task() calls task_subsys_state() calls
task_subsys_state_check().  task_subsys_state_check() will be happy if
rcu_read_lock is held.

I don't think that this will fail lockdep, because rcu_read_lock_held()
is true when calling mem_cgroup_from_task() within
try_get_mem_cgroup_from_mm()..

> mem_cgroup_from_task() is designed to be used as this.
> If dqefined as macro, I think it will not be catched.

I do not understand how making mem_cgroup_from_task() a macro will
change its behavior wrt. to lockdep assertion checking.  I assume that
as a macro mem_cgroup_from_task() would still call task_subsys_state(),
which requires either:
a) rcu read lock held
b) task->alloc_lock held
c) cgroup lock held


>> Maybe it makes more sense to find a way to perform this check in
>> mem_cgroup_has_dirty_limit() without needing to grab the rcu lock.  I
>> think this lock grab is unneeded.  I am still collecting performance
>> data, but suspect that this may be making the code slower than it needs
>> to be.
>> 
>
> Hmm. css_set[] itself is freed by RCU..what idea to remove rcu_read_lock() do
> you have ? Adding some flags ?

It seems like a shame to need a lock to determine if current is in the
root cgroup.  Especially given that as soon as
mem_cgroup_has_dirty_limit() returns, the task could be moved
in-to/out-of the root cgroup thereby invaliding the answer.  So the
answer is just a sample that may be wrong.  But I think you are correct.
We will need the rcu read lock in mem_cgroup_has_dirty_limit().

> Ah...I noticed that you should do
>
>  mem = mem_cgroup_from_task(current->mm->owner);
>
> to check has_dirty_limit...

What are the cases where current->mm->owner->cgroups !=
current->cgroups?

I was hoping to avoid having add even more logic into
mem_cgroup_has_dirty_limit() to handle the case where current->mm is
NULL.

Presumably the newly proposed vm_dirty_param(),
mem_cgroup_has_dirty_limit(), and mem_cgroup_page_stat() routines all
need to use the same logic.  I assume they should all be consistently
using current->mm->owner or current.

--
Greg

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 07/10] memcg: add dirty limits to mem_cgroup
  2010-10-12  0:24             ` Greg Thelen
@ 2010-10-12  0:55               ` KAMEZAWA Hiroyuki
  2010-10-12  7:32                 ` Greg Thelen
  0 siblings, 1 reply; 96+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-10-12  0:55 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Andrea Righi, Andrew Morton, linux-kernel, linux-mm, containers,
	Balbir Singh, Daisuke Nishimura

On Mon, 11 Oct 2010 17:24:21 -0700
Greg Thelen <gthelen@google.com> wrote:

> >> Is your motivation to increase performance with the same functionality?
> >> If so, then would a 'static inline' be performance equivalent to a
> >> preprocessor macro yet be safer to use?
> >> 
> > Ah, if lockdep finds this as bug, I think other parts will hit this,
> > too.  like this.
> >> static struct mem_cgroup *try_get_mem_cgroup_from_mm(struct mm_struct *mm)
> >> {
> >>         struct mem_cgroup *mem = NULL;
> >> 
> >>         if (!mm)
> >>                 return NULL;
> >>         /*
> >>          * Because we have no locks, mm->owner's may be being moved to other
> >>          * cgroup. We use css_tryget() here even if this looks
> >>          * pessimistic (rather than adding locks here).
> >>          */
> >>         rcu_read_lock();
> >>         do {
> >>                 mem = mem_cgroup_from_task(rcu_dereference(mm->owner));
> >>                 if (unlikely(!mem))
> >>                         break;
> >>         } while (!css_tryget(&mem->css));
> >>         rcu_read_unlock();
> >>         return mem;
> >> }
> 
> mem_cgroup_from_task() calls task_subsys_state() calls
> task_subsys_state_check().  task_subsys_state_check() will be happy if
> rcu_read_lock is held.
> 
yes.

> I don't think that this will fail lockdep, because rcu_read_lock_held()
> is true when calling mem_cgroup_from_task() within
> try_get_mem_cgroup_from_mm()..
> 
agreed.

> > mem_cgroup_from_task() is designed to be used as this.
> > If dqefined as macro, I think it will not be catched.
> 
> I do not understand how making mem_cgroup_from_task() a macro will
> change its behavior wrt. to lockdep assertion checking.  I assume that
> as a macro mem_cgroup_from_task() would still call task_subsys_state(),
> which requires either:
> a) rcu read lock held
> b) task->alloc_lock held
> c) cgroup lock held
> 

Hmm. Maybe I was wrong.

> 
> >> Maybe it makes more sense to find a way to perform this check in
> >> mem_cgroup_has_dirty_limit() without needing to grab the rcu lock.  I
> >> think this lock grab is unneeded.  I am still collecting performance
> >> data, but suspect that this may be making the code slower than it needs
> >> to be.
> >> 
> >
> > Hmm. css_set[] itself is freed by RCU..what idea to remove rcu_read_lock() do
> > you have ? Adding some flags ?
> 
> It seems like a shame to need a lock to determine if current is in the
> root cgroup.  Especially given that as soon as
> mem_cgroup_has_dirty_limit() returns, the task could be moved
> in-to/out-of the root cgroup thereby invaliding the answer.  So the
> answer is just a sample that may be wrong. 

Yes. But it's not a bug but a specification.

> But I think you are correct.
> We will need the rcu read lock in mem_cgroup_has_dirty_limit().
> 

yes.


> > Ah...I noticed that you should do
> >
> >  mem = mem_cgroup_from_task(current->mm->owner);
> >
> > to check has_dirty_limit...
> 
> What are the cases where current->mm->owner->cgroups !=
> current->cgroups?
> 
In that case, assume group A and B.

   thread(1) -> belongs to cgroup A  (thread(1) is mm->owner)
   thread(2) -> belongs to cgroup B
and
   a page    -> charnged to cgroup A

Then, thread(2) make the page dirty which is under cgroup A.

In this case, if page's dirty_pages accounting is added to cgroup B, cgroup B'
statistics may show "dirty_pages > all_lru_pages". This is bug.


> I was hoping to avoid having add even more logic into
> mem_cgroup_has_dirty_limit() to handle the case where current->mm is
> NULL.
> 

Blease check current->mm. We can't limit works of kernel-thread by this, let's
consider it later if necessary.

> Presumably the newly proposed vm_dirty_param(),
> mem_cgroup_has_dirty_limit(), and mem_cgroup_page_stat() routines all
> need to use the same logic.  I assume they should all be consistently
> using current->mm->owner or current.
> 

please.

Thanks,
-Kame




^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v2] memcg: reduce lock time at move charge (Was Re: [PATCH 04/10] memcg: disable local interrupts in lock_page_cgroup()
  2010-10-08 10:41                               ` KAMEZAWA Hiroyuki
@ 2010-10-12  3:39                                 ` Balbir Singh
  2010-10-12  3:42                                   ` KAMEZAWA Hiroyuki
  2010-10-12  3:56                                 ` Daisuke Nishimura
  1 sibling, 1 reply; 96+ messages in thread
From: Balbir Singh @ 2010-10-12  3:39 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, Daisuke Nishimura, Minchan Kim, Greg Thelen,
	linux-kernel, linux-mm, containers, Andrea Righi

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2010-10-08 19:41:31]:

> On Fri, 8 Oct 2010 14:12:01 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > > Sure.  It walks the same data three times, potentially causing
> > > thrashing in the L1 cache.
> > 
> > Hmm, make this 2 times, at least.
> > 
> How about this ?
> ==
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> Presently, at task migration among cgroups, memory cgroup scans page tables and
> moves accounting if flags are properly set.
> 
> 
> The core code, mem_cgroup_move_charge_pte_range() does
> 
>  	pte_offset_map_lock();
> 	for all ptes in a page table:
> 		1. look into page table, find_and_get a page
> 		2. remove it from LRU.
> 		3. move charge.
> 		4. putback to LRU. put_page()
> 	pte_offset_map_unlock();
> 
> for pte entries on a 3rd level? page table.
> 
> As a planned updates, we'll support dirty-page accounting. Because move_charge()
> is highly race, we need to add more check in move_charge.
> For example, lock_page();-> wait_on_page_writeback();-> unlock_page();
> is an candidate for new check.
>


Is this a change to help dirty limits or is it a generic bug fix.
 
-- 
	Three Cheers,
	Balbir

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v2] memcg: reduce lock time at move charge (Was Re: [PATCH 04/10] memcg: disable local interrupts in lock_page_cgroup()
  2010-10-12  3:39                                 ` Balbir Singh
@ 2010-10-12  3:42                                   ` KAMEZAWA Hiroyuki
  2010-10-12  3:54                                     ` Balbir Singh
  0 siblings, 1 reply; 96+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-10-12  3:42 UTC (permalink / raw)
  To: balbir
  Cc: Andrew Morton, Daisuke Nishimura, Minchan Kim, Greg Thelen,
	linux-kernel, linux-mm, containers, Andrea Righi

On Tue, 12 Oct 2010 09:09:15 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2010-10-08 19:41:31]:
> 
> > On Fri, 8 Oct 2010 14:12:01 +0900
> > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > > > Sure.  It walks the same data three times, potentially causing
> > > > thrashing in the L1 cache.
> > > 
> > > Hmm, make this 2 times, at least.
> > > 
> > How about this ?
> > ==
> > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > 
> > Presently, at task migration among cgroups, memory cgroup scans page tables and
> > moves accounting if flags are properly set.
> > 
> > 
> > The core code, mem_cgroup_move_charge_pte_range() does
> > 
> >  	pte_offset_map_lock();
> > 	for all ptes in a page table:
> > 		1. look into page table, find_and_get a page
> > 		2. remove it from LRU.
> > 		3. move charge.
> > 		4. putback to LRU. put_page()
> > 	pte_offset_map_unlock();
> > 
> > for pte entries on a 3rd level? page table.
> > 
> > As a planned updates, we'll support dirty-page accounting. Because move_charge()
> > is highly race, we need to add more check in move_charge.
> > For example, lock_page();-> wait_on_page_writeback();-> unlock_page();
> > is an candidate for new check.
> >
> 
> 
> Is this a change to help dirty limits or is it a generic bug fix.
>  
Not a bug fix. This for adding lock_page() to moge_charge(). It helps us
to remove "irq disable" in update_stat().


Thanks,
-Kame



^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v2] memcg: reduce lock time at move charge (Was Re: [PATCH 04/10] memcg: disable local interrupts in lock_page_cgroup()
  2010-10-12  3:42                                   ` KAMEZAWA Hiroyuki
@ 2010-10-12  3:54                                     ` Balbir Singh
  0 siblings, 0 replies; 96+ messages in thread
From: Balbir Singh @ 2010-10-12  3:54 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrea Righi, Daisuke Nishimura, linux-kernel, linux-mm, Greg,
	containers, Andrew Morton

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2010-10-12 12:42:53]:

> On Tue, 12 Oct 2010 09:09:15 +0530
> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> 
> > * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2010-10-08 19:41:31]:
> > 
> > > On Fri, 8 Oct 2010 14:12:01 +0900
> > > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > > > > Sure.  It walks the same data three times, potentially causing
> > > > > thrashing in the L1 cache.
> > > > 
> > > > Hmm, make this 2 times, at least.
> > > > 
> > > How about this ?
> > > ==
> > > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > > 
> > > Presently, at task migration among cgroups, memory cgroup scans page tables and
> > > moves accounting if flags are properly set.
> > > 
> > > 
> > > The core code, mem_cgroup_move_charge_pte_range() does
> > > 
> > >  	pte_offset_map_lock();
> > > 	for all ptes in a page table:
> > > 		1. look into page table, find_and_get a page
> > > 		2. remove it from LRU.
> > > 		3. move charge.
> > > 		4. putback to LRU. put_page()
> > > 	pte_offset_map_unlock();
> > > 
> > > for pte entries on a 3rd level? page table.
> > > 
> > > As a planned updates, we'll support dirty-page accounting. Because move_charge()
> > > is highly race, we need to add more check in move_charge.
> > > For example, lock_page();-> wait_on_page_writeback();-> unlock_page();
> > > is an candidate for new check.
> > >
> > 
> > 
> > Is this a change to help dirty limits or is it a generic bug fix.
> >  
> Not a bug fix. This for adding lock_page() to moge_charge(). It helps us
> to remove "irq disable" in update_stat().
>

Excellent! Thanks 

-- 
	Three Cheers,
	Balbir

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v2] memcg: reduce lock time at move charge (Was Re: [PATCH 04/10] memcg: disable local interrupts in lock_page_cgroup()
  2010-10-08 10:41                               ` KAMEZAWA Hiroyuki
  2010-10-12  3:39                                 ` Balbir Singh
@ 2010-10-12  3:56                                 ` Daisuke Nishimura
  2010-10-12  5:01                                   ` KAMEZAWA Hiroyuki
  2010-10-12  5:48                                   ` [PATCH v4] memcg: reduce lock time at move charge KAMEZAWA Hiroyuki
  1 sibling, 2 replies; 96+ messages in thread
From: Daisuke Nishimura @ 2010-10-12  3:56 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, Minchan Kim, Greg Thelen, linux-kernel, linux-mm,
	containers, Andrea Righi, Balbir Singh, Daisuke Nishimura

> +err_out:
> +	for (; mt < info + num; mt++)
> +		if (mt->type == MC_TARGET_PAGE) {
> +			putback_lru_page(mt->val.page);
Is this putback_lru_page() necessary ?
is_target_pte_for_mc() doesn't isolate the page.

Thanks,
Daisuke Nishimura.


> +			put_page(mt->val.page);
> +		}
> +	goto out;
>  }
>  
>  static void mem_cgroup_move_charge(struct mm_struct *mm)
> 

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v2] memcg: reduce lock time at move charge (Was Re: [PATCH 04/10] memcg: disable local interrupts in lock_page_cgroup()
  2010-10-12  3:56                                 ` Daisuke Nishimura
@ 2010-10-12  5:01                                   ` KAMEZAWA Hiroyuki
  2010-10-12  5:48                                   ` [PATCH v4] memcg: reduce lock time at move charge KAMEZAWA Hiroyuki
  1 sibling, 0 replies; 96+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-10-12  5:01 UTC (permalink / raw)
  To: Daisuke Nishimura
  Cc: Andrew Morton, Minchan Kim, Greg Thelen, linux-kernel, linux-mm,
	containers, Andrea Righi, Balbir Singh

On Tue, 12 Oct 2010 12:56:13 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:

> > +err_out:
> > +	for (; mt < info + num; mt++)
> > +		if (mt->type == MC_TARGET_PAGE) {
> > +			putback_lru_page(mt->val.page);
> Is this putback_lru_page() necessary ?
> is_target_pte_for_mc() doesn't isolate the page.
> 
Unnecessary, will post v2.

I'm sorry for my low-quality patches :(

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 04/10] memcg: disable local interrupts in lock_page_cgroup()
  2010-10-04  6:57 ` [PATCH 04/10] memcg: disable local interrupts in lock_page_cgroup() Greg Thelen
  2010-10-05  6:54   ` KAMEZAWA Hiroyuki
  2010-10-05 16:03   ` Minchan Kim
@ 2010-10-12  5:39   ` Balbir Singh
  2 siblings, 0 replies; 96+ messages in thread
From: Balbir Singh @ 2010-10-12  5:39 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, Andrea Righi,
	KAMEZAWA Hiroyuki, Daisuke Nishimura

* Greg Thelen <gthelen@google.com> [2010-10-03 23:57:59]:

> If pages are being migrated from a memcg, then updates to that
> memcg's page statistics are protected by grabbing a bit spin lock
> using lock_page_cgroup().  In an upcoming commit memcg dirty page
> accounting will be updating memcg page accounting (specifically:
> num writeback pages) from softirq.  Avoid a deadlocking nested
> spin lock attempt by disabling interrupts on the local processor
> when grabbing the page_cgroup bit_spin_lock in lock_page_cgroup().
> This avoids the following deadlock:
> statistic
>       CPU 0             CPU 1
>                     inc_file_mapped
>                     rcu_read_lock
>   start move
>   synchronize_rcu
>                     lock_page_cgroup
>                       softirq
>                       test_clear_page_writeback
>                       mem_cgroup_dec_page_stat(NR_WRITEBACK)
>                       rcu_read_lock
>                       lock_page_cgroup   /* deadlock */
>                       unlock_page_cgroup
>                       rcu_read_unlock
>                     unlock_page_cgroup
>                     rcu_read_unlock
> 
> By disabling interrupts in lock_page_cgroup, nested calls
> are avoided.  The softirq would be delayed until after inc_file_mapped
> enables interrupts when calling unlock_page_cgroup().
> 
> The normal, fast path, of memcg page stat updates typically
> does not need to call lock_page_cgroup(), so this change does
> not affect the performance of the common case page accounting.
> 
> Signed-off-by: Andrea Righi <arighi@develer.com>
> Signed-off-by: Greg Thelen <gthelen@google.com>
-- 


It will take more convincing, all important functions (charge/uncharge
use lock_page_cgroup()). I'd like to see the page fault scalability
test results. I am not against this patch, just want to see the
scalability numbers.

	Three Cheers,
	Balbir

^ permalink raw reply	[flat|nested] 96+ messages in thread

* [PATCH v4] memcg: reduce lock time at move charge
  2010-10-12  3:56                                 ` Daisuke Nishimura
  2010-10-12  5:01                                   ` KAMEZAWA Hiroyuki
@ 2010-10-12  5:48                                   ` KAMEZAWA Hiroyuki
  2010-10-12  6:23                                     ` Daisuke Nishimura
  1 sibling, 1 reply; 96+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-10-12  5:48 UTC (permalink / raw)
  To: Daisuke Nishimura
  Cc: Andrew Morton, Minchan Kim, Greg Thelen, linux-kernel, linux-mm,
	containers, Andrea Righi, Balbir Singh

On Tue, 12 Oct 2010 12:56:13 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:

> > +err_out:
> > +	for (; mt < info + num; mt++)
> > +		if (mt->type == MC_TARGET_PAGE) {
> > +			putback_lru_page(mt->val.page);
> Is this putback_lru_page() necessary ?
> is_target_pte_for_mc() doesn't isolate the page.
> 
Ok, v4 here. tested failure path and success path.

==
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Presently, at task migration among cgroups, memory cgroup scans page tables and
moves accounting if flags are properly set.


The core code, mem_cgroup_move_charge_pte_range() does

 	pte_offset_map_lock();
	for all ptes in a page table:
		1. look into page table, find_and_get a page
		2. remove it from LRU.
		3. move charge.
		4. putback to LRU. put_page()
	pte_offset_map_unlock();

for pte entries on a 3rd level? page table.

As a planned updates, we'll support dirty-page accounting. Because move_charge()
is highly race, we need to add more check in move_charge.
For example, lock_page();-> wait_on_page_writeback();-> unlock_page();
is an candidate for new check.


This patch modifies a rountine as

	for 32 pages: pte_offset_map_lock()
		      find_and_get a page
		      record it
		      pte_offset_map_unlock()
	for all recorded pages
		      isolate it from LRU.
		      move charge
		      putback to LRU
		      put_page()
Code size change is:
(Before)
[kamezawa@bluextal mmotm-1008]$ size mm/memcontrol.o
   text    data     bss     dec     hex filename
  28247    7685    4100   40032    9c60 mm/memcontrol.o
(After)
[kamezawa@bluextal mmotm-1008]$ size mm/memcontrol.o
   text    data     bss     dec     hex filename
  28591    7685    4100   40376    9db8 mm/memcontrol.o

Easy Bencmark score.

Moving 2Gbytes anonymous memory task between cgroup/A and cgroup/B.
 <===== shows a function under pte_lock.
Before Patch.

real    0m42.346s
user    0m0.002s
sys     0m39.668s

    13.88%  swap_task.sh  [kernel.kallsyms]  [k] put_page	     <=====
    10.37%  swap_task.sh  [kernel.kallsyms]  [k] isolate_lru_page    <===== 
    10.25%  swap_task.sh  [kernel.kallsyms]  [k] is_target_pte_for_mc  <=====
     7.85%  swap_task.sh  [kernel.kallsyms]  [k] mem_cgroup_move_account <=====
     7.63%  swap_task.sh  [kernel.kallsyms]  [k] lookup_page_cgroup      <=====
     6.96%  swap_task.sh  [kernel.kallsyms]  [k] ____pagevec_lru_add
     6.43%  swap_task.sh  [kernel.kallsyms]  [k] mem_cgroup_del_lru_list
     6.31%  swap_task.sh  [kernel.kallsyms]  [k] putback_lru_page
     5.28%  swap_task.sh  [kernel.kallsyms]  [k] mem_cgroup_add_lru_list
     3.58%  swap_task.sh  [kernel.kallsyms]  [k] __lru_cache_add
     3.57%  swap_task.sh  [kernel.kallsyms]  [k] _raw_spin_lock_irq
     3.06%  swap_task.sh  [kernel.kallsyms]  [k] release_pages
     2.35%  swap_task.sh  [kernel.kallsyms]  [k] mem_cgroup_get_reclaim_stat_from_page
     2.31%  swap_task.sh  [kernel.kallsyms]  [k] mem_cgroup_move_charge_pte_range
     1.80%  swap_task.sh  [kernel.kallsyms]  [k] memcg_check_events
     1.59%  swap_task.sh  [kernel.kallsyms]  [k] page_evictable
     1.55%  swap_task.sh  [kernel.kallsyms]  [k] vm_normal_page
     1.53%  swap_task.sh  [kernel.kallsyms]  [k] mem_cgroup_charge_statistics

After patch:

real    0m43.440s
user    0m0.000s
sys     0m40.704s
    13.68%  swap_task.sh  [kernel.kallsyms]  [k] is_target_pte_for_mc <====
    13.29%  swap_task.sh  [kernel.kallsyms]  [k] put_page
    10.34%  swap_task.sh  [kernel.kallsyms]  [k] isolate_lru_page
     7.48%  swap_task.sh  [kernel.kallsyms]  [k] lookup_page_cgroup
     7.42%  swap_task.sh  [kernel.kallsyms]  [k] mem_cgroup_move_account
     6.98%  swap_task.sh  [kernel.kallsyms]  [k] ____pagevec_lru_add
     6.15%  swap_task.sh  [kernel.kallsyms]  [k] putback_lru_page
     5.46%  swap_task.sh  [kernel.kallsyms]  [k] mem_cgroup_add_lru_list
     5.00%  swap_task.sh  [kernel.kallsyms]  [k] mem_cgroup_del_lru_list
     3.38%  swap_task.sh  [kernel.kallsyms]  [k] _raw_spin_lock_irq
     3.31%  swap_task.sh  [kernel.kallsyms]  [k] __lru_cache_add
     3.02%  swap_task.sh  [kernel.kallsyms]  [k] release_pages
     2.24%  swap_task.sh  [kernel.kallsyms]  [k] mem_cgroup_get_reclaim_stat_from_page
     2.04%  swap_task.sh  [kernel.kallsyms]  [k] mem_cgroup_move_charge_pte_range
     1.84%  swap_task.sh  [kernel.kallsyms]  [k] memcg_check_events

I think this is not very bad. 

Changelog: v3->v4
 - fixed bug in error path (at resource shortage, putback is not necessary)

Changelog: v2->v3
 - rebased onto mmotm 1008
 - redecued the number of loops.
 - clean ups. reduced unnecessary switch, break, continue, goto.
 - added kzalloc again.

Changelog: v1->v2
 - removed kzalloc() of mc_target. preallocate it on "mc"

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |  127 ++++++++++++++++++++++++++++++++------------------------
 1 file changed, 74 insertions(+), 53 deletions(-)

Index: mmotm-1008/mm/memcontrol.c
===================================================================
--- mmotm-1008.orig/mm/memcontrol.c
+++ mmotm-1008/mm/memcontrol.c
@@ -276,6 +276,21 @@ enum move_type {
 	NR_MOVE_TYPE,
 };
 
+enum mc_target_type {
+	MC_TARGET_NONE, /* used as failure code(0) */
+	MC_TARGET_PAGE,
+	MC_TARGET_SWAP,
+};
+
+struct mc_target {
+	enum mc_target_type type;
+	union {
+		struct page *page;
+		swp_entry_t ent;
+	} val;
+};
+#define MC_MOVE_ONCE	(16)
+
 /* "mc" and its members are protected by cgroup_mutex */
 static struct move_charge_struct {
 	spinlock_t	  lock; /* for from, to, moving_task */
@@ -4479,16 +4494,7 @@ one_by_one:
  *
  * Called with pte lock held.
  */
-union mc_target {
-	struct page	*page;
-	swp_entry_t	ent;
-};
 
-enum mc_target_type {
-	MC_TARGET_NONE,	/* not used */
-	MC_TARGET_PAGE,
-	MC_TARGET_SWAP,
-};
 
 static struct page *mc_handle_present_pte(struct vm_area_struct *vma,
 						unsigned long addr, pte_t ptent)
@@ -4565,7 +4571,7 @@ static struct page *mc_handle_file_pte(s
 }
 
 static int is_target_pte_for_mc(struct vm_area_struct *vma,
-		unsigned long addr, pte_t ptent, union mc_target *target)
+		unsigned long addr, pte_t ptent, struct mc_target *target)
 {
 	struct page *page = NULL;
 	struct page_cgroup *pc;
@@ -4590,8 +4596,10 @@ static int is_target_pte_for_mc(struct v
 		 */
 		if (PageCgroupUsed(pc) && pc->mem_cgroup == mc.from) {
 			ret = MC_TARGET_PAGE;
-			if (target)
-				target->page = page;
+			if (target) {
+				target->val.page = page;
+				target->type = ret;
+			}
 		}
 		if (!ret || !target)
 			put_page(page);
@@ -4600,8 +4608,10 @@ static int is_target_pte_for_mc(struct v
 	if (ent.val && !ret &&
 			css_id(&mc.from->css) == lookup_swap_cgroup(ent)) {
 		ret = MC_TARGET_SWAP;
-		if (target)
-			target->ent = ent;
+		if (target) {
+			target->val.ent = ent;
+			target->type = ret;
+		}
 	}
 	return ret;
 }
@@ -4761,68 +4771,79 @@ static int mem_cgroup_move_charge_pte_ra
 {
 	int ret = 0;
 	struct vm_area_struct *vma = walk->private;
+	struct mc_target *info, *mt;
+	struct page_cgroup *pc;
 	pte_t *pte;
 	spinlock_t *ptl;
+	int num;
+
+	info = kzalloc(sizeof(struct mc_target) *MC_MOVE_ONCE, GFP_KERNEL);
+	if (!info)
+		return -ENOMEM;
 
 retry:
+	/*
+	 * We want to move account without taking pte_offset_map_lock() because
+	 * "move" may need to wait for some event completion.(in future)
+	 * At 1st half, scan page table and grab pages.  At 2nd half, remove it
+	 * from LRU and overwrite page_cgroup's information.
+	 */
 	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
-	for (; addr != end; addr += PAGE_SIZE) {
+	for (num = 0; num < MC_MOVE_ONCE && addr != end; addr += PAGE_SIZE) {
 		pte_t ptent = *(pte++);
-		union mc_target target;
-		int type;
-		struct page *page;
-		struct page_cgroup *pc;
-		swp_entry_t ent;
+		ret = is_target_pte_for_mc(vma, addr, ptent, info + num);
+		if (!ret)
+			continue;
+		num++;
+	}
+	pte_unmap_unlock(pte - 1, ptl);
+	cond_resched();
 
-		if (!mc.precharge)
-			break;
+	mt = info;
 
-		type = is_target_pte_for_mc(vma, addr, ptent, &target);
-		switch (type) {
+	while (mc.precharge < num) {
+		ret = mem_cgroup_do_precharge(1);
+		if (ret)
+			goto err_out;
+	}
+
+	for (ret = 0; mt < info + num; mt++) {
+		switch (mt->type) {
 		case MC_TARGET_PAGE:
-			page = target.page;
-			if (isolate_lru_page(page))
-				goto put;
-			pc = lookup_page_cgroup(page);
-			if (!mem_cgroup_move_account(pc,
+			if (!isolate_lru_page(mt->val.page)) {
+				pc = lookup_page_cgroup(mt->val.page);
+				if (!mem_cgroup_move_account(pc,
 						mc.from, mc.to, false)) {
-				mc.precharge--;
-				/* we uncharge from mc.from later. */
-				mc.moved_charge++;
+					mc.precharge--;
+					/* we uncharge from mc.from later. */
+					mc.moved_charge++;
+				}
+				putback_lru_page(mt->val.page);
 			}
-			putback_lru_page(page);
-put:			/* is_target_pte_for_mc() gets the page */
-			put_page(page);
+			put_page(mt->val.page);
 			break;
 		case MC_TARGET_SWAP:
-			ent = target.ent;
-			if (!mem_cgroup_move_swap_account(ent,
+			if (!mem_cgroup_move_swap_account(mt->val.ent,
 						mc.from, mc.to, false)) {
-				mc.precharge--;
 				/* we fixup refcnts and charges later. */
+				mc.precharge--;
 				mc.moved_swap++;
 			}
-			break;
 		default:
 			break;
 		}
 	}
-	pte_unmap_unlock(pte - 1, ptl);
-	cond_resched();
-
-	if (addr != end) {
-		/*
-		 * We have consumed all precharges we got in can_attach().
-		 * We try charge one by one, but don't do any additional
-		 * charges to mc.to if we have failed in charge once in attach()
-		 * phase.
-		 */
-		ret = mem_cgroup_do_precharge(1);
-		if (!ret)
-			goto retry;
-	}
 
+	if (addr != end)
+		goto retry;
+out:
+	kfree(info);
 	return ret;
+err_out:
+	for (; mt < info + num; mt++)
+		if (mt->type == MC_TARGET_PAGE)
+			put_page(mt->val.page);
+	goto out;
 }
 
 static void mem_cgroup_move_charge(struct mm_struct *mm)


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v4] memcg: reduce lock time at move charge
  2010-10-12  5:48                                   ` [PATCH v4] memcg: reduce lock time at move charge KAMEZAWA Hiroyuki
@ 2010-10-12  6:23                                     ` Daisuke Nishimura
  0 siblings, 0 replies; 96+ messages in thread
From: Daisuke Nishimura @ 2010-10-12  6:23 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, Minchan Kim, Greg Thelen, linux-kernel, linux-mm,
	containers, Andrea Righi, Balbir Singh, Daisuke Nishimura

On Tue, 12 Oct 2010 14:48:01 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> On Tue, 12 Oct 2010 12:56:13 +0900
> Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> 
> > > +err_out:
> > > +	for (; mt < info + num; mt++)
> > > +		if (mt->type == MC_TARGET_PAGE) {
> > > +			putback_lru_page(mt->val.page);
> > Is this putback_lru_page() necessary ?
> > is_target_pte_for_mc() doesn't isolate the page.
> > 
> Ok, v4 here. tested failure path and success path.
> 
Looks good to me.

	Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>

Thanks,
Daisuke Nishimura.

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 07/10] memcg: add dirty limits to mem_cgroup
  2010-10-12  0:55               ` KAMEZAWA Hiroyuki
@ 2010-10-12  7:32                 ` Greg Thelen
  2010-10-12  8:38                   ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 96+ messages in thread
From: Greg Thelen @ 2010-10-12  7:32 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrea Righi, Andrew Morton, linux-kernel, linux-mm, containers,
	Balbir Singh, Daisuke Nishimura

KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> writes:

> On Mon, 11 Oct 2010 17:24:21 -0700
> Greg Thelen <gthelen@google.com> wrote:
>
>> >> Is your motivation to increase performance with the same functionality?
>> >> If so, then would a 'static inline' be performance equivalent to a
>> >> preprocessor macro yet be safer to use?
>> >> 
>> > Ah, if lockdep finds this as bug, I think other parts will hit this,
>> > too.  like this.
>> >> static struct mem_cgroup *try_get_mem_cgroup_from_mm(struct mm_struct *mm)
>> >> {
>> >>         struct mem_cgroup *mem = NULL;
>> >> 
>> >>         if (!mm)
>> >>                 return NULL;
>> >>         /*
>> >>          * Because we have no locks, mm->owner's may be being moved to other
>> >>          * cgroup. We use css_tryget() here even if this looks
>> >>          * pessimistic (rather than adding locks here).
>> >>          */
>> >>         rcu_read_lock();
>> >>         do {
>> >>                 mem = mem_cgroup_from_task(rcu_dereference(mm->owner));
>> >>                 if (unlikely(!mem))
>> >>                         break;
>> >>         } while (!css_tryget(&mem->css));
>> >>         rcu_read_unlock();
>> >>         return mem;
>> >> }
>> 
>> mem_cgroup_from_task() calls task_subsys_state() calls
>> task_subsys_state_check().  task_subsys_state_check() will be happy if
>> rcu_read_lock is held.
>> 
> yes.
>
>> I don't think that this will fail lockdep, because rcu_read_lock_held()
>> is true when calling mem_cgroup_from_task() within
>> try_get_mem_cgroup_from_mm()..
>> 
> agreed.
>
>> > mem_cgroup_from_task() is designed to be used as this.
>> > If dqefined as macro, I think it will not be catched.
>> 
>> I do not understand how making mem_cgroup_from_task() a macro will
>> change its behavior wrt. to lockdep assertion checking.  I assume that
>> as a macro mem_cgroup_from_task() would still call task_subsys_state(),
>> which requires either:
>> a) rcu read lock held
>> b) task->alloc_lock held
>> c) cgroup lock held
>> 
>
> Hmm. Maybe I was wrong.
>
>> 
>> >> Maybe it makes more sense to find a way to perform this check in
>> >> mem_cgroup_has_dirty_limit() without needing to grab the rcu lock.  I
>> >> think this lock grab is unneeded.  I am still collecting performance
>> >> data, but suspect that this may be making the code slower than it needs
>> >> to be.
>> >> 
>> >
>> > Hmm. css_set[] itself is freed by RCU..what idea to remove rcu_read_lock() do
>> > you have ? Adding some flags ?
>> 
>> It seems like a shame to need a lock to determine if current is in the
>> root cgroup.  Especially given that as soon as
>> mem_cgroup_has_dirty_limit() returns, the task could be moved
>> in-to/out-of the root cgroup thereby invaliding the answer.  So the
>> answer is just a sample that may be wrong. 
>
> Yes. But it's not a bug but a specification.
>
>> But I think you are correct.
>> We will need the rcu read lock in mem_cgroup_has_dirty_limit().
>> 
>
> yes.
>
>
>> > Ah...I noticed that you should do
>> >
>> >  mem = mem_cgroup_from_task(current->mm->owner);
>> >
>> > to check has_dirty_limit...
>> 
>> What are the cases where current->mm->owner->cgroups !=
>> current->cgroups?
>> 
> In that case, assume group A and B.
>
>    thread(1) -> belongs to cgroup A  (thread(1) is mm->owner)
>    thread(2) -> belongs to cgroup B
> and
>    a page    -> charnged to cgroup A
>
> Then, thread(2) make the page dirty which is under cgroup A.
>
> In this case, if page's dirty_pages accounting is added to cgroup B,
> cgroup B' statistics may show "dirty_pages > all_lru_pages". This is
> bug.

I agree that in this case the dirty_pages accounting should be added to
cgroup A because that is where the page was charged.  This will happen
because pc->mem_cgroup was set to A when the page was charged.  The
mark-page-dirty code will check pc->mem_cgroup to determine which cgroup
to add the dirty page to.

I think that the current vs current->mm->owner decision is in areas of
the code that is used to query the dirty limits.  These routines do not
use this data to determine which cgroup to charge for dirty pages.  The
usage of either mem_cgroup_from_task(current->mm->owner) or
mem_cgroup_from_task(current) in mem_cgroup_has_dirty_limit() does not
determine which cgroup is added for dirty_pages.
mem_cgroup_has_dirty_limit() is only used to determine if the process
has a dirty limit.  As discussed, this is a momentary answer that may be
wrong by the time decisions are made because the task may be migrated
in-to/out-of root cgroup while mem_cgroup_has_dirty_limit() runs.  If
the process has a dirty limit, then the process's memcg is used to
compute dirty limits.  Using your example, I assume that thread(1) and
thread(2) will git dirty limits from cgroup(A) and cgroup(B)
respectively.

Are you thinking that when accounting for a dirty page (by incrementing
pc->mem_cgroup->stat->count[MEM_CGROUP_STAT_FILE_DIRTY]) that we should
check the pc->mem_cgroup dirty limit?

>> I was hoping to avoid having add even more logic into
>> mem_cgroup_has_dirty_limit() to handle the case where current->mm is
>> NULL.
>> 
>
> Blease check current->mm. We can't limit works of kernel-thread by this, let's
> consider it later if necessary.
>
>> Presumably the newly proposed vm_dirty_param(),
>> mem_cgroup_has_dirty_limit(), and mem_cgroup_page_stat() routines all
>> need to use the same logic.  I assume they should all be consistently
>> using current->mm->owner or current.
>> 
>
> please.
>
> Thanks,
> -Kame

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 07/10] memcg: add dirty limits to mem_cgroup
  2010-10-12  7:32                 ` Greg Thelen
@ 2010-10-12  8:38                   ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 96+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-10-12  8:38 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Andrea Righi, Andrew Morton, linux-kernel, linux-mm, containers,
	Balbir Singh, Daisuke Nishimura

On Tue, 12 Oct 2010 00:32:33 -0700
Greg Thelen <gthelen@google.com> wrote:

> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> writes:

> >> What are the cases where current->mm->owner->cgroups !=
> >> current->cgroups?
> >> 
> > In that case, assume group A and B.
> >
> >    thread(1) -> belongs to cgroup A  (thread(1) is mm->owner)
> >    thread(2) -> belongs to cgroup B
> > and
> >    a page    -> charnged to cgroup A
> >
> > Then, thread(2) make the page dirty which is under cgroup A.
> >
> > In this case, if page's dirty_pages accounting is added to cgroup B,
> > cgroup B' statistics may show "dirty_pages > all_lru_pages". This is
> > bug.
> 
> I agree that in this case the dirty_pages accounting should be added to
> cgroup A because that is where the page was charged.  This will happen
> because pc->mem_cgroup was set to A when the page was charged.  The
> mark-page-dirty code will check pc->mem_cgroup to determine which cgroup
> to add the dirty page to.
> 
> I think that the current vs current->mm->owner decision is in areas of
> the code that is used to query the dirty limits.  These routines do not
> use this data to determine which cgroup to charge for dirty pages.  The
> usage of either mem_cgroup_from_task(current->mm->owner) or
> mem_cgroup_from_task(current) in mem_cgroup_has_dirty_limit() does not
> determine which cgroup is added for dirty_pages.
> mem_cgroup_has_dirty_limit() is only used to determine if the process
> has a dirty limit.  As discussed, this is a momentary answer that may be
> wrong by the time decisions are made because the task may be migrated
> in-to/out-of root cgroup while mem_cgroup_has_dirty_limit() runs.  If
> the process has a dirty limit, then the process's memcg is used to
> compute dirty limits.  Using your example, I assume that thread(1) and
> thread(2) will git dirty limits from cgroup(A) and cgroup(B)
> respectively.
> 

Ok, thank you for clarification. Throttoling a thread based on its own
cgroup not based on mm->owner makes sense. Could you add a brief comment on
the code ?

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 00/10] memcg: per cgroup dirty page accounting
  2010-10-04  6:57 [PATCH 00/10] memcg: per cgroup dirty page accounting Greg Thelen
                   ` (13 preceding siblings ...)
  2010-10-06  3:23 ` Balbir Singh
@ 2010-10-18  5:56 ` KAMEZAWA Hiroyuki
  2010-10-18 18:09   ` Greg Thelen
  14 siblings, 1 reply; 96+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-10-18  5:56 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, Andrea Righi,
	Balbir Singh, Daisuke Nishimura

On Sun,  3 Oct 2010 23:57:55 -0700
Greg Thelen <gthelen@google.com> wrote:

> Greg Thelen (10):
>   memcg: add page_cgroup flags for dirty page tracking
>   memcg: document cgroup dirty memory interfaces
>   memcg: create extensible page stat update routines
>   memcg: disable local interrupts in lock_page_cgroup()
>   memcg: add dirty page accounting infrastructure
>   memcg: add kernel calls for memcg dirty page stats
>   memcg: add dirty limits to mem_cgroup
>   memcg: add cgroupfs interface to memcg dirty limits
>   writeback: make determine_dirtyable_memory() static.
>   memcg: check memcg dirty limits in page writeback

Greg, this is a patch on your set.

 mmotm-1014 
 - memcg-reduce-lock-hold-time-during-charge-moving.patch
   (I asked Andrew to drop this)
 + your 1,2,3,5,6,7,8,9,10 (dropped patch "4")

I'm grad if you merge this to your set as replacement of "4".
I'll prepare a performance improvement patch and post it if this dirty_limit
patches goes to -mm.

Thank you for your work.

==
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Now, at supporing dirty limit, there is a deadlock problem in accounting.

 1. If pages are being migrated from a memcg, then updates to that
memcg page statistics are protected by grabbing a bit spin lock
using lock_page_cgroup().  In recent changes of dirty page accounting
is updating memcg page accounting (specifically: num writeback pages)
from IRQ context (softirq).  Avoid a deadlocking nested spin lock attempt
by irq on the local processor when grabbing the page_cgroup.

 2. lock for update_stat is used only for avoiding race with move_account().
So, IRQ awareness of lock_page_cgroup() itself is not a problem. The problem
is in update_stat() and move_account().

Then, this reworks locking scheme of update_stat() and move_account() by
adding new lock bit PCG_MOVE_LOCK, which is always taken under IRQ disable.

Trade-off
  * using lock_page_cgroup() + disable IRQ has some impacts on performance
    and I think it's bad to disable IRQ when it's not necessary.
  * adding a new lock makes move_account() slow. Score is here.

Peformance Impact: moving a 8G anon process.

Before:
	real    0m0.792s
	user    0m0.000s
	sys     0m0.780s

After:
	real    0m0.854s
	user    0m0.000s
	sys     0m0.842s

This score is bad but planned patches for optimization can reduce
this impact.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/page_cgroup.h |   31 ++++++++++++++++++++++++++++---
 mm/memcontrol.c             |    9 +++++++--
 2 files changed, 35 insertions(+), 5 deletions(-)

Index: dirty_limit_new/include/linux/page_cgroup.h
===================================================================
--- dirty_limit_new.orig/include/linux/page_cgroup.h
+++ dirty_limit_new/include/linux/page_cgroup.h
@@ -35,15 +35,18 @@ struct page_cgroup *lookup_page_cgroup(s
 
 enum {
 	/* flags for mem_cgroup */
-	PCG_LOCK,  /* page cgroup is locked */
+	PCG_LOCK,  /* Lock for pc->mem_cgroup and following bits. */
 	PCG_CACHE, /* charged as cache */
 	PCG_USED, /* this object is in use. */
-	PCG_ACCT_LRU, /* page has been accounted for */
+	PCG_MIGRATION, /* under page migration */
+	/* flags for mem_cgroup and file and I/O status */
+	PCG_MOVE_LOCK, /* For race between move_account v.s. following bits */
 	PCG_FILE_MAPPED, /* page is accounted as "mapped" */
 	PCG_FILE_DIRTY, /* page is dirty */
 	PCG_FILE_WRITEBACK, /* page is under writeback */
 	PCG_FILE_UNSTABLE_NFS, /* page is NFS unstable */
-	PCG_MIGRATION, /* under page migration */
+	/* No lock in page_cgroup */
+	PCG_ACCT_LRU, /* page has been accounted for (under lru_lock) */
 };
 
 #define TESTPCGFLAG(uname, lname)			\
@@ -119,6 +122,10 @@ static inline enum zone_type page_cgroup
 
 static inline void lock_page_cgroup(struct page_cgroup *pc)
 {
+	/*
+	 * Don't take this lock in IRQ context.
+	 * This lock is for pc->mem_cgroup, USED, CACHE, MIGRATION
+	 */
 	bit_spin_lock(PCG_LOCK, &pc->flags);
 }
 
@@ -127,6 +134,24 @@ static inline void unlock_page_cgroup(st
 	bit_spin_unlock(PCG_LOCK, &pc->flags);
 }
 
+static inline void move_lock_page_cgroup(struct page_cgroup *pc,
+	unsigned long *flags)
+{
+	/*
+	 * We know updates to pc->flags of page cache's stats are from both of
+	 * usual context or IRQ context. Disable IRQ to avoid deadlock.
+	 */
+	local_irq_save(*flags);
+	bit_spin_lock(PCG_MOVE_LOCK, &pc->flags);
+}
+
+static inline void move_unlock_page_cgroup(struct page_cgroup *pc,
+	unsigned long *flags)
+{
+	bit_spin_unlock(PCG_MOVE_LOCK, &pc->flags);
+	local_irq_restore(*flags);
+}
+
 #else /* CONFIG_CGROUP_MEM_RES_CTLR */
 struct page_cgroup;
 
Index: dirty_limit_new/mm/memcontrol.c
===================================================================
--- dirty_limit_new.orig/mm/memcontrol.c
+++ dirty_limit_new/mm/memcontrol.c
@@ -1784,6 +1784,7 @@ void mem_cgroup_update_page_stat(struct 
 	struct mem_cgroup *mem;
 	struct page_cgroup *pc = lookup_page_cgroup(page);
 	bool need_unlock = false;
+	unsigned long uninitialized_var(flags);
 
 	if (unlikely(!pc))
 		return;
@@ -1795,7 +1796,7 @@ void mem_cgroup_update_page_stat(struct 
 	/* pc->mem_cgroup is unstable ? */
 	if (unlikely(mem_cgroup_stealed(mem))) {
 		/* take a lock against to access pc->mem_cgroup */
-		lock_page_cgroup(pc);
+		move_lock_page_cgroup(pc, &flags);
 		need_unlock = true;
 		mem = pc->mem_cgroup;
 		if (!mem || !PageCgroupUsed(pc))
@@ -1856,7 +1857,7 @@ void mem_cgroup_update_page_stat(struct 
 
 out:
 	if (unlikely(need_unlock))
-		unlock_page_cgroup(pc);
+		move_unlock_page_cgroup(pc, &flags);
 	rcu_read_unlock();
 	return;
 }
@@ -2426,9 +2427,13 @@ static int mem_cgroup_move_account(struc
 		struct mem_cgroup *from, struct mem_cgroup *to, bool uncharge)
 {
 	int ret = -EINVAL;
+	unsigned long flags;
+
 	lock_page_cgroup(pc);
 	if (PageCgroupUsed(pc) && pc->mem_cgroup == from) {
+		move_lock_page_cgroup(pc, &flags);
 		__mem_cgroup_move_account(pc, from, to, uncharge);
+		move_unlock_page_cgroup(pc, &flags);
 		ret = 0;
 	}
 	unlock_page_cgroup(pc);


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH 00/10] memcg: per cgroup dirty page accounting
  2010-10-18  5:56 ` KAMEZAWA Hiroyuki
@ 2010-10-18 18:09   ` Greg Thelen
  0 siblings, 0 replies; 96+ messages in thread
From: Greg Thelen @ 2010-10-18 18:09 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, linux-kernel, linux-mm, containers, Andrea Righi,
	Balbir Singh, Daisuke Nishimura

KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> writes:

> On Sun,  3 Oct 2010 23:57:55 -0700
> Greg Thelen <gthelen@google.com> wrote:
>
>> Greg Thelen (10):
>>   memcg: add page_cgroup flags for dirty page tracking
>>   memcg: document cgroup dirty memory interfaces
>>   memcg: create extensible page stat update routines
>>   memcg: disable local interrupts in lock_page_cgroup()
>>   memcg: add dirty page accounting infrastructure
>>   memcg: add kernel calls for memcg dirty page stats
>>   memcg: add dirty limits to mem_cgroup
>>   memcg: add cgroupfs interface to memcg dirty limits
>>   writeback: make determine_dirtyable_memory() static.
>>   memcg: check memcg dirty limits in page writeback
>
> Greg, this is a patch on your set.
>
>  mmotm-1014 
>  - memcg-reduce-lock-hold-time-during-charge-moving.patch
>    (I asked Andrew to drop this)
>  + your 1,2,3,5,6,7,8,9,10 (dropped patch "4")
>
> I'm grad if you merge this to your set as replacement of "4".
> I'll prepare a performance improvement patch and post it if this dirty_limit
> patches goes to -mm.

Thanks for the patch.  I will merge your patch (below) as a replacement
of memcg dirty limits patch #4 and repost the entire series.

> Thank you for your work.
>
> ==
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>
> Now, at supporing dirty limit, there is a deadlock problem in accounting.
>
>  1. If pages are being migrated from a memcg, then updates to that
> memcg page statistics are protected by grabbing a bit spin lock
> using lock_page_cgroup().  In recent changes of dirty page accounting
> is updating memcg page accounting (specifically: num writeback pages)
> from IRQ context (softirq).  Avoid a deadlocking nested spin lock attempt
> by irq on the local processor when grabbing the page_cgroup.
>
>  2. lock for update_stat is used only for avoiding race with move_account().
> So, IRQ awareness of lock_page_cgroup() itself is not a problem. The problem
> is in update_stat() and move_account().
>
> Then, this reworks locking scheme of update_stat() and move_account() by
> adding new lock bit PCG_MOVE_LOCK, which is always taken under IRQ disable.
>
> Trade-off
>   * using lock_page_cgroup() + disable IRQ has some impacts on performance
>     and I think it's bad to disable IRQ when it's not necessary.
>   * adding a new lock makes move_account() slow. Score is here.
>
> Peformance Impact: moving a 8G anon process.
>
> Before:
> 	real    0m0.792s
> 	user    0m0.000s
> 	sys     0m0.780s
>
> After:
> 	real    0m0.854s
> 	user    0m0.000s
> 	sys     0m0.842s
>
> This score is bad but planned patches for optimization can reduce
> this impact.
>
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
>  include/linux/page_cgroup.h |   31 ++++++++++++++++++++++++++++---
>  mm/memcontrol.c             |    9 +++++++--
>  2 files changed, 35 insertions(+), 5 deletions(-)
>
> Index: dirty_limit_new/include/linux/page_cgroup.h
> ===================================================================
> --- dirty_limit_new.orig/include/linux/page_cgroup.h
> +++ dirty_limit_new/include/linux/page_cgroup.h
> @@ -35,15 +35,18 @@ struct page_cgroup *lookup_page_cgroup(s
>  
>  enum {
>  	/* flags for mem_cgroup */
> -	PCG_LOCK,  /* page cgroup is locked */
> +	PCG_LOCK,  /* Lock for pc->mem_cgroup and following bits. */
>  	PCG_CACHE, /* charged as cache */
>  	PCG_USED, /* this object is in use. */
> -	PCG_ACCT_LRU, /* page has been accounted for */
> +	PCG_MIGRATION, /* under page migration */
> +	/* flags for mem_cgroup and file and I/O status */
> +	PCG_MOVE_LOCK, /* For race between move_account v.s. following bits */
>  	PCG_FILE_MAPPED, /* page is accounted as "mapped" */
>  	PCG_FILE_DIRTY, /* page is dirty */
>  	PCG_FILE_WRITEBACK, /* page is under writeback */
>  	PCG_FILE_UNSTABLE_NFS, /* page is NFS unstable */
> -	PCG_MIGRATION, /* under page migration */
> +	/* No lock in page_cgroup */
> +	PCG_ACCT_LRU, /* page has been accounted for (under lru_lock) */
>  };
>  
>  #define TESTPCGFLAG(uname, lname)			\
> @@ -119,6 +122,10 @@ static inline enum zone_type page_cgroup
>  
>  static inline void lock_page_cgroup(struct page_cgroup *pc)
>  {
> +	/*
> +	 * Don't take this lock in IRQ context.
> +	 * This lock is for pc->mem_cgroup, USED, CACHE, MIGRATION
> +	 */
>  	bit_spin_lock(PCG_LOCK, &pc->flags);
>  }
>  
> @@ -127,6 +134,24 @@ static inline void unlock_page_cgroup(st
>  	bit_spin_unlock(PCG_LOCK, &pc->flags);
>  }
>  
> +static inline void move_lock_page_cgroup(struct page_cgroup *pc,
> +	unsigned long *flags)
> +{
> +	/*
> +	 * We know updates to pc->flags of page cache's stats are from both of
> +	 * usual context or IRQ context. Disable IRQ to avoid deadlock.
> +	 */
> +	local_irq_save(*flags);
> +	bit_spin_lock(PCG_MOVE_LOCK, &pc->flags);
> +}
> +
> +static inline void move_unlock_page_cgroup(struct page_cgroup *pc,
> +	unsigned long *flags)
> +{
> +	bit_spin_unlock(PCG_MOVE_LOCK, &pc->flags);
> +	local_irq_restore(*flags);
> +}
> +
>  #else /* CONFIG_CGROUP_MEM_RES_CTLR */
>  struct page_cgroup;
>  
> Index: dirty_limit_new/mm/memcontrol.c
> ===================================================================
> --- dirty_limit_new.orig/mm/memcontrol.c
> +++ dirty_limit_new/mm/memcontrol.c
> @@ -1784,6 +1784,7 @@ void mem_cgroup_update_page_stat(struct 
>  	struct mem_cgroup *mem;
>  	struct page_cgroup *pc = lookup_page_cgroup(page);
>  	bool need_unlock = false;
> +	unsigned long uninitialized_var(flags);
>  
>  	if (unlikely(!pc))
>  		return;
> @@ -1795,7 +1796,7 @@ void mem_cgroup_update_page_stat(struct 
>  	/* pc->mem_cgroup is unstable ? */
>  	if (unlikely(mem_cgroup_stealed(mem))) {
>  		/* take a lock against to access pc->mem_cgroup */
> -		lock_page_cgroup(pc);
> +		move_lock_page_cgroup(pc, &flags);
>  		need_unlock = true;
>  		mem = pc->mem_cgroup;
>  		if (!mem || !PageCgroupUsed(pc))
> @@ -1856,7 +1857,7 @@ void mem_cgroup_update_page_stat(struct 
>  
>  out:
>  	if (unlikely(need_unlock))
> -		unlock_page_cgroup(pc);
> +		move_unlock_page_cgroup(pc, &flags);
>  	rcu_read_unlock();
>  	return;
>  }
> @@ -2426,9 +2427,13 @@ static int mem_cgroup_move_account(struc
>  		struct mem_cgroup *from, struct mem_cgroup *to, bool uncharge)
>  {
>  	int ret = -EINVAL;
> +	unsigned long flags;
> +
>  	lock_page_cgroup(pc);
>  	if (PageCgroupUsed(pc) && pc->mem_cgroup == from) {
> +		move_lock_page_cgroup(pc, &flags);
>  		__mem_cgroup_move_account(pc, from, to, uncharge);
> +		move_unlock_page_cgroup(pc, &flags);
>  		ret = 0;
>  	}
>  	unlock_page_cgroup(pc);

^ permalink raw reply	[flat|nested] 96+ messages in thread

end of thread, other threads:[~2010-10-18 18:10 UTC | newest]

Thread overview: 96+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-10-04  6:57 [PATCH 00/10] memcg: per cgroup dirty page accounting Greg Thelen
2010-10-04  6:57 ` [PATCH 01/10] memcg: add page_cgroup flags for dirty page tracking Greg Thelen
2010-10-05  6:20   ` KAMEZAWA Hiroyuki
2010-10-06  0:37   ` Daisuke Nishimura
2010-10-06 11:07   ` Balbir Singh
2010-10-04  6:57 ` [PATCH 02/10] memcg: document cgroup dirty memory interfaces Greg Thelen
2010-10-05  6:48   ` KAMEZAWA Hiroyuki
2010-10-06  0:49   ` Daisuke Nishimura
2010-10-06 11:12   ` Balbir Singh
2010-10-04  6:57 ` [PATCH 03/10] memcg: create extensible page stat update routines Greg Thelen
2010-10-04 13:48   ` Ciju Rajan K
2010-10-04 15:43     ` Greg Thelen
2010-10-04 17:35       ` Ciju Rajan K
2010-10-05  6:51   ` KAMEZAWA Hiroyuki
2010-10-05  7:10     ` Greg Thelen
2010-10-05 15:42   ` Minchan Kim
2010-10-05 19:59     ` Greg Thelen
2010-10-05 23:57       ` Minchan Kim
2010-10-06  0:48         ` Greg Thelen
2010-10-06 16:19   ` Balbir Singh
2010-10-04  6:57 ` [PATCH 04/10] memcg: disable local interrupts in lock_page_cgroup() Greg Thelen
2010-10-05  6:54   ` KAMEZAWA Hiroyuki
2010-10-05  7:18     ` Greg Thelen
2010-10-05 16:03   ` Minchan Kim
2010-10-05 23:26     ` Greg Thelen
2010-10-06  0:15       ` Minchan Kim
2010-10-07  0:35         ` KAMEZAWA Hiroyuki
2010-10-07  1:54           ` Daisuke Nishimura
2010-10-07  2:17             ` KAMEZAWA Hiroyuki
2010-10-07  6:21               ` [PATCH] memcg: reduce lock time at move charge (Was " KAMEZAWA Hiroyuki
2010-10-07  6:24                 ` [PATCH] memcg: lock-free clear page writeback " KAMEZAWA Hiroyuki
2010-10-07  9:05                   ` KAMEZAWA Hiroyuki
2010-10-07 23:35                   ` Minchan Kim
2010-10-08  4:41                     ` KAMEZAWA Hiroyuki
2010-10-07  7:28                 ` [PATCH] memcg: reduce lock time at move charge " Daisuke Nishimura
2010-10-07  7:42                   ` KAMEZAWA Hiroyuki
2010-10-07  8:04                     ` [PATCH v2] " KAMEZAWA Hiroyuki
2010-10-07 23:14                       ` Andrew Morton
2010-10-08  1:12                         ` Daisuke Nishimura
2010-10-08  4:37                         ` KAMEZAWA Hiroyuki
2010-10-08  4:55                           ` Andrew Morton
2010-10-08  5:12                             ` KAMEZAWA Hiroyuki
2010-10-08 10:41                               ` KAMEZAWA Hiroyuki
2010-10-12  3:39                                 ` Balbir Singh
2010-10-12  3:42                                   ` KAMEZAWA Hiroyuki
2010-10-12  3:54                                     ` Balbir Singh
2010-10-12  3:56                                 ` Daisuke Nishimura
2010-10-12  5:01                                   ` KAMEZAWA Hiroyuki
2010-10-12  5:48                                   ` [PATCH v4] memcg: reduce lock time at move charge KAMEZAWA Hiroyuki
2010-10-12  6:23                                     ` Daisuke Nishimura
2010-10-12  5:39   ` [PATCH 04/10] memcg: disable local interrupts in lock_page_cgroup() Balbir Singh
2010-10-04  6:58 ` [PATCH 05/10] memcg: add dirty page accounting infrastructure Greg Thelen
2010-10-05  7:22   ` KAMEZAWA Hiroyuki
2010-10-05  7:35     ` Greg Thelen
2010-10-05 16:09   ` Minchan Kim
2010-10-05 20:06     ` Greg Thelen
2010-10-04  6:58 ` [PATCH 06/10] memcg: add kernel calls for memcg dirty page stats Greg Thelen
2010-10-05  6:55   ` KAMEZAWA Hiroyuki
2010-10-04  6:58 ` [PATCH 07/10] memcg: add dirty limits to mem_cgroup Greg Thelen
2010-10-05  7:07   ` KAMEZAWA Hiroyuki
2010-10-05  9:43   ` Andrea Righi
2010-10-05 19:00     ` Greg Thelen
2010-10-07  0:13       ` KAMEZAWA Hiroyuki
2010-10-07  0:27         ` Greg Thelen
2010-10-07  0:48           ` KAMEZAWA Hiroyuki
2010-10-12  0:24             ` Greg Thelen
2010-10-12  0:55               ` KAMEZAWA Hiroyuki
2010-10-12  7:32                 ` Greg Thelen
2010-10-12  8:38                   ` KAMEZAWA Hiroyuki
2010-10-04  6:58 ` [PATCH 08/10] memcg: add cgroupfs interface to memcg dirty limits Greg Thelen
2010-10-05  7:13   ` KAMEZAWA Hiroyuki
2010-10-05  7:33     ` Greg Thelen
2010-10-05  7:31       ` KAMEZAWA Hiroyuki
2010-10-05  9:18       ` Andrea Righi
2010-10-05 18:31         ` David Rientjes
2010-10-06 18:34         ` Greg Thelen
2010-10-06 20:54           ` Andrea Righi
2010-10-06 13:30   ` Balbir Singh
2010-10-06 13:32     ` Balbir Singh
2010-10-06 16:21       ` Greg Thelen
2010-10-06 16:24         ` Balbir Singh
2010-10-07  6:23   ` Ciju Rajan K
2010-10-07 17:46     ` Greg Thelen
2010-10-04  6:58 ` [PATCH 09/10] writeback: make determine_dirtyable_memory() static Greg Thelen
2010-10-05  7:15   ` KAMEZAWA Hiroyuki
2010-10-04  6:58 ` [PATCH 10/10] memcg: check memcg dirty limits in page writeback Greg Thelen
2010-10-05  7:29   ` KAMEZAWA Hiroyuki
2010-10-06  0:32   ` Minchan Kim
2010-10-05  4:20 ` [PATCH 00/10] memcg: per cgroup dirty page accounting Balbir Singh
2010-10-05  4:50 ` Balbir Singh
2010-10-05  5:50   ` Greg Thelen
2010-10-05  8:37     ` Ciju Rajan K
2010-10-05 22:15 ` Andrea Righi
2010-10-06  3:23 ` Balbir Singh
2010-10-18  5:56 ` KAMEZAWA Hiroyuki
2010-10-18 18:09   ` Greg Thelen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).