linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/6 v4] memcg: page cgroup diet
@ 2012-02-14  3:04 KAMEZAWA Hiroyuki
  2012-02-14  3:06 ` [PATCH 1/6 v4] memcg: remove EXPORT_SYMBOL(mem_cgroup_update_page_stat) KAMEZAWA Hiroyuki
                   ` (5 more replies)
  0 siblings, 6 replies; 16+ messages in thread
From: KAMEZAWA Hiroyuki @ 2012-02-14  3:04 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel, hannes, Michal Hocko, akpm, Ying Han, Hugh Dickins


Here is v4. I removed RFC and fixed a fatal bug in v3.

This patch series is a preparetion for removing page_cgroup->flags. To remove
it, we need to reduce flags on page_cgroup. In this set, PCG_MOVE_LOCK and
PCG_FILE_MAPPED are remvoed.

After this, we only have 3 bits.
==
enum {
        /* flags for mem_cgroup */
        PCG_LOCK,  /* Lock for pc->mem_cgroup and following bits. */
        PCG_USED, /* this object is in use. */
        PCG_MIGRATION, /* under page migration */
        __NR_PCG_FLAGS,
};
==

I'll make a try to merge this 3bits to lower 3bits of pointer to
memcg. Then, we can make page_cgroup 8(4)bytes per page.


Major changes since v3 are
 - replaced move_lock_page_cgroup() with move_lock_mem_cgroup().
   passes pointer to memcg rather than page_cgroup.

Working on linux-next feb13 and passed several tests.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 1/6 v4] memcg: remove EXPORT_SYMBOL(mem_cgroup_update_page_stat)
  2012-02-14  3:04 [PATCH 0/6 v4] memcg: page cgroup diet KAMEZAWA Hiroyuki
@ 2012-02-14  3:06 ` KAMEZAWA Hiroyuki
  2012-02-14  7:21   ` Greg Thelen
  2012-02-14  3:07 ` [PATCH 2/6 v4] memcg: simplify move_account() check KAMEZAWA Hiroyuki
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 16+ messages in thread
From: KAMEZAWA Hiroyuki @ 2012-02-14  3:06 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, linux-kernel, hannes, Michal Hocko, akpm, Ying Han,
	Hugh Dickins


This is just a cleanup.
==
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Date: Thu, 2 Feb 2012 12:05:41 +0900
Subject: [PATCH 1/6] memcg: remove EXPORT_SYMBOL(mem_cgroup_update_page_stat)

>From the log, I guess EXPORT was for preparing dirty accounting.
But _now_, we don't need to export this. Remove this for now.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |    1 -
 1 files changed, 0 insertions(+), 1 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ab315ab..4c2b759 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1897,7 +1897,6 @@ out:
 		move_unlock_page_cgroup(pc, &flags);
 	rcu_read_unlock();
 }
-EXPORT_SYMBOL(mem_cgroup_update_page_stat);
 
 /*
  * size of first charge trial. "32" comes from vmscan.c's magic value.
-- 
1.7.4.1



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 2/6 v4] memcg: simplify move_account() check
  2012-02-14  3:04 [PATCH 0/6 v4] memcg: page cgroup diet KAMEZAWA Hiroyuki
  2012-02-14  3:06 ` [PATCH 1/6 v4] memcg: remove EXPORT_SYMBOL(mem_cgroup_update_page_stat) KAMEZAWA Hiroyuki
@ 2012-02-14  3:07 ` KAMEZAWA Hiroyuki
  2012-02-14  7:21   ` Greg Thelen
  2012-02-14  3:13 ` [PATCH 3/6 v4] memcg: remove PCG_MOVE_LOCK flag from page_cgroup KAMEZAWA Hiroyuki
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 16+ messages in thread
From: KAMEZAWA Hiroyuki @ 2012-02-14  3:07 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, linux-kernel, hannes, Michal Hocko, akpm, Ying Han,
	Hugh Dickins

>From 9cdb3b63dc8d08cc2220c54c80438c13433a0d12 Mon Sep 17 00:00:00 2001
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Date: Thu, 2 Feb 2012 10:02:39 +0900
Subject: [PATCH 2/6] memcg: simplify move_account() check.

In memcg, for avoiding take-lock-irq-off at accessing page_cgroup,
a logic, flag + rcu_read_lock(), is used. This works as following

     CPU-A                     CPU-B
                             rcu_read_lock()
    set flag
                             if(flag is set)
                                   take heavy lock
                             do job.
    synchronize_rcu()        rcu_read_unlock()

In recent discussion, it's argued that using per-cpu value for this
flag just complicates the code because 'set flag' is very rare.

This patch changes 'flag' implementation from percpu to atomic_t.
This will be much simpler.

Changelog: v4
 - fixed many typos.
 - fixed return value to be bool
 - add comments.
Changelog: v3
 - this is a new patch since v3.

 memcontrol.c |   65 ++++++++++++++++++++++-------------------------------------
 1 file changed, 25 insertions(+), 40 deletions(-)

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |   70 +++++++++++++++++++++++-------------------------------
 1 files changed, 30 insertions(+), 40 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 4c2b759..9f69679 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -89,7 +89,6 @@ enum mem_cgroup_stat_index {
 	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
 	MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
 	MEM_CGROUP_STAT_DATA, /* end of data requires synchronization */
-	MEM_CGROUP_ON_MOVE,	/* someone is moving account between groups */
 	MEM_CGROUP_STAT_NSTATS,
 };
 
@@ -278,6 +277,10 @@ struct mem_cgroup {
 	 */
 	unsigned long 	move_charge_at_immigrate;
 	/*
+	 * set > 0 if pages under this cgroup are moving to other cgroup.
+	 */
+	atomic_t	moving_account;
+	/*
 	 * percpu counter.
 	 */
 	struct mem_cgroup_stat_cpu *stat;
@@ -1253,36 +1256,37 @@ int mem_cgroup_swappiness(struct mem_cgroup *memcg)
 
 	return memcg->swappiness;
 }
+/*
+ * memcg->moving_account is used for checking possibility that some thread is
+ * calling move_account(). When a thread on CPU-A starts moving pages under
+ * a memcg, other threads should check memcg->moving_account under
+ * rcu_read_lock(), like this:
+ *
+ *         CPU-A                                    CPU-B
+ *                                              rcu_read_lock()
+ *         memcg->moving_account+1              if (memcg->mocing_account)
+ *                                                   take heavy locks.
+ *         synchronize_rcu()                    update something.
+ *                                              rcu_read_unlock()
+ *         start move here.
+ */
 
 static void mem_cgroup_start_move(struct mem_cgroup *memcg)
 {
-	int cpu;
-
-	get_online_cpus();
-	spin_lock(&memcg->pcp_counter_lock);
-	for_each_online_cpu(cpu)
-		per_cpu(memcg->stat->count[MEM_CGROUP_ON_MOVE], cpu) += 1;
-	memcg->nocpu_base.count[MEM_CGROUP_ON_MOVE] += 1;
-	spin_unlock(&memcg->pcp_counter_lock);
-	put_online_cpus();
-
+	atomic_inc(&memcg->moving_account);
 	synchronize_rcu();
 }
 
 static void mem_cgroup_end_move(struct mem_cgroup *memcg)
 {
-	int cpu;
-
-	if (!memcg)
-		return;
-	get_online_cpus();
-	spin_lock(&memcg->pcp_counter_lock);
-	for_each_online_cpu(cpu)
-		per_cpu(memcg->stat->count[MEM_CGROUP_ON_MOVE], cpu) -= 1;
-	memcg->nocpu_base.count[MEM_CGROUP_ON_MOVE] -= 1;
-	spin_unlock(&memcg->pcp_counter_lock);
-	put_online_cpus();
+	/*
+	 * Now, mem_cgroup_clear_mc() may call this function with NULL.
+	 * We check NULL in callee rather than caller.
+	 */
+	if (memcg)
+		atomic_dec(&memcg->moving_account);
 }
+
 /*
  * 2 routines for checking "mem" is under move_account() or not.
  *
@@ -1298,7 +1302,7 @@ static void mem_cgroup_end_move(struct mem_cgroup *memcg)
 static bool mem_cgroup_stealed(struct mem_cgroup *memcg)
 {
 	VM_BUG_ON(!rcu_read_lock_held());
-	return this_cpu_read(memcg->stat->count[MEM_CGROUP_ON_MOVE]) > 0;
+	return atomic_read(&memcg->moving_account) > 0;
 }
 
 static bool mem_cgroup_under_move(struct mem_cgroup *memcg)
@@ -1849,8 +1853,8 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask)
  * by flags.
  *
  * Considering "move", this is an only case we see a race. To make the race
- * small, we check MEM_CGROUP_ON_MOVE percpu value and detect there are
- * possibility of race condition. If there is, we take a lock.
+ * small, we check mm->moving_account and detect there are possibility of race
+ * If there is, we take a lock.
  */
 
 void mem_cgroup_update_page_stat(struct page *page,
@@ -2067,17 +2071,6 @@ static void mem_cgroup_drain_pcp_counter(struct mem_cgroup *memcg, int cpu)
 		per_cpu(memcg->stat->events[i], cpu) = 0;
 		memcg->nocpu_base.events[i] += x;
 	}
-	/* need to clear ON_MOVE value, works as a kind of lock. */
-	per_cpu(memcg->stat->count[MEM_CGROUP_ON_MOVE], cpu) = 0;
-	spin_unlock(&memcg->pcp_counter_lock);
-}
-
-static void synchronize_mem_cgroup_on_move(struct mem_cgroup *memcg, int cpu)
-{
-	int idx = MEM_CGROUP_ON_MOVE;
-
-	spin_lock(&memcg->pcp_counter_lock);
-	per_cpu(memcg->stat->count[idx], cpu) = memcg->nocpu_base.count[idx];
 	spin_unlock(&memcg->pcp_counter_lock);
 }
 
@@ -2089,11 +2082,8 @@ static int __cpuinit memcg_cpu_hotplug_callback(struct notifier_block *nb,
 	struct memcg_stock_pcp *stock;
 	struct mem_cgroup *iter;
 
-	if ((action == CPU_ONLINE)) {
-		for_each_mem_cgroup(iter)
-			synchronize_mem_cgroup_on_move(iter, cpu);
+	if ((action == CPU_ONLINE))
 		return NOTIFY_OK;
-	}
 
 	if ((action != CPU_DEAD) || action != CPU_DEAD_FROZEN)
 		return NOTIFY_OK;
-- 
1.7.4.1



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 3/6 v4] memcg: remove PCG_MOVE_LOCK flag from page_cgroup.
  2012-02-14  3:04 [PATCH 0/6 v4] memcg: page cgroup diet KAMEZAWA Hiroyuki
  2012-02-14  3:06 ` [PATCH 1/6 v4] memcg: remove EXPORT_SYMBOL(mem_cgroup_update_page_stat) KAMEZAWA Hiroyuki
  2012-02-14  3:07 ` [PATCH 2/6 v4] memcg: simplify move_account() check KAMEZAWA Hiroyuki
@ 2012-02-14  3:13 ` KAMEZAWA Hiroyuki
  2012-02-14  7:21   ` Greg Thelen
  2012-02-14  3:14 ` [PATCH 4/6 v4] memcg: use new logic for page stat accounting KAMEZAWA Hiroyuki
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 16+ messages in thread
From: KAMEZAWA Hiroyuki @ 2012-02-14  3:13 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, linux-kernel, hannes, Michal Hocko, akpm, Ying Han,
	Hugh Dickins

>From ffd1b013fe294a80c12e3f30e85386135a6fb284 Mon Sep 17 00:00:00 2001
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Date: Thu, 2 Feb 2012 11:49:59 +0900
Subject: [PATCH 3/6] memcg: remove PCG_MOVE_LOCK flag from page_cgroup.

PCG_MOVE_LOCK is used for bit spinlock to avoid race between overwriting
pc->mem_cgroup and page statistics accounting per memcg.
This lock helps to avoid the race but the race is very rare because moving
tasks between cgroup is not a usual job.
So, it seems using 1bit per page is too costly.

This patch changes this lock as per-memcg spinlock and removes PCG_MOVE_LOCK.

If smaller lock is required, we'll be able to add some hashes but
I'd like to start from this.

Changelog:
  - fixed to pass memcg as an argument rather than page_cgroup.
    and renamed from move_lock_page_cgroup() to move_lock_mem_cgroup()

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/page_cgroup.h |   19 -------------------
 mm/memcontrol.c             |   42 ++++++++++++++++++++++++++++++++----------
 2 files changed, 32 insertions(+), 29 deletions(-)

diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
index 1060292..7a3af74 100644
--- a/include/linux/page_cgroup.h
+++ b/include/linux/page_cgroup.h
@@ -7,7 +7,6 @@ enum {
 	PCG_USED, /* this object is in use. */
 	PCG_MIGRATION, /* under page migration */
 	/* flags for mem_cgroup and file and I/O status */
-	PCG_MOVE_LOCK, /* For race between move_account v.s. following bits */
 	PCG_FILE_MAPPED, /* page is accounted as "mapped" */
 	__NR_PCG_FLAGS,
 };
@@ -89,24 +88,6 @@ static inline void unlock_page_cgroup(struct page_cgroup *pc)
 	bit_spin_unlock(PCG_LOCK, &pc->flags);
 }
 
-static inline void move_lock_page_cgroup(struct page_cgroup *pc,
-	unsigned long *flags)
-{
-	/*
-	 * We know updates to pc->flags of page cache's stats are from both of
-	 * usual context or IRQ context. Disable IRQ to avoid deadlock.
-	 */
-	local_irq_save(*flags);
-	bit_spin_lock(PCG_MOVE_LOCK, &pc->flags);
-}
-
-static inline void move_unlock_page_cgroup(struct page_cgroup *pc,
-	unsigned long *flags)
-{
-	bit_spin_unlock(PCG_MOVE_LOCK, &pc->flags);
-	local_irq_restore(*flags);
-}
-
 #else /* CONFIG_CGROUP_MEM_RES_CTLR */
 struct page_cgroup;
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 9f69679..ecf8856 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -280,6 +280,8 @@ struct mem_cgroup {
 	 * set > 0 if pages under this cgroup are moving to other cgroup.
 	 */
 	atomic_t	moving_account;
+	/* taken only while moving_account > 0 */
+	spinlock_t	move_lock;
 	/*
 	 * percpu counter.
 	 */
@@ -1343,6 +1345,24 @@ static bool mem_cgroup_wait_acct_move(struct mem_cgroup *memcg)
 	return false;
 }
 
+/*
+ * Take this lock when
+ * - a code tries to modify page's memcg while it's USED.
+ * - a code tries to modify page state accounting in a memcg.
+ * see mem_cgroup_stealed(), too.
+ */
+static void move_lock_mem_cgroup(struct mem_cgroup *memcg,
+				  unsigned long *flags)
+{
+	spin_lock_irqsave(&memcg->move_lock, *flags);
+}
+
+static void move_unlock_mem_cgroup(struct mem_cgroup *memcg,
+				unsigned long *flags)
+{
+	spin_unlock_irqrestore(&memcg->move_lock, *flags);
+}
+
 /**
  * mem_cgroup_print_oom_info: Called from OOM with tasklist_lock held in read mode.
  * @memcg: The memory cgroup that went over limit
@@ -1867,7 +1887,7 @@ void mem_cgroup_update_page_stat(struct page *page,
 
 	if (mem_cgroup_disabled())
 		return;
-
+again:
 	rcu_read_lock();
 	memcg = pc->mem_cgroup;
 	if (unlikely(!memcg || !PageCgroupUsed(pc)))
@@ -1875,11 +1895,13 @@ void mem_cgroup_update_page_stat(struct page *page,
 	/* pc->mem_cgroup is unstable ? */
 	if (unlikely(mem_cgroup_stealed(memcg))) {
 		/* take a lock against to access pc->mem_cgroup */
-		move_lock_page_cgroup(pc, &flags);
+		move_lock_mem_cgroup(memcg, &flags);
+		if (memcg != pc->mem_cgroup || !PageCgroupUsed(pc)) {
+			move_unlock_mem_cgroup(memcg, &flags);
+			rcu_read_unlock();
+			goto again;
+		}
 		need_unlock = true;
-		memcg = pc->mem_cgroup;
-		if (!memcg || !PageCgroupUsed(pc))
-			goto out;
 	}
 
 	switch (idx) {
@@ -1898,7 +1920,7 @@ void mem_cgroup_update_page_stat(struct page *page,
 
 out:
 	if (unlikely(need_unlock))
-		move_unlock_page_cgroup(pc, &flags);
+		move_unlock_mem_cgroup(memcg, &flags);
 	rcu_read_unlock();
 }
 
@@ -2440,8 +2462,7 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *memcg,
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 
-#define PCGF_NOCOPY_AT_SPLIT ((1 << PCG_LOCK) | (1 << PCG_MOVE_LOCK) |\
-			(1 << PCG_MIGRATION))
+#define PCGF_NOCOPY_AT_SPLIT ((1 << PCG_LOCK) | (1 << PCG_MIGRATION))
 /*
  * Because tail pages are not marked as "used", set it. We're under
  * zone->lru_lock, 'splitting on pmd' and compound_lock.
@@ -2512,7 +2533,7 @@ static int mem_cgroup_move_account(struct page *page,
 	if (!PageCgroupUsed(pc) || pc->mem_cgroup != from)
 		goto unlock;
 
-	move_lock_page_cgroup(pc, &flags);
+	move_lock_mem_cgroup(from, &flags);
 
 	if (PageCgroupFileMapped(pc)) {
 		/* Update mapped_file data for mem_cgroup */
@@ -2536,7 +2557,7 @@ static int mem_cgroup_move_account(struct page *page,
 	 * guaranteed that "to" is never removed. So, we don't check rmdir
 	 * status here.
 	 */
-	move_unlock_page_cgroup(pc, &flags);
+	move_unlock_mem_cgroup(from, &flags);
 	ret = 0;
 unlock:
 	unlock_page_cgroup(pc);
@@ -4928,6 +4949,7 @@ mem_cgroup_create(struct cgroup *cont)
 	atomic_set(&memcg->refcnt, 1);
 	memcg->move_charge_at_immigrate = 0;
 	mutex_init(&memcg->thresholds_lock);
+	spin_lock_init(&memcg->move_lock);
 	return &memcg->css;
 free_out:
 	__mem_cgroup_free(memcg);
-- 
1.7.4.1



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 4/6 v4] memcg: use new logic for page stat accounting
  2012-02-14  3:04 [PATCH 0/6 v4] memcg: page cgroup diet KAMEZAWA Hiroyuki
                   ` (2 preceding siblings ...)
  2012-02-14  3:13 ` [PATCH 3/6 v4] memcg: remove PCG_MOVE_LOCK flag from page_cgroup KAMEZAWA Hiroyuki
@ 2012-02-14  3:14 ` KAMEZAWA Hiroyuki
  2012-02-14  7:22   ` Greg Thelen
  2012-02-14  3:15 ` [PATCH 5/6 v4] memcg: remove PCG_FILE_MAPPED KAMEZAWA Hiroyuki
  2012-02-14  3:16 ` [PATCH 6/6 v4] memcg: fix performance of mem_cgroup_begin_update_page_stat() KAMEZAWA Hiroyuki
  5 siblings, 1 reply; 16+ messages in thread
From: KAMEZAWA Hiroyuki @ 2012-02-14  3:14 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, linux-kernel, hannes, Michal Hocko, akpm, Ying Han,
	Hugh Dickins

>From ad2905362ef58a44d96a325193ab384739418050 Mon Sep 17 00:00:00 2001
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Date: Thu, 2 Feb 2012 11:49:59 +0900
Subject: [PATCH 4/6] memcg: use new logic for page stat accounting.

Now, page-stat-per-memcg is recorded into per page_cgroup flag by
duplicating page's status into the flag. The reason is that memcg
has a feature to move a page from a group to another group and we
have race between "move" and "page stat accounting",

Under current logic, assume CPU-A and CPU-B. CPU-A does "move"
and CPU-B does "page stat accounting".

When CPU-A goes 1st,

            CPU-A                           CPU-B
                                    update "struct page" info.
    move_lock_mem_cgroup(memcg)
    see flags
    copy page stat to new group
    overwrite pc->mem_cgroup.
    move_unlock_mem_cgroup(memcg)
                                    move_lock_mem_cgroup(mem)
                                    set pc->flags
                                    update page stat accounting
                                    move_unlock_mem_cgroup(mem)

stat accounting is guarded by move_lock_mem_cgroup() and "move"
logic (CPU-A) doesn't see changes in "struct page" information.

But it's costly to have the same information both in 'struct page' and
'struct page_cgroup'. And, there is a potential problem.

For example, assume we have PG_dirty accounting in memcg.
PG_..is a flag for struct page.
PCG_ is a flag for struct page_cgroup.
(This is just an example. The same problem can be found in any
 kind of page stat accounting.)

	  CPU-A                               CPU-B
      TestSet PG_dirty
      (delay)                        TestClear PG_dirty_
                                     if (TestClear(PCG_dirty))
                                          memcg->nr_dirty--
      if (TestSet(PCG_dirty))
          memcg->nr_dirty++

Here, memcg->nr_dirty = +1, this is wrong.
This race was reported by  Greg Thelen <gthelen@google.com>.
Now, only FILE_MAPPED is supported but fortunately, it's serialized by
page table lock and this is not real bug, _now_,

If this potential problem is caused by having duplicated information in
struct page and struct page_cgroup, we may be able to fix this by using
original 'struct page' information.
But we'll have a problem in "move account"

Assume we use only PG_dirty.

         CPU-A                   CPU-B
    TestSet PG_dirty
    (delay)                    move_lock_mem_cgroup()
                               if (PageDirty(page))
                                      new_memcg->nr_dirty++
                               pc->mem_cgroup = new_memcg;
                               move_unlock_mem_cgroup()
    move_lock_mem_cgroup()
    memcg = pc->mem_cgroup
    new_memcg->nr_dirty++

accounting information may be double-counted. This was original
reason to have PCG_xxx flags but it seems PCG_xxx has another problem.

I think we need a bigger lock as

     move_lock_mem_cgroup(page)
     TestSetPageDirty(page)
     update page stats (without any checks)
     move_unlock_mem_cgroup(page)

This fixes both of problems and we don't have to duplicate page flag
into page_cgroup. Please note: move_lock_mem_cgroup() is held
only when there are possibility of "account move" under the system.
So, in most path, status update will go without atomic locks.

This patch introduce mem_cgroup_begin_update_page_stat() and
mem_cgroup_end_update_page_stat() both should be called at modifying
'struct page' information if memcg takes care of it. as

     mem_cgroup_begin_update_page_stat()
     modify page information
     mem_cgroup_update_page_stat()
     => never check any 'struct page' info, just update counters.
     mem_cgroup_end_update_page_stat().

This patch is slow because we need to call begin_update_page_stat()/
end_update_page_stat() regardless of accounted will be changed or not.
A following patch adds an easy optimization and reduces the cost.

Changes since v3
 - fixed typos

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/memcontrol.h |   35 +++++++++++++++++++++++++++
 mm/memcontrol.c            |   57 ++++++++++++++++++++++++++++---------------
 mm/rmap.c                  |   15 ++++++++++-
 3 files changed, 85 insertions(+), 22 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index bf4e1f4..3f3ef33 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -141,6 +141,31 @@ static inline bool mem_cgroup_disabled(void)
 	return false;
 }
 
+void __mem_cgroup_begin_update_page_stat(struct page *page,
+					bool *lock, unsigned long *flags);
+
+static inline void mem_cgroup_begin_update_page_stat(struct page *page,
+					bool *lock, unsigned long *flags)
+{
+	if (mem_cgroup_disabled())
+		return;
+	rcu_read_lock();
+	*lock = false;
+	return __mem_cgroup_begin_update_page_stat(page, lock, flags);
+}
+
+void __mem_cgroup_end_update_page_stat(struct page *page, bool *lock,
+				unsigned long *flags);
+static inline void mem_cgroup_end_update_page_stat(struct page *page,
+					bool *lock, unsigned long *flags)
+{
+	if (mem_cgroup_disabled())
+		return;
+	if (*lock)
+		__mem_cgroup_end_update_page_stat(page, lock, flags);
+	rcu_read_unlock();
+}
+
 void mem_cgroup_update_page_stat(struct page *page,
 				 enum mem_cgroup_page_stat_item idx,
 				 int val);
@@ -356,6 +381,16 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
 {
 }
 
+static inline void mem_cgroup_begin_update_page_stat(struct page *page,
+					bool *lock, unsigned long *flags)
+{
+}
+
+static inline void mem_cgroup_end_update_page_stat(struct page *page,
+					bool *lock, unsigned long *flags)
+{
+}
+
 static inline void mem_cgroup_inc_page_stat(struct page *page,
 					    enum mem_cgroup_page_stat_item idx)
 {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ecf8856..30afea5 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1877,32 +1877,54 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask)
  * If there is, we take a lock.
  */
 
+void __mem_cgroup_begin_update_page_stat(struct page *page,
+                               bool *lock, unsigned long *flags)
+{
+	struct mem_cgroup *memcg;
+	struct page_cgroup *pc;
+
+	pc = lookup_page_cgroup(page);
+again:
+	memcg = pc->mem_cgroup;
+	if (unlikely(!memcg || !PageCgroupUsed(pc)))
+		return;
+	if (!mem_cgroup_stealed(memcg))
+		return;
+
+	move_lock_mem_cgroup(memcg, flags);
+	if (memcg != pc->mem_cgroup || !PageCgroupUsed(pc)) {
+		move_unlock_mem_cgroup(memcg, flags);
+		goto again;
+	}
+	*lock = true;
+}
+
+void __mem_cgroup_end_update_page_stat(struct page *page,
+                               bool *lock, unsigned long *flags)
+{
+	struct page_cgroup *pc = lookup_page_cgroup(page);
+
+	/*
+	 * It's guaranteed that pc->mem_cgroup never changes while
+	 * lock is held
+	 */
+	move_unlock_mem_cgroup(pc->mem_cgroup, flags);
+}
+
+
 void mem_cgroup_update_page_stat(struct page *page,
 				 enum mem_cgroup_page_stat_item idx, int val)
 {
 	struct mem_cgroup *memcg;
 	struct page_cgroup *pc = lookup_page_cgroup(page);
-	bool need_unlock = false;
 	unsigned long uninitialized_var(flags);
 
 	if (mem_cgroup_disabled())
 		return;
-again:
-	rcu_read_lock();
+
 	memcg = pc->mem_cgroup;
 	if (unlikely(!memcg || !PageCgroupUsed(pc)))
-		goto out;
-	/* pc->mem_cgroup is unstable ? */
-	if (unlikely(mem_cgroup_stealed(memcg))) {
-		/* take a lock against to access pc->mem_cgroup */
-		move_lock_mem_cgroup(memcg, &flags);
-		if (memcg != pc->mem_cgroup || !PageCgroupUsed(pc)) {
-			move_unlock_mem_cgroup(memcg, &flags);
-			rcu_read_unlock();
-			goto again;
-		}
-		need_unlock = true;
-	}
+		return;
 
 	switch (idx) {
 	case MEMCG_NR_FILE_MAPPED:
@@ -1917,11 +1939,6 @@ again:
 	}
 
 	this_cpu_add(memcg->stat->count[idx], val);
-
-out:
-	if (unlikely(need_unlock))
-		move_unlock_mem_cgroup(memcg, &flags);
-	rcu_read_unlock();
 }
 
 /*
diff --git a/mm/rmap.c b/mm/rmap.c
index aa547d4..dd193db 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1150,10 +1150,15 @@ void page_add_new_anon_rmap(struct page *page,
  */
 void page_add_file_rmap(struct page *page)
 {
+	bool locked;
+	unsigned long flags;
+
+	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
 	if (atomic_inc_and_test(&page->_mapcount)) {
 		__inc_zone_page_state(page, NR_FILE_MAPPED);
 		mem_cgroup_inc_page_stat(page, MEMCG_NR_FILE_MAPPED);
 	}
+	mem_cgroup_end_update_page_stat(page, &locked, &flags);
 }
 
 /**
@@ -1164,9 +1169,13 @@ void page_add_file_rmap(struct page *page)
  */
 void page_remove_rmap(struct page *page)
 {
+	bool locked;
+	unsigned long flags;
+
+	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
 	/* page still mapped by someone else? */
 	if (!atomic_add_negative(-1, &page->_mapcount))
-		return;
+		goto out;
 
 	/*
 	 * Now that the last pte has gone, s390 must transfer dirty
@@ -1183,7 +1192,7 @@ void page_remove_rmap(struct page *page)
 	 * and not charged by memcg for now.
 	 */
 	if (unlikely(PageHuge(page)))
-		return;
+		goto out;
 	if (PageAnon(page)) {
 		mem_cgroup_uncharge_page(page);
 		if (!PageTransHuge(page))
@@ -1204,6 +1213,8 @@ void page_remove_rmap(struct page *page)
 	 * Leaving it set also helps swapoff to reinstate ptes
 	 * faster for those pages still in swapcache.
 	 */
+out:
+	mem_cgroup_end_update_page_stat(page, &locked, &flags);
 }
 
 /*
-- 
1.7.4.1



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 5/6 v4] memcg: remove PCG_FILE_MAPPED
  2012-02-14  3:04 [PATCH 0/6 v4] memcg: page cgroup diet KAMEZAWA Hiroyuki
                   ` (3 preceding siblings ...)
  2012-02-14  3:14 ` [PATCH 4/6 v4] memcg: use new logic for page stat accounting KAMEZAWA Hiroyuki
@ 2012-02-14  3:15 ` KAMEZAWA Hiroyuki
  2012-02-14  7:22   ` Greg Thelen
  2012-02-14  3:16 ` [PATCH 6/6 v4] memcg: fix performance of mem_cgroup_begin_update_page_stat() KAMEZAWA Hiroyuki
  5 siblings, 1 reply; 16+ messages in thread
From: KAMEZAWA Hiroyuki @ 2012-02-14  3:15 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, linux-kernel, hannes, Michal Hocko, akpm, Ying Han,
	Hugh Dickins

>From 96407a510d5212179a4768f28591b35d5c17d403 Mon Sep 17 00:00:00 2001
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Date: Thu, 2 Feb 2012 15:02:18 +0900
Subject: [PATCH 5/6] memcg: remove PCG_FILE_MAPPED

with new lock scheme for updating memcg's page stat, we don't need
a flag PCG_FILE_MAPPED which was duplicated information of page_mapped().

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/page_cgroup.h |    6 ------
 mm/memcontrol.c             |    6 +-----
 2 files changed, 1 insertions(+), 11 deletions(-)

diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
index 7a3af74..a88cdba 100644
--- a/include/linux/page_cgroup.h
+++ b/include/linux/page_cgroup.h
@@ -6,8 +6,6 @@ enum {
 	PCG_LOCK,  /* Lock for pc->mem_cgroup and following bits. */
 	PCG_USED, /* this object is in use. */
 	PCG_MIGRATION, /* under page migration */
-	/* flags for mem_cgroup and file and I/O status */
-	PCG_FILE_MAPPED, /* page is accounted as "mapped" */
 	__NR_PCG_FLAGS,
 };
 
@@ -66,10 +64,6 @@ TESTPCGFLAG(Used, USED)
 CLEARPCGFLAG(Used, USED)
 SETPCGFLAG(Used, USED)
 
-SETPCGFLAG(FileMapped, FILE_MAPPED)
-CLEARPCGFLAG(FileMapped, FILE_MAPPED)
-TESTPCGFLAG(FileMapped, FILE_MAPPED)
-
 SETPCGFLAG(Migration, MIGRATION)
 CLEARPCGFLAG(Migration, MIGRATION)
 TESTPCGFLAG(Migration, MIGRATION)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 30afea5..ec55880 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1928,10 +1928,6 @@ void mem_cgroup_update_page_stat(struct page *page,
 
 	switch (idx) {
 	case MEMCG_NR_FILE_MAPPED:
-		if (val > 0)
-			SetPageCgroupFileMapped(pc);
-		else if (!page_mapped(page))
-			ClearPageCgroupFileMapped(pc);
 		idx = MEM_CGROUP_STAT_FILE_MAPPED;
 		break;
 	default:
@@ -2552,7 +2548,7 @@ static int mem_cgroup_move_account(struct page *page,
 
 	move_lock_mem_cgroup(from, &flags);
 
-	if (PageCgroupFileMapped(pc)) {
+	if (page_mapped(page)) {
 		/* Update mapped_file data for mem_cgroup */
 		preempt_disable();
 		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
-- 
1.7.4.1



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 6/6 v4] memcg: fix performance of mem_cgroup_begin_update_page_stat()
  2012-02-14  3:04 [PATCH 0/6 v4] memcg: page cgroup diet KAMEZAWA Hiroyuki
                   ` (4 preceding siblings ...)
  2012-02-14  3:15 ` [PATCH 5/6 v4] memcg: remove PCG_FILE_MAPPED KAMEZAWA Hiroyuki
@ 2012-02-14  3:16 ` KAMEZAWA Hiroyuki
  2012-02-14  7:22   ` Greg Thelen
  5 siblings, 1 reply; 16+ messages in thread
From: KAMEZAWA Hiroyuki @ 2012-02-14  3:16 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, linux-kernel, hannes, Michal Hocko, akpm, Ying Han,
	Hugh Dickins

>From 3377fd7b6e23a5d2a368c078eae27e2b49c4f4aa Mon Sep 17 00:00:00 2001
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Date: Mon, 6 Feb 2012 12:14:47 +0900
Subject: [PATCH 6/6] memcg: fix performance of mem_cgroup_begin_update_page_stat()

mem_cgroup_begin_update_page_stat() should be very fast because
it's called very frequently. Now, it needs to look up page_cgroup
and its memcg....this is slow.

This patch adds a global variable to check "a memcg is moving or not".
By this, the caller doesn't need to visit page_cgroup and memcg.

Here is a test result. A test program makes page faults onto a file,
MAP_SHARED and makes each page's page_mapcount(page) > 1, and free
the range by madvise() and page fault again.  This program causes
26214400 times of page fault onto a file(size was 1G.) and shows
shows the cost of mem_cgroup_begin_update_page_stat().

Before this patch for mem_cgroup_begin_update_page_stat()
[kamezawa@bluextal test]$ time ./mmap 1G

real    0m21.765s
user    0m5.999s
sys     0m15.434s

    27.46%     mmap  mmap               [.] reader
    21.15%     mmap  [kernel.kallsyms]  [k] page_fault
     9.17%     mmap  [kernel.kallsyms]  [k] filemap_fault
     2.96%     mmap  [kernel.kallsyms]  [k] __do_fault
     2.83%     mmap  [kernel.kallsyms]  [k] __mem_cgroup_begin_update_page_stat

After this patch
[root@bluextal test]# time ./mmap 1G

real    0m21.373s
user    0m6.113s
sys     0m15.016s

In usual path, calls to __mem_cgroup_begin_update_page_stat() goes away.

Note: we may be able to remove this optimization in future if
      we can get pointer to memcg directly from struct page.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/memcontrol.h |    5 +++--
 mm/memcontrol.c            |    7 ++++++-
 2 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 3f3ef33..3df9979 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -143,7 +143,7 @@ static inline bool mem_cgroup_disabled(void)
 
 void __mem_cgroup_begin_update_page_stat(struct page *page,
 					bool *lock, unsigned long *flags);
-
+extern atomic_t memcg_moving;
 static inline void mem_cgroup_begin_update_page_stat(struct page *page,
 					bool *lock, unsigned long *flags)
 {
@@ -151,7 +151,8 @@ static inline void mem_cgroup_begin_update_page_stat(struct page *page,
 		return;
 	rcu_read_lock();
 	*lock = false;
-	return __mem_cgroup_begin_update_page_stat(page, lock, flags);
+	if (atomic_read(&memcg_moving))
+		return __mem_cgroup_begin_update_page_stat(page, lock, flags);
 }
 
 void __mem_cgroup_end_update_page_stat(struct page *page, bool *lock,
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ec55880..fd02fab 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1273,8 +1273,11 @@ int mem_cgroup_swappiness(struct mem_cgroup *memcg)
  *         start move here.
  */
 
+/* for quick checking without looking up memcg */
+atomic_t memcg_moving __read_mostly;
 static void mem_cgroup_start_move(struct mem_cgroup *memcg)
 {
+	atomic_inc(&memcg_moving);
 	atomic_inc(&memcg->moving_account);
 	synchronize_rcu();
 }
@@ -1285,8 +1288,10 @@ static void mem_cgroup_end_move(struct mem_cgroup *memcg)
 	 * Now, mem_cgroup_clear_mc() may call this function with NULL.
 	 * We check NULL in callee rather than caller.
 	 */
-	if (memcg)
+	if (memcg) {
+		atomic_dec(&memcg_moving);
 		atomic_dec(&memcg->moving_account);
+	}
 }
 
 /*
-- 
1.7.4.1



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/6 v4] memcg: remove EXPORT_SYMBOL(mem_cgroup_update_page_stat)
  2012-02-14  3:06 ` [PATCH 1/6 v4] memcg: remove EXPORT_SYMBOL(mem_cgroup_update_page_stat) KAMEZAWA Hiroyuki
@ 2012-02-14  7:21   ` Greg Thelen
  0 siblings, 0 replies; 16+ messages in thread
From: Greg Thelen @ 2012-02-14  7:21 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, linux-kernel, hannes, Michal Hocko, akpm, Ying Han,
	Hugh Dickins

On Mon, Feb 13, 2012 at 7:06 PM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
>
> This is just a cleanup.
> ==
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Date: Thu, 2 Feb 2012 12:05:41 +0900
> Subject: [PATCH 1/6] memcg: remove EXPORT_SYMBOL(mem_cgroup_update_page_stat)
>
> From the log, I guess EXPORT was for preparing dirty accounting.
> But _now_, we don't need to export this. Remove this for now.
>
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Looks good to me.

Reviewed-by: Greg Thelen <gthelen@google.com>

> ---
>  mm/memcontrol.c |    1 -
>  1 files changed, 0 insertions(+), 1 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index ab315ab..4c2b759 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1897,7 +1897,6 @@ out:
>                move_unlock_page_cgroup(pc, &flags);
>        rcu_read_unlock();
>  }
> -EXPORT_SYMBOL(mem_cgroup_update_page_stat);
>
>  /*
>  * size of first charge trial. "32" comes from vmscan.c's magic value.
> --
> 1.7.4.1

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/6 v4] memcg: simplify move_account() check
  2012-02-14  3:07 ` [PATCH 2/6 v4] memcg: simplify move_account() check KAMEZAWA Hiroyuki
@ 2012-02-14  7:21   ` Greg Thelen
  2012-02-14  8:44     ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 16+ messages in thread
From: Greg Thelen @ 2012-02-14  7:21 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, linux-kernel, hannes, Michal Hocko, akpm, Ying Han,
	Hugh Dickins

On Mon, Feb 13, 2012 at 7:07 PM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> From 9cdb3b63dc8d08cc2220c54c80438c13433a0d12 Mon Sep 17 00:00:00 2001
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Date: Thu, 2 Feb 2012 10:02:39 +0900
> Subject: [PATCH 2/6] memcg: simplify move_account() check.
>
> In memcg, for avoiding take-lock-irq-off at accessing page_cgroup,
> a logic, flag + rcu_read_lock(), is used. This works as following
>
>     CPU-A                     CPU-B
>                             rcu_read_lock()
>    set flag
>                             if(flag is set)
>                                   take heavy lock
>                             do job.
>    synchronize_rcu()        rcu_read_unlock()

I assume that CPU-A will take heavy lock after synchronize_rcu() when
updating variables read by CPU-B.

>  memcontrol.c |   65 ++++++++++++++++++++++-------------------------------------
>  1 file changed, 25 insertions(+), 40 deletions(-)
>
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Acked-by: Greg Thelen <gthelen@google.com>

> ---
>  mm/memcontrol.c |   70 +++++++++++++++++++++++-------------------------------
>  1 files changed, 30 insertions(+), 40 deletions(-)

> @@ -2089,11 +2082,8 @@ static int __cpuinit memcg_cpu_hotplug_callback(struct notifier_block *nb,
>        struct memcg_stock_pcp *stock;
>        struct mem_cgroup *iter;
>
> -       if ((action == CPU_ONLINE)) {
> -               for_each_mem_cgroup(iter)
> -                       synchronize_mem_cgroup_on_move(iter, cpu);
> +       if ((action == CPU_ONLINE))

Extra parenthesis.  I recommend:
+       if (action == CPU_ONLINE)

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 3/6 v4] memcg: remove PCG_MOVE_LOCK flag from page_cgroup.
  2012-02-14  3:13 ` [PATCH 3/6 v4] memcg: remove PCG_MOVE_LOCK flag from page_cgroup KAMEZAWA Hiroyuki
@ 2012-02-14  7:21   ` Greg Thelen
  0 siblings, 0 replies; 16+ messages in thread
From: Greg Thelen @ 2012-02-14  7:21 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, linux-kernel, hannes, Michal Hocko, akpm, Ying Han,
	Hugh Dickins

On Mon, Feb 13, 2012 at 7:13 PM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> From ffd1b013fe294a80c12e3f30e85386135a6fb284 Mon Sep 17 00:00:00 2001
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Date: Thu, 2 Feb 2012 11:49:59 +0900
> Subject: [PATCH 3/6] memcg: remove PCG_MOVE_LOCK flag from page_cgroup.
>
> PCG_MOVE_LOCK is used for bit spinlock to avoid race between overwriting
> pc->mem_cgroup and page statistics accounting per memcg.
> This lock helps to avoid the race but the race is very rare because moving
> tasks between cgroup is not a usual job.
> So, it seems using 1bit per page is too costly.
>
> This patch changes this lock as per-memcg spinlock and removes PCG_MOVE_LOCK.
>
> If smaller lock is required, we'll be able to add some hashes but
> I'd like to start from this.
>
> Changelog:
>  - fixed to pass memcg as an argument rather than page_cgroup.
>    and renamed from move_lock_page_cgroup() to move_lock_mem_cgroup()
>
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Seems good.  Thanks.

Acked-by: Greg Thelen <gthelen@google.com>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 4/6 v4] memcg: use new logic for page stat accounting
  2012-02-14  3:14 ` [PATCH 4/6 v4] memcg: use new logic for page stat accounting KAMEZAWA Hiroyuki
@ 2012-02-14  7:22   ` Greg Thelen
  2012-02-14  8:43     ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 16+ messages in thread
From: Greg Thelen @ 2012-02-14  7:22 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, linux-kernel, hannes, Michal Hocko, akpm, Ying Han,
	Hugh Dickins

On Mon, Feb 13, 2012 at 7:14 PM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> From ad2905362ef58a44d96a325193ab384739418050 Mon Sep 17 00:00:00 2001
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Date: Thu, 2 Feb 2012 11:49:59 +0900
> Subject: [PATCH 4/6] memcg: use new logic for page stat accounting.
>
> Now, page-stat-per-memcg is recorded into per page_cgroup flag by
> duplicating page's status into the flag. The reason is that memcg
> has a feature to move a page from a group to another group and we
> have race between "move" and "page stat accounting",
>
> Under current logic, assume CPU-A and CPU-B. CPU-A does "move"
> and CPU-B does "page stat accounting".
>
> When CPU-A goes 1st,
>
>            CPU-A                           CPU-B
>                                    update "struct page" info.
>    move_lock_mem_cgroup(memcg)
>    see flags

pc->flags?

>    copy page stat to new group
>    overwrite pc->mem_cgroup.
>    move_unlock_mem_cgroup(memcg)
>                                    move_lock_mem_cgroup(mem)
>                                    set pc->flags
>                                    update page stat accounting
>                                    move_unlock_mem_cgroup(mem)
>
> stat accounting is guarded by move_lock_mem_cgroup() and "move"
> logic (CPU-A) doesn't see changes in "struct page" information.
>
> But it's costly to have the same information both in 'struct page' and
> 'struct page_cgroup'. And, there is a potential problem.
>
> For example, assume we have PG_dirty accounting in memcg.
> PG_..is a flag for struct page.
> PCG_ is a flag for struct page_cgroup.
> (This is just an example. The same problem can be found in any
>  kind of page stat accounting.)
>
>          CPU-A                               CPU-B
>      TestSet PG_dirty
>      (delay)                        TestClear PG_dirty_

PG_dirty

>                                     if (TestClear(PCG_dirty))
>                                          memcg->nr_dirty--
>      if (TestSet(PCG_dirty))
>          memcg->nr_dirty++
>

> @@ -141,6 +141,31 @@ static inline bool mem_cgroup_disabled(void)
>        return false;
>  }
>
> +void __mem_cgroup_begin_update_page_stat(struct page *page,
> +                                       bool *lock, unsigned long *flags);
> +
> +static inline void mem_cgroup_begin_update_page_stat(struct page *page,
> +                                       bool *lock, unsigned long *flags)
> +{
> +       if (mem_cgroup_disabled())
> +               return;
> +       rcu_read_lock();
> +       *lock = false;

This seems like a strange place to set *lock=false.  I think it's
clearer if __mem_cgroup_begin_update_page_stat() is the only routine
that sets or clears *lock.  But I do see that in patch 6/6 'memcg: fix
performance of mem_cgroup_begin_update_page_stat()' this position is
required.

> +       return __mem_cgroup_begin_update_page_stat(page, lock, flags);
> +}

> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index ecf8856..30afea5 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1877,32 +1877,54 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask)
>  * If there is, we take a lock.
>  */
>
> +void __mem_cgroup_begin_update_page_stat(struct page *page,
> +                               bool *lock, unsigned long *flags)
> +{
> +       struct mem_cgroup *memcg;
> +       struct page_cgroup *pc;
> +
> +       pc = lookup_page_cgroup(page);
> +again:
> +       memcg = pc->mem_cgroup;
> +       if (unlikely(!memcg || !PageCgroupUsed(pc)))
> +               return;
> +       if (!mem_cgroup_stealed(memcg))
> +               return;
> +
> +       move_lock_mem_cgroup(memcg, flags);
> +       if (memcg != pc->mem_cgroup || !PageCgroupUsed(pc)) {
> +               move_unlock_mem_cgroup(memcg, flags);
> +               goto again;
> +       }
> +       *lock = true;
> +}
> +
> +void __mem_cgroup_end_update_page_stat(struct page *page,
> +                               bool *lock, unsigned long *flags)

'lock' looks like an unused parameter.  If so, then remove it.

> +{
> +       struct page_cgroup *pc = lookup_page_cgroup(page);
> +
> +       /*
> +        * It's guaranteed that pc->mem_cgroup never changes while
> +        * lock is held

Please continue comment describing what provides this guarantee.  I
assume it is because rcu_read_lock() is held by
mem_cgroup_begin_update_page_stat().  Maybe it's best to to just make
small reference to the locking protocol description in
mem_cgroup_start_move().

> +        */
> +       move_unlock_mem_cgroup(pc->mem_cgroup, flags);
> +}
> +
> +

I think it would be useful to add a small comment here declaring that
all callers of this routine must be in a
mem_cgroup_begin_update_page_stat(), mem_cgroup_end_update_page_stat()
critical section to keep pc->mem_cgroup stable.

>  void mem_cgroup_update_page_stat(struct page *page,
>                                 enum mem_cgroup_page_stat_item idx, int val)
>  {

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 5/6 v4] memcg: remove PCG_FILE_MAPPED
  2012-02-14  3:15 ` [PATCH 5/6 v4] memcg: remove PCG_FILE_MAPPED KAMEZAWA Hiroyuki
@ 2012-02-14  7:22   ` Greg Thelen
  0 siblings, 0 replies; 16+ messages in thread
From: Greg Thelen @ 2012-02-14  7:22 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, linux-kernel, hannes, Michal Hocko, akpm, Ying Han,
	Hugh Dickins

On Mon, Feb 13, 2012 at 7:15 PM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> From 96407a510d5212179a4768f28591b35d5c17d403 Mon Sep 17 00:00:00 2001
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Date: Thu, 2 Feb 2012 15:02:18 +0900
> Subject: [PATCH 5/6] memcg: remove PCG_FILE_MAPPED
>
> with new lock scheme for updating memcg's page stat, we don't need
> a flag PCG_FILE_MAPPED which was duplicated information of page_mapped().
>
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Thanks!

Acked-by: Greg Thelen <gthelen@google.com>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 6/6 v4] memcg: fix performance of mem_cgroup_begin_update_page_stat()
  2012-02-14  3:16 ` [PATCH 6/6 v4] memcg: fix performance of mem_cgroup_begin_update_page_stat() KAMEZAWA Hiroyuki
@ 2012-02-14  7:22   ` Greg Thelen
  2012-02-14 11:08     ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 16+ messages in thread
From: Greg Thelen @ 2012-02-14  7:22 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, linux-kernel, hannes, Michal Hocko, akpm, Ying Han,
	Hugh Dickins

On Mon, Feb 13, 2012 at 7:16 PM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> From 3377fd7b6e23a5d2a368c078eae27e2b49c4f4aa Mon Sep 17 00:00:00 2001
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Date: Mon, 6 Feb 2012 12:14:47 +0900
> Subject: [PATCH 6/6] memcg: fix performance of mem_cgroup_begin_update_page_stat()
>
> mem_cgroup_begin_update_page_stat() should be very fast because
> it's called very frequently. Now, it needs to look up page_cgroup
> and its memcg....this is slow.
>
> This patch adds a global variable to check "a memcg is moving or not".

s/a memcg/any memcg/

> By this, the caller doesn't need to visit page_cgroup and memcg.

s/By/With/

> Here is a test result. A test program makes page faults onto a file,
> MAP_SHARED and makes each page's page_mapcount(page) > 1, and free
> the range by madvise() and page fault again.  This program causes
> 26214400 times of page fault onto a file(size was 1G.) and shows
> shows the cost of mem_cgroup_begin_update_page_stat().

Out of curiosity, what is the performance of the mmap program before
this series?

> Before this patch for mem_cgroup_begin_update_page_stat()
> [kamezawa@bluextal test]$ time ./mmap 1G
>
> real    0m21.765s
> user    0m5.999s
> sys     0m15.434s
>
>    27.46%     mmap  mmap               [.] reader
>    21.15%     mmap  [kernel.kallsyms]  [k] page_fault
>     9.17%     mmap  [kernel.kallsyms]  [k] filemap_fault
>     2.96%     mmap  [kernel.kallsyms]  [k] __do_fault
>     2.83%     mmap  [kernel.kallsyms]  [k] __mem_cgroup_begin_update_page_stat
>
> After this patch
> [root@bluextal test]# time ./mmap 1G
>
> real    0m21.373s
> user    0m6.113s
> sys     0m15.016s
>
> In usual path, calls to __mem_cgroup_begin_update_page_stat() goes away.
>
> Note: we may be able to remove this optimization in future if
>      we can get pointer to memcg directly from struct page.
>
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Acked-by: Greg Thelen <gthelen@google.com>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 4/6 v4] memcg: use new logic for page stat accounting
  2012-02-14  7:22   ` Greg Thelen
@ 2012-02-14  8:43     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 16+ messages in thread
From: KAMEZAWA Hiroyuki @ 2012-02-14  8:43 UTC (permalink / raw)
  To: Greg Thelen
  Cc: linux-mm, linux-kernel, hannes, Michal Hocko, akpm, Ying Han,
	Hugh Dickins

On Mon, 13 Feb 2012 23:22:22 -0800
Greg Thelen <gthelen@google.com> wrote:

> On Mon, Feb 13, 2012 at 7:14 PM, KAMEZAWA Hiroyuki
> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > From ad2905362ef58a44d96a325193ab384739418050 Mon Sep 17 00:00:00 2001
> > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > Date: Thu, 2 Feb 2012 11:49:59 +0900
> > Subject: [PATCH 4/6] memcg: use new logic for page stat accounting.
> >
> > Now, page-stat-per-memcg is recorded into per page_cgroup flag by
> > duplicating page's status into the flag. The reason is that memcg
> > has a feature to move a page from a group to another group and we
> > have race between "move" and "page stat accounting",
> >
> > Under current logic, assume CPU-A and CPU-B. CPU-A does "move"
> > and CPU-B does "page stat accounting".
> >
> > When CPU-A goes 1st,
> >
> >            CPU-A                           CPU-B
> >                                    update "struct page" info.
> >    move_lock_mem_cgroup(memcg)
> >    see flags
> 
> pc->flags?
> 
yes.


> >    copy page stat to new group
> >    overwrite pc->mem_cgroup.
> >    move_unlock_mem_cgroup(memcg)
> >                                    move_lock_mem_cgroup(mem)
> >                                    set pc->flags
> >                                    update page stat accounting
> >                                    move_unlock_mem_cgroup(mem)
> >
> > stat accounting is guarded by move_lock_mem_cgroup() and "move"
> > logic (CPU-A) doesn't see changes in "struct page" information.
> >
> > But it's costly to have the same information both in 'struct page' and
> > 'struct page_cgroup'. And, there is a potential problem.
> >
> > For example, assume we have PG_dirty accounting in memcg.
> > PG_..is a flag for struct page.
> > PCG_ is a flag for struct page_cgroup.
> > (This is just an example. The same problem can be found in any
> >  kind of page stat accounting.)
> >
> >          CPU-A                               CPU-B
> >      TestSet PG_dirty
> >      (delay)                        TestClear PG_dirty_
> 
> PG_dirty
> 
> >                                     if (TestClear(PCG_dirty))
> >                                          memcg->nr_dirty--
> >      if (TestSet(PCG_dirty))
> >          memcg->nr_dirty++
> >
> 
> > @@ -141,6 +141,31 @@ static inline bool mem_cgroup_disabled(void)
> >        return false;
> >  }
> >
> > +void __mem_cgroup_begin_update_page_stat(struct page *page,
> > +                                       bool *lock, unsigned long *flags);
> > +
> > +static inline void mem_cgroup_begin_update_page_stat(struct page *page,
> > +                                       bool *lock, unsigned long *flags)
> > +{
> > +       if (mem_cgroup_disabled())
> > +               return;
> > +       rcu_read_lock();
> > +       *lock = false;
> 
> This seems like a strange place to set *lock=false.  I think it's
> clearer if __mem_cgroup_begin_update_page_stat() is the only routine
> that sets or clears *lock.  But I do see that in patch 6/6 'memcg: fix
> performance of mem_cgroup_begin_update_page_stat()' this position is
> required.
> 

Ah, yes. Hmm, it was better to move this to the body of function.



> > +       return __mem_cgroup_begin_update_page_stat(page, lock, flags);
> > +}
> 
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index ecf8856..30afea5 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -1877,32 +1877,54 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask)
> >  * If there is, we take a lock.
> >  */
> >
> > +void __mem_cgroup_begin_update_page_stat(struct page *page,
> > +                               bool *lock, unsigned long *flags)
> > +{
> > +       struct mem_cgroup *memcg;
> > +       struct page_cgroup *pc;
> > +
> > +       pc = lookup_page_cgroup(page);
> > +again:
> > +       memcg = pc->mem_cgroup;
> > +       if (unlikely(!memcg || !PageCgroupUsed(pc)))
> > +               return;
> > +       if (!mem_cgroup_stealed(memcg))
> > +               return;
> > +
> > +       move_lock_mem_cgroup(memcg, flags);
> > +       if (memcg != pc->mem_cgroup || !PageCgroupUsed(pc)) {
> > +               move_unlock_mem_cgroup(memcg, flags);
> > +               goto again;
> > +       }
> > +       *lock = true;
> > +}
> > +
> > +void __mem_cgroup_end_update_page_stat(struct page *page,
> > +                               bool *lock, unsigned long *flags)
> 
> 'lock' looks like an unused parameter.  If so, then remove it.
> 

Ok.

> > +{
> > +       struct page_cgroup *pc = lookup_page_cgroup(page);
> > +
> > +       /*
> > +        * It's guaranteed that pc->mem_cgroup never changes while
> > +        * lock is held
> 
> Please continue comment describing what provides this guarantee.  I
> assume it is because rcu_read_lock() is held by
> mem_cgroup_begin_update_page_stat().  Maybe it's best to to just make
> small reference to the locking protocol description in
> mem_cgroup_start_move().
> 
Ok, I will update this.


> > +        */
> > +       move_unlock_mem_cgroup(pc->mem_cgroup, flags);
> > +}
> > +
> > +
> 
> I think it would be useful to add a small comment here declaring that
> all callers of this routine must be in a
> mem_cgroup_begin_update_page_stat(), mem_cgroup_end_update_page_stat()
> critical section to keep pc->mem_cgroup stable.
> 

Sure, will do.

Thank you for review.
-Kame


> >  void mem_cgroup_update_page_stat(struct page *page,
> >                                 enum mem_cgroup_page_stat_item idx, int val)
> >  {
> 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/6 v4] memcg: simplify move_account() check
  2012-02-14  7:21   ` Greg Thelen
@ 2012-02-14  8:44     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 16+ messages in thread
From: KAMEZAWA Hiroyuki @ 2012-02-14  8:44 UTC (permalink / raw)
  To: Greg Thelen
  Cc: linux-mm, linux-kernel, hannes, Michal Hocko, akpm, Ying Han,
	Hugh Dickins

On Mon, 13 Feb 2012 23:21:34 -0800
Greg Thelen <gthelen@google.com> wrote:

> On Mon, Feb 13, 2012 at 7:07 PM, KAMEZAWA Hiroyuki
> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > From 9cdb3b63dc8d08cc2220c54c80438c13433a0d12 Mon Sep 17 00:00:00 2001
> > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > Date: Thu, 2 Feb 2012 10:02:39 +0900
> > Subject: [PATCH 2/6] memcg: simplify move_account() check.
> >
> > In memcg, for avoiding take-lock-irq-off at accessing page_cgroup,
> > a logic, flag + rcu_read_lock(), is used. This works as following
> >
> >     CPU-A                     CPU-B
> >                             rcu_read_lock()
> >    set flag
> >                             if(flag is set)
> >                                   take heavy lock
> >                             do job.
> >    synchronize_rcu()        rcu_read_unlock()
> 
> I assume that CPU-A will take heavy lock after synchronize_rcu() when
> updating variables read by CPU-B.
> 
Ah, yes. I should wrote that.

> >  memcontrol.c |   65 ++++++++++++++++++++++-------------------------------------
> >  1 file changed, 25 insertions(+), 40 deletions(-)
> >
> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> Acked-by: Greg Thelen <gthelen@google.com>
> 

Thank you!.

-Kame


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 6/6 v4] memcg: fix performance of mem_cgroup_begin_update_page_stat()
  2012-02-14  7:22   ` Greg Thelen
@ 2012-02-14 11:08     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 16+ messages in thread
From: KAMEZAWA Hiroyuki @ 2012-02-14 11:08 UTC (permalink / raw)
  To: Greg Thelen
  Cc: linux-mm, linux-kernel, hannes, Michal Hocko, akpm, Ying Han,
	Hugh Dickins

On Mon, 13 Feb 2012 23:22:50 -0800
Greg Thelen <gthelen@google.com> wrote:

> On Mon, Feb 13, 2012 at 7:16 PM, KAMEZAWA Hiroyuki
> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > From 3377fd7b6e23a5d2a368c078eae27e2b49c4f4aa Mon Sep 17 00:00:00 2001
> > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > Date: Mon, 6 Feb 2012 12:14:47 +0900
> > Subject: [PATCH 6/6] memcg: fix performance of mem_cgroup_begin_update_page_stat()
> >
> > mem_cgroup_begin_update_page_stat() should be very fast because
> > it's called very frequently. Now, it needs to look up page_cgroup
> > and its memcg....this is slow.
> >
> > This patch adds a global variable to check "a memcg is moving or not".
> 
> s/a memcg/any memcg/
> 
yes.

> > By this, the caller doesn't need to visit page_cgroup and memcg.
> 
> s/By/With/
> 
ok.

> > Here is a test result. A test program makes page faults onto a file,
> > MAP_SHARED and makes each page's page_mapcount(page) > 1, and free
> > the range by madvise() and page fault again.  This program causes
> > 26214400 times of page fault onto a file(size was 1G.) and shows
> > shows the cost of mem_cgroup_begin_update_page_stat().
> 
> Out of curiosity, what is the performance of the mmap program before
> this series?
> 

Score of 3 runs underlinux-next.
==
[root@bluextal test]# time ./mmap 1G

real    0m21.041s
user    0m6.146s
sys     0m14.625s
[root@bluextal test]# time ./mmap 1G

real    0m21.063s
user    0m6.019s
sys     0m14.776s
[root@bluextal test]# time ./mmap 1G

real    0m21.178s
user    0m6.000s
sys     0m14.849s
==

My program is attached. This program is for checking cost of updating FILE_MAPPED.
I guess sys_time scores's error rate will be 0.2-0.3 sec.

Thanks,
-Kame
==
#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <unistd.h>
#include <sys/mman.h>
#include <fcntl.h>

void reader(int fd, int size)
{
	int i, off, x;
	char *addr;

	addr = mmap(NULL, size, PROT_READ, MAP_SHARED, fd, 0);
	for (i = 0; i < 100; i++) {
		for(off = 0; off < size; off += 4096) {
			x += *(addr + off);
		}
		madvise(addr, size, MADV_DONTNEED);
	}
}

int main(int argc, char *argv[])
{
	int fd;
	char *addr, *c;
	unsigned long size;
	struct stat statbuf;

	fd = open(argv[1], O_RDONLY);
	if (fd < 0) {
		perror("cannot open file");
		return 1;
	}

	if (fstat(fd, &statbuf)) {
		perror("fstat failed");
		return 1;
	}
	size = statbuf.st_size;
	/* mmap in 2 place. */
	addr = mmap(NULL, size, PROT_READ, MAP_SHARED, fd, 0);
	mlock(addr, size);
	reader(fd, size);
}


^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2012-02-14 11:10 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-02-14  3:04 [PATCH 0/6 v4] memcg: page cgroup diet KAMEZAWA Hiroyuki
2012-02-14  3:06 ` [PATCH 1/6 v4] memcg: remove EXPORT_SYMBOL(mem_cgroup_update_page_stat) KAMEZAWA Hiroyuki
2012-02-14  7:21   ` Greg Thelen
2012-02-14  3:07 ` [PATCH 2/6 v4] memcg: simplify move_account() check KAMEZAWA Hiroyuki
2012-02-14  7:21   ` Greg Thelen
2012-02-14  8:44     ` KAMEZAWA Hiroyuki
2012-02-14  3:13 ` [PATCH 3/6 v4] memcg: remove PCG_MOVE_LOCK flag from page_cgroup KAMEZAWA Hiroyuki
2012-02-14  7:21   ` Greg Thelen
2012-02-14  3:14 ` [PATCH 4/6 v4] memcg: use new logic for page stat accounting KAMEZAWA Hiroyuki
2012-02-14  7:22   ` Greg Thelen
2012-02-14  8:43     ` KAMEZAWA Hiroyuki
2012-02-14  3:15 ` [PATCH 5/6 v4] memcg: remove PCG_FILE_MAPPED KAMEZAWA Hiroyuki
2012-02-14  7:22   ` Greg Thelen
2012-02-14  3:16 ` [PATCH 6/6 v4] memcg: fix performance of mem_cgroup_begin_update_page_stat() KAMEZAWA Hiroyuki
2012-02-14  7:22   ` Greg Thelen
2012-02-14 11:08     ` KAMEZAWA Hiroyuki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).