From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1754197Ab0DTJX3 (ORCPT <rfc822;w@1wt.eu>);
	Tue, 20 Apr 2010 05:23:29 -0400
Received: from fgwmail5.fujitsu.co.jp ([192.51.44.35]:49798 "EHLO
	fgwmail5.fujitsu.co.jp" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1754153Ab0DTJX0 (ORCPT
	<rfc822;linux-kernel@vger.kernel.org>);
	Tue, 20 Apr 2010 05:23:26 -0400
X-SecurityPolicyCheck-FJ: OK by FujitsuOutboundMailChecker v1.3.1
Date: Tue, 20 Apr 2010 18:19:25 +0900
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
To: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: LKML <linux-kernel@vger.kernel.org>, linux-mm <linux-mm@kvack.org>,
       Mel Gorman <mel@csn.ul.ie>, Rik van Riel <riel@redhat.com>,
       Minchan Kim <minchan.kim@gmail.com>,
       Balbir Singh <balbir@linux.vnet.ibm.com>,
       KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>,
       Christoph Lameter <cl@linux-foundation.org>,
       Andrea Arcangeli <aarcange@redhat.com>,
       Andrew Morton <akpm@linux-foundation.org>
Subject: Re: [RFC][BUGFIX][PATCH 2/2] memcg: fix file mapped underflow at
 migration (v3)
Message-Id: <20100420181925.ed881e7a.kamezawa.hiroyu@jp.fujitsu.com>
In-Reply-To: <20100420132050.3477a717.nishimura@mxp.nes.nec.co.jp>
References: <20100413134207.f12cdc9c.nishimura@mxp.nes.nec.co.jp>
	<20100415120516.3891ce46.kamezawa.hiroyu@jp.fujitsu.com>
	<20100415120652.c577846f.kamezawa.hiroyu@jp.fujitsu.com>
	<20100416193143.5807d114.kamezawa.hiroyu@jp.fujitsu.com>
	<20100419124225.91f3110b.nishimura@mxp.nes.nec.co.jp>
	<20100419131817.f263d93c.kamezawa.hiroyu@jp.fujitsu.com>
	<20100419170701.3864992e.nishimura@mxp.nes.nec.co.jp>
	<20100419172629.dbf65e18.kamezawa.hiroyu@jp.fujitsu.com>
	<20100420132050.3477a717.nishimura@mxp.nes.nec.co.jp>
Organization: FUJITSU Co. LTD.
X-Mailer: Sylpheed 3.0.2 (GTK+ 2.10.14; i686-pc-mingw32)
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Tue, 20 Apr 2010 13:20:50 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:

> > It will have no meanings for migrating
> > file caches, but it may have some meanings for easy debugging. 
> > I think "mark it always but it's used only for anonymous page" is reasonable
> > (if it causes no bug.)
> > 
> Anyway, I don't have any strong objection.
> It's all right for me as long as it is well documented or commented.
> 
Okay, before posting as v4, here is draft version.

I'll check again.
==

At page migration, the new page is unlocked before calling end_migration().
This is mis-understanding with page-migration code of memcg.
(it worked but...)
This has small race and we need to fix this race. By this,
FILE_MAPPED of migrated file cache is not properly updated, now.

This patch is for fixing the race by changing algorithm.

At migrating mapped file, events happens in following sequence.

 1. allocate a new page.
 2. get memcg of an old page.
 3. charge ageinst a new page before migration. But at this point,
    no changes to new page's page_cgroup, no commit-charge.
 4. page migration replaces radix-tree, old-page and new-page.
 5. page migration remaps the new page if the old page was mapped.
 6. Here, the new page is unlocked.
 7. memcg commits the charge for newpage, Mark page cgroup as USED.

Because "commit" happens after page-remap, we can count FILE_MAPPED
at "5", we can lose file_mapped accounting information within this
small race. FILE_MAPPED is updated only when mapcount changes 0->1.
So, if we don't catch this 0->1 event, we can underflow FILE_MAPPED
at 1->0 event.

We may be able to avoid underflow by some small technique but
we should catch mapcount 0->1 event. To catch this, we have to
make page_cgroup of new page as "USED".

BTW, historically, above implemntation comes from migration-failure
of anonymous page. Because we charge both of old page and new page
with mapcount=0, we can't catch
  - the page is really freed before remap.
  - migration fails but it's freed before remap
or .....corner cases.

For fixing all, this changes parepare/end migration.

New migration sequence with memcg is:

 1. allocate a new page.
 2. mark PageCgroupMigration to the old page.
 3. charge against a new page onto the old page's memcg. (here, new page's pc
    is marked as PageCgroupUsed.)
 4. mark PageCgroupMigration to the old page.

If page_cgroup is marked as PageCgroupMigration, it's uncahrged until
it's cleared.

 5. page migration replaces radix-tree, page table, etc...
 6. At remapping, new page's page_cgroup is now makrked as "USED"
    We can catch 0->1 event and FILE_MAPPED will be properly updated.
    And we can catch SWAPOUT event after unlock this and freeing this
    page by unmap() can be caught.

 7. Clear PageCgroupMigration of the old page.

By this:
At migration success of Anon:
 - The new page is properly charged. If not-mapped after remap,
   uncharge() will be called.
 - The file cache is properly charged. FILE_MAPPED event can be caught.

At migration failure of Anon:
 - The old page stays as charged. If not mapped after remap,
   uncharge() will called. The corner case is SWAPOUT. But, while migraion,
   it's locked. So, we have no race with it.

Then, for what MIGRATION flag is ?
  Without it, at migration failure, we may have to charge old page again
  because it may be fully unmapped. "charge" means that we have to dive into
  memory reclaim or something complated. So, it's better to avoid
  charge it again. Before this patch, __commit_charge() was working for
  both of the old/new page and fixed up all. But this technique has some
  racy condtion around FILE_MAPPED and SWAPOUT etc...
  Now, the kernel use MIGRATION flag and don't uncharge old page until
  the end of migration.


I hope this change will make memcg's page migration much simpler.
This page migration has caused several troubles. Worth to add
a flag for simplification.

Changelog: 2010/04/20
 - fixed SWAPOUT case.
 - added texts for explanation.
 - removed MIGRAION flag onto new page.

Changelog: 2010/04/15
 - updated against  mm-of-the-moment snapshot 2010-04-15-14-42
   + Nishimura's fix. memcg-fix-prepare-migration-fix.patch
 - fixed some typos.
 - handle migration failure of anon page.

Changelog: 2010/04/14
 - updated onto the latest mmotm + page-compaction, etc...
 - fixed __try_charge() bypass case.
 - use CHARGE_TYPE_FORCE for uncharging an unused page.

Reported-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/memcontrol.h  |    6 +
 include/linux/page_cgroup.h |    5 +
 mm/memcontrol.c             |  145 +++++++++++++++++++++++++++++++-------------
 mm/migrate.c                |    2 
 4 files changed, 115 insertions(+), 43 deletions(-)

Index: linux-2.6.34-rc4-mm1/mm/memcontrol.c
===================================================================
--- linux-2.6.34-rc4-mm1.orig/mm/memcontrol.c
+++ linux-2.6.34-rc4-mm1/mm/memcontrol.c
@@ -2278,7 +2278,8 @@ __mem_cgroup_uncharge_common(struct page
 	switch (ctype) {
 	case MEM_CGROUP_CHARGE_TYPE_MAPPED:
 	case MEM_CGROUP_CHARGE_TYPE_DROP:
-		if (page_mapped(page))
+		/* See mem_cgroup_prepare_migration() */
+		if (page_mapped(page) || PageCgroupMigration(pc))
 			goto unlock_out;
 		break;
 	case MEM_CGROUP_CHARGE_TYPE_SWAPOUT:
@@ -2501,10 +2502,12 @@ static inline int mem_cgroup_move_swap_a
  * Before starting migration, account PAGE_SIZE to mem_cgroup that the old
  * page belongs to.
  */
-int mem_cgroup_prepare_migration(struct page *page, struct mem_cgroup **ptr)
+int mem_cgroup_prepare_migration(struct page *page,
+	struct page *newpage, struct mem_cgroup **ptr)
 {
 	struct page_cgroup *pc;
 	struct mem_cgroup *mem = NULL;
+	enum charge_type ctype;
 	int ret = 0;
 
 	if (mem_cgroup_disabled())
@@ -2515,69 +2518,131 @@ int mem_cgroup_prepare_migration(struct 
 	if (PageCgroupUsed(pc)) {
 		mem = pc->mem_cgroup;
 		css_get(&mem->css);
+		/*
+		 * At migrationg an anonymous page, its mapcount goes down
+		 * to 0 and uncharge() will be called. But, even if it's fully
+		 * unmapped, migration may fail and this page has to be
+		 * charged again. We set MIGRATION flag here and delay uncharge
+		 * until end_migration() is called
+		 *
+		 * Corner Case Thinking
+		 * A)
+		 * When the old page was mapped as Anon and it's unmap-and-freed
+		 * while migration was ongoing, it's not uncharged until
+		 * end_migration(). Both of old and new page will be uncharged
+		 * at end_migration() because it's not mapped and not SwapCache.
+		 *
+		 * B)
+		 * When the old page was mapped but migraion fails, the kernel
+		 * remap it. The charge for it is kept by MIGRATION flag even
+		 * if mapcount goes down to 0, we can do remap successfully
+		 * without charging it again.
+		 * If the kernel doesn't remap it because it's unmapped,
+		 * we can check it at end_migration(), no new charge happens
+		 * at end_migration().
+		 *
+		 * C)
+		 * The "old" page is under lock_page() until the end of
+		 * migration, so, the old page itself will not be swapped-out.
+		 * But the new page can be.
+		 * etc...
+		 */
+		if (PageAnon(page))
+			SetPageCgroupMigration(pc);
 	}
 	unlock_page_cgroup(pc);
+	/*
+	 * If the page is not charged at this point,
+	 * we return here.
+	 */
+	if (!mem)
+		return 0;
 
 	*ptr = mem;
-	if (mem) {
-		ret = __mem_cgroup_try_charge(NULL, GFP_KERNEL, ptr, false);
-		css_put(&mem->css);
+	ret = __mem_cgroup_try_charge(NULL, GFP_KERNEL, ptr, false);
+	css_put(&mem->css);/* drop extra refcnt */
+	if (ret || *ptr == NULL) {
+		if (PageAnon(page)) {
+			lock_page_cgroup(pc);
+			ClearPageCgroupMigration(pc);
+			unlock_page_cgroup(pc);
+			/*
+		 	 * The old page may be fully unmapped while we kept it.
+		 	 */
+			mem_cgroup_uncharge_page(page);
+		}
+		return -ENOMEM;
 	}
+	/*
+ 	 * We charge new page before it's mapped. So, even if unlock_page()
+ 	 * is called far before end_migration, we can catch all events on
+ 	 * this new page. In the case new page is migrated but not remapped,
+ 	 * new page's mapcount will be finally 0 and we call uncharge in
+ 	 * end_migration().
+  	 */
+	pc = lookup_page_cgroup(newpage);
+	if (PageAnon(page))
+		ctype = MEM_CGROUP_CHARGE_TYPE_MAPPED;
+	else if (page_is_file_cache(page))
+		ctype = MEM_CGROUP_CHARGE_TYPE_CACHE;
+	else
+		ctype = MEM_CGROUP_CHARGE_TYPE_SHMEM;
+	__mem_cgroup_commit_charge(mem, pc, ctype);
 	return ret;
 }
 
 /* remove redundant charge if migration failed*/
 void mem_cgroup_end_migration(struct mem_cgroup *mem,
-		struct page *oldpage, struct page *newpage)
+	struct page *oldpage, struct page *newpage)
 {
-	struct page *target, *unused;
+	struct page *used, *unused;
 	struct page_cgroup *pc;
-	enum charge_type ctype;
 
 	if (!mem)
 		return;
+	/* blocks rmdir() */
 	cgroup_exclude_rmdir(&mem->css);
 	/* at migration success, oldpage->mapping is NULL. */
 	if (oldpage->mapping) {
-		target = oldpage;
-		unused = NULL;
+		used = oldpage;
+		unused = newpage;
 	} else {
-		target = newpage;
+		used = newpage;
 		unused = oldpage;
 	}
-
-	if (PageAnon(target))
-		ctype = MEM_CGROUP_CHARGE_TYPE_MAPPED;
-	else if (page_is_file_cache(target))
-		ctype = MEM_CGROUP_CHARGE_TYPE_CACHE;
-	else
-		ctype = MEM_CGROUP_CHARGE_TYPE_SHMEM;
-
-	/* unused page is not on radix-tree now. */
-	if (unused)
-		__mem_cgroup_uncharge_common(unused, ctype);
-
-	pc = lookup_page_cgroup(target);
 	/*
-	 * __mem_cgroup_commit_charge() check PCG_USED bit of page_cgroup.
-	 * So, double-counting is effectively avoided.
+	 * We disallowed uncharge of pages under migration because mapcount
+	 * of the page goes down to zero, temporarly.
+	 * Clear the flag and check the page should be charged.
 	 */
-	__mem_cgroup_commit_charge(mem, pc, ctype);
-
+	pc = lookup_page_cgroup(unused);
+	/* This flag itself is not racy, so, check it before lock */
+	if (PageCgroupMigration(pc)) {
+		lock_page_cgroup(pc);
+		ClearPageCgroupMigration(pc);
+		unlock_page_cgroup(pc);
+	}
+	__mem_cgroup_uncharge_common(unused, MEM_CGROUP_CHARGE_TYPE_FORCE);
+
+	pc = lookup_page_cgroup(used);
+	if (PageCgroupMigration(pc)) {
+		lock_page_cgroup(pc);
+		ClearPageCgroupMigration(pc);
+		unlock_page_cgroup(pc);
+	}
 	/*
-	 * Both of oldpage and newpage are still under lock_page().
-	 * Then, we don't have to care about race in radix-tree.
-	 * But we have to be careful that this page is unmapped or not.
-	 *
-	 * There is a case for !page_mapped(). At the start of
-	 * migration, oldpage was mapped. But now, it's zapped.
-	 * But we know *target* page is not freed/reused under us.
-	 * mem_cgroup_uncharge_page() does all necessary checks.
-	 */
-	if (ctype == MEM_CGROUP_CHARGE_TYPE_MAPPED)
-		mem_cgroup_uncharge_page(target);
+	 * If a page is a file cache, radix-tree replacement is very atomic
+ 	 * and we can skip this check. When it was an Anon page, its mapcount
+ 	 * goes down to 0. But because we added MIGRATION flage, it's not
+ 	 * uncharged yet. There are several case but page->mapcount check
+ 	 * and USED bit check in mem_cgroup_uncharge_page() will do enough
+ 	 * check. (see prepare_charge() also)
+ 	 */
+	if (PageAnon(used))
+		mem_cgroup_uncharge_page(used);
 	/*
-	 * At migration, we may charge account against cgroup which has no tasks
+	 * At migration, we may charge account against cgroup which has no
+	 * tasks.
 	 * So, rmdir()->pre_destroy() can be called while we do this charge.
 	 * In that case, we need to call pre_destroy() again. check it here.
 	 */
Index: linux-2.6.34-rc4-mm1/mm/migrate.c
===================================================================
--- linux-2.6.34-rc4-mm1.orig/mm/migrate.c
+++ linux-2.6.34-rc4-mm1/mm/migrate.c
@@ -581,7 +581,7 @@ static int unmap_and_move(new_page_t get
 	}
 
 	/* charge against new page */
-	charge = mem_cgroup_prepare_migration(page, &mem);
+	charge = mem_cgroup_prepare_migration(page, newpage, &mem);
 	if (charge == -ENOMEM) {
 		rc = -ENOMEM;
 		goto unlock;
Index: linux-2.6.34-rc4-mm1/include/linux/memcontrol.h
===================================================================
--- linux-2.6.34-rc4-mm1.orig/include/linux/memcontrol.h
+++ linux-2.6.34-rc4-mm1/include/linux/memcontrol.h
@@ -89,7 +89,8 @@ int mm_match_cgroup(const struct mm_stru
 extern struct cgroup_subsys_state *mem_cgroup_css(struct mem_cgroup *mem);
 
 extern int
-mem_cgroup_prepare_migration(struct page *page, struct mem_cgroup **ptr);
+mem_cgroup_prepare_migration(struct page *page,
+	struct page *newpage, struct mem_cgroup **ptr);
 extern void mem_cgroup_end_migration(struct mem_cgroup *mem,
 	struct page *oldpage, struct page *newpage);
 
@@ -228,7 +229,8 @@ static inline struct cgroup_subsys_state
 }
 
 static inline int
-mem_cgroup_prepare_migration(struct page *page, struct mem_cgroup **ptr)
+mem_cgroup_prepare_migration(struct page *page, struct page *newpage,
+	struct mem_cgroup **ptr)
 {
 	return 0;
 }
Index: linux-2.6.34-rc4-mm1/include/linux/page_cgroup.h
===================================================================
--- linux-2.6.34-rc4-mm1.orig/include/linux/page_cgroup.h
+++ linux-2.6.34-rc4-mm1/include/linux/page_cgroup.h
@@ -40,6 +40,7 @@ enum {
 	PCG_USED, /* this object is in use. */
 	PCG_ACCT_LRU, /* page has been accounted for */
 	PCG_FILE_MAPPED, /* page is accounted as "mapped" */
+	PCG_MIGRATION, /* under page migration */
 };
 
 #define TESTPCGFLAG(uname, lname)			\
@@ -79,6 +80,10 @@ SETPCGFLAG(FileMapped, FILE_MAPPED)
 CLEARPCGFLAG(FileMapped, FILE_MAPPED)
 TESTPCGFLAG(FileMapped, FILE_MAPPED)
 
+SETPCGFLAG(Migration, MIGRATION)
+CLEARPCGFLAG(Migration, MIGRATION)
+TESTPCGFLAG(Migration, MIGRATION)
+
 static inline int page_cgroup_nid(struct page_cgroup *pc)
 {
 	return page_to_nid(pc->page);