linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/6] memcg update v6 (for review and discuss)
@ 2008-10-01  7:52 KAMEZAWA Hiroyuki
  2008-10-01  7:55 ` [PATCH 1/6] atomic page_cgroup flags KAMEZAWA Hiroyuki
                   ` (6 more replies)
  0 siblings, 7 replies; 26+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-10-01  7:52 UTC (permalink / raw)
  To: linux-mm; +Cc: LKML, balbir, nishimura

This series is update from v5.

easy 4 patches are already posted as ready-to-go-series.

This is need-more-discuss set.

Includes following 6 patches. (reduced from v5).
The whole series are reordered.

[1/6] make page_cgroup->flags to be atomic.
[2/6] allocate all page_cgroup at boot.
[3/6] rewrite charge path by charge/commit/cancel
[4/6] new force_empty and move_account
[5/6] lazy lru free
[6/6] lazy lru add.

Patch [3/6] and [4/6] are totally rewritten.
Races in Patch [6/6] is fixed....I think.

Patch [1-4] seems to be big but there is no complicated ops.
Patch [5-6] is more racy. Check-by-regression-test is necessary.
(Of course, I does some.)

If ready-to-go-series goes, next is patch 1 and 2.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 1/6] atomic page_cgroup flags
  2008-10-01  7:52 [PATCH 0/6] memcg update v6 (for review and discuss) KAMEZAWA Hiroyuki
@ 2008-10-01  7:55 ` KAMEZAWA Hiroyuki
  2008-10-06  7:42   ` Balbir Singh
  2008-10-01  7:56 ` [PATCH 2/6] memcg: allocate page_cgroup at boot KAMEZAWA Hiroyuki
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 26+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-10-01  7:55 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, LKML, balbir, nishimura

This patch makes page_cgroup->flags to be atomic_ops and define
functions (and macros) to access it.

Before trying to modify memory resource controller, this atomic operation
on flags is necessary. Most of flags in this patch is for LRU and modfied
under mz->lru_lock but we'll add another flags which is not for LRU soon.
(lock_page_cgroup() will use LOCK bit on page_cgroup->flags)
So we use atomic version here.

 
Changelog: (v5) -> (v6)
 - no changes.

Changelog: (v4) -> (v5)
 - removed unsued operations.
 - adjusted to new ctype MEM_CGROUP_CHARGE_TYPE_SHMEM

Changelog: (v3) -> (v4)
 - no changes.

Changelog:  (v2) -> (v3)
 - renamed macros and flags to be longer name.
 - added comments.
 - added "default bit set" for File, Shmem, Anon.

Changelog:  (preview) -> (v1):
 - patch ordering is changed.
 - Added macro for defining functions for Test/Set/Clear bit.
 - made the names of flags shorter.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

 mm/memcontrol.c |  122 +++++++++++++++++++++++++++++++++++++-------------------
 1 file changed, 82 insertions(+), 40 deletions(-)

Index: mmotm-2.6.27-rc7+/mm/memcontrol.c
===================================================================
--- mmotm-2.6.27-rc7+.orig/mm/memcontrol.c
+++ mmotm-2.6.27-rc7+/mm/memcontrol.c
@@ -157,12 +157,46 @@ struct page_cgroup {
 	struct list_head lru;		/* per cgroup LRU list */
 	struct page *page;
 	struct mem_cgroup *mem_cgroup;
-	int flags;
+	unsigned long flags;
 };
-#define PAGE_CGROUP_FLAG_CACHE	   (0x1)	/* charged as cache */
-#define PAGE_CGROUP_FLAG_ACTIVE    (0x2)	/* page is active in this cgroup */
-#define PAGE_CGROUP_FLAG_FILE	   (0x4)	/* page is file system backed */
-#define PAGE_CGROUP_FLAG_UNEVICTABLE (0x8)	/* page is unevictableable */
+
+enum {
+	/* flags for mem_cgroup */
+	PCG_CACHE, /* charged as cache */
+	/* flags for LRU placement */
+	PCG_ACTIVE, /* page is active in this cgroup */
+	PCG_FILE, /* page is file system backed */
+	PCG_UNEVICTABLE, /* page is unevictableable */
+};
+
+#define TESTPCGFLAG(uname, lname)			\
+static inline int PageCgroup##uname(struct page_cgroup *pc)	\
+	{ return test_bit(PCG_##lname, &pc->flags); }
+
+#define SETPCGFLAG(uname, lname)			\
+static inline void SetPageCgroup##uname(struct page_cgroup *pc)\
+	{ set_bit(PCG_##lname, &pc->flags);  }
+
+#define CLEARPCGFLAG(uname, lname)			\
+static inline void ClearPageCgroup##uname(struct page_cgroup *pc)	\
+	{ clear_bit(PCG_##lname, &pc->flags);  }
+
+
+/* Cache flag is set only once (at allocation) */
+TESTPCGFLAG(Cache, CACHE)
+
+/* LRU management flags (from global-lru definition) */
+TESTPCGFLAG(File, FILE)
+SETPCGFLAG(File, FILE)
+CLEARPCGFLAG(File, FILE)
+
+TESTPCGFLAG(Active, ACTIVE)
+SETPCGFLAG(Active, ACTIVE)
+CLEARPCGFLAG(Active, ACTIVE)
+
+TESTPCGFLAG(Unevictable, UNEVICTABLE)
+SETPCGFLAG(Unevictable, UNEVICTABLE)
+CLEARPCGFLAG(Unevictable, UNEVICTABLE)
 
 static int page_cgroup_nid(struct page_cgroup *pc)
 {
@@ -177,15 +211,25 @@ static enum zone_type page_cgroup_zid(st
 enum charge_type {
 	MEM_CGROUP_CHARGE_TYPE_CACHE = 0,
 	MEM_CGROUP_CHARGE_TYPE_MAPPED,
-	MEM_CGROUP_CHARGE_TYPE_FORCE,	/* used by force_empty */
 	MEM_CGROUP_CHARGE_TYPE_SHMEM,	/* used by page migration of shmem */
+	MEM_CGROUP_CHARGE_TYPE_FORCE,	/* used by force_empty */
+	NR_CHARGE_TYPE,
+};
+
+static const unsigned long
+pcg_default_flags[NR_CHARGE_TYPE] = {
+	((1 << PCG_CACHE) | (1 << PCG_FILE)),
+	((1 << PCG_ACTIVE)),
+	((1 << PCG_ACTIVE) | (1 << PCG_CACHE)),
+	0,
 };
 
 /*
  * Always modified under lru lock. Then, not necessary to preempt_disable()
  */
-static void mem_cgroup_charge_statistics(struct mem_cgroup *mem, int flags,
-					bool charge)
+static void mem_cgroup_charge_statistics(struct mem_cgroup *mem,
+					 struct page_cgroup *pc,
+					 bool charge)
 {
 	int val = (charge)? 1 : -1;
 	struct mem_cgroup_stat *stat = &mem->stat;
@@ -194,7 +238,7 @@ static void mem_cgroup_charge_statistics
 	VM_BUG_ON(!irqs_disabled());
 
 	cpustat = &stat->cpustat[smp_processor_id()];
-	if (flags & PAGE_CGROUP_FLAG_CACHE)
+	if (PageCgroupCache(pc))
 		__mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_CACHE, val);
 	else
 		__mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_RSS, val);
@@ -295,18 +339,18 @@ static void __mem_cgroup_remove_list(str
 {
 	int lru = LRU_BASE;
 
-	if (pc->flags & PAGE_CGROUP_FLAG_UNEVICTABLE)
+	if (PageCgroupUnevictable(pc))
 		lru = LRU_UNEVICTABLE;
 	else {
-		if (pc->flags & PAGE_CGROUP_FLAG_ACTIVE)
+		if (PageCgroupActive(pc))
 			lru += LRU_ACTIVE;
-		if (pc->flags & PAGE_CGROUP_FLAG_FILE)
+		if (PageCgroupFile(pc))
 			lru += LRU_FILE;
 	}
 
 	MEM_CGROUP_ZSTAT(mz, lru) -= 1;
 
-	mem_cgroup_charge_statistics(pc->mem_cgroup, pc->flags, false);
+	mem_cgroup_charge_statistics(pc->mem_cgroup, pc, false);
 	list_del(&pc->lru);
 }
 
@@ -315,27 +359,27 @@ static void __mem_cgroup_add_list(struct
 {
 	int lru = LRU_BASE;
 
-	if (pc->flags & PAGE_CGROUP_FLAG_UNEVICTABLE)
+	if (PageCgroupUnevictable(pc))
 		lru = LRU_UNEVICTABLE;
 	else {
-		if (pc->flags & PAGE_CGROUP_FLAG_ACTIVE)
+		if (PageCgroupActive(pc))
 			lru += LRU_ACTIVE;
-		if (pc->flags & PAGE_CGROUP_FLAG_FILE)
+		if (PageCgroupFile(pc))
 			lru += LRU_FILE;
 	}
 
 	MEM_CGROUP_ZSTAT(mz, lru) += 1;
 	list_add(&pc->lru, &mz->lists[lru]);
 
-	mem_cgroup_charge_statistics(pc->mem_cgroup, pc->flags, true);
+	mem_cgroup_charge_statistics(pc->mem_cgroup, pc, true);
 }
 
 static void __mem_cgroup_move_lists(struct page_cgroup *pc, enum lru_list lru)
 {
 	struct mem_cgroup_per_zone *mz = page_cgroup_zoneinfo(pc);
-	int active    = pc->flags & PAGE_CGROUP_FLAG_ACTIVE;
-	int file      = pc->flags & PAGE_CGROUP_FLAG_FILE;
-	int unevictable = pc->flags & PAGE_CGROUP_FLAG_UNEVICTABLE;
+	int active    = PageCgroupActive(pc);
+	int file      = PageCgroupFile(pc);
+	int unevictable = PageCgroupUnevictable(pc);
 	enum lru_list from = unevictable ? LRU_UNEVICTABLE :
 				(LRU_FILE * !!file + !!active);
 
@@ -343,16 +387,20 @@ static void __mem_cgroup_move_lists(stru
 		return;
 
 	MEM_CGROUP_ZSTAT(mz, from) -= 1;
-
+	/*
+	 * However this is done under mz->lru_lock, another flags, which
+	 * are not related to LRU, will be modified from out-of-lock.
+	 * We have to use atomic set/clear flags.
+	 */
 	if (is_unevictable_lru(lru)) {
-		pc->flags &= ~PAGE_CGROUP_FLAG_ACTIVE;
-		pc->flags |= PAGE_CGROUP_FLAG_UNEVICTABLE;
+		ClearPageCgroupActive(pc);
+		SetPageCgroupUnevictable(pc);
 	} else {
 		if (is_active_lru(lru))
-			pc->flags |= PAGE_CGROUP_FLAG_ACTIVE;
+			SetPageCgroupActive(pc);
 		else
-			pc->flags &= ~PAGE_CGROUP_FLAG_ACTIVE;
-		pc->flags &= ~PAGE_CGROUP_FLAG_UNEVICTABLE;
+			ClearPageCgroupActive(pc);
+		ClearPageCgroupUnevictable(pc);
 	}
 
 	MEM_CGROUP_ZSTAT(mz, lru) += 1;
@@ -589,16 +637,7 @@ static int mem_cgroup_charge_common(stru
 	 * If a page is accounted as a page cache, insert to inactive list.
 	 * If anon, insert to active list.
 	 */
-	if (ctype == MEM_CGROUP_CHARGE_TYPE_CACHE) {
-		pc->flags = PAGE_CGROUP_FLAG_CACHE;
-		if (page_is_file_cache(page))
-			pc->flags |= PAGE_CGROUP_FLAG_FILE;
-		else
-			pc->flags |= PAGE_CGROUP_FLAG_ACTIVE;
-	} else if (ctype == MEM_CGROUP_CHARGE_TYPE_MAPPED)
-		pc->flags = PAGE_CGROUP_FLAG_ACTIVE;
-	else /* MEM_CGROUP_CHARGE_TYPE_SHMEM */
-		pc->flags = PAGE_CGROUP_FLAG_CACHE | PAGE_CGROUP_FLAG_ACTIVE;
+	pc->flags = pcg_default_flags[ctype];
 
 	lock_page_cgroup(page);
 	if (unlikely(page_get_page_cgroup(page))) {
@@ -677,8 +716,12 @@ int mem_cgroup_cache_charge(struct page 
 	if (unlikely(!mm))
 		mm = &init_mm;
 
-	return mem_cgroup_charge_common(page, mm, gfp_mask,
+	if (page_is_file_cache(page))
+		return mem_cgroup_charge_common(page, mm, gfp_mask,
 				MEM_CGROUP_CHARGE_TYPE_CACHE, NULL);
+	else
+		return mem_cgroup_charge_common(page, mm, gfp_mask,
+				MEM_CGROUP_CHARGE_TYPE_SHMEM, NULL);
 }
 
 /*
@@ -706,8 +749,7 @@ __mem_cgroup_uncharge_common(struct page
 	VM_BUG_ON(pc->page != page);
 
 	if ((ctype == MEM_CGROUP_CHARGE_TYPE_MAPPED)
-	    && ((pc->flags & PAGE_CGROUP_FLAG_CACHE)
-		|| page_mapped(page)))
+	    && ((PageCgroupCache(pc) || page_mapped(page))))
 		goto unlock;
 
 	mz = page_cgroup_zoneinfo(pc);
@@ -758,7 +800,7 @@ int mem_cgroup_prepare_migration(struct 
 	if (pc) {
 		mem = pc->mem_cgroup;
 		css_get(&mem->css);
-		if (pc->flags & PAGE_CGROUP_FLAG_CACHE) {
+		if (PageCgroupCache(pc)) {
 			if (page_is_file_cache(page))
 				ctype = MEM_CGROUP_CHARGE_TYPE_CACHE;
 			else


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 2/6] memcg: allocate page_cgroup at boot
  2008-10-01  7:52 [PATCH 0/6] memcg update v6 (for review and discuss) KAMEZAWA Hiroyuki
  2008-10-01  7:55 ` [PATCH 1/6] atomic page_cgroup flags KAMEZAWA Hiroyuki
@ 2008-10-01  7:56 ` KAMEZAWA Hiroyuki
  2008-10-02  8:49   ` [PATCH 2/6] memcg: allocate page_cgroup at boot (hunk fix) KAMEZAWA Hiroyuki
  2008-10-06 10:11   ` [PATCH 2/6] memcg: allocate page_cgroup at boot Balbir Singh
  2008-10-01  7:57 ` [PATCH 3/6] memcg: charge-commit-cancel protocl KAMEZAWA Hiroyuki
                   ` (4 subsequent siblings)
  6 siblings, 2 replies; 26+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-10-01  7:56 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, LKML, balbir, nishimura

Allocate all page_cgroup at boot and remove page_cgroup poitner
from struct page. This patch adds an interface as

 struct page_cgroup *lookup_page_cgroup(struct page*)

All FLATMEM/DISCONTIGMEM/SPARSEMEM  and MEMORY_HOTPLUG is supported.

Remove page_cgroup pointer reduces the amount of memory by
 - 4 bytes per PAGE_SIZE.
 - 8 bytes per PAGE_SIZE
if memory controller is disabled. (even if configured.)

On usual 8GB x86-32 server, this saves 8MB of NORMAL_ZONE memory.
On my x86-64 server with 48GB of memory, this saves 96MB of memory.
I think this reduction makes sense.

By pre-allocation, kmalloc/kfree in charge/uncharge are removed. 
This means
  - we're not necessary to be afraid of kmalloc faiulre.
    (this can happen because of gfp_mask type.)
  - we can avoid calling kmalloc/kfree.
  - we can avoid allocating tons of small objects which can be fragmented.
  - we can know what amount of memory will be used for this extra-lru handling.

I added printk message as

	"allocated %ld bytes of page_cgroup"
        "please try cgroup_disable=memory option if you don't want"

maybe enough informative for users.

Changelog: v5 -> v6.
 * reflected comments.
 * coding style fixes.
 * removed "ctype" from uncharge.
 * improved comment to show FLAT_NODE_MEM_MAP == !SPARSEMEM
 * fixed errors in !SPARSEMEM codes
 * removed unused function in !SPARSEMEM codes.
(start from v5 because of series..)

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

 include/linux/memcontrol.h  |   11 -
 include/linux/mm_types.h    |    4 
 include/linux/mmzone.h      |   14 ++
 include/linux/page_cgroup.h |   90 ++++++++++++++++
 mm/Makefile                 |    2 
 mm/memcontrol.c             |  246 ++++++++++++++------------------------------
 mm/page_alloc.c             |   12 --
 mm/page_cgroup.c            |  237 ++++++++++++++++++++++++++++++++++++++++++
 8 files changed, 424 insertions(+), 192 deletions(-)

Index: mmotm-2.6.27-rc7+/mm/page_cgroup.c
===================================================================
--- /dev/null
+++ mmotm-2.6.27-rc7+/mm/page_cgroup.c
@@ -0,0 +1,237 @@
+#include <linux/mm.h>
+#include <linux/mmzone.h>
+#include <linux/bootmem.h>
+#include <linux/bit_spinlock.h>
+#include <linux/page_cgroup.h>
+#include <linux/hash.h>
+#include <linux/memory.h>
+
+static void __meminit
+__init_page_cgroup(struct page_cgroup *pc, unsigned long pfn)
+{
+	pc->flags = 0;
+	pc->mem_cgroup = NULL;
+	pc->page = pfn_to_page(pfn);
+}
+static unsigned long total_usage;
+
+#if !defined(CONFIG_SPARSEMEM)
+
+
+void __init pgdat_page_cgroup_init(struct pglist_data *pgdat)
+{
+	pgdat->node_page_cgroup = NULL;
+}
+
+struct page_cgroup *lookup_page_cgroup(struct page *page)
+{
+	unsigned long pfn = page_to_pfn(page);
+	unsigned long offset;
+	struct page_cgroup *base;
+
+	base = NODE_DATA(page_to_nid(page))->node_page_cgroup;
+	if (unlikely(!base))
+		return NULL;
+
+	offset = pfn - NODE_DATA(page_to_nid(page))->node_start_pfn;
+	return base + offset;
+}
+
+static int __init alloc_node_page_cgroup(int nid)
+{
+	struct page_cgroup *base, *pc;
+	unsigned long table_size;
+	unsigned long start_pfn, nr_pages, index;
+
+	start_pfn = NODE_DATA(nid)->node_start_pfn;
+	nr_pages = NODE_DATA(nid)->node_spanned_pages;
+
+	table_size = sizeof(struct page_cgroup) * nr_pages;
+
+	base = __alloc_bootmem_node_nopanic(NODE_DATA(nid),
+			table_size, PAGE_SIZE, __pa(MAX_DMA_ADDRESS));
+	if (!base)
+		return -ENOMEM;
+	for (index = 0; index < nr_pages; index++) {
+		pc = base + index;
+		__init_page_cgroup(pc, start_pfn + index);
+	}
+	NODE_DATA(nid)->node_page_cgroup = base;
+	total_usage += table_size;
+	return 0;
+}
+
+void __init page_cgroup_init(void)
+{
+
+	int nid, fail;
+
+	for_each_online_node(nid)  {
+		fail = alloc_node_page_cgroup(nid);
+		if (fail)
+			goto fail;
+	}
+	printk(KERN_INFO "allocated %ld bytes of page_cgroup\n", total_usage);
+	printk(KERN_INFO "please try cgroup_disable=memory option if you"
+	" don't want\n");
+	return;
+fail:
+	printk(KERN_CRIT "allocation of page_cgroup was failed.\n");
+	printk(KERN_CRIT "please try cgroup_disable=memory boot option\n");
+	panic("Out of memory");
+}
+
+#else /* CONFIG_FLAT_NODE_MEM_MAP */
+
+struct page_cgroup *lookup_page_cgroup(struct page *page)
+{
+	unsigned long pfn = page_to_pfn(page);
+	struct mem_section *section = __pfn_to_section(pfn);
+
+	return section->page_cgroup + pfn;
+}
+
+int __meminit init_section_page_cgroup(unsigned long pfn)
+{
+	struct mem_section *section;
+	struct page_cgroup *base, *pc;
+	unsigned long table_size;
+	int nid, index;
+
+	section = __pfn_to_section(pfn);
+
+	if (section->page_cgroup)
+		return 0;
+
+	nid = page_to_nid(pfn_to_page(pfn));
+
+	table_size = sizeof(struct page_cgroup) * PAGES_PER_SECTION;
+	base = kmalloc_node(table_size, GFP_KERNEL, nid);
+	if (!base)
+		base = vmalloc_node(table_size, nid);
+
+	if (!base) {
+		printk(KERN_ERR "page cgroup allocation failure\n");
+		return -ENOMEM;
+	}
+
+	for (index = 0; index < PAGES_PER_SECTION; index++) {
+		pc = base + index;
+		__init_page_cgroup(pc, pfn + index);
+	}
+
+	section = __pfn_to_section(pfn);
+	section->page_cgroup = base - pfn;
+	total_usage += table_size;
+	return 0;
+}
+#ifdef CONFIG_MEMORY_HOTPLUG
+void __free_page_cgroup(unsigned long pfn)
+{
+	struct mem_section *ms;
+	struct page_cgroup *base;
+
+	ms = __pfn_to_section(pfn);
+	if (!ms || !ms->page_cgroup)
+		return;
+	base = ms->page_cgroup + pfn;
+	ms->page_cgroup = NULL;
+	if (is_vmalloc_addr(base))
+		vfree(base);
+	else
+		kfree(base);
+}
+
+int online_page_cgroup(unsigned long start_pfn,
+			unsigned long nr_pages,
+			int nid)
+{
+	unsigned long start, end, pfn;
+	int fail = 0;
+
+	start = start_pfn & (PAGES_PER_SECTION - 1);
+	end = ALIGN(start_pfn + nr_pages, PAGES_PER_SECTION);
+
+	for (pfn = start; !fail && pfn < end; pfn += PAGES_PER_SECTION) {
+		if (!pfn_present(pfn))
+			continue;
+		fail = init_section_page_cgroup(pfn);
+	}
+	if (!fail)
+		return 0;
+
+	/* rollback */
+	for (pfn = start; pfn < end; pfn += PAGES_PER_SECTION)
+		__free_page_cgroup(pfn);
+
+	return -ENOMEM;
+}
+
+int offline_page_cgroup(unsigned long start_pfn,
+		unsigned long nr_pages, int nid)
+{
+	unsigned long start, end, pfn;
+
+	start = start_pfn & (PAGES_PER_SECTION - 1);
+	end = ALIGN(start_pfn + nr_pages, PAGES_PER_SECTION);
+
+	for (pfn = start; pfn < end; pfn += PAGES_PER_SECTION)
+		__free_page_cgroup(pfn);
+	return 0;
+
+}
+
+static int page_cgroup_callback(struct notifier_block *self,
+			       unsigned long action, void *arg)
+{
+	struct memory_notify *mn = arg;
+	int ret = 0;
+	switch (action) {
+	case MEM_GOING_ONLINE:
+		ret = online_page_cgroup(mn->start_pfn,
+				   mn->nr_pages, mn->status_change_nid);
+		break;
+	case MEM_CANCEL_ONLINE:
+	case MEM_OFFLINE:
+		offline_page_cgroup(mn->start_pfn,
+				mn->nr_pages, mn->status_change_nid);
+		break;
+	case MEM_GOING_OFFLINE:
+		break;
+	case MEM_ONLINE:
+	case MEM_CANCEL_OFFLINE:
+		break;
+	}
+	ret = notifier_from_errno(ret);
+	return ret;
+}
+
+#endif
+
+void __init page_cgroup_init(void)
+{
+	unsigned long pfn;
+	int fail = 0;
+
+	for (pfn = 0; !fail && pfn < max_pfn; pfn += PAGES_PER_SECTION) {
+		if (!pfn_present(pfn))
+			continue;
+		fail = init_section_page_cgroup(pfn);
+	}
+	if (fail) {
+		printk(KERN_CRIT "try cgroup_disable=memory boot option\n");
+		panic("Out of memory");
+	} else {
+		hotplug_memory_notifier(page_cgroup_callback, 0);
+	}
+	printk(KERN_INFO "allocated %ld bytes of page_cgroup\n", total_usage);
+	printk(KERN_INFO "please try cgroup_disable=memory option if you don't"
+	" want\n");
+}
+
+void __init pgdat_page_cgroup_init(struct pglist_data *pgdat)
+{
+	return;
+}
+
+#endif
Index: mmotm-2.6.27-rc7+/include/linux/mm_types.h
===================================================================
--- mmotm-2.6.27-rc7+.orig/include/linux/mm_types.h
+++ mmotm-2.6.27-rc7+/include/linux/mm_types.h
@@ -94,10 +94,6 @@ struct page {
 	void *virtual;			/* Kernel virtual address (NULL if
 					   not kmapped, ie. highmem) */
 #endif /* WANT_PAGE_VIRTUAL */
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR
-	unsigned long page_cgroup;
-#endif
-
 #ifdef CONFIG_KMEMCHECK
 	void *shadow;
 #endif
Index: mmotm-2.6.27-rc7+/mm/Makefile
===================================================================
--- mmotm-2.6.27-rc7+.orig/mm/Makefile
+++ mmotm-2.6.27-rc7+/mm/Makefile
@@ -34,6 +34,6 @@ obj-$(CONFIG_FS_XIP) += filemap_xip.o
 obj-$(CONFIG_MIGRATION) += migrate.o
 obj-$(CONFIG_SMP) += allocpercpu.o
 obj-$(CONFIG_QUICKLIST) += quicklist.o
-obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o
+obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o page_cgroup.o
 obj-$(CONFIG_CGROUP_MEMRLIMIT_CTLR) += memrlimitcgroup.o
 obj-$(CONFIG_KMEMTRACE) += kmemtrace.o
Index: mmotm-2.6.27-rc7+/include/linux/page_cgroup.h
===================================================================
--- /dev/null
+++ mmotm-2.6.27-rc7+/include/linux/page_cgroup.h
@@ -0,0 +1,90 @@
+#ifndef __LINUX_PAGE_CGROUP_H
+#define __LINUX_PAGE_CGROUP_H
+#include <linux/bit_spinlock.h>
+/*
+ * Page Cgroup can be considered as an extended mem_map.
+ * A page_cgroup page is associated with every page descriptor. The
+ * page_cgroup helps us identify information about the cgroup
+ * All page cgroups are allocated at boot or memory hotplug event,
+ * then the page cgroup for pfn always exists.
+ */
+struct page_cgroup {
+	unsigned long flags;
+	struct mem_cgroup *mem_cgroup;
+	struct page *page;
+	struct list_head lru;		/* per cgroup LRU list */
+};
+
+void __init pgdat_page_cgroup_init(struct pglist_data *pgdat);
+void __init page_cgroup_init(void);
+struct page_cgroup *lookup_page_cgroup(struct page *page);
+
+enum {
+	/* flags for mem_cgroup */
+	PCG_LOCK,  /* page cgroup is locked */
+	PCG_CACHE, /* charged as cache */
+	PCG_USED, /* this object is in use. */
+	/* flags for LRU placement */
+	PCG_ACTIVE, /* page is active in this cgroup */
+	PCG_FILE, /* page is file system backed */
+	PCG_UNEVICTABLE, /* page is unevictableable */
+};
+
+#define TESTPCGFLAG(uname, lname)			\
+static inline int PageCgroup##uname(struct page_cgroup *pc)	\
+	{ return test_bit(PCG_##lname, &pc->flags); }
+
+#define SETPCGFLAG(uname, lname)			\
+static inline void SetPageCgroup##uname(struct page_cgroup *pc)\
+	{ set_bit(PCG_##lname, &pc->flags);  }
+
+#define CLEARPCGFLAG(uname, lname)			\
+static inline void ClearPageCgroup##uname(struct page_cgroup *pc)	\
+	{ clear_bit(PCG_##lname, &pc->flags);  }
+
+/* Cache flag is set only once (at allocation) */
+TESTPCGFLAG(Cache, CACHE)
+
+TESTPCGFLAG(Used, USED)
+CLEARPCGFLAG(Used, USED)
+
+/* LRU management flags (from global-lru definition) */
+TESTPCGFLAG(File, FILE)
+SETPCGFLAG(File, FILE)
+CLEARPCGFLAG(File, FILE)
+
+TESTPCGFLAG(Active, ACTIVE)
+SETPCGFLAG(Active, ACTIVE)
+CLEARPCGFLAG(Active, ACTIVE)
+
+TESTPCGFLAG(Unevictable, UNEVICTABLE)
+SETPCGFLAG(Unevictable, UNEVICTABLE)
+CLEARPCGFLAG(Unevictable, UNEVICTABLE)
+
+static inline int page_cgroup_nid(struct page_cgroup *pc)
+{
+	return page_to_nid(pc->page);
+}
+
+static inline enum zone_type page_cgroup_zid(struct page_cgroup *pc)
+{
+	return page_zonenum(pc->page);
+}
+
+static inline void lock_page_cgroup(struct page_cgroup *pc)
+{
+	bit_spin_lock(PCG_LOCK, &pc->flags);
+}
+
+static inline int trylock_page_cgroup(struct page_cgroup *pc)
+{
+	return bit_spin_trylock(PCG_LOCK, &pc->flags);
+}
+
+static inline void unlock_page_cgroup(struct page_cgroup *pc)
+{
+	bit_spin_unlock(PCG_LOCK, &pc->flags);
+}
+
+
+#endif
Index: mmotm-2.6.27-rc7+/mm/memcontrol.c
===================================================================
--- mmotm-2.6.27-rc7+.orig/mm/memcontrol.c
+++ mmotm-2.6.27-rc7+/mm/memcontrol.c
@@ -33,11 +33,11 @@
 #include <linux/seq_file.h>
 #include <linux/vmalloc.h>
 #include <linux/mm_inline.h>
+#include <linux/page_cgroup.h>
 
 #include <asm/uaccess.h>
 
 struct cgroup_subsys mem_cgroup_subsys __read_mostly;
-static struct kmem_cache *page_cgroup_cache __read_mostly;
 #define MEM_CGROUP_RECLAIM_RETRIES	5
 
 /*
@@ -135,79 +135,6 @@ struct mem_cgroup {
 };
 static struct mem_cgroup init_mem_cgroup;
 
-/*
- * We use the lower bit of the page->page_cgroup pointer as a bit spin
- * lock.  We need to ensure that page->page_cgroup is at least two
- * byte aligned (based on comments from Nick Piggin).  But since
- * bit_spin_lock doesn't actually set that lock bit in a non-debug
- * uniprocessor kernel, we should avoid setting it here too.
- */
-#define PAGE_CGROUP_LOCK_BIT 	0x0
-#if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK)
-#define PAGE_CGROUP_LOCK 	(1 << PAGE_CGROUP_LOCK_BIT)
-#else
-#define PAGE_CGROUP_LOCK	0x0
-#endif
-
-/*
- * A page_cgroup page is associated with every page descriptor. The
- * page_cgroup helps us identify information about the cgroup
- */
-struct page_cgroup {
-	struct list_head lru;		/* per cgroup LRU list */
-	struct page *page;
-	struct mem_cgroup *mem_cgroup;
-	unsigned long flags;
-};
-
-enum {
-	/* flags for mem_cgroup */
-	PCG_CACHE, /* charged as cache */
-	/* flags for LRU placement */
-	PCG_ACTIVE, /* page is active in this cgroup */
-	PCG_FILE, /* page is file system backed */
-	PCG_UNEVICTABLE, /* page is unevictableable */
-};
-
-#define TESTPCGFLAG(uname, lname)			\
-static inline int PageCgroup##uname(struct page_cgroup *pc)	\
-	{ return test_bit(PCG_##lname, &pc->flags); }
-
-#define SETPCGFLAG(uname, lname)			\
-static inline void SetPageCgroup##uname(struct page_cgroup *pc)\
-	{ set_bit(PCG_##lname, &pc->flags);  }
-
-#define CLEARPCGFLAG(uname, lname)			\
-static inline void ClearPageCgroup##uname(struct page_cgroup *pc)	\
-	{ clear_bit(PCG_##lname, &pc->flags);  }
-
-
-/* Cache flag is set only once (at allocation) */
-TESTPCGFLAG(Cache, CACHE)
-
-/* LRU management flags (from global-lru definition) */
-TESTPCGFLAG(File, FILE)
-SETPCGFLAG(File, FILE)
-CLEARPCGFLAG(File, FILE)
-
-TESTPCGFLAG(Active, ACTIVE)
-SETPCGFLAG(Active, ACTIVE)
-CLEARPCGFLAG(Active, ACTIVE)
-
-TESTPCGFLAG(Unevictable, UNEVICTABLE)
-SETPCGFLAG(Unevictable, UNEVICTABLE)
-CLEARPCGFLAG(Unevictable, UNEVICTABLE)
-
-static int page_cgroup_nid(struct page_cgroup *pc)
-{
-	return page_to_nid(pc->page);
-}
-
-static enum zone_type page_cgroup_zid(struct page_cgroup *pc)
-{
-	return page_zonenum(pc->page);
-}
-
 enum charge_type {
 	MEM_CGROUP_CHARGE_TYPE_CACHE = 0,
 	MEM_CGROUP_CHARGE_TYPE_MAPPED,
@@ -216,12 +143,18 @@ enum charge_type {
 	NR_CHARGE_TYPE,
 };
 
+/* only for here (for easy reading.) */
+#define PCGF_CACHE	(1UL << PCG_CACHE)
+#define PCGF_USED	(1UL << PCG_USED)
+#define PCGF_ACTIVE	(1UL << PCG_ACTIVE)
+#define PCGF_LOCK	(1UL << PCG_LOCK)
+#define PCGF_FILE	(1UL << PCG_FILE)
 static const unsigned long
 pcg_default_flags[NR_CHARGE_TYPE] = {
-	((1 << PCG_CACHE) | (1 << PCG_FILE)),
-	((1 << PCG_ACTIVE)),
-	((1 << PCG_ACTIVE) | (1 << PCG_CACHE)),
-	0,
+	PCGF_CACHE | PCGF_FILE | PCGF_USED | PCGF_LOCK, /* File Cache */
+	PCGF_ACTIVE | PCGF_USED | PCGF_LOCK, /* Anon */
+	PCGF_ACTIVE | PCGF_CACHE | PCGF_USED | PCGF_LOCK, /* Shmem */
+	0, /* FORCE */
 };
 
 /*
@@ -303,37 +236,6 @@ struct mem_cgroup *mem_cgroup_from_task(
 				struct mem_cgroup, css);
 }
 
-static inline int page_cgroup_locked(struct page *page)
-{
-	return bit_spin_is_locked(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup);
-}
-
-static void page_assign_page_cgroup(struct page *page, struct page_cgroup *pc)
-{
-	VM_BUG_ON(!page_cgroup_locked(page));
-	page->page_cgroup = ((unsigned long)pc | PAGE_CGROUP_LOCK);
-}
-
-struct page_cgroup *page_get_page_cgroup(struct page *page)
-{
-	return (struct page_cgroup *) (page->page_cgroup & ~PAGE_CGROUP_LOCK);
-}
-
-static void lock_page_cgroup(struct page *page)
-{
-	bit_spin_lock(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup);
-}
-
-static int try_lock_page_cgroup(struct page *page)
-{
-	return bit_spin_trylock(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup);
-}
-
-static void unlock_page_cgroup(struct page *page)
-{
-	bit_spin_unlock(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup);
-}
-
 static void __mem_cgroup_remove_list(struct mem_cgroup_per_zone *mz,
 			struct page_cgroup *pc)
 {
@@ -436,17 +338,16 @@ void mem_cgroup_move_lists(struct page *
 	 * safely get to page_cgroup without it, so just try_lock it:
 	 * mem_cgroup_isolate_pages allows for page left on wrong list.
 	 */
-	if (!try_lock_page_cgroup(page))
+	pc = lookup_page_cgroup(page);
+	if (!trylock_page_cgroup(pc))
 		return;
-
-	pc = page_get_page_cgroup(page);
-	if (pc) {
+	if (pc && PageCgroupUsed(pc)) {
 		mz = page_cgroup_zoneinfo(pc);
 		spin_lock_irqsave(&mz->lru_lock, flags);
 		__mem_cgroup_move_lists(pc, lru);
 		spin_unlock_irqrestore(&mz->lru_lock, flags);
 	}
-	unlock_page_cgroup(page);
+	unlock_page_cgroup(pc);
 }
 
 /*
@@ -533,6 +434,8 @@ unsigned long mem_cgroup_isolate_pages(u
 	list_for_each_entry_safe_reverse(pc, tmp, src, lru) {
 		if (scan >= nr_to_scan)
 			break;
+		if (unlikely(!PageCgroupUsed(pc)))
+			continue;
 		page = pc->page;
 
 		if (unlikely(!PageLRU(page)))
@@ -576,26 +479,27 @@ static int mem_cgroup_charge_common(stru
 {
 	struct mem_cgroup *mem;
 	struct page_cgroup *pc;
-	unsigned long flags;
 	unsigned long nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
 	struct mem_cgroup_per_zone *mz;
+	unsigned long flags;
 
-	pc = kmem_cache_alloc(page_cgroup_cache, gfp_mask);
-	if (unlikely(pc == NULL))
-		goto err;
-
+	pc = lookup_page_cgroup(page);
+	/* can happen at boot */
+	if (unlikely(!pc))
+		return 0;
+	prefetchw(pc);
 	/*
 	 * We always charge the cgroup the mm_struct belongs to.
 	 * The mm_struct's mem_cgroup changes on task migration if the
 	 * thread group leader migrates. It's possible that mm is not
 	 * set, if so charge the init_mm (happens for pagecache usage).
 	 */
+
 	if (likely(!memcg)) {
 		rcu_read_lock();
 		mem = mem_cgroup_from_task(rcu_dereference(mm->owner));
 		if (unlikely(!mem)) {
 			rcu_read_unlock();
-			kmem_cache_free(page_cgroup_cache, pc);
 			return 0;
 		}
 		/*
@@ -631,36 +535,34 @@ static int mem_cgroup_charge_common(stru
 		}
 	}
 
+
+	lock_page_cgroup(pc);
+
+	if (unlikely(PageCgroupUsed(pc))) {
+		unlock_page_cgroup(pc);
+		res_counter_uncharge(&mem->res, PAGE_SIZE);
+		css_put(&mem->css);
+
+		goto done;
+	}
 	pc->mem_cgroup = mem;
-	pc->page = page;
 	/*
 	 * If a page is accounted as a page cache, insert to inactive list.
 	 * If anon, insert to active list.
 	 */
 	pc->flags = pcg_default_flags[ctype];
 
-	lock_page_cgroup(page);
-	if (unlikely(page_get_page_cgroup(page))) {
-		unlock_page_cgroup(page);
-		res_counter_uncharge(&mem->res, PAGE_SIZE);
-		css_put(&mem->css);
-		kmem_cache_free(page_cgroup_cache, pc);
-		goto done;
-	}
-	page_assign_page_cgroup(page, pc);
-
 	mz = page_cgroup_zoneinfo(pc);
+
 	spin_lock_irqsave(&mz->lru_lock, flags);
 	__mem_cgroup_add_list(mz, pc);
 	spin_unlock_irqrestore(&mz->lru_lock, flags);
+	unlock_page_cgroup(pc);
 
-	unlock_page_cgroup(page);
 done:
 	return 0;
 out:
 	css_put(&mem->css);
-	kmem_cache_free(page_cgroup_cache, pc);
-err:
 	return -ENOMEM;
 }
 
@@ -668,7 +570,8 @@ int mem_cgroup_charge(struct page *page,
 {
 	if (mem_cgroup_subsys.disabled)
 		return 0;
-
+	if (PageCompound(page))
+		return 0;
 	/*
 	 * If already mapped, we don't have to account.
 	 * If page cache, page->mapping has address_space.
@@ -689,7 +592,8 @@ int mem_cgroup_cache_charge(struct page 
 {
 	if (mem_cgroup_subsys.disabled)
 		return 0;
-
+	if (PageCompound(page))
+		return 0;
 	/*
 	 * Corner case handling. This is called from add_to_page_cache()
 	 * in usual. But some FS (shmem) precharges this page before calling it
@@ -702,15 +606,16 @@ int mem_cgroup_cache_charge(struct page 
 	if (!(gfp_mask & __GFP_WAIT)) {
 		struct page_cgroup *pc;
 
-		lock_page_cgroup(page);
-		pc = page_get_page_cgroup(page);
-		if (pc) {
-			VM_BUG_ON(pc->page != page);
-			VM_BUG_ON(!pc->mem_cgroup);
-			unlock_page_cgroup(page);
+
+		pc = lookup_page_cgroup(page);
+		if (!pc)
+			return 0;
+		lock_page_cgroup(pc);
+		if (PageCgroupUsed(pc)) {
+			unlock_page_cgroup(pc);
 			return 0;
 		}
-		unlock_page_cgroup(page);
+		unlock_page_cgroup(pc);
 	}
 
 	if (unlikely(!mm))
@@ -741,37 +646,39 @@ __mem_cgroup_uncharge_common(struct page
 	/*
 	 * Check if our page_cgroup is valid
 	 */
-	lock_page_cgroup(page);
-	pc = page_get_page_cgroup(page);
-	if (unlikely(!pc))
-		goto unlock;
-
-	VM_BUG_ON(pc->page != page);
+	pc = lookup_page_cgroup(page);
+	if (unlikely(!pc || !PageCgroupUsed(pc)))
+		return;
 
-	if ((ctype == MEM_CGROUP_CHARGE_TYPE_MAPPED)
-	    && ((PageCgroupCache(pc) || page_mapped(page))))
-		goto unlock;
+	lock_page_cgroup(pc);
+	if ((ctype == MEM_CGROUP_CHARGE_TYPE_MAPPED && page_mapped(page))
+	     || !PageCgroupUsed(pc)) {
+		/* This happens at race in zap_pte_range() and do_swap_page()*/
+		unlock_page_cgroup(pc);
+		return;
+	}
+	ClearPageCgroupUsed(pc);
+	mem = pc->mem_cgroup;
 
 	mz = page_cgroup_zoneinfo(pc);
 	spin_lock_irqsave(&mz->lru_lock, flags);
 	__mem_cgroup_remove_list(mz, pc);
 	spin_unlock_irqrestore(&mz->lru_lock, flags);
+	unlock_page_cgroup(pc);
 
-	page_assign_page_cgroup(page, NULL);
-	unlock_page_cgroup(page);
-
-	mem = pc->mem_cgroup;
 	res_counter_uncharge(&mem->res, PAGE_SIZE);
 	css_put(&mem->css);
 
-	kmem_cache_free(page_cgroup_cache, pc);
 	return;
-unlock:
-	unlock_page_cgroup(page);
 }
 
 void mem_cgroup_uncharge_page(struct page *page)
 {
+	/* early check. */
+	if (page_mapped(page))
+		return;
+	if (page->mapping && !PageAnon(page))
+		return;
 	__mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_MAPPED);
 }
 
@@ -795,9 +702,9 @@ int mem_cgroup_prepare_migration(struct 
 	if (mem_cgroup_subsys.disabled)
 		return 0;
 
-	lock_page_cgroup(page);
-	pc = page_get_page_cgroup(page);
-	if (pc) {
+	pc = lookup_page_cgroup(page);
+	lock_page_cgroup(pc);
+	if (PageCgroupUsed(pc)) {
 		mem = pc->mem_cgroup;
 		css_get(&mem->css);
 		if (PageCgroupCache(pc)) {
@@ -807,7 +714,7 @@ int mem_cgroup_prepare_migration(struct 
 				ctype = MEM_CGROUP_CHARGE_TYPE_SHMEM;
 		}
 	}
-	unlock_page_cgroup(page);
+	unlock_page_cgroup(pc);
 	if (mem) {
 		ret = mem_cgroup_charge_common(newpage, NULL, GFP_KERNEL,
 			ctype, mem);
@@ -832,7 +739,7 @@ void mem_cgroup_end_migration(struct pag
 	 */
 	if (!newpage->mapping)
 		__mem_cgroup_uncharge_common(newpage,
-					 MEM_CGROUP_CHARGE_TYPE_FORCE);
+				MEM_CGROUP_CHARGE_TYPE_FORCE);
 	else if (PageAnon(newpage))
 		mem_cgroup_uncharge_page(newpage);
 }
@@ -918,6 +825,8 @@ static void mem_cgroup_force_empty_list(
 	while (!list_empty(list)) {
 		pc = list_entry(list->prev, struct page_cgroup, lru);
 		page = pc->page;
+		if (!PageCgroupUsed(pc))
+			break;
 		get_page(page);
 		spin_unlock_irqrestore(&mz->lru_lock, flags);
 		/*
@@ -932,8 +841,10 @@ static void mem_cgroup_force_empty_list(
 				count = FORCE_UNCHARGE_BATCH;
 				cond_resched();
 			}
-		} else
-			cond_resched();
+		} else {
+			spin_lock_irqsave(&mz->lru_lock, flags);
+			break;
+		}
 		spin_lock_irqsave(&mz->lru_lock, flags);
 	}
 	spin_unlock_irqrestore(&mz->lru_lock, flags);
@@ -965,6 +876,7 @@ static int mem_cgroup_force_empty(struct
 				for_each_lru(l)
 					mem_cgroup_force_empty_list(mem, mz, l);
 			}
+		cond_resched();
 	}
 	ret = 0;
 out:
@@ -1175,8 +1087,8 @@ mem_cgroup_create(struct cgroup_subsys *
 	int node;
 
 	if (unlikely((cont->parent) == NULL)) {
+		page_cgroup_init();
 		mem = &init_mem_cgroup;
-		page_cgroup_cache = KMEM_CACHE(page_cgroup, SLAB_PANIC);
 	} else {
 		mem = mem_cgroup_alloc();
 		if (!mem)
Index: mmotm-2.6.27-rc7+/mm/page_alloc.c
===================================================================
--- mmotm-2.6.27-rc7+.orig/mm/page_alloc.c
+++ mmotm-2.6.27-rc7+/mm/page_alloc.c
@@ -44,7 +44,7 @@
 #include <linux/backing-dev.h>
 #include <linux/fault-inject.h>
 #include <linux/page-isolation.h>
-#include <linux/memcontrol.h>
+#include <linux/page_cgroup.h>
 #include <linux/debugobjects.h>
 
 #include <asm/tlbflush.h>
@@ -223,17 +223,12 @@ static inline int bad_range(struct zone 
 
 static void bad_page(struct page *page)
 {
-	void *pc = page_get_page_cgroup(page);
-
 	printk(KERN_EMERG "Bad page state in process '%s'\n" KERN_EMERG
 		"page:%p flags:0x%0*lx mapping:%p mapcount:%d count:%d\n",
 		current->comm, page, (int)(2*sizeof(unsigned long)),
 		(unsigned long)page->flags, page->mapping,
 		page_mapcount(page), page_count(page));
-	if (pc) {
-		printk(KERN_EMERG "cgroup:%p\n", pc);
-		page_reset_bad_cgroup(page);
-	}
+
 	printk(KERN_EMERG "Trying to fix it up, but a reboot is needed\n"
 		KERN_EMERG "Backtrace:\n");
 	dump_stack();
@@ -472,7 +467,6 @@ static inline void free_pages_check(stru
 	free_page_mlock(page);
 	if (unlikely(page_mapcount(page) |
 		(page->mapping != NULL)  |
-		(page_get_page_cgroup(page) != NULL) |
 		(page_count(page) != 0)  |
 		(page->flags & PAGE_FLAGS_CHECK_AT_FREE)))
 		bad_page(page);
@@ -609,7 +603,6 @@ static void prep_new_page(struct page *p
 {
 	if (unlikely(page_mapcount(page) |
 		(page->mapping != NULL)  |
-		(page_get_page_cgroup(page) != NULL) |
 		(page_count(page) != 0)  |
 		(page->flags & PAGE_FLAGS_CHECK_AT_PREP)))
 		bad_page(page);
@@ -3495,6 +3488,7 @@ static void __paginginit free_area_init_
 	pgdat->nr_zones = 0;
 	init_waitqueue_head(&pgdat->kswapd_wait);
 	pgdat->kswapd_max_order = 0;
+	pgdat_page_cgroup_init(pgdat);
 	
 	for (j = 0; j < MAX_NR_ZONES; j++) {
 		struct zone *zone = pgdat->node_zones + j;
Index: mmotm-2.6.27-rc7+/include/linux/mmzone.h
===================================================================
--- mmotm-2.6.27-rc7+.orig/include/linux/mmzone.h
+++ mmotm-2.6.27-rc7+/include/linux/mmzone.h
@@ -602,8 +602,11 @@ typedef struct pglist_data {
 	struct zone node_zones[MAX_NR_ZONES];
 	struct zonelist node_zonelists[MAX_ZONELISTS];
 	int nr_zones;
-#ifdef CONFIG_FLAT_NODE_MEM_MAP
+#ifdef CONFIG_FLAT_NODE_MEM_MAP	/* means !SPARSEMEM */
 	struct page *node_mem_map;
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+	struct page_cgroup *node_page_cgroup;
+#endif
 #endif
 	struct bootmem_data *bdata;
 #ifdef CONFIG_MEMORY_HOTPLUG
@@ -932,6 +935,7 @@ static inline unsigned long early_pfn_to
 #endif
 
 struct page;
+struct page_cgroup;
 struct mem_section {
 	/*
 	 * This is, logically, a pointer to an array of struct
@@ -949,6 +953,14 @@ struct mem_section {
 
 	/* See declaration of similar field in struct zone */
 	unsigned long *pageblock_flags;
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+	/*
+	 * If !SPARSEMEM, pgdat doesn't have page_cgroup pointer. We use
+	 * section. (see memcontrol.h/page_cgroup.h about this.)
+	 */
+	struct page_cgroup *page_cgroup;
+	unsigned long pad;
+#endif
 };
 
 #ifdef CONFIG_SPARSEMEM_EXTREME
Index: mmotm-2.6.27-rc7+/include/linux/memcontrol.h
===================================================================
--- mmotm-2.6.27-rc7+.orig/include/linux/memcontrol.h
+++ mmotm-2.6.27-rc7+/include/linux/memcontrol.h
@@ -29,7 +29,6 @@ struct mm_struct;
 
 #define page_reset_bad_cgroup(page)	((page)->page_cgroup = 0)
 
-extern struct page_cgroup *page_get_page_cgroup(struct page *page);
 extern int mem_cgroup_charge(struct page *page, struct mm_struct *mm,
 				gfp_t gfp_mask);
 extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
@@ -72,16 +71,8 @@ extern void mem_cgroup_record_reclaim_pr
 extern long mem_cgroup_calc_reclaim(struct mem_cgroup *mem, struct zone *zone,
 					int priority, enum lru_list lru);
 
-#else /* CONFIG_CGROUP_MEM_RES_CTLR */
-static inline void page_reset_bad_cgroup(struct page *page)
-{
-}
-
-static inline struct page_cgroup *page_get_page_cgroup(struct page *page)
-{
-	return NULL;
-}
 
+#else /* CONFIG_CGROUP_MEM_RES_CTLR */
 static inline int mem_cgroup_charge(struct page *page,
 					struct mm_struct *mm, gfp_t gfp_mask)
 {


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 3/6] memcg: charge-commit-cancel protocl
  2008-10-01  7:52 [PATCH 0/6] memcg update v6 (for review and discuss) KAMEZAWA Hiroyuki
  2008-10-01  7:55 ` [PATCH 1/6] atomic page_cgroup flags KAMEZAWA Hiroyuki
  2008-10-01  7:56 ` [PATCH 2/6] memcg: allocate page_cgroup at boot KAMEZAWA Hiroyuki
@ 2008-10-01  7:57 ` KAMEZAWA Hiroyuki
  2008-10-01  8:33   ` Daisuke Nishimura
                     ` (4 more replies)
  2008-10-01  7:59 ` [PATCH 4/6] memcg: new force_empty and move_account KAMEZAWA Hiroyuki
                   ` (3 subsequent siblings)
  6 siblings, 5 replies; 26+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-10-01  7:57 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, LKML, balbir, nishimura

There is a small race in do_swap_page(). When the page swapped-in is charged,
the mapcount can be greater than 0. But, at the same time some process (shares
it ) call unmap and make mapcount 1->0 and the page is uncharged.

      CPUA 			CPUB
       mapcount == 1.
   (1) charge if mapcount==0     zap_pte_range()
                                (2) mapcount 1 => 0.
			        (3) uncharge(). (success)
   (4) set page'rmap()
       mapcoint 0=>1

Then, this swap page's account is leaked.

For fixing this, I added a new interface.
  - precharge
   account to res_counter by PAGE_SIZE and try to free pages if necessary.
  - commit	
   register page_cgroup and add to LRU if necessary.
  - cancel
   uncharge PAGE_SIZE because of do_swap_page failure.


     CPUA              
  (1) charge (always)
  (2) set page's rmap (mapcount > 0)
  (3) commit charge was necessary or not after set_pte().

This protocol uses PCG_USED bit on page_cgroup for avoiding over accounting.
Usual mem_cgroup_charge_common() does precharge -> commit at a time.

And this patch also adds following function to clarify all charges.

  - mem_cgroup_newpage_charge() ....replacement for mem_cgroup_charge()
	called against newly allocated anon pages.

  - mem_cgroup_charge_migrate_fixup()
        called only from remove_migration_ptes().
	we'll have to rewrite this later.(this patch just keeps old behavior)

Good for clarify "what we does"

Then, we have 4 following charge points.
  - newpage
  - swapin
  - add-to-cache.
  - migration.

precharge/commit/cancel can be used for other places,
 - shmem, (and other places need precharge.)
 - move_account(force_empty) etc...
we'll revisit later.

Changelog v5 -> v6:
 - added newpage_charge() and migrate_fixup().
 - renamed  functions for swap-in from "swap" to "swapin"
 - add more precise description.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

 include/linux/memcontrol.h |   35 +++++++++-
 mm/memcontrol.c            |  151 +++++++++++++++++++++++++++++++++++----------
 mm/memory.c                |   12 ++-
 mm/migrate.c               |    2 
 mm/swapfile.c              |    6 +
 5 files changed, 165 insertions(+), 41 deletions(-)

Index: mmotm-2.6.27-rc7+/include/linux/memcontrol.h
===================================================================
--- mmotm-2.6.27-rc7+.orig/include/linux/memcontrol.h
+++ mmotm-2.6.27-rc7+/include/linux/memcontrol.h
@@ -29,8 +29,17 @@ struct mm_struct;
 
 #define page_reset_bad_cgroup(page)	((page)->page_cgroup = 0)
 
-extern int mem_cgroup_charge(struct page *page, struct mm_struct *mm,
+extern int mem_cgroup_newpage_charge(struct page *page, struct mm_struct *mm,
 				gfp_t gfp_mask);
+extern int mem_cgroup_charge_migrate_fixup(struct page *page,
+				struct mm_struct *mm, gfp_t gfp_mask);
+/* for swap handling */
+extern int mem_cgroup_try_charge(struct mm_struct *mm,
+		gfp_t gfp_mask, struct mem_cgroup **ptr);
+extern void mem_cgroup_commit_charge_swapin(struct page *page,
+					struct mem_cgroup *ptr);
+extern void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *ptr);
+
 extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
 					gfp_t gfp_mask);
 extern void mem_cgroup_move_lists(struct page *page, enum lru_list lru);
@@ -73,7 +82,9 @@ extern long mem_cgroup_calc_reclaim(stru
 
 
 #else /* CONFIG_CGROUP_MEM_RES_CTLR */
-static inline int mem_cgroup_charge(struct page *page,
+struct mem_cgroup;
+
+static inline int mem_cgroup_newpage_charge(struct page *page,
 					struct mm_struct *mm, gfp_t gfp_mask)
 {
 	return 0;
@@ -85,6 +96,26 @@ static inline int mem_cgroup_cache_charg
 	return 0;
 }
 
+static inline int mem_cgroup_charge_migrate_fixup(struct page *page,
+					struct mm_struct *mm, gfp_t gfp_mask)
+{
+	return 0;
+}
+
+static int mem_cgroup_try_charge(struct mm_struct *mm,
+				gfp_t gfp_mask, struct mem_cgroup **ptr)
+{
+	return 0;
+}
+
+static void mem_cgroup_commit_charge_swapin(struct page *page,
+					  struct mem_cgroup *ptr)
+{
+}
+static void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *ptr)
+{
+}
+
 static inline void mem_cgroup_uncharge_page(struct page *page)
 {
 }
Index: mmotm-2.6.27-rc7+/mm/memcontrol.c
===================================================================
--- mmotm-2.6.27-rc7+.orig/mm/memcontrol.c
+++ mmotm-2.6.27-rc7+/mm/memcontrol.c
@@ -467,35 +467,31 @@ unsigned long mem_cgroup_isolate_pages(u
 	return nr_taken;
 }
 
-/*
- * Charge the memory controller for page usage.
- * Return
- * 0 if the charge was successful
- * < 0 if the cgroup is over its limit
+
+/**
+ * mem_cgroup_try_charge - get charge of PAGE_SIZE.
+ * @mm: an mm_struct which is charged against. (when *memcg is NULL)
+ * @gfp_mask: gfp_mask for reclaim.
+ * @memcg: a pointer to memory cgroup which is charged against.
+ *
+ * charge aginst memory cgroup pointed by *memcg. if *memcg == NULL, estimated
+ * memory cgroup from @mm is got and stored in *memcg.
+ *
+ * Retruns 0 if success. -ENOMEM at failure.
  */
-static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm,
-				gfp_t gfp_mask, enum charge_type ctype,
-				struct mem_cgroup *memcg)
+
+int mem_cgroup_try_charge(struct mm_struct *mm,
+			gfp_t gfp_mask, struct mem_cgroup **memcg)
 {
 	struct mem_cgroup *mem;
-	struct page_cgroup *pc;
-	unsigned long nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
-	struct mem_cgroup_per_zone *mz;
-	unsigned long flags;
-
-	pc = lookup_page_cgroup(page);
-	/* can happen at boot */
-	if (unlikely(!pc))
-		return 0;
-	prefetchw(pc);
+	int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
 	/*
 	 * We always charge the cgroup the mm_struct belongs to.
 	 * The mm_struct's mem_cgroup changes on task migration if the
 	 * thread group leader migrates. It's possible that mm is not
 	 * set, if so charge the init_mm (happens for pagecache usage).
 	 */
-
-	if (likely(!memcg)) {
+	if (likely(!*memcg)) {
 		rcu_read_lock();
 		mem = mem_cgroup_from_task(rcu_dereference(mm->owner));
 		if (unlikely(!mem)) {
@@ -506,15 +502,17 @@ static int mem_cgroup_charge_common(stru
 		 * For every charge from the cgroup, increment reference count
 		 */
 		css_get(&mem->css);
+		*memcg = mem;
 		rcu_read_unlock();
 	} else {
-		mem = memcg;
-		css_get(&memcg->css);
+		mem = *memcg;
+		css_get(&mem->css);
 	}
 
+
 	while (unlikely(res_counter_charge(&mem->res, PAGE_SIZE))) {
 		if (!(gfp_mask & __GFP_WAIT))
-			goto out;
+			goto nomem;
 
 		if (try_to_free_mem_cgroup_pages(mem, gfp_mask))
 			continue;
@@ -531,18 +529,33 @@ static int mem_cgroup_charge_common(stru
 
 		if (!nr_retries--) {
 			mem_cgroup_out_of_memory(mem, gfp_mask);
-			goto out;
+			goto nomem;
 		}
 	}
+	return 0;
+nomem:
+	css_put(&mem->css);
+	return -ENOMEM;
+}
 
+/*
+ * commit a charge got by mem_cgroup_try_charge() and makes page_cgroup to be
+ * USED state. If already USED, uncharge and return.
+ */
+
+static void __mem_cgroup_commit_charge(struct mem_cgroup *mem,
+				     struct page_cgroup *pc,
+				     enum charge_type ctype)
+{
+	struct mem_cgroup_per_zone *mz;
+	unsigned long flags;
 
 	lock_page_cgroup(pc);
 	if (unlikely(PageCgroupUsed(pc))) {
 		unlock_page_cgroup(pc);
 		res_counter_uncharge(&mem->res, PAGE_SIZE);
 		css_put(&mem->css);
-
-		goto done;
+		return;
 	}
 	pc->mem_cgroup = mem;
 	/*
@@ -557,15 +570,39 @@ static int mem_cgroup_charge_common(stru
 	__mem_cgroup_add_list(mz, pc);
 	spin_unlock_irqrestore(&mz->lru_lock, flags);
 	unlock_page_cgroup(pc);
+}
 
-done:
+/*
+ * Charge the memory controller for page usage.
+ * Return
+ * 0 if the charge was successful
+ * < 0 if the cgroup is over its limit
+ */
+static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm,
+				gfp_t gfp_mask, enum charge_type ctype,
+				struct mem_cgroup *memcg)
+{
+	struct mem_cgroup *mem;
+	struct page_cgroup *pc;
+	int ret;
+
+	pc = lookup_page_cgroup(page);
+	/* can happen at boot */
+	if (unlikely(!pc))
+		return 0;
+	prefetchw(pc);
+
+	mem = memcg;
+	ret = mem_cgroup_try_charge(mm, gfp_mask, &mem);
+	if (ret)
+		return ret;
+
+	__mem_cgroup_commit_charge(mem, pc, ctype);
 	return 0;
-out:
-	css_put(&mem->css);
-	return -ENOMEM;
 }
 
-int mem_cgroup_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask)
+int mem_cgroup_newpage_charge(struct page *page,
+			      struct mm_struct *mm, gfp_t gfp_mask)
 {
 	if (mem_cgroup_subsys.disabled)
 		return 0;
@@ -586,6 +623,34 @@ int mem_cgroup_charge(struct page *page,
 				MEM_CGROUP_CHARGE_TYPE_MAPPED, NULL);
 }
 
+/*
+ * same as mem_cgroup_newpage_charge(), now.
+ * But what we assume is different from newpage, and this is special case.
+ * treat this in special function. easy for maintainance.
+ */
+
+int mem_cgroup_charge_migrate_fixup(struct page *page,
+				struct mm_struct *mm, gfp_t gfp_mask)
+{
+	if (mem_cgroup_subsys.disabled)
+		return 0;
+
+	if (PageCompound(page))
+		return 0;
+
+	if (page_mapped(page) || (page->mapping && !PageAnon(page)))
+		return 0;
+
+	if (unlikely(!mm))
+		mm = &init_mm;
+
+	return mem_cgroup_charge_common(page, mm, gfp_mask,
+				MEM_CGROUP_CHARGE_TYPE_MAPPED, NULL);
+}
+
+
+
+
 int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
 				gfp_t gfp_mask)
 {
@@ -628,6 +693,30 @@ int mem_cgroup_cache_charge(struct page 
 				MEM_CGROUP_CHARGE_TYPE_SHMEM, NULL);
 }
 
+
+void mem_cgroup_commit_charge_swapin(struct page *page, struct mem_cgroup *ptr)
+{
+	struct page_cgroup *pc;
+
+	if (mem_cgroup_subsys.disabled)
+		return;
+	if (!ptr)
+		return;
+	pc = lookup_page_cgroup(page);
+	__mem_cgroup_commit_charge(ptr, pc, MEM_CGROUP_CHARGE_TYPE_MAPPED);
+}
+
+void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *mem)
+{
+	if (mem_cgroup_subsys.disabled)
+		return;
+	if (!mem)
+		return;
+	res_counter_uncharge(&mem->res, PAGE_SIZE);
+	css_put(&mem->css);
+}
+
+
 /*
  * uncharge if !page_mapped(page)
  */
Index: mmotm-2.6.27-rc7+/mm/memory.c
===================================================================
--- mmotm-2.6.27-rc7+.orig/mm/memory.c
+++ mmotm-2.6.27-rc7+/mm/memory.c
@@ -1891,7 +1891,7 @@ gotten:
 	cow_user_page(new_page, old_page, address, vma);
 	__SetPageUptodate(new_page);
 
-	if (mem_cgroup_charge(new_page, mm, GFP_KERNEL))
+	if (mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL))
 		goto oom_free_new;
 
 	/*
@@ -2287,6 +2287,7 @@ static int do_swap_page(struct mm_struct
 	struct page *page;
 	swp_entry_t entry;
 	pte_t pte;
+	struct mem_cgroup *ptr = NULL;
 	int ret = 0;
 
 	if (!pte_unmap_same(mm, pmd, page_table, orig_pte))
@@ -2325,7 +2326,7 @@ static int do_swap_page(struct mm_struct
 	lock_page(page);
 	delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
 
-	if (mem_cgroup_charge(page, mm, GFP_KERNEL)) {
+	if (mem_cgroup_try_charge(mm, GFP_KERNEL, &ptr) == -ENOMEM) {
 		ret = VM_FAULT_OOM;
 		unlock_page(page);
 		goto out;
@@ -2355,6 +2356,7 @@ static int do_swap_page(struct mm_struct
 	flush_icache_page(vma, page);
 	set_pte_at(mm, address, page_table, pte);
 	page_add_anon_rmap(page, vma, address);
+	mem_cgroup_commit_charge_swapin(page, ptr);
 
 	swap_free(entry);
 	if (vm_swap_full() || (vma->vm_flags & VM_LOCKED) || PageMlocked(page))
@@ -2375,7 +2377,7 @@ unlock:
 out:
 	return ret;
 out_nomap:
-	mem_cgroup_uncharge_page(page);
+	mem_cgroup_cancel_charge_swapin(ptr);
 	pte_unmap_unlock(page_table, ptl);
 	unlock_page(page);
 	page_cache_release(page);
@@ -2405,7 +2407,7 @@ static int do_anonymous_page(struct mm_s
 		goto oom;
 	__SetPageUptodate(page);
 
-	if (mem_cgroup_charge(page, mm, GFP_KERNEL))
+	if (mem_cgroup_newpage_charge(page, mm, GFP_KERNEL))
 		goto oom_free_page;
 
 	entry = mk_pte(page, vma->vm_page_prot);
@@ -2498,7 +2500,7 @@ static int __do_fault(struct mm_struct *
 				ret = VM_FAULT_OOM;
 				goto out;
 			}
-			if (mem_cgroup_charge(page, mm, GFP_KERNEL)) {
+			if (mem_cgroup_newpage_charge(page, mm, GFP_KERNEL)) {
 				ret = VM_FAULT_OOM;
 				page_cache_release(page);
 				goto out;
Index: mmotm-2.6.27-rc7+/mm/migrate.c
===================================================================
--- mmotm-2.6.27-rc7+.orig/mm/migrate.c
+++ mmotm-2.6.27-rc7+/mm/migrate.c
@@ -133,7 +133,7 @@ static void remove_migration_pte(struct 
 	 * be reliable, and this charge can actually fail: oh well, we don't
 	 * make the situation any worse by proceeding as if it had succeeded.
 	 */
-	mem_cgroup_charge(new, mm, GFP_ATOMIC);
+	mem_cgroup_charge_migrate_fixup(new, mm, GFP_ATOMIC);
 
 	get_page(new);
 	pte = pte_mkold(mk_pte(new, vma->vm_page_prot));
Index: mmotm-2.6.27-rc7+/mm/swapfile.c
===================================================================
--- mmotm-2.6.27-rc7+.orig/mm/swapfile.c
+++ mmotm-2.6.27-rc7+/mm/swapfile.c
@@ -530,17 +530,18 @@ unsigned int count_swap_pages(int type, 
 static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
 		unsigned long addr, swp_entry_t entry, struct page *page)
 {
+	struct mem_cgroup *ptr;
 	spinlock_t *ptl;
 	pte_t *pte;
 	int ret = 1;
 
-	if (mem_cgroup_charge(page, vma->vm_mm, GFP_KERNEL))
+	if (mem_cgroup_try_charge(vma->vm_mm, GFP_KERNEL, &ptr))
 		ret = -ENOMEM;
 
 	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
 	if (unlikely(!pte_same(*pte, swp_entry_to_pte(entry)))) {
 		if (ret > 0)
-			mem_cgroup_uncharge_page(page);
+			mem_cgroup_cancel_charge_swapin(ptr);
 		ret = 0;
 		goto out;
 	}
@@ -550,6 +551,7 @@ static int unuse_pte(struct vm_area_stru
 	set_pte_at(vma->vm_mm, addr, pte,
 		   pte_mkold(mk_pte(page, vma->vm_page_prot)));
 	page_add_anon_rmap(page, vma, addr);
+	mem_cgroup_commit_charge_swapin(page, ptr);
 	swap_free(entry);
 	/*
 	 * Move the page to the active list so it is not


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 4/6] memcg: new force_empty and move_account
  2008-10-01  7:52 [PATCH 0/6] memcg update v6 (for review and discuss) KAMEZAWA Hiroyuki
                   ` (2 preceding siblings ...)
  2008-10-01  7:57 ` [PATCH 3/6] memcg: charge-commit-cancel protocl KAMEZAWA Hiroyuki
@ 2008-10-01  7:59 ` KAMEZAWA Hiroyuki
  2008-10-01 16:38   ` Randy Dunlap
  2008-10-01  8:00 ` [PATCH 5/6] memcg: lazy lru freeing KAMEZAWA Hiroyuki
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 26+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-10-01  7:59 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, LKML, balbir, nishimura

This patch provides a function to move account information of a page between
mem_cgroups and rewrite force_empty to make use of this.

This moving of page_cgroup is done under
 - lru_lock of source/destination mem_cgroup is held.
 - lock_page_cgroup() is held.

Then, a routine which touches pc->mem_cgroup without lock_page_cgroup() should
confirm pc->mem_cgroup is still valid or not. Typlical code can be following.

(while page is not under lock_page())
	mem = pc->mem_cgroup;
	mz = page_cgroup_zoneinfo(pc)
	spin_lock_irqsave(&mz->lru_lock);
	if (pc->mem_cgroup == mem)
		...../* some list handling */
	spin_unlock_irq(&mz->lru_lock);

Of course, better way is
	lock_page_cgroup(pc);
	....
	unlock_page_cgroup(pc);

But you should confirm the nest of lock and avoid deadlock.

If you treats page_cgroup from mem_cgroup's LRU under mz->lru_lock,
you don't have to worry about what pc->mem_cgroup points to.
moved pages are added to head of lru, not to tail.

Expected users of this routine is:
  - force_empty (rmdir)
  - moving tasks between cgroup (for moving account information.)
  - hierarchy (maybe useful.)

force_empty(rmdir) uses this move_account and move pages to its parent.
This "move" will not cause OOM (I added "oom" parameter to try_charge().)

If the parent is busy (not enough memory), force_empty calls try_to_free_page()
and reduce usage.

Purpose of this behavior is
  - Fix "forget all" behavior of force_empty and avoid leak of accounting.
  - By "moving first, free if necessary", keep pages on memory as much as
    possible.

Adding a switch to change behavior of force_empty to
  - free first, move if necessary
  - free all, if there is mlocked/busy pages, return -EBUSY.
is under consideration.

Changelog: (v5) -> (v6)
  - removed unnecessary check.
  - do all under lock_page_cgroup().
  - removed res_counter_charge() from move function itself.
    (and modifies try_charge() function.)
  - add argument to add_list() to specify to add page_cgroup head or tail.
  - merged with force_empty patch. (to answer who is user? question)

Changelog: (v4) -> (v5)
  - check for lock_page() is removed.
  - rewrote description.

Changelog: (v2) -> (v4)
  - added lock_page_cgroup().
  - splitted out from new-force-empty patch.
  - added how-to-use text.
  - fixed race in __mem_cgroup_uncharge_common().

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

 Documentation/controllers/memory.txt |   10 -
 mm/memcontrol.c                      |  267 +++++++++++++++++++++++++++--------
 2 files changed, 216 insertions(+), 61 deletions(-)

Index: mmotm-2.6.27-rc7+/mm/memcontrol.c
===================================================================
--- mmotm-2.6.27-rc7+.orig/mm/memcontrol.c
+++ mmotm-2.6.27-rc7+/mm/memcontrol.c
@@ -257,7 +257,7 @@ static void __mem_cgroup_remove_list(str
 }
 
 static void __mem_cgroup_add_list(struct mem_cgroup_per_zone *mz,
-				struct page_cgroup *pc)
+				struct page_cgroup *pc, bool hot)
 {
 	int lru = LRU_BASE;
 
@@ -271,7 +271,10 @@ static void __mem_cgroup_add_list(struct
 	}
 
 	MEM_CGROUP_ZSTAT(mz, lru) += 1;
-	list_add(&pc->lru, &mz->lists[lru]);
+	if (hot)
+		list_add(&pc->lru, &mz->lists[lru]);
+	else
+		list_add_tail(&pc->lru, &mz->lists[lru]);
 
 	mem_cgroup_charge_statistics(pc->mem_cgroup, pc, true);
 }
@@ -467,21 +470,12 @@ unsigned long mem_cgroup_isolate_pages(u
 	return nr_taken;
 }
 
-
-/**
- * mem_cgroup_try_charge - get charge of PAGE_SIZE.
- * @mm: an mm_struct which is charged against. (when *memcg is NULL)
- * @gfp_mask: gfp_mask for reclaim.
- * @memcg: a pointer to memory cgroup which is charged against.
- *
- * charge aginst memory cgroup pointed by *memcg. if *memcg == NULL, estimated
- * memory cgroup from @mm is got and stored in *memcg.
- *
- * Retruns 0 if success. -ENOMEM at failure.
+/*
+ * Unlike exported interface, "oom" parameter is added. if oom==true,
+ * oom-killer can be invoked.
  */
-
-int mem_cgroup_try_charge(struct mm_struct *mm,
-			gfp_t gfp_mask, struct mem_cgroup **memcg)
+static int __mem_cgroup_try_charge(struct mm_struct *mm,
+			gfp_t gfp_mask, struct mem_cgroup **memcg, bool oom)
 {
 	struct mem_cgroup *mem;
 	int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
@@ -528,7 +522,8 @@ int mem_cgroup_try_charge(struct mm_stru
 			continue;
 
 		if (!nr_retries--) {
-			mem_cgroup_out_of_memory(mem, gfp_mask);
+			if (oom)
+				mem_cgroup_out_of_memory(mem, gfp_mask);
 			goto nomem;
 		}
 	}
@@ -538,6 +533,24 @@ nomem:
 	return -ENOMEM;
 }
 
+/**
+ * mem_cgroup_try_charge - get charge of PAGE_SIZE.
+ * @mm: an mm_struct which is charged against. (when *memcg is NULL)
+ * @gfp_mask: gfp_mask for reclaim.
+ * @memcg: a pointer to memory cgroup which is charged against.
+ *
+ * charge aginst memory cgroup pointed by *memcg. if *memcg == NULL, estimated
+ * memory cgroup from @mm is got and stored in *memcg.
+ *
+ * Retruns 0 if success. -ENOMEM at failure.
+ */
+
+int mem_cgroup_try_charge(struct mm_struct *mm,
+			  gfp_t mask, struct mem_cgroup **memcg)
+{
+	return __mem_cgroup_try_charge(mm, mask, memcg, false);
+}
+
 /*
  * commit a charge got by mem_cgroup_try_charge() and makes page_cgroup to be
  * USED state. If already USED, uncharge and return.
@@ -567,11 +580,109 @@ static void __mem_cgroup_commit_charge(s
 	mz = page_cgroup_zoneinfo(pc);
 
 	spin_lock_irqsave(&mz->lru_lock, flags);
-	__mem_cgroup_add_list(mz, pc);
+	__mem_cgroup_add_list(mz, pc, true);
 	spin_unlock_irqrestore(&mz->lru_lock, flags);
 	unlock_page_cgroup(pc);
 }
 
+/**
+ * mem_cgroup_move_account - move account of the page
+ * @pc   ... page_cgroup of the page.
+ * @from ... mem_cgroup which the page is moved from.
+ * @to   ... mem_cgroup which the page is moved to. @from != @to.
+ *
+ * The caller must confirm following.
+ * 1. disable irq.
+ * 2. lru_lock of old mem_cgroup(@from) should be held.
+ *
+ * returns 0 at success,
+ * returns -EBUSY when lock is busy or "pc" is unstable.
+ *
+ * This function do "uncharge" from old cgroup but doesn't do "charge" to
+ * new cgroup. It should be done by a caller.
+ */
+
+static int mem_cgroup_move_account(struct page_cgroup *pc,
+	struct mem_cgroup *from, struct mem_cgroup *to)
+{
+	struct mem_cgroup_per_zone *from_mz, *to_mz;
+	int nid, zid;
+	int ret = -EBUSY;
+
+	VM_BUG_ON(!irqs_disabled());
+	VM_BUG_ON(from == to);
+
+	nid = page_cgroup_nid(pc);
+	zid = page_cgroup_zid(pc);
+	from_mz =  mem_cgroup_zoneinfo(from, nid, zid);
+	to_mz =  mem_cgroup_zoneinfo(to, nid, zid);
+
+
+	if (!trylock_page_cgroup(pc))
+		return ret;
+
+	if (!PageCgroupUsed(pc))
+		goto out;
+
+	if (pc->mem_cgroup != from)
+		goto out;
+
+	if (spin_trylock(&to_mz->lru_lock)) {
+		__mem_cgroup_remove_list(from_mz, pc);
+		css_put(&from->css);
+		res_counter_uncharge(&from->res, PAGE_SIZE);
+		pc->mem_cgroup = to;
+		css_get(&to->css);
+		__mem_cgroup_add_list(to_mz, pc, false);
+		ret = 0;
+		spin_unlock(&to_mz->lru_lock);
+	}
+out:
+	unlock_page_cgroup(pc);
+	return ret;
+}
+
+/*
+ * move charges to its parent.
+ */
+
+static int mem_cgroup_move_parent(struct page_cgroup *pc,
+				  struct mem_cgroup *child,
+				  gfp_t gfp_mask)
+{
+	struct cgroup *cg = child->css.cgroup;
+	struct cgroup *pcg = cg->parent;
+	struct mem_cgroup *parent;
+	struct mem_cgroup_per_zone *mz;
+	unsigned long flags;
+	int ret;
+
+	/* Is ROOT ? */
+	if (!pcg)
+		return -EINVAL;
+
+	parent = mem_cgroup_from_cont(pcg);
+
+	ret = __mem_cgroup_try_charge(NULL, gfp_mask, &parent, false);
+	if (ret)
+		return ret;
+
+	mz = mem_cgroup_zoneinfo(child,
+			page_cgroup_nid(pc), page_cgroup_zid(pc));
+
+	spin_lock_irqsave(&mz->lru_lock, flags);
+	ret = mem_cgroup_move_account(pc, child, parent);
+	spin_unlock_irqrestore(&mz->lru_lock, flags);
+
+	/* drop extra refcnt */
+	css_put(&parent->css);
+	/* uncharge if move fails */
+	if (ret)
+		res_counter_uncharge(&parent->res, PAGE_SIZE);
+
+	return ret;
+}
+
 /*
  * Charge the memory controller for page usage.
  * Return
@@ -593,7 +704,7 @@ static int mem_cgroup_charge_common(stru
 	prefetchw(pc);
 
 	mem = memcg;
-	ret = mem_cgroup_try_charge(mm, gfp_mask, &mem);
+	ret = __mem_cgroup_try_charge(mm, gfp_mask, &mem, true);
 	if (ret)
 		return ret;
 
@@ -896,46 +1007,52 @@ int mem_cgroup_resize_limit(struct mem_c
  * This routine traverse page_cgroup in given list and drop them all.
  * *And* this routine doesn't reclaim page itself, just removes page_cgroup.
  */
-#define FORCE_UNCHARGE_BATCH	(128)
-static void mem_cgroup_force_empty_list(struct mem_cgroup *mem,
+static int mem_cgroup_force_empty_list(struct mem_cgroup *mem,
 			    struct mem_cgroup_per_zone *mz,
 			    enum lru_list lru)
 {
-	struct page_cgroup *pc;
-	struct page *page;
-	int count = FORCE_UNCHARGE_BATCH;
+	struct page_cgroup *pc, *busy;
 	unsigned long flags;
+	unsigned long loop;
 	struct list_head *list;
+	int ret = 0;
 
 	list = &mz->lists[lru];
 
-	spin_lock_irqsave(&mz->lru_lock, flags);
-	while (!list_empty(list)) {
-		pc = list_entry(list->prev, struct page_cgroup, lru);
-		page = pc->page;
-		if (!PageCgroupUsed(pc))
+	loop = MEM_CGROUP_ZSTAT(mz, lru);
+	/* give some margin against EBUSY etc...*/
+	loop += 256;
+	busy = NULL;
+	while (loop--) {
+		ret = 0;
+		spin_lock_irqsave(&mz->lru_lock, flags);
+		if (list_empty(list)) {
+			spin_unlock_irqrestore(&mz->lru_lock, flags);
 			break;
-		get_page(page);
+		}
+		pc = list_entry(list->prev, struct page_cgroup, lru);
+		if (busy == pc) {
+			list_move(&pc->lru, list);
+			busy = 0;
+			spin_unlock_irqrestore(&mz->lru_lock, flags);
+			continue;
+		}
 		spin_unlock_irqrestore(&mz->lru_lock, flags);
-		/*
-		 * Check if this page is on LRU. !LRU page can be found
-		 * if it's under page migration.
-		 */
-		if (PageLRU(page)) {
-			__mem_cgroup_uncharge_common(page,
-					MEM_CGROUP_CHARGE_TYPE_FORCE);
-			put_page(page);
-			if (--count <= 0) {
-				count = FORCE_UNCHARGE_BATCH;
-				cond_resched();
-			}
-		} else {
-			spin_lock_irqsave(&mz->lru_lock, flags);
+
+		ret = mem_cgroup_move_parent(pc, mem, GFP_HIGHUSER_MOVABLE);
+		if (ret == -ENOMEM)
 			break;
-		}
-		spin_lock_irqsave(&mz->lru_lock, flags);
+
+		if (ret == -EBUSY || ret == -EINVAL) {
+			/* found lock contention or "pc" is obsolete. */
+			busy = pc;
+			cond_resched();
+		} else
+			busy = NULL;
 	}
-	spin_unlock_irqrestore(&mz->lru_lock, flags);
+	if (!ret && !list_empty(list))
+		return -EBUSY;
+	return ret;
 }
 
 /*
@@ -944,32 +1061,66 @@ static void mem_cgroup_force_empty_list(
  */
 static int mem_cgroup_force_empty(struct mem_cgroup *mem)
 {
-	int ret = -EBUSY;
-	int node, zid;
+	int ret;
+	int node, zid, shrink;
+	int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
 
 	css_get(&mem->css);
-	/*
-	 * page reclaim code (kswapd etc..) will move pages between
-	 * active_list <-> inactive_list while we don't take a lock.
-	 * So, we have to do loop here until all lists are empty.
-	 */
+
+	shrink = 0;
+move_account:
 	while (mem->res.usage > 0) {
+		ret = -EBUSY;
 		if (atomic_read(&mem->css.cgroup->count) > 0)
 			goto out;
-		for_each_node_state(node, N_POSSIBLE)
-			for (zid = 0; zid < MAX_NR_ZONES; zid++) {
+
+		ret = 0;
+		for_each_node_state(node, N_POSSIBLE) {
+			for (zid = 0; !ret && zid < MAX_NR_ZONES; zid++) {
 				struct mem_cgroup_per_zone *mz;
 				enum lru_list l;
 				mz = mem_cgroup_zoneinfo(mem, node, zid);
-				for_each_lru(l)
-					mem_cgroup_force_empty_list(mem, mz, l);
+				for_each_lru(l) {
+					ret = mem_cgroup_force_empty_list(mem,
+								  mz, l);
+					if (ret)
+						break;
+				}
 			}
+			if (ret)
+				break;
+		}
+		/* it seems parent cgroup doesn't have enough mem */
+		if (ret == -ENOMEM)
+			goto try_to_free;
 		cond_resched();
 	}
 	ret = 0;
 out:
 	css_put(&mem->css);
 	return ret;
+
+try_to_free:
+	/* returns EBUSY if we come here twice. */
+	if (shrink)  {
+		ret = -EBUSY;
+		goto out;
+	}
+	/* try to free all pages in this cgroup */
+	shrink = 1;
+	while (nr_retries && mem->res.usage > 0) {
+		int progress;
+		progress = try_to_free_mem_cgroup_pages(mem,
+						  GFP_HIGHUSER_MOVABLE);
+		if (!progress)
+			nr_retries--;
+
+	}
+	/* try move_account...there may be some *locked* pages. */
+	if (mem->res.usage)
+		goto move_account;
+	ret = 0;
+	goto out;
 }
 
 static u64 mem_cgroup_read(struct cgroup *cont, struct cftype *cft)
Index: mmotm-2.6.27-rc7+/Documentation/controllers/memory.txt
===================================================================
--- mmotm-2.6.27-rc7+.orig/Documentation/controllers/memory.txt
+++ mmotm-2.6.27-rc7+/Documentation/controllers/memory.txt
@@ -211,7 +211,9 @@ The memory.force_empty gives an interfac
 
 # echo 1 > memory.force_empty
 
-will drop all charges in cgroup. Currently, this is maintained for test.
+Will move account to parent. if parenet is full, will try to free pages.
+If both of a parent and a child are busy, return -EBUSY;
+This file, memory.force_empty, is just for debug purpose.
 
 4. Testing
 
@@ -242,8 +244,10 @@ reclaimed.
 
 A cgroup can be removed by rmdir, but as discussed in sections 4.1 and 4.2, a
 cgroup might have some charge associated with it, even though all
-tasks have migrated away from it. Such charges are automatically dropped at
-rmdir() if there are no tasks.
+tasks have migrated away from it.
+Such charges are moved to its parent as mush as possible and freed if parent
+seems to be full. (see force_empty)
+If both of them are busy, rmdir() returns -EBUSY.
 
 5. TODO
 


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 5/6] memcg: lazy lru freeing
  2008-10-01  7:52 [PATCH 0/6] memcg update v6 (for review and discuss) KAMEZAWA Hiroyuki
                   ` (3 preceding siblings ...)
  2008-10-01  7:59 ` [PATCH 4/6] memcg: new force_empty and move_account KAMEZAWA Hiroyuki
@ 2008-10-01  8:00 ` KAMEZAWA Hiroyuki
  2008-10-09  5:39   ` Daisuke Nishimura
  2008-10-01  8:01 ` [PATCH 6/6] memcg: lazy lru addition KAMEZAWA Hiroyuki
  2008-10-02  9:02 ` [PATCH 0/6] memcg update v6 (for review and discuss) KAMEZAWA Hiroyuki
  6 siblings, 1 reply; 26+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-10-01  8:00 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, LKML, balbir, nishimura

Free page_cgroup from its LRU in batched manner.

When uncharge() is called, page is pushed onto per-cpu vector and
removed from LRU, later.. This routine resembles to global LRU's pagevec.
This patch is half of the whole patch and a set with following lazy LRU add
patch.

After this, a pc, which is PageCgroupLRU(pc)==true, is on LRU.
This LRU bit is guarded by lru_lock().

 PageCgroupUsed(pc) && PageCgroupLRU(pc) means "pc" is used and on LRU.
 This check makes sense only when both 2 locks, lock_page_cgroup()/lru_lock(),
 are aquired.

 PageCgroupUsed(pc) && !PageCgroupLRU(pc) means "pc" is used but not on LRU.
 !PageCgroupUsed(pc) && PageCgroupLRU(pc) means "pc" is unused but still on
 LRU. lru walk routine should avoid touching this.

Changelog (v5) => (v6):
 - Fixing race and added PCG_LRU bit

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

 include/linux/page_cgroup.h |    5 +
 mm/memcontrol.c             |  201 ++++++++++++++++++++++++++++++++++++++++----
 2 files changed, 189 insertions(+), 17 deletions(-)

Index: mmotm-2.6.27-rc7+/mm/memcontrol.c
===================================================================
--- mmotm-2.6.27-rc7+.orig/mm/memcontrol.c
+++ mmotm-2.6.27-rc7+/mm/memcontrol.c
@@ -34,6 +34,7 @@
 #include <linux/vmalloc.h>
 #include <linux/mm_inline.h>
 #include <linux/page_cgroup.h>
+#include <linux/cpu.h>
 
 #include <asm/uaccess.h>
 
@@ -344,7 +345,7 @@ void mem_cgroup_move_lists(struct page *
 	pc = lookup_page_cgroup(page);
 	if (!trylock_page_cgroup(pc))
 		return;
-	if (pc && PageCgroupUsed(pc)) {
+	if (pc && PageCgroupUsed(pc) && PageCgroupLRU(pc)) {
 		mz = page_cgroup_zoneinfo(pc);
 		spin_lock_irqsave(&mz->lru_lock, flags);
 		__mem_cgroup_move_lists(pc, lru);
@@ -470,6 +471,122 @@ unsigned long mem_cgroup_isolate_pages(u
 	return nr_taken;
 }
 
+
+#define MEMCG_PCPVEC_SIZE	(14)	/* size of pagevec */
+struct memcg_percpu_vec {
+	int nr;
+	int limit;
+	struct page_cgroup *vec[MEMCG_PCPVEC_SIZE];
+};
+static DEFINE_PER_CPU(struct memcg_percpu_vec, memcg_free_vec);
+
+static void
+__release_page_cgroup(struct memcg_percpu_vec *mpv)
+{
+	unsigned long flags;
+	struct mem_cgroup_per_zone *mz, *prev_mz;
+	struct page_cgroup *pc;
+	int i, nr;
+
+	local_irq_save(flags);
+	nr = mpv->nr;
+	mpv->nr = 0;
+	prev_mz = NULL;
+	for (i = nr - 1; i >= 0; i--) {
+		pc = mpv->vec[i];
+		mz = page_cgroup_zoneinfo(pc);
+		if (prev_mz != mz) {
+			if (prev_mz)
+				spin_unlock(&prev_mz->lru_lock);
+			prev_mz = mz;
+			spin_lock(&mz->lru_lock);
+		}
+		/*
+		 * this "pc" may be charge()->uncharge() while we are waiting
+		 * for this. But charge() path check LRU bit and remove this
+		 * from LRU if necessary.
+		 */
+		if (!PageCgroupUsed(pc) && PageCgroupLRU(pc)) {
+			ClearPageCgroupLRU(pc);
+			__mem_cgroup_remove_list(mz, pc);
+			css_put(&pc->mem_cgroup->css);
+		}
+	}
+	if (prev_mz)
+		spin_unlock(&prev_mz->lru_lock);
+	local_irq_restore(flags);
+
+}
+
+static void
+release_page_cgroup(struct page_cgroup *pc)
+{
+	struct memcg_percpu_vec *mpv;
+
+	mpv = &get_cpu_var(memcg_free_vec);
+	mpv->vec[mpv->nr++] = pc;
+	if (mpv->nr >= mpv->limit)
+		__release_page_cgroup(mpv);
+	put_cpu_var(memcg_free_vec);
+}
+
+static void page_cgroup_start_cache_cpu(int cpu)
+{
+	struct memcg_percpu_vec *mpv;
+	mpv = &per_cpu(memcg_free_vec, cpu);
+	mpv->limit = MEMCG_PCPVEC_SIZE;
+}
+
+#ifdef CONFIG_HOTPLUG_CPU
+static void page_cgroup_stop_cache_cpu(int cpu)
+{
+	struct memcg_percpu_vec *mpv;
+	mpv = &per_cpu(memcg_free_vec, cpu);
+	mpv->limit = 0;
+}
+#endif
+
+
+/*
+ * Used when freeing memory resource controller to remove all
+ * page_cgroup (in obsolete list).
+ */
+static DEFINE_MUTEX(memcg_force_drain_mutex);
+
+static void drain_page_cgroup_local(struct work_struct *work)
+{
+	struct memcg_percpu_vec *mpv;
+	mpv = &get_cpu_var(memcg_free_vec);
+	__release_page_cgroup(mpv);
+	put_cpu_var(mpv);
+}
+
+static void drain_page_cgroup_cpu(int cpu)
+{
+	int local_cpu;
+	struct work_struct work;
+
+	local_cpu = get_cpu();
+	if (local_cpu == cpu) {
+		drain_page_cgroup_local(NULL);
+		put_cpu();
+		return;
+	}
+	put_cpu();
+
+	INIT_WORK(&work, drain_page_cgroup_local);
+	schedule_work_on(cpu, &work);
+	flush_work(&work);
+}
+
+static void drain_page_cgroup_all(void)
+{
+	mutex_lock(&memcg_force_drain_mutex);
+	schedule_on_each_cpu(drain_page_cgroup_local);
+	mutex_unlock(&memcg_force_drain_mutex);
+}
+
+
 /*
  * Unlike exported interface, "oom" parameter is added. if oom==true,
  * oom-killer can be invoked.
@@ -564,25 +681,46 @@ static void __mem_cgroup_commit_charge(s
 	unsigned long flags;
 
 	lock_page_cgroup(pc);
+	/*
+	 * USED bit is set after pc->mem_cgroup has valid value.
+	 */
 	if (unlikely(PageCgroupUsed(pc))) {
 		unlock_page_cgroup(pc);
 		res_counter_uncharge(&mem->res, PAGE_SIZE);
 		css_put(&mem->css);
 		return;
 	}
+	/*
+	 * This page_cgroup is not used but may be on LRU.
+	 */
+	if (unlikely(PageCgroupLRU(pc))) {
+		/*
+		 * pc->mem_cgroup has old information. force_empty() guarantee
+		 * that we never see stale mem_cgroup here.
+		 */
+		mz = page_cgroup_zoneinfo(pc);
+		spin_lock_irqsave(&mz->lru_lock, flags);
+		if (PageCgroupLRU(pc)) {
+			ClearPageCgroupLRU(pc);
+			__mem_cgroup_remove_list(mz, pc);
+			css_put(&pc->mem_cgroup->css);
+		}
+		spin_unlock_irqrestore(&mz->lru_lock, flags);
+	}
+	/* Here, PCG_LRU bit is cleared */
 	pc->mem_cgroup = mem;
 	/*
-	 * If a page is accounted as a page cache, insert to inactive list.
-	 * If anon, insert to active list.
+	 * below pcg_default_flags includes PCG_LOCK bit.
 	 */
 	pc->flags = pcg_default_flags[ctype];
+	unlock_page_cgroup(pc);
 
 	mz = page_cgroup_zoneinfo(pc);
 
 	spin_lock_irqsave(&mz->lru_lock, flags);
 	__mem_cgroup_add_list(mz, pc, true);
+	SetPageCgroupLRU(pc);
 	spin_unlock_irqrestore(&mz->lru_lock, flags);
-	unlock_page_cgroup(pc);
 }
 
 /**
@@ -621,7 +759,7 @@ static int mem_cgroup_move_account(struc
 	if (!trylock_page_cgroup(pc))
 		return ret;
 
-	if (!PageCgroupUsed(pc))
+	if (!PageCgroupUsed(pc) || !PageCgroupLRU(pc))
 		goto out;
 
 	if (pc->mem_cgroup != from)
@@ -836,8 +974,6 @@ __mem_cgroup_uncharge_common(struct page
 {
 	struct page_cgroup *pc;
 	struct mem_cgroup *mem;
-	struct mem_cgroup_per_zone *mz;
-	unsigned long flags;
 
 	if (mem_cgroup_subsys.disabled)
 		return;
@@ -858,16 +994,13 @@ __mem_cgroup_uncharge_common(struct page
 	}
 	ClearPageCgroupUsed(pc);
 	mem = pc->mem_cgroup;
-
-	mz = page_cgroup_zoneinfo(pc);
-	spin_lock_irqsave(&mz->lru_lock, flags);
-	__mem_cgroup_remove_list(mz, pc);
-	spin_unlock_irqrestore(&mz->lru_lock, flags);
-	unlock_page_cgroup(pc);
-
+	/*
+	 * We must uncharge here because "reuse" can occur just after we
+	 * unlock this.
+	 */
 	res_counter_uncharge(&mem->res, PAGE_SIZE);
-	css_put(&mem->css);
-
+	unlock_page_cgroup(pc);
+	release_page_cgroup(pc);
 	return;
 }
 
@@ -1073,7 +1206,7 @@ move_account:
 		ret = -EBUSY;
 		if (atomic_read(&mem->css.cgroup->count) > 0)
 			goto out;
-
+		drain_page_cgroup_all();
 		ret = 0;
 		for_each_node_state(node, N_POSSIBLE) {
 			for (zid = 0; !ret && zid < MAX_NR_ZONES; zid++) {
@@ -1097,6 +1230,7 @@ move_account:
 	}
 	ret = 0;
 out:
+	drain_page_cgroup_all();
 	css_put(&mem->css);
 	return ret;
 
@@ -1318,6 +1452,38 @@ static void mem_cgroup_free(struct mem_c
 		vfree(mem);
 }
 
+static void mem_cgroup_init_pcp(int cpu)
+{
+	page_cgroup_start_cache_cpu(cpu);
+}
+
+static int cpu_memcgroup_callback(struct notifier_block *nb,
+			unsigned long action, void *hcpu)
+{
+	int cpu = (long)hcpu;
+
+	switch (action) {
+	case CPU_UP_PREPARE:
+	case CPU_UP_PREPARE_FROZEN:
+		mem_cgroup_init_pcp(cpu);
+		break;
+#ifdef CONFIG_HOTPLUG_CPU
+	case CPU_DOWN_PREPARE:
+	case CPU_DOWN_PREPARE_FROZEN:
+		page_cgroup_stop_cache_cpu(cpu);
+		drain_page_cgroup_cpu(cpu);
+		break;
+#endif
+	default:
+		break;
+	}
+	return NOTIFY_OK;
+}
+
+static struct notifier_block __refdata memcgroup_nb =
+{
+	.notifier_call = cpu_memcgroup_callback,
+};
 
 static struct cgroup_subsys_state *
 mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
@@ -1328,6 +1494,10 @@ mem_cgroup_create(struct cgroup_subsys *
 	if (unlikely((cont->parent) == NULL)) {
 		page_cgroup_init();
 		mem = &init_mem_cgroup;
+		cpu_memcgroup_callback(&memcgroup_nb,
+					(unsigned long)CPU_UP_PREPARE,
+					(void *)(long)smp_processor_id());
+		register_hotcpu_notifier(&memcgroup_nb);
 	} else {
 		mem = mem_cgroup_alloc();
 		if (!mem)
Index: mmotm-2.6.27-rc7+/include/linux/page_cgroup.h
===================================================================
--- mmotm-2.6.27-rc7+.orig/include/linux/page_cgroup.h
+++ mmotm-2.6.27-rc7+/include/linux/page_cgroup.h
@@ -24,6 +24,7 @@ enum {
 	PCG_LOCK,  /* page cgroup is locked */
 	PCG_CACHE, /* charged as cache */
 	PCG_USED, /* this object is in use. */
+	PCG_LRU, /* on LRU */
 	/* flags for LRU placement */
 	PCG_ACTIVE, /* page is active in this cgroup */
 	PCG_FILE, /* page is file system backed */
@@ -48,6 +49,10 @@ TESTPCGFLAG(Cache, CACHE)
 TESTPCGFLAG(Used, USED)
 CLEARPCGFLAG(Used, USED)
 
+SETPCGFLAG(LRU, LRU)
+TESTPCGFLAG(LRU, LRU)
+CLEARPCGFLAG(LRU, LRU)
+
 /* LRU management flags (from global-lru definition) */
 TESTPCGFLAG(File, FILE)
 SETPCGFLAG(File, FILE)


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 6/6] memcg: lazy lru addition
  2008-10-01  7:52 [PATCH 0/6] memcg update v6 (for review and discuss) KAMEZAWA Hiroyuki
                   ` (4 preceding siblings ...)
  2008-10-01  8:00 ` [PATCH 5/6] memcg: lazy lru freeing KAMEZAWA Hiroyuki
@ 2008-10-01  8:01 ` KAMEZAWA Hiroyuki
  2008-10-09  6:21   ` Daisuke Nishimura
  2008-10-02  9:02 ` [PATCH 0/6] memcg update v6 (for review and discuss) KAMEZAWA Hiroyuki
  6 siblings, 1 reply; 26+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-10-01  8:01 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, LKML, balbir, nishimura

Delaying add_to_lru() and do it in batched manner like page_vec.
For doing that 2 flags PCG_USED and PCG_LRU.

Because __set_page_cgroup_lru() itself doesn't take lock_page_cgroup(),
we need a sanity check inside lru_lock().

And this delaying make css_put()/get() complicated. 
To make it clear,
 * css_get() is called from mem_cgroup_add_list().
 * css_put() is called from mem_cgroup_remove_list().
 * css_get()->css_put() is called while try_charge()->commit/cancel sequence.
is newly added.

Changelog: v5 -> v6.
 - css_get()/put comes back again...it's called via add_list(), remove_list().
 - patch for PCG_LRU bit part is moved to release_page_cgroup_lru() patch.
 - Avoid TestSet and just use lock_page_cgroup() etc.
 - fixed race condition we saw in v5. (smp_wmb() and USED bit magic help us)

Changelog: v3 -> v5.
 - removed css_get/put per page_cgroup struct.
   Now, *new* force_empty checks there is page_cgroup on the memcg.
   We don't need to be afraid of leak.

Changelog: v2 -> v3
 - added TRANSIT flag and removed lock from core logic.
Changelog: v1 -> v2:
 - renamed function name from use_page_cgroup to set_page_cgroup_lru().

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

 mm/memcontrol.c |   83 +++++++++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 73 insertions(+), 10 deletions(-)

Index: mmotm-2.6.27-rc7+/mm/memcontrol.c
===================================================================
--- mmotm-2.6.27-rc7+.orig/mm/memcontrol.c
+++ mmotm-2.6.27-rc7+/mm/memcontrol.c
@@ -255,6 +255,7 @@ static void __mem_cgroup_remove_list(str
 
 	mem_cgroup_charge_statistics(pc->mem_cgroup, pc, false);
 	list_del(&pc->lru);
+	css_put(&pc->mem_cgroup->css);
 }
 
 static void __mem_cgroup_add_list(struct mem_cgroup_per_zone *mz,
@@ -278,6 +279,7 @@ static void __mem_cgroup_add_list(struct
 		list_add_tail(&pc->lru, &mz->lists[lru]);
 
 	mem_cgroup_charge_statistics(pc->mem_cgroup, pc, true);
+	css_get(&pc->mem_cgroup->css);
 }
 
 static void __mem_cgroup_move_lists(struct page_cgroup *pc, enum lru_list lru)
@@ -479,6 +481,7 @@ struct memcg_percpu_vec {
 	struct page_cgroup *vec[MEMCG_PCPVEC_SIZE];
 };
 static DEFINE_PER_CPU(struct memcg_percpu_vec, memcg_free_vec);
+static DEFINE_PER_CPU(struct memcg_percpu_vec, memcg_add_vec);
 
 static void
 __release_page_cgroup(struct memcg_percpu_vec *mpv)
@@ -509,7 +512,6 @@ __release_page_cgroup(struct memcg_percp
 		if (!PageCgroupUsed(pc) && PageCgroupLRU(pc)) {
 			ClearPageCgroupLRU(pc);
 			__mem_cgroup_remove_list(mz, pc);
-			css_put(&pc->mem_cgroup->css);
 		}
 	}
 	if (prev_mz)
@@ -519,10 +521,53 @@ __release_page_cgroup(struct memcg_percp
 }
 
 static void
+__set_page_cgroup_lru(struct memcg_percpu_vec *mpv)
+{
+	unsigned long flags;
+	struct mem_cgroup *mem;
+	struct mem_cgroup_per_zone *mz, *prev_mz;
+	struct page_cgroup *pc;
+	int i, nr;
+
+	local_irq_save(flags);
+	nr = mpv->nr;
+	mpv->nr = 0;
+	prev_mz = NULL;
+
+	for (i = nr - 1; i >= 0; i--) {
+		pc = mpv->vec[i];
+		mem = pc->mem_cgroup;
+		mz = page_cgroup_zoneinfo(pc);
+		if (prev_mz != mz) {
+			if (prev_mz)
+				spin_unlock(&prev_mz->lru_lock);
+			prev_mz = mz;
+			spin_lock(&mz->lru_lock);
+		}
+		if (PageCgroupUsed(pc) && !PageCgroupLRU(pc)) {
+			/*
+			 * while we wait for lru_lock, uncharge()->charge()
+			 * can occur. check here pc->mem_cgroup is what
+			 * we expected or yet.
+			 */
+			smp_rmb();
+			if (likely(mem == pc->mem_cgroup)) {
+				SetPageCgroupLRU(pc);
+				__mem_cgroup_add_list(mz, pc, true);
+			}
+		}
+	}
+
+	if (prev_mz)
+		spin_unlock(&prev_mz->lru_lock);
+	local_irq_restore(flags);
+
+}
+
+static void
 release_page_cgroup(struct page_cgroup *pc)
 {
 	struct memcg_percpu_vec *mpv;
-
 	mpv = &get_cpu_var(memcg_free_vec);
 	mpv->vec[mpv->nr++] = pc;
 	if (mpv->nr >= mpv->limit)
@@ -530,11 +575,25 @@ release_page_cgroup(struct page_cgroup *
 	put_cpu_var(memcg_free_vec);
 }
 
+static void
+set_page_cgroup_lru(struct page_cgroup *pc)
+{
+	struct memcg_percpu_vec *mpv;
+
+	mpv = &get_cpu_var(memcg_add_vec);
+	mpv->vec[mpv->nr++] = pc;
+	if (mpv->nr >= mpv->limit)
+		__set_page_cgroup_lru(mpv);
+	put_cpu_var(memcg_add_vec);
+}
+
 static void page_cgroup_start_cache_cpu(int cpu)
 {
 	struct memcg_percpu_vec *mpv;
 	mpv = &per_cpu(memcg_free_vec, cpu);
 	mpv->limit = MEMCG_PCPVEC_SIZE;
+	mpv = &per_cpu(memcg_add_vec, cpu);
+	mpv->limit = MEMCG_PCPVEC_SIZE;
 }
 
 #ifdef CONFIG_HOTPLUG_CPU
@@ -543,6 +602,8 @@ static void page_cgroup_stop_cache_cpu(i
 	struct memcg_percpu_vec *mpv;
 	mpv = &per_cpu(memcg_free_vec, cpu);
 	mpv->limit = 0;
+	mpv = &per_cpu(memcg_add_vec, cpu);
+	mpv->limit = 0;
 }
 #endif
 
@@ -556,6 +617,9 @@ static DEFINE_MUTEX(memcg_force_drain_mu
 static void drain_page_cgroup_local(struct work_struct *work)
 {
 	struct memcg_percpu_vec *mpv;
+	mpv = &get_cpu_var(memcg_add_vec);
+	__set_page_cgroup_lru(mpv);
+	put_cpu_var(mpv);
 	mpv = &get_cpu_var(memcg_free_vec);
 	__release_page_cgroup(mpv);
 	put_cpu_var(mpv);
@@ -710,17 +774,18 @@ static void __mem_cgroup_commit_charge(s
 	/* Here, PCG_LRU bit is cleared */
 	pc->mem_cgroup = mem;
 	/*
+	 * We have to set pc->mem_cgroup before set USED bit for avoiding
+	 * race with (delayed) __set_page_cgroup_lru() in other cpu.
+	 */
+	smp_wmb();
+	/*
 	 * below pcg_default_flags includes PCG_LOCK bit.
 	 */
 	pc->flags = pcg_default_flags[ctype];
 	unlock_page_cgroup(pc);
 
-	mz = page_cgroup_zoneinfo(pc);
-
-	spin_lock_irqsave(&mz->lru_lock, flags);
-	__mem_cgroup_add_list(mz, pc, true);
-	SetPageCgroupLRU(pc);
-	spin_unlock_irqrestore(&mz->lru_lock, flags);
+	set_page_cgroup_lru(pc);
+	css_put(&mem->css);
 }
 
 /**
@@ -767,10 +832,8 @@ static int mem_cgroup_move_account(struc
 
 	if (spin_trylock(&to_mz->lru_lock)) {
 		__mem_cgroup_remove_list(from_mz, pc);
-		css_put(&from->css);
 		res_counter_uncharge(&from->res, PAGE_SIZE);
 		pc->mem_cgroup = to;
-		css_get(&to->css);
 		__mem_cgroup_add_list(to_mz, pc, false);
 		ret = 0;
 		spin_unlock(&to_mz->lru_lock);


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 3/6] memcg: charge-commit-cancel protocl
  2008-10-01  7:57 ` [PATCH 3/6] memcg: charge-commit-cancel protocl KAMEZAWA Hiroyuki
@ 2008-10-01  8:33   ` Daisuke Nishimura
  2008-10-01 10:04   ` kamezawa.hiroyu
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 26+ messages in thread
From: Daisuke Nishimura @ 2008-10-01  8:33 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: nishimura, linux-mm, LKML, balbir

> @@ -531,18 +529,33 @@ static int mem_cgroup_charge_common(stru
>  
>  		if (!nr_retries--) {
>  			mem_cgroup_out_of_memory(mem, gfp_mask);
> -			goto out;
> +			goto nomem;
>  		}
>  	}
> +	return 0;
> +nomem:
> +	css_put(&mem->css);
> +	return -ENOMEM;
> +}
>  
> +/*
> + * commit a charge got by mem_cgroup_try_charge() and makes page_cgroup to be
> + * USED state. If already USED, uncharge and return.
> + */
> +
> +static void __mem_cgroup_commit_charge(struct mem_cgroup *mem,
> +				     struct page_cgroup *pc,
> +				     enum charge_type ctype)
> +{
> +	struct mem_cgroup_per_zone *mz;
> +	unsigned long flags;
>  
>  	lock_page_cgroup(pc);
>  	if (unlikely(PageCgroupUsed(pc))) {
>  		unlock_page_cgroup(pc);
>  		res_counter_uncharge(&mem->res, PAGE_SIZE);
>  		css_put(&mem->css);
> -
> -		goto done;
> +		return;
>  	}
>  	pc->mem_cgroup = mem;
>  	/*

Hmm, this patch cannot be applied because of this part.

After [2/6], mem_cgroup_charge_common looks like:

---
                if (!nr_retries--) {
                        mem_cgroup_out_of_memory(mem, gfp_mask);
                        goto out;
                }
        }


        lock_page_cgroup(pc);

        if (unlikely(PageCgroupUsed(pc))) {
                unlock_page_cgroup(pc);
                res_counter_uncharge(&mem->res, PAGE_SIZE);
                css_put(&mem->css);

                goto done;
        }
        pc->mem_cgroup = mem;
        /*
---

There is an empty line after lock_page_cgroup.

After removing this line, I can appliy this patch(and [4-6/6]).


Thanks,
Daisuke Nishimura.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Re: [PATCH 3/6] memcg: charge-commit-cancel protocl
  2008-10-01  7:57 ` [PATCH 3/6] memcg: charge-commit-cancel protocl KAMEZAWA Hiroyuki
  2008-10-01  8:33   ` Daisuke Nishimura
@ 2008-10-01 10:04   ` kamezawa.hiroyu
  2008-10-03 10:05   ` Daisuke Nishimura
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 26+ messages in thread
From: kamezawa.hiroyu @ 2008-10-01 10:04 UTC (permalink / raw)
  To: Daisuke Nishimura; +Cc: KAMEZAWA Hiroyuki, nishimura, linux-mm, LKML, balbir

----- Original Message -----
>> @@ -531,18 +529,33 @@ static int mem_cgroup_charge_common(stru
>>  
>>  		if (!nr_retries--) {
>>  			mem_cgroup_out_of_memory(mem, gfp_mask);
>> -			goto out;
>> +			goto nomem;
>>  		}
>>  	}
>> +	return 0;
>> +nomem:
>> +	css_put(&mem->css);
>> +	return -ENOMEM;
>> +}
>>  
>> +/*
>> + * commit a charge got by mem_cgroup_try_charge() and makes page_cgroup to
 be
>> + * USED state. If already USED, uncharge and return.
>> + */
>> +
>> +static void __mem_cgroup_commit_charge(struct mem_cgroup *mem,
>> +				     struct page_cgroup *pc,
>> +				     enum charge_type ctype)
>> +{
>> +	struct mem_cgroup_per_zone *mz;
>> +	unsigned long flags;
>>  
>>  	lock_page_cgroup(pc);
>>  	if (unlikely(PageCgroupUsed(pc))) {
>>  		unlock_page_cgroup(pc);
>>  		res_counter_uncharge(&mem->res, PAGE_SIZE);
>>  		css_put(&mem->css);
>> -
>> -		goto done;
>> +		return;
>>  	}
>>  	pc->mem_cgroup = mem;
>>  	/*
>
>Hmm, this patch cannot be applied because of this part.
>
>After [2/6], mem_cgroup_charge_common looks like:
>
>---
>                if (!nr_retries--) {
>                        mem_cgroup_out_of_memory(mem, gfp_mask);
>                        goto out;
>                }
>        }
>
>
>        lock_page_cgroup(pc);
>
>        if (unlikely(PageCgroupUsed(pc))) {
>                unlock_page_cgroup(pc);
>                res_counter_uncharge(&mem->res, PAGE_SIZE);
>                css_put(&mem->css);
>
>                goto done;
>        }
>        pc->mem_cgroup = mem;
>        /*
>---
>
>There is an empty line after lock_page_cgroup.
>
>After removing this line, I can appliy this patch(and [4-6/6]).
>
Ah, sorry. maybe refresh miss..

I'll check it again tomorrow.

Thanks,
-Kame


>
>Thanks,
>Daisuke Nishimura.
>


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/6] memcg: new force_empty and move_account
  2008-10-01  7:59 ` [PATCH 4/6] memcg: new force_empty and move_account KAMEZAWA Hiroyuki
@ 2008-10-01 16:38   ` Randy Dunlap
  2008-10-02  5:13     ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 26+ messages in thread
From: Randy Dunlap @ 2008-10-01 16:38 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, LKML, balbir, nishimura

On Wed, 1 Oct 2008 16:59:12 +0900 KAMEZAWA Hiroyuki wrote:

>  Documentation/controllers/memory.txt |   10 -
>  mm/memcontrol.c                      |  267 +++++++++++++++++++++++++++--------
>  2 files changed, 216 insertions(+), 61 deletions(-)
> 
> Index: mmotm-2.6.27-rc7+/mm/memcontrol.c
> ===================================================================
> --- mmotm-2.6.27-rc7+.orig/mm/memcontrol.c
> +++ mmotm-2.6.27-rc7+/mm/memcontrol.c
> @@ -538,6 +533,24 @@ nomem:
>  	return -ENOMEM;
>  }
>  
> +/**
> + * mem_cgroup_try_charge - get charge of PAGE_SIZE.
> + * @mm: an mm_struct which is charged against. (when *memcg is NULL)
> + * @gfp_mask: gfp_mask for reclaim.
> + * @memcg: a pointer to memory cgroup which is charged against.
> + *
> + * charge aginst memory cgroup pointed by *memcg. if *memcg == NULL, estimated
> + * memory cgroup from @mm is got and stored in *memcg.
> + *
> + * Retruns 0 if success. -ENOMEM at failure.
> + */
> +
> +int mem_cgroup_try_charge(struct mm_struct *mm,
> +			  gfp_t mask, struct mem_cgroup **memcg)
> +{
> +	return __mem_cgroup_try_charge(mm, mask, memcg, false);
> +}
> +
>  /*
>   * commit a charge got by mem_cgroup_try_charge() and makes page_cgroup to be
>   * USED state. If already USED, uncharge and return.
> @@ -567,11 +580,109 @@ static void __mem_cgroup_commit_charge(s
>  	mz = page_cgroup_zoneinfo(pc);
>  
>  	spin_lock_irqsave(&mz->lru_lock, flags);
> -	__mem_cgroup_add_list(mz, pc);
> +	__mem_cgroup_add_list(mz, pc, true);
>  	spin_unlock_irqrestore(&mz->lru_lock, flags);
>  	unlock_page_cgroup(pc);
>  }
>  
> +/**
> + * mem_cgroup_move_account - move account of the page
> + * @pc   ... page_cgroup of the page.
> + * @from ... mem_cgroup which the page is moved from.
> + * @to   ... mem_cgroup which the page is moved to. @from != @to.

Bad kernel-doc format.  Use (e.g.)
 * @pc: page_cgroup of the page

as was done in the function above.

> + *
> + * The caller must confirm following.
> + * 1. disable irq.
> + * 2. lru_lock of old mem_cgroup(@from) should be held.
> + *
> + * returns 0 at success,
> + * returns -EBUSY when lock is busy or "pc" is unstable.
> + *
> + * This function do "uncharge" from old cgroup but doesn't do "charge" to

                    does

> + * new cgroup. It should be done by a caller.
> + */
> +
> +static int mem_cgroup_move_account(struct page_cgroup *pc,
> +	struct mem_cgroup *from, struct mem_cgroup *to)
> +{
...
> +}
> +
> Index: mmotm-2.6.27-rc7+/Documentation/controllers/memory.txt
> ===================================================================
> --- mmotm-2.6.27-rc7+.orig/Documentation/controllers/memory.txt
> +++ mmotm-2.6.27-rc7+/Documentation/controllers/memory.txt
> @@ -211,7 +211,9 @@ The memory.force_empty gives an interfac
>  
>  # echo 1 > memory.force_empty
>  
> -will drop all charges in cgroup. Currently, this is maintained for test.
> +Will move account to parent. if parenet is full, will try to free pages.

                                If parent

> +If both of a parent and a child are busy, return -EBUSY;

maybe:
   If both parent and child are busy, return -EBUSY.

> +This file, memory.force_empty, is just for debug purpose.
>  
>  4. Testing
>  
> @@ -242,8 +244,10 @@ reclaimed.
>  
>  A cgroup can be removed by rmdir, but as discussed in sections 4.1 and 4.2, a
>  cgroup might have some charge associated with it, even though all
> -tasks have migrated away from it. Such charges are automatically dropped at
> -rmdir() if there are no tasks.
> +tasks have migrated away from it.
> +Such charges are moved to its parent as mush as possible and freed if parent

                                           much

> +seems to be full. (see force_empty)

seems??  Is it questionable/unsure?

> +If both of them are busy, rmdir() returns -EBUSY.
>  
>  5. TODO


---
~Randy

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/6] memcg: new force_empty and move_account
  2008-10-01 16:38   ` Randy Dunlap
@ 2008-10-02  5:13     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 26+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-10-02  5:13 UTC (permalink / raw)
  To: Randy Dunlap; +Cc: linux-mm, LKML, balbir, nishimura

Thank yoy for review!

On Wed, 1 Oct 2008 09:38:04 -0700
Randy Dunlap <randy.dunlap@oracle.com> wrote:

> On Wed, 1 Oct 2008 16:59:12 +0900 KAMEZAWA Hiroyuki wrote:
> 
> >  Documentation/controllers/memory.txt |   10 -
> >  mm/memcontrol.c                      |  267 +++++++++++++++++++++++++++--------
> >  2 files changed, 216 insertions(+), 61 deletions(-)
> > 
> > Index: mmotm-2.6.27-rc7+/mm/memcontrol.c
> > ===================================================================
> > --- mmotm-2.6.27-rc7+.orig/mm/memcontrol.c
> > +++ mmotm-2.6.27-rc7+/mm/memcontrol.c
> > @@ -538,6 +533,24 @@ nomem:
> >  	return -ENOMEM;
> >  }
> >  
> > +/**
> > + * mem_cgroup_try_charge - get charge of PAGE_SIZE.
> > + * @mm: an mm_struct which is charged against. (when *memcg is NULL)
> > + * @gfp_mask: gfp_mask for reclaim.
> > + * @memcg: a pointer to memory cgroup which is charged against.
> > + *
> > + * charge aginst memory cgroup pointed by *memcg. if *memcg == NULL, estimated
> > + * memory cgroup from @mm is got and stored in *memcg.
> > + *
> > + * Retruns 0 if success. -ENOMEM at failure.
> > + */
> > +
> > +int mem_cgroup_try_charge(struct mm_struct *mm,
> > +			  gfp_t mask, struct mem_cgroup **memcg)
> > +{
> > +	return __mem_cgroup_try_charge(mm, mask, memcg, false);
> > +}
> > +
> >  /*
> >   * commit a charge got by mem_cgroup_try_charge() and makes page_cgroup to be
> >   * USED state. If already USED, uncharge and return.
> > @@ -567,11 +580,109 @@ static void __mem_cgroup_commit_charge(s
> >  	mz = page_cgroup_zoneinfo(pc);
> >  
> >  	spin_lock_irqsave(&mz->lru_lock, flags);
> > -	__mem_cgroup_add_list(mz, pc);
> > +	__mem_cgroup_add_list(mz, pc, true);
> >  	spin_unlock_irqrestore(&mz->lru_lock, flags);
> >  	unlock_page_cgroup(pc);
> >  }
> >  
> > +/**
> > + * mem_cgroup_move_account - move account of the page
> > + * @pc   ... page_cgroup of the page.
> > + * @from ... mem_cgroup which the page is moved from.
> > + * @to   ... mem_cgroup which the page is moved to. @from != @to.
> 
> Bad kernel-doc format.  Use (e.g.)
>  * @pc: page_cgroup of the page
> 
Hmm, sorry.

> as was done in the function above.
> 
> > + *
> > + * The caller must confirm following.
> > + * 1. disable irq.
> > + * 2. lru_lock of old mem_cgroup(@from) should be held.
> > + *
> > + * returns 0 at success,
> > + * returns -EBUSY when lock is busy or "pc" is unstable.
> > + *
> > + * This function do "uncharge" from old cgroup but doesn't do "charge" to
> 
>                     does
> 
will fix.

> > + * new cgroup. It should be done by a caller.
> > + */
> > +
> > +static int mem_cgroup_move_account(struct page_cgroup *pc,
> > +	struct mem_cgroup *from, struct mem_cgroup *to)
> > +{
> ...
> > +}
> > +
> > Index: mmotm-2.6.27-rc7+/Documentation/controllers/memory.txt
> > ===================================================================
> > --- mmotm-2.6.27-rc7+.orig/Documentation/controllers/memory.txt
> > +++ mmotm-2.6.27-rc7+/Documentation/controllers/memory.txt
> > @@ -211,7 +211,9 @@ The memory.force_empty gives an interfac
> >  
> >  # echo 1 > memory.force_empty
> >  
> > -will drop all charges in cgroup. Currently, this is maintained for test.
> > +Will move account to parent. if parenet is full, will try to free pages.
> 
>                                 If parent
> 
Ah, sorry.

> > +If both of a parent and a child are busy, return -EBUSY;
> 
> maybe:
>    If both parent and child are busy, return -EBUSY.
> 
will fix.


> > +This file, memory.force_empty, is just for debug purpose.
> >  
> >  4. Testing
> >  
> > @@ -242,8 +244,10 @@ reclaimed.
> >  
> >  A cgroup can be removed by rmdir, but as discussed in sections 4.1 and 4.2, a
> >  cgroup might have some charge associated with it, even though all
> > -tasks have migrated away from it. Such charges are automatically dropped at
> > -rmdir() if there are no tasks.
> > +tasks have migrated away from it.
> > +Such charges are moved to its parent as mush as possible and freed if parent
> 
>                                            much
> 
will fix

> > +seems to be full. (see force_empty)
> 
> seems??  Is it questionable/unsure?
> 

It's unstable state. If someone frees some page, it's not full.
But "seems" is not good, I'll remove "seems".

Thanks,
-Kame



^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 2/6] memcg: allocate page_cgroup at boot (hunk fix)
  2008-10-01  7:56 ` [PATCH 2/6] memcg: allocate page_cgroup at boot KAMEZAWA Hiroyuki
@ 2008-10-02  8:49   ` KAMEZAWA Hiroyuki
  2008-10-06 16:32     ` Balbir Singh
  2008-10-06 10:11   ` [PATCH 2/6] memcg: allocate page_cgroup at boot Balbir Singh
  1 sibling, 1 reply; 26+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-10-02  8:49 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, LKML, balbir, nishimura

[3-6/6] cannot be applied because of HUNK.
This is fixed one for 2/6.
only for test. I'll have to post this series again, anyway.

=
Allocate all page_cgroup at boot and remove page_cgroup poitner
from struct page. This patch adds an interface as

 struct page_cgroup *lookup_page_cgroup(struct page*)

All FLATMEM/DISCONTIGMEM/SPARSEMEM  and MEMORY_HOTPLUG is supported.

Remove page_cgroup pointer reduces the amount of memory by
 - 4 bytes per PAGE_SIZE.
 - 8 bytes per PAGE_SIZE
if memory controller is disabled. (even if configured.)

On usual 8GB x86-32 server, this saves 8MB of NORMAL_ZONE memory.
On my x86-64 server with 48GB of memory, this saves 96MB of memory.
I think this reduction makes sense.

By pre-allocation, kmalloc/kfree in charge/uncharge are removed. 
This means
  - we're not necessary to be afraid of kmalloc faiulre.
    (this can happen because of gfp_mask type.)
  - we can avoid calling kmalloc/kfree.
  - we can avoid allocating tons of small objects which can be fragmented.
  - we can know what amount of memory will be used for this extra-lru handling.

I added printk message as

	"allocated %ld bytes of page_cgroup"
        "please try cgroup_disable=memory option if you don't want"

maybe enough informative for users.

Changelog: v5 -> v6.
 * reflected comments.
 * coding style fixes.
 * removed "ctype" from uncharge.
 * improved comment to show FLAT_NODE_MEM_MAP == !SPARSEMEM
 * fixed errors in !SPARSEMEM codes
 * removed unused function in !SPARSEMEM codes.
(start from v5 because of series..)

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

 include/linux/memcontrol.h  |   11 -
 include/linux/mm_types.h    |    4 
 include/linux/mmzone.h      |   14 ++
 include/linux/page_cgroup.h |   90 ++++++++++++++++
 mm/Makefile                 |    2 
 mm/memcontrol.c             |  246 ++++++++++++++------------------------------
 mm/page_alloc.c             |   12 --
 mm/page_cgroup.c            |  237 ++++++++++++++++++++++++++++++++++++++++++
 8 files changed, 424 insertions(+), 192 deletions(-)

Index: mmotm-2.6.27-rc7+/mm/page_cgroup.c
===================================================================
--- /dev/null
+++ mmotm-2.6.27-rc7+/mm/page_cgroup.c
@@ -0,0 +1,237 @@
+#include <linux/mm.h>
+#include <linux/mmzone.h>
+#include <linux/bootmem.h>
+#include <linux/bit_spinlock.h>
+#include <linux/page_cgroup.h>
+#include <linux/hash.h>
+#include <linux/memory.h>
+
+static void __meminit
+__init_page_cgroup(struct page_cgroup *pc, unsigned long pfn)
+{
+	pc->flags = 0;
+	pc->mem_cgroup = NULL;
+	pc->page = pfn_to_page(pfn);
+}
+static unsigned long total_usage;
+
+#if !defined(CONFIG_SPARSEMEM)
+
+
+void __init pgdat_page_cgroup_init(struct pglist_data *pgdat)
+{
+	pgdat->node_page_cgroup = NULL;
+}
+
+struct page_cgroup *lookup_page_cgroup(struct page *page)
+{
+	unsigned long pfn = page_to_pfn(page);
+	unsigned long offset;
+	struct page_cgroup *base;
+
+	base = NODE_DATA(page_to_nid(page))->node_page_cgroup;
+	if (unlikely(!base))
+		return NULL;
+
+	offset = pfn - NODE_DATA(page_to_nid(page))->node_start_pfn;
+	return base + offset;
+}
+
+static int __init alloc_node_page_cgroup(int nid)
+{
+	struct page_cgroup *base, *pc;
+	unsigned long table_size;
+	unsigned long start_pfn, nr_pages, index;
+
+	start_pfn = NODE_DATA(nid)->node_start_pfn;
+	nr_pages = NODE_DATA(nid)->node_spanned_pages;
+
+	table_size = sizeof(struct page_cgroup) * nr_pages;
+
+	base = __alloc_bootmem_node_nopanic(NODE_DATA(nid),
+			table_size, PAGE_SIZE, __pa(MAX_DMA_ADDRESS));
+	if (!base)
+		return -ENOMEM;
+	for (index = 0; index < nr_pages; index++) {
+		pc = base + index;
+		__init_page_cgroup(pc, start_pfn + index);
+	}
+	NODE_DATA(nid)->node_page_cgroup = base;
+	total_usage += table_size;
+	return 0;
+}
+
+void __init page_cgroup_init(void)
+{
+
+	int nid, fail;
+
+	for_each_online_node(nid)  {
+		fail = alloc_node_page_cgroup(nid);
+		if (fail)
+			goto fail;
+	}
+	printk(KERN_INFO "allocated %ld bytes of page_cgroup\n", total_usage);
+	printk(KERN_INFO "please try cgroup_disable=memory option if you"
+	" don't want\n");
+	return;
+fail:
+	printk(KERN_CRIT "allocation of page_cgroup was failed.\n");
+	printk(KERN_CRIT "please try cgroup_disable=memory boot option\n");
+	panic("Out of memory");
+}
+
+#else /* CONFIG_FLAT_NODE_MEM_MAP */
+
+struct page_cgroup *lookup_page_cgroup(struct page *page)
+{
+	unsigned long pfn = page_to_pfn(page);
+	struct mem_section *section = __pfn_to_section(pfn);
+
+	return section->page_cgroup + pfn;
+}
+
+int __meminit init_section_page_cgroup(unsigned long pfn)
+{
+	struct mem_section *section;
+	struct page_cgroup *base, *pc;
+	unsigned long table_size;
+	int nid, index;
+
+	section = __pfn_to_section(pfn);
+
+	if (section->page_cgroup)
+		return 0;
+
+	nid = page_to_nid(pfn_to_page(pfn));
+
+	table_size = sizeof(struct page_cgroup) * PAGES_PER_SECTION;
+	base = kmalloc_node(table_size, GFP_KERNEL, nid);
+	if (!base)
+		base = vmalloc_node(table_size, nid);
+
+	if (!base) {
+		printk(KERN_ERR "page cgroup allocation failure\n");
+		return -ENOMEM;
+	}
+
+	for (index = 0; index < PAGES_PER_SECTION; index++) {
+		pc = base + index;
+		__init_page_cgroup(pc, pfn + index);
+	}
+
+	section = __pfn_to_section(pfn);
+	section->page_cgroup = base - pfn;
+	total_usage += table_size;
+	return 0;
+}
+#ifdef CONFIG_MEMORY_HOTPLUG
+void __free_page_cgroup(unsigned long pfn)
+{
+	struct mem_section *ms;
+	struct page_cgroup *base;
+
+	ms = __pfn_to_section(pfn);
+	if (!ms || !ms->page_cgroup)
+		return;
+	base = ms->page_cgroup + pfn;
+	ms->page_cgroup = NULL;
+	if (is_vmalloc_addr(base))
+		vfree(base);
+	else
+		kfree(base);
+}
+
+int online_page_cgroup(unsigned long start_pfn,
+			unsigned long nr_pages,
+			int nid)
+{
+	unsigned long start, end, pfn;
+	int fail = 0;
+
+	start = start_pfn & (PAGES_PER_SECTION - 1);
+	end = ALIGN(start_pfn + nr_pages, PAGES_PER_SECTION);
+
+	for (pfn = start; !fail && pfn < end; pfn += PAGES_PER_SECTION) {
+		if (!pfn_present(pfn))
+			continue;
+		fail = init_section_page_cgroup(pfn);
+	}
+	if (!fail)
+		return 0;
+
+	/* rollback */
+	for (pfn = start; pfn < end; pfn += PAGES_PER_SECTION)
+		__free_page_cgroup(pfn);
+
+	return -ENOMEM;
+}
+
+int offline_page_cgroup(unsigned long start_pfn,
+		unsigned long nr_pages, int nid)
+{
+	unsigned long start, end, pfn;
+
+	start = start_pfn & (PAGES_PER_SECTION - 1);
+	end = ALIGN(start_pfn + nr_pages, PAGES_PER_SECTION);
+
+	for (pfn = start; pfn < end; pfn += PAGES_PER_SECTION)
+		__free_page_cgroup(pfn);
+	return 0;
+
+}
+
+static int page_cgroup_callback(struct notifier_block *self,
+			       unsigned long action, void *arg)
+{
+	struct memory_notify *mn = arg;
+	int ret = 0;
+	switch (action) {
+	case MEM_GOING_ONLINE:
+		ret = online_page_cgroup(mn->start_pfn,
+				   mn->nr_pages, mn->status_change_nid);
+		break;
+	case MEM_CANCEL_ONLINE:
+	case MEM_OFFLINE:
+		offline_page_cgroup(mn->start_pfn,
+				mn->nr_pages, mn->status_change_nid);
+		break;
+	case MEM_GOING_OFFLINE:
+		break;
+	case MEM_ONLINE:
+	case MEM_CANCEL_OFFLINE:
+		break;
+	}
+	ret = notifier_from_errno(ret);
+	return ret;
+}
+
+#endif
+
+void __init page_cgroup_init(void)
+{
+	unsigned long pfn;
+	int fail = 0;
+
+	for (pfn = 0; !fail && pfn < max_pfn; pfn += PAGES_PER_SECTION) {
+		if (!pfn_present(pfn))
+			continue;
+		fail = init_section_page_cgroup(pfn);
+	}
+	if (fail) {
+		printk(KERN_CRIT "try cgroup_disable=memory boot option\n");
+		panic("Out of memory");
+	} else {
+		hotplug_memory_notifier(page_cgroup_callback, 0);
+	}
+	printk(KERN_INFO "allocated %ld bytes of page_cgroup\n", total_usage);
+	printk(KERN_INFO "please try cgroup_disable=memory option if you don't"
+	" want\n");
+}
+
+void __init pgdat_page_cgroup_init(struct pglist_data *pgdat)
+{
+	return;
+}
+
+#endif
Index: mmotm-2.6.27-rc7+/include/linux/mm_types.h
===================================================================
--- mmotm-2.6.27-rc7+.orig/include/linux/mm_types.h
+++ mmotm-2.6.27-rc7+/include/linux/mm_types.h
@@ -94,10 +94,6 @@ struct page {
 	void *virtual;			/* Kernel virtual address (NULL if
 					   not kmapped, ie. highmem) */
 #endif /* WANT_PAGE_VIRTUAL */
-#ifdef CONFIG_CGROUP_MEM_RES_CTLR
-	unsigned long page_cgroup;
-#endif
-
 #ifdef CONFIG_KMEMCHECK
 	void *shadow;
 #endif
Index: mmotm-2.6.27-rc7+/mm/Makefile
===================================================================
--- mmotm-2.6.27-rc7+.orig/mm/Makefile
+++ mmotm-2.6.27-rc7+/mm/Makefile
@@ -34,6 +34,6 @@ obj-$(CONFIG_FS_XIP) += filemap_xip.o
 obj-$(CONFIG_MIGRATION) += migrate.o
 obj-$(CONFIG_SMP) += allocpercpu.o
 obj-$(CONFIG_QUICKLIST) += quicklist.o
-obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o
+obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o page_cgroup.o
 obj-$(CONFIG_CGROUP_MEMRLIMIT_CTLR) += memrlimitcgroup.o
 obj-$(CONFIG_KMEMTRACE) += kmemtrace.o
Index: mmotm-2.6.27-rc7+/include/linux/page_cgroup.h
===================================================================
--- /dev/null
+++ mmotm-2.6.27-rc7+/include/linux/page_cgroup.h
@@ -0,0 +1,90 @@
+#ifndef __LINUX_PAGE_CGROUP_H
+#define __LINUX_PAGE_CGROUP_H
+#include <linux/bit_spinlock.h>
+/*
+ * Page Cgroup can be considered as an extended mem_map.
+ * A page_cgroup page is associated with every page descriptor. The
+ * page_cgroup helps us identify information about the cgroup
+ * All page cgroups are allocated at boot or memory hotplug event,
+ * then the page cgroup for pfn always exists.
+ */
+struct page_cgroup {
+	unsigned long flags;
+	struct mem_cgroup *mem_cgroup;
+	struct page *page;
+	struct list_head lru;		/* per cgroup LRU list */
+};
+
+void __init pgdat_page_cgroup_init(struct pglist_data *pgdat);
+void __init page_cgroup_init(void);
+struct page_cgroup *lookup_page_cgroup(struct page *page);
+
+enum {
+	/* flags for mem_cgroup */
+	PCG_LOCK,  /* page cgroup is locked */
+	PCG_CACHE, /* charged as cache */
+	PCG_USED, /* this object is in use. */
+	/* flags for LRU placement */
+	PCG_ACTIVE, /* page is active in this cgroup */
+	PCG_FILE, /* page is file system backed */
+	PCG_UNEVICTABLE, /* page is unevictableable */
+};
+
+#define TESTPCGFLAG(uname, lname)			\
+static inline int PageCgroup##uname(struct page_cgroup *pc)	\
+	{ return test_bit(PCG_##lname, &pc->flags); }
+
+#define SETPCGFLAG(uname, lname)			\
+static inline void SetPageCgroup##uname(struct page_cgroup *pc)\
+	{ set_bit(PCG_##lname, &pc->flags);  }
+
+#define CLEARPCGFLAG(uname, lname)			\
+static inline void ClearPageCgroup##uname(struct page_cgroup *pc)	\
+	{ clear_bit(PCG_##lname, &pc->flags);  }
+
+/* Cache flag is set only once (at allocation) */
+TESTPCGFLAG(Cache, CACHE)
+
+TESTPCGFLAG(Used, USED)
+CLEARPCGFLAG(Used, USED)
+
+/* LRU management flags (from global-lru definition) */
+TESTPCGFLAG(File, FILE)
+SETPCGFLAG(File, FILE)
+CLEARPCGFLAG(File, FILE)
+
+TESTPCGFLAG(Active, ACTIVE)
+SETPCGFLAG(Active, ACTIVE)
+CLEARPCGFLAG(Active, ACTIVE)
+
+TESTPCGFLAG(Unevictable, UNEVICTABLE)
+SETPCGFLAG(Unevictable, UNEVICTABLE)
+CLEARPCGFLAG(Unevictable, UNEVICTABLE)
+
+static inline int page_cgroup_nid(struct page_cgroup *pc)
+{
+	return page_to_nid(pc->page);
+}
+
+static inline enum zone_type page_cgroup_zid(struct page_cgroup *pc)
+{
+	return page_zonenum(pc->page);
+}
+
+static inline void lock_page_cgroup(struct page_cgroup *pc)
+{
+	bit_spin_lock(PCG_LOCK, &pc->flags);
+}
+
+static inline int trylock_page_cgroup(struct page_cgroup *pc)
+{
+	return bit_spin_trylock(PCG_LOCK, &pc->flags);
+}
+
+static inline void unlock_page_cgroup(struct page_cgroup *pc)
+{
+	bit_spin_unlock(PCG_LOCK, &pc->flags);
+}
+
+
+#endif
Index: mmotm-2.6.27-rc7+/mm/memcontrol.c
===================================================================
--- mmotm-2.6.27-rc7+.orig/mm/memcontrol.c
+++ mmotm-2.6.27-rc7+/mm/memcontrol.c
@@ -33,11 +33,11 @@
 #include <linux/seq_file.h>
 #include <linux/vmalloc.h>
 #include <linux/mm_inline.h>
+#include <linux/page_cgroup.h>
 
 #include <asm/uaccess.h>
 
 struct cgroup_subsys mem_cgroup_subsys __read_mostly;
-static struct kmem_cache *page_cgroup_cache __read_mostly;
 #define MEM_CGROUP_RECLAIM_RETRIES	5
 
 /*
@@ -135,79 +135,6 @@ struct mem_cgroup {
 };
 static struct mem_cgroup init_mem_cgroup;
 
-/*
- * We use the lower bit of the page->page_cgroup pointer as a bit spin
- * lock.  We need to ensure that page->page_cgroup is at least two
- * byte aligned (based on comments from Nick Piggin).  But since
- * bit_spin_lock doesn't actually set that lock bit in a non-debug
- * uniprocessor kernel, we should avoid setting it here too.
- */
-#define PAGE_CGROUP_LOCK_BIT 	0x0
-#if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK)
-#define PAGE_CGROUP_LOCK 	(1 << PAGE_CGROUP_LOCK_BIT)
-#else
-#define PAGE_CGROUP_LOCK	0x0
-#endif
-
-/*
- * A page_cgroup page is associated with every page descriptor. The
- * page_cgroup helps us identify information about the cgroup
- */
-struct page_cgroup {
-	struct list_head lru;		/* per cgroup LRU list */
-	struct page *page;
-	struct mem_cgroup *mem_cgroup;
-	unsigned long flags;
-};
-
-enum {
-	/* flags for mem_cgroup */
-	PCG_CACHE, /* charged as cache */
-	/* flags for LRU placement */
-	PCG_ACTIVE, /* page is active in this cgroup */
-	PCG_FILE, /* page is file system backed */
-	PCG_UNEVICTABLE, /* page is unevictableable */
-};
-
-#define TESTPCGFLAG(uname, lname)			\
-static inline int PageCgroup##uname(struct page_cgroup *pc)	\
-	{ return test_bit(PCG_##lname, &pc->flags); }
-
-#define SETPCGFLAG(uname, lname)			\
-static inline void SetPageCgroup##uname(struct page_cgroup *pc)\
-	{ set_bit(PCG_##lname, &pc->flags);  }
-
-#define CLEARPCGFLAG(uname, lname)			\
-static inline void ClearPageCgroup##uname(struct page_cgroup *pc)	\
-	{ clear_bit(PCG_##lname, &pc->flags);  }
-
-
-/* Cache flag is set only once (at allocation) */
-TESTPCGFLAG(Cache, CACHE)
-
-/* LRU management flags (from global-lru definition) */
-TESTPCGFLAG(File, FILE)
-SETPCGFLAG(File, FILE)
-CLEARPCGFLAG(File, FILE)
-
-TESTPCGFLAG(Active, ACTIVE)
-SETPCGFLAG(Active, ACTIVE)
-CLEARPCGFLAG(Active, ACTIVE)
-
-TESTPCGFLAG(Unevictable, UNEVICTABLE)
-SETPCGFLAG(Unevictable, UNEVICTABLE)
-CLEARPCGFLAG(Unevictable, UNEVICTABLE)
-
-static int page_cgroup_nid(struct page_cgroup *pc)
-{
-	return page_to_nid(pc->page);
-}
-
-static enum zone_type page_cgroup_zid(struct page_cgroup *pc)
-{
-	return page_zonenum(pc->page);
-}
-
 enum charge_type {
 	MEM_CGROUP_CHARGE_TYPE_CACHE = 0,
 	MEM_CGROUP_CHARGE_TYPE_MAPPED,
@@ -216,12 +143,18 @@ enum charge_type {
 	NR_CHARGE_TYPE,
 };
 
+/* only for here (for easy reading.) */
+#define PCGF_CACHE	(1UL << PCG_CACHE)
+#define PCGF_USED	(1UL << PCG_USED)
+#define PCGF_ACTIVE	(1UL << PCG_ACTIVE)
+#define PCGF_LOCK	(1UL << PCG_LOCK)
+#define PCGF_FILE	(1UL << PCG_FILE)
 static const unsigned long
 pcg_default_flags[NR_CHARGE_TYPE] = {
-	((1 << PCG_CACHE) | (1 << PCG_FILE)),
-	((1 << PCG_ACTIVE)),
-	((1 << PCG_ACTIVE) | (1 << PCG_CACHE)),
-	0,
+	PCGF_CACHE | PCGF_FILE | PCGF_USED | PCGF_LOCK, /* File Cache */
+	PCGF_ACTIVE | PCGF_USED | PCGF_LOCK, /* Anon */
+	PCGF_ACTIVE | PCGF_CACHE | PCGF_USED | PCGF_LOCK, /* Shmem */
+	0, /* FORCE */
 };
 
 /*
@@ -303,37 +236,6 @@ struct mem_cgroup *mem_cgroup_from_task(
 				struct mem_cgroup, css);
 }
 
-static inline int page_cgroup_locked(struct page *page)
-{
-	return bit_spin_is_locked(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup);
-}
-
-static void page_assign_page_cgroup(struct page *page, struct page_cgroup *pc)
-{
-	VM_BUG_ON(!page_cgroup_locked(page));
-	page->page_cgroup = ((unsigned long)pc | PAGE_CGROUP_LOCK);
-}
-
-struct page_cgroup *page_get_page_cgroup(struct page *page)
-{
-	return (struct page_cgroup *) (page->page_cgroup & ~PAGE_CGROUP_LOCK);
-}
-
-static void lock_page_cgroup(struct page *page)
-{
-	bit_spin_lock(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup);
-}
-
-static int try_lock_page_cgroup(struct page *page)
-{
-	return bit_spin_trylock(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup);
-}
-
-static void unlock_page_cgroup(struct page *page)
-{
-	bit_spin_unlock(PAGE_CGROUP_LOCK_BIT, &page->page_cgroup);
-}
-
 static void __mem_cgroup_remove_list(struct mem_cgroup_per_zone *mz,
 			struct page_cgroup *pc)
 {
@@ -436,17 +338,16 @@ void mem_cgroup_move_lists(struct page *
 	 * safely get to page_cgroup without it, so just try_lock it:
 	 * mem_cgroup_isolate_pages allows for page left on wrong list.
 	 */
-	if (!try_lock_page_cgroup(page))
+	pc = lookup_page_cgroup(page);
+	if (!trylock_page_cgroup(pc))
 		return;
-
-	pc = page_get_page_cgroup(page);
-	if (pc) {
+	if (pc && PageCgroupUsed(pc)) {
 		mz = page_cgroup_zoneinfo(pc);
 		spin_lock_irqsave(&mz->lru_lock, flags);
 		__mem_cgroup_move_lists(pc, lru);
 		spin_unlock_irqrestore(&mz->lru_lock, flags);
 	}
-	unlock_page_cgroup(page);
+	unlock_page_cgroup(pc);
 }
 
 /*
@@ -533,6 +434,8 @@ unsigned long mem_cgroup_isolate_pages(u
 	list_for_each_entry_safe_reverse(pc, tmp, src, lru) {
 		if (scan >= nr_to_scan)
 			break;
+		if (unlikely(!PageCgroupUsed(pc)))
+			continue;
 		page = pc->page;
 
 		if (unlikely(!PageLRU(page)))
@@ -576,26 +479,27 @@ static int mem_cgroup_charge_common(stru
 {
 	struct mem_cgroup *mem;
 	struct page_cgroup *pc;
-	unsigned long flags;
 	unsigned long nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
 	struct mem_cgroup_per_zone *mz;
+	unsigned long flags;
 
-	pc = kmem_cache_alloc(page_cgroup_cache, gfp_mask);
-	if (unlikely(pc == NULL))
-		goto err;
-
+	pc = lookup_page_cgroup(page);
+	/* can happen at boot */
+	if (unlikely(!pc))
+		return 0;
+	prefetchw(pc);
 	/*
 	 * We always charge the cgroup the mm_struct belongs to.
 	 * The mm_struct's mem_cgroup changes on task migration if the
 	 * thread group leader migrates. It's possible that mm is not
 	 * set, if so charge the init_mm (happens for pagecache usage).
 	 */
+
 	if (likely(!memcg)) {
 		rcu_read_lock();
 		mem = mem_cgroup_from_task(rcu_dereference(mm->owner));
 		if (unlikely(!mem)) {
 			rcu_read_unlock();
-			kmem_cache_free(page_cgroup_cache, pc);
 			return 0;
 		}
 		/*
@@ -631,36 +535,33 @@ static int mem_cgroup_charge_common(stru
 		}
 	}
 
+
+	lock_page_cgroup(pc);
+	if (unlikely(PageCgroupUsed(pc))) {
+		unlock_page_cgroup(pc);
+		res_counter_uncharge(&mem->res, PAGE_SIZE);
+		css_put(&mem->css);
+
+		goto done;
+	}
 	pc->mem_cgroup = mem;
-	pc->page = page;
 	/*
 	 * If a page is accounted as a page cache, insert to inactive list.
 	 * If anon, insert to active list.
 	 */
 	pc->flags = pcg_default_flags[ctype];
 
-	lock_page_cgroup(page);
-	if (unlikely(page_get_page_cgroup(page))) {
-		unlock_page_cgroup(page);
-		res_counter_uncharge(&mem->res, PAGE_SIZE);
-		css_put(&mem->css);
-		kmem_cache_free(page_cgroup_cache, pc);
-		goto done;
-	}
-	page_assign_page_cgroup(page, pc);
-
 	mz = page_cgroup_zoneinfo(pc);
+
 	spin_lock_irqsave(&mz->lru_lock, flags);
 	__mem_cgroup_add_list(mz, pc);
 	spin_unlock_irqrestore(&mz->lru_lock, flags);
+	unlock_page_cgroup(pc);
 
-	unlock_page_cgroup(page);
 done:
 	return 0;
 out:
 	css_put(&mem->css);
-	kmem_cache_free(page_cgroup_cache, pc);
-err:
 	return -ENOMEM;
 }
 
@@ -668,7 +569,8 @@ int mem_cgroup_charge(struct page *page,
 {
 	if (mem_cgroup_subsys.disabled)
 		return 0;
-
+	if (PageCompound(page))
+		return 0;
 	/*
 	 * If already mapped, we don't have to account.
 	 * If page cache, page->mapping has address_space.
@@ -689,7 +591,8 @@ int mem_cgroup_cache_charge(struct page 
 {
 	if (mem_cgroup_subsys.disabled)
 		return 0;
-
+	if (PageCompound(page))
+		return 0;
 	/*
 	 * Corner case handling. This is called from add_to_page_cache()
 	 * in usual. But some FS (shmem) precharges this page before calling it
@@ -702,15 +605,16 @@ int mem_cgroup_cache_charge(struct page 
 	if (!(gfp_mask & __GFP_WAIT)) {
 		struct page_cgroup *pc;
 
-		lock_page_cgroup(page);
-		pc = page_get_page_cgroup(page);
-		if (pc) {
-			VM_BUG_ON(pc->page != page);
-			VM_BUG_ON(!pc->mem_cgroup);
-			unlock_page_cgroup(page);
+
+		pc = lookup_page_cgroup(page);
+		if (!pc)
+			return 0;
+		lock_page_cgroup(pc);
+		if (PageCgroupUsed(pc)) {
+			unlock_page_cgroup(pc);
 			return 0;
 		}
-		unlock_page_cgroup(page);
+		unlock_page_cgroup(pc);
 	}
 
 	if (unlikely(!mm))
@@ -741,37 +645,39 @@ __mem_cgroup_uncharge_common(struct page
 	/*
 	 * Check if our page_cgroup is valid
 	 */
-	lock_page_cgroup(page);
-	pc = page_get_page_cgroup(page);
-	if (unlikely(!pc))
-		goto unlock;
-
-	VM_BUG_ON(pc->page != page);
+	pc = lookup_page_cgroup(page);
+	if (unlikely(!pc || !PageCgroupUsed(pc)))
+		return;
 
-	if ((ctype == MEM_CGROUP_CHARGE_TYPE_MAPPED)
-	    && ((PageCgroupCache(pc) || page_mapped(page))))
-		goto unlock;
+	lock_page_cgroup(pc);
+	if ((ctype == MEM_CGROUP_CHARGE_TYPE_MAPPED && page_mapped(page))
+	     || !PageCgroupUsed(pc)) {
+		/* This happens at race in zap_pte_range() and do_swap_page()*/
+		unlock_page_cgroup(pc);
+		return;
+	}
+	ClearPageCgroupUsed(pc);
+	mem = pc->mem_cgroup;
 
 	mz = page_cgroup_zoneinfo(pc);
 	spin_lock_irqsave(&mz->lru_lock, flags);
 	__mem_cgroup_remove_list(mz, pc);
 	spin_unlock_irqrestore(&mz->lru_lock, flags);
+	unlock_page_cgroup(pc);
 
-	page_assign_page_cgroup(page, NULL);
-	unlock_page_cgroup(page);
-
-	mem = pc->mem_cgroup;
 	res_counter_uncharge(&mem->res, PAGE_SIZE);
 	css_put(&mem->css);
 
-	kmem_cache_free(page_cgroup_cache, pc);
 	return;
-unlock:
-	unlock_page_cgroup(page);
 }
 
 void mem_cgroup_uncharge_page(struct page *page)
 {
+	/* early check. */
+	if (page_mapped(page))
+		return;
+	if (page->mapping && !PageAnon(page))
+		return;
 	__mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_MAPPED);
 }
 
@@ -795,9 +701,9 @@ int mem_cgroup_prepare_migration(struct 
 	if (mem_cgroup_subsys.disabled)
 		return 0;
 
-	lock_page_cgroup(page);
-	pc = page_get_page_cgroup(page);
-	if (pc) {
+	pc = lookup_page_cgroup(page);
+	lock_page_cgroup(pc);
+	if (PageCgroupUsed(pc)) {
 		mem = pc->mem_cgroup;
 		css_get(&mem->css);
 		if (PageCgroupCache(pc)) {
@@ -807,7 +713,7 @@ int mem_cgroup_prepare_migration(struct 
 				ctype = MEM_CGROUP_CHARGE_TYPE_SHMEM;
 		}
 	}
-	unlock_page_cgroup(page);
+	unlock_page_cgroup(pc);
 	if (mem) {
 		ret = mem_cgroup_charge_common(newpage, NULL, GFP_KERNEL,
 			ctype, mem);
@@ -832,7 +738,7 @@ void mem_cgroup_end_migration(struct pag
 	 */
 	if (!newpage->mapping)
 		__mem_cgroup_uncharge_common(newpage,
-					 MEM_CGROUP_CHARGE_TYPE_FORCE);
+				MEM_CGROUP_CHARGE_TYPE_FORCE);
 	else if (PageAnon(newpage))
 		mem_cgroup_uncharge_page(newpage);
 }
@@ -918,6 +824,8 @@ static void mem_cgroup_force_empty_list(
 	while (!list_empty(list)) {
 		pc = list_entry(list->prev, struct page_cgroup, lru);
 		page = pc->page;
+		if (!PageCgroupUsed(pc))
+			break;
 		get_page(page);
 		spin_unlock_irqrestore(&mz->lru_lock, flags);
 		/*
@@ -932,8 +840,10 @@ static void mem_cgroup_force_empty_list(
 				count = FORCE_UNCHARGE_BATCH;
 				cond_resched();
 			}
-		} else
-			cond_resched();
+		} else {
+			spin_lock_irqsave(&mz->lru_lock, flags);
+			break;
+		}
 		spin_lock_irqsave(&mz->lru_lock, flags);
 	}
 	spin_unlock_irqrestore(&mz->lru_lock, flags);
@@ -965,6 +875,7 @@ static int mem_cgroup_force_empty(struct
 				for_each_lru(l)
 					mem_cgroup_force_empty_list(mem, mz, l);
 			}
+		cond_resched();
 	}
 	ret = 0;
 out:
@@ -1175,8 +1086,8 @@ mem_cgroup_create(struct cgroup_subsys *
 	int node;
 
 	if (unlikely((cont->parent) == NULL)) {
+		page_cgroup_init();
 		mem = &init_mem_cgroup;
-		page_cgroup_cache = KMEM_CACHE(page_cgroup, SLAB_PANIC);
 	} else {
 		mem = mem_cgroup_alloc();
 		if (!mem)
Index: mmotm-2.6.27-rc7+/mm/page_alloc.c
===================================================================
--- mmotm-2.6.27-rc7+.orig/mm/page_alloc.c
+++ mmotm-2.6.27-rc7+/mm/page_alloc.c
@@ -44,7 +44,7 @@
 #include <linux/backing-dev.h>
 #include <linux/fault-inject.h>
 #include <linux/page-isolation.h>
-#include <linux/memcontrol.h>
+#include <linux/page_cgroup.h>
 #include <linux/debugobjects.h>
 
 #include <asm/tlbflush.h>
@@ -223,17 +223,12 @@ static inline int bad_range(struct zone 
 
 static void bad_page(struct page *page)
 {
-	void *pc = page_get_page_cgroup(page);
-
 	printk(KERN_EMERG "Bad page state in process '%s'\n" KERN_EMERG
 		"page:%p flags:0x%0*lx mapping:%p mapcount:%d count:%d\n",
 		current->comm, page, (int)(2*sizeof(unsigned long)),
 		(unsigned long)page->flags, page->mapping,
 		page_mapcount(page), page_count(page));
-	if (pc) {
-		printk(KERN_EMERG "cgroup:%p\n", pc);
-		page_reset_bad_cgroup(page);
-	}
+
 	printk(KERN_EMERG "Trying to fix it up, but a reboot is needed\n"
 		KERN_EMERG "Backtrace:\n");
 	dump_stack();
@@ -472,7 +467,6 @@ static inline void free_pages_check(stru
 	free_page_mlock(page);
 	if (unlikely(page_mapcount(page) |
 		(page->mapping != NULL)  |
-		(page_get_page_cgroup(page) != NULL) |
 		(page_count(page) != 0)  |
 		(page->flags & PAGE_FLAGS_CHECK_AT_FREE)))
 		bad_page(page);
@@ -609,7 +603,6 @@ static void prep_new_page(struct page *p
 {
 	if (unlikely(page_mapcount(page) |
 		(page->mapping != NULL)  |
-		(page_get_page_cgroup(page) != NULL) |
 		(page_count(page) != 0)  |
 		(page->flags & PAGE_FLAGS_CHECK_AT_PREP)))
 		bad_page(page);
@@ -3495,6 +3488,7 @@ static void __paginginit free_area_init_
 	pgdat->nr_zones = 0;
 	init_waitqueue_head(&pgdat->kswapd_wait);
 	pgdat->kswapd_max_order = 0;
+	pgdat_page_cgroup_init(pgdat);
 	
 	for (j = 0; j < MAX_NR_ZONES; j++) {
 		struct zone *zone = pgdat->node_zones + j;
Index: mmotm-2.6.27-rc7+/include/linux/mmzone.h
===================================================================
--- mmotm-2.6.27-rc7+.orig/include/linux/mmzone.h
+++ mmotm-2.6.27-rc7+/include/linux/mmzone.h
@@ -602,8 +602,11 @@ typedef struct pglist_data {
 	struct zone node_zones[MAX_NR_ZONES];
 	struct zonelist node_zonelists[MAX_ZONELISTS];
 	int nr_zones;
-#ifdef CONFIG_FLAT_NODE_MEM_MAP
+#ifdef CONFIG_FLAT_NODE_MEM_MAP	/* means !SPARSEMEM */
 	struct page *node_mem_map;
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+	struct page_cgroup *node_page_cgroup;
+#endif
 #endif
 	struct bootmem_data *bdata;
 #ifdef CONFIG_MEMORY_HOTPLUG
@@ -932,6 +935,7 @@ static inline unsigned long early_pfn_to
 #endif
 
 struct page;
+struct page_cgroup;
 struct mem_section {
 	/*
 	 * This is, logically, a pointer to an array of struct
@@ -949,6 +953,14 @@ struct mem_section {
 
 	/* See declaration of similar field in struct zone */
 	unsigned long *pageblock_flags;
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+	/*
+	 * If !SPARSEMEM, pgdat doesn't have page_cgroup pointer. We use
+	 * section. (see memcontrol.h/page_cgroup.h about this.)
+	 */
+	struct page_cgroup *page_cgroup;
+	unsigned long pad;
+#endif
 };
 
 #ifdef CONFIG_SPARSEMEM_EXTREME
Index: mmotm-2.6.27-rc7+/include/linux/memcontrol.h
===================================================================
--- mmotm-2.6.27-rc7+.orig/include/linux/memcontrol.h
+++ mmotm-2.6.27-rc7+/include/linux/memcontrol.h
@@ -29,7 +29,6 @@ struct mm_struct;
 
 #define page_reset_bad_cgroup(page)	((page)->page_cgroup = 0)
 
-extern struct page_cgroup *page_get_page_cgroup(struct page *page);
 extern int mem_cgroup_charge(struct page *page, struct mm_struct *mm,
 				gfp_t gfp_mask);
 extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
@@ -72,16 +71,8 @@ extern void mem_cgroup_record_reclaim_pr
 extern long mem_cgroup_calc_reclaim(struct mem_cgroup *mem, struct zone *zone,
 					int priority, enum lru_list lru);
 
-#else /* CONFIG_CGROUP_MEM_RES_CTLR */
-static inline void page_reset_bad_cgroup(struct page *page)
-{
-}
-
-static inline struct page_cgroup *page_get_page_cgroup(struct page *page)
-{
-	return NULL;
-}
 
+#else /* CONFIG_CGROUP_MEM_RES_CTLR */
 static inline int mem_cgroup_charge(struct page *page,
 					struct mm_struct *mm, gfp_t gfp_mask)
 {


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/6] memcg update v6 (for review and discuss)
  2008-10-01  7:52 [PATCH 0/6] memcg update v6 (for review and discuss) KAMEZAWA Hiroyuki
                   ` (5 preceding siblings ...)
  2008-10-01  8:01 ` [PATCH 6/6] memcg: lazy lru addition KAMEZAWA Hiroyuki
@ 2008-10-02  9:02 ` KAMEZAWA Hiroyuki
  2008-10-06 17:26   ` Balbir Singh
  6 siblings, 1 reply; 26+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-10-02  9:02 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, LKML, balbir, nishimura

On Wed, 1 Oct 2008 16:52:33 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> This series is update from v5.
> 
> easy 4 patches are already posted as ready-to-go-series.
> 
> This is need-more-discuss set.
> 
> Includes following 6 patches. (reduced from v5).
> The whole series are reordered.
> 
> [1/6] make page_cgroup->flags to be atomic.
> [2/6] allocate all page_cgroup at boot.
> [3/6] rewrite charge path by charge/commit/cancel
> [4/6] new force_empty and move_account
> [5/6] lazy lru free
> [6/6] lazy lru add.
> 
> Patch [3/6] and [4/6] are totally rewritten.
> Races in Patch [6/6] is fixed....I think.
> 
> Patch [1-4] seems to be big but there is no complicated ops.
> Patch [5-6] is more racy. Check-by-regression-test is necessary.
> (Of course, I does some.)
> 
> If ready-to-go-series goes, next is patch 1 and 2.
> 

No terrible bugs until now on my test.

My current idea for next week is following.
(I may have to wait until the end of next merge window. If so, 
 I'll wait and maintain this set.)

 - post ready-to-go set again.
 - post 1/6 and 2/6 as may-ready-to-go set. I don't chagnge order of these.
 - reflects comments for 3/6. 
   patch 3/6 adds new functions. So, please tell me if you have better idea
   about new functions.
 - check logic for 4/6.
 - 5/6 and 6/6 may need some more comments in codes.
 - no new additional ones.


Thanks,
-Kame







^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 3/6] memcg: charge-commit-cancel protocl
  2008-10-01  7:57 ` [PATCH 3/6] memcg: charge-commit-cancel protocl KAMEZAWA Hiroyuki
  2008-10-01  8:33   ` Daisuke Nishimura
  2008-10-01 10:04   ` kamezawa.hiroyu
@ 2008-10-03 10:05   ` Daisuke Nishimura
  2008-10-03 15:15   ` kamezawa.hiroyu
  2008-10-08  9:05   ` [RFC] memcg: handle migration by charge-commit-cancel (was " KAMEZAWA Hiroyuki
  4 siblings, 0 replies; 26+ messages in thread
From: Daisuke Nishimura @ 2008-10-03 10:05 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: nishimura, linux-mm, LKML, balbir

On Wed, 1 Oct 2008 16:57:34 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> There is a small race in do_swap_page(). When the page swapped-in is charged,
> the mapcount can be greater than 0. But, at the same time some process (shares
> it ) call unmap and make mapcount 1->0 and the page is uncharged.
> 
>       CPUA 			CPUB
>        mapcount == 1.
>    (1) charge if mapcount==0     zap_pte_range()
>                                 (2) mapcount 1 => 0.
> 			        (3) uncharge(). (success)
>    (4) set page'rmap()
>        mapcoint 0=>1
> 
> Then, this swap page's account is leaked.
> 
> For fixing this, I added a new interface.
>   - precharge
>    account to res_counter by PAGE_SIZE and try to free pages if necessary.
>   - commit	
>    register page_cgroup and add to LRU if necessary.
>   - cancel
>    uncharge PAGE_SIZE because of do_swap_page failure.
> 
> 
>      CPUA              
>   (1) charge (always)
>   (2) set page's rmap (mapcount > 0)
>   (3) commit charge was necessary or not after set_pte().
> 
> This protocol uses PCG_USED bit on page_cgroup for avoiding over accounting.
> Usual mem_cgroup_charge_common() does precharge -> commit at a time.
> 
> And this patch also adds following function to clarify all charges.
> 
>   - mem_cgroup_newpage_charge() ....replacement for mem_cgroup_charge()
> 	called against newly allocated anon pages.
> 
>   - mem_cgroup_charge_migrate_fixup()
>         called only from remove_migration_ptes().
> 	we'll have to rewrite this later.(this patch just keeps old behavior)
> 
> Good for clarify "what we does"
> 
> Then, we have 4 following charge points.
>   - newpage
>   - swapin
>   - add-to-cache.
>   - migration.
> 
> precharge/commit/cancel can be used for other places,
>  - shmem, (and other places need precharge.)
>  - move_account(force_empty) etc...
> we'll revisit later.
> 
> Changelog v5 -> v6:
>  - added newpage_charge() and migrate_fixup().
>  - renamed  functions for swap-in from "swap" to "swapin"
>  - add more precise description.
> 

I don't have any objection to this direction now, but I have one quiestion.

Does mem_cgroup_charge_migrate_fixup need to charge a newpage,
while mem_cgroup_prepare_migration has charged it already?

I agree adding I/F would be good for future, but I think
mem_cgroup_charge_migration_fixup can be no-op function for now.


Thanks,
Daisuke Nishimura.

> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
>  include/linux/memcontrol.h |   35 +++++++++-
>  mm/memcontrol.c            |  151 +++++++++++++++++++++++++++++++++++----------
>  mm/memory.c                |   12 ++-
>  mm/migrate.c               |    2 
>  mm/swapfile.c              |    6 +
>  5 files changed, 165 insertions(+), 41 deletions(-)
> 
> Index: mmotm-2.6.27-rc7+/include/linux/memcontrol.h
> ===================================================================
> --- mmotm-2.6.27-rc7+.orig/include/linux/memcontrol.h
> +++ mmotm-2.6.27-rc7+/include/linux/memcontrol.h
> @@ -29,8 +29,17 @@ struct mm_struct;
>  
>  #define page_reset_bad_cgroup(page)	((page)->page_cgroup = 0)
>  
> -extern int mem_cgroup_charge(struct page *page, struct mm_struct *mm,
> +extern int mem_cgroup_newpage_charge(struct page *page, struct mm_struct *mm,
>  				gfp_t gfp_mask);
> +extern int mem_cgroup_charge_migrate_fixup(struct page *page,
> +				struct mm_struct *mm, gfp_t gfp_mask);
> +/* for swap handling */
> +extern int mem_cgroup_try_charge(struct mm_struct *mm,
> +		gfp_t gfp_mask, struct mem_cgroup **ptr);
> +extern void mem_cgroup_commit_charge_swapin(struct page *page,
> +					struct mem_cgroup *ptr);
> +extern void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *ptr);
> +
>  extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
>  					gfp_t gfp_mask);
>  extern void mem_cgroup_move_lists(struct page *page, enum lru_list lru);
> @@ -73,7 +82,9 @@ extern long mem_cgroup_calc_reclaim(stru
>  
>  
>  #else /* CONFIG_CGROUP_MEM_RES_CTLR */
> -static inline int mem_cgroup_charge(struct page *page,
> +struct mem_cgroup;
> +
> +static inline int mem_cgroup_newpage_charge(struct page *page,
>  					struct mm_struct *mm, gfp_t gfp_mask)
>  {
>  	return 0;
> @@ -85,6 +96,26 @@ static inline int mem_cgroup_cache_charg
>  	return 0;
>  }
>  
> +static inline int mem_cgroup_charge_migrate_fixup(struct page *page,
> +					struct mm_struct *mm, gfp_t gfp_mask)
> +{
> +	return 0;
> +}
> +
> +static int mem_cgroup_try_charge(struct mm_struct *mm,
> +				gfp_t gfp_mask, struct mem_cgroup **ptr)
> +{
> +	return 0;
> +}
> +
> +static void mem_cgroup_commit_charge_swapin(struct page *page,
> +					  struct mem_cgroup *ptr)
> +{
> +}
> +static void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *ptr)
> +{
> +}
> +
>  static inline void mem_cgroup_uncharge_page(struct page *page)
>  {
>  }
> Index: mmotm-2.6.27-rc7+/mm/memcontrol.c
> ===================================================================
> --- mmotm-2.6.27-rc7+.orig/mm/memcontrol.c
> +++ mmotm-2.6.27-rc7+/mm/memcontrol.c
> @@ -467,35 +467,31 @@ unsigned long mem_cgroup_isolate_pages(u
>  	return nr_taken;
>  }
>  
> -/*
> - * Charge the memory controller for page usage.
> - * Return
> - * 0 if the charge was successful
> - * < 0 if the cgroup is over its limit
> +
> +/**
> + * mem_cgroup_try_charge - get charge of PAGE_SIZE.
> + * @mm: an mm_struct which is charged against. (when *memcg is NULL)
> + * @gfp_mask: gfp_mask for reclaim.
> + * @memcg: a pointer to memory cgroup which is charged against.
> + *
> + * charge aginst memory cgroup pointed by *memcg. if *memcg == NULL, estimated
> + * memory cgroup from @mm is got and stored in *memcg.
> + *
> + * Retruns 0 if success. -ENOMEM at failure.
>   */
> -static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm,
> -				gfp_t gfp_mask, enum charge_type ctype,
> -				struct mem_cgroup *memcg)
> +
> +int mem_cgroup_try_charge(struct mm_struct *mm,
> +			gfp_t gfp_mask, struct mem_cgroup **memcg)
>  {
>  	struct mem_cgroup *mem;
> -	struct page_cgroup *pc;
> -	unsigned long nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
> -	struct mem_cgroup_per_zone *mz;
> -	unsigned long flags;
> -
> -	pc = lookup_page_cgroup(page);
> -	/* can happen at boot */
> -	if (unlikely(!pc))
> -		return 0;
> -	prefetchw(pc);
> +	int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
>  	/*
>  	 * We always charge the cgroup the mm_struct belongs to.
>  	 * The mm_struct's mem_cgroup changes on task migration if the
>  	 * thread group leader migrates. It's possible that mm is not
>  	 * set, if so charge the init_mm (happens for pagecache usage).
>  	 */
> -
> -	if (likely(!memcg)) {
> +	if (likely(!*memcg)) {
>  		rcu_read_lock();
>  		mem = mem_cgroup_from_task(rcu_dereference(mm->owner));
>  		if (unlikely(!mem)) {
> @@ -506,15 +502,17 @@ static int mem_cgroup_charge_common(stru
>  		 * For every charge from the cgroup, increment reference count
>  		 */
>  		css_get(&mem->css);
> +		*memcg = mem;
>  		rcu_read_unlock();
>  	} else {
> -		mem = memcg;
> -		css_get(&memcg->css);
> +		mem = *memcg;
> +		css_get(&mem->css);
>  	}
>  
> +
>  	while (unlikely(res_counter_charge(&mem->res, PAGE_SIZE))) {
>  		if (!(gfp_mask & __GFP_WAIT))
> -			goto out;
> +			goto nomem;
>  
>  		if (try_to_free_mem_cgroup_pages(mem, gfp_mask))
>  			continue;
> @@ -531,18 +529,33 @@ static int mem_cgroup_charge_common(stru
>  
>  		if (!nr_retries--) {
>  			mem_cgroup_out_of_memory(mem, gfp_mask);
> -			goto out;
> +			goto nomem;
>  		}
>  	}
> +	return 0;
> +nomem:
> +	css_put(&mem->css);
> +	return -ENOMEM;
> +}
>  
> +/*
> + * commit a charge got by mem_cgroup_try_charge() and makes page_cgroup to be
> + * USED state. If already USED, uncharge and return.
> + */
> +
> +static void __mem_cgroup_commit_charge(struct mem_cgroup *mem,
> +				     struct page_cgroup *pc,
> +				     enum charge_type ctype)
> +{
> +	struct mem_cgroup_per_zone *mz;
> +	unsigned long flags;
>  
>  	lock_page_cgroup(pc);
>  	if (unlikely(PageCgroupUsed(pc))) {
>  		unlock_page_cgroup(pc);
>  		res_counter_uncharge(&mem->res, PAGE_SIZE);
>  		css_put(&mem->css);
> -
> -		goto done;
> +		return;
>  	}
>  	pc->mem_cgroup = mem;
>  	/*
> @@ -557,15 +570,39 @@ static int mem_cgroup_charge_common(stru
>  	__mem_cgroup_add_list(mz, pc);
>  	spin_unlock_irqrestore(&mz->lru_lock, flags);
>  	unlock_page_cgroup(pc);
> +}
>  
> -done:
> +/*
> + * Charge the memory controller for page usage.
> + * Return
> + * 0 if the charge was successful
> + * < 0 if the cgroup is over its limit
> + */
> +static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm,
> +				gfp_t gfp_mask, enum charge_type ctype,
> +				struct mem_cgroup *memcg)
> +{
> +	struct mem_cgroup *mem;
> +	struct page_cgroup *pc;
> +	int ret;
> +
> +	pc = lookup_page_cgroup(page);
> +	/* can happen at boot */
> +	if (unlikely(!pc))
> +		return 0;
> +	prefetchw(pc);
> +
> +	mem = memcg;
> +	ret = mem_cgroup_try_charge(mm, gfp_mask, &mem);
> +	if (ret)
> +		return ret;
> +
> +	__mem_cgroup_commit_charge(mem, pc, ctype);
>  	return 0;
> -out:
> -	css_put(&mem->css);
> -	return -ENOMEM;
>  }
>  
> -int mem_cgroup_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask)
> +int mem_cgroup_newpage_charge(struct page *page,
> +			      struct mm_struct *mm, gfp_t gfp_mask)
>  {
>  	if (mem_cgroup_subsys.disabled)
>  		return 0;
> @@ -586,6 +623,34 @@ int mem_cgroup_charge(struct page *page,
>  				MEM_CGROUP_CHARGE_TYPE_MAPPED, NULL);
>  }
>  
> +/*
> + * same as mem_cgroup_newpage_charge(), now.
> + * But what we assume is different from newpage, and this is special case.
> + * treat this in special function. easy for maintainance.
> + */
> +
> +int mem_cgroup_charge_migrate_fixup(struct page *page,
> +				struct mm_struct *mm, gfp_t gfp_mask)
> +{
> +	if (mem_cgroup_subsys.disabled)
> +		return 0;
> +
> +	if (PageCompound(page))
> +		return 0;
> +
> +	if (page_mapped(page) || (page->mapping && !PageAnon(page)))
> +		return 0;
> +
> +	if (unlikely(!mm))
> +		mm = &init_mm;
> +
> +	return mem_cgroup_charge_common(page, mm, gfp_mask,
> +				MEM_CGROUP_CHARGE_TYPE_MAPPED, NULL);
> +}
> +
> +
> +
> +
>  int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
>  				gfp_t gfp_mask)
>  {
> @@ -628,6 +693,30 @@ int mem_cgroup_cache_charge(struct page 
>  				MEM_CGROUP_CHARGE_TYPE_SHMEM, NULL);
>  }
>  
> +
> +void mem_cgroup_commit_charge_swapin(struct page *page, struct mem_cgroup *ptr)
> +{
> +	struct page_cgroup *pc;
> +
> +	if (mem_cgroup_subsys.disabled)
> +		return;
> +	if (!ptr)
> +		return;
> +	pc = lookup_page_cgroup(page);
> +	__mem_cgroup_commit_charge(ptr, pc, MEM_CGROUP_CHARGE_TYPE_MAPPED);
> +}
> +
> +void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *mem)
> +{
> +	if (mem_cgroup_subsys.disabled)
> +		return;
> +	if (!mem)
> +		return;
> +	res_counter_uncharge(&mem->res, PAGE_SIZE);
> +	css_put(&mem->css);
> +}
> +
> +
>  /*
>   * uncharge if !page_mapped(page)
>   */
> Index: mmotm-2.6.27-rc7+/mm/memory.c
> ===================================================================
> --- mmotm-2.6.27-rc7+.orig/mm/memory.c
> +++ mmotm-2.6.27-rc7+/mm/memory.c
> @@ -1891,7 +1891,7 @@ gotten:
>  	cow_user_page(new_page, old_page, address, vma);
>  	__SetPageUptodate(new_page);
>  
> -	if (mem_cgroup_charge(new_page, mm, GFP_KERNEL))
> +	if (mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL))
>  		goto oom_free_new;
>  
>  	/*
> @@ -2287,6 +2287,7 @@ static int do_swap_page(struct mm_struct
>  	struct page *page;
>  	swp_entry_t entry;
>  	pte_t pte;
> +	struct mem_cgroup *ptr = NULL;
>  	int ret = 0;
>  
>  	if (!pte_unmap_same(mm, pmd, page_table, orig_pte))
> @@ -2325,7 +2326,7 @@ static int do_swap_page(struct mm_struct
>  	lock_page(page);
>  	delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
>  
> -	if (mem_cgroup_charge(page, mm, GFP_KERNEL)) {
> +	if (mem_cgroup_try_charge(mm, GFP_KERNEL, &ptr) == -ENOMEM) {
>  		ret = VM_FAULT_OOM;
>  		unlock_page(page);
>  		goto out;
> @@ -2355,6 +2356,7 @@ static int do_swap_page(struct mm_struct
>  	flush_icache_page(vma, page);
>  	set_pte_at(mm, address, page_table, pte);
>  	page_add_anon_rmap(page, vma, address);
> +	mem_cgroup_commit_charge_swapin(page, ptr);
>  
>  	swap_free(entry);
>  	if (vm_swap_full() || (vma->vm_flags & VM_LOCKED) || PageMlocked(page))
> @@ -2375,7 +2377,7 @@ unlock:
>  out:
>  	return ret;
>  out_nomap:
> -	mem_cgroup_uncharge_page(page);
> +	mem_cgroup_cancel_charge_swapin(ptr);
>  	pte_unmap_unlock(page_table, ptl);
>  	unlock_page(page);
>  	page_cache_release(page);
> @@ -2405,7 +2407,7 @@ static int do_anonymous_page(struct mm_s
>  		goto oom;
>  	__SetPageUptodate(page);
>  
> -	if (mem_cgroup_charge(page, mm, GFP_KERNEL))
> +	if (mem_cgroup_newpage_charge(page, mm, GFP_KERNEL))
>  		goto oom_free_page;
>  
>  	entry = mk_pte(page, vma->vm_page_prot);
> @@ -2498,7 +2500,7 @@ static int __do_fault(struct mm_struct *
>  				ret = VM_FAULT_OOM;
>  				goto out;
>  			}
> -			if (mem_cgroup_charge(page, mm, GFP_KERNEL)) {
> +			if (mem_cgroup_newpage_charge(page, mm, GFP_KERNEL)) {
>  				ret = VM_FAULT_OOM;
>  				page_cache_release(page);
>  				goto out;
> Index: mmotm-2.6.27-rc7+/mm/migrate.c
> ===================================================================
> --- mmotm-2.6.27-rc7+.orig/mm/migrate.c
> +++ mmotm-2.6.27-rc7+/mm/migrate.c
> @@ -133,7 +133,7 @@ static void remove_migration_pte(struct 
>  	 * be reliable, and this charge can actually fail: oh well, we don't
>  	 * make the situation any worse by proceeding as if it had succeeded.
>  	 */
> -	mem_cgroup_charge(new, mm, GFP_ATOMIC);
> +	mem_cgroup_charge_migrate_fixup(new, mm, GFP_ATOMIC);
>  
>  	get_page(new);
>  	pte = pte_mkold(mk_pte(new, vma->vm_page_prot));
> Index: mmotm-2.6.27-rc7+/mm/swapfile.c
> ===================================================================
> --- mmotm-2.6.27-rc7+.orig/mm/swapfile.c
> +++ mmotm-2.6.27-rc7+/mm/swapfile.c
> @@ -530,17 +530,18 @@ unsigned int count_swap_pages(int type, 
>  static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
>  		unsigned long addr, swp_entry_t entry, struct page *page)
>  {
> +	struct mem_cgroup *ptr;
>  	spinlock_t *ptl;
>  	pte_t *pte;
>  	int ret = 1;
>  
> -	if (mem_cgroup_charge(page, vma->vm_mm, GFP_KERNEL))
> +	if (mem_cgroup_try_charge(vma->vm_mm, GFP_KERNEL, &ptr))
>  		ret = -ENOMEM;
>  
>  	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
>  	if (unlikely(!pte_same(*pte, swp_entry_to_pte(entry)))) {
>  		if (ret > 0)
> -			mem_cgroup_uncharge_page(page);
> +			mem_cgroup_cancel_charge_swapin(ptr);
>  		ret = 0;
>  		goto out;
>  	}
> @@ -550,6 +551,7 @@ static int unuse_pte(struct vm_area_stru
>  	set_pte_at(vma->vm_mm, addr, pte,
>  		   pte_mkold(mk_pte(page, vma->vm_page_prot)));
>  	page_add_anon_rmap(page, vma, addr);
> +	mem_cgroup_commit_charge_swapin(page, ptr);
>  	swap_free(entry);
>  	/*
>  	 * Move the page to the active list so it is not
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: Re: [PATCH 3/6] memcg: charge-commit-cancel protocl
  2008-10-01  7:57 ` [PATCH 3/6] memcg: charge-commit-cancel protocl KAMEZAWA Hiroyuki
                     ` (2 preceding siblings ...)
  2008-10-03 10:05   ` Daisuke Nishimura
@ 2008-10-03 15:15   ` kamezawa.hiroyu
  2008-10-03 15:25     ` Daisuke Nishimura
  2008-10-08  9:05   ` [RFC] memcg: handle migration by charge-commit-cancel (was " KAMEZAWA Hiroyuki
  4 siblings, 1 reply; 26+ messages in thread
From: kamezawa.hiroyu @ 2008-10-03 15:15 UTC (permalink / raw)
  To: Daisuke Nishimura; +Cc: KAMEZAWA Hiroyuki, nishimura, linux-mm, LKML, balbir

----- Original Message -----
>> precharge/commit/cancel can be used for other places,
>>  - shmem, (and other places need precharge.)
>>  - move_account(force_empty) etc...
>> we'll revisit later.
>> 
>> Changelog v5 -> v6:
>>  - added newpage_charge() and migrate_fixup().
>>  - renamed  functions for swap-in from "swap" to "swapin"
>>  - add more precise description.
>> 
>
>I don't have any objection to this direction now, but I have one quiestion.
>
>Does mem_cgroup_charge_migrate_fixup need to charge a newpage,
>while mem_cgroup_prepare_migration has charged it already?
In migration-is-failed case, we have to charge *old page* here.

>
>I agree adding I/F would be good for future, but I think
>mem_cgroup_charge_migration_fixup can be no-op function for now.
>
Hmm, handling failure case in explicit way may be better. Ok,
I'll try some.

Thanks,
-Kame



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 3/6] memcg: charge-commit-cancel protocl
  2008-10-03 15:15   ` kamezawa.hiroyu
@ 2008-10-03 15:25     ` Daisuke Nishimura
  0 siblings, 0 replies; 26+ messages in thread
From: Daisuke Nishimura @ 2008-10-03 15:25 UTC (permalink / raw)
  To: kamezawa.hiroyu; +Cc: d-nishimura, Daisuke Nishimura, linux-mm, LKML, balbir

On Sat, 4 Oct 2008 00:15:17 +0900 (JST)
kamezawa.hiroyu@jp.fujitsu.com wrote:

> ----- Original Message -----
> >> precharge/commit/cancel can be used for other places,
> >>  - shmem, (and other places need precharge.)
> >>  - move_account(force_empty) etc...
> >> we'll revisit later.
> >> 
> >> Changelog v5 -> v6:
> >>  - added newpage_charge() and migrate_fixup().
> >>  - renamed  functions for swap-in from "swap" to "swapin"
> >>  - add more precise description.
> >> 
> >
> >I don't have any objection to this direction now, but I have one quiestion.
> >
> >Does mem_cgroup_charge_migrate_fixup need to charge a newpage,
> >while mem_cgroup_prepare_migration has charged it already?
> In migration-is-failed case, we have to charge *old page* here.
> 
Ah... you are right.
Sorry for noise.


Daisuke Nishimura.

> >
> >I agree adding I/F would be good for future, but I think
> >mem_cgroup_charge_migration_fixup can be no-op function for now.
> >
> Hmm, handling failure case in explicit way may be better. Ok,
> I'll try some.
> 
> Thanks,
> -Kame
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 




^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/6] atomic page_cgroup flags
  2008-10-01  7:55 ` [PATCH 1/6] atomic page_cgroup flags KAMEZAWA Hiroyuki
@ 2008-10-06  7:42   ` Balbir Singh
  0 siblings, 0 replies; 26+ messages in thread
From: Balbir Singh @ 2008-10-06  7:42 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, LKML, nishimura

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2008-10-01 16:55:13]:

> This patch makes page_cgroup->flags to be atomic_ops and define
> functions (and macros) to access it.
> 
> Before trying to modify memory resource controller, this atomic operation
> on flags is necessary. Most of flags in this patch is for LRU and modfied
> under mz->lru_lock but we'll add another flags which is not for LRU soon.
> (lock_page_cgroup() will use LOCK bit on page_cgroup->flags)
> So we use atomic version here.
>

Seems quite straightforward

Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> 

-- 
	Balbir

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/6] memcg: allocate page_cgroup at boot
  2008-10-01  7:56 ` [PATCH 2/6] memcg: allocate page_cgroup at boot KAMEZAWA Hiroyuki
  2008-10-02  8:49   ` [PATCH 2/6] memcg: allocate page_cgroup at boot (hunk fix) KAMEZAWA Hiroyuki
@ 2008-10-06 10:11   ` Balbir Singh
  1 sibling, 0 replies; 26+ messages in thread
From: Balbir Singh @ 2008-10-06 10:11 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, LKML, nishimura

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2008-10-01 16:56:03]:

> Allocate all page_cgroup at boot and remove page_cgroup poitner
> from struct page. This patch adds an interface as
> 
>  struct page_cgroup *lookup_page_cgroup(struct page*)
> 
> All FLATMEM/DISCONTIGMEM/SPARSEMEM  and MEMORY_HOTPLUG is supported.
> 
> Remove page_cgroup pointer reduces the amount of memory by
>  - 4 bytes per PAGE_SIZE.
>  - 8 bytes per PAGE_SIZE
> if memory controller is disabled. (even if configured.)
> 
> On usual 8GB x86-32 server, this saves 8MB of NORMAL_ZONE memory.
> On my x86-64 server with 48GB of memory, this saves 96MB of memory.
> I think this reduction makes sense.
> 
> By pre-allocation, kmalloc/kfree in charge/uncharge are removed. 
> This means
>   - we're not necessary to be afraid of kmalloc faiulre.
>     (this can happen because of gfp_mask type.)
>   - we can avoid calling kmalloc/kfree.
>   - we can avoid allocating tons of small objects which can be fragmented.
>   - we can know what amount of memory will be used for this extra-lru handling.
> 
> I added printk message as
> 
> 	"allocated %ld bytes of page_cgroup"
>         "please try cgroup_disable=memory option if you don't want"
> 
> maybe enough informative for users.
> 
> Changelog: v5 -> v6.
>  * reflected comments.
>  * coding style fixes.
>  * removed "ctype" from uncharge.
>  * improved comment to show FLAT_NODE_MEM_MAP == !SPARSEMEM
>  * fixed errors in !SPARSEMEM codes
>  * removed unused function in !SPARSEMEM codes.
> (start from v5 because of series..)
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

I like this patch very much

Reviewed-by: Balbir Singh <balbir@linux.vnet.ibm.com>

-- 
	Balbir

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/6] memcg: allocate page_cgroup at boot (hunk fix)
  2008-10-02  8:49   ` [PATCH 2/6] memcg: allocate page_cgroup at boot (hunk fix) KAMEZAWA Hiroyuki
@ 2008-10-06 16:32     ` Balbir Singh
  0 siblings, 0 replies; 26+ messages in thread
From: Balbir Singh @ 2008-10-06 16:32 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, LKML, nishimura

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2008-10-02 17:49:45]:

> Index: mmotm-2.6.27-rc7+/include/linux/mm_types.h
> ===================================================================
> --- mmotm-2.6.27-rc7+.orig/include/linux/mm_types.h
> +++ mmotm-2.6.27-rc7+/include/linux/mm_types.h
> @@ -94,10 +94,6 @@ struct page {
>  	void *virtual;			/* Kernel virtual address (NULL if
>  					   not kmapped, ie. highmem) */
>  #endif /* WANT_PAGE_VIRTUAL */
> -#ifdef CONFIG_CGROUP_MEM_RES_CTLR
> -	unsigned long page_cgroup;
> -#endif
> -

Just FYI, this hunk fails for mmotm Oct 2nd, applied on top of
2.6.27-rc8, the space after the #endif is gone

-- 
	Balbir

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/6] memcg update v6 (for review and discuss)
  2008-10-02  9:02 ` [PATCH 0/6] memcg update v6 (for review and discuss) KAMEZAWA Hiroyuki
@ 2008-10-06 17:26   ` Balbir Singh
  2008-10-07  1:22     ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 26+ messages in thread
From: Balbir Singh @ 2008-10-06 17:26 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki, Andrew Morton; +Cc: linux-mm, LKML, nishimura

KAMEZAWA Hiroyuki wrote:
> On Wed, 1 Oct 2008 16:52:33 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 
>> This series is update from v5.
>>
>> easy 4 patches are already posted as ready-to-go-series.
>>
>> This is need-more-discuss set.
>>
>> Includes following 6 patches. (reduced from v5).
>> The whole series are reordered.
>>
>> [1/6] make page_cgroup->flags to be atomic.
>> [2/6] allocate all page_cgroup at boot.
>> [3/6] rewrite charge path by charge/commit/cancel
>> [4/6] new force_empty and move_account
>> [5/6] lazy lru free
>> [6/6] lazy lru add.
>>
>> Patch [3/6] and [4/6] are totally rewritten.
>> Races in Patch [6/6] is fixed....I think.
>>
>> Patch [1-4] seems to be big but there is no complicated ops.
>> Patch [5-6] is more racy. Check-by-regression-test is necessary.
>> (Of course, I does some.)
>>
>> If ready-to-go-series goes, next is patch 1 and 2.
>>
> 
> No terrible bugs until now on my test.
> 
> My current idea for next week is following.
> (I may have to wait until the end of next merge window. If so, 
>  I'll wait and maintain this set.)
> 
>  - post ready-to-go set again.
>  - post 1/6 and 2/6 as may-ready-to-go set. I don't chagnge order of these.
>  - reflects comments for 3/6. 
>    patch 3/6 adds new functions. So, please tell me if you have better idea
>    about new functions.
>  - check logic for 4/6.
>  - 5/6 and 6/6 may need some more comments in codes.
>  - no new additional ones.

Kamezawa-San, Andrew,

I think patches 1 and 2 are ready to go. Andrew they remove the cgroup member
from struct page and will help reduce the overhead for distros that care about
32 bit systems and also help with performance (in my runs so far).

I would recommend pushing 1 and 2 right away to -mm followed by the other
performance improvements. Comments?


-- 
	Balbir

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/6] memcg update v6 (for review and discuss)
  2008-10-06 17:26   ` Balbir Singh
@ 2008-10-07  1:22     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 26+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-10-07  1:22 UTC (permalink / raw)
  To: balbir; +Cc: Andrew Morton, linux-mm, LKML, nishimura

On Mon, 06 Oct 2008 22:56:20 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> KAMEZAWA Hiroyuki wrote:
> > On Wed, 1 Oct 2008 16:52:33 +0900
> > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > 
> >> This series is update from v5.
> >>
> >> easy 4 patches are already posted as ready-to-go-series.
> >>
> >> This is need-more-discuss set.
> >>
> >> Includes following 6 patches. (reduced from v5).
> >> The whole series are reordered.
> >>
> >> [1/6] make page_cgroup->flags to be atomic.
> >> [2/6] allocate all page_cgroup at boot.
> >> [3/6] rewrite charge path by charge/commit/cancel
> >> [4/6] new force_empty and move_account
> >> [5/6] lazy lru free
> >> [6/6] lazy lru add.
> >>
> >> Patch [3/6] and [4/6] are totally rewritten.
> >> Races in Patch [6/6] is fixed....I think.
> >>
> >> Patch [1-4] seems to be big but there is no complicated ops.
> >> Patch [5-6] is more racy. Check-by-regression-test is necessary.
> >> (Of course, I does some.)
> >>
> >> If ready-to-go-series goes, next is patch 1 and 2.
> >>
> > 
> > No terrible bugs until now on my test.
> > 
> > My current idea for next week is following.
> > (I may have to wait until the end of next merge window. If so, 
> >  I'll wait and maintain this set.)
> > 
> >  - post ready-to-go set again.
> >  - post 1/6 and 2/6 as may-ready-to-go set. I don't chagnge order of these.
> >  - reflects comments for 3/6. 
> >    patch 3/6 adds new functions. So, please tell me if you have better idea
> >    about new functions.
> >  - check logic for 4/6.
> >  - 5/6 and 6/6 may need some more comments in codes.
> >  - no new additional ones.
> 
> Kamezawa-San, Andrew,
> 
> I think patches 1 and 2 are ready to go. Andrew they remove the cgroup member
> from struct page and will help reduce the overhead for distros that care about
> 32 bit systems and also help with performance (in my runs so far).
> 
> I would recommend pushing 1 and 2 right away to -mm followed by the other
> performance improvements. Comments?
> 
> 

I'll rebase ready-to-go 4 patches and this 1 and 2 onto the latest mmotm
and send again.

Thank you for review.

Regards,
-Kame



> -- 
> 	Balbir
> 


-- 
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [RFC] memcg: handle migration by charge-commit-cancel (was [PATCH 3/6] memcg: charge-commit-cancel protocl
  2008-10-01  7:57 ` [PATCH 3/6] memcg: charge-commit-cancel protocl KAMEZAWA Hiroyuki
                     ` (3 preceding siblings ...)
  2008-10-03 15:15   ` kamezawa.hiroyu
@ 2008-10-08  9:05   ` KAMEZAWA Hiroyuki
  4 siblings, 0 replies; 26+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-10-08  9:05 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, LKML, balbir, nishimura

On Wed, 1 Oct 2008 16:57:34 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> For fixing this, I added a new interface.
>   - precharge
>    account to res_counter by PAGE_SIZE and try to free pages if necessary.
>   - commit	
>    register page_cgroup and add to LRU if necessary.
>   - cancel
>    uncharge PAGE_SIZE because of do_swap_page failure.
> 
> 
>      CPUA              
>   (1) charge (always)
>   (2) set page's rmap (mapcount > 0)
>   (3) commit charge was necessary or not after set_pte().
> 
> This protocol uses PCG_USED bit on page_cgroup for avoiding over accounting.
> Usual mem_cgroup_charge_common() does precharge -> commit at a time.
> 
> And this patch also adds following function to clarify all charges.
> 
>   - mem_cgroup_newpage_charge() ....replacement for mem_cgroup_charge()
> 	called against newly allocated anon pages.
> 
>   - mem_cgroup_charge_migrate_fixup()
>         called only from remove_migration_ptes().
> 	we'll have to rewrite this later.(this patch just keeps old behavior)
> 
> Good for clarify "what we does"
> 

Not tested yet but an idea for handling page migraion in better way.
How do you think ?

Now, management of "charge" under page migration is done under following
manner. (Assume migrate page contents from oldapge to newpage)

 Before
  - "newpage" is charged before migration.
 at success.
  - "oldpage" is uncharged at somewhere(unmap, radix-tree-replace)
 at failure
  - "newpage" is uncharged.
  - "oldpage" is charged if necessary (*1)

But (*1) is not trustable....because of GFP_ATOMIC and we need
special handler.

This patch tries to change behavior as following by
pre-charge/commit/cancel ops.

 Before
  - charge PAGE_SIZE (no target page)
 success
  - commit charge against "newpage".
 failure
  - commit charge against "oldpage".
    (PCG_USED bit works effectively to avoid double-counting)
  - if "oldpage" is obsolete, cancel charge of PAGE_SIZE.
  
I think this is much simpler and trustable.

This is based on "allocate all page_cgroup at boot" behavior.


Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

 include/linux/memcontrol.h |   19 ++++-------
 mm/memcontrol.c            |   72 +++++++++++++++++++++++++++++----------------
 mm/migrate.c               |   42 ++++++++------------------
 3 files changed, 68 insertions(+), 65 deletions(-)

Index: mmotm-2.6.27-rc8+/mm/migrate.c
===================================================================
--- mmotm-2.6.27-rc8+.orig/mm/migrate.c
+++ mmotm-2.6.27-rc8+/mm/migrate.c
@@ -121,20 +121,6 @@ static void remove_migration_pte(struct 
 	if (!is_migration_entry(entry) || migration_entry_to_page(entry) != old)
 		goto out;
 
-	/*
-	 * Yes, ignore the return value from a GFP_ATOMIC mem_cgroup_charge.
-	 * Failure is not an option here: we're now expected to remove every
-	 * migration pte, and will cause crashes otherwise.  Normally this
-	 * is not an issue: mem_cgroup_prepare_migration bumped up the old
-	 * page_cgroup count for safety, that's now attached to the new page,
-	 * so this charge should just be another incrementation of the count,
-	 * to keep in balance with rmap.c's mem_cgroup_uncharging.  But if
-	 * there's been a force_empty, those reference counts may no longer
-	 * be reliable, and this charge can actually fail: oh well, we don't
-	 * make the situation any worse by proceeding as if it had succeeded.
-	 */
-	mem_cgroup_charge_migrate_fixup(new, mm, GFP_ATOMIC);
-
 	get_page(new);
 	pte = pte_mkold(mk_pte(new, vma->vm_page_prot));
 	if (is_write_migration_entry(entry))
@@ -382,9 +368,6 @@ static void migrate_page_copy(struct pag
 	anon = PageAnon(page);
 	page->mapping = NULL;
 
-	if (!anon) /* This page was removed from radix-tree. */
-		mem_cgroup_uncharge_cache_page(page);
-
 	/*
 	 * If any waiters have accumulated on the new page then
 	 * wake them up.
@@ -621,6 +604,7 @@ static int unmap_and_move(new_page_t get
 	struct page *newpage = get_new_page(page, private, &result);
 	int rcu_locked = 0;
 	int charge = 0;
+	struct mem_cgroup *mem;
 
 	if (!newpage)
 		return -ENOMEM;
@@ -630,24 +614,26 @@ static int unmap_and_move(new_page_t get
 		goto move_newpage;
 	}
 
-	charge = mem_cgroup_prepare_migration(page, newpage);
-	if (charge == -ENOMEM) {
-		rc = -ENOMEM;
-		goto move_newpage;
-	}
 	/* prepare cgroup just returns 0 or -ENOMEM */
-	BUG_ON(charge);
-
 	rc = -EAGAIN;
+
 	if (!trylock_page(page)) {
 		if (!force)
 			goto move_newpage;
 		lock_page(page);
 	}
 
+	/* precharge against new page */
+	charge = mem_cgroup_prepare_migration(page, &mem);
+	if (charge == -ENOMEM) {
+		rc = -ENOMEM;
+		goto unlock;
+	}
+	BUG_ON(charge);
+
 	if (PageWriteback(page)) {
 		if (!force)
-			goto unlock;
+			goto uncharge;
 		wait_on_page_writeback(page);
 	}
 	/*
@@ -700,7 +686,9 @@ static int unmap_and_move(new_page_t get
 rcu_unlock:
 	if (rcu_locked)
 		rcu_read_unlock();
-
+uncharge:
+	if (!charge)
+		mem_cgroup_end_migration(mem, page, newpage);
 unlock:
 	unlock_page(page);
 
@@ -716,8 +704,6 @@ unlock:
 	}
 
 move_newpage:
-	if (!charge)
-		mem_cgroup_end_migration(newpage);
 
 	/*
 	 * Move the new page to the LRU. If migration was not successful
Index: mmotm-2.6.27-rc8+/include/linux/memcontrol.h
===================================================================
--- mmotm-2.6.27-rc8+.orig/include/linux/memcontrol.h
+++ mmotm-2.6.27-rc8+/include/linux/memcontrol.h
@@ -29,8 +29,6 @@ struct mm_struct;
 
 extern int mem_cgroup_newpage_charge(struct page *page, struct mm_struct *mm,
 				gfp_t gfp_mask);
-extern int mem_cgroup_charge_migrate_fixup(struct page *page,
-				struct mm_struct *mm, gfp_t gfp_mask);
 /* for swap handling */
 extern int mem_cgroup_try_charge(struct mm_struct *mm,
 		gfp_t gfp_mask, struct mem_cgroup **ptr);
@@ -60,8 +58,9 @@ extern struct mem_cgroup *mem_cgroup_fro
 	((cgroup) == mem_cgroup_from_task((mm)->owner))
 
 extern int
-mem_cgroup_prepare_migration(struct page *page, struct page *newpage);
-extern void mem_cgroup_end_migration(struct page *page);
+mem_cgroup_prepare_migration(struct page *page, struct mem_cgroup **ptr);
+extern void mem_cgroup_end_migration(struct mem_cgroup *mem,
+	struct page *oldpage, struct page *newpage);
 
 /*
  * For memory reclaim.
@@ -94,12 +93,6 @@ static inline int mem_cgroup_cache_charg
 	return 0;
 }
 
-static inline int mem_cgroup_charge_migrate_fixup(struct page *page,
-					struct mm_struct *mm, gfp_t gfp_mask)
-{
-	return 0;
-}
-
 static int mem_cgroup_try_charge(struct mm_struct *mm,
 				gfp_t gfp_mask, struct mem_cgroup **ptr)
 {
@@ -143,12 +136,14 @@ static inline int task_in_mem_cgroup(str
 }
 
 static inline int
-mem_cgroup_prepare_migration(struct page *page, struct page *newpage)
+mem_cgroup_prepare_migration(struct page *page, struct mem_cgroup **ptr)
 {
 	return 0;
 }
 
-static inline void mem_cgroup_end_migration(struct page *page)
+static inline void mem_cgroup_end_migration(struct mem_cgroup *mem,
+					struct page *oldpage,
+					struct page *newpage)
 {
 }
 
Index: mmotm-2.6.27-rc8+/mm/memcontrol.c
===================================================================
--- mmotm-2.6.27-rc8+.orig/mm/memcontrol.c
+++ mmotm-2.6.27-rc8+/mm/memcontrol.c
@@ -780,11 +780,10 @@ void mem_cgroup_uncharge_cache_page(stru
 /*
  * Before starting migration, account against new page.
  */
-int mem_cgroup_prepare_migration(struct page *page, struct page *newpage)
+int mem_cgroup_prepare_migration(struct page *page, struct mem_cgroup **ptr)
 {
 	struct page_cgroup *pc;
 	struct mem_cgroup *mem = NULL;
-	enum charge_type ctype = MEM_CGROUP_CHARGE_TYPE_MAPPED;
 	int ret = 0;
 
 	if (mem_cgroup_subsys.disabled)
@@ -795,43 +794,66 @@ int mem_cgroup_prepare_migration(struct 
 	if (PageCgroupUsed(pc)) {
 		mem = pc->mem_cgroup;
 		css_get(&mem->css);
-		if (PageCgroupCache(pc)) {
-			if (page_is_file_cache(page))
-				ctype = MEM_CGROUP_CHARGE_TYPE_CACHE;
-			else
-				ctype = MEM_CGROUP_CHARGE_TYPE_SHMEM;
-		}
 	}
 	unlock_page_cgroup(pc);
 	if (mem) {
-		ret = mem_cgroup_charge_common(newpage, NULL, GFP_KERNEL,
-			ctype, mem);
+		ret = mem_cgroup_try_charge(NULL, GFP_KERNEL, &mem);
 		css_put(&mem->css);
+		*ptr = mem;
 	}
 	return ret;
 }
 
 /* remove redundant charge if migration failed*/
-void mem_cgroup_end_migration(struct page *newpage)
+void mem_cgroup_end_migration(struct mem_cgroup *mem,
+		struct page *oldpage, struct page *newpage)
 {
+	struct page *target;
+	struct page_cgroup *pc;
+	enum charge_type ctype;
+
+	/* at migration success, oldpage->mapping is *always* NULL */
+	if (oldpage->mapping)
+		target = oldpage;
+	else
+		target = newpage;
+
+	if (PageAnon(target))
+		ctype = MEM_CGROUP_CHARGE_TYPE_MAPPED;
+	else if (page_is_file_cache(target))
+		ctype = MEM_CGROUP_CHARGE_TYPE_CACHE;
+	else
+		ctype = MEM_CGROUP_CHARGE_TYPE_SHMEM;
+
 	/*
-	 * At success, page->mapping is not NULL.
-	 * special rollback care is necessary when
-	 * 1. at migration failure. (newpage->mapping is cleared in this case)
-	 * 2. the newpage was moved but not remapped again because the task
-	 *    exits and the newpage is obsolete. In this case, the new page
-	 *    may be a swapcache. So, we just call mem_cgroup_uncharge_page()
-	 *    always for avoiding mess. The  page_cgroup will be removed if
-	 *    unnecessary. File cache pages is still on radix-tree. Don't
-	 *    care it.
+	 * Both of oldpage and newpage are still under lock_page().
+	 * Then, we don't have to care about race in radix-tree.
+	 * But we have to be careful that this page is unmapped or not.
 	 */
-	if (!newpage->mapping)
-		__mem_cgroup_uncharge_common(newpage,
-				MEM_CGROUP_CHARGE_TYPE_FORCE);
-	else if (PageAnon(newpage))
-		mem_cgroup_uncharge_page(newpage);
+	if (ctype == MEM_CGROUP_CHARGE_TYPE_MAPPED) {
+		/*
+		 * 2 caes for !page_mapped().
+		 * 1. remove_migration_pte() cannot remap this again.
+		 *    because the page is zapped.
+		 * 2. remove_migration_pte() successfully mapped it but
+		 *    it's unmapped, now.
+		 */
+		if (!page_mapped(target)) {
+			res_counter_uncharge(&mem->res, PAGE_SIZE);
+			css_put(&mem->css);
+			return;
+		}
+	}
+		
+	pc = lookup_page_cgroup(target);
+	/*
+	 * __mem_cgroup_commit_charge() check PCG_USED bit of page_cgroup.
+	 * So, double-counting is effectively avoided.
+	 */
+	__mem_cgroup_commit_charge(mem, pc, ctype);
 }
 
+
 /*
  * A call to try to shrink memory usage under specified resource controller.
  * This is typically used for page reclaiming for shmem for reducing side












^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 5/6] memcg: lazy lru freeing
  2008-10-01  8:00 ` [PATCH 5/6] memcg: lazy lru freeing KAMEZAWA Hiroyuki
@ 2008-10-09  5:39   ` Daisuke Nishimura
  2008-10-09  6:26     ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 26+ messages in thread
From: Daisuke Nishimura @ 2008-10-09  5:39 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: nishimura, linux-mm, LKML, balbir

On Wed, 1 Oct 2008 17:00:05 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> Free page_cgroup from its LRU in batched manner.
> 
> When uncharge() is called, page is pushed onto per-cpu vector and
> removed from LRU, later.. This routine resembles to global LRU's pagevec.
> This patch is half of the whole patch and a set with following lazy LRU add
> patch.
> 
> After this, a pc, which is PageCgroupLRU(pc)==true, is on LRU.
> This LRU bit is guarded by lru_lock().
> 
>  PageCgroupUsed(pc) && PageCgroupLRU(pc) means "pc" is used and on LRU.
>  This check makes sense only when both 2 locks, lock_page_cgroup()/lru_lock(),
>  are aquired.
> 
>  PageCgroupUsed(pc) && !PageCgroupLRU(pc) means "pc" is used but not on LRU.
>  !PageCgroupUsed(pc) && PageCgroupLRU(pc) means "pc" is unused but still on
>  LRU. lru walk routine should avoid touching this.
> 
> Changelog (v5) => (v6):
>  - Fixing race and added PCG_LRU bit
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 

(snip)

> +static void
> +__release_page_cgroup(struct memcg_percpu_vec *mpv)
> +{
> +	unsigned long flags;
> +	struct mem_cgroup_per_zone *mz, *prev_mz;
> +	struct page_cgroup *pc;
> +	int i, nr;
> +
> +	local_irq_save(flags);
> +	nr = mpv->nr;
> +	mpv->nr = 0;
> +	prev_mz = NULL;
> +	for (i = nr - 1; i >= 0; i--) {
> +		pc = mpv->vec[i];
> +		mz = page_cgroup_zoneinfo(pc);
> +		if (prev_mz != mz) {
> +			if (prev_mz)
> +				spin_unlock(&prev_mz->lru_lock);
> +			prev_mz = mz;
> +			spin_lock(&mz->lru_lock);
> +		}
> +		/*
> +		 * this "pc" may be charge()->uncharge() while we are waiting
> +		 * for this. But charge() path check LRU bit and remove this
> +		 * from LRU if necessary.
> +		 */
> +		if (!PageCgroupUsed(pc) && PageCgroupLRU(pc)) {
> +			ClearPageCgroupLRU(pc);
> +			__mem_cgroup_remove_list(mz, pc);
> +			css_put(&pc->mem_cgroup->css);
> +		}
> +	}
> +	if (prev_mz)
> +		spin_unlock(&prev_mz->lru_lock);
> +	local_irq_restore(flags);
> +
> +}
> +
I'm wondering if page_cgroup_zoneinfo is safe without lock_page_cgroup
because it dereferences pc->mem_cgroup.
I'm worring if the pc has been moved to another lru by re-charge(and re-uncharge),
and __mem_cgroup_remove_list toches a wrong(old) group.

Hmm, there are many things to be done for re-charge and re-uncharge,
so "if (!PageCgroupUsed(pc) && PageCgroupLRU(pc))" would be enough.
(it can avoid race between re-charge.)

Another user of page_cgroup_zoneinfo without lock_page_cgroup is
__mem_cgroup_move_lists called by mem_cgroup_isolate_pages,
but mem_cgroup_isolate_pages handles pc which is actually on the mz->lru
so it would be ok.
(I think adding VM_BUG_ON(mz != page_cgroup_zoneifno(pc)) would make sense,
or add new arg *mz to __mem_cgroup_move_lists?)


Thanks,
Daisuke Nishimura.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 6/6] memcg: lazy lru addition
  2008-10-01  8:01 ` [PATCH 6/6] memcg: lazy lru addition KAMEZAWA Hiroyuki
@ 2008-10-09  6:21   ` Daisuke Nishimura
  2008-10-09  6:51     ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 26+ messages in thread
From: Daisuke Nishimura @ 2008-10-09  6:21 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: nishimura, linux-mm, LKML, balbir

On Wed, 1 Oct 2008 17:01:19 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> Delaying add_to_lru() and do it in batched manner like page_vec.
> For doing that 2 flags PCG_USED and PCG_LRU.
> 
> Because __set_page_cgroup_lru() itself doesn't take lock_page_cgroup(),
> we need a sanity check inside lru_lock().
> 
> And this delaying make css_put()/get() complicated. 
> To make it clear,
>  * css_get() is called from mem_cgroup_add_list().
>  * css_put() is called from mem_cgroup_remove_list().
>  * css_get()->css_put() is called while try_charge()->commit/cancel sequence.
> is newly added.
> 

I like this new policy, but

> @@ -710,17 +774,18 @@ static void __mem_cgroup_commit_charge(s

===
                if (PageCgroupLRU(pc)) {
                        ClearPageCgroupLRU(pc);
                        __mem_cgroup_remove_list(mz, pc);
                        css_put(&pc->mem_cgroup->css);
                }
                spin_unlock_irqrestore(&mz->lru_lock, flags);
        }
===

Is this css_put needed yet?

>  	/* Here, PCG_LRU bit is cleared */
>  	pc->mem_cgroup = mem;
>  	/*
> +	 * We have to set pc->mem_cgroup before set USED bit for avoiding
> +	 * race with (delayed) __set_page_cgroup_lru() in other cpu.
> +	 */
> +	smp_wmb();
> +	/*
>  	 * below pcg_default_flags includes PCG_LOCK bit.
>  	 */
>  	pc->flags = pcg_default_flags[ctype];
>  	unlock_page_cgroup(pc);
>  
> -	mz = page_cgroup_zoneinfo(pc);
> -
> -	spin_lock_irqsave(&mz->lru_lock, flags);
> -	__mem_cgroup_add_list(mz, pc, true);
> -	SetPageCgroupLRU(pc);
> -	spin_unlock_irqrestore(&mz->lru_lock, flags);
> +	set_page_cgroup_lru(pc);
> +	css_put(&mem->css);
>  }
>  
>  /**


Thanks,
Daisuke Nishimura.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 5/6] memcg: lazy lru freeing
  2008-10-09  5:39   ` Daisuke Nishimura
@ 2008-10-09  6:26     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 26+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-10-09  6:26 UTC (permalink / raw)
  To: Daisuke Nishimura; +Cc: linux-mm, LKML, balbir

On Thu, 9 Oct 2008 14:39:49 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:

> On Wed, 1 Oct 2008 17:00:05 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > Free page_cgroup from its LRU in batched manner.
> > 
> > When uncharge() is called, page is pushed onto per-cpu vector and
> > removed from LRU, later.. This routine resembles to global LRU's pagevec.
> > This patch is half of the whole patch and a set with following lazy LRU add
> > patch.
> > 
> > After this, a pc, which is PageCgroupLRU(pc)==true, is on LRU.
> > This LRU bit is guarded by lru_lock().
> > 
> >  PageCgroupUsed(pc) && PageCgroupLRU(pc) means "pc" is used and on LRU.
> >  This check makes sense only when both 2 locks, lock_page_cgroup()/lru_lock(),
> >  are aquired.
> > 
> >  PageCgroupUsed(pc) && !PageCgroupLRU(pc) means "pc" is used but not on LRU.
> >  !PageCgroupUsed(pc) && PageCgroupLRU(pc) means "pc" is unused but still on
> >  LRU. lru walk routine should avoid touching this.
> > 
> > Changelog (v5) => (v6):
> >  - Fixing race and added PCG_LRU bit
> > 
> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > 
> 
> (snip)
> 
> > +static void
> > +__release_page_cgroup(struct memcg_percpu_vec *mpv)
> > +{
> > +	unsigned long flags;
> > +	struct mem_cgroup_per_zone *mz, *prev_mz;
> > +	struct page_cgroup *pc;
> > +	int i, nr;
> > +
> > +	local_irq_save(flags);
> > +	nr = mpv->nr;
> > +	mpv->nr = 0;
> > +	prev_mz = NULL;
> > +	for (i = nr - 1; i >= 0; i--) {
> > +		pc = mpv->vec[i];
> > +		mz = page_cgroup_zoneinfo(pc);
> > +		if (prev_mz != mz) {
> > +			if (prev_mz)
> > +				spin_unlock(&prev_mz->lru_lock);
> > +			prev_mz = mz;
> > +			spin_lock(&mz->lru_lock);
> > +		}
> > +		/*
> > +		 * this "pc" may be charge()->uncharge() while we are waiting
> > +		 * for this. But charge() path check LRU bit and remove this
> > +		 * from LRU if necessary.
> > +		 */
> > +		if (!PageCgroupUsed(pc) && PageCgroupLRU(pc)) {
> > +			ClearPageCgroupLRU(pc);
> > +			__mem_cgroup_remove_list(mz, pc);
> > +			css_put(&pc->mem_cgroup->css);
> > +		}
> > +	}
> > +	if (prev_mz)
> > +		spin_unlock(&prev_mz->lru_lock);
> > +	local_irq_restore(flags);
> > +
> > +}
> > +
> I'm wondering if page_cgroup_zoneinfo is safe without lock_page_cgroup
> because it dereferences pc->mem_cgroup.
> I'm worring if the pc has been moved to another lru by re-charge(and re-uncharge),
> and __mem_cgroup_remove_list toches a wrong(old) group.
> 
> Hmm, there are many things to be done for re-charge and re-uncharge,
> so "if (!PageCgroupUsed(pc) && PageCgroupLRU(pc))" would be enough.
> (it can avoid race between re-charge.)
> 
It's safe just because  I added following check.

+	/*
+	 * This page_cgroup is not used but may be on LRU.
+	 */
+	if (unlikely(PageCgroupLRU(pc))) {
+		/*
+		 * pc->mem_cgroup has old information. force_empty() guarantee
+		 * that we never see stale mem_cgroup here.
+		 */
+		mz = page_cgroup_zoneinfo(pc);
+		spin_lock_irqsave(&mz->lru_lock, flags);
+		if (PageCgroupLRU(pc)) {
+			ClearPageCgroupLRU(pc);
+			__mem_cgroup_remove_list(mz, pc);
+			css_put(&pc->mem_cgroup->css);
+		}
+		spin_unlock_irqrestore(&mz->lru_lock, flags);
+	}
+	/* Here, PCG_LRU bit is cleared */

before reusing, LRU bit is unset.


> Another user of page_cgroup_zoneinfo without lock_page_cgroup is
> __mem_cgroup_move_lists called by mem_cgroup_isolate_pages,
> but mem_cgroup_isolate_pages handles pc which is actually on the mz->lru
> so it would be ok.
> (I think adding VM_BUG_ON(mz != page_cgroup_zoneifno(pc)) would make sense,
> or add new arg *mz to __mem_cgroup_move_lists?)
> 
ok, I'll add VM_BUG_ON().

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 6/6] memcg: lazy lru addition
  2008-10-09  6:21   ` Daisuke Nishimura
@ 2008-10-09  6:51     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 26+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-10-09  6:51 UTC (permalink / raw)
  To: Daisuke Nishimura; +Cc: linux-mm, LKML, balbir

On Thu, 9 Oct 2008 15:21:32 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:

> On Wed, 1 Oct 2008 17:01:19 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > Delaying add_to_lru() and do it in batched manner like page_vec.
> > For doing that 2 flags PCG_USED and PCG_LRU.
> > 
> > Because __set_page_cgroup_lru() itself doesn't take lock_page_cgroup(),
> > we need a sanity check inside lru_lock().
> > 
> > And this delaying make css_put()/get() complicated. 
> > To make it clear,
> >  * css_get() is called from mem_cgroup_add_list().
> >  * css_put() is called from mem_cgroup_remove_list().
> >  * css_get()->css_put() is called while try_charge()->commit/cancel sequence.
> > is newly added.
> > 
> 
> I like this new policy, but
> 
> > @@ -710,17 +774,18 @@ static void __mem_cgroup_commit_charge(s
> 
> ===
>                 if (PageCgroupLRU(pc)) {
>                         ClearPageCgroupLRU(pc);
>                         __mem_cgroup_remove_list(mz, pc);
>                         css_put(&pc->mem_cgroup->css);
>                 }
>                 spin_unlock_irqrestore(&mz->lru_lock, flags);
>         }
> ===
> 
> Is this css_put needed yet?
> 
Oh, nice catch. it's unnecessary.
I'll fix this in the next. Thank you for review.


I'll post still-under-discuss set (v7), tomorrow.
includes
  - charge/commit/cancel
  - move account & force_empty
  - lazy lru free
  - lazy lru add
Currently works well under my test..

In the next week, I'd like to restart Mem+Swap series.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2008-10-09  6:51 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-10-01  7:52 [PATCH 0/6] memcg update v6 (for review and discuss) KAMEZAWA Hiroyuki
2008-10-01  7:55 ` [PATCH 1/6] atomic page_cgroup flags KAMEZAWA Hiroyuki
2008-10-06  7:42   ` Balbir Singh
2008-10-01  7:56 ` [PATCH 2/6] memcg: allocate page_cgroup at boot KAMEZAWA Hiroyuki
2008-10-02  8:49   ` [PATCH 2/6] memcg: allocate page_cgroup at boot (hunk fix) KAMEZAWA Hiroyuki
2008-10-06 16:32     ` Balbir Singh
2008-10-06 10:11   ` [PATCH 2/6] memcg: allocate page_cgroup at boot Balbir Singh
2008-10-01  7:57 ` [PATCH 3/6] memcg: charge-commit-cancel protocl KAMEZAWA Hiroyuki
2008-10-01  8:33   ` Daisuke Nishimura
2008-10-01 10:04   ` kamezawa.hiroyu
2008-10-03 10:05   ` Daisuke Nishimura
2008-10-03 15:15   ` kamezawa.hiroyu
2008-10-03 15:25     ` Daisuke Nishimura
2008-10-08  9:05   ` [RFC] memcg: handle migration by charge-commit-cancel (was " KAMEZAWA Hiroyuki
2008-10-01  7:59 ` [PATCH 4/6] memcg: new force_empty and move_account KAMEZAWA Hiroyuki
2008-10-01 16:38   ` Randy Dunlap
2008-10-02  5:13     ` KAMEZAWA Hiroyuki
2008-10-01  8:00 ` [PATCH 5/6] memcg: lazy lru freeing KAMEZAWA Hiroyuki
2008-10-09  5:39   ` Daisuke Nishimura
2008-10-09  6:26     ` KAMEZAWA Hiroyuki
2008-10-01  8:01 ` [PATCH 6/6] memcg: lazy lru addition KAMEZAWA Hiroyuki
2008-10-09  6:21   ` Daisuke Nishimura
2008-10-09  6:51     ` KAMEZAWA Hiroyuki
2008-10-02  9:02 ` [PATCH 0/6] memcg update v6 (for review and discuss) KAMEZAWA Hiroyuki
2008-10-06 17:26   ` Balbir Singh
2008-10-07  1:22     ` KAMEZAWA Hiroyuki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).