All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/7] memcg : more fixes and clean up for 2.6.28-rc
@ 2011-01-21  6:34 ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 74+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-01-21  6:34 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel, nishimura, hannes, balbir, akpm


This is a set of patches which I'm now testing, and it seems it passed
small test. So I post this.

Some are bug fixes and other are clean ups but I think these are for 2.6.38.

Brief decription

[1/7] remove buggy comment and use better name for mem_cgroup_move_parent()
      The fixes for mem_cgroup_move_parent() is already in mainline, this is
      an add-on.

[2/7] a bug fix for a new function mem_cgroup_split_huge_fixup(),
      which was recently merged.

[3/7] prepare for fixes in [4/7],[5/7]. This is an enhancement of function
      which is used now.

[4/7] fix mem_cgroup_charge() for THP. By this, memory cgroup's charge function
      will handle THP request in sane way.

[5/7] fix khugepaged scan condition for memcg.
      This is a fix for hang of processes under small/buzy memory cgroup.

[6/7] rename vairable names to be page_size, nr_pages, bytes rather than
      ambiguous names.

[7/7] some memcg function requires the caller to initialize variable
      before call. It's ugly and fix it.


I think patch 1,2,3,4,5 is urgent ones. But I think patch "5" needs some
good review. But without "5", stress-test on small memory cgroup will not
run succesfully.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH 0/7] memcg : more fixes and clean up for 2.6.28-rc
@ 2011-01-21  6:34 ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 74+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-01-21  6:34 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel, nishimura, hannes, balbir, akpm


This is a set of patches which I'm now testing, and it seems it passed
small test. So I post this.

Some are bug fixes and other are clean ups but I think these are for 2.6.38.

Brief decription

[1/7] remove buggy comment and use better name for mem_cgroup_move_parent()
      The fixes for mem_cgroup_move_parent() is already in mainline, this is
      an add-on.

[2/7] a bug fix for a new function mem_cgroup_split_huge_fixup(),
      which was recently merged.

[3/7] prepare for fixes in [4/7],[5/7]. This is an enhancement of function
      which is used now.

[4/7] fix mem_cgroup_charge() for THP. By this, memory cgroup's charge function
      will handle THP request in sane way.

[5/7] fix khugepaged scan condition for memcg.
      This is a fix for hang of processes under small/buzy memory cgroup.

[6/7] rename vairable names to be page_size, nr_pages, bytes rather than
      ambiguous names.

[7/7] some memcg function requires the caller to initialize variable
      before call. It's ugly and fix it.


I think patch 1,2,3,4,5 is urgent ones. But I think patch "5" needs some
good review. But without "5", stress-test on small memory cgroup will not
run succesfully.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH 1/7] memcg : comment, style fixes for recent patch of move_parent
  2011-01-21  6:34 ` KAMEZAWA Hiroyuki
@ 2011-01-21  6:37   ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 74+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-01-21  6:37 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, nishimura, hannes, balbir, akpm

From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

A fix for 987eba66e0e6aa654d60881a14731a353ee0acb4

A clean up for mem_cgroup_move_parent(). 
 - remove unnecessary initialization of local variable.
 - rename charge_size -> page_size
 - remove unnecessary (wrong) comment.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |   17 +++++++++--------
 1 file changed, 9 insertions(+), 8 deletions(-)

Index: mmotm-0107/mm/memcontrol.c
===================================================================
--- mmotm-0107.orig/mm/memcontrol.c
+++ mmotm-0107/mm/memcontrol.c
@@ -2265,7 +2265,7 @@ static int mem_cgroup_move_parent(struct
 	struct cgroup *cg = child->css.cgroup;
 	struct cgroup *pcg = cg->parent;
 	struct mem_cgroup *parent;
-	int charge = PAGE_SIZE;
+	int page_size;
 	unsigned long flags;
 	int ret;
 
@@ -2278,22 +2278,23 @@ static int mem_cgroup_move_parent(struct
 		goto out;
 	if (isolate_lru_page(page))
 		goto put;
-	/* The page is isolated from LRU and we have no race with splitting */
-	charge = PAGE_SIZE << compound_order(page);
+
+	page_size = PAGE_SIZE << compound_order(page);
 
 	parent = mem_cgroup_from_cont(pcg);
-	ret = __mem_cgroup_try_charge(NULL, gfp_mask, &parent, false, charge);
+	ret = __mem_cgroup_try_charge(NULL, gfp_mask,
+				&parent, false, page_size);
 	if (ret || !parent)
 		goto put_back;
 
-	if (charge > PAGE_SIZE)
+	if (page_size > PAGE_SIZE)
 		flags = compound_lock_irqsave(page);
 
-	ret = mem_cgroup_move_account(pc, child, parent, true, charge);
+	ret = mem_cgroup_move_account(pc, child, parent, true, page_size);
 	if (ret)
-		mem_cgroup_cancel_charge(parent, charge);
+		mem_cgroup_cancel_charge(parent, page_size);
 put_back:
-	if (charge > PAGE_SIZE)
+	if (page_size > PAGE_SIZE)
 		compound_unlock_irqrestore(page, flags);
 	putback_lru_page(page);
 put:


^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH 1/7] memcg : comment, style fixes for recent patch of move_parent
@ 2011-01-21  6:37   ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 74+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-01-21  6:37 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, nishimura, hannes, balbir, akpm

From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

A fix for 987eba66e0e6aa654d60881a14731a353ee0acb4

A clean up for mem_cgroup_move_parent(). 
 - remove unnecessary initialization of local variable.
 - rename charge_size -> page_size
 - remove unnecessary (wrong) comment.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |   17 +++++++++--------
 1 file changed, 9 insertions(+), 8 deletions(-)

Index: mmotm-0107/mm/memcontrol.c
===================================================================
--- mmotm-0107.orig/mm/memcontrol.c
+++ mmotm-0107/mm/memcontrol.c
@@ -2265,7 +2265,7 @@ static int mem_cgroup_move_parent(struct
 	struct cgroup *cg = child->css.cgroup;
 	struct cgroup *pcg = cg->parent;
 	struct mem_cgroup *parent;
-	int charge = PAGE_SIZE;
+	int page_size;
 	unsigned long flags;
 	int ret;
 
@@ -2278,22 +2278,23 @@ static int mem_cgroup_move_parent(struct
 		goto out;
 	if (isolate_lru_page(page))
 		goto put;
-	/* The page is isolated from LRU and we have no race with splitting */
-	charge = PAGE_SIZE << compound_order(page);
+
+	page_size = PAGE_SIZE << compound_order(page);
 
 	parent = mem_cgroup_from_cont(pcg);
-	ret = __mem_cgroup_try_charge(NULL, gfp_mask, &parent, false, charge);
+	ret = __mem_cgroup_try_charge(NULL, gfp_mask,
+				&parent, false, page_size);
 	if (ret || !parent)
 		goto put_back;
 
-	if (charge > PAGE_SIZE)
+	if (page_size > PAGE_SIZE)
 		flags = compound_lock_irqsave(page);
 
-	ret = mem_cgroup_move_account(pc, child, parent, true, charge);
+	ret = mem_cgroup_move_account(pc, child, parent, true, page_size);
 	if (ret)
-		mem_cgroup_cancel_charge(parent, charge);
+		mem_cgroup_cancel_charge(parent, page_size);
 put_back:
-	if (charge > PAGE_SIZE)
+	if (page_size > PAGE_SIZE)
 		compound_unlock_irqrestore(page, flags);
 	putback_lru_page(page);
 put:

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH 2/7] memcg : more fixes and clean up for 2.6.28-rc
  2011-01-21  6:34 ` KAMEZAWA Hiroyuki
@ 2011-01-21  6:39   ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 74+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-01-21  6:39 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, nishimura, hannes, balbir, akpm

From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

This is a fix for
ca3e021417eed30ec2b64ce88eb0acf64aa9bc29

mem_cgroup_disabled() should be checked at splitting.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |    2 ++
 1 file changed, 2 insertions(+)

Index: mmotm-0107/mm/memcontrol.c
===================================================================
--- mmotm-0107.orig/mm/memcontrol.c
+++ mmotm-0107/mm/memcontrol.c
@@ -2144,6 +2144,8 @@ void mem_cgroup_split_huge_fixup(struct 
 	struct page_cgroup *tail_pc = lookup_page_cgroup(tail);
 	unsigned long flags;
 
+	if (mem_cgroup_disabled())
+		return;
 	/*
 	 * We have no races with charge/uncharge but will have races with
 	 * page state accounting.


^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH 2/7] memcg : more fixes and clean up for 2.6.28-rc
@ 2011-01-21  6:39   ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 74+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-01-21  6:39 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, nishimura, hannes, balbir, akpm

From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

This is a fix for
ca3e021417eed30ec2b64ce88eb0acf64aa9bc29

mem_cgroup_disabled() should be checked at splitting.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |    2 ++
 1 file changed, 2 insertions(+)

Index: mmotm-0107/mm/memcontrol.c
===================================================================
--- mmotm-0107.orig/mm/memcontrol.c
+++ mmotm-0107/mm/memcontrol.c
@@ -2144,6 +2144,8 @@ void mem_cgroup_split_huge_fixup(struct 
 	struct page_cgroup *tail_pc = lookup_page_cgroup(tail);
 	unsigned long flags;
 
+	if (mem_cgroup_disabled())
+		return;
 	/*
 	 * We have no races with charge/uncharge but will have races with
 	 * page state accounting.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH 3/7] memcg : fix mem_cgroup_check_under_limit
  2011-01-21  6:34 ` KAMEZAWA Hiroyuki
@ 2011-01-21  6:41   ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 74+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-01-21  6:41 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, nishimura, hannes, balbir, akpm

From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Current memory cgroup's code tends to assume page_size == PAGE_SIZE
but THP does HPAGE_SIZE charge.

This is one of fixes for supporing THP. This modifies
mem_cgroup_check_under_limit to take page_size into account.

Total fixes for do_charge()/reclaim memory will follow this patch.

TODO: By this reclaim function can get page_size as argument.
So...there may be something should be improvoed.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/res_counter.h |   11 +++++++++++
 mm/memcontrol.c             |   27 ++++++++++++++-------------
 2 files changed, 25 insertions(+), 13 deletions(-)

Index: mmotm-0107/include/linux/res_counter.h
===================================================================
--- mmotm-0107.orig/include/linux/res_counter.h
+++ mmotm-0107/include/linux/res_counter.h
@@ -182,6 +182,17 @@ static inline bool res_counter_check_und
 	return ret;
 }
 
+static inline s64 res_counter_check_margin(struct res_counter *cnt)
+{
+	s64 ret;
+	unsigned long flags;
+
+	spin_lock_irqsave(&cnt->lock, flags);
+	ret = cnt->limit - cnt->usage;
+	spin_unlock_irqrestore(&cnt->lock, flags);
+	return ret;
+}
+
 static inline bool res_counter_check_under_soft_limit(struct res_counter *cnt)
 {
 	bool ret;
Index: mmotm-0107/mm/memcontrol.c
===================================================================
--- mmotm-0107.orig/mm/memcontrol.c
+++ mmotm-0107/mm/memcontrol.c
@@ -1099,14 +1099,14 @@ unsigned long mem_cgroup_isolate_pages(u
 #define mem_cgroup_from_res_counter(counter, member)	\
 	container_of(counter, struct mem_cgroup, member)
 
-static bool mem_cgroup_check_under_limit(struct mem_cgroup *mem)
+static bool mem_cgroup_check_under_limit(struct mem_cgroup *mem, int page_size)
 {
 	if (do_swap_account) {
-		if (res_counter_check_under_limit(&mem->res) &&
-			res_counter_check_under_limit(&mem->memsw))
+		if (res_counter_check_margin(&mem->res) >= page_size &&
+			res_counter_check_margin(&mem->memsw) >= page_size)
 			return true;
 	} else
-		if (res_counter_check_under_limit(&mem->res))
+		if (res_counter_check_margin(&mem->res) >= page_size)
 			return true;
 	return false;
 }
@@ -1367,7 +1367,8 @@ mem_cgroup_select_victim(struct mem_cgro
 static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
 						struct zone *zone,
 						gfp_t gfp_mask,
-						unsigned long reclaim_options)
+						unsigned long reclaim_options,
+						int page_size)
 {
 	struct mem_cgroup *victim;
 	int ret, total = 0;
@@ -1434,7 +1435,7 @@ static int mem_cgroup_hierarchical_recla
 		if (check_soft) {
 			if (res_counter_check_under_soft_limit(&root_mem->res))
 				return total;
-		} else if (mem_cgroup_check_under_limit(root_mem))
+		} else if (mem_cgroup_check_under_limit(root_mem, page_size))
 			return 1 + total;
 	}
 	return total;
@@ -1844,7 +1845,7 @@ static int __mem_cgroup_do_charge(struct
 		return CHARGE_WOULDBLOCK;
 
 	ret = mem_cgroup_hierarchical_reclaim(mem_over_limit, NULL,
-					gfp_mask, flags);
+					gfp_mask, flags, csize);
 	/*
 	 * try_to_free_mem_cgroup_pages() might not give us a full
 	 * picture of reclaim. Some pages are reclaimed and might be
@@ -1852,7 +1853,7 @@ static int __mem_cgroup_do_charge(struct
 	 * Check the limit again to see if the reclaim reduced the
 	 * current usage of the cgroup before giving up
 	 */
-	if (ret || mem_cgroup_check_under_limit(mem_over_limit))
+	if (ret || mem_cgroup_check_under_limit(mem_over_limit, csize))
 		return CHARGE_RETRY;
 
 	/*
@@ -3058,7 +3059,7 @@ static int mem_cgroup_resize_limit(struc
 			break;
 
 		mem_cgroup_hierarchical_reclaim(memcg, NULL, GFP_KERNEL,
-						MEM_CGROUP_RECLAIM_SHRINK);
+					MEM_CGROUP_RECLAIM_SHRINK, PAGE_SIZE);
 		curusage = res_counter_read_u64(&memcg->res, RES_USAGE);
 		/* Usage is reduced ? */
   		if (curusage >= oldusage)
@@ -3117,8 +3118,8 @@ static int mem_cgroup_resize_memsw_limit
 			break;
 
 		mem_cgroup_hierarchical_reclaim(memcg, NULL, GFP_KERNEL,
-						MEM_CGROUP_RECLAIM_NOSWAP |
-						MEM_CGROUP_RECLAIM_SHRINK);
+			MEM_CGROUP_RECLAIM_NOSWAP | MEM_CGROUP_RECLAIM_SHRINK,
+			PAGE_SIZE);
 		curusage = res_counter_read_u64(&memcg->memsw, RES_USAGE);
 		/* Usage is reduced ? */
 		if (curusage >= oldusage)
@@ -3159,8 +3160,8 @@ unsigned long mem_cgroup_soft_limit_recl
 			break;
 
 		reclaimed = mem_cgroup_hierarchical_reclaim(mz->mem, zone,
-						gfp_mask,
-						MEM_CGROUP_RECLAIM_SOFT);
+					gfp_mask,
+					MEM_CGROUP_RECLAIM_SOFT, PAGE_SIZE);
 		nr_reclaimed += reclaimed;
 		spin_lock(&mctz->lock);
 


^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH 3/7] memcg : fix mem_cgroup_check_under_limit
@ 2011-01-21  6:41   ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 74+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-01-21  6:41 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, nishimura, hannes, balbir, akpm

From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Current memory cgroup's code tends to assume page_size == PAGE_SIZE
but THP does HPAGE_SIZE charge.

This is one of fixes for supporing THP. This modifies
mem_cgroup_check_under_limit to take page_size into account.

Total fixes for do_charge()/reclaim memory will follow this patch.

TODO: By this reclaim function can get page_size as argument.
So...there may be something should be improvoed.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/res_counter.h |   11 +++++++++++
 mm/memcontrol.c             |   27 ++++++++++++++-------------
 2 files changed, 25 insertions(+), 13 deletions(-)

Index: mmotm-0107/include/linux/res_counter.h
===================================================================
--- mmotm-0107.orig/include/linux/res_counter.h
+++ mmotm-0107/include/linux/res_counter.h
@@ -182,6 +182,17 @@ static inline bool res_counter_check_und
 	return ret;
 }
 
+static inline s64 res_counter_check_margin(struct res_counter *cnt)
+{
+	s64 ret;
+	unsigned long flags;
+
+	spin_lock_irqsave(&cnt->lock, flags);
+	ret = cnt->limit - cnt->usage;
+	spin_unlock_irqrestore(&cnt->lock, flags);
+	return ret;
+}
+
 static inline bool res_counter_check_under_soft_limit(struct res_counter *cnt)
 {
 	bool ret;
Index: mmotm-0107/mm/memcontrol.c
===================================================================
--- mmotm-0107.orig/mm/memcontrol.c
+++ mmotm-0107/mm/memcontrol.c
@@ -1099,14 +1099,14 @@ unsigned long mem_cgroup_isolate_pages(u
 #define mem_cgroup_from_res_counter(counter, member)	\
 	container_of(counter, struct mem_cgroup, member)
 
-static bool mem_cgroup_check_under_limit(struct mem_cgroup *mem)
+static bool mem_cgroup_check_under_limit(struct mem_cgroup *mem, int page_size)
 {
 	if (do_swap_account) {
-		if (res_counter_check_under_limit(&mem->res) &&
-			res_counter_check_under_limit(&mem->memsw))
+		if (res_counter_check_margin(&mem->res) >= page_size &&
+			res_counter_check_margin(&mem->memsw) >= page_size)
 			return true;
 	} else
-		if (res_counter_check_under_limit(&mem->res))
+		if (res_counter_check_margin(&mem->res) >= page_size)
 			return true;
 	return false;
 }
@@ -1367,7 +1367,8 @@ mem_cgroup_select_victim(struct mem_cgro
 static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
 						struct zone *zone,
 						gfp_t gfp_mask,
-						unsigned long reclaim_options)
+						unsigned long reclaim_options,
+						int page_size)
 {
 	struct mem_cgroup *victim;
 	int ret, total = 0;
@@ -1434,7 +1435,7 @@ static int mem_cgroup_hierarchical_recla
 		if (check_soft) {
 			if (res_counter_check_under_soft_limit(&root_mem->res))
 				return total;
-		} else if (mem_cgroup_check_under_limit(root_mem))
+		} else if (mem_cgroup_check_under_limit(root_mem, page_size))
 			return 1 + total;
 	}
 	return total;
@@ -1844,7 +1845,7 @@ static int __mem_cgroup_do_charge(struct
 		return CHARGE_WOULDBLOCK;
 
 	ret = mem_cgroup_hierarchical_reclaim(mem_over_limit, NULL,
-					gfp_mask, flags);
+					gfp_mask, flags, csize);
 	/*
 	 * try_to_free_mem_cgroup_pages() might not give us a full
 	 * picture of reclaim. Some pages are reclaimed and might be
@@ -1852,7 +1853,7 @@ static int __mem_cgroup_do_charge(struct
 	 * Check the limit again to see if the reclaim reduced the
 	 * current usage of the cgroup before giving up
 	 */
-	if (ret || mem_cgroup_check_under_limit(mem_over_limit))
+	if (ret || mem_cgroup_check_under_limit(mem_over_limit, csize))
 		return CHARGE_RETRY;
 
 	/*
@@ -3058,7 +3059,7 @@ static int mem_cgroup_resize_limit(struc
 			break;
 
 		mem_cgroup_hierarchical_reclaim(memcg, NULL, GFP_KERNEL,
-						MEM_CGROUP_RECLAIM_SHRINK);
+					MEM_CGROUP_RECLAIM_SHRINK, PAGE_SIZE);
 		curusage = res_counter_read_u64(&memcg->res, RES_USAGE);
 		/* Usage is reduced ? */
   		if (curusage >= oldusage)
@@ -3117,8 +3118,8 @@ static int mem_cgroup_resize_memsw_limit
 			break;
 
 		mem_cgroup_hierarchical_reclaim(memcg, NULL, GFP_KERNEL,
-						MEM_CGROUP_RECLAIM_NOSWAP |
-						MEM_CGROUP_RECLAIM_SHRINK);
+			MEM_CGROUP_RECLAIM_NOSWAP | MEM_CGROUP_RECLAIM_SHRINK,
+			PAGE_SIZE);
 		curusage = res_counter_read_u64(&memcg->memsw, RES_USAGE);
 		/* Usage is reduced ? */
 		if (curusage >= oldusage)
@@ -3159,8 +3160,8 @@ unsigned long mem_cgroup_soft_limit_recl
 			break;
 
 		reclaimed = mem_cgroup_hierarchical_reclaim(mz->mem, zone,
-						gfp_mask,
-						MEM_CGROUP_RECLAIM_SOFT);
+					gfp_mask,
+					MEM_CGROUP_RECLAIM_SOFT, PAGE_SIZE);
 		nr_reclaimed += reclaimed;
 		spin_lock(&mctz->lock);
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH 4/7] memcg : fix charge function of THP allocation.
  2011-01-21  6:34 ` KAMEZAWA Hiroyuki
@ 2011-01-21  6:44   ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 74+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-01-21  6:44 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, nishimura, hannes, balbir, akpm

From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

When THP is used, Hugepage size charge can happen. It's not handled
correctly in mem_cgroup_do_charge(). For example, THP can fallback
to small page allocation when HUGEPAGE allocation seems difficult
or busy, but memory cgroup doesn't understand it and continue to
try HUGEPAGE charging. And the worst thing is memory cgroup
believes 'memory reclaim succeeded' if limit - usage > PAGE_SIZE.

By this, khugepaged etc...can goes into inifinite reclaim loop
if tasks in memcg are busy.

After this patch 
 - Hugepage allocation will fail if 1st trial of page reclaim fails.
 - distinguish THP allocaton from Bached allocation. 

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |   51 +++++++++++++++++++++++++++++++++++----------------
 1 file changed, 35 insertions(+), 16 deletions(-)

Index: mmotm-0107/mm/memcontrol.c
===================================================================
--- mmotm-0107.orig/mm/memcontrol.c
+++ mmotm-0107/mm/memcontrol.c
@@ -1812,24 +1812,25 @@ enum {
 	CHARGE_OK,		/* success */
 	CHARGE_RETRY,		/* need to retry but retry is not bad */
 	CHARGE_NOMEM,		/* we can't do more. return -ENOMEM */
+	CHARGE_NEED_BREAK,	/* big size allocation failure */
 	CHARGE_WOULDBLOCK,	/* GFP_WAIT wasn't set and no enough res. */
 	CHARGE_OOM_DIE,		/* the current is killed because of OOM */
 };
 
 static int __mem_cgroup_do_charge(struct mem_cgroup *mem, gfp_t gfp_mask,
-				int csize, bool oom_check)
+			int page_size, bool do_reclaim, bool oom_check)
 {
 	struct mem_cgroup *mem_over_limit;
 	struct res_counter *fail_res;
 	unsigned long flags = 0;
 	int ret;
 
-	ret = res_counter_charge(&mem->res, csize, &fail_res);
+	ret = res_counter_charge(&mem->res, page_size, &fail_res);
 
 	if (likely(!ret)) {
 		if (!do_swap_account)
 			return CHARGE_OK;
-		ret = res_counter_charge(&mem->memsw, csize, &fail_res);
+		ret = res_counter_charge(&mem->memsw, page_size, &fail_res);
 		if (likely(!ret))
 			return CHARGE_OK;
 
@@ -1838,14 +1839,14 @@ static int __mem_cgroup_do_charge(struct
 	} else
 		mem_over_limit = mem_cgroup_from_res_counter(fail_res, res);
 
-	if (csize > PAGE_SIZE) /* change csize and retry */
+	if (!do_reclaim)
 		return CHARGE_RETRY;
 
 	if (!(gfp_mask & __GFP_WAIT))
 		return CHARGE_WOULDBLOCK;
 
 	ret = mem_cgroup_hierarchical_reclaim(mem_over_limit, NULL,
-					gfp_mask, flags, csize);
+					gfp_mask, flags, page_size);
 	/*
 	 * try_to_free_mem_cgroup_pages() might not give us a full
 	 * picture of reclaim. Some pages are reclaimed and might be
@@ -1853,19 +1854,28 @@ static int __mem_cgroup_do_charge(struct
 	 * Check the limit again to see if the reclaim reduced the
 	 * current usage of the cgroup before giving up
 	 */
-	if (ret || mem_cgroup_check_under_limit(mem_over_limit, csize))
+	if (ret || mem_cgroup_check_under_limit(mem_over_limit, page_size))
 		return CHARGE_RETRY;
 
 	/*
+	 * When page_size > PAGE_SIZE, THP calls this function and it's
+	 * ok to tell 'there are not enough pages for hugepage'. THP will
+	 * fallback into PAGE_SIZE allocation. If we do reclaim eagerly,
+	 * page splitting will occur and it seems much worse.
+	 */
+	if (page_size > PAGE_SIZE)
+		return CHARGE_NEED_BREAK;
+
+	/*
 	 * At task move, charge accounts can be doubly counted. So, it's
 	 * better to wait until the end of task_move if something is going on.
 	 */
 	if (mem_cgroup_wait_acct_move(mem_over_limit))
 		return CHARGE_RETRY;
-
 	/* If we don't need to call oom-killer at el, return immediately */
 	if (!oom_check)
 		return CHARGE_NOMEM;
+
 	/* check OOM */
 	if (!mem_cgroup_handle_oom(mem_over_limit, gfp_mask))
 		return CHARGE_OOM_DIE;
@@ -1885,7 +1895,7 @@ static int __mem_cgroup_try_charge(struc
 	int nr_oom_retries = MEM_CGROUP_RECLAIM_RETRIES;
 	struct mem_cgroup *mem = NULL;
 	int ret;
-	int csize = max(CHARGE_SIZE, (unsigned long) page_size);
+	bool use_pcp_cache = (page_size == PAGE_SIZE);
 
 	/*
 	 * Unlike gloval-vm's OOM-kill, we're not in memory shortage
@@ -1910,7 +1920,7 @@ again:
 		VM_BUG_ON(css_is_removed(&mem->css));
 		if (mem_cgroup_is_root(mem))
 			goto done;
-		if (page_size == PAGE_SIZE && consume_stock(mem))
+		if (use_pcp_cache && consume_stock(mem))
 			goto done;
 		css_get(&mem->css);
 	} else {
@@ -1933,7 +1943,7 @@ again:
 			rcu_read_unlock();
 			goto done;
 		}
-		if (page_size == PAGE_SIZE && consume_stock(mem)) {
+		if (use_pcp_cache && consume_stock(mem)) {
 			/*
 			 * It seems dagerous to access memcg without css_get().
 			 * But considering how consume_stok works, it's not
@@ -1967,17 +1977,26 @@ again:
 			oom_check = true;
 			nr_oom_retries = MEM_CGROUP_RECLAIM_RETRIES;
 		}
-
-		ret = __mem_cgroup_do_charge(mem, gfp_mask, csize, oom_check);
+		if (use_pcp_cache)
+			ret = __mem_cgroup_do_charge(mem, gfp_mask,
+					CHARGE_SIZE, false, oom_check);
+		else
+			ret = __mem_cgroup_do_charge(mem, gfp_mask,
+					page_size, true, oom_check);
 
 		switch (ret) {
 		case CHARGE_OK:
 			break;
 		case CHARGE_RETRY: /* not in OOM situation but retry */
-			csize = page_size;
+			if (use_pcp_cache)/* need to reclaim pages */
+				use_pcp_cache = false;
 			css_put(&mem->css);
 			mem = NULL;
 			goto again;
+		case CHARGE_NEED_BREAK: /* page_size > PAGE_SIZE */
+			css_put(&mem->css);
+			/* returning faiulre doesn't mean OOM for hugepages */
+			goto nomem;
 		case CHARGE_WOULDBLOCK: /* !__GFP_WAIT */
 			css_put(&mem->css);
 			goto nomem;
@@ -1994,9 +2013,9 @@ again:
 			goto bypass;
 		}
 	} while (ret != CHARGE_OK);
-
-	if (csize > page_size)
-		refill_stock(mem, csize - page_size);
+	/* This flag is cleared when we fail CHAEGE_SIZE charge. */
+	if (use_pcp_cache)
+		refill_stock(mem, CHARGE_SIZE - page_size);
 	css_put(&mem->css);
 done:
 	*memcg = mem;


^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH 4/7] memcg : fix charge function of THP allocation.
@ 2011-01-21  6:44   ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 74+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-01-21  6:44 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, nishimura, hannes, balbir, akpm

From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

When THP is used, Hugepage size charge can happen. It's not handled
correctly in mem_cgroup_do_charge(). For example, THP can fallback
to small page allocation when HUGEPAGE allocation seems difficult
or busy, but memory cgroup doesn't understand it and continue to
try HUGEPAGE charging. And the worst thing is memory cgroup
believes 'memory reclaim succeeded' if limit - usage > PAGE_SIZE.

By this, khugepaged etc...can goes into inifinite reclaim loop
if tasks in memcg are busy.

After this patch 
 - Hugepage allocation will fail if 1st trial of page reclaim fails.
 - distinguish THP allocaton from Bached allocation. 

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |   51 +++++++++++++++++++++++++++++++++++----------------
 1 file changed, 35 insertions(+), 16 deletions(-)

Index: mmotm-0107/mm/memcontrol.c
===================================================================
--- mmotm-0107.orig/mm/memcontrol.c
+++ mmotm-0107/mm/memcontrol.c
@@ -1812,24 +1812,25 @@ enum {
 	CHARGE_OK,		/* success */
 	CHARGE_RETRY,		/* need to retry but retry is not bad */
 	CHARGE_NOMEM,		/* we can't do more. return -ENOMEM */
+	CHARGE_NEED_BREAK,	/* big size allocation failure */
 	CHARGE_WOULDBLOCK,	/* GFP_WAIT wasn't set and no enough res. */
 	CHARGE_OOM_DIE,		/* the current is killed because of OOM */
 };
 
 static int __mem_cgroup_do_charge(struct mem_cgroup *mem, gfp_t gfp_mask,
-				int csize, bool oom_check)
+			int page_size, bool do_reclaim, bool oom_check)
 {
 	struct mem_cgroup *mem_over_limit;
 	struct res_counter *fail_res;
 	unsigned long flags = 0;
 	int ret;
 
-	ret = res_counter_charge(&mem->res, csize, &fail_res);
+	ret = res_counter_charge(&mem->res, page_size, &fail_res);
 
 	if (likely(!ret)) {
 		if (!do_swap_account)
 			return CHARGE_OK;
-		ret = res_counter_charge(&mem->memsw, csize, &fail_res);
+		ret = res_counter_charge(&mem->memsw, page_size, &fail_res);
 		if (likely(!ret))
 			return CHARGE_OK;
 
@@ -1838,14 +1839,14 @@ static int __mem_cgroup_do_charge(struct
 	} else
 		mem_over_limit = mem_cgroup_from_res_counter(fail_res, res);
 
-	if (csize > PAGE_SIZE) /* change csize and retry */
+	if (!do_reclaim)
 		return CHARGE_RETRY;
 
 	if (!(gfp_mask & __GFP_WAIT))
 		return CHARGE_WOULDBLOCK;
 
 	ret = mem_cgroup_hierarchical_reclaim(mem_over_limit, NULL,
-					gfp_mask, flags, csize);
+					gfp_mask, flags, page_size);
 	/*
 	 * try_to_free_mem_cgroup_pages() might not give us a full
 	 * picture of reclaim. Some pages are reclaimed and might be
@@ -1853,19 +1854,28 @@ static int __mem_cgroup_do_charge(struct
 	 * Check the limit again to see if the reclaim reduced the
 	 * current usage of the cgroup before giving up
 	 */
-	if (ret || mem_cgroup_check_under_limit(mem_over_limit, csize))
+	if (ret || mem_cgroup_check_under_limit(mem_over_limit, page_size))
 		return CHARGE_RETRY;
 
 	/*
+	 * When page_size > PAGE_SIZE, THP calls this function and it's
+	 * ok to tell 'there are not enough pages for hugepage'. THP will
+	 * fallback into PAGE_SIZE allocation. If we do reclaim eagerly,
+	 * page splitting will occur and it seems much worse.
+	 */
+	if (page_size > PAGE_SIZE)
+		return CHARGE_NEED_BREAK;
+
+	/*
 	 * At task move, charge accounts can be doubly counted. So, it's
 	 * better to wait until the end of task_move if something is going on.
 	 */
 	if (mem_cgroup_wait_acct_move(mem_over_limit))
 		return CHARGE_RETRY;
-
 	/* If we don't need to call oom-killer at el, return immediately */
 	if (!oom_check)
 		return CHARGE_NOMEM;
+
 	/* check OOM */
 	if (!mem_cgroup_handle_oom(mem_over_limit, gfp_mask))
 		return CHARGE_OOM_DIE;
@@ -1885,7 +1895,7 @@ static int __mem_cgroup_try_charge(struc
 	int nr_oom_retries = MEM_CGROUP_RECLAIM_RETRIES;
 	struct mem_cgroup *mem = NULL;
 	int ret;
-	int csize = max(CHARGE_SIZE, (unsigned long) page_size);
+	bool use_pcp_cache = (page_size == PAGE_SIZE);
 
 	/*
 	 * Unlike gloval-vm's OOM-kill, we're not in memory shortage
@@ -1910,7 +1920,7 @@ again:
 		VM_BUG_ON(css_is_removed(&mem->css));
 		if (mem_cgroup_is_root(mem))
 			goto done;
-		if (page_size == PAGE_SIZE && consume_stock(mem))
+		if (use_pcp_cache && consume_stock(mem))
 			goto done;
 		css_get(&mem->css);
 	} else {
@@ -1933,7 +1943,7 @@ again:
 			rcu_read_unlock();
 			goto done;
 		}
-		if (page_size == PAGE_SIZE && consume_stock(mem)) {
+		if (use_pcp_cache && consume_stock(mem)) {
 			/*
 			 * It seems dagerous to access memcg without css_get().
 			 * But considering how consume_stok works, it's not
@@ -1967,17 +1977,26 @@ again:
 			oom_check = true;
 			nr_oom_retries = MEM_CGROUP_RECLAIM_RETRIES;
 		}
-
-		ret = __mem_cgroup_do_charge(mem, gfp_mask, csize, oom_check);
+		if (use_pcp_cache)
+			ret = __mem_cgroup_do_charge(mem, gfp_mask,
+					CHARGE_SIZE, false, oom_check);
+		else
+			ret = __mem_cgroup_do_charge(mem, gfp_mask,
+					page_size, true, oom_check);
 
 		switch (ret) {
 		case CHARGE_OK:
 			break;
 		case CHARGE_RETRY: /* not in OOM situation but retry */
-			csize = page_size;
+			if (use_pcp_cache)/* need to reclaim pages */
+				use_pcp_cache = false;
 			css_put(&mem->css);
 			mem = NULL;
 			goto again;
+		case CHARGE_NEED_BREAK: /* page_size > PAGE_SIZE */
+			css_put(&mem->css);
+			/* returning faiulre doesn't mean OOM for hugepages */
+			goto nomem;
 		case CHARGE_WOULDBLOCK: /* !__GFP_WAIT */
 			css_put(&mem->css);
 			goto nomem;
@@ -1994,9 +2013,9 @@ again:
 			goto bypass;
 		}
 	} while (ret != CHARGE_OK);
-
-	if (csize > page_size)
-		refill_stock(mem, csize - page_size);
+	/* This flag is cleared when we fail CHAEGE_SIZE charge. */
+	if (use_pcp_cache)
+		refill_stock(mem, CHARGE_SIZE - page_size);
 	css_put(&mem->css);
 done:
 	*memcg = mem;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH 5/7] memcg : fix khugepaged scan of process under buzy memcg
  2011-01-21  6:34 ` KAMEZAWA Hiroyuki
@ 2011-01-21  6:46   ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 74+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-01-21  6:46 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, nishimura, hannes, balbir, akpm

From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

When using khugepaged with small memory cgroup, we see khugepaged
causes soft lockup, or running process under memcg will hang

It's because khugepaged tries to scan all pmd of a process
which is under busy/small memory cgroup and tries to allocate
HUGEPAGE size resource.

This work is done under mmap_sem and can cause memory reclaim
repeatedly. This will easily raise cpu usage of khugepaged and latecy
of scanned process will goes up. Moreover, it seems succesfully
working TransHuge pages may be splitted by this memory reclaim
caused by khugepaged.

This patch adds a hint for khugepaged whether a process is
under a memory cgroup which has sufficient memory. If memcg
seems busy, a process is skipped.

How to test:
  # mount -o cgroup cgroup /cgroup/memory -o memory
  # mkdir /cgroup/memory/A
  # echo 200M (or some small) > /cgroup/memory/A/memory.limit_in_bytes
  # echo 0 > /cgroup/memory/A/tasks
  # make -j 8 kernel

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/memcontrol.h |    7 +++++++
 mm/huge_memory.c           |   11 ++++++++++-
 mm/memcontrol.c            |   44 ++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 61 insertions(+), 1 deletion(-)

Index: mmotm-0107/mm/memcontrol.c
===================================================================
--- mmotm-0107.orig/mm/memcontrol.c
+++ mmotm-0107/mm/memcontrol.c
@@ -255,6 +255,9 @@ struct mem_cgroup {
 	/* For oom notifier event fd */
 	struct list_head oom_notify;
 
+	/* For transparent hugepage daemon */
+	unsigned long long recent_failcnt;
+
 	/*
 	 * Should we move charges of a task when a task is moved into this
 	 * mem_cgroup ? And what type of charges should we move ?
@@ -2190,6 +2193,47 @@ void mem_cgroup_split_huge_fixup(struct 
 	tail_pc->flags = head_pc->flags & ~PCGF_NOCOPY_AT_SPLIT;
 	move_unlock_page_cgroup(head_pc, &flags);
 }
+
+bool mem_cgroup_worth_try_hugepage_scan(struct mm_struct *mm)
+{
+	struct mem_cgroup *mem;
+	bool ret = true;
+	u64 recent_charge_fail;
+
+	if (mem_cgroup_disabled())
+		return true;
+
+	mem = try_get_mem_cgroup_from_mm(mm);
+
+	if (!mem)
+		return true;
+
+	if (mem_cgroup_is_root(mem))
+		goto out;
+
+	if (mem_cgroup_check_under_limit(mem, HPAGE_SIZE))
+		goto out;
+	/*
+	 * When memory cgroup is near to full, it's required to reclaim
+	 * memory for collapsing. This requirement of 'extra charge' at
+	 * splitting seems redundant but it's safe way for now.
+	 *
+	 * We return true when no one hit limits since we visit this mm before.
+	 *
+	 * TODO: This check is very naive. Some new good should be innovated.
+	 */
+	recent_charge_fail = res_counter_read_u64(&mem->res, RES_FAILCNT);
+	if (mem->recent_failcnt
+		&& recent_charge_fail > mem->recent_failcnt) {
+		ret = false;
+	}
+	/* because this thread will fail charge by itself +1.*/
+	mem->recent_failcnt = recent_charge_fail + 1;
+out:
+	css_put(&mem->css);
+	return ret;
+}
+
 #endif
 
 /**
Index: mmotm-0107/mm/huge_memory.c
===================================================================
--- mmotm-0107.orig/mm/huge_memory.c
+++ mmotm-0107/mm/huge_memory.c
@@ -2007,11 +2007,14 @@ static unsigned int khugepaged_scan_mm_s
 	spin_unlock(&khugepaged_mm_lock);
 
 	mm = mm_slot->mm;
+
 	down_read(&mm->mmap_sem);
 	if (unlikely(khugepaged_test_exit(mm)))
 		vma = NULL;
-	else
+	else if (mem_cgroup_worth_try_hugepage_scan(mm))
 		vma = find_vma(mm, khugepaged_scan.address);
+	else
+		vma = NULL;
 
 	progress++;
 	for (; vma; vma = vma->vm_next) {
@@ -2023,6 +2026,12 @@ static unsigned int khugepaged_scan_mm_s
 			break;
 		}
 
+		if (unlikely(!mem_cgroup_worth_try_hugepage_scan(mm))) {
+			progress++;
+			vma = NULL; /* try next mm */
+			break;
+		}
+
 		if ((!(vma->vm_flags & VM_HUGEPAGE) &&
 		     !khugepaged_always()) ||
 		    (vma->vm_flags & VM_NOHUGEPAGE)) {
Index: mmotm-0107/include/linux/memcontrol.h
===================================================================
--- mmotm-0107.orig/include/linux/memcontrol.h
+++ mmotm-0107/include/linux/memcontrol.h
@@ -148,6 +148,7 @@ u64 mem_cgroup_get_limit(struct mem_cgro
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 void mem_cgroup_split_huge_fixup(struct page *head, struct page *tail);
+bool mem_cgroup_worth_try_hugepage_scan(struct mm_struct *mm);
 #endif
 
 #else /* CONFIG_CGROUP_MEM_RES_CTLR */
@@ -341,6 +342,12 @@ u64 mem_cgroup_get_limit(struct mem_cgro
 
 static inline mem_cgroup_split_huge_fixup(struct page *head, struct page *tail)
 {
+
+}
+
+static inline bool mem_cgroup_worth_try_hugepage_scan(struct mm_struct *mm)
+{
+	return true;
 }
 
 #endif /* CONFIG_CGROUP_MEM_CONT */


^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH 5/7] memcg : fix khugepaged scan of process under buzy memcg
@ 2011-01-21  6:46   ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 74+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-01-21  6:46 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, nishimura, hannes, balbir, akpm

From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

When using khugepaged with small memory cgroup, we see khugepaged
causes soft lockup, or running process under memcg will hang

It's because khugepaged tries to scan all pmd of a process
which is under busy/small memory cgroup and tries to allocate
HUGEPAGE size resource.

This work is done under mmap_sem and can cause memory reclaim
repeatedly. This will easily raise cpu usage of khugepaged and latecy
of scanned process will goes up. Moreover, it seems succesfully
working TransHuge pages may be splitted by this memory reclaim
caused by khugepaged.

This patch adds a hint for khugepaged whether a process is
under a memory cgroup which has sufficient memory. If memcg
seems busy, a process is skipped.

How to test:
  # mount -o cgroup cgroup /cgroup/memory -o memory
  # mkdir /cgroup/memory/A
  # echo 200M (or some small) > /cgroup/memory/A/memory.limit_in_bytes
  # echo 0 > /cgroup/memory/A/tasks
  # make -j 8 kernel

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/memcontrol.h |    7 +++++++
 mm/huge_memory.c           |   11 ++++++++++-
 mm/memcontrol.c            |   44 ++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 61 insertions(+), 1 deletion(-)

Index: mmotm-0107/mm/memcontrol.c
===================================================================
--- mmotm-0107.orig/mm/memcontrol.c
+++ mmotm-0107/mm/memcontrol.c
@@ -255,6 +255,9 @@ struct mem_cgroup {
 	/* For oom notifier event fd */
 	struct list_head oom_notify;
 
+	/* For transparent hugepage daemon */
+	unsigned long long recent_failcnt;
+
 	/*
 	 * Should we move charges of a task when a task is moved into this
 	 * mem_cgroup ? And what type of charges should we move ?
@@ -2190,6 +2193,47 @@ void mem_cgroup_split_huge_fixup(struct 
 	tail_pc->flags = head_pc->flags & ~PCGF_NOCOPY_AT_SPLIT;
 	move_unlock_page_cgroup(head_pc, &flags);
 }
+
+bool mem_cgroup_worth_try_hugepage_scan(struct mm_struct *mm)
+{
+	struct mem_cgroup *mem;
+	bool ret = true;
+	u64 recent_charge_fail;
+
+	if (mem_cgroup_disabled())
+		return true;
+
+	mem = try_get_mem_cgroup_from_mm(mm);
+
+	if (!mem)
+		return true;
+
+	if (mem_cgroup_is_root(mem))
+		goto out;
+
+	if (mem_cgroup_check_under_limit(mem, HPAGE_SIZE))
+		goto out;
+	/*
+	 * When memory cgroup is near to full, it's required to reclaim
+	 * memory for collapsing. This requirement of 'extra charge' at
+	 * splitting seems redundant but it's safe way for now.
+	 *
+	 * We return true when no one hit limits since we visit this mm before.
+	 *
+	 * TODO: This check is very naive. Some new good should be innovated.
+	 */
+	recent_charge_fail = res_counter_read_u64(&mem->res, RES_FAILCNT);
+	if (mem->recent_failcnt
+		&& recent_charge_fail > mem->recent_failcnt) {
+		ret = false;
+	}
+	/* because this thread will fail charge by itself +1.*/
+	mem->recent_failcnt = recent_charge_fail + 1;
+out:
+	css_put(&mem->css);
+	return ret;
+}
+
 #endif
 
 /**
Index: mmotm-0107/mm/huge_memory.c
===================================================================
--- mmotm-0107.orig/mm/huge_memory.c
+++ mmotm-0107/mm/huge_memory.c
@@ -2007,11 +2007,14 @@ static unsigned int khugepaged_scan_mm_s
 	spin_unlock(&khugepaged_mm_lock);
 
 	mm = mm_slot->mm;
+
 	down_read(&mm->mmap_sem);
 	if (unlikely(khugepaged_test_exit(mm)))
 		vma = NULL;
-	else
+	else if (mem_cgroup_worth_try_hugepage_scan(mm))
 		vma = find_vma(mm, khugepaged_scan.address);
+	else
+		vma = NULL;
 
 	progress++;
 	for (; vma; vma = vma->vm_next) {
@@ -2023,6 +2026,12 @@ static unsigned int khugepaged_scan_mm_s
 			break;
 		}
 
+		if (unlikely(!mem_cgroup_worth_try_hugepage_scan(mm))) {
+			progress++;
+			vma = NULL; /* try next mm */
+			break;
+		}
+
 		if ((!(vma->vm_flags & VM_HUGEPAGE) &&
 		     !khugepaged_always()) ||
 		    (vma->vm_flags & VM_NOHUGEPAGE)) {
Index: mmotm-0107/include/linux/memcontrol.h
===================================================================
--- mmotm-0107.orig/include/linux/memcontrol.h
+++ mmotm-0107/include/linux/memcontrol.h
@@ -148,6 +148,7 @@ u64 mem_cgroup_get_limit(struct mem_cgro
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 void mem_cgroup_split_huge_fixup(struct page *head, struct page *tail);
+bool mem_cgroup_worth_try_hugepage_scan(struct mm_struct *mm);
 #endif
 
 #else /* CONFIG_CGROUP_MEM_RES_CTLR */
@@ -341,6 +342,12 @@ u64 mem_cgroup_get_limit(struct mem_cgro
 
 static inline mem_cgroup_split_huge_fixup(struct page *head, struct page *tail)
 {
+
+}
+
+static inline bool mem_cgroup_worth_try_hugepage_scan(struct mm_struct *mm)
+{
+	return true;
 }
 
 #endif /* CONFIG_CGROUP_MEM_CONT */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH 6/7] memcg : use better variable name
  2011-01-21  6:34 ` KAMEZAWA Hiroyuki
@ 2011-01-21  6:49   ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 74+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-01-21  6:49 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, nishimura, hannes, balbir, akpm

rename 'charge_size'at el. to be 'page_size'. Then,

  nr_pages = page_size >> PAGE_SHIFT

seems natural.

This patch renames

 charge_size -> page_size
 count -> nr_pages

etc.



Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |   28 ++++++++++++++--------------
 1 file changed, 14 insertions(+), 14 deletions(-)

Index: mmotm-0107/mm/memcontrol.c
===================================================================
--- mmotm-0107.orig/mm/memcontrol.c
+++ mmotm-0107/mm/memcontrol.c
@@ -1705,7 +1705,7 @@ static void drain_local_stock(struct wor
  * Cache charges(val) which is from res_counter, to local per_cpu area.
  * This will be consumed by consume_stock() function, later.
  */
-static void refill_stock(struct mem_cgroup *mem, int val)
+static void refill_stock(struct mem_cgroup *mem, int page_size)
 {
 	struct memcg_stock_pcp *stock = &get_cpu_var(memcg_stock);
 
@@ -1713,7 +1713,7 @@ static void refill_stock(struct mem_cgro
 		drain_stock(stock);
 		stock->cached = mem;
 	}
-	stock->charge += val;
+	stock->charge += page_size;
 	put_cpu_var(memcg_stock);
 }
 
@@ -2037,12 +2037,12 @@ bypass:
  * gotten by try_charge().
  */
 static void __mem_cgroup_cancel_charge(struct mem_cgroup *mem,
-							unsigned long count)
+					unsigned long nr_pages)
 {
 	if (!mem_cgroup_is_root(mem)) {
-		res_counter_uncharge(&mem->res, PAGE_SIZE * count);
+		res_counter_uncharge(&mem->res, PAGE_SIZE * nr_pages);
 		if (do_swap_account)
-			res_counter_uncharge(&mem->memsw, PAGE_SIZE * count);
+			res_counter_uncharge(&mem->memsw, PAGE_SIZE * nr_pages);
 	}
 }
 
@@ -2255,9 +2255,9 @@ out:
 
 static void __mem_cgroup_move_account(struct page_cgroup *pc,
 	struct mem_cgroup *from, struct mem_cgroup *to, bool uncharge,
-	int charge_size)
+	int page_size)
 {
-	int nr_pages = charge_size >> PAGE_SHIFT;
+	int nr_pages = page_size >> PAGE_SHIFT;
 
 	VM_BUG_ON(from == to);
 	VM_BUG_ON(PageLRU(pc->page));
@@ -2275,7 +2275,7 @@ static void __mem_cgroup_move_account(st
 	mem_cgroup_charge_statistics(from, PageCgroupCache(pc), -nr_pages);
 	if (uncharge)
 		/* This is not "cancel", but cancel_charge does all we need. */
-		mem_cgroup_cancel_charge(from, charge_size);
+		mem_cgroup_cancel_charge(from, page_size);
 
 	/* caller should have done css_get */
 	pc->mem_cgroup = to;
@@ -2295,18 +2295,18 @@ static void __mem_cgroup_move_account(st
  */
 static int mem_cgroup_move_account(struct page_cgroup *pc,
 		struct mem_cgroup *from, struct mem_cgroup *to,
-		bool uncharge, int charge_size)
+		bool uncharge, int page_size)
 {
 	int ret = -EINVAL;
 	unsigned long flags;
 
-	if ((charge_size > PAGE_SIZE) && !PageTransHuge(pc->page))
+	if ((page_size > PAGE_SIZE) && !PageTransHuge(pc->page))
 		return -EBUSY;
 
 	lock_page_cgroup(pc);
 	if (PageCgroupUsed(pc) && pc->mem_cgroup == from) {
 		move_lock_page_cgroup(pc, &flags);
-		__mem_cgroup_move_account(pc, from, to, uncharge, charge_size);
+		__mem_cgroup_move_account(pc, from, to, uncharge, page_size);
 		move_unlock_page_cgroup(pc, &flags);
 		ret = 0;
 	}
@@ -2645,7 +2645,7 @@ direct_uncharge:
 static struct mem_cgroup *
 __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype)
 {
-	int count;
+	int nr_pages;
 	struct page_cgroup *pc;
 	struct mem_cgroup *mem = NULL;
 	int page_size = PAGE_SIZE;
@@ -2661,7 +2661,7 @@ __mem_cgroup_uncharge_common(struct page
 		VM_BUG_ON(!PageTransHuge(page));
 	}
 
-	count = page_size >> PAGE_SHIFT;
+	nr_pages = page_size >> PAGE_SHIFT;
 	/*
 	 * Check if our page_cgroup is valid
 	 */
@@ -2694,7 +2694,7 @@ __mem_cgroup_uncharge_common(struct page
 		break;
 	}
 
-	mem_cgroup_charge_statistics(mem, PageCgroupCache(pc), -count);
+	mem_cgroup_charge_statistics(mem, PageCgroupCache(pc), -nr_pages);
 
 	ClearPageCgroupUsed(pc);
 	/*


^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH 6/7] memcg : use better variable name
@ 2011-01-21  6:49   ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 74+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-01-21  6:49 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, nishimura, hannes, balbir, akpm

rename 'charge_size'at el. to be 'page_size'. Then,

  nr_pages = page_size >> PAGE_SHIFT

seems natural.

This patch renames

 charge_size -> page_size
 count -> nr_pages

etc.



Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |   28 ++++++++++++++--------------
 1 file changed, 14 insertions(+), 14 deletions(-)

Index: mmotm-0107/mm/memcontrol.c
===================================================================
--- mmotm-0107.orig/mm/memcontrol.c
+++ mmotm-0107/mm/memcontrol.c
@@ -1705,7 +1705,7 @@ static void drain_local_stock(struct wor
  * Cache charges(val) which is from res_counter, to local per_cpu area.
  * This will be consumed by consume_stock() function, later.
  */
-static void refill_stock(struct mem_cgroup *mem, int val)
+static void refill_stock(struct mem_cgroup *mem, int page_size)
 {
 	struct memcg_stock_pcp *stock = &get_cpu_var(memcg_stock);
 
@@ -1713,7 +1713,7 @@ static void refill_stock(struct mem_cgro
 		drain_stock(stock);
 		stock->cached = mem;
 	}
-	stock->charge += val;
+	stock->charge += page_size;
 	put_cpu_var(memcg_stock);
 }
 
@@ -2037,12 +2037,12 @@ bypass:
  * gotten by try_charge().
  */
 static void __mem_cgroup_cancel_charge(struct mem_cgroup *mem,
-							unsigned long count)
+					unsigned long nr_pages)
 {
 	if (!mem_cgroup_is_root(mem)) {
-		res_counter_uncharge(&mem->res, PAGE_SIZE * count);
+		res_counter_uncharge(&mem->res, PAGE_SIZE * nr_pages);
 		if (do_swap_account)
-			res_counter_uncharge(&mem->memsw, PAGE_SIZE * count);
+			res_counter_uncharge(&mem->memsw, PAGE_SIZE * nr_pages);
 	}
 }
 
@@ -2255,9 +2255,9 @@ out:
 
 static void __mem_cgroup_move_account(struct page_cgroup *pc,
 	struct mem_cgroup *from, struct mem_cgroup *to, bool uncharge,
-	int charge_size)
+	int page_size)
 {
-	int nr_pages = charge_size >> PAGE_SHIFT;
+	int nr_pages = page_size >> PAGE_SHIFT;
 
 	VM_BUG_ON(from == to);
 	VM_BUG_ON(PageLRU(pc->page));
@@ -2275,7 +2275,7 @@ static void __mem_cgroup_move_account(st
 	mem_cgroup_charge_statistics(from, PageCgroupCache(pc), -nr_pages);
 	if (uncharge)
 		/* This is not "cancel", but cancel_charge does all we need. */
-		mem_cgroup_cancel_charge(from, charge_size);
+		mem_cgroup_cancel_charge(from, page_size);
 
 	/* caller should have done css_get */
 	pc->mem_cgroup = to;
@@ -2295,18 +2295,18 @@ static void __mem_cgroup_move_account(st
  */
 static int mem_cgroup_move_account(struct page_cgroup *pc,
 		struct mem_cgroup *from, struct mem_cgroup *to,
-		bool uncharge, int charge_size)
+		bool uncharge, int page_size)
 {
 	int ret = -EINVAL;
 	unsigned long flags;
 
-	if ((charge_size > PAGE_SIZE) && !PageTransHuge(pc->page))
+	if ((page_size > PAGE_SIZE) && !PageTransHuge(pc->page))
 		return -EBUSY;
 
 	lock_page_cgroup(pc);
 	if (PageCgroupUsed(pc) && pc->mem_cgroup == from) {
 		move_lock_page_cgroup(pc, &flags);
-		__mem_cgroup_move_account(pc, from, to, uncharge, charge_size);
+		__mem_cgroup_move_account(pc, from, to, uncharge, page_size);
 		move_unlock_page_cgroup(pc, &flags);
 		ret = 0;
 	}
@@ -2645,7 +2645,7 @@ direct_uncharge:
 static struct mem_cgroup *
 __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype)
 {
-	int count;
+	int nr_pages;
 	struct page_cgroup *pc;
 	struct mem_cgroup *mem = NULL;
 	int page_size = PAGE_SIZE;
@@ -2661,7 +2661,7 @@ __mem_cgroup_uncharge_common(struct page
 		VM_BUG_ON(!PageTransHuge(page));
 	}
 
-	count = page_size >> PAGE_SHIFT;
+	nr_pages = page_size >> PAGE_SHIFT;
 	/*
 	 * Check if our page_cgroup is valid
 	 */
@@ -2694,7 +2694,7 @@ __mem_cgroup_uncharge_common(struct page
 		break;
 	}
 
-	mem_cgroup_charge_statistics(mem, PageCgroupCache(pc), -count);
+	mem_cgroup_charge_statistics(mem, PageCgroupCache(pc), -nr_pages);
 
 	ClearPageCgroupUsed(pc);
 	/*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH 7/7] memcg : remove ugly vairable initialization by callers
  2011-01-21  6:34 ` KAMEZAWA Hiroyuki
@ 2011-01-21  6:50   ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 74+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-01-21  6:50 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, nishimura, hannes, balbir, akpm

This is a promised one.
==
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

This patch is for removing initialization in caller of memory cgroup
function. Some memory cgroup uses following style to bring the result
of start function to the end function for avoiding races.

   mem_cgroup_start_A(&(*ptr))
   /* Something very complicated can happen here. */
   mem_cgroup_end_A(*ptr)

In some calls, *ptr should be initialized to NULL be caller. But
it's ugly. This patch fixes that *ptr is initialized by _start
function.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |   10 ++++++++--
 mm/memory.c     |    2 +-
 mm/migrate.c    |    2 +-
 mm/swapfile.c   |    2 +-
 4 files changed, 11 insertions(+), 5 deletions(-)

Index: mmotm-0107/mm/memcontrol.c
===================================================================
--- mmotm-0107.orig/mm/memcontrol.c
+++ mmotm-0107/mm/memcontrol.c
@@ -2469,7 +2469,7 @@ int mem_cgroup_cache_charge(struct page 
 
 	/* shmem */
 	if (PageSwapCache(page)) {
-		struct mem_cgroup *mem = NULL;
+		struct mem_cgroup *mem;
 
 		ret = mem_cgroup_try_charge_swapin(mm, page, gfp_mask, &mem);
 		if (!ret)
@@ -2495,6 +2495,9 @@ int mem_cgroup_try_charge_swapin(struct 
 	struct mem_cgroup *mem;
 	int ret;
 
+	/* *ptr is used for checking caller needs to call commit */
+	*ptr = NULL;
+
 	if (mem_cgroup_disabled())
 		return 0;
 
@@ -2910,6 +2913,9 @@ int mem_cgroup_prepare_migration(struct 
 	enum charge_type ctype;
 	int ret = 0;
 
+	/* *ptr is used by caller to check end_migration() should be called.*/
+	*ptr = NULL;
+
 	VM_BUG_ON(PageTransHuge(page));
 	if (mem_cgroup_disabled())
 		return 0;
@@ -3052,7 +3058,7 @@ int mem_cgroup_shmem_charge_fallback(str
 			    struct mm_struct *mm,
 			    gfp_t gfp_mask)
 {
-	struct mem_cgroup *mem = NULL;
+	struct mem_cgroup *mem;
 	int ret;
 
 	if (mem_cgroup_disabled())
Index: mmotm-0107/mm/migrate.c
===================================================================
--- mmotm-0107.orig/mm/migrate.c
+++ mmotm-0107/mm/migrate.c
@@ -624,7 +624,7 @@ static int unmap_and_move(new_page_t get
 	int remap_swapcache = 1;
 	int rcu_locked = 0;
 	int charge = 0;
-	struct mem_cgroup *mem = NULL;
+	struct mem_cgroup *mem;
 	struct anon_vma *anon_vma = NULL;
 
 	if (!newpage)
Index: mmotm-0107/mm/memory.c
===================================================================
--- mmotm-0107.orig/mm/memory.c
+++ mmotm-0107/mm/memory.c
@@ -2729,7 +2729,7 @@ static int do_swap_page(struct mm_struct
 	swp_entry_t entry;
 	pte_t pte;
 	int locked;
-	struct mem_cgroup *ptr = NULL;
+	struct mem_cgroup *ptr;
 	int exclusive = 0;
 	int ret = 0;
 
Index: mmotm-0107/mm/swapfile.c
===================================================================
--- mmotm-0107.orig/mm/swapfile.c
+++ mmotm-0107/mm/swapfile.c
@@ -880,7 +880,7 @@ unsigned int count_swap_pages(int type, 
 static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
 		unsigned long addr, swp_entry_t entry, struct page *page)
 {
-	struct mem_cgroup *ptr = NULL;
+	struct mem_cgroup *ptr;
 	spinlock_t *ptl;
 	pte_t *pte;
 	int ret = 1;


^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH 7/7] memcg : remove ugly vairable initialization by callers
@ 2011-01-21  6:50   ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 74+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-01-21  6:50 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, nishimura, hannes, balbir, akpm

This is a promised one.
==
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

This patch is for removing initialization in caller of memory cgroup
function. Some memory cgroup uses following style to bring the result
of start function to the end function for avoiding races.

   mem_cgroup_start_A(&(*ptr))
   /* Something very complicated can happen here. */
   mem_cgroup_end_A(*ptr)

In some calls, *ptr should be initialized to NULL be caller. But
it's ugly. This patch fixes that *ptr is initialized by _start
function.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |   10 ++++++++--
 mm/memory.c     |    2 +-
 mm/migrate.c    |    2 +-
 mm/swapfile.c   |    2 +-
 4 files changed, 11 insertions(+), 5 deletions(-)

Index: mmotm-0107/mm/memcontrol.c
===================================================================
--- mmotm-0107.orig/mm/memcontrol.c
+++ mmotm-0107/mm/memcontrol.c
@@ -2469,7 +2469,7 @@ int mem_cgroup_cache_charge(struct page 
 
 	/* shmem */
 	if (PageSwapCache(page)) {
-		struct mem_cgroup *mem = NULL;
+		struct mem_cgroup *mem;
 
 		ret = mem_cgroup_try_charge_swapin(mm, page, gfp_mask, &mem);
 		if (!ret)
@@ -2495,6 +2495,9 @@ int mem_cgroup_try_charge_swapin(struct 
 	struct mem_cgroup *mem;
 	int ret;
 
+	/* *ptr is used for checking caller needs to call commit */
+	*ptr = NULL;
+
 	if (mem_cgroup_disabled())
 		return 0;
 
@@ -2910,6 +2913,9 @@ int mem_cgroup_prepare_migration(struct 
 	enum charge_type ctype;
 	int ret = 0;
 
+	/* *ptr is used by caller to check end_migration() should be called.*/
+	*ptr = NULL;
+
 	VM_BUG_ON(PageTransHuge(page));
 	if (mem_cgroup_disabled())
 		return 0;
@@ -3052,7 +3058,7 @@ int mem_cgroup_shmem_charge_fallback(str
 			    struct mm_struct *mm,
 			    gfp_t gfp_mask)
 {
-	struct mem_cgroup *mem = NULL;
+	struct mem_cgroup *mem;
 	int ret;
 
 	if (mem_cgroup_disabled())
Index: mmotm-0107/mm/migrate.c
===================================================================
--- mmotm-0107.orig/mm/migrate.c
+++ mmotm-0107/mm/migrate.c
@@ -624,7 +624,7 @@ static int unmap_and_move(new_page_t get
 	int remap_swapcache = 1;
 	int rcu_locked = 0;
 	int charge = 0;
-	struct mem_cgroup *mem = NULL;
+	struct mem_cgroup *mem;
 	struct anon_vma *anon_vma = NULL;
 
 	if (!newpage)
Index: mmotm-0107/mm/memory.c
===================================================================
--- mmotm-0107.orig/mm/memory.c
+++ mmotm-0107/mm/memory.c
@@ -2729,7 +2729,7 @@ static int do_swap_page(struct mm_struct
 	swp_entry_t entry;
 	pte_t pte;
 	int locked;
-	struct mem_cgroup *ptr = NULL;
+	struct mem_cgroup *ptr;
 	int exclusive = 0;
 	int ret = 0;
 
Index: mmotm-0107/mm/swapfile.c
===================================================================
--- mmotm-0107.orig/mm/swapfile.c
+++ mmotm-0107/mm/swapfile.c
@@ -880,7 +880,7 @@ unsigned int count_swap_pages(int type, 
 static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
 		unsigned long addr, swp_entry_t entry, struct page *page)
 {
-	struct mem_cgroup *ptr = NULL;
+	struct mem_cgroup *ptr;
 	spinlock_t *ptl;
 	pte_t *pte;
 	int ret = 1;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/7] memcg : comment, style fixes for recent patch of move_parent
  2011-01-21  6:37   ` KAMEZAWA Hiroyuki
@ 2011-01-21  7:16     ` Daisuke Nishimura
  -1 siblings, 0 replies; 74+ messages in thread
From: Daisuke Nishimura @ 2011-01-21  7:16 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, linux-kernel, hannes, balbir, akpm, Daisuke Nishimura

On Fri, 21 Jan 2011 15:37:26 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> A fix for 987eba66e0e6aa654d60881a14731a353ee0acb4
> 
> A clean up for mem_cgroup_move_parent(). 
>  - remove unnecessary initialization of local variable.
>  - rename charge_size -> page_size
>  - remove unnecessary (wrong) comment.
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/7] memcg : comment, style fixes for recent patch of move_parent
@ 2011-01-21  7:16     ` Daisuke Nishimura
  0 siblings, 0 replies; 74+ messages in thread
From: Daisuke Nishimura @ 2011-01-21  7:16 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, linux-kernel, hannes, balbir, akpm, Daisuke Nishimura

On Fri, 21 Jan 2011 15:37:26 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> A fix for 987eba66e0e6aa654d60881a14731a353ee0acb4
> 
> A clean up for mem_cgroup_move_parent(). 
>  - remove unnecessary initialization of local variable.
>  - rename charge_size -> page_size
>  - remove unnecessary (wrong) comment.
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 2/7] memcg : more fixes and clean up for 2.6.28-rc
  2011-01-21  6:39   ` KAMEZAWA Hiroyuki
@ 2011-01-21  7:17     ` Daisuke Nishimura
  -1 siblings, 0 replies; 74+ messages in thread
From: Daisuke Nishimura @ 2011-01-21  7:17 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, linux-kernel, hannes, balbir, akpm, Daisuke Nishimura

On Fri, 21 Jan 2011 15:39:28 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> This is a fix for
> ca3e021417eed30ec2b64ce88eb0acf64aa9bc29
> 
> mem_cgroup_disabled() should be checked at splitting.
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

You need to change the subject :)
But anyway,

Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 2/7] memcg : more fixes and clean up for 2.6.28-rc
@ 2011-01-21  7:17     ` Daisuke Nishimura
  0 siblings, 0 replies; 74+ messages in thread
From: Daisuke Nishimura @ 2011-01-21  7:17 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, linux-kernel, hannes, balbir, akpm, Daisuke Nishimura

On Fri, 21 Jan 2011 15:39:28 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> This is a fix for
> ca3e021417eed30ec2b64ce88eb0acf64aa9bc29
> 
> mem_cgroup_disabled() should be checked at splitting.
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

You need to change the subject :)
But anyway,

Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 3/7] memcg : fix mem_cgroup_check_under_limit
  2011-01-21  6:41   ` KAMEZAWA Hiroyuki
@ 2011-01-21  7:45     ` Daisuke Nishimura
  -1 siblings, 0 replies; 74+ messages in thread
From: Daisuke Nishimura @ 2011-01-21  7:45 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, linux-kernel, hannes, balbir, akpm, Daisuke Nishimura

On Fri, 21 Jan 2011 15:41:41 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> Current memory cgroup's code tends to assume page_size == PAGE_SIZE
> but THP does HPAGE_SIZE charge.
> 
> This is one of fixes for supporing THP. This modifies
> mem_cgroup_check_under_limit to take page_size into account.
> 
> Total fixes for do_charge()/reclaim memory will follow this patch.
> 
> TODO: By this reclaim function can get page_size as argument.
> So...there may be something should be improvoed.
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 3/7] memcg : fix mem_cgroup_check_under_limit
@ 2011-01-21  7:45     ` Daisuke Nishimura
  0 siblings, 0 replies; 74+ messages in thread
From: Daisuke Nishimura @ 2011-01-21  7:45 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, linux-kernel, hannes, balbir, akpm, Daisuke Nishimura

On Fri, 21 Jan 2011 15:41:41 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> Current memory cgroup's code tends to assume page_size == PAGE_SIZE
> but THP does HPAGE_SIZE charge.
> 
> This is one of fixes for supporing THP. This modifies
> mem_cgroup_check_under_limit to take page_size into account.
> 
> Total fixes for do_charge()/reclaim memory will follow this patch.
> 
> TODO: By this reclaim function can get page_size as argument.
> So...there may be something should be improvoed.
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 4/7] memcg : fix charge function of THP allocation.
  2011-01-21  6:44   ` KAMEZAWA Hiroyuki
@ 2011-01-21  8:48     ` Daisuke Nishimura
  -1 siblings, 0 replies; 74+ messages in thread
From: Daisuke Nishimura @ 2011-01-21  8:48 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, linux-kernel, hannes, balbir, akpm, Daisuke Nishimura

On Fri, 21 Jan 2011 15:44:30 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> When THP is used, Hugepage size charge can happen. It's not handled
> correctly in mem_cgroup_do_charge(). For example, THP can fallback
> to small page allocation when HUGEPAGE allocation seems difficult
> or busy, but memory cgroup doesn't understand it and continue to
> try HUGEPAGE charging. And the worst thing is memory cgroup
> believes 'memory reclaim succeeded' if limit - usage > PAGE_SIZE.
> 
> By this, khugepaged etc...can goes into inifinite reclaim loop
> if tasks in memcg are busy.
> 
> After this patch 
>  - Hugepage allocation will fail if 1st trial of page reclaim fails.
>  - distinguish THP allocaton from Bached allocation. 
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
>  mm/memcontrol.c |   51 +++++++++++++++++++++++++++++++++++----------------
>  1 file changed, 35 insertions(+), 16 deletions(-)
> 
> Index: mmotm-0107/mm/memcontrol.c
> ===================================================================
> --- mmotm-0107.orig/mm/memcontrol.c
> +++ mmotm-0107/mm/memcontrol.c
> @@ -1812,24 +1812,25 @@ enum {
>  	CHARGE_OK,		/* success */
>  	CHARGE_RETRY,		/* need to retry but retry is not bad */
>  	CHARGE_NOMEM,		/* we can't do more. return -ENOMEM */
> +	CHARGE_NEED_BREAK,	/* big size allocation failure */
>  	CHARGE_WOULDBLOCK,	/* GFP_WAIT wasn't set and no enough res. */
>  	CHARGE_OOM_DIE,		/* the current is killed because of OOM */
>  };
>  
>  static int __mem_cgroup_do_charge(struct mem_cgroup *mem, gfp_t gfp_mask,
> -				int csize, bool oom_check)
> +			int page_size, bool do_reclaim, bool oom_check)

I'm sorry, I can't understand why we need 'do_reclaim'. See below.

>  {
>  	struct mem_cgroup *mem_over_limit;
>  	struct res_counter *fail_res;
>  	unsigned long flags = 0;
>  	int ret;
>  
> -	ret = res_counter_charge(&mem->res, csize, &fail_res);
> +	ret = res_counter_charge(&mem->res, page_size, &fail_res);
>  
>  	if (likely(!ret)) {
>  		if (!do_swap_account)
>  			return CHARGE_OK;
> -		ret = res_counter_charge(&mem->memsw, csize, &fail_res);
> +		ret = res_counter_charge(&mem->memsw, page_size, &fail_res);
>  		if (likely(!ret))
>  			return CHARGE_OK;
>  
> @@ -1838,14 +1839,14 @@ static int __mem_cgroup_do_charge(struct
>  	} else
>  		mem_over_limit = mem_cgroup_from_res_counter(fail_res, res);
>  
> -	if (csize > PAGE_SIZE) /* change csize and retry */
> +	if (!do_reclaim)
>  		return CHARGE_RETRY;
>  

>From the very beginning, do we need this "CHARGE_RETRY" ?

>  	if (!(gfp_mask & __GFP_WAIT))
>  		return CHARGE_WOULDBLOCK;
>  
>  	ret = mem_cgroup_hierarchical_reclaim(mem_over_limit, NULL,
> -					gfp_mask, flags, csize);
> +					gfp_mask, flags, page_size);
>  	/*
>  	 * try_to_free_mem_cgroup_pages() might not give us a full
>  	 * picture of reclaim. Some pages are reclaimed and might be
> @@ -1853,19 +1854,28 @@ static int __mem_cgroup_do_charge(struct
>  	 * Check the limit again to see if the reclaim reduced the
>  	 * current usage of the cgroup before giving up
>  	 */
> -	if (ret || mem_cgroup_check_under_limit(mem_over_limit, csize))
> +	if (ret || mem_cgroup_check_under_limit(mem_over_limit, page_size))
>  		return CHARGE_RETRY;
>  
>  	/*
> +	 * When page_size > PAGE_SIZE, THP calls this function and it's
> +	 * ok to tell 'there are not enough pages for hugepage'. THP will
> +	 * fallback into PAGE_SIZE allocation. If we do reclaim eagerly,
> +	 * page splitting will occur and it seems much worse.
> +	 */
> +	if (page_size > PAGE_SIZE)
> +		return CHARGE_NEED_BREAK;
> +
> +	/*
>  	 * At task move, charge accounts can be doubly counted. So, it's
>  	 * better to wait until the end of task_move if something is going on.
>  	 */
>  	if (mem_cgroup_wait_acct_move(mem_over_limit))
>  		return CHARGE_RETRY;
> -
>  	/* If we don't need to call oom-killer at el, return immediately */
>  	if (!oom_check)
>  		return CHARGE_NOMEM;
> +
>  	/* check OOM */
>  	if (!mem_cgroup_handle_oom(mem_over_limit, gfp_mask))
>  		return CHARGE_OOM_DIE;
> @@ -1885,7 +1895,7 @@ static int __mem_cgroup_try_charge(struc
>  	int nr_oom_retries = MEM_CGROUP_RECLAIM_RETRIES;
>  	struct mem_cgroup *mem = NULL;
>  	int ret;
> -	int csize = max(CHARGE_SIZE, (unsigned long) page_size);
> +	bool use_pcp_cache = (page_size == PAGE_SIZE);
>  
>  	/*
>  	 * Unlike gloval-vm's OOM-kill, we're not in memory shortage
> @@ -1910,7 +1920,7 @@ again:
>  		VM_BUG_ON(css_is_removed(&mem->css));
>  		if (mem_cgroup_is_root(mem))
>  			goto done;
> -		if (page_size == PAGE_SIZE && consume_stock(mem))
> +		if (use_pcp_cache && consume_stock(mem))
>  			goto done;
>  		css_get(&mem->css);
>  	} else {
> @@ -1933,7 +1943,7 @@ again:
>  			rcu_read_unlock();
>  			goto done;
>  		}
> -		if (page_size == PAGE_SIZE && consume_stock(mem)) {
> +		if (use_pcp_cache && consume_stock(mem)) {
>  			/*
>  			 * It seems dagerous to access memcg without css_get().
>  			 * But considering how consume_stok works, it's not
> @@ -1967,17 +1977,26 @@ again:
>  			oom_check = true;
>  			nr_oom_retries = MEM_CGROUP_RECLAIM_RETRIES;
>  		}
> -
> -		ret = __mem_cgroup_do_charge(mem, gfp_mask, csize, oom_check);
> +		if (use_pcp_cache)
> +			ret = __mem_cgroup_do_charge(mem, gfp_mask,
> +					CHARGE_SIZE, false, oom_check);
> +		else
> +			ret = __mem_cgroup_do_charge(mem, gfp_mask,
> +					page_size, true, oom_check);
>  

hmm, this confuses me. I think 'use_pcp_cache' will be used to decide
whether we should do consume_stock() or not, but why we change charge size
and reclaim behavior depending on it ? I think this code itself is right,
but using 'use_pcp_cache' confused me.


>  		switch (ret) {
>  		case CHARGE_OK:
>  			break;
>  		case CHARGE_RETRY: /* not in OOM situation but retry */
> -			csize = page_size;
> +			if (use_pcp_cache)/* need to reclaim pages */
> +				use_pcp_cache = false;
>  			css_put(&mem->css);
>  			mem = NULL;
>  			goto again;
> +		case CHARGE_NEED_BREAK: /* page_size > PAGE_SIZE */
> +			css_put(&mem->css);
> +			/* returning faiulre doesn't mean OOM for hugepages */
> +			goto nomem;

I like this change.

>  		case CHARGE_WOULDBLOCK: /* !__GFP_WAIT */
>  			css_put(&mem->css);
>  			goto nomem;
> @@ -1994,9 +2013,9 @@ again:
>  			goto bypass;
>  		}
>  	} while (ret != CHARGE_OK);
> -
> -	if (csize > page_size)
> -		refill_stock(mem, csize - page_size);
> +	/* This flag is cleared when we fail CHAEGE_SIZE charge. */
> +	if (use_pcp_cache)
> +		refill_stock(mem, CHARGE_SIZE - page_size);

Ditto. can't we keep 'csize' and old code here ?

>  	css_put(&mem->css);
>  done:
>  	*memcg = mem;
> 


Thanks,
Daisuke Nishimura.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 4/7] memcg : fix charge function of THP allocation.
@ 2011-01-21  8:48     ` Daisuke Nishimura
  0 siblings, 0 replies; 74+ messages in thread
From: Daisuke Nishimura @ 2011-01-21  8:48 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, linux-kernel, hannes, balbir, akpm, Daisuke Nishimura

On Fri, 21 Jan 2011 15:44:30 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> When THP is used, Hugepage size charge can happen. It's not handled
> correctly in mem_cgroup_do_charge(). For example, THP can fallback
> to small page allocation when HUGEPAGE allocation seems difficult
> or busy, but memory cgroup doesn't understand it and continue to
> try HUGEPAGE charging. And the worst thing is memory cgroup
> believes 'memory reclaim succeeded' if limit - usage > PAGE_SIZE.
> 
> By this, khugepaged etc...can goes into inifinite reclaim loop
> if tasks in memcg are busy.
> 
> After this patch 
>  - Hugepage allocation will fail if 1st trial of page reclaim fails.
>  - distinguish THP allocaton from Bached allocation. 
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
>  mm/memcontrol.c |   51 +++++++++++++++++++++++++++++++++++----------------
>  1 file changed, 35 insertions(+), 16 deletions(-)
> 
> Index: mmotm-0107/mm/memcontrol.c
> ===================================================================
> --- mmotm-0107.orig/mm/memcontrol.c
> +++ mmotm-0107/mm/memcontrol.c
> @@ -1812,24 +1812,25 @@ enum {
>  	CHARGE_OK,		/* success */
>  	CHARGE_RETRY,		/* need to retry but retry is not bad */
>  	CHARGE_NOMEM,		/* we can't do more. return -ENOMEM */
> +	CHARGE_NEED_BREAK,	/* big size allocation failure */
>  	CHARGE_WOULDBLOCK,	/* GFP_WAIT wasn't set and no enough res. */
>  	CHARGE_OOM_DIE,		/* the current is killed because of OOM */
>  };
>  
>  static int __mem_cgroup_do_charge(struct mem_cgroup *mem, gfp_t gfp_mask,
> -				int csize, bool oom_check)
> +			int page_size, bool do_reclaim, bool oom_check)

I'm sorry, I can't understand why we need 'do_reclaim'. See below.

>  {
>  	struct mem_cgroup *mem_over_limit;
>  	struct res_counter *fail_res;
>  	unsigned long flags = 0;
>  	int ret;
>  
> -	ret = res_counter_charge(&mem->res, csize, &fail_res);
> +	ret = res_counter_charge(&mem->res, page_size, &fail_res);
>  
>  	if (likely(!ret)) {
>  		if (!do_swap_account)
>  			return CHARGE_OK;
> -		ret = res_counter_charge(&mem->memsw, csize, &fail_res);
> +		ret = res_counter_charge(&mem->memsw, page_size, &fail_res);
>  		if (likely(!ret))
>  			return CHARGE_OK;
>  
> @@ -1838,14 +1839,14 @@ static int __mem_cgroup_do_charge(struct
>  	} else
>  		mem_over_limit = mem_cgroup_from_res_counter(fail_res, res);
>  
> -	if (csize > PAGE_SIZE) /* change csize and retry */
> +	if (!do_reclaim)
>  		return CHARGE_RETRY;
>  

>From the very beginning, do we need this "CHARGE_RETRY" ?

>  	if (!(gfp_mask & __GFP_WAIT))
>  		return CHARGE_WOULDBLOCK;
>  
>  	ret = mem_cgroup_hierarchical_reclaim(mem_over_limit, NULL,
> -					gfp_mask, flags, csize);
> +					gfp_mask, flags, page_size);
>  	/*
>  	 * try_to_free_mem_cgroup_pages() might not give us a full
>  	 * picture of reclaim. Some pages are reclaimed and might be
> @@ -1853,19 +1854,28 @@ static int __mem_cgroup_do_charge(struct
>  	 * Check the limit again to see if the reclaim reduced the
>  	 * current usage of the cgroup before giving up
>  	 */
> -	if (ret || mem_cgroup_check_under_limit(mem_over_limit, csize))
> +	if (ret || mem_cgroup_check_under_limit(mem_over_limit, page_size))
>  		return CHARGE_RETRY;
>  
>  	/*
> +	 * When page_size > PAGE_SIZE, THP calls this function and it's
> +	 * ok to tell 'there are not enough pages for hugepage'. THP will
> +	 * fallback into PAGE_SIZE allocation. If we do reclaim eagerly,
> +	 * page splitting will occur and it seems much worse.
> +	 */
> +	if (page_size > PAGE_SIZE)
> +		return CHARGE_NEED_BREAK;
> +
> +	/*
>  	 * At task move, charge accounts can be doubly counted. So, it's
>  	 * better to wait until the end of task_move if something is going on.
>  	 */
>  	if (mem_cgroup_wait_acct_move(mem_over_limit))
>  		return CHARGE_RETRY;
> -
>  	/* If we don't need to call oom-killer at el, return immediately */
>  	if (!oom_check)
>  		return CHARGE_NOMEM;
> +
>  	/* check OOM */
>  	if (!mem_cgroup_handle_oom(mem_over_limit, gfp_mask))
>  		return CHARGE_OOM_DIE;
> @@ -1885,7 +1895,7 @@ static int __mem_cgroup_try_charge(struc
>  	int nr_oom_retries = MEM_CGROUP_RECLAIM_RETRIES;
>  	struct mem_cgroup *mem = NULL;
>  	int ret;
> -	int csize = max(CHARGE_SIZE, (unsigned long) page_size);
> +	bool use_pcp_cache = (page_size == PAGE_SIZE);
>  
>  	/*
>  	 * Unlike gloval-vm's OOM-kill, we're not in memory shortage
> @@ -1910,7 +1920,7 @@ again:
>  		VM_BUG_ON(css_is_removed(&mem->css));
>  		if (mem_cgroup_is_root(mem))
>  			goto done;
> -		if (page_size == PAGE_SIZE && consume_stock(mem))
> +		if (use_pcp_cache && consume_stock(mem))
>  			goto done;
>  		css_get(&mem->css);
>  	} else {
> @@ -1933,7 +1943,7 @@ again:
>  			rcu_read_unlock();
>  			goto done;
>  		}
> -		if (page_size == PAGE_SIZE && consume_stock(mem)) {
> +		if (use_pcp_cache && consume_stock(mem)) {
>  			/*
>  			 * It seems dagerous to access memcg without css_get().
>  			 * But considering how consume_stok works, it's not
> @@ -1967,17 +1977,26 @@ again:
>  			oom_check = true;
>  			nr_oom_retries = MEM_CGROUP_RECLAIM_RETRIES;
>  		}
> -
> -		ret = __mem_cgroup_do_charge(mem, gfp_mask, csize, oom_check);
> +		if (use_pcp_cache)
> +			ret = __mem_cgroup_do_charge(mem, gfp_mask,
> +					CHARGE_SIZE, false, oom_check);
> +		else
> +			ret = __mem_cgroup_do_charge(mem, gfp_mask,
> +					page_size, true, oom_check);
>  

hmm, this confuses me. I think 'use_pcp_cache' will be used to decide
whether we should do consume_stock() or not, but why we change charge size
and reclaim behavior depending on it ? I think this code itself is right,
but using 'use_pcp_cache' confused me.


>  		switch (ret) {
>  		case CHARGE_OK:
>  			break;
>  		case CHARGE_RETRY: /* not in OOM situation but retry */
> -			csize = page_size;
> +			if (use_pcp_cache)/* need to reclaim pages */
> +				use_pcp_cache = false;
>  			css_put(&mem->css);
>  			mem = NULL;
>  			goto again;
> +		case CHARGE_NEED_BREAK: /* page_size > PAGE_SIZE */
> +			css_put(&mem->css);
> +			/* returning faiulre doesn't mean OOM for hugepages */
> +			goto nomem;

I like this change.

>  		case CHARGE_WOULDBLOCK: /* !__GFP_WAIT */
>  			css_put(&mem->css);
>  			goto nomem;
> @@ -1994,9 +2013,9 @@ again:
>  			goto bypass;
>  		}
>  	} while (ret != CHARGE_OK);
> -
> -	if (csize > page_size)
> -		refill_stock(mem, csize - page_size);
> +	/* This flag is cleared when we fail CHAEGE_SIZE charge. */
> +	if (use_pcp_cache)
> +		refill_stock(mem, CHARGE_SIZE - page_size);

Ditto. can't we keep 'csize' and old code here ?

>  	css_put(&mem->css);
>  done:
>  	*memcg = mem;
> 


Thanks,
Daisuke Nishimura.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 7/7] memcg : remove ugly vairable initialization by callers
  2011-01-21  6:50   ` KAMEZAWA Hiroyuki
@ 2011-01-21  9:17     ` Daisuke Nishimura
  -1 siblings, 0 replies; 74+ messages in thread
From: Daisuke Nishimura @ 2011-01-21  9:17 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, linux-kernel, hannes, balbir, akpm, Daisuke Nishimura

On Fri, 21 Jan 2011 15:50:51 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> This is a promised one.
> ==
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> This patch is for removing initialization in caller of memory cgroup
> function. Some memory cgroup uses following style to bring the result
> of start function to the end function for avoiding races.
> 
>    mem_cgroup_start_A(&(*ptr))
>    /* Something very complicated can happen here. */
>    mem_cgroup_end_A(*ptr)
> 
> In some calls, *ptr should be initialized to NULL be caller. But
> it's ugly. This patch fixes that *ptr is initialized by _start
> function.
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 7/7] memcg : remove ugly vairable initialization by callers
@ 2011-01-21  9:17     ` Daisuke Nishimura
  0 siblings, 0 replies; 74+ messages in thread
From: Daisuke Nishimura @ 2011-01-21  9:17 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, linux-kernel, hannes, balbir, akpm, Daisuke Nishimura

On Fri, 21 Jan 2011 15:50:51 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> This is a promised one.
> ==
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> This patch is for removing initialization in caller of memory cgroup
> function. Some memory cgroup uses following style to bring the result
> of start function to the end function for avoiding races.
> 
>    mem_cgroup_start_A(&(*ptr))
>    /* Something very complicated can happen here. */
>    mem_cgroup_end_A(*ptr)
> 
> In some calls, *ptr should be initialized to NULL be caller. But
> it's ugly. This patch fixes that *ptr is initialized by _start
> function.
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 4/7] memcg : fix charge function of THP allocation.
  2011-01-21  8:48     ` Daisuke Nishimura
@ 2011-01-24  0:14       ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 74+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-01-24  0:14 UTC (permalink / raw)
  To: Daisuke Nishimura; +Cc: linux-mm, linux-kernel, hannes, balbir, akpm

On Fri, 21 Jan 2011 17:48:18 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:

> On Fri, 21 Jan 2011 15:44:30 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 
> > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > 
> > When THP is used, Hugepage size charge can happen. It's not handled
> > correctly in mem_cgroup_do_charge(). For example, THP can fallback
> > to small page allocation when HUGEPAGE allocation seems difficult
> > or busy, but memory cgroup doesn't understand it and continue to
> > try HUGEPAGE charging. And the worst thing is memory cgroup
> > believes 'memory reclaim succeeded' if limit - usage > PAGE_SIZE.
> > 
> > By this, khugepaged etc...can goes into inifinite reclaim loop
> > if tasks in memcg are busy.
> > 
> > After this patch 
> >  - Hugepage allocation will fail if 1st trial of page reclaim fails.
> >  - distinguish THP allocaton from Bached allocation. 
> > 
> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > ---
> >  mm/memcontrol.c |   51 +++++++++++++++++++++++++++++++++++----------------
> >  1 file changed, 35 insertions(+), 16 deletions(-)
> > 
> > Index: mmotm-0107/mm/memcontrol.c
> > ===================================================================
> > --- mmotm-0107.orig/mm/memcontrol.c
> > +++ mmotm-0107/mm/memcontrol.c
> > @@ -1812,24 +1812,25 @@ enum {
> >  	CHARGE_OK,		/* success */
> >  	CHARGE_RETRY,		/* need to retry but retry is not bad */
> >  	CHARGE_NOMEM,		/* we can't do more. return -ENOMEM */
> > +	CHARGE_NEED_BREAK,	/* big size allocation failure */
> >  	CHARGE_WOULDBLOCK,	/* GFP_WAIT wasn't set and no enough res. */
> >  	CHARGE_OOM_DIE,		/* the current is killed because of OOM */
> >  };
> >  
> >  static int __mem_cgroup_do_charge(struct mem_cgroup *mem, gfp_t gfp_mask,
> > -				int csize, bool oom_check)
> > +			int page_size, bool do_reclaim, bool oom_check)
> 
> I'm sorry, I can't understand why we need 'do_reclaim'. See below.
> 
> >  {
> >  	struct mem_cgroup *mem_over_limit;
> >  	struct res_counter *fail_res;
> >  	unsigned long flags = 0;
> >  	int ret;
> >  
> > -	ret = res_counter_charge(&mem->res, csize, &fail_res);
> > +	ret = res_counter_charge(&mem->res, page_size, &fail_res);
> >  
> >  	if (likely(!ret)) {
> >  		if (!do_swap_account)
> >  			return CHARGE_OK;
> > -		ret = res_counter_charge(&mem->memsw, csize, &fail_res);
> > +		ret = res_counter_charge(&mem->memsw, page_size, &fail_res);
> >  		if (likely(!ret))
> >  			return CHARGE_OK;
> >  
> > @@ -1838,14 +1839,14 @@ static int __mem_cgroup_do_charge(struct
> >  	} else
> >  		mem_over_limit = mem_cgroup_from_res_counter(fail_res, res);
> >  
> > -	if (csize > PAGE_SIZE) /* change csize and retry */
> > +	if (!do_reclaim)
> >  		return CHARGE_RETRY;
> >  
> 
> From the very beginning, do we need this "CHARGE_RETRY" ?
> 

Reducing charge_size here in automatic and go back to the start of this function ? 
I think returning here is better.


> >  	if (!(gfp_mask & __GFP_WAIT))
> >  		return CHARGE_WOULDBLOCK;
> >  
> >  	ret = mem_cgroup_hierarchical_reclaim(mem_over_limit, NULL,
> > -					gfp_mask, flags, csize);
> > +					gfp_mask, flags, page_size);
> >  	/*
> >  	 * try_to_free_mem_cgroup_pages() might not give us a full
> >  	 * picture of reclaim. Some pages are reclaimed and might be
> > @@ -1853,19 +1854,28 @@ static int __mem_cgroup_do_charge(struct
> >  	 * Check the limit again to see if the reclaim reduced the
> >  	 * current usage of the cgroup before giving up
> >  	 */
> > -	if (ret || mem_cgroup_check_under_limit(mem_over_limit, csize))
> > +	if (ret || mem_cgroup_check_under_limit(mem_over_limit, page_size))
> >  		return CHARGE_RETRY;
> >  
> >  	/*
> > +	 * When page_size > PAGE_SIZE, THP calls this function and it's
> > +	 * ok to tell 'there are not enough pages for hugepage'. THP will
> > +	 * fallback into PAGE_SIZE allocation. If we do reclaim eagerly,
> > +	 * page splitting will occur and it seems much worse.
> > +	 */
> > +	if (page_size > PAGE_SIZE)
> > +		return CHARGE_NEED_BREAK;
> > +
> > +	/*
> >  	 * At task move, charge accounts can be doubly counted. So, it's
> >  	 * better to wait until the end of task_move if something is going on.
> >  	 */
> >  	if (mem_cgroup_wait_acct_move(mem_over_limit))
> >  		return CHARGE_RETRY;
> > -
> >  	/* If we don't need to call oom-killer at el, return immediately */
> >  	if (!oom_check)
> >  		return CHARGE_NOMEM;
> > +
> >  	/* check OOM */
> >  	if (!mem_cgroup_handle_oom(mem_over_limit, gfp_mask))
> >  		return CHARGE_OOM_DIE;
> > @@ -1885,7 +1895,7 @@ static int __mem_cgroup_try_charge(struc
> >  	int nr_oom_retries = MEM_CGROUP_RECLAIM_RETRIES;
> >  	struct mem_cgroup *mem = NULL;
> >  	int ret;
> > -	int csize = max(CHARGE_SIZE, (unsigned long) page_size);
> > +	bool use_pcp_cache = (page_size == PAGE_SIZE);
> >  
> >  	/*
> >  	 * Unlike gloval-vm's OOM-kill, we're not in memory shortage
> > @@ -1910,7 +1920,7 @@ again:
> >  		VM_BUG_ON(css_is_removed(&mem->css));
> >  		if (mem_cgroup_is_root(mem))
> >  			goto done;
> > -		if (page_size == PAGE_SIZE && consume_stock(mem))
> > +		if (use_pcp_cache && consume_stock(mem))
> >  			goto done;
> >  		css_get(&mem->css);
> >  	} else {
> > @@ -1933,7 +1943,7 @@ again:
> >  			rcu_read_unlock();
> >  			goto done;
> >  		}
> > -		if (page_size == PAGE_SIZE && consume_stock(mem)) {
> > +		if (use_pcp_cache && consume_stock(mem)) {
> >  			/*
> >  			 * It seems dagerous to access memcg without css_get().
> >  			 * But considering how consume_stok works, it's not
> > @@ -1967,17 +1977,26 @@ again:
> >  			oom_check = true;
> >  			nr_oom_retries = MEM_CGROUP_RECLAIM_RETRIES;
> >  		}
> > -
> > -		ret = __mem_cgroup_do_charge(mem, gfp_mask, csize, oom_check);
> > +		if (use_pcp_cache)
> > +			ret = __mem_cgroup_do_charge(mem, gfp_mask,
> > +					CHARGE_SIZE, false, oom_check);
> > +		else
> > +			ret = __mem_cgroup_do_charge(mem, gfp_mask,
> > +					page_size, true, oom_check);
> >  
> 
> hmm, this confuses me. I think 'use_pcp_cache' will be used to decide
> whether we should do consume_stock() or not, but why we change charge size
> and reclaim behavior depending on it ? I think this code itself is right,
> but using 'use_pcp_cache' confused me.
> 

Is it problem of function name ? 
'do_batched_charge' or some ?

I'd like to use a 'xxxx_size' variable rather than 2 xxxx_size variable.



> 
> >  		switch (ret) {
> >  		case CHARGE_OK:
> >  			break;
> >  		case CHARGE_RETRY: /* not in OOM situation but retry */
> > -			csize = page_size;
> > +			if (use_pcp_cache)/* need to reclaim pages */
> > +				use_pcp_cache = false;
> >  			css_put(&mem->css);
> >  			mem = NULL;
> >  			goto again;
> > +		case CHARGE_NEED_BREAK: /* page_size > PAGE_SIZE */
> > +			css_put(&mem->css);
> > +			/* returning faiulre doesn't mean OOM for hugepages */
> > +			goto nomem;
> 
> I like this change.
> 
> >  		case CHARGE_WOULDBLOCK: /* !__GFP_WAIT */
> >  			css_put(&mem->css);
> >  			goto nomem;
> > @@ -1994,9 +2013,9 @@ again:
> >  			goto bypass;
> >  		}
> >  	} while (ret != CHARGE_OK);
> > -
> > -	if (csize > page_size)
> > -		refill_stock(mem, csize - page_size);
> > +	/* This flag is cleared when we fail CHAEGE_SIZE charge. */
> > +	if (use_pcp_cache)
> > +		refill_stock(mem, CHARGE_SIZE - page_size);
> 
> Ditto. can't we keep 'csize' and old code here ?
> 

I remove csize. 2 'size' variable is confusing.


Thanks.
-Kame


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 4/7] memcg : fix charge function of THP allocation.
@ 2011-01-24  0:14       ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 74+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-01-24  0:14 UTC (permalink / raw)
  To: Daisuke Nishimura; +Cc: linux-mm, linux-kernel, hannes, balbir, akpm

On Fri, 21 Jan 2011 17:48:18 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:

> On Fri, 21 Jan 2011 15:44:30 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 
> > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > 
> > When THP is used, Hugepage size charge can happen. It's not handled
> > correctly in mem_cgroup_do_charge(). For example, THP can fallback
> > to small page allocation when HUGEPAGE allocation seems difficult
> > or busy, but memory cgroup doesn't understand it and continue to
> > try HUGEPAGE charging. And the worst thing is memory cgroup
> > believes 'memory reclaim succeeded' if limit - usage > PAGE_SIZE.
> > 
> > By this, khugepaged etc...can goes into inifinite reclaim loop
> > if tasks in memcg are busy.
> > 
> > After this patch 
> >  - Hugepage allocation will fail if 1st trial of page reclaim fails.
> >  - distinguish THP allocaton from Bached allocation. 
> > 
> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > ---
> >  mm/memcontrol.c |   51 +++++++++++++++++++++++++++++++++++----------------
> >  1 file changed, 35 insertions(+), 16 deletions(-)
> > 
> > Index: mmotm-0107/mm/memcontrol.c
> > ===================================================================
> > --- mmotm-0107.orig/mm/memcontrol.c
> > +++ mmotm-0107/mm/memcontrol.c
> > @@ -1812,24 +1812,25 @@ enum {
> >  	CHARGE_OK,		/* success */
> >  	CHARGE_RETRY,		/* need to retry but retry is not bad */
> >  	CHARGE_NOMEM,		/* we can't do more. return -ENOMEM */
> > +	CHARGE_NEED_BREAK,	/* big size allocation failure */
> >  	CHARGE_WOULDBLOCK,	/* GFP_WAIT wasn't set and no enough res. */
> >  	CHARGE_OOM_DIE,		/* the current is killed because of OOM */
> >  };
> >  
> >  static int __mem_cgroup_do_charge(struct mem_cgroup *mem, gfp_t gfp_mask,
> > -				int csize, bool oom_check)
> > +			int page_size, bool do_reclaim, bool oom_check)
> 
> I'm sorry, I can't understand why we need 'do_reclaim'. See below.
> 
> >  {
> >  	struct mem_cgroup *mem_over_limit;
> >  	struct res_counter *fail_res;
> >  	unsigned long flags = 0;
> >  	int ret;
> >  
> > -	ret = res_counter_charge(&mem->res, csize, &fail_res);
> > +	ret = res_counter_charge(&mem->res, page_size, &fail_res);
> >  
> >  	if (likely(!ret)) {
> >  		if (!do_swap_account)
> >  			return CHARGE_OK;
> > -		ret = res_counter_charge(&mem->memsw, csize, &fail_res);
> > +		ret = res_counter_charge(&mem->memsw, page_size, &fail_res);
> >  		if (likely(!ret))
> >  			return CHARGE_OK;
> >  
> > @@ -1838,14 +1839,14 @@ static int __mem_cgroup_do_charge(struct
> >  	} else
> >  		mem_over_limit = mem_cgroup_from_res_counter(fail_res, res);
> >  
> > -	if (csize > PAGE_SIZE) /* change csize and retry */
> > +	if (!do_reclaim)
> >  		return CHARGE_RETRY;
> >  
> 
> From the very beginning, do we need this "CHARGE_RETRY" ?
> 

Reducing charge_size here in automatic and go back to the start of this function ? 
I think returning here is better.


> >  	if (!(gfp_mask & __GFP_WAIT))
> >  		return CHARGE_WOULDBLOCK;
> >  
> >  	ret = mem_cgroup_hierarchical_reclaim(mem_over_limit, NULL,
> > -					gfp_mask, flags, csize);
> > +					gfp_mask, flags, page_size);
> >  	/*
> >  	 * try_to_free_mem_cgroup_pages() might not give us a full
> >  	 * picture of reclaim. Some pages are reclaimed and might be
> > @@ -1853,19 +1854,28 @@ static int __mem_cgroup_do_charge(struct
> >  	 * Check the limit again to see if the reclaim reduced the
> >  	 * current usage of the cgroup before giving up
> >  	 */
> > -	if (ret || mem_cgroup_check_under_limit(mem_over_limit, csize))
> > +	if (ret || mem_cgroup_check_under_limit(mem_over_limit, page_size))
> >  		return CHARGE_RETRY;
> >  
> >  	/*
> > +	 * When page_size > PAGE_SIZE, THP calls this function and it's
> > +	 * ok to tell 'there are not enough pages for hugepage'. THP will
> > +	 * fallback into PAGE_SIZE allocation. If we do reclaim eagerly,
> > +	 * page splitting will occur and it seems much worse.
> > +	 */
> > +	if (page_size > PAGE_SIZE)
> > +		return CHARGE_NEED_BREAK;
> > +
> > +	/*
> >  	 * At task move, charge accounts can be doubly counted. So, it's
> >  	 * better to wait until the end of task_move if something is going on.
> >  	 */
> >  	if (mem_cgroup_wait_acct_move(mem_over_limit))
> >  		return CHARGE_RETRY;
> > -
> >  	/* If we don't need to call oom-killer at el, return immediately */
> >  	if (!oom_check)
> >  		return CHARGE_NOMEM;
> > +
> >  	/* check OOM */
> >  	if (!mem_cgroup_handle_oom(mem_over_limit, gfp_mask))
> >  		return CHARGE_OOM_DIE;
> > @@ -1885,7 +1895,7 @@ static int __mem_cgroup_try_charge(struc
> >  	int nr_oom_retries = MEM_CGROUP_RECLAIM_RETRIES;
> >  	struct mem_cgroup *mem = NULL;
> >  	int ret;
> > -	int csize = max(CHARGE_SIZE, (unsigned long) page_size);
> > +	bool use_pcp_cache = (page_size == PAGE_SIZE);
> >  
> >  	/*
> >  	 * Unlike gloval-vm's OOM-kill, we're not in memory shortage
> > @@ -1910,7 +1920,7 @@ again:
> >  		VM_BUG_ON(css_is_removed(&mem->css));
> >  		if (mem_cgroup_is_root(mem))
> >  			goto done;
> > -		if (page_size == PAGE_SIZE && consume_stock(mem))
> > +		if (use_pcp_cache && consume_stock(mem))
> >  			goto done;
> >  		css_get(&mem->css);
> >  	} else {
> > @@ -1933,7 +1943,7 @@ again:
> >  			rcu_read_unlock();
> >  			goto done;
> >  		}
> > -		if (page_size == PAGE_SIZE && consume_stock(mem)) {
> > +		if (use_pcp_cache && consume_stock(mem)) {
> >  			/*
> >  			 * It seems dagerous to access memcg without css_get().
> >  			 * But considering how consume_stok works, it's not
> > @@ -1967,17 +1977,26 @@ again:
> >  			oom_check = true;
> >  			nr_oom_retries = MEM_CGROUP_RECLAIM_RETRIES;
> >  		}
> > -
> > -		ret = __mem_cgroup_do_charge(mem, gfp_mask, csize, oom_check);
> > +		if (use_pcp_cache)
> > +			ret = __mem_cgroup_do_charge(mem, gfp_mask,
> > +					CHARGE_SIZE, false, oom_check);
> > +		else
> > +			ret = __mem_cgroup_do_charge(mem, gfp_mask,
> > +					page_size, true, oom_check);
> >  
> 
> hmm, this confuses me. I think 'use_pcp_cache' will be used to decide
> whether we should do consume_stock() or not, but why we change charge size
> and reclaim behavior depending on it ? I think this code itself is right,
> but using 'use_pcp_cache' confused me.
> 

Is it problem of function name ? 
'do_batched_charge' or some ?

I'd like to use a 'xxxx_size' variable rather than 2 xxxx_size variable.



> 
> >  		switch (ret) {
> >  		case CHARGE_OK:
> >  			break;
> >  		case CHARGE_RETRY: /* not in OOM situation but retry */
> > -			csize = page_size;
> > +			if (use_pcp_cache)/* need to reclaim pages */
> > +				use_pcp_cache = false;
> >  			css_put(&mem->css);
> >  			mem = NULL;
> >  			goto again;
> > +		case CHARGE_NEED_BREAK: /* page_size > PAGE_SIZE */
> > +			css_put(&mem->css);
> > +			/* returning faiulre doesn't mean OOM for hugepages */
> > +			goto nomem;
> 
> I like this change.
> 
> >  		case CHARGE_WOULDBLOCK: /* !__GFP_WAIT */
> >  			css_put(&mem->css);
> >  			goto nomem;
> > @@ -1994,9 +2013,9 @@ again:
> >  			goto bypass;
> >  		}
> >  	} while (ret != CHARGE_OK);
> > -
> > -	if (csize > page_size)
> > -		refill_stock(mem, csize - page_size);
> > +	/* This flag is cleared when we fail CHAEGE_SIZE charge. */
> > +	if (use_pcp_cache)
> > +		refill_stock(mem, CHARGE_SIZE - page_size);
> 
> Ditto. can't we keep 'csize' and old code here ?
> 

I remove csize. 2 'size' variable is confusing.


Thanks.
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 0/7] memcg : more fixes and clean up for 2.6.28-rc
  2011-01-21  6:34 ` KAMEZAWA Hiroyuki
@ 2011-01-24  0:29   ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 74+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-01-24  0:29 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, nishimura, hannes, balbir, akpm

On Fri, 21 Jan 2011 15:34:31 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> 
> This is a set of patches which I'm now testing, and it seems it passed
> small test. So I post this.
> 
> Some are bug fixes and other are clean ups but I think these are for 2.6.38.
> 
> Brief decription
> 
> [1/7] remove buggy comment and use better name for mem_cgroup_move_parent()
>       The fixes for mem_cgroup_move_parent() is already in mainline, this is
>       an add-on.
> 
> [2/7] a bug fix for a new function mem_cgroup_split_huge_fixup(),
>       which was recently merged.
> 
> [3/7] prepare for fixes in [4/7],[5/7]. This is an enhancement of function
>       which is used now.
> 
> [4/7] fix mem_cgroup_charge() for THP. By this, memory cgroup's charge function
>       will handle THP request in sane way.
> 
> [5/7] fix khugepaged scan condition for memcg.
>       This is a fix for hang of processes under small/buzy memory cgroup.
> 
> [6/7] rename vairable names to be page_size, nr_pages, bytes rather than
>       ambiguous names.
> 
> [7/7] some memcg function requires the caller to initialize variable
>       before call. It's ugly and fix it.
> 
> 
> I think patch 1,2,3,4,5 is urgent ones. But I think patch "5" needs some
> good review. But without "5", stress-test on small memory cgroup will not
> run succesfully.
> 

I'll rebase this set to onto http://marc.info/?l=linux-mm&m=129559263207634&w=2
and post v2.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 0/7] memcg : more fixes and clean up for 2.6.28-rc
@ 2011-01-24  0:29   ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 74+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-01-24  0:29 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, nishimura, hannes, balbir, akpm

On Fri, 21 Jan 2011 15:34:31 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> 
> This is a set of patches which I'm now testing, and it seems it passed
> small test. So I post this.
> 
> Some are bug fixes and other are clean ups but I think these are for 2.6.38.
> 
> Brief decription
> 
> [1/7] remove buggy comment and use better name for mem_cgroup_move_parent()
>       The fixes for mem_cgroup_move_parent() is already in mainline, this is
>       an add-on.
> 
> [2/7] a bug fix for a new function mem_cgroup_split_huge_fixup(),
>       which was recently merged.
> 
> [3/7] prepare for fixes in [4/7],[5/7]. This is an enhancement of function
>       which is used now.
> 
> [4/7] fix mem_cgroup_charge() for THP. By this, memory cgroup's charge function
>       will handle THP request in sane way.
> 
> [5/7] fix khugepaged scan condition for memcg.
>       This is a fix for hang of processes under small/buzy memory cgroup.
> 
> [6/7] rename vairable names to be page_size, nr_pages, bytes rather than
>       ambiguous names.
> 
> [7/7] some memcg function requires the caller to initialize variable
>       before call. It's ugly and fix it.
> 
> 
> I think patch 1,2,3,4,5 is urgent ones. But I think patch "5" needs some
> good review. But without "5", stress-test on small memory cgroup will not
> run succesfully.
> 

I'll rebase this set to onto http://marc.info/?l=linux-mm&m=129559263207634&w=2
and post v2.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 3/7] memcg : fix mem_cgroup_check_under_limit
  2011-01-24 10:04     ` Johannes Weiner
@ 2011-01-24 10:03       ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 74+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-01-24 10:03 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: linux-mm, linux-kernel, nishimura, balbir, akpm

On Mon, 24 Jan 2011 11:04:34 +0100
Johannes Weiner <hannes@cmpxchg.org> wrote:
  
> >  	ret = mem_cgroup_hierarchical_reclaim(mem_over_limit, NULL,
> > -					gfp_mask, flags);
> > +					gfp_mask, flags, csize);
> >  	/*
> >  	 * try_to_free_mem_cgroup_pages() might not give us a full
> >  	 * picture of reclaim. Some pages are reclaimed and might be
> > @@ -1852,7 +1853,7 @@ static int __mem_cgroup_do_charge(struct
> >  	 * Check the limit again to see if the reclaim reduced the
> >  	 * current usage of the cgroup before giving up
> >  	 */
> > -	if (ret || mem_cgroup_check_under_limit(mem_over_limit))
> > +	if (ret || mem_cgroup_check_under_limit(mem_over_limit, csize))
> >  		return CHARGE_RETRY;
> 
> This is the only site that is really involved with THP. 

yes.

> But you need to touch every site because you change mem_cgroup_check_under_limit()
> instead of adding a new function.
> 
Yes.

> I would suggest just adding another function for checking available
> space explicitely and only changing this single call site to use it.
> 
> Just ignore the return value of mem_cgroup_hierarchical_reclaim() and
> check for enough space unconditionally.
> 
> Everybody else is happy with PAGE_SIZE pages.
> 
Hmm. ok, let us changes to be small and see how often hugepage alloc will fail.

Thanks,
-Kame




^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 3/7] memcg : fix mem_cgroup_check_under_limit
@ 2011-01-24 10:03       ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 74+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-01-24 10:03 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: linux-mm, linux-kernel, nishimura, balbir, akpm

On Mon, 24 Jan 2011 11:04:34 +0100
Johannes Weiner <hannes@cmpxchg.org> wrote:
  
> >  	ret = mem_cgroup_hierarchical_reclaim(mem_over_limit, NULL,
> > -					gfp_mask, flags);
> > +					gfp_mask, flags, csize);
> >  	/*
> >  	 * try_to_free_mem_cgroup_pages() might not give us a full
> >  	 * picture of reclaim. Some pages are reclaimed and might be
> > @@ -1852,7 +1853,7 @@ static int __mem_cgroup_do_charge(struct
> >  	 * Check the limit again to see if the reclaim reduced the
> >  	 * current usage of the cgroup before giving up
> >  	 */
> > -	if (ret || mem_cgroup_check_under_limit(mem_over_limit))
> > +	if (ret || mem_cgroup_check_under_limit(mem_over_limit, csize))
> >  		return CHARGE_RETRY;
> 
> This is the only site that is really involved with THP. 

yes.

> But you need to touch every site because you change mem_cgroup_check_under_limit()
> instead of adding a new function.
> 
Yes.

> I would suggest just adding another function for checking available
> space explicitely and only changing this single call site to use it.
> 
> Just ignore the return value of mem_cgroup_hierarchical_reclaim() and
> check for enough space unconditionally.
> 
> Everybody else is happy with PAGE_SIZE pages.
> 
Hmm. ok, let us changes to be small and see how often hugepage alloc will fail.

Thanks,
-Kame



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 3/7] memcg : fix mem_cgroup_check_under_limit
  2011-01-21  6:41   ` KAMEZAWA Hiroyuki
@ 2011-01-24 10:04     ` Johannes Weiner
  -1 siblings, 0 replies; 74+ messages in thread
From: Johannes Weiner @ 2011-01-24 10:04 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, nishimura, balbir, akpm

On Fri, Jan 21, 2011 at 03:41:41PM +0900, KAMEZAWA Hiroyuki wrote:
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> Current memory cgroup's code tends to assume page_size == PAGE_SIZE
> but THP does HPAGE_SIZE charge.
> 
> This is one of fixes for supporing THP. This modifies
> mem_cgroup_check_under_limit to take page_size into account.
> 
> Total fixes for do_charge()/reclaim memory will follow this patch.
> 
> TODO: By this reclaim function can get page_size as argument.
> So...there may be something should be improvoed.
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
>  include/linux/res_counter.h |   11 +++++++++++
>  mm/memcontrol.c             |   27 ++++++++++++++-------------
>  2 files changed, 25 insertions(+), 13 deletions(-)
> 
> Index: mmotm-0107/include/linux/res_counter.h
> ===================================================================
> --- mmotm-0107.orig/include/linux/res_counter.h
> +++ mmotm-0107/include/linux/res_counter.h
> @@ -182,6 +182,17 @@ static inline bool res_counter_check_und
>  	return ret;
>  }
>  
> +static inline s64 res_counter_check_margin(struct res_counter *cnt)
> +{
> +	s64 ret;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&cnt->lock, flags);
> +	ret = cnt->limit - cnt->usage;
> +	spin_unlock_irqrestore(&cnt->lock, flags);
> +	return ret;
> +}
> +
>  static inline bool res_counter_check_under_soft_limit(struct res_counter *cnt)
>  {
>  	bool ret;
> Index: mmotm-0107/mm/memcontrol.c
> ===================================================================
> --- mmotm-0107.orig/mm/memcontrol.c
> +++ mmotm-0107/mm/memcontrol.c
> @@ -1099,14 +1099,14 @@ unsigned long mem_cgroup_isolate_pages(u
>  #define mem_cgroup_from_res_counter(counter, member)	\
>  	container_of(counter, struct mem_cgroup, member)
>  
> -static bool mem_cgroup_check_under_limit(struct mem_cgroup *mem)
> +static bool mem_cgroup_check_under_limit(struct mem_cgroup *mem, int page_size)
>  {
>  	if (do_swap_account) {
> -		if (res_counter_check_under_limit(&mem->res) &&
> -			res_counter_check_under_limit(&mem->memsw))
> +		if (res_counter_check_margin(&mem->res) >= page_size &&
> +			res_counter_check_margin(&mem->memsw) >= page_size)
>  			return true;
>  	} else
> -		if (res_counter_check_under_limit(&mem->res))
> +		if (res_counter_check_margin(&mem->res) >= page_size)
>  			return true;
>  	return false;
>  }
> @@ -1367,7 +1367,8 @@ mem_cgroup_select_victim(struct mem_cgro
>  static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
>  						struct zone *zone,
>  						gfp_t gfp_mask,
> -						unsigned long reclaim_options)
> +						unsigned long reclaim_options,
> +						int page_size)
>  {
>  	struct mem_cgroup *victim;
>  	int ret, total = 0;
> @@ -1434,7 +1435,7 @@ static int mem_cgroup_hierarchical_recla
>  		if (check_soft) {
>  			if (res_counter_check_under_soft_limit(&root_mem->res))
>  				return total;
> -		} else if (mem_cgroup_check_under_limit(root_mem))
> +		} else if (mem_cgroup_check_under_limit(root_mem, page_size))
>  			return 1 + total;
>  	}
>  	return total;
> @@ -1844,7 +1845,7 @@ static int __mem_cgroup_do_charge(struct
>  		return CHARGE_WOULDBLOCK;
>  
>  	ret = mem_cgroup_hierarchical_reclaim(mem_over_limit, NULL,
> -					gfp_mask, flags);
> +					gfp_mask, flags, csize);
>  	/*
>  	 * try_to_free_mem_cgroup_pages() might not give us a full
>  	 * picture of reclaim. Some pages are reclaimed and might be
> @@ -1852,7 +1853,7 @@ static int __mem_cgroup_do_charge(struct
>  	 * Check the limit again to see if the reclaim reduced the
>  	 * current usage of the cgroup before giving up
>  	 */
> -	if (ret || mem_cgroup_check_under_limit(mem_over_limit))
> +	if (ret || mem_cgroup_check_under_limit(mem_over_limit, csize))
>  		return CHARGE_RETRY;

This is the only site that is really involved with THP.  But you need
to touch every site because you change mem_cgroup_check_under_limit()
instead of adding a new function.

I would suggest just adding another function for checking available
space explicitely and only changing this single call site to use it.

Just ignore the return value of mem_cgroup_hierarchical_reclaim() and
check for enough space unconditionally.

Everybody else is happy with PAGE_SIZE pages.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 3/7] memcg : fix mem_cgroup_check_under_limit
@ 2011-01-24 10:04     ` Johannes Weiner
  0 siblings, 0 replies; 74+ messages in thread
From: Johannes Weiner @ 2011-01-24 10:04 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, nishimura, balbir, akpm

On Fri, Jan 21, 2011 at 03:41:41PM +0900, KAMEZAWA Hiroyuki wrote:
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> Current memory cgroup's code tends to assume page_size == PAGE_SIZE
> but THP does HPAGE_SIZE charge.
> 
> This is one of fixes for supporing THP. This modifies
> mem_cgroup_check_under_limit to take page_size into account.
> 
> Total fixes for do_charge()/reclaim memory will follow this patch.
> 
> TODO: By this reclaim function can get page_size as argument.
> So...there may be something should be improvoed.
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
>  include/linux/res_counter.h |   11 +++++++++++
>  mm/memcontrol.c             |   27 ++++++++++++++-------------
>  2 files changed, 25 insertions(+), 13 deletions(-)
> 
> Index: mmotm-0107/include/linux/res_counter.h
> ===================================================================
> --- mmotm-0107.orig/include/linux/res_counter.h
> +++ mmotm-0107/include/linux/res_counter.h
> @@ -182,6 +182,17 @@ static inline bool res_counter_check_und
>  	return ret;
>  }
>  
> +static inline s64 res_counter_check_margin(struct res_counter *cnt)
> +{
> +	s64 ret;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&cnt->lock, flags);
> +	ret = cnt->limit - cnt->usage;
> +	spin_unlock_irqrestore(&cnt->lock, flags);
> +	return ret;
> +}
> +
>  static inline bool res_counter_check_under_soft_limit(struct res_counter *cnt)
>  {
>  	bool ret;
> Index: mmotm-0107/mm/memcontrol.c
> ===================================================================
> --- mmotm-0107.orig/mm/memcontrol.c
> +++ mmotm-0107/mm/memcontrol.c
> @@ -1099,14 +1099,14 @@ unsigned long mem_cgroup_isolate_pages(u
>  #define mem_cgroup_from_res_counter(counter, member)	\
>  	container_of(counter, struct mem_cgroup, member)
>  
> -static bool mem_cgroup_check_under_limit(struct mem_cgroup *mem)
> +static bool mem_cgroup_check_under_limit(struct mem_cgroup *mem, int page_size)
>  {
>  	if (do_swap_account) {
> -		if (res_counter_check_under_limit(&mem->res) &&
> -			res_counter_check_under_limit(&mem->memsw))
> +		if (res_counter_check_margin(&mem->res) >= page_size &&
> +			res_counter_check_margin(&mem->memsw) >= page_size)
>  			return true;
>  	} else
> -		if (res_counter_check_under_limit(&mem->res))
> +		if (res_counter_check_margin(&mem->res) >= page_size)
>  			return true;
>  	return false;
>  }
> @@ -1367,7 +1367,8 @@ mem_cgroup_select_victim(struct mem_cgro
>  static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
>  						struct zone *zone,
>  						gfp_t gfp_mask,
> -						unsigned long reclaim_options)
> +						unsigned long reclaim_options,
> +						int page_size)
>  {
>  	struct mem_cgroup *victim;
>  	int ret, total = 0;
> @@ -1434,7 +1435,7 @@ static int mem_cgroup_hierarchical_recla
>  		if (check_soft) {
>  			if (res_counter_check_under_soft_limit(&root_mem->res))
>  				return total;
> -		} else if (mem_cgroup_check_under_limit(root_mem))
> +		} else if (mem_cgroup_check_under_limit(root_mem, page_size))
>  			return 1 + total;
>  	}
>  	return total;
> @@ -1844,7 +1845,7 @@ static int __mem_cgroup_do_charge(struct
>  		return CHARGE_WOULDBLOCK;
>  
>  	ret = mem_cgroup_hierarchical_reclaim(mem_over_limit, NULL,
> -					gfp_mask, flags);
> +					gfp_mask, flags, csize);
>  	/*
>  	 * try_to_free_mem_cgroup_pages() might not give us a full
>  	 * picture of reclaim. Some pages are reclaimed and might be
> @@ -1852,7 +1853,7 @@ static int __mem_cgroup_do_charge(struct
>  	 * Check the limit again to see if the reclaim reduced the
>  	 * current usage of the cgroup before giving up
>  	 */
> -	if (ret || mem_cgroup_check_under_limit(mem_over_limit))
> +	if (ret || mem_cgroup_check_under_limit(mem_over_limit, csize))
>  		return CHARGE_RETRY;

This is the only site that is really involved with THP.  But you need
to touch every site because you change mem_cgroup_check_under_limit()
instead of adding a new function.

I would suggest just adding another function for checking available
space explicitely and only changing this single call site to use it.

Just ignore the return value of mem_cgroup_hierarchical_reclaim() and
check for enough space unconditionally.

Everybody else is happy with PAGE_SIZE pages.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/7] memcg : comment, style fixes for recent patch of move_parent
  2011-01-21  6:37   ` KAMEZAWA Hiroyuki
@ 2011-01-24 10:14     ` Johannes Weiner
  -1 siblings, 0 replies; 74+ messages in thread
From: Johannes Weiner @ 2011-01-24 10:14 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, nishimura, balbir, akpm

On Fri, Jan 21, 2011 at 03:37:26PM +0900, KAMEZAWA Hiroyuki wrote:
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> A fix for 987eba66e0e6aa654d60881a14731a353ee0acb4
> 
> A clean up for mem_cgroup_move_parent(). 
>  - remove unnecessary initialization of local variable.
>  - rename charge_size -> page_size
>  - remove unnecessary (wrong) comment.
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
>  mm/memcontrol.c |   17 +++++++++--------
>  1 file changed, 9 insertions(+), 8 deletions(-)
> 
> Index: mmotm-0107/mm/memcontrol.c
> ===================================================================
> --- mmotm-0107.orig/mm/memcontrol.c
> +++ mmotm-0107/mm/memcontrol.c
> @@ -2265,7 +2265,7 @@ static int mem_cgroup_move_parent(struct
>  	struct cgroup *cg = child->css.cgroup;
>  	struct cgroup *pcg = cg->parent;
>  	struct mem_cgroup *parent;
> -	int charge = PAGE_SIZE;
> +	int page_size;
>  	unsigned long flags;
>  	int ret;
>  
> @@ -2278,22 +2278,23 @@ static int mem_cgroup_move_parent(struct
>  		goto out;
>  	if (isolate_lru_page(page))
>  		goto put;
> -	/* The page is isolated from LRU and we have no race with splitting */
> -	charge = PAGE_SIZE << compound_order(page);
> +
> +	page_size = PAGE_SIZE << compound_order(page);

Okay, so you remove the wrong comment, but that does not make the code
right.  What protects compound_order from reading garbage because the
page is currently splitting?

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/7] memcg : comment, style fixes for recent patch of move_parent
@ 2011-01-24 10:14     ` Johannes Weiner
  0 siblings, 0 replies; 74+ messages in thread
From: Johannes Weiner @ 2011-01-24 10:14 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, nishimura, balbir, akpm

On Fri, Jan 21, 2011 at 03:37:26PM +0900, KAMEZAWA Hiroyuki wrote:
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> A fix for 987eba66e0e6aa654d60881a14731a353ee0acb4
> 
> A clean up for mem_cgroup_move_parent(). 
>  - remove unnecessary initialization of local variable.
>  - rename charge_size -> page_size
>  - remove unnecessary (wrong) comment.
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
>  mm/memcontrol.c |   17 +++++++++--------
>  1 file changed, 9 insertions(+), 8 deletions(-)
> 
> Index: mmotm-0107/mm/memcontrol.c
> ===================================================================
> --- mmotm-0107.orig/mm/memcontrol.c
> +++ mmotm-0107/mm/memcontrol.c
> @@ -2265,7 +2265,7 @@ static int mem_cgroup_move_parent(struct
>  	struct cgroup *cg = child->css.cgroup;
>  	struct cgroup *pcg = cg->parent;
>  	struct mem_cgroup *parent;
> -	int charge = PAGE_SIZE;
> +	int page_size;
>  	unsigned long flags;
>  	int ret;
>  
> @@ -2278,22 +2278,23 @@ static int mem_cgroup_move_parent(struct
>  		goto out;
>  	if (isolate_lru_page(page))
>  		goto put;
> -	/* The page is isolated from LRU and we have no race with splitting */
> -	charge = PAGE_SIZE << compound_order(page);
> +
> +	page_size = PAGE_SIZE << compound_order(page);

Okay, so you remove the wrong comment, but that does not make the code
right.  What protects compound_order from reading garbage because the
page is currently splitting?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 2/7] memcg : more fixes and clean up for 2.6.28-rc
  2011-01-21  6:39   ` KAMEZAWA Hiroyuki
@ 2011-01-24 10:14     ` Johannes Weiner
  -1 siblings, 0 replies; 74+ messages in thread
From: Johannes Weiner @ 2011-01-24 10:14 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, nishimura, balbir, akpm

On Fri, Jan 21, 2011 at 03:39:28PM +0900, KAMEZAWA Hiroyuki wrote:
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> This is a fix for
> ca3e021417eed30ec2b64ce88eb0acf64aa9bc29
> 
> mem_cgroup_disabled() should be checked at splitting.
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 2/7] memcg : more fixes and clean up for 2.6.28-rc
@ 2011-01-24 10:14     ` Johannes Weiner
  0 siblings, 0 replies; 74+ messages in thread
From: Johannes Weiner @ 2011-01-24 10:14 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, nishimura, balbir, akpm

On Fri, Jan 21, 2011 at 03:39:28PM +0900, KAMEZAWA Hiroyuki wrote:
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> This is a fix for
> ca3e021417eed30ec2b64ce88eb0acf64aa9bc29
> 
> mem_cgroup_disabled() should be checked at splitting.
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/7] memcg : comment, style fixes for recent patch of move_parent
  2011-01-24 10:14     ` Johannes Weiner
@ 2011-01-24 10:15       ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 74+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-01-24 10:15 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: linux-mm, linux-kernel, nishimura, balbir, akpm

On Mon, 24 Jan 2011 11:14:02 +0100
Johannes Weiner <hannes@cmpxchg.org> wrote:

> On Fri, Jan 21, 2011 at 03:37:26PM +0900, KAMEZAWA Hiroyuki wrote:
> > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > 
> > A fix for 987eba66e0e6aa654d60881a14731a353ee0acb4
> > 
> > A clean up for mem_cgroup_move_parent(). 
> >  - remove unnecessary initialization of local variable.
> >  - rename charge_size -> page_size
> >  - remove unnecessary (wrong) comment.
> > 
> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > ---
> >  mm/memcontrol.c |   17 +++++++++--------
> >  1 file changed, 9 insertions(+), 8 deletions(-)
> > 
> > Index: mmotm-0107/mm/memcontrol.c
> > ===================================================================
> > --- mmotm-0107.orig/mm/memcontrol.c
> > +++ mmotm-0107/mm/memcontrol.c
> > @@ -2265,7 +2265,7 @@ static int mem_cgroup_move_parent(struct
> >  	struct cgroup *cg = child->css.cgroup;
> >  	struct cgroup *pcg = cg->parent;
> >  	struct mem_cgroup *parent;
> > -	int charge = PAGE_SIZE;
> > +	int page_size;
> >  	unsigned long flags;
> >  	int ret;
> >  
> > @@ -2278,22 +2278,23 @@ static int mem_cgroup_move_parent(struct
> >  		goto out;
> >  	if (isolate_lru_page(page))
> >  		goto put;
> > -	/* The page is isolated from LRU and we have no race with splitting */
> > -	charge = PAGE_SIZE << compound_order(page);
> > +
> > +	page_size = PAGE_SIZE << compound_order(page);
> 
> Okay, so you remove the wrong comment, but that does not make the code
> right.  What protects compound_order from reading garbage because the
> page is currently splitting?
> 

==
static int mem_cgroup_move_account(struct page_cgroup *pc,
                struct mem_cgroup *from, struct mem_cgroup *to,
                bool uncharge, int charge_size)
{
        int ret = -EINVAL;
        unsigned long flags;

        if ((charge_size > PAGE_SIZE) && !PageTransHuge(pc->page))
                return -EBUSY;
==

This is called under compound_lock(). Then, if someone breaks THP,
-EBUSY and retry.


Thanks,
-Kame





^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/7] memcg : comment, style fixes for recent patch of move_parent
@ 2011-01-24 10:15       ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 74+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-01-24 10:15 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: linux-mm, linux-kernel, nishimura, balbir, akpm

On Mon, 24 Jan 2011 11:14:02 +0100
Johannes Weiner <hannes@cmpxchg.org> wrote:

> On Fri, Jan 21, 2011 at 03:37:26PM +0900, KAMEZAWA Hiroyuki wrote:
> > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > 
> > A fix for 987eba66e0e6aa654d60881a14731a353ee0acb4
> > 
> > A clean up for mem_cgroup_move_parent(). 
> >  - remove unnecessary initialization of local variable.
> >  - rename charge_size -> page_size
> >  - remove unnecessary (wrong) comment.
> > 
> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > ---
> >  mm/memcontrol.c |   17 +++++++++--------
> >  1 file changed, 9 insertions(+), 8 deletions(-)
> > 
> > Index: mmotm-0107/mm/memcontrol.c
> > ===================================================================
> > --- mmotm-0107.orig/mm/memcontrol.c
> > +++ mmotm-0107/mm/memcontrol.c
> > @@ -2265,7 +2265,7 @@ static int mem_cgroup_move_parent(struct
> >  	struct cgroup *cg = child->css.cgroup;
> >  	struct cgroup *pcg = cg->parent;
> >  	struct mem_cgroup *parent;
> > -	int charge = PAGE_SIZE;
> > +	int page_size;
> >  	unsigned long flags;
> >  	int ret;
> >  
> > @@ -2278,22 +2278,23 @@ static int mem_cgroup_move_parent(struct
> >  		goto out;
> >  	if (isolate_lru_page(page))
> >  		goto put;
> > -	/* The page is isolated from LRU and we have no race with splitting */
> > -	charge = PAGE_SIZE << compound_order(page);
> > +
> > +	page_size = PAGE_SIZE << compound_order(page);
> 
> Okay, so you remove the wrong comment, but that does not make the code
> right.  What protects compound_order from reading garbage because the
> page is currently splitting?
> 

==
static int mem_cgroup_move_account(struct page_cgroup *pc,
                struct mem_cgroup *from, struct mem_cgroup *to,
                bool uncharge, int charge_size)
{
        int ret = -EINVAL;
        unsigned long flags;

        if ((charge_size > PAGE_SIZE) && !PageTransHuge(pc->page))
                return -EBUSY;
==

This is called under compound_lock(). Then, if someone breaks THP,
-EBUSY and retry.


Thanks,
-Kame




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 7/7] memcg : remove ugly vairable initialization by callers
  2011-01-21  6:50   ` KAMEZAWA Hiroyuki
@ 2011-01-24 10:19     ` Johannes Weiner
  -1 siblings, 0 replies; 74+ messages in thread
From: Johannes Weiner @ 2011-01-24 10:19 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, nishimura, balbir, akpm

On Fri, Jan 21, 2011 at 03:50:51PM +0900, KAMEZAWA Hiroyuki wrote:
> This is a promised one.
> ==
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> This patch is for removing initialization in caller of memory cgroup
> function. Some memory cgroup uses following style to bring the result
> of start function to the end function for avoiding races.
> 
>    mem_cgroup_start_A(&(*ptr))
>    /* Something very complicated can happen here. */
>    mem_cgroup_end_A(*ptr)
> 
> In some calls, *ptr should be initialized to NULL be caller. But
> it's ugly. This patch fixes that *ptr is initialized by _start
> function.
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

Nitpick: I would remove the comments above the *ptr = NULL lines,
there should be no assumptions about the consequences in the caller
(the next patch will change the caller, and then the comments are
nothing but confusing).  It's just a plain initialization of a return
value.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 7/7] memcg : remove ugly vairable initialization by callers
@ 2011-01-24 10:19     ` Johannes Weiner
  0 siblings, 0 replies; 74+ messages in thread
From: Johannes Weiner @ 2011-01-24 10:19 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, nishimura, balbir, akpm

On Fri, Jan 21, 2011 at 03:50:51PM +0900, KAMEZAWA Hiroyuki wrote:
> This is a promised one.
> ==
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> This patch is for removing initialization in caller of memory cgroup
> function. Some memory cgroup uses following style to bring the result
> of start function to the end function for avoiding races.
> 
>    mem_cgroup_start_A(&(*ptr))
>    /* Something very complicated can happen here. */
>    mem_cgroup_end_A(*ptr)
> 
> In some calls, *ptr should be initialized to NULL be caller. But
> it's ugly. This patch fixes that *ptr is initialized by _start
> function.
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

Nitpick: I would remove the comments above the *ptr = NULL lines,
there should be no assumptions about the consequences in the caller
(the next patch will change the caller, and then the comments are
nothing but confusing).  It's just a plain initialization of a return
value.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/7] memcg : comment, style fixes for recent patch of move_parent
  2011-01-24 10:15       ` KAMEZAWA Hiroyuki
@ 2011-01-24 10:45         ` Johannes Weiner
  -1 siblings, 0 replies; 74+ messages in thread
From: Johannes Weiner @ 2011-01-24 10:45 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, nishimura, balbir, akpm

On Mon, Jan 24, 2011 at 07:15:35PM +0900, KAMEZAWA Hiroyuki wrote:
> On Mon, 24 Jan 2011 11:14:02 +0100
> Johannes Weiner <hannes@cmpxchg.org> wrote:
> 
> > On Fri, Jan 21, 2011 at 03:37:26PM +0900, KAMEZAWA Hiroyuki wrote:
> > > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > > 
> > > A fix for 987eba66e0e6aa654d60881a14731a353ee0acb4
> > > 
> > > A clean up for mem_cgroup_move_parent(). 
> > >  - remove unnecessary initialization of local variable.
> > >  - rename charge_size -> page_size
> > >  - remove unnecessary (wrong) comment.
> > > 
> > > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > > ---
> > >  mm/memcontrol.c |   17 +++++++++--------
> > >  1 file changed, 9 insertions(+), 8 deletions(-)
> > > 
> > > Index: mmotm-0107/mm/memcontrol.c
> > > ===================================================================
> > > --- mmotm-0107.orig/mm/memcontrol.c
> > > +++ mmotm-0107/mm/memcontrol.c
> > > @@ -2265,7 +2265,7 @@ static int mem_cgroup_move_parent(struct
> > >  	struct cgroup *cg = child->css.cgroup;
> > >  	struct cgroup *pcg = cg->parent;
> > >  	struct mem_cgroup *parent;
> > > -	int charge = PAGE_SIZE;
> > > +	int page_size;
> > >  	unsigned long flags;
> > >  	int ret;
> > >  
> > > @@ -2278,22 +2278,23 @@ static int mem_cgroup_move_parent(struct
> > >  		goto out;
> > >  	if (isolate_lru_page(page))
> > >  		goto put;
> > > -	/* The page is isolated from LRU and we have no race with splitting */
> > > -	charge = PAGE_SIZE << compound_order(page);
> > > +
> > > +	page_size = PAGE_SIZE << compound_order(page);
> > 
> > Okay, so you remove the wrong comment, but that does not make the code
> > right.  What protects compound_order from reading garbage because the
> > page is currently splitting?
> > 
> 
> ==
> static int mem_cgroup_move_account(struct page_cgroup *pc,
>                 struct mem_cgroup *from, struct mem_cgroup *to,
>                 bool uncharge, int charge_size)
> {
>         int ret = -EINVAL;
>         unsigned long flags;
> 
>         if ((charge_size > PAGE_SIZE) && !PageTransHuge(pc->page))
>                 return -EBUSY;
> ==
> 
> This is called under compound_lock(). Then, if someone breaks THP,
> -EBUSY and retry.

This charge_size contains exactly the garbage you just read from an
unprotected compound_order().  It could be anything if the page is
split concurrently.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/7] memcg : comment, style fixes for recent patch of move_parent
@ 2011-01-24 10:45         ` Johannes Weiner
  0 siblings, 0 replies; 74+ messages in thread
From: Johannes Weiner @ 2011-01-24 10:45 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, nishimura, balbir, akpm

On Mon, Jan 24, 2011 at 07:15:35PM +0900, KAMEZAWA Hiroyuki wrote:
> On Mon, 24 Jan 2011 11:14:02 +0100
> Johannes Weiner <hannes@cmpxchg.org> wrote:
> 
> > On Fri, Jan 21, 2011 at 03:37:26PM +0900, KAMEZAWA Hiroyuki wrote:
> > > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > > 
> > > A fix for 987eba66e0e6aa654d60881a14731a353ee0acb4
> > > 
> > > A clean up for mem_cgroup_move_parent(). 
> > >  - remove unnecessary initialization of local variable.
> > >  - rename charge_size -> page_size
> > >  - remove unnecessary (wrong) comment.
> > > 
> > > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > > ---
> > >  mm/memcontrol.c |   17 +++++++++--------
> > >  1 file changed, 9 insertions(+), 8 deletions(-)
> > > 
> > > Index: mmotm-0107/mm/memcontrol.c
> > > ===================================================================
> > > --- mmotm-0107.orig/mm/memcontrol.c
> > > +++ mmotm-0107/mm/memcontrol.c
> > > @@ -2265,7 +2265,7 @@ static int mem_cgroup_move_parent(struct
> > >  	struct cgroup *cg = child->css.cgroup;
> > >  	struct cgroup *pcg = cg->parent;
> > >  	struct mem_cgroup *parent;
> > > -	int charge = PAGE_SIZE;
> > > +	int page_size;
> > >  	unsigned long flags;
> > >  	int ret;
> > >  
> > > @@ -2278,22 +2278,23 @@ static int mem_cgroup_move_parent(struct
> > >  		goto out;
> > >  	if (isolate_lru_page(page))
> > >  		goto put;
> > > -	/* The page is isolated from LRU and we have no race with splitting */
> > > -	charge = PAGE_SIZE << compound_order(page);
> > > +
> > > +	page_size = PAGE_SIZE << compound_order(page);
> > 
> > Okay, so you remove the wrong comment, but that does not make the code
> > right.  What protects compound_order from reading garbage because the
> > page is currently splitting?
> > 
> 
> ==
> static int mem_cgroup_move_account(struct page_cgroup *pc,
>                 struct mem_cgroup *from, struct mem_cgroup *to,
>                 bool uncharge, int charge_size)
> {
>         int ret = -EINVAL;
>         unsigned long flags;
> 
>         if ((charge_size > PAGE_SIZE) && !PageTransHuge(pc->page))
>                 return -EBUSY;
> ==
> 
> This is called under compound_lock(). Then, if someone breaks THP,
> -EBUSY and retry.

This charge_size contains exactly the garbage you just read from an
unprotected compound_order().  It could be anything if the page is
split concurrently.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/7] memcg : comment, style fixes for recent patch of move_parent
  2011-01-24 10:45         ` Johannes Weiner
@ 2011-01-24 11:14           ` Hiroyuki Kamezawa
  -1 siblings, 0 replies; 74+ messages in thread
From: Hiroyuki Kamezawa @ 2011-01-24 11:14 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: KAMEZAWA Hiroyuki, linux-mm, linux-kernel, nishimura, balbir, akpm

2011/1/24 Johannes Weiner <hannes@cmpxchg.org>:
> On Mon, Jan 24, 2011 at 07:15:35PM +0900, KAMEZAWA Hiroyuki wrote:
>> On Mon, 24 Jan 2011 11:14:02 +0100
>> Johannes Weiner <hannes@cmpxchg.org> wrote:
>>
>> > On Fri, Jan 21, 2011 at 03:37:26PM +0900, KAMEZAWA Hiroyuki wrote:
>> > > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>> > >
>> > > A fix for 987eba66e0e6aa654d60881a14731a353ee0acb4
>> > >
>> > > A clean up for mem_cgroup_move_parent().
>> > >  - remove unnecessary initialization of local variable.
>> > >  - rename charge_size -> page_size
>> > >  - remove unnecessary (wrong) comment.
>> > >
>> > > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>> > > ---
>> > >  mm/memcontrol.c |   17 +++++++++--------
>> > >  1 file changed, 9 insertions(+), 8 deletions(-)
>> > >
>> > > Index: mmotm-0107/mm/memcontrol.c
>> > > ===================================================================
>> > > --- mmotm-0107.orig/mm/memcontrol.c
>> > > +++ mmotm-0107/mm/memcontrol.c
>> > > @@ -2265,7 +2265,7 @@ static int mem_cgroup_move_parent(struct
>> > >   struct cgroup *cg = child->css.cgroup;
>> > >   struct cgroup *pcg = cg->parent;
>> > >   struct mem_cgroup *parent;
>> > > - int charge = PAGE_SIZE;
>> > > + int page_size;
>> > >   unsigned long flags;
>> > >   int ret;
>> > >
>> > > @@ -2278,22 +2278,23 @@ static int mem_cgroup_move_parent(struct
>> > >           goto out;
>> > >   if (isolate_lru_page(page))
>> > >           goto put;
>> > > - /* The page is isolated from LRU and we have no race with splitting */
>> > > - charge = PAGE_SIZE << compound_order(page);
>> > > +
>> > > + page_size = PAGE_SIZE << compound_order(page);
>> >
>> > Okay, so you remove the wrong comment, but that does not make the code
>> > right.  What protects compound_order from reading garbage because the
>> > page is currently splitting?
>> >
>>
>> ==
>> static int mem_cgroup_move_account(struct page_cgroup *pc,
>>                 struct mem_cgroup *from, struct mem_cgroup *to,
>>                 bool uncharge, int charge_size)
>> {
>>         int ret = -EINVAL;
>>         unsigned long flags;
>>
>>         if ((charge_size > PAGE_SIZE) && !PageTransHuge(pc->page))
>>                 return -EBUSY;
>> ==
>>
>> This is called under compound_lock(). Then, if someone breaks THP,
>> -EBUSY and retry.
>
> This charge_size contains exactly the garbage you just read from an
> unprotected compound_order().  It could be anything if the page is
> split concurrently.

Then, my recent fix to LRU accounting which use compound_order() is racy, too ?

I'll replace compound_order() with
  if (PageTransHuge(page))
      size = HPAGE_SIZE.

Does this work ?
If there are no way to aquire size of page without lock, I need to add one.
Any idea?

Thanks,
-Kame

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/7] memcg : comment, style fixes for recent patch of move_parent
@ 2011-01-24 11:14           ` Hiroyuki Kamezawa
  0 siblings, 0 replies; 74+ messages in thread
From: Hiroyuki Kamezawa @ 2011-01-24 11:14 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: KAMEZAWA Hiroyuki, linux-mm, linux-kernel, nishimura, balbir, akpm

2011/1/24 Johannes Weiner <hannes@cmpxchg.org>:
> On Mon, Jan 24, 2011 at 07:15:35PM +0900, KAMEZAWA Hiroyuki wrote:
>> On Mon, 24 Jan 2011 11:14:02 +0100
>> Johannes Weiner <hannes@cmpxchg.org> wrote:
>>
>> > On Fri, Jan 21, 2011 at 03:37:26PM +0900, KAMEZAWA Hiroyuki wrote:
>> > > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>> > >
>> > > A fix for 987eba66e0e6aa654d60881a14731a353ee0acb4
>> > >
>> > > A clean up for mem_cgroup_move_parent().
>> > >  - remove unnecessary initialization of local variable.
>> > >  - rename charge_size -> page_size
>> > >  - remove unnecessary (wrong) comment.
>> > >
>> > > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>> > > ---
>> > >  mm/memcontrol.c |   17 +++++++++--------
>> > >  1 file changed, 9 insertions(+), 8 deletions(-)
>> > >
>> > > Index: mmotm-0107/mm/memcontrol.c
>> > > ===================================================================
>> > > --- mmotm-0107.orig/mm/memcontrol.c
>> > > +++ mmotm-0107/mm/memcontrol.c
>> > > @@ -2265,7 +2265,7 @@ static int mem_cgroup_move_parent(struct
>> > >   struct cgroup *cg = child->css.cgroup;
>> > >   struct cgroup *pcg = cg->parent;
>> > >   struct mem_cgroup *parent;
>> > > - int charge = PAGE_SIZE;
>> > > + int page_size;
>> > >   unsigned long flags;
>> > >   int ret;
>> > >
>> > > @@ -2278,22 +2278,23 @@ static int mem_cgroup_move_parent(struct
>> > >           goto out;
>> > >   if (isolate_lru_page(page))
>> > >           goto put;
>> > > - /* The page is isolated from LRU and we have no race with splitting */
>> > > - charge = PAGE_SIZE << compound_order(page);
>> > > +
>> > > + page_size = PAGE_SIZE << compound_order(page);
>> >
>> > Okay, so you remove the wrong comment, but that does not make the code
>> > right.  What protects compound_order from reading garbage because the
>> > page is currently splitting?
>> >
>>
>> ==
>> static int mem_cgroup_move_account(struct page_cgroup *pc,
>>                 struct mem_cgroup *from, struct mem_cgroup *to,
>>                 bool uncharge, int charge_size)
>> {
>>         int ret = -EINVAL;
>>         unsigned long flags;
>>
>>         if ((charge_size > PAGE_SIZE) && !PageTransHuge(pc->page))
>>                 return -EBUSY;
>> ==
>>
>> This is called under compound_lock(). Then, if someone breaks THP,
>> -EBUSY and retry.
>
> This charge_size contains exactly the garbage you just read from an
> unprotected compound_order().  It could be anything if the page is
> split concurrently.

Then, my recent fix to LRU accounting which use compound_order() is racy, too ?

I'll replace compound_order() with
  if (PageTransHuge(page))
      size = HPAGE_SIZE.

Does this work ?
If there are no way to aquire size of page without lock, I need to add one.
Any idea?

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/7] memcg : comment, style fixes for recent patch of move_parent
  2011-01-24 11:14           ` Hiroyuki Kamezawa
@ 2011-01-24 11:34             ` Johannes Weiner
  -1 siblings, 0 replies; 74+ messages in thread
From: Johannes Weiner @ 2011-01-24 11:34 UTC (permalink / raw)
  To: Hiroyuki Kamezawa
  Cc: KAMEZAWA Hiroyuki, linux-mm, linux-kernel, nishimura, balbir, akpm

On Mon, Jan 24, 2011 at 08:14:22PM +0900, Hiroyuki Kamezawa wrote:
> 2011/1/24 Johannes Weiner <hannes@cmpxchg.org>:
> > On Mon, Jan 24, 2011 at 07:15:35PM +0900, KAMEZAWA Hiroyuki wrote:
> >> On Mon, 24 Jan 2011 11:14:02 +0100
> >> Johannes Weiner <hannes@cmpxchg.org> wrote:
> >>
> >> > On Fri, Jan 21, 2011 at 03:37:26PM +0900, KAMEZAWA Hiroyuki wrote:
> >> > > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> >> > >
> >> > > A fix for 987eba66e0e6aa654d60881a14731a353ee0acb4
> >> > >
> >> > > A clean up for mem_cgroup_move_parent().
> >> > >  - remove unnecessary initialization of local variable.
> >> > >  - rename charge_size -> page_size
> >> > >  - remove unnecessary (wrong) comment.
> >> > >
> >> > > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> >> > > ---
> >> > >  mm/memcontrol.c |   17 +++++++++--------
> >> > >  1 file changed, 9 insertions(+), 8 deletions(-)
> >> > >
> >> > > Index: mmotm-0107/mm/memcontrol.c
> >> > > ===================================================================
> >> > > --- mmotm-0107.orig/mm/memcontrol.c
> >> > > +++ mmotm-0107/mm/memcontrol.c
> >> > > @@ -2265,7 +2265,7 @@ static int mem_cgroup_move_parent(struct
> >> > >   struct cgroup *cg = child->css.cgroup;
> >> > >   struct cgroup *pcg = cg->parent;
> >> > >   struct mem_cgroup *parent;
> >> > > - int charge = PAGE_SIZE;
> >> > > + int page_size;
> >> > >   unsigned long flags;
> >> > >   int ret;
> >> > >
> >> > > @@ -2278,22 +2278,23 @@ static int mem_cgroup_move_parent(struct
> >> > >           goto out;
> >> > >   if (isolate_lru_page(page))
> >> > >           goto put;
> >> > > - /* The page is isolated from LRU and we have no race with splitting */
> >> > > - charge = PAGE_SIZE << compound_order(page);
> >> > > +
> >> > > + page_size = PAGE_SIZE << compound_order(page);
> >> >
> >> > Okay, so you remove the wrong comment, but that does not make the code
> >> > right.  What protects compound_order from reading garbage because the
> >> > page is currently splitting?
> >> >
> >>
> >> ==
> >> static int mem_cgroup_move_account(struct page_cgroup *pc,
> >>                 struct mem_cgroup *from, struct mem_cgroup *to,
> >>                 bool uncharge, int charge_size)
> >> {
> >>         int ret = -EINVAL;
> >>         unsigned long flags;
> >>
> >>         if ((charge_size > PAGE_SIZE) && !PageTransHuge(pc->page))
> >>                 return -EBUSY;
> >> ==
> >>
> >> This is called under compound_lock(). Then, if someone breaks THP,
> >> -EBUSY and retry.
> >
> > This charge_size contains exactly the garbage you just read from an
> > unprotected compound_order().  It could be anything if the page is
> > split concurrently.
> 
> Then, my recent fix to LRU accounting which use compound_order() is racy, too ?

In lru add/delete/move/rotate?  No, that should be safe because we
have the lru lock there and __split_huge_page_refcount() takes the
lock as well.

> I'll replace compound_order() with
>   if (PageTransHuge(page))
>       size = HPAGE_SIZE.
> 
> Does this work ?

Yes, I think this should work.  This gives a sane size for try_charge
and we still catch a split under the compound_lock later in
move_account as you described above.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/7] memcg : comment, style fixes for recent patch of move_parent
@ 2011-01-24 11:34             ` Johannes Weiner
  0 siblings, 0 replies; 74+ messages in thread
From: Johannes Weiner @ 2011-01-24 11:34 UTC (permalink / raw)
  To: Hiroyuki Kamezawa
  Cc: KAMEZAWA Hiroyuki, linux-mm, linux-kernel, nishimura, balbir, akpm

On Mon, Jan 24, 2011 at 08:14:22PM +0900, Hiroyuki Kamezawa wrote:
> 2011/1/24 Johannes Weiner <hannes@cmpxchg.org>:
> > On Mon, Jan 24, 2011 at 07:15:35PM +0900, KAMEZAWA Hiroyuki wrote:
> >> On Mon, 24 Jan 2011 11:14:02 +0100
> >> Johannes Weiner <hannes@cmpxchg.org> wrote:
> >>
> >> > On Fri, Jan 21, 2011 at 03:37:26PM +0900, KAMEZAWA Hiroyuki wrote:
> >> > > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> >> > >
> >> > > A fix for 987eba66e0e6aa654d60881a14731a353ee0acb4
> >> > >
> >> > > A clean up for mem_cgroup_move_parent().
> >> > >  - remove unnecessary initialization of local variable.
> >> > >  - rename charge_size -> page_size
> >> > >  - remove unnecessary (wrong) comment.
> >> > >
> >> > > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> >> > > ---
> >> > >  mm/memcontrol.c |   17 +++++++++--------
> >> > >  1 file changed, 9 insertions(+), 8 deletions(-)
> >> > >
> >> > > Index: mmotm-0107/mm/memcontrol.c
> >> > > ===================================================================
> >> > > --- mmotm-0107.orig/mm/memcontrol.c
> >> > > +++ mmotm-0107/mm/memcontrol.c
> >> > > @@ -2265,7 +2265,7 @@ static int mem_cgroup_move_parent(struct
> >> > >   struct cgroup *cg = child->css.cgroup;
> >> > >   struct cgroup *pcg = cg->parent;
> >> > >   struct mem_cgroup *parent;
> >> > > - int charge = PAGE_SIZE;
> >> > > + int page_size;
> >> > >   unsigned long flags;
> >> > >   int ret;
> >> > >
> >> > > @@ -2278,22 +2278,23 @@ static int mem_cgroup_move_parent(struct
> >> > >           goto out;
> >> > >   if (isolate_lru_page(page))
> >> > >           goto put;
> >> > > - /* The page is isolated from LRU and we have no race with splitting */
> >> > > - charge = PAGE_SIZE << compound_order(page);
> >> > > +
> >> > > + page_size = PAGE_SIZE << compound_order(page);
> >> >
> >> > Okay, so you remove the wrong comment, but that does not make the code
> >> > right.  What protects compound_order from reading garbage because the
> >> > page is currently splitting?
> >> >
> >>
> >> ==
> >> static int mem_cgroup_move_account(struct page_cgroup *pc,
> >>                 struct mem_cgroup *from, struct mem_cgroup *to,
> >>                 bool uncharge, int charge_size)
> >> {
> >>         int ret = -EINVAL;
> >>         unsigned long flags;
> >>
> >>         if ((charge_size > PAGE_SIZE) && !PageTransHuge(pc->page))
> >>                 return -EBUSY;
> >> ==
> >>
> >> This is called under compound_lock(). Then, if someone breaks THP,
> >> -EBUSY and retry.
> >
> > This charge_size contains exactly the garbage you just read from an
> > unprotected compound_order().  It could be anything if the page is
> > split concurrently.
> 
> Then, my recent fix to LRU accounting which use compound_order() is racy, too ?

In lru add/delete/move/rotate?  No, that should be safe because we
have the lru lock there and __split_huge_page_refcount() takes the
lock as well.

> I'll replace compound_order() with
>   if (PageTransHuge(page))
>       size = HPAGE_SIZE.
> 
> Does this work ?

Yes, I think this should work.  This gives a sane size for try_charge
and we still catch a split under the compound_lock later in
move_account as you described above.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 4/7] memcg : fix charge function of THP allocation.
  2011-01-21  6:44   ` KAMEZAWA Hiroyuki
@ 2011-01-27 10:34     ` Johannes Weiner
  -1 siblings, 0 replies; 74+ messages in thread
From: Johannes Weiner @ 2011-01-27 10:34 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, nishimura, balbir, akpm

On Fri, Jan 21, 2011 at 03:44:30PM +0900, KAMEZAWA Hiroyuki wrote:
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> When THP is used, Hugepage size charge can happen. It's not handled
> correctly in mem_cgroup_do_charge(). For example, THP can fallback
> to small page allocation when HUGEPAGE allocation seems difficult
> or busy, but memory cgroup doesn't understand it and continue to
> try HUGEPAGE charging. And the worst thing is memory cgroup
> believes 'memory reclaim succeeded' if limit - usage > PAGE_SIZE.
> 
> By this, khugepaged etc...can goes into inifinite reclaim loop
> if tasks in memcg are busy.
> 
> After this patch 
>  - Hugepage allocation will fail if 1st trial of page reclaim fails.
>  - distinguish THP allocaton from Bached allocation. 

This does too many things at once.  Can you split this into more
patches where each one has a single objective?  Thanks.

	Hannes

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 4/7] memcg : fix charge function of THP allocation.
@ 2011-01-27 10:34     ` Johannes Weiner
  0 siblings, 0 replies; 74+ messages in thread
From: Johannes Weiner @ 2011-01-27 10:34 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, nishimura, balbir, akpm

On Fri, Jan 21, 2011 at 03:44:30PM +0900, KAMEZAWA Hiroyuki wrote:
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> When THP is used, Hugepage size charge can happen. It's not handled
> correctly in mem_cgroup_do_charge(). For example, THP can fallback
> to small page allocation when HUGEPAGE allocation seems difficult
> or busy, but memory cgroup doesn't understand it and continue to
> try HUGEPAGE charging. And the worst thing is memory cgroup
> believes 'memory reclaim succeeded' if limit - usage > PAGE_SIZE.
> 
> By this, khugepaged etc...can goes into inifinite reclaim loop
> if tasks in memcg are busy.
> 
> After this patch 
>  - Hugepage allocation will fail if 1st trial of page reclaim fails.
>  - distinguish THP allocaton from Bached allocation. 

This does too many things at once.  Can you split this into more
patches where each one has a single objective?  Thanks.

	Hannes

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* [patch] memcg: prevent endless loop with huge pages and near-limit group
  2011-01-27 10:34     ` Johannes Weiner
@ 2011-01-27 10:40       ` Johannes Weiner
  -1 siblings, 0 replies; 74+ messages in thread
From: Johannes Weiner @ 2011-01-27 10:40 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, nishimura, balbir, akpm

This is a patch I sent to Andrea ages ago in response to a RHEL
bugzilla.  Not sure why it did not reach mainline...  But it fixes one
issue you described in 4/7, namely looping around a not exceeded limit
with a huge page that won't fit anymore.

---
From: Johannes Weiner <hannes@cmpxchg.org>
Subject: [patch] memcg: prevent endless loop with huge pages and near-limit group

If reclaim after a failed charging was unsuccessful, the limits are
checked again, just in case they settled by means of other tasks.

This is all fine as long as every charge is of size PAGE_SIZE, because
in that case, being below the limit means having at least PAGE_SIZE
bytes available.

But with transparent huge pages, we may end up in an endless loop
where charging and reclaim fail, but we keep going because the limits
are not yet exceeded, although not allowing for a huge page.

Fix this up by explicitely checking for enough room, not just whether
we are within limits.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/res_counter.h |   12 ++++++++++++
 mm/memcontrol.c             |   20 +++++++++++---------
 2 files changed, 23 insertions(+), 9 deletions(-)

diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
index fcb9884..03212e4 100644
--- a/include/linux/res_counter.h
+++ b/include/linux/res_counter.h
@@ -182,6 +182,18 @@ static inline bool res_counter_check_under_limit(struct res_counter *cnt)
 	return ret;
 }
 
+static inline bool res_counter_check_room(struct res_counter *cnt,
+					  unsigned long room)
+{
+	bool ret;
+	unsigned long flags;
+
+	spin_lock_irqsave(&cnt->lock, flags);
+	ret = cnt->limit - cnt->usage >= room;
+	spin_unlock_irqrestore(&cnt->lock, flags);
+	return ret;
+}
+
 static inline bool res_counter_check_under_soft_limit(struct res_counter *cnt)
 {
 	bool ret;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d572102..8fa4be3 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1111,6 +1111,15 @@ static bool mem_cgroup_check_under_limit(struct mem_cgroup *mem)
 	return false;
 }
 
+static bool mem_cgroup_check_room(struct mem_cgroup *mem, unsigned long room)
+{
+	if (!res_counter_check_room(&mem->res, room))
+		return false;
+	if (!do_swap_account)
+		return true;
+	return res_counter_check_room(&mem->memsw, room);
+}
+
 static unsigned int get_swappiness(struct mem_cgroup *memcg)
 {
 	struct cgroup *cgrp = memcg->css.cgroup;
@@ -1844,16 +1853,9 @@ static int __mem_cgroup_do_charge(struct mem_cgroup *mem, gfp_t gfp_mask,
 	if (!(gfp_mask & __GFP_WAIT))
 		return CHARGE_WOULDBLOCK;
 
-	ret = mem_cgroup_hierarchical_reclaim(mem_over_limit, NULL,
+	mem_cgroup_hierarchical_reclaim(mem_over_limit, NULL,
 					gfp_mask, flags);
-	/*
-	 * try_to_free_mem_cgroup_pages() might not give us a full
-	 * picture of reclaim. Some pages are reclaimed and might be
-	 * moved to swap cache or just unmapped from the cgroup.
-	 * Check the limit again to see if the reclaim reduced the
-	 * current usage of the cgroup before giving up
-	 */
-	if (ret || mem_cgroup_check_under_limit(mem_over_limit))
+	if (mem_cgroup_check_room(mem_over_limit, csize))
 		return CHARGE_RETRY;
 
 	/*
-- 
1.7.3.5

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [patch] memcg: prevent endless loop with huge pages and near-limit group
@ 2011-01-27 10:40       ` Johannes Weiner
  0 siblings, 0 replies; 74+ messages in thread
From: Johannes Weiner @ 2011-01-27 10:40 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, nishimura, balbir, akpm

This is a patch I sent to Andrea ages ago in response to a RHEL
bugzilla.  Not sure why it did not reach mainline...  But it fixes one
issue you described in 4/7, namely looping around a not exceeded limit
with a huge page that won't fit anymore.

---
From: Johannes Weiner <hannes@cmpxchg.org>
Subject: [patch] memcg: prevent endless loop with huge pages and near-limit group

If reclaim after a failed charging was unsuccessful, the limits are
checked again, just in case they settled by means of other tasks.

This is all fine as long as every charge is of size PAGE_SIZE, because
in that case, being below the limit means having at least PAGE_SIZE
bytes available.

But with transparent huge pages, we may end up in an endless loop
where charging and reclaim fail, but we keep going because the limits
are not yet exceeded, although not allowing for a huge page.

Fix this up by explicitely checking for enough room, not just whether
we are within limits.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/res_counter.h |   12 ++++++++++++
 mm/memcontrol.c             |   20 +++++++++++---------
 2 files changed, 23 insertions(+), 9 deletions(-)

diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
index fcb9884..03212e4 100644
--- a/include/linux/res_counter.h
+++ b/include/linux/res_counter.h
@@ -182,6 +182,18 @@ static inline bool res_counter_check_under_limit(struct res_counter *cnt)
 	return ret;
 }
 
+static inline bool res_counter_check_room(struct res_counter *cnt,
+					  unsigned long room)
+{
+	bool ret;
+	unsigned long flags;
+
+	spin_lock_irqsave(&cnt->lock, flags);
+	ret = cnt->limit - cnt->usage >= room;
+	spin_unlock_irqrestore(&cnt->lock, flags);
+	return ret;
+}
+
 static inline bool res_counter_check_under_soft_limit(struct res_counter *cnt)
 {
 	bool ret;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d572102..8fa4be3 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1111,6 +1111,15 @@ static bool mem_cgroup_check_under_limit(struct mem_cgroup *mem)
 	return false;
 }
 
+static bool mem_cgroup_check_room(struct mem_cgroup *mem, unsigned long room)
+{
+	if (!res_counter_check_room(&mem->res, room))
+		return false;
+	if (!do_swap_account)
+		return true;
+	return res_counter_check_room(&mem->memsw, room);
+}
+
 static unsigned int get_swappiness(struct mem_cgroup *memcg)
 {
 	struct cgroup *cgrp = memcg->css.cgroup;
@@ -1844,16 +1853,9 @@ static int __mem_cgroup_do_charge(struct mem_cgroup *mem, gfp_t gfp_mask,
 	if (!(gfp_mask & __GFP_WAIT))
 		return CHARGE_WOULDBLOCK;
 
-	ret = mem_cgroup_hierarchical_reclaim(mem_over_limit, NULL,
+	mem_cgroup_hierarchical_reclaim(mem_over_limit, NULL,
 					gfp_mask, flags);
-	/*
-	 * try_to_free_mem_cgroup_pages() might not give us a full
-	 * picture of reclaim. Some pages are reclaimed and might be
-	 * moved to swap cache or just unmapped from the cgroup.
-	 * Check the limit again to see if the reclaim reduced the
-	 * current usage of the cgroup before giving up
-	 */
-	if (ret || mem_cgroup_check_under_limit(mem_over_limit))
+	if (mem_cgroup_check_room(mem_over_limit, csize))
 		return CHARGE_RETRY;
 
 	/*
-- 
1.7.3.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [patch 2/3] memcg: prevent endless loop on huge page charge
  2011-01-27 10:34     ` Johannes Weiner
@ 2011-01-27 13:46       ` Johannes Weiner
  -1 siblings, 0 replies; 74+ messages in thread
From: Johannes Weiner @ 2011-01-27 13:46 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, nishimura, balbir, akpm

The charging code can encounter a charge size that is bigger than a
regular page in two situations: one is a batched charge to fill the
per-cpu stocks, the other is a huge page charge.

This code is distributed over two functions, however, and only the
outer one is aware of huge pages.  In case the charging fails, the
inner function will tell the outer function to retry if the charge
size is bigger than regular pages--assuming batched charging is the
only case.  And the outer function will retry forever charging a huge
page.

This patch makes sure the inner function can distinguish between batch
charging and a single huge page charge.  It will only signal another
attempt if batch charging failed, and go into regular reclaim when it
is called on behalf of a huge page.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/memcontrol.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 8fa4be3..17c4e36 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1847,7 +1847,7 @@ static int __mem_cgroup_do_charge(struct mem_cgroup *mem, gfp_t gfp_mask,
 	} else
 		mem_over_limit = mem_cgroup_from_res_counter(fail_res, res);
 
-	if (csize > PAGE_SIZE) /* change csize and retry */
+	if (csize == CHARGE_SIZE) /* retry without batching */
 		return CHARGE_RETRY;
 
 	if (!(gfp_mask & __GFP_WAIT))
-- 
1.7.3.5


^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [patch 2/3] memcg: prevent endless loop on huge page charge
@ 2011-01-27 13:46       ` Johannes Weiner
  0 siblings, 0 replies; 74+ messages in thread
From: Johannes Weiner @ 2011-01-27 13:46 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, nishimura, balbir, akpm

The charging code can encounter a charge size that is bigger than a
regular page in two situations: one is a batched charge to fill the
per-cpu stocks, the other is a huge page charge.

This code is distributed over two functions, however, and only the
outer one is aware of huge pages.  In case the charging fails, the
inner function will tell the outer function to retry if the charge
size is bigger than regular pages--assuming batched charging is the
only case.  And the outer function will retry forever charging a huge
page.

This patch makes sure the inner function can distinguish between batch
charging and a single huge page charge.  It will only signal another
attempt if batch charging failed, and go into regular reclaim when it
is called on behalf of a huge page.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/memcontrol.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 8fa4be3..17c4e36 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1847,7 +1847,7 @@ static int __mem_cgroup_do_charge(struct mem_cgroup *mem, gfp_t gfp_mask,
 	} else
 		mem_over_limit = mem_cgroup_from_res_counter(fail_res, res);
 
-	if (csize > PAGE_SIZE) /* change csize and retry */
+	if (csize == CHARGE_SIZE) /* retry without batching */
 		return CHARGE_RETRY;
 
 	if (!(gfp_mask & __GFP_WAIT))
-- 
1.7.3.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [patch 3/3] memcg: never OOM when charging huge pages
  2011-01-27 10:34     ` Johannes Weiner
@ 2011-01-27 13:47       ` Johannes Weiner
  -1 siblings, 0 replies; 74+ messages in thread
From: Johannes Weiner @ 2011-01-27 13:47 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, nishimura, balbir, akpm

Huge page coverage should obviously have less priority than the
continued execution of a process.

Never kill a process when charging it a huge page fails.  Instead,
give up after the first failed reclaim attempt and fall back to
regular pages.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/memcontrol.c |    7 +++++++
 1 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 17c4e36..2945649 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1890,6 +1890,13 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
 	int csize = max(CHARGE_SIZE, (unsigned long) page_size);
 
 	/*
+	 * Do not OOM on huge pages.  Fall back to regular pages after
+	 * the first failed reclaim attempt.
+	 */
+	if (page_size > PAGE_SIZE)
+		oom = false;
+
+	/*
 	 * Unlike gloval-vm's OOM-kill, we're not in memory shortage
 	 * in system level. So, allow to go ahead dying process in addition to
 	 * MEMDIE process.
-- 
1.7.3.5


^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [patch 3/3] memcg: never OOM when charging huge pages
@ 2011-01-27 13:47       ` Johannes Weiner
  0 siblings, 0 replies; 74+ messages in thread
From: Johannes Weiner @ 2011-01-27 13:47 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, nishimura, balbir, akpm

Huge page coverage should obviously have less priority than the
continued execution of a process.

Never kill a process when charging it a huge page fails.  Instead,
give up after the first failed reclaim attempt and fall back to
regular pages.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/memcontrol.c |    7 +++++++
 1 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 17c4e36..2945649 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1890,6 +1890,13 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
 	int csize = max(CHARGE_SIZE, (unsigned long) page_size);
 
 	/*
+	 * Do not OOM on huge pages.  Fall back to regular pages after
+	 * the first failed reclaim attempt.
+	 */
+	if (page_size > PAGE_SIZE)
+		oom = false;
+
+	/*
 	 * Unlike gloval-vm's OOM-kill, we're not in memory shortage
 	 * in system level. So, allow to go ahead dying process in addition to
 	 * MEMDIE process.
-- 
1.7.3.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* Re: [patch 2/3] memcg: prevent endless loop on huge page charge
  2011-01-27 13:46       ` Johannes Weiner
@ 2011-01-27 14:00         ` Gleb Natapov
  -1 siblings, 0 replies; 74+ messages in thread
From: Gleb Natapov @ 2011-01-27 14:00 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: KAMEZAWA Hiroyuki, linux-mm, linux-kernel, nishimura, balbir, akpm

On Thu, Jan 27, 2011 at 02:46:45PM +0100, Johannes Weiner wrote:
> The charging code can encounter a charge size that is bigger than a
> regular page in two situations: one is a batched charge to fill the
> per-cpu stocks, the other is a huge page charge.
> 
> This code is distributed over two functions, however, and only the
> outer one is aware of huge pages.  In case the charging fails, the
> inner function will tell the outer function to retry if the charge
> size is bigger than regular pages--assuming batched charging is the
> only case.  And the outer function will retry forever charging a huge
> page.
> 
> This patch makes sure the inner function can distinguish between batch
> charging and a single huge page charge.  It will only signal another
> attempt if batch charging failed, and go into regular reclaim when it
> is called on behalf of a huge page.
> 
Yeah, that is exactly the case I am debugging right now. Came up with
different solution: pass page_size to __mem_cgroup_do_charge() and
compare csize with page_size (not CHARGE_SIZE). Not sure which solution
it more correct.

> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  mm/memcontrol.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 8fa4be3..17c4e36 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1847,7 +1847,7 @@ static int __mem_cgroup_do_charge(struct mem_cgroup *mem, gfp_t gfp_mask,
>  	} else
>  		mem_over_limit = mem_cgroup_from_res_counter(fail_res, res);
>  
> -	if (csize > PAGE_SIZE) /* change csize and retry */
> +	if (csize == CHARGE_SIZE) /* retry without batching */
>  		return CHARGE_RETRY;
>  
>  	if (!(gfp_mask & __GFP_WAIT))
> -- 
> 1.7.3.5
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
			Gleb.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [patch 2/3] memcg: prevent endless loop on huge page charge
@ 2011-01-27 14:00         ` Gleb Natapov
  0 siblings, 0 replies; 74+ messages in thread
From: Gleb Natapov @ 2011-01-27 14:00 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: KAMEZAWA Hiroyuki, linux-mm, linux-kernel, nishimura, balbir, akpm

On Thu, Jan 27, 2011 at 02:46:45PM +0100, Johannes Weiner wrote:
> The charging code can encounter a charge size that is bigger than a
> regular page in two situations: one is a batched charge to fill the
> per-cpu stocks, the other is a huge page charge.
> 
> This code is distributed over two functions, however, and only the
> outer one is aware of huge pages.  In case the charging fails, the
> inner function will tell the outer function to retry if the charge
> size is bigger than regular pages--assuming batched charging is the
> only case.  And the outer function will retry forever charging a huge
> page.
> 
> This patch makes sure the inner function can distinguish between batch
> charging and a single huge page charge.  It will only signal another
> attempt if batch charging failed, and go into regular reclaim when it
> is called on behalf of a huge page.
> 
Yeah, that is exactly the case I am debugging right now. Came up with
different solution: pass page_size to __mem_cgroup_do_charge() and
compare csize with page_size (not CHARGE_SIZE). Not sure which solution
it more correct.

> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  mm/memcontrol.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 8fa4be3..17c4e36 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1847,7 +1847,7 @@ static int __mem_cgroup_do_charge(struct mem_cgroup *mem, gfp_t gfp_mask,
>  	} else
>  		mem_over_limit = mem_cgroup_from_res_counter(fail_res, res);
>  
> -	if (csize > PAGE_SIZE) /* change csize and retry */
> +	if (csize == CHARGE_SIZE) /* retry without batching */
>  		return CHARGE_RETRY;
>  
>  	if (!(gfp_mask & __GFP_WAIT))
> -- 
> 1.7.3.5
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
			Gleb.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [patch 2/3] memcg: prevent endless loop on huge page charge
  2011-01-27 14:00         ` Gleb Natapov
@ 2011-01-27 14:14           ` Johannes Weiner
  -1 siblings, 0 replies; 74+ messages in thread
From: Johannes Weiner @ 2011-01-27 14:14 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: KAMEZAWA Hiroyuki, linux-mm, linux-kernel, nishimura, balbir, akpm

Hi Gleb,

On Thu, Jan 27, 2011 at 04:00:24PM +0200, Gleb Natapov wrote:
> On Thu, Jan 27, 2011 at 02:46:45PM +0100, Johannes Weiner wrote:
> > The charging code can encounter a charge size that is bigger than a
> > regular page in two situations: one is a batched charge to fill the
> > per-cpu stocks, the other is a huge page charge.
> > 
> > This code is distributed over two functions, however, and only the
> > outer one is aware of huge pages.  In case the charging fails, the
> > inner function will tell the outer function to retry if the charge
> > size is bigger than regular pages--assuming batched charging is the
> > only case.  And the outer function will retry forever charging a huge
> > page.
> > 
> > This patch makes sure the inner function can distinguish between batch
> > charging and a single huge page charge.  It will only signal another
> > attempt if batch charging failed, and go into regular reclaim when it
> > is called on behalf of a huge page.
> > 
> Yeah, that is exactly the case I am debugging right now. Came up with
> different solution: pass page_size to __mem_cgroup_do_charge() and
> compare csize with page_size (not CHARGE_SIZE). Not sure which solution
> it more correct.

I guess it makes no difference, but using CHARGE_SIZE gets away
without adding another parameter to __mem_cgroup_do_charge().

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [patch 2/3] memcg: prevent endless loop on huge page charge
@ 2011-01-27 14:14           ` Johannes Weiner
  0 siblings, 0 replies; 74+ messages in thread
From: Johannes Weiner @ 2011-01-27 14:14 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: KAMEZAWA Hiroyuki, linux-mm, linux-kernel, nishimura, balbir, akpm

Hi Gleb,

On Thu, Jan 27, 2011 at 04:00:24PM +0200, Gleb Natapov wrote:
> On Thu, Jan 27, 2011 at 02:46:45PM +0100, Johannes Weiner wrote:
> > The charging code can encounter a charge size that is bigger than a
> > regular page in two situations: one is a batched charge to fill the
> > per-cpu stocks, the other is a huge page charge.
> > 
> > This code is distributed over two functions, however, and only the
> > outer one is aware of huge pages.  In case the charging fails, the
> > inner function will tell the outer function to retry if the charge
> > size is bigger than regular pages--assuming batched charging is the
> > only case.  And the outer function will retry forever charging a huge
> > page.
> > 
> > This patch makes sure the inner function can distinguish between batch
> > charging and a single huge page charge.  It will only signal another
> > attempt if batch charging failed, and go into regular reclaim when it
> > is called on behalf of a huge page.
> > 
> Yeah, that is exactly the case I am debugging right now. Came up with
> different solution: pass page_size to __mem_cgroup_do_charge() and
> compare csize with page_size (not CHARGE_SIZE). Not sure which solution
> it more correct.

I guess it makes no difference, but using CHARGE_SIZE gets away
without adding another parameter to __mem_cgroup_do_charge().

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 4/7] memcg : fix charge function of THP allocation.
  2011-01-27 10:34     ` Johannes Weiner
@ 2011-01-27 14:18       ` Johannes Weiner
  -1 siblings, 0 replies; 74+ messages in thread
From: Johannes Weiner @ 2011-01-27 14:18 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, nishimura, balbir, akpm

On Thu, Jan 27, 2011 at 11:34:38AM +0100, Johannes Weiner wrote:
> On Fri, Jan 21, 2011 at 03:44:30PM +0900, KAMEZAWA Hiroyuki wrote:
> > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > 
> > When THP is used, Hugepage size charge can happen. It's not handled
> > correctly in mem_cgroup_do_charge(). For example, THP can fallback
> > to small page allocation when HUGEPAGE allocation seems difficult
> > or busy, but memory cgroup doesn't understand it and continue to
> > try HUGEPAGE charging. And the worst thing is memory cgroup
> > believes 'memory reclaim succeeded' if limit - usage > PAGE_SIZE.
> > 
> > By this, khugepaged etc...can goes into inifinite reclaim loop
> > if tasks in memcg are busy.
> > 
> > After this patch 
> >  - Hugepage allocation will fail if 1st trial of page reclaim fails.
> >  - distinguish THP allocaton from Bached allocation. 
> 
> This does too many things at once.  Can you split this into more
> patches where each one has a single objective?  Thanks.

So I sent three patches that, I think, fix the same issues this patch
fixes, only they are much simpler.

The more I look at this code, though, the less confident I am in it..
Can you guys give it a good look?

Thanks,
	Hannes

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 4/7] memcg : fix charge function of THP allocation.
@ 2011-01-27 14:18       ` Johannes Weiner
  0 siblings, 0 replies; 74+ messages in thread
From: Johannes Weiner @ 2011-01-27 14:18 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, nishimura, balbir, akpm

On Thu, Jan 27, 2011 at 11:34:38AM +0100, Johannes Weiner wrote:
> On Fri, Jan 21, 2011 at 03:44:30PM +0900, KAMEZAWA Hiroyuki wrote:
> > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > 
> > When THP is used, Hugepage size charge can happen. It's not handled
> > correctly in mem_cgroup_do_charge(). For example, THP can fallback
> > to small page allocation when HUGEPAGE allocation seems difficult
> > or busy, but memory cgroup doesn't understand it and continue to
> > try HUGEPAGE charging. And the worst thing is memory cgroup
> > believes 'memory reclaim succeeded' if limit - usage > PAGE_SIZE.
> > 
> > By this, khugepaged etc...can goes into inifinite reclaim loop
> > if tasks in memcg are busy.
> > 
> > After this patch 
> >  - Hugepage allocation will fail if 1st trial of page reclaim fails.
> >  - distinguish THP allocaton from Bached allocation. 
> 
> This does too many things at once.  Can you split this into more
> patches where each one has a single objective?  Thanks.

So I sent three patches that, I think, fix the same issues this patch
fixes, only they are much simpler.

The more I look at this code, though, the less confident I am in it..
Can you guys give it a good look?

Thanks,
	Hannes

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 4/7] memcg : fix charge function of THP allocation.
  2011-01-27 10:34     ` Johannes Weiner
@ 2011-01-27 23:38       ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 74+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-01-27 23:38 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: linux-mm, linux-kernel, nishimura, balbir, akpm

On Thu, 27 Jan 2011 11:34:38 +0100
Johannes Weiner <hannes@cmpxchg.org> wrote:

> On Fri, Jan 21, 2011 at 03:44:30PM +0900, KAMEZAWA Hiroyuki wrote:
> > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > 
> > When THP is used, Hugepage size charge can happen. It's not handled
> > correctly in mem_cgroup_do_charge(). For example, THP can fallback
> > to small page allocation when HUGEPAGE allocation seems difficult
> > or busy, but memory cgroup doesn't understand it and continue to
> > try HUGEPAGE charging. And the worst thing is memory cgroup
> > believes 'memory reclaim succeeded' if limit - usage > PAGE_SIZE.
> > 
> > By this, khugepaged etc...can goes into inifinite reclaim loop
> > if tasks in memcg are busy.
> > 
> > After this patch 
> >  - Hugepage allocation will fail if 1st trial of page reclaim fails.
> >  - distinguish THP allocaton from Bached allocation. 
> 
> This does too many things at once.  Can you split this into more
> patches where each one has a single objective?  Thanks.
> 

Sure, will do.

I'm now testing new ones.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 4/7] memcg : fix charge function of THP allocation.
@ 2011-01-27 23:38       ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 74+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-01-27 23:38 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: linux-mm, linux-kernel, nishimura, balbir, akpm

On Thu, 27 Jan 2011 11:34:38 +0100
Johannes Weiner <hannes@cmpxchg.org> wrote:

> On Fri, Jan 21, 2011 at 03:44:30PM +0900, KAMEZAWA Hiroyuki wrote:
> > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > 
> > When THP is used, Hugepage size charge can happen. It's not handled
> > correctly in mem_cgroup_do_charge(). For example, THP can fallback
> > to small page allocation when HUGEPAGE allocation seems difficult
> > or busy, but memory cgroup doesn't understand it and continue to
> > try HUGEPAGE charging. And the worst thing is memory cgroup
> > believes 'memory reclaim succeeded' if limit - usage > PAGE_SIZE.
> > 
> > By this, khugepaged etc...can goes into inifinite reclaim loop
> > if tasks in memcg are busy.
> > 
> > After this patch 
> >  - Hugepage allocation will fail if 1st trial of page reclaim fails.
> >  - distinguish THP allocaton from Bached allocation. 
> 
> This does too many things at once.  Can you split this into more
> patches where each one has a single objective?  Thanks.
> 

Sure, will do.

I'm now testing new ones.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [patch] memcg: prevent endless loop with huge pages and near-limit group
  2011-01-27 10:40       ` Johannes Weiner
@ 2011-01-27 23:40         ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 74+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-01-27 23:40 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: linux-mm, linux-kernel, nishimura, balbir, akpm

On Thu, 27 Jan 2011 11:40:14 +0100
Johannes Weiner <hannes@cmpxchg.org> wrote:

> This is a patch I sent to Andrea ages ago in response to a RHEL
> bugzilla.  Not sure why it did not reach mainline...  But it fixes one
> issue you described in 4/7, namely looping around a not exceeded limit
> with a huge page that won't fit anymore.
> 
> ---
> From: Johannes Weiner <hannes@cmpxchg.org>
> Subject: [patch] memcg: prevent endless loop with huge pages and near-limit group
> 
> If reclaim after a failed charging was unsuccessful, the limits are
> checked again, just in case they settled by means of other tasks.
> 
> This is all fine as long as every charge is of size PAGE_SIZE, because
> in that case, being below the limit means having at least PAGE_SIZE
> bytes available.
> 
> But with transparent huge pages, we may end up in an endless loop
> where charging and reclaim fail, but we keep going because the limits
> are not yet exceeded, although not allowing for a huge page.
> 
> Fix this up by explicitely checking for enough room, not just whether
> we are within limits.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>


Okay, seems to have the same concept as mine. 
-Kame

> ---
>  include/linux/res_counter.h |   12 ++++++++++++
>  mm/memcontrol.c             |   20 +++++++++++---------
>  2 files changed, 23 insertions(+), 9 deletions(-)
> 
> diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
> index fcb9884..03212e4 100644
> --- a/include/linux/res_counter.h
> +++ b/include/linux/res_counter.h
> @@ -182,6 +182,18 @@ static inline bool res_counter_check_under_limit(struct res_counter *cnt)
>  	return ret;
>  }
>  
> +static inline bool res_counter_check_room(struct res_counter *cnt,
> +					  unsigned long room)
> +{
> +	bool ret;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&cnt->lock, flags);
> +	ret = cnt->limit - cnt->usage >= room;
> +	spin_unlock_irqrestore(&cnt->lock, flags);
> +	return ret;
> +}
> +
>  static inline bool res_counter_check_under_soft_limit(struct res_counter *cnt)
>  {
>  	bool ret;
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index d572102..8fa4be3 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1111,6 +1111,15 @@ static bool mem_cgroup_check_under_limit(struct mem_cgroup *mem)
>  	return false;
>  }
>  
> +static bool mem_cgroup_check_room(struct mem_cgroup *mem, unsigned long room)
> +{
> +	if (!res_counter_check_room(&mem->res, room))
> +		return false;
> +	if (!do_swap_account)
> +		return true;
> +	return res_counter_check_room(&mem->memsw, room);
> +}
> +
>  static unsigned int get_swappiness(struct mem_cgroup *memcg)
>  {
>  	struct cgroup *cgrp = memcg->css.cgroup;
> @@ -1844,16 +1853,9 @@ static int __mem_cgroup_do_charge(struct mem_cgroup *mem, gfp_t gfp_mask,
>  	if (!(gfp_mask & __GFP_WAIT))
>  		return CHARGE_WOULDBLOCK;
>  
> -	ret = mem_cgroup_hierarchical_reclaim(mem_over_limit, NULL,
> +	mem_cgroup_hierarchical_reclaim(mem_over_limit, NULL,
>  					gfp_mask, flags);
> -	/*
> -	 * try_to_free_mem_cgroup_pages() might not give us a full
> -	 * picture of reclaim. Some pages are reclaimed and might be
> -	 * moved to swap cache or just unmapped from the cgroup.
> -	 * Check the limit again to see if the reclaim reduced the
> -	 * current usage of the cgroup before giving up
> -	 */
> -	if (ret || mem_cgroup_check_under_limit(mem_over_limit))
> +	if (mem_cgroup_check_room(mem_over_limit, csize))
>  		return CHARGE_RETRY;
>  
>  	/*
> -- 
> 1.7.3.5
> 


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [patch] memcg: prevent endless loop with huge pages and near-limit group
@ 2011-01-27 23:40         ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 74+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-01-27 23:40 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: linux-mm, linux-kernel, nishimura, balbir, akpm

On Thu, 27 Jan 2011 11:40:14 +0100
Johannes Weiner <hannes@cmpxchg.org> wrote:

> This is a patch I sent to Andrea ages ago in response to a RHEL
> bugzilla.  Not sure why it did not reach mainline...  But it fixes one
> issue you described in 4/7, namely looping around a not exceeded limit
> with a huge page that won't fit anymore.
> 
> ---
> From: Johannes Weiner <hannes@cmpxchg.org>
> Subject: [patch] memcg: prevent endless loop with huge pages and near-limit group
> 
> If reclaim after a failed charging was unsuccessful, the limits are
> checked again, just in case they settled by means of other tasks.
> 
> This is all fine as long as every charge is of size PAGE_SIZE, because
> in that case, being below the limit means having at least PAGE_SIZE
> bytes available.
> 
> But with transparent huge pages, we may end up in an endless loop
> where charging and reclaim fail, but we keep going because the limits
> are not yet exceeded, although not allowing for a huge page.
> 
> Fix this up by explicitely checking for enough room, not just whether
> we are within limits.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>


Okay, seems to have the same concept as mine. 
-Kame

> ---
>  include/linux/res_counter.h |   12 ++++++++++++
>  mm/memcontrol.c             |   20 +++++++++++---------
>  2 files changed, 23 insertions(+), 9 deletions(-)
> 
> diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
> index fcb9884..03212e4 100644
> --- a/include/linux/res_counter.h
> +++ b/include/linux/res_counter.h
> @@ -182,6 +182,18 @@ static inline bool res_counter_check_under_limit(struct res_counter *cnt)
>  	return ret;
>  }
>  
> +static inline bool res_counter_check_room(struct res_counter *cnt,
> +					  unsigned long room)
> +{
> +	bool ret;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&cnt->lock, flags);
> +	ret = cnt->limit - cnt->usage >= room;
> +	spin_unlock_irqrestore(&cnt->lock, flags);
> +	return ret;
> +}
> +
>  static inline bool res_counter_check_under_soft_limit(struct res_counter *cnt)
>  {
>  	bool ret;
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index d572102..8fa4be3 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1111,6 +1111,15 @@ static bool mem_cgroup_check_under_limit(struct mem_cgroup *mem)
>  	return false;
>  }
>  
> +static bool mem_cgroup_check_room(struct mem_cgroup *mem, unsigned long room)
> +{
> +	if (!res_counter_check_room(&mem->res, room))
> +		return false;
> +	if (!do_swap_account)
> +		return true;
> +	return res_counter_check_room(&mem->memsw, room);
> +}
> +
>  static unsigned int get_swappiness(struct mem_cgroup *memcg)
>  {
>  	struct cgroup *cgrp = memcg->css.cgroup;
> @@ -1844,16 +1853,9 @@ static int __mem_cgroup_do_charge(struct mem_cgroup *mem, gfp_t gfp_mask,
>  	if (!(gfp_mask & __GFP_WAIT))
>  		return CHARGE_WOULDBLOCK;
>  
> -	ret = mem_cgroup_hierarchical_reclaim(mem_over_limit, NULL,
> +	mem_cgroup_hierarchical_reclaim(mem_over_limit, NULL,
>  					gfp_mask, flags);
> -	/*
> -	 * try_to_free_mem_cgroup_pages() might not give us a full
> -	 * picture of reclaim. Some pages are reclaimed and might be
> -	 * moved to swap cache or just unmapped from the cgroup.
> -	 * Check the limit again to see if the reclaim reduced the
> -	 * current usage of the cgroup before giving up
> -	 */
> -	if (ret || mem_cgroup_check_under_limit(mem_over_limit))
> +	if (mem_cgroup_check_room(mem_over_limit, csize))
>  		return CHARGE_RETRY;
>  
>  	/*
> -- 
> 1.7.3.5
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [patch 2/3] memcg: prevent endless loop on huge page charge
  2011-01-27 14:14           ` Johannes Weiner
@ 2011-01-27 23:41             ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 74+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-01-27 23:41 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Gleb Natapov, linux-mm, linux-kernel, nishimura, balbir, akpm

On Thu, 27 Jan 2011 15:14:51 +0100
Johannes Weiner <hannes@cmpxchg.org> wrote:

> Hi Gleb,
> 
> On Thu, Jan 27, 2011 at 04:00:24PM +0200, Gleb Natapov wrote:
> > On Thu, Jan 27, 2011 at 02:46:45PM +0100, Johannes Weiner wrote:
> > > The charging code can encounter a charge size that is bigger than a
> > > regular page in two situations: one is a batched charge to fill the
> > > per-cpu stocks, the other is a huge page charge.
> > > 
> > > This code is distributed over two functions, however, and only the
> > > outer one is aware of huge pages.  In case the charging fails, the
> > > inner function will tell the outer function to retry if the charge
> > > size is bigger than regular pages--assuming batched charging is the
> > > only case.  And the outer function will retry forever charging a huge
> > > page.
> > > 
> > > This patch makes sure the inner function can distinguish between batch
> > > charging and a single huge page charge.  It will only signal another
> > > attempt if batch charging failed, and go into regular reclaim when it
> > > is called on behalf of a huge page.
> > > 
> > Yeah, that is exactly the case I am debugging right now. Came up with
> > different solution: pass page_size to __mem_cgroup_do_charge() and
> > compare csize with page_size (not CHARGE_SIZE). Not sure which solution
> > it more correct.
> 
> I guess it makes no difference, but using CHARGE_SIZE gets away
> without adding another parameter to __mem_cgroup_do_charge().
> 

My new one is similar to this ;)

Thanks,
-Kame




^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [patch 2/3] memcg: prevent endless loop on huge page charge
@ 2011-01-27 23:41             ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 74+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-01-27 23:41 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Gleb Natapov, linux-mm, linux-kernel, nishimura, balbir, akpm

On Thu, 27 Jan 2011 15:14:51 +0100
Johannes Weiner <hannes@cmpxchg.org> wrote:

> Hi Gleb,
> 
> On Thu, Jan 27, 2011 at 04:00:24PM +0200, Gleb Natapov wrote:
> > On Thu, Jan 27, 2011 at 02:46:45PM +0100, Johannes Weiner wrote:
> > > The charging code can encounter a charge size that is bigger than a
> > > regular page in two situations: one is a batched charge to fill the
> > > per-cpu stocks, the other is a huge page charge.
> > > 
> > > This code is distributed over two functions, however, and only the
> > > outer one is aware of huge pages.  In case the charging fails, the
> > > inner function will tell the outer function to retry if the charge
> > > size is bigger than regular pages--assuming batched charging is the
> > > only case.  And the outer function will retry forever charging a huge
> > > page.
> > > 
> > > This patch makes sure the inner function can distinguish between batch
> > > charging and a single huge page charge.  It will only signal another
> > > attempt if batch charging failed, and go into regular reclaim when it
> > > is called on behalf of a huge page.
> > > 
> > Yeah, that is exactly the case I am debugging right now. Came up with
> > different solution: pass page_size to __mem_cgroup_do_charge() and
> > compare csize with page_size (not CHARGE_SIZE). Not sure which solution
> > it more correct.
> 
> I guess it makes no difference, but using CHARGE_SIZE gets away
> without adding another parameter to __mem_cgroup_do_charge().
> 

My new one is similar to this ;)

Thanks,
-Kame



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [patch 3/3] memcg: never OOM when charging huge pages
  2011-01-27 13:47       ` Johannes Weiner
@ 2011-01-27 23:44         ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 74+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-01-27 23:44 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: linux-mm, linux-kernel, nishimura, balbir, akpm

On Thu, 27 Jan 2011 14:47:03 +0100
Johannes Weiner <hannes@cmpxchg.org> wrote:

> Huge page coverage should obviously have less priority than the
> continued execution of a process.
> 
> Never kill a process when charging it a huge page fails.  Instead,
> give up after the first failed reclaim attempt and fall back to
> regular pages.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Ah, I missed this. I'll merge this patch into my series. 
Thank you for pointing out.

But what I have to point out here is khugepaged lock-up issue needs
another fix for khugepaged itself. It should skip hopeless memcg.
I think I can send patches, today.

Thanks,
-Kame


> ---
>  mm/memcontrol.c |    7 +++++++
>  1 files changed, 7 insertions(+), 0 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 17c4e36..2945649 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1890,6 +1890,13 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
>  	int csize = max(CHARGE_SIZE, (unsigned long) page_size);
>  
>  	/*
> +	 * Do not OOM on huge pages.  Fall back to regular pages after
> +	 * the first failed reclaim attempt.
> +	 */
> +	if (page_size > PAGE_SIZE)
> +		oom = false;
> +
> +	/*
>  	 * Unlike gloval-vm's OOM-kill, we're not in memory shortage
>  	 * in system level. So, allow to go ahead dying process in addition to
>  	 * MEMDIE process.
> -- 
> 1.7.3.5
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [patch 3/3] memcg: never OOM when charging huge pages
@ 2011-01-27 23:44         ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 74+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-01-27 23:44 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: linux-mm, linux-kernel, nishimura, balbir, akpm

On Thu, 27 Jan 2011 14:47:03 +0100
Johannes Weiner <hannes@cmpxchg.org> wrote:

> Huge page coverage should obviously have less priority than the
> continued execution of a process.
> 
> Never kill a process when charging it a huge page fails.  Instead,
> give up after the first failed reclaim attempt and fall back to
> regular pages.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Ah, I missed this. I'll merge this patch into my series. 
Thank you for pointing out.

But what I have to point out here is khugepaged lock-up issue needs
another fix for khugepaged itself. It should skip hopeless memcg.
I think I can send patches, today.

Thanks,
-Kame


> ---
>  mm/memcontrol.c |    7 +++++++
>  1 files changed, 7 insertions(+), 0 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 17c4e36..2945649 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1890,6 +1890,13 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
>  	int csize = max(CHARGE_SIZE, (unsigned long) page_size);
>  
>  	/*
> +	 * Do not OOM on huge pages.  Fall back to regular pages after
> +	 * the first failed reclaim attempt.
> +	 */
> +	if (page_size > PAGE_SIZE)
> +		oom = false;
> +
> +	/*
>  	 * Unlike gloval-vm's OOM-kill, we're not in memory shortage
>  	 * in system level. So, allow to go ahead dying process in addition to
>  	 * MEMDIE process.
> -- 
> 1.7.3.5
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [patch 3/3] memcg: never OOM when charging huge pages
  2011-01-27 13:47       ` Johannes Weiner
@ 2011-01-27 23:45         ` Daisuke Nishimura
  -1 siblings, 0 replies; 74+ messages in thread
From: Daisuke Nishimura @ 2011-01-27 23:45 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: KAMEZAWA Hiroyuki, linux-mm, linux-kernel, balbir, akpm,
	Daisuke Nishimura

On Thu, 27 Jan 2011 14:47:03 +0100
Johannes Weiner <hannes@cmpxchg.org> wrote:

> Huge page coverage should obviously have less priority than the
> continued execution of a process.
> 
> Never kill a process when charging it a huge page fails.  Instead,
> give up after the first failed reclaim attempt and fall back to
> regular pages.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  mm/memcontrol.c |    7 +++++++
>  1 files changed, 7 insertions(+), 0 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 17c4e36..2945649 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1890,6 +1890,13 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
>  	int csize = max(CHARGE_SIZE, (unsigned long) page_size);
>  
>  	/*
> +	 * Do not OOM on huge pages.  Fall back to regular pages after
> +	 * the first failed reclaim attempt.
> +	 */
> +	if (page_size > PAGE_SIZE)
> +		oom = false;
> +
> +	/*
>  	 * Unlike gloval-vm's OOM-kill, we're not in memory shortage
>  	 * in system level. So, allow to go ahead dying process in addition to
>  	 * MEMDIE process.
> -- 
> 1.7.3.5
> 
__mem_cgroup_try_charge() has "oom" switch already, so I prefer making callers
use the switch properly by themselves.

Thanks,
Daisuke Nishimura.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [patch 3/3] memcg: never OOM when charging huge pages
@ 2011-01-27 23:45         ` Daisuke Nishimura
  0 siblings, 0 replies; 74+ messages in thread
From: Daisuke Nishimura @ 2011-01-27 23:45 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: KAMEZAWA Hiroyuki, linux-mm, linux-kernel, balbir, akpm,
	Daisuke Nishimura

On Thu, 27 Jan 2011 14:47:03 +0100
Johannes Weiner <hannes@cmpxchg.org> wrote:

> Huge page coverage should obviously have less priority than the
> continued execution of a process.
> 
> Never kill a process when charging it a huge page fails.  Instead,
> give up after the first failed reclaim attempt and fall back to
> regular pages.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  mm/memcontrol.c |    7 +++++++
>  1 files changed, 7 insertions(+), 0 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 17c4e36..2945649 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1890,6 +1890,13 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
>  	int csize = max(CHARGE_SIZE, (unsigned long) page_size);
>  
>  	/*
> +	 * Do not OOM on huge pages.  Fall back to regular pages after
> +	 * the first failed reclaim attempt.
> +	 */
> +	if (page_size > PAGE_SIZE)
> +		oom = false;
> +
> +	/*
>  	 * Unlike gloval-vm's OOM-kill, we're not in memory shortage
>  	 * in system level. So, allow to go ahead dying process in addition to
>  	 * MEMDIE process.
> -- 
> 1.7.3.5
> 
__mem_cgroup_try_charge() has "oom" switch already, so I prefer making callers
use the switch properly by themselves.

Thanks,
Daisuke Nishimura.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [patch 3/3] memcg: never OOM when charging huge pages
  2011-01-27 23:45         ` Daisuke Nishimura
@ 2011-01-27 23:49           ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 74+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-01-27 23:49 UTC (permalink / raw)
  To: Daisuke Nishimura; +Cc: Johannes Weiner, linux-mm, linux-kernel, balbir, akpm

On Fri, 28 Jan 2011 08:45:40 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:

> On Thu, 27 Jan 2011 14:47:03 +0100
> Johannes Weiner <hannes@cmpxchg.org> wrote:
> 
> > Huge page coverage should obviously have less priority than the
> > continued execution of a process.
> > 
> > Never kill a process when charging it a huge page fails.  Instead,
> > give up after the first failed reclaim attempt and fall back to
> > regular pages.
> > 
> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > ---
> >  mm/memcontrol.c |    7 +++++++
> >  1 files changed, 7 insertions(+), 0 deletions(-)
> > 
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 17c4e36..2945649 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -1890,6 +1890,13 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
> >  	int csize = max(CHARGE_SIZE, (unsigned long) page_size);
> >  
> >  	/*
> > +	 * Do not OOM on huge pages.  Fall back to regular pages after
> > +	 * the first failed reclaim attempt.
> > +	 */
> > +	if (page_size > PAGE_SIZE)
> > +		oom = false;
> > +
> > +	/*
> >  	 * Unlike gloval-vm's OOM-kill, we're not in memory shortage
> >  	 * in system level. So, allow to go ahead dying process in addition to
> >  	 * MEMDIE process.
> > -- 
> > 1.7.3.5
> > 
> __mem_cgroup_try_charge() has "oom" switch already, so I prefer making callers
> use the switch properly by themselves.
> 

Okay, will do.

-Kame


^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [patch 3/3] memcg: never OOM when charging huge pages
@ 2011-01-27 23:49           ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 74+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-01-27 23:49 UTC (permalink / raw)
  To: Daisuke Nishimura; +Cc: Johannes Weiner, linux-mm, linux-kernel, balbir, akpm

On Fri, 28 Jan 2011 08:45:40 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:

> On Thu, 27 Jan 2011 14:47:03 +0100
> Johannes Weiner <hannes@cmpxchg.org> wrote:
> 
> > Huge page coverage should obviously have less priority than the
> > continued execution of a process.
> > 
> > Never kill a process when charging it a huge page fails.  Instead,
> > give up after the first failed reclaim attempt and fall back to
> > regular pages.
> > 
> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > ---
> >  mm/memcontrol.c |    7 +++++++
> >  1 files changed, 7 insertions(+), 0 deletions(-)
> > 
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 17c4e36..2945649 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -1890,6 +1890,13 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
> >  	int csize = max(CHARGE_SIZE, (unsigned long) page_size);
> >  
> >  	/*
> > +	 * Do not OOM on huge pages.  Fall back to regular pages after
> > +	 * the first failed reclaim attempt.
> > +	 */
> > +	if (page_size > PAGE_SIZE)
> > +		oom = false;
> > +
> > +	/*
> >  	 * Unlike gloval-vm's OOM-kill, we're not in memory shortage
> >  	 * in system level. So, allow to go ahead dying process in addition to
> >  	 * MEMDIE process.
> > -- 
> > 1.7.3.5
> > 
> __mem_cgroup_try_charge() has "oom" switch already, so I prefer making callers
> use the switch properly by themselves.
> 

Okay, will do.

-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

end of thread, other threads:[~2011-01-27 23:55 UTC | newest]

Thread overview: 74+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-01-21  6:34 [PATCH 0/7] memcg : more fixes and clean up for 2.6.28-rc KAMEZAWA Hiroyuki
2011-01-21  6:34 ` KAMEZAWA Hiroyuki
2011-01-21  6:37 ` [PATCH 1/7] memcg : comment, style fixes for recent patch of move_parent KAMEZAWA Hiroyuki
2011-01-21  6:37   ` KAMEZAWA Hiroyuki
2011-01-21  7:16   ` Daisuke Nishimura
2011-01-21  7:16     ` Daisuke Nishimura
2011-01-24 10:14   ` Johannes Weiner
2011-01-24 10:14     ` Johannes Weiner
2011-01-24 10:15     ` KAMEZAWA Hiroyuki
2011-01-24 10:15       ` KAMEZAWA Hiroyuki
2011-01-24 10:45       ` Johannes Weiner
2011-01-24 10:45         ` Johannes Weiner
2011-01-24 11:14         ` Hiroyuki Kamezawa
2011-01-24 11:14           ` Hiroyuki Kamezawa
2011-01-24 11:34           ` Johannes Weiner
2011-01-24 11:34             ` Johannes Weiner
2011-01-21  6:39 ` [PATCH 2/7] memcg : more fixes and clean up for 2.6.28-rc KAMEZAWA Hiroyuki
2011-01-21  6:39   ` KAMEZAWA Hiroyuki
2011-01-21  7:17   ` Daisuke Nishimura
2011-01-21  7:17     ` Daisuke Nishimura
2011-01-24 10:14   ` Johannes Weiner
2011-01-24 10:14     ` Johannes Weiner
2011-01-21  6:41 ` [PATCH 3/7] memcg : fix mem_cgroup_check_under_limit KAMEZAWA Hiroyuki
2011-01-21  6:41   ` KAMEZAWA Hiroyuki
2011-01-21  7:45   ` Daisuke Nishimura
2011-01-21  7:45     ` Daisuke Nishimura
2011-01-24 10:04   ` Johannes Weiner
2011-01-24 10:04     ` Johannes Weiner
2011-01-24 10:03     ` KAMEZAWA Hiroyuki
2011-01-24 10:03       ` KAMEZAWA Hiroyuki
2011-01-21  6:44 ` [PATCH 4/7] memcg : fix charge function of THP allocation KAMEZAWA Hiroyuki
2011-01-21  6:44   ` KAMEZAWA Hiroyuki
2011-01-21  8:48   ` Daisuke Nishimura
2011-01-21  8:48     ` Daisuke Nishimura
2011-01-24  0:14     ` KAMEZAWA Hiroyuki
2011-01-24  0:14       ` KAMEZAWA Hiroyuki
2011-01-27 10:34   ` Johannes Weiner
2011-01-27 10:34     ` Johannes Weiner
2011-01-27 10:40     ` [patch] memcg: prevent endless loop with huge pages and near-limit group Johannes Weiner
2011-01-27 10:40       ` Johannes Weiner
2011-01-27 23:40       ` KAMEZAWA Hiroyuki
2011-01-27 23:40         ` KAMEZAWA Hiroyuki
2011-01-27 13:46     ` [patch 2/3] memcg: prevent endless loop on huge page charge Johannes Weiner
2011-01-27 13:46       ` Johannes Weiner
2011-01-27 14:00       ` Gleb Natapov
2011-01-27 14:00         ` Gleb Natapov
2011-01-27 14:14         ` Johannes Weiner
2011-01-27 14:14           ` Johannes Weiner
2011-01-27 23:41           ` KAMEZAWA Hiroyuki
2011-01-27 23:41             ` KAMEZAWA Hiroyuki
2011-01-27 13:47     ` [patch 3/3] memcg: never OOM when charging huge pages Johannes Weiner
2011-01-27 13:47       ` Johannes Weiner
2011-01-27 23:44       ` KAMEZAWA Hiroyuki
2011-01-27 23:44         ` KAMEZAWA Hiroyuki
2011-01-27 23:45       ` Daisuke Nishimura
2011-01-27 23:45         ` Daisuke Nishimura
2011-01-27 23:49         ` KAMEZAWA Hiroyuki
2011-01-27 23:49           ` KAMEZAWA Hiroyuki
2011-01-27 14:18     ` [PATCH 4/7] memcg : fix charge function of THP allocation Johannes Weiner
2011-01-27 14:18       ` Johannes Weiner
2011-01-27 23:38     ` KAMEZAWA Hiroyuki
2011-01-27 23:38       ` KAMEZAWA Hiroyuki
2011-01-21  6:46 ` [PATCH 5/7] memcg : fix khugepaged scan of process under buzy memcg KAMEZAWA Hiroyuki
2011-01-21  6:46   ` KAMEZAWA Hiroyuki
2011-01-21  6:49 ` [PATCH 6/7] memcg : use better variable name KAMEZAWA Hiroyuki
2011-01-21  6:49   ` KAMEZAWA Hiroyuki
2011-01-21  6:50 ` [PATCH 7/7] memcg : remove ugly vairable initialization by callers KAMEZAWA Hiroyuki
2011-01-21  6:50   ` KAMEZAWA Hiroyuki
2011-01-21  9:17   ` Daisuke Nishimura
2011-01-21  9:17     ` Daisuke Nishimura
2011-01-24 10:19   ` Johannes Weiner
2011-01-24 10:19     ` Johannes Weiner
2011-01-24  0:29 ` [PATCH 0/7] memcg : more fixes and clean up for 2.6.28-rc KAMEZAWA Hiroyuki
2011-01-24  0:29   ` KAMEZAWA Hiroyuki

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.