All of lore.kernel.org
 help / color / mirror / Atom feed
* [BUGFIX][PATCH 0/4] Fixes for memcg with THP
@ 2011-01-28  3:22 ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 66+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-01-28  3:22 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel, hannes, nishimura, balbir


On recent -mm, when I run make -j 8 under 200M limit of memcg, as
==
# mount -t cgroup none /cgroup/memory -o memory
# mkdir /cgroup/memory/A
# echo 200M > /cgroup/memory/A/memory.limit_in_bytes
# echo $$ > /cgroup/memory/A/tasks
# make -j 8 kernel
==

I see hangs with khugepaged. That's because memcg's memory reclaim
routine doesn't handle HUGE_PAGE request in proper way. And khugepaged
doesn't know about memcg.

This patch set is for fixing above hang. Patch 1-3 seems obvious and
has the same concept as patches in RHEL.

Patch 4 may need discussion. I think this version is much simpler
because I dropped almost all cosmetics. Please review.

This patch is onto mmotm-0125.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 66+ messages in thread

* [BUGFIX][PATCH 0/4] Fixes for memcg with THP
@ 2011-01-28  3:22 ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 66+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-01-28  3:22 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel, hannes, nishimura, balbir


On recent -mm, when I run make -j 8 under 200M limit of memcg, as
==
# mount -t cgroup none /cgroup/memory -o memory
# mkdir /cgroup/memory/A
# echo 200M > /cgroup/memory/A/memory.limit_in_bytes
# echo $$ > /cgroup/memory/A/tasks
# make -j 8 kernel
==

I see hangs with khugepaged. That's because memcg's memory reclaim
routine doesn't handle HUGE_PAGE request in proper way. And khugepaged
doesn't know about memcg.

This patch set is for fixing above hang. Patch 1-3 seems obvious and
has the same concept as patches in RHEL.

Patch 4 may need discussion. I think this version is much simpler
because I dropped almost all cosmetics. Please review.

This patch is onto mmotm-0125.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [BUGFIX][PATCH 1/4] memcg: fix limit estimation at reclaim for hugepage
  2011-01-28  3:22 ` KAMEZAWA Hiroyuki
@ 2011-01-28  3:24   ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 66+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-01-28  3:24 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, hannes, nishimura, balbir

From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Current memory cgroup's code tends to assume page_size == PAGE_SIZE
and arrangement for THP is not enough yet.

This is one of fixes for supporing THP. This adds
mem_cgroup_check_margin() and checks whether there are required amount of
free resource after memory reclaim. By this, THP page allocation
can know whether it really succeeded or not and avoid infinite-loop
and hangup.

Total fixes for do_charge()/reclaim memory will follow this patch.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/res_counter.h |   11 +++++++++++
 mm/memcontrol.c             |   25 ++++++++++++++++++++++++-
 2 files changed, 35 insertions(+), 1 deletion(-)

Index: mmotm-0125/include/linux/res_counter.h
===================================================================
--- mmotm-0125.orig/include/linux/res_counter.h
+++ mmotm-0125/include/linux/res_counter.h
@@ -182,6 +182,17 @@ static inline bool res_counter_check_und
 	return ret;
 }
 
+static inline s64 res_counter_check_margin(struct res_counter *cnt)
+{
+	s64 ret;
+	unsigned long flags;
+
+	spin_lock_irqsave(&cnt->lock, flags);
+	ret = cnt->limit - cnt->usage;
+	spin_unlock_irqrestore(&cnt->lock, flags);
+	return ret;
+}
+
 static inline bool res_counter_check_under_soft_limit(struct res_counter *cnt)
 {
 	bool ret;
Index: mmotm-0125/mm/memcontrol.c
===================================================================
--- mmotm-0125.orig/mm/memcontrol.c
+++ mmotm-0125/mm/memcontrol.c
@@ -1111,6 +1111,22 @@ static bool mem_cgroup_check_under_limit
 	return false;
 }
 
+static s64  mem_cgroup_check_margin(struct mem_cgroup *mem)
+{
+	s64 mem_margin;
+
+	if (do_swap_account) {
+		s64 memsw_margin;
+
+		mem_margin = res_counter_check_margin(&mem->res);
+		memsw_margin = res_counter_check_margin(&mem->memsw);
+		if (mem_margin > memsw_margin)
+			mem_margin = memsw_margin;
+	} else
+		mem_margin = res_counter_check_margin(&mem->res);
+	return mem_margin;
+}
+
 static unsigned int get_swappiness(struct mem_cgroup *memcg)
 {
 	struct cgroup *cgrp = memcg->css.cgroup;
@@ -1853,7 +1869,14 @@ static int __mem_cgroup_do_charge(struct
 	 * Check the limit again to see if the reclaim reduced the
 	 * current usage of the cgroup before giving up
 	 */
-	if (ret || mem_cgroup_check_under_limit(mem_over_limit))
+	if (mem_cgroup_check_margin(mem_over_limit) >= csize)
+		return CHARGE_RETRY;
+
+	/*
+ 	 * If the charge size is a PAGE_SIZE, it's not hopeless while
+ 	 * we can reclaim a page.
+ 	 */
+	if (csize == PAGE_SIZE && ret)
 		return CHARGE_RETRY;
 
 	/*


^ permalink raw reply	[flat|nested] 66+ messages in thread

* [BUGFIX][PATCH 1/4] memcg: fix limit estimation at reclaim for hugepage
@ 2011-01-28  3:24   ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 66+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-01-28  3:24 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, hannes, nishimura, balbir

From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Current memory cgroup's code tends to assume page_size == PAGE_SIZE
and arrangement for THP is not enough yet.

This is one of fixes for supporing THP. This adds
mem_cgroup_check_margin() and checks whether there are required amount of
free resource after memory reclaim. By this, THP page allocation
can know whether it really succeeded or not and avoid infinite-loop
and hangup.

Total fixes for do_charge()/reclaim memory will follow this patch.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/res_counter.h |   11 +++++++++++
 mm/memcontrol.c             |   25 ++++++++++++++++++++++++-
 2 files changed, 35 insertions(+), 1 deletion(-)

Index: mmotm-0125/include/linux/res_counter.h
===================================================================
--- mmotm-0125.orig/include/linux/res_counter.h
+++ mmotm-0125/include/linux/res_counter.h
@@ -182,6 +182,17 @@ static inline bool res_counter_check_und
 	return ret;
 }
 
+static inline s64 res_counter_check_margin(struct res_counter *cnt)
+{
+	s64 ret;
+	unsigned long flags;
+
+	spin_lock_irqsave(&cnt->lock, flags);
+	ret = cnt->limit - cnt->usage;
+	spin_unlock_irqrestore(&cnt->lock, flags);
+	return ret;
+}
+
 static inline bool res_counter_check_under_soft_limit(struct res_counter *cnt)
 {
 	bool ret;
Index: mmotm-0125/mm/memcontrol.c
===================================================================
--- mmotm-0125.orig/mm/memcontrol.c
+++ mmotm-0125/mm/memcontrol.c
@@ -1111,6 +1111,22 @@ static bool mem_cgroup_check_under_limit
 	return false;
 }
 
+static s64  mem_cgroup_check_margin(struct mem_cgroup *mem)
+{
+	s64 mem_margin;
+
+	if (do_swap_account) {
+		s64 memsw_margin;
+
+		mem_margin = res_counter_check_margin(&mem->res);
+		memsw_margin = res_counter_check_margin(&mem->memsw);
+		if (mem_margin > memsw_margin)
+			mem_margin = memsw_margin;
+	} else
+		mem_margin = res_counter_check_margin(&mem->res);
+	return mem_margin;
+}
+
 static unsigned int get_swappiness(struct mem_cgroup *memcg)
 {
 	struct cgroup *cgrp = memcg->css.cgroup;
@@ -1853,7 +1869,14 @@ static int __mem_cgroup_do_charge(struct
 	 * Check the limit again to see if the reclaim reduced the
 	 * current usage of the cgroup before giving up
 	 */
-	if (ret || mem_cgroup_check_under_limit(mem_over_limit))
+	if (mem_cgroup_check_margin(mem_over_limit) >= csize)
+		return CHARGE_RETRY;
+
+	/*
+ 	 * If the charge size is a PAGE_SIZE, it's not hopeless while
+ 	 * we can reclaim a page.
+ 	 */
+	if (csize == PAGE_SIZE && ret)
 		return CHARGE_RETRY;
 
 	/*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [BUGFIX][PATCH 2/4] memcg: fix charge path for THP and allow early retirement
  2011-01-28  3:22 ` KAMEZAWA Hiroyuki
@ 2011-01-28  3:26   ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 66+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-01-28  3:26 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, hannes, nishimura, balbir

From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

When THP is used, Hugepage size charge can happen. It's not handled
correctly in mem_cgroup_do_charge(). For example, THP can fallback
to small page allocation when HUGEPAGE allocation seems difficult
or busy, but memory cgroup doesn't understand it and continue to
try HUGEPAGE charging. And the worst thing is memory cgroup
believes 'memory reclaim succeeded' if limit - usage > PAGE_SIZE.

By this, khugepaged etc...can goes into inifinite reclaim loop
if tasks in memcg are busy.

After this patch 
 - Hugepage allocation will fail if 1st trial of page reclaim fails.

Changelog:
 - make changes small. removed renaming codes.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |   28 ++++++++++++++++++++++++----
 1 file changed, 24 insertions(+), 4 deletions(-)

Index: mmotm-0125/mm/memcontrol.c
===================================================================
--- mmotm-0125.orig/mm/memcontrol.c
+++ mmotm-0125/mm/memcontrol.c
@@ -1827,10 +1827,14 @@ enum {
 	CHARGE_OK,		/* success */
 	CHARGE_RETRY,		/* need to retry but retry is not bad */
 	CHARGE_NOMEM,		/* we can't do more. return -ENOMEM */
+	CHARGE_NEED_BREAK,	/* big size allocation failure */
 	CHARGE_WOULDBLOCK,	/* GFP_WAIT wasn't set and no enough res. */
 	CHARGE_OOM_DIE,		/* the current is killed because of OOM */
 };
 
+/*
+ * Now we have 3 charge size as PAGE_SIZE, HPAGE_SIZE and batched allcation.
+ */
 static int __mem_cgroup_do_charge(struct mem_cgroup *mem, gfp_t gfp_mask,
 				int csize, bool oom_check)
 {
@@ -1854,9 +1858,6 @@ static int __mem_cgroup_do_charge(struct
 	} else
 		mem_over_limit = mem_cgroup_from_res_counter(fail_res, res);
 
-	if (csize > PAGE_SIZE) /* change csize and retry */
-		return CHARGE_RETRY;
-
 	if (!(gfp_mask & __GFP_WAIT))
 		return CHARGE_WOULDBLOCK;
 
@@ -1880,6 +1881,13 @@ static int __mem_cgroup_do_charge(struct
 		return CHARGE_RETRY;
 
 	/*
+	 * if request size is larger than PAGE_SIZE, it's not OOM
+	 * and caller will do retry in smaller size.
+	 */
+	if (csize != PAGE_SIZE)
+		return CHARGE_NEED_BREAK;
+
+	/*
 	 * At task move, charge accounts can be doubly counted. So, it's
 	 * better to wait until the end of task_move if something is going on.
 	 */
@@ -1997,10 +2005,22 @@ again:
 		case CHARGE_OK:
 			break;
 		case CHARGE_RETRY: /* not in OOM situation but retry */
-			csize = page_size;
 			css_put(&mem->css);
 			mem = NULL;
 			goto again;
+		case CHARGE_NEED_BREAK: /* page_size > PAGE_SIZE */
+			css_put(&mem->css);
+			/*
+			 * We'll come here in 2 caes, batched-charge and
+			 * hugetlb alloc. batched-charge can do retry
+			 * with smaller page size. hugepage should return
+			 * NOMEM. This doesn't mean OOM.
+			 */
+			if (page_size > PAGE_SIZE)
+				goto nomem;
+			csize = page_size;
+			mem = NULL;
+			goto again;
 		case CHARGE_WOULDBLOCK: /* !__GFP_WAIT */
 			css_put(&mem->css);
 			goto nomem;


^ permalink raw reply	[flat|nested] 66+ messages in thread

* [BUGFIX][PATCH 2/4] memcg: fix charge path for THP and allow early retirement
@ 2011-01-28  3:26   ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 66+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-01-28  3:26 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, hannes, nishimura, balbir

From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

When THP is used, Hugepage size charge can happen. It's not handled
correctly in mem_cgroup_do_charge(). For example, THP can fallback
to small page allocation when HUGEPAGE allocation seems difficult
or busy, but memory cgroup doesn't understand it and continue to
try HUGEPAGE charging. And the worst thing is memory cgroup
believes 'memory reclaim succeeded' if limit - usage > PAGE_SIZE.

By this, khugepaged etc...can goes into inifinite reclaim loop
if tasks in memcg are busy.

After this patch 
 - Hugepage allocation will fail if 1st trial of page reclaim fails.

Changelog:
 - make changes small. removed renaming codes.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |   28 ++++++++++++++++++++++++----
 1 file changed, 24 insertions(+), 4 deletions(-)

Index: mmotm-0125/mm/memcontrol.c
===================================================================
--- mmotm-0125.orig/mm/memcontrol.c
+++ mmotm-0125/mm/memcontrol.c
@@ -1827,10 +1827,14 @@ enum {
 	CHARGE_OK,		/* success */
 	CHARGE_RETRY,		/* need to retry but retry is not bad */
 	CHARGE_NOMEM,		/* we can't do more. return -ENOMEM */
+	CHARGE_NEED_BREAK,	/* big size allocation failure */
 	CHARGE_WOULDBLOCK,	/* GFP_WAIT wasn't set and no enough res. */
 	CHARGE_OOM_DIE,		/* the current is killed because of OOM */
 };
 
+/*
+ * Now we have 3 charge size as PAGE_SIZE, HPAGE_SIZE and batched allcation.
+ */
 static int __mem_cgroup_do_charge(struct mem_cgroup *mem, gfp_t gfp_mask,
 				int csize, bool oom_check)
 {
@@ -1854,9 +1858,6 @@ static int __mem_cgroup_do_charge(struct
 	} else
 		mem_over_limit = mem_cgroup_from_res_counter(fail_res, res);
 
-	if (csize > PAGE_SIZE) /* change csize and retry */
-		return CHARGE_RETRY;
-
 	if (!(gfp_mask & __GFP_WAIT))
 		return CHARGE_WOULDBLOCK;
 
@@ -1880,6 +1881,13 @@ static int __mem_cgroup_do_charge(struct
 		return CHARGE_RETRY;
 
 	/*
+	 * if request size is larger than PAGE_SIZE, it's not OOM
+	 * and caller will do retry in smaller size.
+	 */
+	if (csize != PAGE_SIZE)
+		return CHARGE_NEED_BREAK;
+
+	/*
 	 * At task move, charge accounts can be doubly counted. So, it's
 	 * better to wait until the end of task_move if something is going on.
 	 */
@@ -1997,10 +2005,22 @@ again:
 		case CHARGE_OK:
 			break;
 		case CHARGE_RETRY: /* not in OOM situation but retry */
-			csize = page_size;
 			css_put(&mem->css);
 			mem = NULL;
 			goto again;
+		case CHARGE_NEED_BREAK: /* page_size > PAGE_SIZE */
+			css_put(&mem->css);
+			/*
+			 * We'll come here in 2 caes, batched-charge and
+			 * hugetlb alloc. batched-charge can do retry
+			 * with smaller page size. hugepage should return
+			 * NOMEM. This doesn't mean OOM.
+			 */
+			if (page_size > PAGE_SIZE)
+				goto nomem;
+			csize = page_size;
+			mem = NULL;
+			goto again;
 		case CHARGE_WOULDBLOCK: /* !__GFP_WAIT */
 			css_put(&mem->css);
 			goto nomem;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [BUGFIX][PATCH 3/4] mecg: fix oom flag at THP charge
  2011-01-28  3:22 ` KAMEZAWA Hiroyuki
@ 2011-01-28  3:27   ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 66+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-01-28  3:27 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, hannes, nishimura, balbir


Thanks to Johanns and Daisuke for suggestion.
=
Hugepage allocation shouldn't trigger oom.
Allocation failure is not fatal.

Orignal-patch-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |    4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

Index: mmotm-0125/mm/memcontrol.c
===================================================================
--- mmotm-0125.orig/mm/memcontrol.c
+++ mmotm-0125/mm/memcontrol.c
@@ -2369,11 +2369,14 @@ static int mem_cgroup_charge_common(stru
 	struct page_cgroup *pc;
 	int ret;
 	int page_size = PAGE_SIZE;
+	bool oom;
 
 	if (PageTransHuge(page)) {
 		page_size <<= compound_order(page);
 		VM_BUG_ON(!PageTransHuge(page));
-	}
+		oom = false;
+	} else
+		oom = true;
 
 	pc = lookup_page_cgroup(page);
 	/* can happen at boot */
@@ -2381,7 +2384,7 @@ static int mem_cgroup_charge_common(stru
 		return 0;
 	prefetchw(pc);
 
-	ret = __mem_cgroup_try_charge(mm, gfp_mask, &mem, true, page_size);
+	ret = __mem_cgroup_try_charge(mm, gfp_mask, &mem, oom, page_size);
 	if (ret || !mem)
 		return ret;
 


^ permalink raw reply	[flat|nested] 66+ messages in thread

* [BUGFIX][PATCH 3/4] mecg: fix oom flag at THP charge
@ 2011-01-28  3:27   ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 66+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-01-28  3:27 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, hannes, nishimura, balbir


Thanks to Johanns and Daisuke for suggestion.
=
Hugepage allocation shouldn't trigger oom.
Allocation failure is not fatal.

Orignal-patch-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |    4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

Index: mmotm-0125/mm/memcontrol.c
===================================================================
--- mmotm-0125.orig/mm/memcontrol.c
+++ mmotm-0125/mm/memcontrol.c
@@ -2369,11 +2369,14 @@ static int mem_cgroup_charge_common(stru
 	struct page_cgroup *pc;
 	int ret;
 	int page_size = PAGE_SIZE;
+	bool oom;
 
 	if (PageTransHuge(page)) {
 		page_size <<= compound_order(page);
 		VM_BUG_ON(!PageTransHuge(page));
-	}
+		oom = false;
+	} else
+		oom = true;
 
 	pc = lookup_page_cgroup(page);
 	/* can happen at boot */
@@ -2381,7 +2384,7 @@ static int mem_cgroup_charge_common(stru
 		return 0;
 	prefetchw(pc);
 
-	ret = __mem_cgroup_try_charge(mm, gfp_mask, &mem, true, page_size);
+	ret = __mem_cgroup_try_charge(mm, gfp_mask, &mem, oom, page_size);
 	if (ret || !mem)
 		return ret;
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [BUGFIX][PATCH 4/4] memcg: fix khugepaged should skip busy memcg
  2011-01-28  3:22 ` KAMEZAWA Hiroyuki
@ 2011-01-28  3:28   ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 66+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-01-28  3:28 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, hannes, nishimura, balbir


From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

When using khugepaged with small memory cgroup, we see khugepaged
causes soft lockup, or running process under memcg will hang

It's because khugepaged tries to scan all pmd of a process
which is under busy/small memory cgroup and tries to allocate
HUGEPAGE size resource.

This work is done under mmap_sem and can cause memory reclaim
repeatedly. This will easily raise cpu usage of khugepaged and latecy
of scanned process will goes up. Moreover, it seems succesfully
working TransHuge pages may be splitted by this memory reclaim
caused by khugepaged.

This patch adds a hint for khugepaged whether a process is
under a memory cgroup which has sufficient memory. If memcg
seems busy, a process is skipped.

How to test:
  # mount -o cgroup cgroup /cgroup/memory -o memory
  # mkdir /cgroup/memory/A
  # echo 200M (or some small) > /cgroup/memory/A/memory.limit_in_bytes
  # echo 0 > /cgroup/memory/A/tasks
  # make -j 8 kernel

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/memcontrol.h |    7 +++++
 mm/huge_memory.c           |   10 +++++++-
 mm/memcontrol.c            |   53 +++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 69 insertions(+), 1 deletion(-)

Index: mmotm-0125/mm/memcontrol.c
===================================================================
--- mmotm-0125.orig/mm/memcontrol.c
+++ mmotm-0125/mm/memcontrol.c
@@ -255,6 +255,9 @@ struct mem_cgroup {
 	/* For oom notifier event fd */
 	struct list_head oom_notify;
 
+	/* For transparent hugepage daemon */
+	unsigned long long recent_failcnt;
+
 	/*
 	 * Should we move charges of a task when a task is moved into this
 	 * mem_cgroup ? And what type of charges should we move ?
@@ -2214,6 +2217,56 @@ void mem_cgroup_split_huge_fixup(struct 
 	tail_pc->flags = head_pc->flags & ~PCGF_NOCOPY_AT_SPLIT;
 	move_unlock_page_cgroup(head_pc, &flags);
 }
+
+bool mem_cgroup_worth_try_hugepage_scan(struct mm_struct *mm)
+{
+	struct mem_cgroup *mem;
+	bool ret = true;
+	u64 recent_charge_fail;
+
+	if (mem_cgroup_disabled())
+		return true;
+
+	mem = try_get_mem_cgroup_from_mm(mm);
+
+	if (!mem)
+		return true;
+
+	if (mem_cgroup_is_root(mem))
+		goto out;
+
+	/*
+	 * At collapsing, khugepaged charges HPAGE_SIZE. When it unmap
+	 * used ptes, the charge will be decreased.
+	 *
+	 * This requirement of 'extra charge' at collapsing seems redundant
+	 * it's safe way for now. For example, at replacing a chunk of page
+	 * to be hugepage, khuepaged skips pte_none() entry, which is not
+	 * which is not charged. But we should do charge under spinlocks as
+	 * pte_lock, we need precharge. Check status before doing heavy
+	 * jobs and give khugepaged chance to retire early.
+	 */
+	if (mem_cgroup_check_margin(mem) >= HPAGE_SIZE)
+		ret = false;
+
+	 /*
+	  * This is an easy check. If someone other than khugepaged does
+	  * hit limit, khugepaged should avoid more pressure.
+	  */
+	recent_charge_fail = res_counter_read_u64(&mem->res, RES_FAILCNT);
+	if (ret
+	    && mem->recent_failcnt
+            && recent_charge_fail > mem->recent_failcnt) {
+		ret = false;
+	}
+	/* because this thread will fail charge by itself +1.*/
+	if (recent_charge_fail)
+		mem->recent_failcnt = recent_charge_fail + 1;
+out:
+	css_put(&mem->css);
+	return ret;
+}
+
 #endif
 
 /**
Index: mmotm-0125/mm/huge_memory.c
===================================================================
--- mmotm-0125.orig/mm/huge_memory.c
+++ mmotm-0125/mm/huge_memory.c
@@ -2011,8 +2011,10 @@ static unsigned int khugepaged_scan_mm_s
 	down_read(&mm->mmap_sem);
 	if (unlikely(khugepaged_test_exit(mm)))
 		vma = NULL;
-	else
+	else if (mem_cgroup_worth_try_hugepage_scan(mm))
 		vma = find_vma(mm, khugepaged_scan.address);
+	else
+		vma = NULL;
 
 	progress++;
 	for (; vma; vma = vma->vm_next) {
@@ -2024,6 +2026,12 @@ static unsigned int khugepaged_scan_mm_s
 			break;
 		}
 
+		if (unlikely(!mem_cgroup_worth_try_hugepage_scan(mm))) {
+			progress++;
+			vma = NULL; /* try next mm */
+			break;
+		}
+
 		if ((!(vma->vm_flags & VM_HUGEPAGE) &&
 		     !khugepaged_always()) ||
 		    (vma->vm_flags & VM_NOHUGEPAGE)) {
Index: mmotm-0125/include/linux/memcontrol.h
===================================================================
--- mmotm-0125.orig/include/linux/memcontrol.h
+++ mmotm-0125/include/linux/memcontrol.h
@@ -148,6 +148,7 @@ u64 mem_cgroup_get_limit(struct mem_cgro
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 void mem_cgroup_split_huge_fixup(struct page *head, struct page *tail);
+bool mem_cgroup_worth_try_hugepage_scan(struct mm_struct *mm);
 #endif
 
 #else /* CONFIG_CGROUP_MEM_RES_CTLR */
@@ -342,6 +343,12 @@ u64 mem_cgroup_get_limit(struct mem_cgro
 static inline void mem_cgroup_split_huge_fixup(struct page *head,
 						struct page *tail)
 {
+
+}
+
+static inline bool mem_cgroup_worth_try_hugepage_scan(struct mm_struct *mm)
+{
+	return true;
 }
 
 #endif /* CONFIG_CGROUP_MEM_CONT */


^ permalink raw reply	[flat|nested] 66+ messages in thread

* [BUGFIX][PATCH 4/4] memcg: fix khugepaged should skip busy memcg
@ 2011-01-28  3:28   ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 66+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-01-28  3:28 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, hannes, nishimura, balbir


From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

When using khugepaged with small memory cgroup, we see khugepaged
causes soft lockup, or running process under memcg will hang

It's because khugepaged tries to scan all pmd of a process
which is under busy/small memory cgroup and tries to allocate
HUGEPAGE size resource.

This work is done under mmap_sem and can cause memory reclaim
repeatedly. This will easily raise cpu usage of khugepaged and latecy
of scanned process will goes up. Moreover, it seems succesfully
working TransHuge pages may be splitted by this memory reclaim
caused by khugepaged.

This patch adds a hint for khugepaged whether a process is
under a memory cgroup which has sufficient memory. If memcg
seems busy, a process is skipped.

How to test:
  # mount -o cgroup cgroup /cgroup/memory -o memory
  # mkdir /cgroup/memory/A
  # echo 200M (or some small) > /cgroup/memory/A/memory.limit_in_bytes
  # echo 0 > /cgroup/memory/A/tasks
  # make -j 8 kernel

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/memcontrol.h |    7 +++++
 mm/huge_memory.c           |   10 +++++++-
 mm/memcontrol.c            |   53 +++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 69 insertions(+), 1 deletion(-)

Index: mmotm-0125/mm/memcontrol.c
===================================================================
--- mmotm-0125.orig/mm/memcontrol.c
+++ mmotm-0125/mm/memcontrol.c
@@ -255,6 +255,9 @@ struct mem_cgroup {
 	/* For oom notifier event fd */
 	struct list_head oom_notify;
 
+	/* For transparent hugepage daemon */
+	unsigned long long recent_failcnt;
+
 	/*
 	 * Should we move charges of a task when a task is moved into this
 	 * mem_cgroup ? And what type of charges should we move ?
@@ -2214,6 +2217,56 @@ void mem_cgroup_split_huge_fixup(struct 
 	tail_pc->flags = head_pc->flags & ~PCGF_NOCOPY_AT_SPLIT;
 	move_unlock_page_cgroup(head_pc, &flags);
 }
+
+bool mem_cgroup_worth_try_hugepage_scan(struct mm_struct *mm)
+{
+	struct mem_cgroup *mem;
+	bool ret = true;
+	u64 recent_charge_fail;
+
+	if (mem_cgroup_disabled())
+		return true;
+
+	mem = try_get_mem_cgroup_from_mm(mm);
+
+	if (!mem)
+		return true;
+
+	if (mem_cgroup_is_root(mem))
+		goto out;
+
+	/*
+	 * At collapsing, khugepaged charges HPAGE_SIZE. When it unmap
+	 * used ptes, the charge will be decreased.
+	 *
+	 * This requirement of 'extra charge' at collapsing seems redundant
+	 * it's safe way for now. For example, at replacing a chunk of page
+	 * to be hugepage, khuepaged skips pte_none() entry, which is not
+	 * which is not charged. But we should do charge under spinlocks as
+	 * pte_lock, we need precharge. Check status before doing heavy
+	 * jobs and give khugepaged chance to retire early.
+	 */
+	if (mem_cgroup_check_margin(mem) >= HPAGE_SIZE)
+		ret = false;
+
+	 /*
+	  * This is an easy check. If someone other than khugepaged does
+	  * hit limit, khugepaged should avoid more pressure.
+	  */
+	recent_charge_fail = res_counter_read_u64(&mem->res, RES_FAILCNT);
+	if (ret
+	    && mem->recent_failcnt
+            && recent_charge_fail > mem->recent_failcnt) {
+		ret = false;
+	}
+	/* because this thread will fail charge by itself +1.*/
+	if (recent_charge_fail)
+		mem->recent_failcnt = recent_charge_fail + 1;
+out:
+	css_put(&mem->css);
+	return ret;
+}
+
 #endif
 
 /**
Index: mmotm-0125/mm/huge_memory.c
===================================================================
--- mmotm-0125.orig/mm/huge_memory.c
+++ mmotm-0125/mm/huge_memory.c
@@ -2011,8 +2011,10 @@ static unsigned int khugepaged_scan_mm_s
 	down_read(&mm->mmap_sem);
 	if (unlikely(khugepaged_test_exit(mm)))
 		vma = NULL;
-	else
+	else if (mem_cgroup_worth_try_hugepage_scan(mm))
 		vma = find_vma(mm, khugepaged_scan.address);
+	else
+		vma = NULL;
 
 	progress++;
 	for (; vma; vma = vma->vm_next) {
@@ -2024,6 +2026,12 @@ static unsigned int khugepaged_scan_mm_s
 			break;
 		}
 
+		if (unlikely(!mem_cgroup_worth_try_hugepage_scan(mm))) {
+			progress++;
+			vma = NULL; /* try next mm */
+			break;
+		}
+
 		if ((!(vma->vm_flags & VM_HUGEPAGE) &&
 		     !khugepaged_always()) ||
 		    (vma->vm_flags & VM_NOHUGEPAGE)) {
Index: mmotm-0125/include/linux/memcontrol.h
===================================================================
--- mmotm-0125.orig/include/linux/memcontrol.h
+++ mmotm-0125/include/linux/memcontrol.h
@@ -148,6 +148,7 @@ u64 mem_cgroup_get_limit(struct mem_cgro
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 void mem_cgroup_split_huge_fixup(struct page *head, struct page *tail);
+bool mem_cgroup_worth_try_hugepage_scan(struct mm_struct *mm);
 #endif
 
 #else /* CONFIG_CGROUP_MEM_RES_CTLR */
@@ -342,6 +343,12 @@ u64 mem_cgroup_get_limit(struct mem_cgro
 static inline void mem_cgroup_split_huge_fixup(struct page *head,
 						struct page *tail)
 {
+
+}
+
+static inline bool mem_cgroup_worth_try_hugepage_scan(struct mm_struct *mm)
+{
+	return true;
 }
 
 #endif /* CONFIG_CGROUP_MEM_CONT */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [BUGFIX][PATCH 1/4] memcg: fix limit estimation at reclaim for hugepage
  2011-01-28  3:24   ` KAMEZAWA Hiroyuki
@ 2011-01-28  4:40     ` Daisuke Nishimura
  -1 siblings, 0 replies; 66+ messages in thread
From: Daisuke Nishimura @ 2011-01-28  4:40 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, linux-kernel, hannes, balbir, Daisuke Nishimura

On Fri, 28 Jan 2011 12:24:49 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> Current memory cgroup's code tends to assume page_size == PAGE_SIZE
> and arrangement for THP is not enough yet.
> 
> This is one of fixes for supporing THP. This adds
> mem_cgroup_check_margin() and checks whether there are required amount of
> free resource after memory reclaim. By this, THP page allocation
> can know whether it really succeeded or not and avoid infinite-loop
> and hangup.
> 
> Total fixes for do_charge()/reclaim memory will follow this patch.
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

This patch looks good to me, but some nitpicks.

> ---
>  include/linux/res_counter.h |   11 +++++++++++
>  mm/memcontrol.c             |   25 ++++++++++++++++++++++++-
>  2 files changed, 35 insertions(+), 1 deletion(-)
> 
> Index: mmotm-0125/include/linux/res_counter.h
> ===================================================================
> --- mmotm-0125.orig/include/linux/res_counter.h
> +++ mmotm-0125/include/linux/res_counter.h
> @@ -182,6 +182,17 @@ static inline bool res_counter_check_und
>  	return ret;
>  }
>  
> +static inline s64 res_counter_check_margin(struct res_counter *cnt)
> +{
> +	s64 ret;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&cnt->lock, flags);
> +	ret = cnt->limit - cnt->usage;
> +	spin_unlock_irqrestore(&cnt->lock, flags);
> +	return ret;
> +}
> +
>  static inline bool res_counter_check_under_soft_limit(struct res_counter *cnt)
>  {
>  	bool ret;
> Index: mmotm-0125/mm/memcontrol.c
> ===================================================================
> --- mmotm-0125.orig/mm/memcontrol.c
> +++ mmotm-0125/mm/memcontrol.c
> @@ -1111,6 +1111,22 @@ static bool mem_cgroup_check_under_limit
>  	return false;
>  }
>  
> +static s64  mem_cgroup_check_margin(struct mem_cgroup *mem)
> +{
> +	s64 mem_margin;
> +
> +	if (do_swap_account) {
> +		s64 memsw_margin;
> +
> +		mem_margin = res_counter_check_margin(&mem->res);
> +		memsw_margin = res_counter_check_margin(&mem->memsw);
> +		if (mem_margin > memsw_margin)
> +			mem_margin = memsw_margin;
> +	} else
> +		mem_margin = res_counter_check_margin(&mem->res);
> +	return mem_margin;
> +}
> +
How about

	mem_margin = res_counter_check_margin(&mem->res);
	if (do_swap_account)
		memsw_margin = res_counter_check_margin(&mem->memsw);
	else
		memsw_margin = RESOURCE_MAX;

	return min(mem_margin, memsw_margin);

?
I think using min() makes it more clear what this function does.

>  static unsigned int get_swappiness(struct mem_cgroup *memcg)
>  {
>  	struct cgroup *cgrp = memcg->css.cgroup;
> @@ -1853,7 +1869,14 @@ static int __mem_cgroup_do_charge(struct
>  	 * Check the limit again to see if the reclaim reduced the
>  	 * current usage of the cgroup before giving up
>  	 */
> -	if (ret || mem_cgroup_check_under_limit(mem_over_limit))
> +	if (mem_cgroup_check_margin(mem_over_limit) >= csize)
> +		return CHARGE_RETRY;
> +
> +	/*
> + 	 * If the charge size is a PAGE_SIZE, it's not hopeless while
> + 	 * we can reclaim a page.
> + 	 */
> +	if (csize == PAGE_SIZE && ret)
>  		return CHARGE_RETRY;
>  
>  	/*
> 
checkpatch complains some whitespace warnings.


Thanks,
Daisuke Nishimura.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [BUGFIX][PATCH 1/4] memcg: fix limit estimation at reclaim for hugepage
@ 2011-01-28  4:40     ` Daisuke Nishimura
  0 siblings, 0 replies; 66+ messages in thread
From: Daisuke Nishimura @ 2011-01-28  4:40 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, linux-kernel, hannes, balbir, Daisuke Nishimura

On Fri, 28 Jan 2011 12:24:49 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> Current memory cgroup's code tends to assume page_size == PAGE_SIZE
> and arrangement for THP is not enough yet.
> 
> This is one of fixes for supporing THP. This adds
> mem_cgroup_check_margin() and checks whether there are required amount of
> free resource after memory reclaim. By this, THP page allocation
> can know whether it really succeeded or not and avoid infinite-loop
> and hangup.
> 
> Total fixes for do_charge()/reclaim memory will follow this patch.
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

This patch looks good to me, but some nitpicks.

> ---
>  include/linux/res_counter.h |   11 +++++++++++
>  mm/memcontrol.c             |   25 ++++++++++++++++++++++++-
>  2 files changed, 35 insertions(+), 1 deletion(-)
> 
> Index: mmotm-0125/include/linux/res_counter.h
> ===================================================================
> --- mmotm-0125.orig/include/linux/res_counter.h
> +++ mmotm-0125/include/linux/res_counter.h
> @@ -182,6 +182,17 @@ static inline bool res_counter_check_und
>  	return ret;
>  }
>  
> +static inline s64 res_counter_check_margin(struct res_counter *cnt)
> +{
> +	s64 ret;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&cnt->lock, flags);
> +	ret = cnt->limit - cnt->usage;
> +	spin_unlock_irqrestore(&cnt->lock, flags);
> +	return ret;
> +}
> +
>  static inline bool res_counter_check_under_soft_limit(struct res_counter *cnt)
>  {
>  	bool ret;
> Index: mmotm-0125/mm/memcontrol.c
> ===================================================================
> --- mmotm-0125.orig/mm/memcontrol.c
> +++ mmotm-0125/mm/memcontrol.c
> @@ -1111,6 +1111,22 @@ static bool mem_cgroup_check_under_limit
>  	return false;
>  }
>  
> +static s64  mem_cgroup_check_margin(struct mem_cgroup *mem)
> +{
> +	s64 mem_margin;
> +
> +	if (do_swap_account) {
> +		s64 memsw_margin;
> +
> +		mem_margin = res_counter_check_margin(&mem->res);
> +		memsw_margin = res_counter_check_margin(&mem->memsw);
> +		if (mem_margin > memsw_margin)
> +			mem_margin = memsw_margin;
> +	} else
> +		mem_margin = res_counter_check_margin(&mem->res);
> +	return mem_margin;
> +}
> +
How about

	mem_margin = res_counter_check_margin(&mem->res);
	if (do_swap_account)
		memsw_margin = res_counter_check_margin(&mem->memsw);
	else
		memsw_margin = RESOURCE_MAX;

	return min(mem_margin, memsw_margin);

?
I think using min() makes it more clear what this function does.

>  static unsigned int get_swappiness(struct mem_cgroup *memcg)
>  {
>  	struct cgroup *cgrp = memcg->css.cgroup;
> @@ -1853,7 +1869,14 @@ static int __mem_cgroup_do_charge(struct
>  	 * Check the limit again to see if the reclaim reduced the
>  	 * current usage of the cgroup before giving up
>  	 */
> -	if (ret || mem_cgroup_check_under_limit(mem_over_limit))
> +	if (mem_cgroup_check_margin(mem_over_limit) >= csize)
> +		return CHARGE_RETRY;
> +
> +	/*
> + 	 * If the charge size is a PAGE_SIZE, it's not hopeless while
> + 	 * we can reclaim a page.
> + 	 */
> +	if (csize == PAGE_SIZE && ret)
>  		return CHARGE_RETRY;
>  
>  	/*
> 
checkpatch complains some whitespace warnings.


Thanks,
Daisuke Nishimura.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [BUGFIX][PATCH 1/4] memcg: fix limit estimation at reclaim for hugepage
  2011-01-28  4:40     ` Daisuke Nishimura
@ 2011-01-28  4:49       ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 66+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-01-28  4:49 UTC (permalink / raw)
  To: Daisuke Nishimura; +Cc: linux-mm, linux-kernel, hannes, balbir

On Fri, 28 Jan 2011 13:40:19 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:

> On Fri, 28 Jan 2011 12:24:49 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 
> > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > 
> > Current memory cgroup's code tends to assume page_size == PAGE_SIZE
> > and arrangement for THP is not enough yet.
> > 
> > This is one of fixes for supporing THP. This adds
> > mem_cgroup_check_margin() and checks whether there are required amount of
> > free resource after memory reclaim. By this, THP page allocation
> > can know whether it really succeeded or not and avoid infinite-loop
> > and hangup.
> > 
> > Total fixes for do_charge()/reclaim memory will follow this patch.
> > 
> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> This patch looks good to me, but some nitpicks.
> 
> > ---
> >  include/linux/res_counter.h |   11 +++++++++++
> >  mm/memcontrol.c             |   25 ++++++++++++++++++++++++-
> >  2 files changed, 35 insertions(+), 1 deletion(-)
> > 
> > Index: mmotm-0125/include/linux/res_counter.h
> > ===================================================================
> > --- mmotm-0125.orig/include/linux/res_counter.h
> > +++ mmotm-0125/include/linux/res_counter.h
> > @@ -182,6 +182,17 @@ static inline bool res_counter_check_und
> >  	return ret;
> >  }
> >  
> > +static inline s64 res_counter_check_margin(struct res_counter *cnt)
> > +{
> > +	s64 ret;
> > +	unsigned long flags;
> > +
> > +	spin_lock_irqsave(&cnt->lock, flags);
> > +	ret = cnt->limit - cnt->usage;
> > +	spin_unlock_irqrestore(&cnt->lock, flags);
> > +	return ret;
> > +}
> > +
> >  static inline bool res_counter_check_under_soft_limit(struct res_counter *cnt)
> >  {
> >  	bool ret;
> > Index: mmotm-0125/mm/memcontrol.c
> > ===================================================================
> > --- mmotm-0125.orig/mm/memcontrol.c
> > +++ mmotm-0125/mm/memcontrol.c
> > @@ -1111,6 +1111,22 @@ static bool mem_cgroup_check_under_limit
> >  	return false;
> >  }
> >  
> > +static s64  mem_cgroup_check_margin(struct mem_cgroup *mem)
> > +{
> > +	s64 mem_margin;
> > +
> > +	if (do_swap_account) {
> > +		s64 memsw_margin;
> > +
> > +		mem_margin = res_counter_check_margin(&mem->res);
> > +		memsw_margin = res_counter_check_margin(&mem->memsw);
> > +		if (mem_margin > memsw_margin)
> > +			mem_margin = memsw_margin;
> > +	} else
> > +		mem_margin = res_counter_check_margin(&mem->res);
> > +	return mem_margin;
> > +}
> > +
> How about
> 
> 	mem_margin = res_counter_check_margin(&mem->res);
> 	if (do_swap_account)
> 		memsw_margin = res_counter_check_margin(&mem->memsw);
> 	else
> 		memsw_margin = RESOURCE_MAX;
> 
> 	return min(mem_margin, memsw_margin);
> 
> ?
> I think using min() makes it more clear what this function does.
> 

Ok.

> >  static unsigned int get_swappiness(struct mem_cgroup *memcg)
> >  {
> >  	struct cgroup *cgrp = memcg->css.cgroup;
> > @@ -1853,7 +1869,14 @@ static int __mem_cgroup_do_charge(struct
> >  	 * Check the limit again to see if the reclaim reduced the
> >  	 * current usage of the cgroup before giving up
> >  	 */
> > -	if (ret || mem_cgroup_check_under_limit(mem_over_limit))
> > +	if (mem_cgroup_check_margin(mem_over_limit) >= csize)
> > +		return CHARGE_RETRY;
> > +
> > +	/*
> > + 	 * If the charge size is a PAGE_SIZE, it's not hopeless while
> > + 	 * we can reclaim a page.
> > + 	 */
> > +	if (csize == PAGE_SIZE && ret)
> >  		return CHARGE_RETRY;
> >  
> >  	/*
> > 
> checkpatch complains some whitespace warnings.
> 
will fix soon.

-Kame


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [BUGFIX][PATCH 1/4] memcg: fix limit estimation at reclaim for hugepage
@ 2011-01-28  4:49       ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 66+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-01-28  4:49 UTC (permalink / raw)
  To: Daisuke Nishimura; +Cc: linux-mm, linux-kernel, hannes, balbir

On Fri, 28 Jan 2011 13:40:19 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:

> On Fri, 28 Jan 2011 12:24:49 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 
> > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > 
> > Current memory cgroup's code tends to assume page_size == PAGE_SIZE
> > and arrangement for THP is not enough yet.
> > 
> > This is one of fixes for supporing THP. This adds
> > mem_cgroup_check_margin() and checks whether there are required amount of
> > free resource after memory reclaim. By this, THP page allocation
> > can know whether it really succeeded or not and avoid infinite-loop
> > and hangup.
> > 
> > Total fixes for do_charge()/reclaim memory will follow this patch.
> > 
> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> This patch looks good to me, but some nitpicks.
> 
> > ---
> >  include/linux/res_counter.h |   11 +++++++++++
> >  mm/memcontrol.c             |   25 ++++++++++++++++++++++++-
> >  2 files changed, 35 insertions(+), 1 deletion(-)
> > 
> > Index: mmotm-0125/include/linux/res_counter.h
> > ===================================================================
> > --- mmotm-0125.orig/include/linux/res_counter.h
> > +++ mmotm-0125/include/linux/res_counter.h
> > @@ -182,6 +182,17 @@ static inline bool res_counter_check_und
> >  	return ret;
> >  }
> >  
> > +static inline s64 res_counter_check_margin(struct res_counter *cnt)
> > +{
> > +	s64 ret;
> > +	unsigned long flags;
> > +
> > +	spin_lock_irqsave(&cnt->lock, flags);
> > +	ret = cnt->limit - cnt->usage;
> > +	spin_unlock_irqrestore(&cnt->lock, flags);
> > +	return ret;
> > +}
> > +
> >  static inline bool res_counter_check_under_soft_limit(struct res_counter *cnt)
> >  {
> >  	bool ret;
> > Index: mmotm-0125/mm/memcontrol.c
> > ===================================================================
> > --- mmotm-0125.orig/mm/memcontrol.c
> > +++ mmotm-0125/mm/memcontrol.c
> > @@ -1111,6 +1111,22 @@ static bool mem_cgroup_check_under_limit
> >  	return false;
> >  }
> >  
> > +static s64  mem_cgroup_check_margin(struct mem_cgroup *mem)
> > +{
> > +	s64 mem_margin;
> > +
> > +	if (do_swap_account) {
> > +		s64 memsw_margin;
> > +
> > +		mem_margin = res_counter_check_margin(&mem->res);
> > +		memsw_margin = res_counter_check_margin(&mem->memsw);
> > +		if (mem_margin > memsw_margin)
> > +			mem_margin = memsw_margin;
> > +	} else
> > +		mem_margin = res_counter_check_margin(&mem->res);
> > +	return mem_margin;
> > +}
> > +
> How about
> 
> 	mem_margin = res_counter_check_margin(&mem->res);
> 	if (do_swap_account)
> 		memsw_margin = res_counter_check_margin(&mem->memsw);
> 	else
> 		memsw_margin = RESOURCE_MAX;
> 
> 	return min(mem_margin, memsw_margin);
> 
> ?
> I think using min() makes it more clear what this function does.
> 

Ok.

> >  static unsigned int get_swappiness(struct mem_cgroup *memcg)
> >  {
> >  	struct cgroup *cgrp = memcg->css.cgroup;
> > @@ -1853,7 +1869,14 @@ static int __mem_cgroup_do_charge(struct
> >  	 * Check the limit again to see if the reclaim reduced the
> >  	 * current usage of the cgroup before giving up
> >  	 */
> > -	if (ret || mem_cgroup_check_under_limit(mem_over_limit))
> > +	if (mem_cgroup_check_margin(mem_over_limit) >= csize)
> > +		return CHARGE_RETRY;
> > +
> > +	/*
> > + 	 * If the charge size is a PAGE_SIZE, it's not hopeless while
> > + 	 * we can reclaim a page.
> > + 	 */
> > +	if (csize == PAGE_SIZE && ret)
> >  		return CHARGE_RETRY;
> >  
> >  	/*
> > 
> checkpatch complains some whitespace warnings.
> 
will fix soon.

-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [BUGFIX][PATCH 1/4] memcg: fix limit estimation at reclaim for hugepage
  2011-01-28  4:40     ` Daisuke Nishimura
@ 2011-01-28  4:58       ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 66+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-01-28  4:58 UTC (permalink / raw)
  To: Daisuke Nishimura; +Cc: linux-mm, linux-kernel, hannes, balbir

How about this ?
==
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Current memory cgroup's code tends to assume page_size == PAGE_SIZE
and arrangement for THP is not enough yet.

This is one of fixes for supporing THP. This adds
mem_cgroup_check_margin() and checks whether there are required amount of
free resource after memory reclaim. By this, THP page allocation
can know whether it really succeeded or not and avoid infinite-loop
and hangup.

Total fixes for do_charge()/reclaim memory will follow this patch.

Changelog v1->v2:
 - style fix.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/res_counter.h |   11 +++++++++++
 mm/memcontrol.c             |   22 +++++++++++++++++++++-
 2 files changed, 32 insertions(+), 1 deletion(-)

Index: mmotm-0125/include/linux/res_counter.h
===================================================================
--- mmotm-0125.orig/include/linux/res_counter.h
+++ mmotm-0125/include/linux/res_counter.h
@@ -182,6 +182,17 @@ static inline bool res_counter_check_und
 	return ret;
 }
 
+static inline s64 res_counter_check_margin(struct res_counter *cnt)
+{
+	s64 ret;
+	unsigned long flags;
+
+	spin_lock_irqsave(&cnt->lock, flags);
+	ret = cnt->limit - cnt->usage;
+	spin_unlock_irqrestore(&cnt->lock, flags);
+	return ret;
+}
+
 static inline bool res_counter_check_under_soft_limit(struct res_counter *cnt)
 {
 	bool ret;
Index: mmotm-0125/mm/memcontrol.c
===================================================================
--- mmotm-0125.orig/mm/memcontrol.c
+++ mmotm-0125/mm/memcontrol.c
@@ -1111,6 +1111,19 @@ static bool mem_cgroup_check_under_limit
 	return false;
 }
 
+static s64 mem_cgroup_check_margin(struct mem_cgroup *mem)
+{
+	s64 mem_margin = res_counter_check_margin(&mem->res);
+	s64 memsw_margin;
+
+	if (do_swap_account)
+		memsw_margin = res_counter_check_margin(&mem->memsw);
+	else
+		memsw_margin = RESOURCE_MAX;
+
+	return min(mem_margin, memsw_margin);
+}
+
 static unsigned int get_swappiness(struct mem_cgroup *memcg)
 {
 	struct cgroup *cgrp = memcg->css.cgroup;
@@ -1853,7 +1866,14 @@ static int __mem_cgroup_do_charge(struct
 	 * Check the limit again to see if the reclaim reduced the
 	 * current usage of the cgroup before giving up
 	 */
-	if (ret || mem_cgroup_check_under_limit(mem_over_limit))
+	if (mem_cgroup_check_margin(mem_over_limit) >= csize)
+		return CHARGE_RETRY;
+
+	/*
+	 * If the charge size is a PAGE_SIZE, it's not hopeless while
+	 * we can reclaim a page.
+	 */
+	if (csize == PAGE_SIZE && ret)
 		return CHARGE_RETRY;
 
 	/*


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [BUGFIX][PATCH 1/4] memcg: fix limit estimation at reclaim for hugepage
@ 2011-01-28  4:58       ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 66+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-01-28  4:58 UTC (permalink / raw)
  To: Daisuke Nishimura; +Cc: linux-mm, linux-kernel, hannes, balbir

How about this ?
==
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Current memory cgroup's code tends to assume page_size == PAGE_SIZE
and arrangement for THP is not enough yet.

This is one of fixes for supporing THP. This adds
mem_cgroup_check_margin() and checks whether there are required amount of
free resource after memory reclaim. By this, THP page allocation
can know whether it really succeeded or not and avoid infinite-loop
and hangup.

Total fixes for do_charge()/reclaim memory will follow this patch.

Changelog v1->v2:
 - style fix.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/res_counter.h |   11 +++++++++++
 mm/memcontrol.c             |   22 +++++++++++++++++++++-
 2 files changed, 32 insertions(+), 1 deletion(-)

Index: mmotm-0125/include/linux/res_counter.h
===================================================================
--- mmotm-0125.orig/include/linux/res_counter.h
+++ mmotm-0125/include/linux/res_counter.h
@@ -182,6 +182,17 @@ static inline bool res_counter_check_und
 	return ret;
 }
 
+static inline s64 res_counter_check_margin(struct res_counter *cnt)
+{
+	s64 ret;
+	unsigned long flags;
+
+	spin_lock_irqsave(&cnt->lock, flags);
+	ret = cnt->limit - cnt->usage;
+	spin_unlock_irqrestore(&cnt->lock, flags);
+	return ret;
+}
+
 static inline bool res_counter_check_under_soft_limit(struct res_counter *cnt)
 {
 	bool ret;
Index: mmotm-0125/mm/memcontrol.c
===================================================================
--- mmotm-0125.orig/mm/memcontrol.c
+++ mmotm-0125/mm/memcontrol.c
@@ -1111,6 +1111,19 @@ static bool mem_cgroup_check_under_limit
 	return false;
 }
 
+static s64 mem_cgroup_check_margin(struct mem_cgroup *mem)
+{
+	s64 mem_margin = res_counter_check_margin(&mem->res);
+	s64 memsw_margin;
+
+	if (do_swap_account)
+		memsw_margin = res_counter_check_margin(&mem->memsw);
+	else
+		memsw_margin = RESOURCE_MAX;
+
+	return min(mem_margin, memsw_margin);
+}
+
 static unsigned int get_swappiness(struct mem_cgroup *memcg)
 {
 	struct cgroup *cgrp = memcg->css.cgroup;
@@ -1853,7 +1866,14 @@ static int __mem_cgroup_do_charge(struct
 	 * Check the limit again to see if the reclaim reduced the
 	 * current usage of the cgroup before giving up
 	 */
-	if (ret || mem_cgroup_check_under_limit(mem_over_limit))
+	if (mem_cgroup_check_margin(mem_over_limit) >= csize)
+		return CHARGE_RETRY;
+
+	/*
+	 * If the charge size is a PAGE_SIZE, it's not hopeless while
+	 * we can reclaim a page.
+	 */
+	if (csize == PAGE_SIZE && ret)
 		return CHARGE_RETRY;
 
 	/*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [BUGFIX][PATCH 1/4] memcg: fix limit estimation at reclaim for hugepage
  2011-01-28  4:58       ` KAMEZAWA Hiroyuki
@ 2011-01-28  5:36         ` Daisuke Nishimura
  -1 siblings, 0 replies; 66+ messages in thread
From: Daisuke Nishimura @ 2011-01-28  5:36 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, linux-kernel, hannes, balbir, Daisuke Nishimura

On Fri, 28 Jan 2011 13:58:39 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> How about this ?
> ==
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> Current memory cgroup's code tends to assume page_size == PAGE_SIZE
> and arrangement for THP is not enough yet.
> 
> This is one of fixes for supporing THP. This adds
> mem_cgroup_check_margin() and checks whether there are required amount of
> free resource after memory reclaim. By this, THP page allocation
> can know whether it really succeeded or not and avoid infinite-loop
> and hangup.
> 
> Total fixes for do_charge()/reclaim memory will follow this patch.
> 
> Changelog v1->v2:
>  - style fix.
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>

Thanks,
Daisuke Nishimura.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [BUGFIX][PATCH 1/4] memcg: fix limit estimation at reclaim for hugepage
@ 2011-01-28  5:36         ` Daisuke Nishimura
  0 siblings, 0 replies; 66+ messages in thread
From: Daisuke Nishimura @ 2011-01-28  5:36 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, linux-kernel, hannes, balbir, Daisuke Nishimura

On Fri, 28 Jan 2011 13:58:39 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> How about this ?
> ==
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> Current memory cgroup's code tends to assume page_size == PAGE_SIZE
> and arrangement for THP is not enough yet.
> 
> This is one of fixes for supporing THP. This adds
> mem_cgroup_check_margin() and checks whether there are required amount of
> free resource after memory reclaim. By this, THP page allocation
> can know whether it really succeeded or not and avoid infinite-loop
> and hangup.
> 
> Total fixes for do_charge()/reclaim memory will follow this patch.
> 
> Changelog v1->v2:
>  - style fix.
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>

Thanks,
Daisuke Nishimura.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [BUGFIX][PATCH 2/4] memcg: fix charge path for THP and allow early retirement
  2011-01-28  3:26   ` KAMEZAWA Hiroyuki
@ 2011-01-28  5:37     ` Daisuke Nishimura
  -1 siblings, 0 replies; 66+ messages in thread
From: Daisuke Nishimura @ 2011-01-28  5:37 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, linux-kernel, hannes, balbir, Daisuke Nishimura

On Fri, 28 Jan 2011 12:26:08 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> When THP is used, Hugepage size charge can happen. It's not handled
> correctly in mem_cgroup_do_charge(). For example, THP can fallback
> to small page allocation when HUGEPAGE allocation seems difficult
> or busy, but memory cgroup doesn't understand it and continue to
> try HUGEPAGE charging. And the worst thing is memory cgroup
> believes 'memory reclaim succeeded' if limit - usage > PAGE_SIZE.
> 
> By this, khugepaged etc...can goes into inifinite reclaim loop
> if tasks in memcg are busy.
> 
> After this patch 
>  - Hugepage allocation will fail if 1st trial of page reclaim fails.
> 
> Changelog:
>  - make changes small. removed renaming codes.
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>

Thanks,
Daisuke Nishimura.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [BUGFIX][PATCH 2/4] memcg: fix charge path for THP and allow early retirement
@ 2011-01-28  5:37     ` Daisuke Nishimura
  0 siblings, 0 replies; 66+ messages in thread
From: Daisuke Nishimura @ 2011-01-28  5:37 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, linux-kernel, hannes, balbir, Daisuke Nishimura

On Fri, 28 Jan 2011 12:26:08 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> When THP is used, Hugepage size charge can happen. It's not handled
> correctly in mem_cgroup_do_charge(). For example, THP can fallback
> to small page allocation when HUGEPAGE allocation seems difficult
> or busy, but memory cgroup doesn't understand it and continue to
> try HUGEPAGE charging. And the worst thing is memory cgroup
> believes 'memory reclaim succeeded' if limit - usage > PAGE_SIZE.
> 
> By this, khugepaged etc...can goes into inifinite reclaim loop
> if tasks in memcg are busy.
> 
> After this patch 
>  - Hugepage allocation will fail if 1st trial of page reclaim fails.
> 
> Changelog:
>  - make changes small. removed renaming codes.
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>

Thanks,
Daisuke Nishimura.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [BUGFIX][PATCH 3/4] mecg: fix oom flag at THP charge
  2011-01-28  3:27   ` KAMEZAWA Hiroyuki
@ 2011-01-28  5:39     ` Daisuke Nishimura
  -1 siblings, 0 replies; 66+ messages in thread
From: Daisuke Nishimura @ 2011-01-28  5:39 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, linux-kernel, hannes, balbir, Daisuke Nishimura

On Fri, 28 Jan 2011 12:27:29 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> 
> Thanks to Johanns and Daisuke for suggestion.
> =
> Hugepage allocation shouldn't trigger oom.
> Allocation failure is not fatal.
> 
> Orignal-patch-by: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

IIUC, this patch isn't necessary because we don't go into oom
in "page_size > PAGE_SIZE" case after [2/4].
But I think it's not so bad to show explicitly that we don't cause oom in
THP charge.

Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>

Thanks,
Daisuke Nishimura.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [BUGFIX][PATCH 3/4] mecg: fix oom flag at THP charge
@ 2011-01-28  5:39     ` Daisuke Nishimura
  0 siblings, 0 replies; 66+ messages in thread
From: Daisuke Nishimura @ 2011-01-28  5:39 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, linux-kernel, hannes, balbir, Daisuke Nishimura

On Fri, 28 Jan 2011 12:27:29 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> 
> Thanks to Johanns and Daisuke for suggestion.
> =
> Hugepage allocation shouldn't trigger oom.
> Allocation failure is not fatal.
> 
> Orignal-patch-by: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

IIUC, this patch isn't necessary because we don't go into oom
in "page_size > PAGE_SIZE" case after [2/4].
But I think it's not so bad to show explicitly that we don't cause oom in
THP charge.

Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>

Thanks,
Daisuke Nishimura.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [BUGFIX][PATCH 3/4] mecg: fix oom flag at THP charge
  2011-01-28  5:39     ` Daisuke Nishimura
@ 2011-01-28  5:50       ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 66+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-01-28  5:50 UTC (permalink / raw)
  To: Daisuke Nishimura; +Cc: linux-mm, linux-kernel, hannes, balbir

On Fri, 28 Jan 2011 14:39:12 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:

> On Fri, 28 Jan 2011 12:27:29 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 
> > 
> > Thanks to Johanns and Daisuke for suggestion.
> > =
> > Hugepage allocation shouldn't trigger oom.
> > Allocation failure is not fatal.
> > 
> > Orignal-patch-by: Johannes Weiner <hannes@cmpxchg.org>
> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> IIUC, this patch isn't necessary because we don't go into oom
> in "page_size > PAGE_SIZE" case after [2/4].
> But I think it's not so bad to show explicitly that we don't cause oom in
> THP charge.
> 
> Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
> 

Thank you for clarification.

-Kame



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [BUGFIX][PATCH 3/4] mecg: fix oom flag at THP charge
@ 2011-01-28  5:50       ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 66+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-01-28  5:50 UTC (permalink / raw)
  To: Daisuke Nishimura; +Cc: linux-mm, linux-kernel, hannes, balbir

On Fri, 28 Jan 2011 14:39:12 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:

> On Fri, 28 Jan 2011 12:27:29 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 
> > 
> > Thanks to Johanns and Daisuke for suggestion.
> > =
> > Hugepage allocation shouldn't trigger oom.
> > Allocation failure is not fatal.
> > 
> > Orignal-patch-by: Johannes Weiner <hannes@cmpxchg.org>
> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> IIUC, this patch isn't necessary because we don't go into oom
> in "page_size > PAGE_SIZE" case after [2/4].
> But I think it's not so bad to show explicitly that we don't cause oom in
> THP charge.
> 
> Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
> 

Thank you for clarification.

-Kame


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [BUGFIX][PATCH 1/4] memcg: fix limit estimation at reclaim for hugepage
  2011-01-28  3:24   ` KAMEZAWA Hiroyuki
@ 2011-01-28  7:52     ` Johannes Weiner
  -1 siblings, 0 replies; 66+ messages in thread
From: Johannes Weiner @ 2011-01-28  7:52 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, nishimura, balbir

On Fri, Jan 28, 2011 at 12:24:49PM +0900, KAMEZAWA Hiroyuki wrote:
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> Current memory cgroup's code tends to assume page_size == PAGE_SIZE
> and arrangement for THP is not enough yet.
> 
> This is one of fixes for supporing THP. This adds
> mem_cgroup_check_margin() and checks whether there are required amount of
> free resource after memory reclaim. By this, THP page allocation
> can know whether it really succeeded or not and avoid infinite-loop
> and hangup.
> 
> Total fixes for do_charge()/reclaim memory will follow this patch.
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
>  include/linux/res_counter.h |   11 +++++++++++
>  mm/memcontrol.c             |   25 ++++++++++++++++++++++++-
>  2 files changed, 35 insertions(+), 1 deletion(-)
> 
> Index: mmotm-0125/include/linux/res_counter.h
> ===================================================================
> --- mmotm-0125.orig/include/linux/res_counter.h
> +++ mmotm-0125/include/linux/res_counter.h
> @@ -182,6 +182,17 @@ static inline bool res_counter_check_und
>  	return ret;
>  }
>  
> +static inline s64 res_counter_check_margin(struct res_counter *cnt)
> +{
> +	s64 ret;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&cnt->lock, flags);
> +	ret = cnt->limit - cnt->usage;
> +	spin_unlock_irqrestore(&cnt->lock, flags);
> +	return ret;
> +}

This function does not check anything.  You could name it
res_counter_get_margin() e.g.  But if you do that, I will complain
that it's asymmetric to res_counter_check_under_limit().  And the
result will be pretty close to my version...

> @@ -1853,7 +1869,14 @@ static int __mem_cgroup_do_charge(struct
>  	 * Check the limit again to see if the reclaim reduced the
>  	 * current usage of the cgroup before giving up
>  	 */
> -	if (ret || mem_cgroup_check_under_limit(mem_over_limit))
> +	if (mem_cgroup_check_margin(mem_over_limit) >= csize)
> +		return CHARGE_RETRY;
> +
> +	/*
> + 	 * If the charge size is a PAGE_SIZE, it's not hopeless while
> + 	 * we can reclaim a page.
> + 	 */
> +	if (csize == PAGE_SIZE && ret)
>  		return CHARGE_RETRY;

That makes sense.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [BUGFIX][PATCH 1/4] memcg: fix limit estimation at reclaim for hugepage
@ 2011-01-28  7:52     ` Johannes Weiner
  0 siblings, 0 replies; 66+ messages in thread
From: Johannes Weiner @ 2011-01-28  7:52 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, nishimura, balbir

On Fri, Jan 28, 2011 at 12:24:49PM +0900, KAMEZAWA Hiroyuki wrote:
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> Current memory cgroup's code tends to assume page_size == PAGE_SIZE
> and arrangement for THP is not enough yet.
> 
> This is one of fixes for supporing THP. This adds
> mem_cgroup_check_margin() and checks whether there are required amount of
> free resource after memory reclaim. By this, THP page allocation
> can know whether it really succeeded or not and avoid infinite-loop
> and hangup.
> 
> Total fixes for do_charge()/reclaim memory will follow this patch.
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
>  include/linux/res_counter.h |   11 +++++++++++
>  mm/memcontrol.c             |   25 ++++++++++++++++++++++++-
>  2 files changed, 35 insertions(+), 1 deletion(-)
> 
> Index: mmotm-0125/include/linux/res_counter.h
> ===================================================================
> --- mmotm-0125.orig/include/linux/res_counter.h
> +++ mmotm-0125/include/linux/res_counter.h
> @@ -182,6 +182,17 @@ static inline bool res_counter_check_und
>  	return ret;
>  }
>  
> +static inline s64 res_counter_check_margin(struct res_counter *cnt)
> +{
> +	s64 ret;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&cnt->lock, flags);
> +	ret = cnt->limit - cnt->usage;
> +	spin_unlock_irqrestore(&cnt->lock, flags);
> +	return ret;
> +}

This function does not check anything.  You could name it
res_counter_get_margin() e.g.  But if you do that, I will complain
that it's asymmetric to res_counter_check_under_limit().  And the
result will be pretty close to my version...

> @@ -1853,7 +1869,14 @@ static int __mem_cgroup_do_charge(struct
>  	 * Check the limit again to see if the reclaim reduced the
>  	 * current usage of the cgroup before giving up
>  	 */
> -	if (ret || mem_cgroup_check_under_limit(mem_over_limit))
> +	if (mem_cgroup_check_margin(mem_over_limit) >= csize)
> +		return CHARGE_RETRY;
> +
> +	/*
> + 	 * If the charge size is a PAGE_SIZE, it's not hopeless while
> + 	 * we can reclaim a page.
> + 	 */
> +	if (csize == PAGE_SIZE && ret)
>  		return CHARGE_RETRY;

That makes sense.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [BUGFIX][PATCH 2/4] memcg: fix charge path for THP and allow early retirement
  2011-01-28  3:26   ` KAMEZAWA Hiroyuki
@ 2011-01-28  7:57     ` Johannes Weiner
  -1 siblings, 0 replies; 66+ messages in thread
From: Johannes Weiner @ 2011-01-28  7:57 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, nishimura, balbir

On Fri, Jan 28, 2011 at 12:26:08PM +0900, KAMEZAWA Hiroyuki wrote:
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> When THP is used, Hugepage size charge can happen. It's not handled
> correctly in mem_cgroup_do_charge(). For example, THP can fallback
> to small page allocation when HUGEPAGE allocation seems difficult
> or busy, but memory cgroup doesn't understand it and continue to
> try HUGEPAGE charging. And the worst thing is memory cgroup
> believes 'memory reclaim succeeded' if limit - usage > PAGE_SIZE.
> 
> By this, khugepaged etc...can goes into inifinite reclaim loop
> if tasks in memcg are busy.
> 
> After this patch 
>  - Hugepage allocation will fail if 1st trial of page reclaim fails.
> 
> Changelog:
>  - make changes small. removed renaming codes.
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
>  mm/memcontrol.c |   28 ++++++++++++++++++++++++----
>  1 file changed, 24 insertions(+), 4 deletions(-)

Was there something wrong with my oneline fix?

Really, there is no way to make this a beautiful fix.  The way this
function is split up makes no sense, and the constant addition of more
and more flags just to correctly communicate with _one callsite_
should be an obvious hint.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [BUGFIX][PATCH 2/4] memcg: fix charge path for THP and allow early retirement
@ 2011-01-28  7:57     ` Johannes Weiner
  0 siblings, 0 replies; 66+ messages in thread
From: Johannes Weiner @ 2011-01-28  7:57 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, nishimura, balbir

On Fri, Jan 28, 2011 at 12:26:08PM +0900, KAMEZAWA Hiroyuki wrote:
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> When THP is used, Hugepage size charge can happen. It's not handled
> correctly in mem_cgroup_do_charge(). For example, THP can fallback
> to small page allocation when HUGEPAGE allocation seems difficult
> or busy, but memory cgroup doesn't understand it and continue to
> try HUGEPAGE charging. And the worst thing is memory cgroup
> believes 'memory reclaim succeeded' if limit - usage > PAGE_SIZE.
> 
> By this, khugepaged etc...can goes into inifinite reclaim loop
> if tasks in memcg are busy.
> 
> After this patch 
>  - Hugepage allocation will fail if 1st trial of page reclaim fails.
> 
> Changelog:
>  - make changes small. removed renaming codes.
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
>  mm/memcontrol.c |   28 ++++++++++++++++++++++++----
>  1 file changed, 24 insertions(+), 4 deletions(-)

Was there something wrong with my oneline fix?

Really, there is no way to make this a beautiful fix.  The way this
function is split up makes no sense, and the constant addition of more
and more flags just to correctly communicate with _one callsite_
should be an obvious hint.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [BUGFIX][PATCH 3/4] mecg: fix oom flag at THP charge
  2011-01-28  3:27   ` KAMEZAWA Hiroyuki
@ 2011-01-28  8:02     ` Johannes Weiner
  -1 siblings, 0 replies; 66+ messages in thread
From: Johannes Weiner @ 2011-01-28  8:02 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, nishimura, balbir

On Fri, Jan 28, 2011 at 12:27:29PM +0900, KAMEZAWA Hiroyuki wrote:
> 
> Thanks to Johanns and Daisuke for suggestion.
> =
> Hugepage allocation shouldn't trigger oom.
> Allocation failure is not fatal.
> 
> Orignal-patch-by: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
>  mm/memcontrol.c |    4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> Index: mmotm-0125/mm/memcontrol.c
> ===================================================================
> --- mmotm-0125.orig/mm/memcontrol.c
> +++ mmotm-0125/mm/memcontrol.c
> @@ -2369,11 +2369,14 @@ static int mem_cgroup_charge_common(stru
>  	struct page_cgroup *pc;
>  	int ret;
>  	int page_size = PAGE_SIZE;
> +	bool oom;
>  
>  	if (PageTransHuge(page)) {
>  		page_size <<= compound_order(page);
>  		VM_BUG_ON(!PageTransHuge(page));
> -	}
> +		oom = false;
> +	} else
> +		oom = true;

That needs a comment.  You can take the one from my patch if you like.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [BUGFIX][PATCH 3/4] mecg: fix oom flag at THP charge
@ 2011-01-28  8:02     ` Johannes Weiner
  0 siblings, 0 replies; 66+ messages in thread
From: Johannes Weiner @ 2011-01-28  8:02 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, nishimura, balbir

On Fri, Jan 28, 2011 at 12:27:29PM +0900, KAMEZAWA Hiroyuki wrote:
> 
> Thanks to Johanns and Daisuke for suggestion.
> =
> Hugepage allocation shouldn't trigger oom.
> Allocation failure is not fatal.
> 
> Orignal-patch-by: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
>  mm/memcontrol.c |    4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> Index: mmotm-0125/mm/memcontrol.c
> ===================================================================
> --- mmotm-0125.orig/mm/memcontrol.c
> +++ mmotm-0125/mm/memcontrol.c
> @@ -2369,11 +2369,14 @@ static int mem_cgroup_charge_common(stru
>  	struct page_cgroup *pc;
>  	int ret;
>  	int page_size = PAGE_SIZE;
> +	bool oom;
>  
>  	if (PageTransHuge(page)) {
>  		page_size <<= compound_order(page);
>  		VM_BUG_ON(!PageTransHuge(page));
> -	}
> +		oom = false;
> +	} else
> +		oom = true;

That needs a comment.  You can take the one from my patch if you like.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [BUGFIX][PATCH 1/4] memcg: fix limit estimation at reclaim for hugepage
  2011-01-28  4:58       ` KAMEZAWA Hiroyuki
@ 2011-01-28  8:04         ` Minchan Kim
  -1 siblings, 0 replies; 66+ messages in thread
From: Minchan Kim @ 2011-01-28  8:04 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Daisuke Nishimura, linux-mm, linux-kernel, hannes, balbir

Hi Kame,

On Fri, Jan 28, 2011 at 1:58 PM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> How about this ?
> ==
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>
> Current memory cgroup's code tends to assume page_size == PAGE_SIZE
> and arrangement for THP is not enough yet.
>
> This is one of fixes for supporing THP. This adds
> mem_cgroup_check_margin() and checks whether there are required amount of
> free resource after memory reclaim. By this, THP page allocation
> can know whether it really succeeded or not and avoid infinite-loop
> and hangup.
>
> Total fixes for do_charge()/reclaim memory will follow this patch.

If this patch is only related to THP, I think patch order isn't good.
Before applying [2/4], huge page allocation will retry without
reclaiming and loop forever by below part.

@@ -1854,9 +1858,6 @@ static int __mem_cgroup_do_charge(struct
 	} else
 		mem_over_limit = mem_cgroup_from_res_counter(fail_res, res);

-	if (csize > PAGE_SIZE) /* change csize and retry */
-		return CHARGE_RETRY;
-
 	if (!(gfp_mask & __GFP_WAIT))
 		return CHARGE_WOULDBLOCK;

Am I missing something?

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [BUGFIX][PATCH 1/4] memcg: fix limit estimation at reclaim for hugepage
@ 2011-01-28  8:04         ` Minchan Kim
  0 siblings, 0 replies; 66+ messages in thread
From: Minchan Kim @ 2011-01-28  8:04 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Daisuke Nishimura, linux-mm, linux-kernel, hannes, balbir

Hi Kame,

On Fri, Jan 28, 2011 at 1:58 PM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> How about this ?
> ==
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>
> Current memory cgroup's code tends to assume page_size == PAGE_SIZE
> and arrangement for THP is not enough yet.
>
> This is one of fixes for supporing THP. This adds
> mem_cgroup_check_margin() and checks whether there are required amount of
> free resource after memory reclaim. By this, THP page allocation
> can know whether it really succeeded or not and avoid infinite-loop
> and hangup.
>
> Total fixes for do_charge()/reclaim memory will follow this patch.

If this patch is only related to THP, I think patch order isn't good.
Before applying [2/4], huge page allocation will retry without
reclaiming and loop forever by below part.

@@ -1854,9 +1858,6 @@ static int __mem_cgroup_do_charge(struct
 	} else
 		mem_over_limit = mem_cgroup_from_res_counter(fail_res, res);

-	if (csize > PAGE_SIZE) /* change csize and retry */
-		return CHARGE_RETRY;
-
 	if (!(gfp_mask & __GFP_WAIT))
 		return CHARGE_WOULDBLOCK;

Am I missing something?

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [BUGFIX][PATCH 1/4] memcg: fix limit estimation at reclaim for hugepage
  2011-01-28  7:52     ` Johannes Weiner
@ 2011-01-28  8:06       ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 66+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-01-28  8:06 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: linux-mm, linux-kernel, nishimura, balbir

On Fri, 28 Jan 2011 08:52:15 +0100
Johannes Weiner <hannes@cmpxchg.org> wrote:
 
> > @@ -1853,7 +1869,14 @@ static int __mem_cgroup_do_charge(struct
> >  	 * Check the limit again to see if the reclaim reduced the
> >  	 * current usage of the cgroup before giving up
> >  	 */
> > -	if (ret || mem_cgroup_check_under_limit(mem_over_limit))
> > +	if (mem_cgroup_check_margin(mem_over_limit) >= csize)
> > +		return CHARGE_RETRY;
> > +
> > +	/*
> > + 	 * If the charge size is a PAGE_SIZE, it's not hopeless while
> > + 	 * we can reclaim a page.
> > + 	 */
> > +	if (csize == PAGE_SIZE && ret)
> >  		return CHARGE_RETRY;
> 
> That makes sense.
> 
here.

It's okay to add your Signed-off because this one is very similar to yours.

==
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Current memory cgroup's code tends to assume page_size == PAGE_SIZE
and arrangement for THP is not enough yet.

This is one of fixes for supporing THP. This adds
mem_cgroup_get_margin() and checks whether there are required amount of
free resource after memory reclaim. By this, THP page allocation
can know whether it really succeeded or not and avoid infinite-loop
and hangup.

Total fixes for do_charge()/reclaim memory will follow this patch.

Changelog:
 - rename check_margin -> get_margin.
 - style fix.


Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/res_counter.h |   11 +++++++++++
 mm/memcontrol.c             |   22 +++++++++++++++++++++-
 2 files changed, 32 insertions(+), 1 deletion(-)

Index: mmotm-0125/include/linux/res_counter.h
===================================================================
--- mmotm-0125.orig/include/linux/res_counter.h
+++ mmotm-0125/include/linux/res_counter.h
@@ -182,6 +182,17 @@ static inline bool res_counter_check_und
 	return ret;
 }
 
+static inline s64 res_counter_get_margin(struct res_counter *cnt)
+{
+	s64 ret;
+	unsigned long flags;
+
+	spin_lock_irqsave(&cnt->lock, flags);
+	ret = cnt->limit - cnt->usage;
+	spin_unlock_irqrestore(&cnt->lock, flags);
+	return ret;
+}
+
 static inline bool res_counter_check_under_soft_limit(struct res_counter *cnt)
 {
 	bool ret;
Index: mmotm-0125/mm/memcontrol.c
===================================================================
--- mmotm-0125.orig/mm/memcontrol.c
+++ mmotm-0125/mm/memcontrol.c
@@ -1111,6 +1111,19 @@ static bool mem_cgroup_check_under_limit
 	return false;
 }
 
+static s64 mem_cgroup_get_margin(struct mem_cgroup *mem)
+{
+	s64 mem_margin = res_counter_get_margin(&mem->res);
+	s64 memsw_margin;
+
+	if (do_swap_account)
+		memsw_margin = res_counter_get_margin(&mem->memsw);
+	else
+		memsw_margin = RESOURCE_MAX;
+
+	return min(mem_margin, memsw_margin);
+}
+
 static unsigned int get_swappiness(struct mem_cgroup *memcg)
 {
 	struct cgroup *cgrp = memcg->css.cgroup;
@@ -1853,7 +1866,14 @@ static int __mem_cgroup_do_charge(struct
 	 * Check the limit again to see if the reclaim reduced the
 	 * current usage of the cgroup before giving up
 	 */
-	if (ret || mem_cgroup_check_under_limit(mem_over_limit))
+	if (mem_cgroup_get_margin(mem_over_limit) >= csize)
+		return CHARGE_RETRY;
+
+	/*
+	 * If the charge size is a PAGE_SIZE, it's not hopeless while
+	 * we can reclaim a page.
+	 */
+	if (csize == PAGE_SIZE && ret)
 		return CHARGE_RETRY;
 
 	/*


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [BUGFIX][PATCH 1/4] memcg: fix limit estimation at reclaim for hugepage
@ 2011-01-28  8:06       ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 66+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-01-28  8:06 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: linux-mm, linux-kernel, nishimura, balbir

On Fri, 28 Jan 2011 08:52:15 +0100
Johannes Weiner <hannes@cmpxchg.org> wrote:
 
> > @@ -1853,7 +1869,14 @@ static int __mem_cgroup_do_charge(struct
> >  	 * Check the limit again to see if the reclaim reduced the
> >  	 * current usage of the cgroup before giving up
> >  	 */
> > -	if (ret || mem_cgroup_check_under_limit(mem_over_limit))
> > +	if (mem_cgroup_check_margin(mem_over_limit) >= csize)
> > +		return CHARGE_RETRY;
> > +
> > +	/*
> > + 	 * If the charge size is a PAGE_SIZE, it's not hopeless while
> > + 	 * we can reclaim a page.
> > + 	 */
> > +	if (csize == PAGE_SIZE && ret)
> >  		return CHARGE_RETRY;
> 
> That makes sense.
> 
here.

It's okay to add your Signed-off because this one is very similar to yours.

==
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Current memory cgroup's code tends to assume page_size == PAGE_SIZE
and arrangement for THP is not enough yet.

This is one of fixes for supporing THP. This adds
mem_cgroup_get_margin() and checks whether there are required amount of
free resource after memory reclaim. By this, THP page allocation
can know whether it really succeeded or not and avoid infinite-loop
and hangup.

Total fixes for do_charge()/reclaim memory will follow this patch.

Changelog:
 - rename check_margin -> get_margin.
 - style fix.


Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/res_counter.h |   11 +++++++++++
 mm/memcontrol.c             |   22 +++++++++++++++++++++-
 2 files changed, 32 insertions(+), 1 deletion(-)

Index: mmotm-0125/include/linux/res_counter.h
===================================================================
--- mmotm-0125.orig/include/linux/res_counter.h
+++ mmotm-0125/include/linux/res_counter.h
@@ -182,6 +182,17 @@ static inline bool res_counter_check_und
 	return ret;
 }
 
+static inline s64 res_counter_get_margin(struct res_counter *cnt)
+{
+	s64 ret;
+	unsigned long flags;
+
+	spin_lock_irqsave(&cnt->lock, flags);
+	ret = cnt->limit - cnt->usage;
+	spin_unlock_irqrestore(&cnt->lock, flags);
+	return ret;
+}
+
 static inline bool res_counter_check_under_soft_limit(struct res_counter *cnt)
 {
 	bool ret;
Index: mmotm-0125/mm/memcontrol.c
===================================================================
--- mmotm-0125.orig/mm/memcontrol.c
+++ mmotm-0125/mm/memcontrol.c
@@ -1111,6 +1111,19 @@ static bool mem_cgroup_check_under_limit
 	return false;
 }
 
+static s64 mem_cgroup_get_margin(struct mem_cgroup *mem)
+{
+	s64 mem_margin = res_counter_get_margin(&mem->res);
+	s64 memsw_margin;
+
+	if (do_swap_account)
+		memsw_margin = res_counter_get_margin(&mem->memsw);
+	else
+		memsw_margin = RESOURCE_MAX;
+
+	return min(mem_margin, memsw_margin);
+}
+
 static unsigned int get_swappiness(struct mem_cgroup *memcg)
 {
 	struct cgroup *cgrp = memcg->css.cgroup;
@@ -1853,7 +1866,14 @@ static int __mem_cgroup_do_charge(struct
 	 * Check the limit again to see if the reclaim reduced the
 	 * current usage of the cgroup before giving up
 	 */
-	if (ret || mem_cgroup_check_under_limit(mem_over_limit))
+	if (mem_cgroup_get_margin(mem_over_limit) >= csize)
+		return CHARGE_RETRY;
+
+	/*
+	 * If the charge size is a PAGE_SIZE, it's not hopeless while
+	 * we can reclaim a page.
+	 */
+	if (csize == PAGE_SIZE && ret)
 		return CHARGE_RETRY;
 
 	/*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [BUGFIX][PATCH 2/4] memcg: fix charge path for THP and allow early retirement
  2011-01-28  7:57     ` Johannes Weiner
@ 2011-01-28  8:14       ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 66+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-01-28  8:14 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: linux-mm, linux-kernel, nishimura, balbir

On Fri, 28 Jan 2011 08:57:24 +0100
Johannes Weiner <hannes@cmpxchg.org> wrote:

> On Fri, Jan 28, 2011 at 12:26:08PM +0900, KAMEZAWA Hiroyuki wrote:
> > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > 
> > When THP is used, Hugepage size charge can happen. It's not handled
> > correctly in mem_cgroup_do_charge(). For example, THP can fallback
> > to small page allocation when HUGEPAGE allocation seems difficult
> > or busy, but memory cgroup doesn't understand it and continue to
> > try HUGEPAGE charging. And the worst thing is memory cgroup
> > believes 'memory reclaim succeeded' if limit - usage > PAGE_SIZE.
> > 
> > By this, khugepaged etc...can goes into inifinite reclaim loop
> > if tasks in memcg are busy.
> > 
> > After this patch 
> >  - Hugepage allocation will fail if 1st trial of page reclaim fails.
> > 
> > Changelog:
> >  - make changes small. removed renaming codes.
> > 
> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > ---
> >  mm/memcontrol.c |   28 ++++++++++++++++++++++++----
> >  1 file changed, 24 insertions(+), 4 deletions(-)
> 
> Was there something wrong with my oneline fix?
> 
I thought your patch was against RHEL6.

> Really, there is no way to make this a beautiful fix.  The way this
> function is split up makes no sense, and the constant addition of more
> and more flags just to correctly communicate with _one callsite_
> should be an obvious hint.
> 

Your version has to depend on oom_check flag to work fine.
I think it's complicated.

Thanks,
-Kame




^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [BUGFIX][PATCH 2/4] memcg: fix charge path for THP and allow early retirement
@ 2011-01-28  8:14       ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 66+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-01-28  8:14 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: linux-mm, linux-kernel, nishimura, balbir

On Fri, 28 Jan 2011 08:57:24 +0100
Johannes Weiner <hannes@cmpxchg.org> wrote:

> On Fri, Jan 28, 2011 at 12:26:08PM +0900, KAMEZAWA Hiroyuki wrote:
> > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > 
> > When THP is used, Hugepage size charge can happen. It's not handled
> > correctly in mem_cgroup_do_charge(). For example, THP can fallback
> > to small page allocation when HUGEPAGE allocation seems difficult
> > or busy, but memory cgroup doesn't understand it and continue to
> > try HUGEPAGE charging. And the worst thing is memory cgroup
> > believes 'memory reclaim succeeded' if limit - usage > PAGE_SIZE.
> > 
> > By this, khugepaged etc...can goes into inifinite reclaim loop
> > if tasks in memcg are busy.
> > 
> > After this patch 
> >  - Hugepage allocation will fail if 1st trial of page reclaim fails.
> > 
> > Changelog:
> >  - make changes small. removed renaming codes.
> > 
> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > ---
> >  mm/memcontrol.c |   28 ++++++++++++++++++++++++----
> >  1 file changed, 24 insertions(+), 4 deletions(-)
> 
> Was there something wrong with my oneline fix?
> 
I thought your patch was against RHEL6.

> Really, there is no way to make this a beautiful fix.  The way this
> function is split up makes no sense, and the constant addition of more
> and more flags just to correctly communicate with _one callsite_
> should be an obvious hint.
> 

Your version has to depend on oom_check flag to work fine.
I think it's complicated.

Thanks,
-Kame



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [BUGFIX][PATCH 1/4] memcg: fix limit estimation at reclaim for hugepage
  2011-01-28  8:04         ` Minchan Kim
@ 2011-01-28  8:17           ` Johannes Weiner
  -1 siblings, 0 replies; 66+ messages in thread
From: Johannes Weiner @ 2011-01-28  8:17 UTC (permalink / raw)
  To: Minchan Kim
  Cc: KAMEZAWA Hiroyuki, Daisuke Nishimura, linux-mm, linux-kernel, balbir

On Fri, Jan 28, 2011 at 05:04:16PM +0900, Minchan Kim wrote:
> Hi Kame,
> 
> On Fri, Jan 28, 2011 at 1:58 PM, KAMEZAWA Hiroyuki
> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > How about this ?
> > ==
> > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> >
> > Current memory cgroup's code tends to assume page_size == PAGE_SIZE
> > and arrangement for THP is not enough yet.
> >
> > This is one of fixes for supporing THP. This adds
> > mem_cgroup_check_margin() and checks whether there are required amount of
> > free resource after memory reclaim. By this, THP page allocation
> > can know whether it really succeeded or not and avoid infinite-loop
> > and hangup.
> >
> > Total fixes for do_charge()/reclaim memory will follow this patch.
> 
> If this patch is only related to THP, I think patch order isn't good.
> Before applying [2/4], huge page allocation will retry without
> reclaiming and loop forever by below part.
> 
> @@ -1854,9 +1858,6 @@ static int __mem_cgroup_do_charge(struct
>  	} else
>  		mem_over_limit = mem_cgroup_from_res_counter(fail_res, res);
> 
> -	if (csize > PAGE_SIZE) /* change csize and retry */
> -		return CHARGE_RETRY;
> -
>  	if (!(gfp_mask & __GFP_WAIT))
>  		return CHARGE_WOULDBLOCK;
> 
> Am I missing something?

No, you are correct.  But I am not sure the order really matters in
theory: you have two endless loops that need independent fixing.

This order, however, is probably more practical during bisection.
When you fix the hidden problem before you uncover this code by fixing
the other problem, at least during bisection you will only hit one of
the two bugs :-)

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [BUGFIX][PATCH 1/4] memcg: fix limit estimation at reclaim for hugepage
@ 2011-01-28  8:17           ` Johannes Weiner
  0 siblings, 0 replies; 66+ messages in thread
From: Johannes Weiner @ 2011-01-28  8:17 UTC (permalink / raw)
  To: Minchan Kim
  Cc: KAMEZAWA Hiroyuki, Daisuke Nishimura, linux-mm, linux-kernel, balbir

On Fri, Jan 28, 2011 at 05:04:16PM +0900, Minchan Kim wrote:
> Hi Kame,
> 
> On Fri, Jan 28, 2011 at 1:58 PM, KAMEZAWA Hiroyuki
> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > How about this ?
> > ==
> > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> >
> > Current memory cgroup's code tends to assume page_size == PAGE_SIZE
> > and arrangement for THP is not enough yet.
> >
> > This is one of fixes for supporing THP. This adds
> > mem_cgroup_check_margin() and checks whether there are required amount of
> > free resource after memory reclaim. By this, THP page allocation
> > can know whether it really succeeded or not and avoid infinite-loop
> > and hangup.
> >
> > Total fixes for do_charge()/reclaim memory will follow this patch.
> 
> If this patch is only related to THP, I think patch order isn't good.
> Before applying [2/4], huge page allocation will retry without
> reclaiming and loop forever by below part.
> 
> @@ -1854,9 +1858,6 @@ static int __mem_cgroup_do_charge(struct
>  	} else
>  		mem_over_limit = mem_cgroup_from_res_counter(fail_res, res);
> 
> -	if (csize > PAGE_SIZE) /* change csize and retry */
> -		return CHARGE_RETRY;
> -
>  	if (!(gfp_mask & __GFP_WAIT))
>  		return CHARGE_WOULDBLOCK;
> 
> Am I missing something?

No, you are correct.  But I am not sure the order really matters in
theory: you have two endless loops that need independent fixing.

This order, however, is probably more practical during bisection.
When you fix the hidden problem before you uncover this code by fixing
the other problem, at least during bisection you will only hit one of
the two bugs :-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [BUGFIX][PATCH 4/4] memcg: fix khugepaged should skip busy memcg
  2011-01-28  3:28   ` KAMEZAWA Hiroyuki
@ 2011-01-28  8:20     ` Daisuke Nishimura
  -1 siblings, 0 replies; 66+ messages in thread
From: Daisuke Nishimura @ 2011-01-28  8:20 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, linux-kernel, hannes, balbir, Daisuke Nishimura

On Fri, 28 Jan 2011 12:28:32 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> 
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> When using khugepaged with small memory cgroup, we see khugepaged
> causes soft lockup, or running process under memcg will hang
> 
> It's because khugepaged tries to scan all pmd of a process
> which is under busy/small memory cgroup and tries to allocate
> HUGEPAGE size resource.
> 
> This work is done under mmap_sem and can cause memory reclaim
> repeatedly. This will easily raise cpu usage of khugepaged and latecy
> of scanned process will goes up. Moreover, it seems succesfully
> working TransHuge pages may be splitted by this memory reclaim
> caused by khugepaged.
> 
> This patch adds a hint for khugepaged whether a process is
> under a memory cgroup which has sufficient memory. If memcg
> seems busy, a process is skipped.
> 
> How to test:
>   # mount -o cgroup cgroup /cgroup/memory -o memory
>   # mkdir /cgroup/memory/A
>   # echo 200M (or some small) > /cgroup/memory/A/memory.limit_in_bytes
>   # echo 0 > /cgroup/memory/A/tasks
>   # make -j 8 kernel
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
>  include/linux/memcontrol.h |    7 +++++
>  mm/huge_memory.c           |   10 +++++++-
>  mm/memcontrol.c            |   53 +++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 69 insertions(+), 1 deletion(-)
> 
> Index: mmotm-0125/mm/memcontrol.c
> ===================================================================
> --- mmotm-0125.orig/mm/memcontrol.c
> +++ mmotm-0125/mm/memcontrol.c
> @@ -255,6 +255,9 @@ struct mem_cgroup {
>  	/* For oom notifier event fd */
>  	struct list_head oom_notify;
>  
> +	/* For transparent hugepage daemon */
> +	unsigned long long recent_failcnt;
> +
>  	/*
>  	 * Should we move charges of a task when a task is moved into this
>  	 * mem_cgroup ? And what type of charges should we move ?
> @@ -2214,6 +2217,56 @@ void mem_cgroup_split_huge_fixup(struct 
>  	tail_pc->flags = head_pc->flags & ~PCGF_NOCOPY_AT_SPLIT;
>  	move_unlock_page_cgroup(head_pc, &flags);
>  }
> +
> +bool mem_cgroup_worth_try_hugepage_scan(struct mm_struct *mm)
> +{
> +	struct mem_cgroup *mem;
> +	bool ret = true;
> +	u64 recent_charge_fail;
> +
> +	if (mem_cgroup_disabled())
> +		return true;
> +
> +	mem = try_get_mem_cgroup_from_mm(mm);
> +
> +	if (!mem)
> +		return true;
> +
> +	if (mem_cgroup_is_root(mem))
> +		goto out;
> +
> +	/*
> +	 * At collapsing, khugepaged charges HPAGE_SIZE. When it unmap
> +	 * used ptes, the charge will be decreased.
> +	 *
> +	 * This requirement of 'extra charge' at collapsing seems redundant
> +	 * it's safe way for now. For example, at replacing a chunk of page
> +	 * to be hugepage, khuepaged skips pte_none() entry, which is not
> +	 * which is not charged. But we should do charge under spinlocks as
> +	 * pte_lock, we need precharge. Check status before doing heavy
> +	 * jobs and give khugepaged chance to retire early.
> +	 */
> +	if (mem_cgroup_check_margin(mem) >= HPAGE_SIZE)
I'm sorry if I misunderstand, shouldn't it be "<" ?

Thanks,
Daisuke Nishimura.

> +		ret = false;
> +
> +	 /*
> +	  * This is an easy check. If someone other than khugepaged does
> +	  * hit limit, khugepaged should avoid more pressure.
> +	  */
> +	recent_charge_fail = res_counter_read_u64(&mem->res, RES_FAILCNT);
> +	if (ret
> +	    && mem->recent_failcnt
> +            && recent_charge_fail > mem->recent_failcnt) {
> +		ret = false;
> +	}
> +	/* because this thread will fail charge by itself +1.*/
> +	if (recent_charge_fail)
> +		mem->recent_failcnt = recent_charge_fail + 1;
> +out:
> +	css_put(&mem->css);
> +	return ret;
> +}
> +
>  #endif
>  
>  /**
> Index: mmotm-0125/mm/huge_memory.c
> ===================================================================
> --- mmotm-0125.orig/mm/huge_memory.c
> +++ mmotm-0125/mm/huge_memory.c
> @@ -2011,8 +2011,10 @@ static unsigned int khugepaged_scan_mm_s
>  	down_read(&mm->mmap_sem);
>  	if (unlikely(khugepaged_test_exit(mm)))
>  		vma = NULL;
> -	else
> +	else if (mem_cgroup_worth_try_hugepage_scan(mm))
>  		vma = find_vma(mm, khugepaged_scan.address);
> +	else
> +		vma = NULL;
>  
>  	progress++;
>  	for (; vma; vma = vma->vm_next) {
> @@ -2024,6 +2026,12 @@ static unsigned int khugepaged_scan_mm_s
>  			break;
>  		}
>  
> +		if (unlikely(!mem_cgroup_worth_try_hugepage_scan(mm))) {
> +			progress++;
> +			vma = NULL; /* try next mm */
> +			break;
> +		}
> +
>  		if ((!(vma->vm_flags & VM_HUGEPAGE) &&
>  		     !khugepaged_always()) ||
>  		    (vma->vm_flags & VM_NOHUGEPAGE)) {
> Index: mmotm-0125/include/linux/memcontrol.h
> ===================================================================
> --- mmotm-0125.orig/include/linux/memcontrol.h
> +++ mmotm-0125/include/linux/memcontrol.h
> @@ -148,6 +148,7 @@ u64 mem_cgroup_get_limit(struct mem_cgro
>  
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>  void mem_cgroup_split_huge_fixup(struct page *head, struct page *tail);
> +bool mem_cgroup_worth_try_hugepage_scan(struct mm_struct *mm);
>  #endif
>  
>  #else /* CONFIG_CGROUP_MEM_RES_CTLR */
> @@ -342,6 +343,12 @@ u64 mem_cgroup_get_limit(struct mem_cgro
>  static inline void mem_cgroup_split_huge_fixup(struct page *head,
>  						struct page *tail)
>  {
> +
> +}
> +
> +static inline bool mem_cgroup_worth_try_hugepage_scan(struct mm_struct *mm)
> +{
> +	return true;
>  }
>  
>  #endif /* CONFIG_CGROUP_MEM_CONT */
> 

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [BUGFIX][PATCH 4/4] memcg: fix khugepaged should skip busy memcg
@ 2011-01-28  8:20     ` Daisuke Nishimura
  0 siblings, 0 replies; 66+ messages in thread
From: Daisuke Nishimura @ 2011-01-28  8:20 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-mm, linux-kernel, hannes, balbir, Daisuke Nishimura

On Fri, 28 Jan 2011 12:28:32 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> 
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> When using khugepaged with small memory cgroup, we see khugepaged
> causes soft lockup, or running process under memcg will hang
> 
> It's because khugepaged tries to scan all pmd of a process
> which is under busy/small memory cgroup and tries to allocate
> HUGEPAGE size resource.
> 
> This work is done under mmap_sem and can cause memory reclaim
> repeatedly. This will easily raise cpu usage of khugepaged and latecy
> of scanned process will goes up. Moreover, it seems succesfully
> working TransHuge pages may be splitted by this memory reclaim
> caused by khugepaged.
> 
> This patch adds a hint for khugepaged whether a process is
> under a memory cgroup which has sufficient memory. If memcg
> seems busy, a process is skipped.
> 
> How to test:
>   # mount -o cgroup cgroup /cgroup/memory -o memory
>   # mkdir /cgroup/memory/A
>   # echo 200M (or some small) > /cgroup/memory/A/memory.limit_in_bytes
>   # echo 0 > /cgroup/memory/A/tasks
>   # make -j 8 kernel
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
>  include/linux/memcontrol.h |    7 +++++
>  mm/huge_memory.c           |   10 +++++++-
>  mm/memcontrol.c            |   53 +++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 69 insertions(+), 1 deletion(-)
> 
> Index: mmotm-0125/mm/memcontrol.c
> ===================================================================
> --- mmotm-0125.orig/mm/memcontrol.c
> +++ mmotm-0125/mm/memcontrol.c
> @@ -255,6 +255,9 @@ struct mem_cgroup {
>  	/* For oom notifier event fd */
>  	struct list_head oom_notify;
>  
> +	/* For transparent hugepage daemon */
> +	unsigned long long recent_failcnt;
> +
>  	/*
>  	 * Should we move charges of a task when a task is moved into this
>  	 * mem_cgroup ? And what type of charges should we move ?
> @@ -2214,6 +2217,56 @@ void mem_cgroup_split_huge_fixup(struct 
>  	tail_pc->flags = head_pc->flags & ~PCGF_NOCOPY_AT_SPLIT;
>  	move_unlock_page_cgroup(head_pc, &flags);
>  }
> +
> +bool mem_cgroup_worth_try_hugepage_scan(struct mm_struct *mm)
> +{
> +	struct mem_cgroup *mem;
> +	bool ret = true;
> +	u64 recent_charge_fail;
> +
> +	if (mem_cgroup_disabled())
> +		return true;
> +
> +	mem = try_get_mem_cgroup_from_mm(mm);
> +
> +	if (!mem)
> +		return true;
> +
> +	if (mem_cgroup_is_root(mem))
> +		goto out;
> +
> +	/*
> +	 * At collapsing, khugepaged charges HPAGE_SIZE. When it unmap
> +	 * used ptes, the charge will be decreased.
> +	 *
> +	 * This requirement of 'extra charge' at collapsing seems redundant
> +	 * it's safe way for now. For example, at replacing a chunk of page
> +	 * to be hugepage, khuepaged skips pte_none() entry, which is not
> +	 * which is not charged. But we should do charge under spinlocks as
> +	 * pte_lock, we need precharge. Check status before doing heavy
> +	 * jobs and give khugepaged chance to retire early.
> +	 */
> +	if (mem_cgroup_check_margin(mem) >= HPAGE_SIZE)
I'm sorry if I misunderstand, shouldn't it be "<" ?

Thanks,
Daisuke Nishimura.

> +		ret = false;
> +
> +	 /*
> +	  * This is an easy check. If someone other than khugepaged does
> +	  * hit limit, khugepaged should avoid more pressure.
> +	  */
> +	recent_charge_fail = res_counter_read_u64(&mem->res, RES_FAILCNT);
> +	if (ret
> +	    && mem->recent_failcnt
> +            && recent_charge_fail > mem->recent_failcnt) {
> +		ret = false;
> +	}
> +	/* because this thread will fail charge by itself +1.*/
> +	if (recent_charge_fail)
> +		mem->recent_failcnt = recent_charge_fail + 1;
> +out:
> +	css_put(&mem->css);
> +	return ret;
> +}
> +
>  #endif
>  
>  /**
> Index: mmotm-0125/mm/huge_memory.c
> ===================================================================
> --- mmotm-0125.orig/mm/huge_memory.c
> +++ mmotm-0125/mm/huge_memory.c
> @@ -2011,8 +2011,10 @@ static unsigned int khugepaged_scan_mm_s
>  	down_read(&mm->mmap_sem);
>  	if (unlikely(khugepaged_test_exit(mm)))
>  		vma = NULL;
> -	else
> +	else if (mem_cgroup_worth_try_hugepage_scan(mm))
>  		vma = find_vma(mm, khugepaged_scan.address);
> +	else
> +		vma = NULL;
>  
>  	progress++;
>  	for (; vma; vma = vma->vm_next) {
> @@ -2024,6 +2026,12 @@ static unsigned int khugepaged_scan_mm_s
>  			break;
>  		}
>  
> +		if (unlikely(!mem_cgroup_worth_try_hugepage_scan(mm))) {
> +			progress++;
> +			vma = NULL; /* try next mm */
> +			break;
> +		}
> +
>  		if ((!(vma->vm_flags & VM_HUGEPAGE) &&
>  		     !khugepaged_always()) ||
>  		    (vma->vm_flags & VM_NOHUGEPAGE)) {
> Index: mmotm-0125/include/linux/memcontrol.h
> ===================================================================
> --- mmotm-0125.orig/include/linux/memcontrol.h
> +++ mmotm-0125/include/linux/memcontrol.h
> @@ -148,6 +148,7 @@ u64 mem_cgroup_get_limit(struct mem_cgro
>  
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>  void mem_cgroup_split_huge_fixup(struct page *head, struct page *tail);
> +bool mem_cgroup_worth_try_hugepage_scan(struct mm_struct *mm);
>  #endif
>  
>  #else /* CONFIG_CGROUP_MEM_RES_CTLR */
> @@ -342,6 +343,12 @@ u64 mem_cgroup_get_limit(struct mem_cgro
>  static inline void mem_cgroup_split_huge_fixup(struct page *head,
>  						struct page *tail)
>  {
> +
> +}
> +
> +static inline bool mem_cgroup_worth_try_hugepage_scan(struct mm_struct *mm)
> +{
> +	return true;
>  }
>  
>  #endif /* CONFIG_CGROUP_MEM_CONT */
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [BUGFIX][PATCH 3/4] mecg: fix oom flag at THP charge
  2011-01-28  8:02     ` Johannes Weiner
@ 2011-01-28  8:21       ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 66+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-01-28  8:21 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: linux-mm, linux-kernel, nishimura, balbir

On Fri, 28 Jan 2011 09:02:13 +0100
Johannes Weiner <hannes@cmpxchg.org> wrote:

> On Fri, Jan 28, 2011 at 12:27:29PM +0900, KAMEZAWA Hiroyuki wrote:
> > 
> > Thanks to Johanns and Daisuke for suggestion.
> > =
> > Hugepage allocation shouldn't trigger oom.
> > Allocation failure is not fatal.
> > 
> > Orignal-patch-by: Johannes Weiner <hannes@cmpxchg.org>
> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > ---
> >  mm/memcontrol.c |    4 +++-
> >  1 file changed, 3 insertions(+), 1 deletion(-)
> > 
> > Index: mmotm-0125/mm/memcontrol.c
> > ===================================================================
> > --- mmotm-0125.orig/mm/memcontrol.c
> > +++ mmotm-0125/mm/memcontrol.c
> > @@ -2369,11 +2369,14 @@ static int mem_cgroup_charge_common(stru
> >  	struct page_cgroup *pc;
> >  	int ret;
> >  	int page_size = PAGE_SIZE;
> > +	bool oom;
> >  
> >  	if (PageTransHuge(page)) {
> >  		page_size <<= compound_order(page);
> >  		VM_BUG_ON(!PageTransHuge(page));
> > -	}
> > +		oom = false;
> > +	} else
> > +		oom = true;
> 
> That needs a comment.  You can take the one from my patch if you like.
> 

How about this ?
==
Hugepage allocation shouldn't trigger oom.
Allocation failure is not fatal.

Changelog:
 - added comments.

Orignal-patch-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |   11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

Index: mmotm-0125/mm/memcontrol.c
===================================================================
--- mmotm-0125.orig/mm/memcontrol.c
+++ mmotm-0125/mm/memcontrol.c
@@ -2366,11 +2366,18 @@ static int mem_cgroup_charge_common(stru
 	struct page_cgroup *pc;
 	int ret;
 	int page_size = PAGE_SIZE;
+	bool oom;
 
 	if (PageTransHuge(page)) {
 		page_size <<= compound_order(page);
 		VM_BUG_ON(!PageTransHuge(page));
-	}
+		/*
+		 * Hugepage allocation will retry in small pages even if
+		 * this allocation fails.
+		 */
+		oom = false;
+	} else
+		oom = true;
 
 	pc = lookup_page_cgroup(page);
 	/* can happen at boot */
@@ -2378,7 +2385,7 @@ static int mem_cgroup_charge_common(stru
 		return 0;
 	prefetchw(pc);
 
-	ret = __mem_cgroup_try_charge(mm, gfp_mask, &mem, true, page_size);
+	ret = __mem_cgroup_try_charge(mm, gfp_mask, &mem, oom, page_size);
 	if (ret || !mem)
 		return ret;
 


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [BUGFIX][PATCH 3/4] mecg: fix oom flag at THP charge
@ 2011-01-28  8:21       ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 66+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-01-28  8:21 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: linux-mm, linux-kernel, nishimura, balbir

On Fri, 28 Jan 2011 09:02:13 +0100
Johannes Weiner <hannes@cmpxchg.org> wrote:

> On Fri, Jan 28, 2011 at 12:27:29PM +0900, KAMEZAWA Hiroyuki wrote:
> > 
> > Thanks to Johanns and Daisuke for suggestion.
> > =
> > Hugepage allocation shouldn't trigger oom.
> > Allocation failure is not fatal.
> > 
> > Orignal-patch-by: Johannes Weiner <hannes@cmpxchg.org>
> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > ---
> >  mm/memcontrol.c |    4 +++-
> >  1 file changed, 3 insertions(+), 1 deletion(-)
> > 
> > Index: mmotm-0125/mm/memcontrol.c
> > ===================================================================
> > --- mmotm-0125.orig/mm/memcontrol.c
> > +++ mmotm-0125/mm/memcontrol.c
> > @@ -2369,11 +2369,14 @@ static int mem_cgroup_charge_common(stru
> >  	struct page_cgroup *pc;
> >  	int ret;
> >  	int page_size = PAGE_SIZE;
> > +	bool oom;
> >  
> >  	if (PageTransHuge(page)) {
> >  		page_size <<= compound_order(page);
> >  		VM_BUG_ON(!PageTransHuge(page));
> > -	}
> > +		oom = false;
> > +	} else
> > +		oom = true;
> 
> That needs a comment.  You can take the one from my patch if you like.
> 

How about this ?
==
Hugepage allocation shouldn't trigger oom.
Allocation failure is not fatal.

Changelog:
 - added comments.

Orignal-patch-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |   11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

Index: mmotm-0125/mm/memcontrol.c
===================================================================
--- mmotm-0125.orig/mm/memcontrol.c
+++ mmotm-0125/mm/memcontrol.c
@@ -2366,11 +2366,18 @@ static int mem_cgroup_charge_common(stru
 	struct page_cgroup *pc;
 	int ret;
 	int page_size = PAGE_SIZE;
+	bool oom;
 
 	if (PageTransHuge(page)) {
 		page_size <<= compound_order(page);
 		VM_BUG_ON(!PageTransHuge(page));
-	}
+		/*
+		 * Hugepage allocation will retry in small pages even if
+		 * this allocation fails.
+		 */
+		oom = false;
+	} else
+		oom = true;
 
 	pc = lookup_page_cgroup(page);
 	/* can happen at boot */
@@ -2378,7 +2385,7 @@ static int mem_cgroup_charge_common(stru
 		return 0;
 	prefetchw(pc);
 
-	ret = __mem_cgroup_try_charge(mm, gfp_mask, &mem, true, page_size);
+	ret = __mem_cgroup_try_charge(mm, gfp_mask, &mem, oom, page_size);
 	if (ret || !mem)
 		return ret;
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [BUGFIX][PATCH 1/4] memcg: fix limit estimation at reclaim for hugepage
  2011-01-28  8:04         ` Minchan Kim
@ 2011-01-28  8:24           ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 66+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-01-28  8:24 UTC (permalink / raw)
  To: Minchan Kim; +Cc: Daisuke Nishimura, linux-mm, linux-kernel, hannes, balbir

On Fri, 28 Jan 2011 17:04:16 +0900
Minchan Kim <minchan.kim@gmail.com> wrote:

> Hi Kame,
> 
> On Fri, Jan 28, 2011 at 1:58 PM, KAMEZAWA Hiroyuki
> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > How about this ?
> > ==
> > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> >
> > Current memory cgroup's code tends to assume page_size == PAGE_SIZE
> > and arrangement for THP is not enough yet.
> >
> > This is one of fixes for supporing THP. This adds
> > mem_cgroup_check_margin() and checks whether there are required amount of
> > free resource after memory reclaim. By this, THP page allocation
> > can know whether it really succeeded or not and avoid infinite-loop
> > and hangup.
> >
> > Total fixes for do_charge()/reclaim memory will follow this patch.
> 
> If this patch is only related to THP, I think patch order isn't good.
> Before applying [2/4], huge page allocation will retry without
> reclaiming and loop forever by below part.
> 
> @@ -1854,9 +1858,6 @@ static int __mem_cgroup_do_charge(struct
>  	} else
>  		mem_over_limit = mem_cgroup_from_res_counter(fail_res, res);
> 
> -	if (csize > PAGE_SIZE) /* change csize and retry */
> -		return CHARGE_RETRY;
> -
>  	if (!(gfp_mask & __GFP_WAIT))
>  		return CHARGE_WOULDBLOCK;
> 
> Am I missing something?
> 

You're right. But
 - This patch oder doesn't affect bi-sect of the bug. because 
   2 bugs seems to be the same.
 - This patch implements a leaf function for the real fix.

Then, I think patch order is not problem here.

Thank you for pointing out.

Thanks,
-Kame





^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [BUGFIX][PATCH 1/4] memcg: fix limit estimation at reclaim for hugepage
@ 2011-01-28  8:24           ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 66+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-01-28  8:24 UTC (permalink / raw)
  To: Minchan Kim; +Cc: Daisuke Nishimura, linux-mm, linux-kernel, hannes, balbir

On Fri, 28 Jan 2011 17:04:16 +0900
Minchan Kim <minchan.kim@gmail.com> wrote:

> Hi Kame,
> 
> On Fri, Jan 28, 2011 at 1:58 PM, KAMEZAWA Hiroyuki
> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > How about this ?
> > ==
> > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> >
> > Current memory cgroup's code tends to assume page_size == PAGE_SIZE
> > and arrangement for THP is not enough yet.
> >
> > This is one of fixes for supporing THP. This adds
> > mem_cgroup_check_margin() and checks whether there are required amount of
> > free resource after memory reclaim. By this, THP page allocation
> > can know whether it really succeeded or not and avoid infinite-loop
> > and hangup.
> >
> > Total fixes for do_charge()/reclaim memory will follow this patch.
> 
> If this patch is only related to THP, I think patch order isn't good.
> Before applying [2/4], huge page allocation will retry without
> reclaiming and loop forever by below part.
> 
> @@ -1854,9 +1858,6 @@ static int __mem_cgroup_do_charge(struct
>  	} else
>  		mem_over_limit = mem_cgroup_from_res_counter(fail_res, res);
> 
> -	if (csize > PAGE_SIZE) /* change csize and retry */
> -		return CHARGE_RETRY;
> -
>  	if (!(gfp_mask & __GFP_WAIT))
>  		return CHARGE_WOULDBLOCK;
> 
> Am I missing something?
> 

You're right. But
 - This patch oder doesn't affect bi-sect of the bug. because 
   2 bugs seems to be the same.
 - This patch implements a leaf function for the real fix.

Then, I think patch order is not problem here.

Thank you for pointing out.

Thanks,
-Kame




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [BUGFIX][PATCH 1/4] memcg: fix limit estimation at reclaim for hugepage
  2011-01-28  8:17           ` Johannes Weiner
@ 2011-01-28  8:25             ` Minchan Kim
  -1 siblings, 0 replies; 66+ messages in thread
From: Minchan Kim @ 2011-01-28  8:25 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: KAMEZAWA Hiroyuki, Daisuke Nishimura, linux-mm, linux-kernel, balbir

Hi Hannes,

On Fri, Jan 28, 2011 at 5:17 PM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> On Fri, Jan 28, 2011 at 05:04:16PM +0900, Minchan Kim wrote:
>> Hi Kame,
>>
>> On Fri, Jan 28, 2011 at 1:58 PM, KAMEZAWA Hiroyuki
>> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>> > How about this ?
>> > ==
>> > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>> >
>> > Current memory cgroup's code tends to assume page_size == PAGE_SIZE
>> > and arrangement for THP is not enough yet.
>> >
>> > This is one of fixes for supporing THP. This adds
>> > mem_cgroup_check_margin() and checks whether there are required amount of
>> > free resource after memory reclaim. By this, THP page allocation
>> > can know whether it really succeeded or not and avoid infinite-loop
>> > and hangup.
>> >
>> > Total fixes for do_charge()/reclaim memory will follow this patch.
>>
>> If this patch is only related to THP, I think patch order isn't good.
>> Before applying [2/4], huge page allocation will retry without
>> reclaiming and loop forever by below part.
>>
>> @@ -1854,9 +1858,6 @@ static int __mem_cgroup_do_charge(struct
>>       } else
>>               mem_over_limit = mem_cgroup_from_res_counter(fail_res, res);
>>
>> -     if (csize > PAGE_SIZE) /* change csize and retry */
>> -             return CHARGE_RETRY;
>> -
>>       if (!(gfp_mask & __GFP_WAIT))
>>               return CHARGE_WOULDBLOCK;
>>
>> Am I missing something?
>
> No, you are correct.  But I am not sure the order really matters in
> theory: you have two endless loops that need independent fixing.

That's why I ask a question.
Two endless loop?

One is what I mentioned. The other is what?
Maybe this patch solve the other.
But I can't guess it by only this description. Stupid..

Please open my eyes.

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [BUGFIX][PATCH 1/4] memcg: fix limit estimation at reclaim for hugepage
@ 2011-01-28  8:25             ` Minchan Kim
  0 siblings, 0 replies; 66+ messages in thread
From: Minchan Kim @ 2011-01-28  8:25 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: KAMEZAWA Hiroyuki, Daisuke Nishimura, linux-mm, linux-kernel, balbir

Hi Hannes,

On Fri, Jan 28, 2011 at 5:17 PM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> On Fri, Jan 28, 2011 at 05:04:16PM +0900, Minchan Kim wrote:
>> Hi Kame,
>>
>> On Fri, Jan 28, 2011 at 1:58 PM, KAMEZAWA Hiroyuki
>> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>> > How about this ?
>> > ==
>> > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>> >
>> > Current memory cgroup's code tends to assume page_size == PAGE_SIZE
>> > and arrangement for THP is not enough yet.
>> >
>> > This is one of fixes for supporing THP. This adds
>> > mem_cgroup_check_margin() and checks whether there are required amount of
>> > free resource after memory reclaim. By this, THP page allocation
>> > can know whether it really succeeded or not and avoid infinite-loop
>> > and hangup.
>> >
>> > Total fixes for do_charge()/reclaim memory will follow this patch.
>>
>> If this patch is only related to THP, I think patch order isn't good.
>> Before applying [2/4], huge page allocation will retry without
>> reclaiming and loop forever by below part.
>>
>> @@ -1854,9 +1858,6 @@ static int __mem_cgroup_do_charge(struct
>>       } else
>>               mem_over_limit = mem_cgroup_from_res_counter(fail_res, res);
>>
>> -     if (csize > PAGE_SIZE) /* change csize and retry */
>> -             return CHARGE_RETRY;
>> -
>>       if (!(gfp_mask & __GFP_WAIT))
>>               return CHARGE_WOULDBLOCK;
>>
>> Am I missing something?
>
> No, you are correct.  But I am not sure the order really matters in
> theory: you have two endless loops that need independent fixing.

That's why I ask a question.
Two endless loop?

One is what I mentioned. The other is what?
Maybe this patch solve the other.
But I can't guess it by only this description. Stupid..

Please open my eyes.

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [BUGFIX][PATCH 4/4] memcg: fix khugepaged should skip busy memcg
  2011-01-28  8:20     ` Daisuke Nishimura
@ 2011-01-28  8:30       ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 66+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-01-28  8:30 UTC (permalink / raw)
  To: Daisuke Nishimura; +Cc: linux-mm, linux-kernel, hannes, balbir

On Fri, 28 Jan 2011 17:20:22 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:

> > +	/*
> > +	 * At collapsing, khugepaged charges HPAGE_SIZE. When it unmap
> > +	 * used ptes, the charge will be decreased.
> > +	 *
> > +	 * This requirement of 'extra charge' at collapsing seems redundant
> > +	 * it's safe way for now. For example, at replacing a chunk of page
> > +	 * to be hugepage, khuepaged skips pte_none() entry, which is not
> > +	 * which is not charged. But we should do charge under spinlocks as
> > +	 * pte_lock, we need precharge. Check status before doing heavy
> > +	 * jobs and give khugepaged chance to retire early.
> > +	 */
> > +	if (mem_cgroup_check_margin(mem) >= HPAGE_SIZE)
> I'm sorry if I misunderstand, shouldn't it be "<" ?

yes. This bug will make khugepaged never work on a memcg and
the system never cause hang ;(

Thank you.

==
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

When using khugepaged with small memory cgroup, we see khugepaged
causes soft lockup, or running process under memcg will hang

It's because khugepaged tries to scan all pmd of a process
which is under busy/small memory cgroup and tries to allocate
HUGEPAGE size resource.

This work is done under mmap_sem and can cause memory reclaim
repeatedly. This will easily raise cpu usage of khugepaged and latecy
of scanned process will goes up. Moreover, it seems succesfully
working TransHuge pages may be splitted by this memory reclaim
caused by khugepaged.

This patch adds a hint for khugepaged whether a process is
under a memory cgroup which has sufficient memory. If memcg
seems busy, a process is skipped.

How to test:
  # mount -o cgroup cgroup /cgroup/memory -o memory
  # mkdir /cgroup/memory/A
  # echo 200M (or some small) > /cgroup/memory/A/memory.limit_in_bytes
  # echo 0 > /cgroup/memory/A/tasks
  # make -j 8 kernel

Changelog:
 - fixed condition check bug.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/memcontrol.h |    7 +++++
 mm/huge_memory.c           |   10 +++++++-
 mm/memcontrol.c            |   53 +++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 69 insertions(+), 1 deletion(-)

Index: mmotm-0125/mm/memcontrol.c
===================================================================
--- mmotm-0125.orig/mm/memcontrol.c
+++ mmotm-0125/mm/memcontrol.c
@@ -255,6 +255,9 @@ struct mem_cgroup {
 	/* For oom notifier event fd */
 	struct list_head oom_notify;
 
+	/* For transparent hugepage daemon */
+	unsigned long long recent_failcnt;
+
 	/*
 	 * Should we move charges of a task when a task is moved into this
 	 * mem_cgroup ? And what type of charges should we move ?
@@ -2211,6 +2214,56 @@ void mem_cgroup_split_huge_fixup(struct 
 	tail_pc->flags = head_pc->flags & ~PCGF_NOCOPY_AT_SPLIT;
 	move_unlock_page_cgroup(head_pc, &flags);
 }
+
+bool mem_cgroup_worth_try_hugepage_scan(struct mm_struct *mm)
+{
+	struct mem_cgroup *mem;
+	bool ret = true;
+	u64 recent_charge_fail;
+
+	if (mem_cgroup_disabled())
+		return true;
+
+	mem = try_get_mem_cgroup_from_mm(mm);
+
+	if (!mem)
+		return true;
+
+	if (mem_cgroup_is_root(mem))
+		goto out;
+
+	/*
+	 * At collapsing, khugepaged charges HPAGE_SIZE. When it unmap
+	 * used ptes, the charge will be decreased.
+	 *
+	 * This requirement of 'extra charge' at collapsing seems redundant
+	 * it's safe way for now. For example, at replacing a chunk of page
+	 * to be hugepage, khuepaged skips pte_none() entry, which is not
+	 * which is not charged. But we should do charge under spinlocks as
+	 * pte_lock, we need precharge. Check status before doing heavy
+	 * jobs and give khugepaged chance to retire early.
+	 */
+	if (mem_cgroup_check_margin(mem) < HPAGE_SIZE)
+		ret = false;
+
+	 /*
+	  * This is an easy check. If someone other than khugepaged does
+	  * hit limit, khugepaged should avoid more pressure.
+	  */
+	recent_charge_fail = res_counter_read_u64(&mem->res, RES_FAILCNT);
+	if (ret
+	    && mem->recent_failcnt
+            && recent_charge_fail > mem->recent_failcnt) {
+		ret = false;
+	}
+	/* because this thread will fail charge by itself +1.*/
+	if (recent_charge_fail)
+		mem->recent_failcnt = recent_charge_fail + 1;
+out:
+	css_put(&mem->css);
+	return ret;
+}
+
 #endif
 
 /**
Index: mmotm-0125/mm/huge_memory.c
===================================================================
--- mmotm-0125.orig/mm/huge_memory.c
+++ mmotm-0125/mm/huge_memory.c
@@ -2011,8 +2011,10 @@ static unsigned int khugepaged_scan_mm_s
 	down_read(&mm->mmap_sem);
 	if (unlikely(khugepaged_test_exit(mm)))
 		vma = NULL;
-	else
+	else if (mem_cgroup_worth_try_hugepage_scan(mm))
 		vma = find_vma(mm, khugepaged_scan.address);
+	else
+		vma = NULL;
 
 	progress++;
 	for (; vma; vma = vma->vm_next) {
@@ -2024,6 +2026,12 @@ static unsigned int khugepaged_scan_mm_s
 			break;
 		}
 
+		if (unlikely(!mem_cgroup_worth_try_hugepage_scan(mm))) {
+			progress++;
+			vma = NULL; /* try next mm */
+			break;
+		}
+
 		if ((!(vma->vm_flags & VM_HUGEPAGE) &&
 		     !khugepaged_always()) ||
 		    (vma->vm_flags & VM_NOHUGEPAGE)) {
Index: mmotm-0125/include/linux/memcontrol.h
===================================================================
--- mmotm-0125.orig/include/linux/memcontrol.h
+++ mmotm-0125/include/linux/memcontrol.h
@@ -148,6 +148,7 @@ u64 mem_cgroup_get_limit(struct mem_cgro
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 void mem_cgroup_split_huge_fixup(struct page *head, struct page *tail);
+bool mem_cgroup_worth_try_hugepage_scan(struct mm_struct *mm);
 #endif
 
 #else /* CONFIG_CGROUP_MEM_RES_CTLR */
@@ -342,6 +343,12 @@ u64 mem_cgroup_get_limit(struct mem_cgro
 static inline void mem_cgroup_split_huge_fixup(struct page *head,
 						struct page *tail)
 {
+
+}
+
+static inline bool mem_cgroup_worth_try_hugepage_scan(struct mm_struct *mm)
+{
+	return true;
 }
 
 #endif /* CONFIG_CGROUP_MEM_CONT */


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [BUGFIX][PATCH 4/4] memcg: fix khugepaged should skip busy memcg
@ 2011-01-28  8:30       ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 66+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-01-28  8:30 UTC (permalink / raw)
  To: Daisuke Nishimura; +Cc: linux-mm, linux-kernel, hannes, balbir

On Fri, 28 Jan 2011 17:20:22 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:

> > +	/*
> > +	 * At collapsing, khugepaged charges HPAGE_SIZE. When it unmap
> > +	 * used ptes, the charge will be decreased.
> > +	 *
> > +	 * This requirement of 'extra charge' at collapsing seems redundant
> > +	 * it's safe way for now. For example, at replacing a chunk of page
> > +	 * to be hugepage, khuepaged skips pte_none() entry, which is not
> > +	 * which is not charged. But we should do charge under spinlocks as
> > +	 * pte_lock, we need precharge. Check status before doing heavy
> > +	 * jobs and give khugepaged chance to retire early.
> > +	 */
> > +	if (mem_cgroup_check_margin(mem) >= HPAGE_SIZE)
> I'm sorry if I misunderstand, shouldn't it be "<" ?

yes. This bug will make khugepaged never work on a memcg and
the system never cause hang ;(

Thank you.

==
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

When using khugepaged with small memory cgroup, we see khugepaged
causes soft lockup, or running process under memcg will hang

It's because khugepaged tries to scan all pmd of a process
which is under busy/small memory cgroup and tries to allocate
HUGEPAGE size resource.

This work is done under mmap_sem and can cause memory reclaim
repeatedly. This will easily raise cpu usage of khugepaged and latecy
of scanned process will goes up. Moreover, it seems succesfully
working TransHuge pages may be splitted by this memory reclaim
caused by khugepaged.

This patch adds a hint for khugepaged whether a process is
under a memory cgroup which has sufficient memory. If memcg
seems busy, a process is skipped.

How to test:
  # mount -o cgroup cgroup /cgroup/memory -o memory
  # mkdir /cgroup/memory/A
  # echo 200M (or some small) > /cgroup/memory/A/memory.limit_in_bytes
  # echo 0 > /cgroup/memory/A/tasks
  # make -j 8 kernel

Changelog:
 - fixed condition check bug.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/memcontrol.h |    7 +++++
 mm/huge_memory.c           |   10 +++++++-
 mm/memcontrol.c            |   53 +++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 69 insertions(+), 1 deletion(-)

Index: mmotm-0125/mm/memcontrol.c
===================================================================
--- mmotm-0125.orig/mm/memcontrol.c
+++ mmotm-0125/mm/memcontrol.c
@@ -255,6 +255,9 @@ struct mem_cgroup {
 	/* For oom notifier event fd */
 	struct list_head oom_notify;
 
+	/* For transparent hugepage daemon */
+	unsigned long long recent_failcnt;
+
 	/*
 	 * Should we move charges of a task when a task is moved into this
 	 * mem_cgroup ? And what type of charges should we move ?
@@ -2211,6 +2214,56 @@ void mem_cgroup_split_huge_fixup(struct 
 	tail_pc->flags = head_pc->flags & ~PCGF_NOCOPY_AT_SPLIT;
 	move_unlock_page_cgroup(head_pc, &flags);
 }
+
+bool mem_cgroup_worth_try_hugepage_scan(struct mm_struct *mm)
+{
+	struct mem_cgroup *mem;
+	bool ret = true;
+	u64 recent_charge_fail;
+
+	if (mem_cgroup_disabled())
+		return true;
+
+	mem = try_get_mem_cgroup_from_mm(mm);
+
+	if (!mem)
+		return true;
+
+	if (mem_cgroup_is_root(mem))
+		goto out;
+
+	/*
+	 * At collapsing, khugepaged charges HPAGE_SIZE. When it unmap
+	 * used ptes, the charge will be decreased.
+	 *
+	 * This requirement of 'extra charge' at collapsing seems redundant
+	 * it's safe way for now. For example, at replacing a chunk of page
+	 * to be hugepage, khuepaged skips pte_none() entry, which is not
+	 * which is not charged. But we should do charge under spinlocks as
+	 * pte_lock, we need precharge. Check status before doing heavy
+	 * jobs and give khugepaged chance to retire early.
+	 */
+	if (mem_cgroup_check_margin(mem) < HPAGE_SIZE)
+		ret = false;
+
+	 /*
+	  * This is an easy check. If someone other than khugepaged does
+	  * hit limit, khugepaged should avoid more pressure.
+	  */
+	recent_charge_fail = res_counter_read_u64(&mem->res, RES_FAILCNT);
+	if (ret
+	    && mem->recent_failcnt
+            && recent_charge_fail > mem->recent_failcnt) {
+		ret = false;
+	}
+	/* because this thread will fail charge by itself +1.*/
+	if (recent_charge_fail)
+		mem->recent_failcnt = recent_charge_fail + 1;
+out:
+	css_put(&mem->css);
+	return ret;
+}
+
 #endif
 
 /**
Index: mmotm-0125/mm/huge_memory.c
===================================================================
--- mmotm-0125.orig/mm/huge_memory.c
+++ mmotm-0125/mm/huge_memory.c
@@ -2011,8 +2011,10 @@ static unsigned int khugepaged_scan_mm_s
 	down_read(&mm->mmap_sem);
 	if (unlikely(khugepaged_test_exit(mm)))
 		vma = NULL;
-	else
+	else if (mem_cgroup_worth_try_hugepage_scan(mm))
 		vma = find_vma(mm, khugepaged_scan.address);
+	else
+		vma = NULL;
 
 	progress++;
 	for (; vma; vma = vma->vm_next) {
@@ -2024,6 +2026,12 @@ static unsigned int khugepaged_scan_mm_s
 			break;
 		}
 
+		if (unlikely(!mem_cgroup_worth_try_hugepage_scan(mm))) {
+			progress++;
+			vma = NULL; /* try next mm */
+			break;
+		}
+
 		if ((!(vma->vm_flags & VM_HUGEPAGE) &&
 		     !khugepaged_always()) ||
 		    (vma->vm_flags & VM_NOHUGEPAGE)) {
Index: mmotm-0125/include/linux/memcontrol.h
===================================================================
--- mmotm-0125.orig/include/linux/memcontrol.h
+++ mmotm-0125/include/linux/memcontrol.h
@@ -148,6 +148,7 @@ u64 mem_cgroup_get_limit(struct mem_cgro
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 void mem_cgroup_split_huge_fixup(struct page *head, struct page *tail);
+bool mem_cgroup_worth_try_hugepage_scan(struct mm_struct *mm);
 #endif
 
 #else /* CONFIG_CGROUP_MEM_RES_CTLR */
@@ -342,6 +343,12 @@ u64 mem_cgroup_get_limit(struct mem_cgro
 static inline void mem_cgroup_split_huge_fixup(struct page *head,
 						struct page *tail)
 {
+
+}
+
+static inline bool mem_cgroup_worth_try_hugepage_scan(struct mm_struct *mm)
+{
+	return true;
 }
 
 #endif /* CONFIG_CGROUP_MEM_CONT */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [BUGFIX][PATCH 1/4] memcg: fix limit estimation at reclaim for hugepage
  2011-01-28  8:25             ` Minchan Kim
@ 2011-01-28  8:36               ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 66+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-01-28  8:36 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Johannes Weiner, Daisuke Nishimura, linux-mm, linux-kernel, balbir

On Fri, 28 Jan 2011 17:25:58 +0900
Minchan Kim <minchan.kim@gmail.com> wrote:

> Hi Hannes,
> 
> On Fri, Jan 28, 2011 at 5:17 PM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> > On Fri, Jan 28, 2011 at 05:04:16PM +0900, Minchan Kim wrote:
> >> Hi Kame,
> >>
> >> On Fri, Jan 28, 2011 at 1:58 PM, KAMEZAWA Hiroyuki
> >> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> >> > How about this ?
> >> > ==
> >> > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> >> >
> >> > Current memory cgroup's code tends to assume page_size == PAGE_SIZE
> >> > and arrangement for THP is not enough yet.
> >> >
> >> > This is one of fixes for supporing THP. This adds
> >> > mem_cgroup_check_margin() and checks whether there are required amount of
> >> > free resource after memory reclaim. By this, THP page allocation
> >> > can know whether it really succeeded or not and avoid infinite-loop
> >> > and hangup.
> >> >
> >> > Total fixes for do_charge()/reclaim memory will follow this patch.
> >>
> >> If this patch is only related to THP, I think patch order isn't good.
> >> Before applying [2/4], huge page allocation will retry without
> >> reclaiming and loop forever by below part.
> >>
> >> @@ -1854,9 +1858,6 @@ static int __mem_cgroup_do_charge(struct
> >>       } else
> >>               mem_over_limit = mem_cgroup_from_res_counter(fail_res, res);
> >>
> >> -     if (csize > PAGE_SIZE) /* change csize and retry */
> >> -             return CHARGE_RETRY;
> >> -
> >>       if (!(gfp_mask & __GFP_WAIT))
> >>               return CHARGE_WOULDBLOCK;
> >>
> >> Am I missing something?
> >
> > No, you are correct.  But I am not sure the order really matters in
> > theory: you have two endless loops that need independent fixing.
> 
> That's why I ask a question.
> Two endless loop?
> 
> One is what I mentioned. The other is what?
> Maybe this patch solve the other.
> But I can't guess it by only this description. Stupid..
> 
> Please open my eyes.
> 

One is.

  if (csize > PAGE_SIZE)
	return CHARGE_RETRY;

By this, reclaim will never be called.


Another is a check after memory reclaim.
==
       ret = mem_cgroup_hierarchical_reclaim(mem_over_limit, NULL,
                                        gfp_mask, flags);
        /*
         * try_to_free_mem_cgroup_pages() might not give us a full
         * picture of reclaim. Some pages are reclaimed and might be
         * moved to swap cache or just unmapped from the cgroup.
         * Check the limit again to see if the reclaim reduced the
         * current usage of the cgroup before giving up
         */
        if (ret || mem_cgroup_check_under_limit(mem_over_limit))
                return CHARGE_RETRY;
==

ret != 0 if one page is reclaimed. Then, khupaged will retry charge and 
cannot get enough room, reclaim, one page -> again. SO, in busy memcg,
HPAGE_SIZE allocation never fails.

Even if khupaged luckly allocates HPAGE_SIZE, because khugepaged walks vmas
one by one and try to collapse each pmd, under mmap_sem(), this seems a hang by
khugepaged, infinite loop.


Thanks,
-Kame






^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [BUGFIX][PATCH 1/4] memcg: fix limit estimation at reclaim for hugepage
@ 2011-01-28  8:36               ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 66+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-01-28  8:36 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Johannes Weiner, Daisuke Nishimura, linux-mm, linux-kernel, balbir

On Fri, 28 Jan 2011 17:25:58 +0900
Minchan Kim <minchan.kim@gmail.com> wrote:

> Hi Hannes,
> 
> On Fri, Jan 28, 2011 at 5:17 PM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> > On Fri, Jan 28, 2011 at 05:04:16PM +0900, Minchan Kim wrote:
> >> Hi Kame,
> >>
> >> On Fri, Jan 28, 2011 at 1:58 PM, KAMEZAWA Hiroyuki
> >> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> >> > How about this ?
> >> > ==
> >> > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> >> >
> >> > Current memory cgroup's code tends to assume page_size == PAGE_SIZE
> >> > and arrangement for THP is not enough yet.
> >> >
> >> > This is one of fixes for supporing THP. This adds
> >> > mem_cgroup_check_margin() and checks whether there are required amount of
> >> > free resource after memory reclaim. By this, THP page allocation
> >> > can know whether it really succeeded or not and avoid infinite-loop
> >> > and hangup.
> >> >
> >> > Total fixes for do_charge()/reclaim memory will follow this patch.
> >>
> >> If this patch is only related to THP, I think patch order isn't good.
> >> Before applying [2/4], huge page allocation will retry without
> >> reclaiming and loop forever by below part.
> >>
> >> @@ -1854,9 +1858,6 @@ static int __mem_cgroup_do_charge(struct
> >> A  A  A  } else
> >> A  A  A  A  A  A  A  mem_over_limit = mem_cgroup_from_res_counter(fail_res, res);
> >>
> >> - A  A  if (csize > PAGE_SIZE) /* change csize and retry */
> >> - A  A  A  A  A  A  return CHARGE_RETRY;
> >> -
> >> A  A  A  if (!(gfp_mask & __GFP_WAIT))
> >> A  A  A  A  A  A  A  return CHARGE_WOULDBLOCK;
> >>
> >> Am I missing something?
> >
> > No, you are correct. A But I am not sure the order really matters in
> > theory: you have two endless loops that need independent fixing.
> 
> That's why I ask a question.
> Two endless loop?
> 
> One is what I mentioned. The other is what?
> Maybe this patch solve the other.
> But I can't guess it by only this description. Stupid..
> 
> Please open my eyes.
> 

One is.

  if (csize > PAGE_SIZE)
	return CHARGE_RETRY;

By this, reclaim will never be called.


Another is a check after memory reclaim.
==
       ret = mem_cgroup_hierarchical_reclaim(mem_over_limit, NULL,
                                        gfp_mask, flags);
        /*
         * try_to_free_mem_cgroup_pages() might not give us a full
         * picture of reclaim. Some pages are reclaimed and might be
         * moved to swap cache or just unmapped from the cgroup.
         * Check the limit again to see if the reclaim reduced the
         * current usage of the cgroup before giving up
         */
        if (ret || mem_cgroup_check_under_limit(mem_over_limit))
                return CHARGE_RETRY;
==

ret != 0 if one page is reclaimed. Then, khupaged will retry charge and 
cannot get enough room, reclaim, one page -> again. SO, in busy memcg,
HPAGE_SIZE allocation never fails.

Even if khupaged luckly allocates HPAGE_SIZE, because khugepaged walks vmas
one by one and try to collapse each pmd, under mmap_sem(), this seems a hang by
khugepaged, infinite loop.


Thanks,
-Kame





--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [BUGFIX][PATCH 1/4] memcg: fix limit estimation at reclaim for hugepage
  2011-01-28  8:24           ` KAMEZAWA Hiroyuki
@ 2011-01-28  8:37             ` Minchan Kim
  -1 siblings, 0 replies; 66+ messages in thread
From: Minchan Kim @ 2011-01-28  8:37 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Daisuke Nishimura, linux-mm, linux-kernel, hannes, balbir

On Fri, Jan 28, 2011 at 5:24 PM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Fri, 28 Jan 2011 17:04:16 +0900
> Minchan Kim <minchan.kim@gmail.com> wrote:
>
>> Hi Kame,
>>
>> On Fri, Jan 28, 2011 at 1:58 PM, KAMEZAWA Hiroyuki
>> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>> > How about this ?
>> > ==
>> > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>> >
>> > Current memory cgroup's code tends to assume page_size == PAGE_SIZE
>> > and arrangement for THP is not enough yet.
>> >
>> > This is one of fixes for supporing THP. This adds
>> > mem_cgroup_check_margin() and checks whether there are required amount of
>> > free resource after memory reclaim. By this, THP page allocation
>> > can know whether it really succeeded or not and avoid infinite-loop
>> > and hangup.
>> >
>> > Total fixes for do_charge()/reclaim memory will follow this patch.
>>
>> If this patch is only related to THP, I think patch order isn't good.
>> Before applying [2/4], huge page allocation will retry without
>> reclaiming and loop forever by below part.
>>
>> @@ -1854,9 +1858,6 @@ static int __mem_cgroup_do_charge(struct
>>       } else
>>               mem_over_limit = mem_cgroup_from_res_counter(fail_res, res);
>>
>> -     if (csize > PAGE_SIZE) /* change csize and retry */
>> -             return CHARGE_RETRY;
>> -
>>       if (!(gfp_mask & __GFP_WAIT))
>>               return CHARGE_WOULDBLOCK;
>>
>> Am I missing something?
>>
>
> You're right. But
>  - This patch oder doesn't affect bi-sect of the bug. because
>   2 bugs seems to be the same.
>  - This patch implements a leaf function for the real fix.
>
> Then, I think patch order is not problem here.
>
> Thank you for pointing out.

Okay. I understand Hannes and your opinion.
In my opinion, my suggestion can enhance the patch readability in this
series as just only my viewpoint. :)
Anyway, I don't mind it.

Reviewed-by: Minchan Kim <minchan.kim@gmail.com>

Thanks!!

>
> Thanks,
> -Kame
>
>
>
>
>



-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [BUGFIX][PATCH 1/4] memcg: fix limit estimation at reclaim for hugepage
@ 2011-01-28  8:37             ` Minchan Kim
  0 siblings, 0 replies; 66+ messages in thread
From: Minchan Kim @ 2011-01-28  8:37 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Daisuke Nishimura, linux-mm, linux-kernel, hannes, balbir

On Fri, Jan 28, 2011 at 5:24 PM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Fri, 28 Jan 2011 17:04:16 +0900
> Minchan Kim <minchan.kim@gmail.com> wrote:
>
>> Hi Kame,
>>
>> On Fri, Jan 28, 2011 at 1:58 PM, KAMEZAWA Hiroyuki
>> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>> > How about this ?
>> > ==
>> > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>> >
>> > Current memory cgroup's code tends to assume page_size == PAGE_SIZE
>> > and arrangement for THP is not enough yet.
>> >
>> > This is one of fixes for supporing THP. This adds
>> > mem_cgroup_check_margin() and checks whether there are required amount of
>> > free resource after memory reclaim. By this, THP page allocation
>> > can know whether it really succeeded or not and avoid infinite-loop
>> > and hangup.
>> >
>> > Total fixes for do_charge()/reclaim memory will follow this patch.
>>
>> If this patch is only related to THP, I think patch order isn't good.
>> Before applying [2/4], huge page allocation will retry without
>> reclaiming and loop forever by below part.
>>
>> @@ -1854,9 +1858,6 @@ static int __mem_cgroup_do_charge(struct
>>       } else
>>               mem_over_limit = mem_cgroup_from_res_counter(fail_res, res);
>>
>> -     if (csize > PAGE_SIZE) /* change csize and retry */
>> -             return CHARGE_RETRY;
>> -
>>       if (!(gfp_mask & __GFP_WAIT))
>>               return CHARGE_WOULDBLOCK;
>>
>> Am I missing something?
>>
>
> You're right. But
>  - This patch oder doesn't affect bi-sect of the bug. because
>   2 bugs seems to be the same.
>  - This patch implements a leaf function for the real fix.
>
> Then, I think patch order is not problem here.
>
> Thank you for pointing out.

Okay. I understand Hannes and your opinion.
In my opinion, my suggestion can enhance the patch readability in this
series as just only my viewpoint. :)
Anyway, I don't mind it.

Reviewed-by: Minchan Kim <minchan.kim@gmail.com>

Thanks!!

>
> Thanks,
> -Kame
>
>
>
>
>



-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [BUGFIX][PATCH 1/4] memcg: fix limit estimation at reclaim for hugepage
  2011-01-28  8:25             ` Minchan Kim
@ 2011-01-28  8:41               ` Johannes Weiner
  -1 siblings, 0 replies; 66+ messages in thread
From: Johannes Weiner @ 2011-01-28  8:41 UTC (permalink / raw)
  To: Minchan Kim
  Cc: KAMEZAWA Hiroyuki, Daisuke Nishimura, linux-mm, linux-kernel, balbir

On Fri, Jan 28, 2011 at 05:25:58PM +0900, Minchan Kim wrote:
> Hi Hannes,
> 
> On Fri, Jan 28, 2011 at 5:17 PM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> > On Fri, Jan 28, 2011 at 05:04:16PM +0900, Minchan Kim wrote:
> >> Hi Kame,
> >>
> >> On Fri, Jan 28, 2011 at 1:58 PM, KAMEZAWA Hiroyuki
> >> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> >> > How about this ?
> >> > ==
> >> > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> >> >
> >> > Current memory cgroup's code tends to assume page_size == PAGE_SIZE
> >> > and arrangement for THP is not enough yet.
> >> >
> >> > This is one of fixes for supporing THP. This adds
> >> > mem_cgroup_check_margin() and checks whether there are required amount of
> >> > free resource after memory reclaim. By this, THP page allocation
> >> > can know whether it really succeeded or not and avoid infinite-loop
> >> > and hangup.
> >> >
> >> > Total fixes for do_charge()/reclaim memory will follow this patch.
> >>
> >> If this patch is only related to THP, I think patch order isn't good.
> >> Before applying [2/4], huge page allocation will retry without
> >> reclaiming and loop forever by below part.
> >>
> >> @@ -1854,9 +1858,6 @@ static int __mem_cgroup_do_charge(struct
> >>       } else
> >>               mem_over_limit = mem_cgroup_from_res_counter(fail_res, res);
> >>
> >> -     if (csize > PAGE_SIZE) /* change csize and retry */
> >> -             return CHARGE_RETRY;
> >> -
> >>       if (!(gfp_mask & __GFP_WAIT))
> >>               return CHARGE_WOULDBLOCK;
> >>
> >> Am I missing something?
> >
> > No, you are correct.  But I am not sure the order really matters in
> > theory: you have two endless loops that need independent fixing.
> 
> That's why I ask a question.
> Two endless loop?
> 
> One is what I mentioned. The other is what?

The other is

1. huge page charge, fail
2. reclaim
3. limit is ok, retry at 1.

The problem is that the limit will be okay for regular pages, but not
huge pages.

Currently, this code is never reached due to the endless loop you
already spotted.

HTH

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [BUGFIX][PATCH 1/4] memcg: fix limit estimation at reclaim for hugepage
@ 2011-01-28  8:41               ` Johannes Weiner
  0 siblings, 0 replies; 66+ messages in thread
From: Johannes Weiner @ 2011-01-28  8:41 UTC (permalink / raw)
  To: Minchan Kim
  Cc: KAMEZAWA Hiroyuki, Daisuke Nishimura, linux-mm, linux-kernel, balbir

On Fri, Jan 28, 2011 at 05:25:58PM +0900, Minchan Kim wrote:
> Hi Hannes,
> 
> On Fri, Jan 28, 2011 at 5:17 PM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> > On Fri, Jan 28, 2011 at 05:04:16PM +0900, Minchan Kim wrote:
> >> Hi Kame,
> >>
> >> On Fri, Jan 28, 2011 at 1:58 PM, KAMEZAWA Hiroyuki
> >> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> >> > How about this ?
> >> > ==
> >> > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> >> >
> >> > Current memory cgroup's code tends to assume page_size == PAGE_SIZE
> >> > and arrangement for THP is not enough yet.
> >> >
> >> > This is one of fixes for supporing THP. This adds
> >> > mem_cgroup_check_margin() and checks whether there are required amount of
> >> > free resource after memory reclaim. By this, THP page allocation
> >> > can know whether it really succeeded or not and avoid infinite-loop
> >> > and hangup.
> >> >
> >> > Total fixes for do_charge()/reclaim memory will follow this patch.
> >>
> >> If this patch is only related to THP, I think patch order isn't good.
> >> Before applying [2/4], huge page allocation will retry without
> >> reclaiming and loop forever by below part.
> >>
> >> @@ -1854,9 +1858,6 @@ static int __mem_cgroup_do_charge(struct
> >>       } else
> >>               mem_over_limit = mem_cgroup_from_res_counter(fail_res, res);
> >>
> >> -     if (csize > PAGE_SIZE) /* change csize and retry */
> >> -             return CHARGE_RETRY;
> >> -
> >>       if (!(gfp_mask & __GFP_WAIT))
> >>               return CHARGE_WOULDBLOCK;
> >>
> >> Am I missing something?
> >
> > No, you are correct.  But I am not sure the order really matters in
> > theory: you have two endless loops that need independent fixing.
> 
> That's why I ask a question.
> Two endless loop?
> 
> One is what I mentioned. The other is what?

The other is

1. huge page charge, fail
2. reclaim
3. limit is ok, retry at 1.

The problem is that the limit will be okay for regular pages, but not
huge pages.

Currently, this code is never reached due to the endless loop you
already spotted.

HTH

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [BUGFIX][PATCH 2/4] memcg: fix charge path for THP and allow early retirement
  2011-01-28  8:14       ` KAMEZAWA Hiroyuki
@ 2011-01-28  9:02         ` Johannes Weiner
  -1 siblings, 0 replies; 66+ messages in thread
From: Johannes Weiner @ 2011-01-28  9:02 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, nishimura, balbir

On Fri, Jan 28, 2011 at 05:14:47PM +0900, KAMEZAWA Hiroyuki wrote:
> On Fri, 28 Jan 2011 08:57:24 +0100
> Johannes Weiner <hannes@cmpxchg.org> wrote:
> 
> > On Fri, Jan 28, 2011 at 12:26:08PM +0900, KAMEZAWA Hiroyuki wrote:
> > > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > > 
> > > When THP is used, Hugepage size charge can happen. It's not handled
> > > correctly in mem_cgroup_do_charge(). For example, THP can fallback
> > > to small page allocation when HUGEPAGE allocation seems difficult
> > > or busy, but memory cgroup doesn't understand it and continue to
> > > try HUGEPAGE charging. And the worst thing is memory cgroup
> > > believes 'memory reclaim succeeded' if limit - usage > PAGE_SIZE.
> > > 
> > > By this, khugepaged etc...can goes into inifinite reclaim loop
> > > if tasks in memcg are busy.
> > > 
> > > After this patch 
> > >  - Hugepage allocation will fail if 1st trial of page reclaim fails.
> > > 
> > > Changelog:
> > >  - make changes small. removed renaming codes.
> > > 
> > > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > > ---
> > >  mm/memcontrol.c |   28 ++++++++++++++++++++++++----
> > >  1 file changed, 24 insertions(+), 4 deletions(-)
> > 
> > Was there something wrong with my oneline fix?
> > 
> I thought your patch was against RHEL6.

Sorry, this was a misunderstanding.  All three patches I sent
yesterday were based on the latest mmotm.

> > Really, there is no way to make this a beautiful fix.  The way this
> > function is split up makes no sense, and the constant addition of more
> > and more flags just to correctly communicate with _one callsite_
> > should be an obvious hint.
> > 
> 
> Your version has to depend on oom_check flag to work fine.
> I think it's complicated.

I don't understand.  We want to retry when batching fails, but not
when huge page charging fails.  This is exactly what my patch does.

This function has 3 steps:

1. charge
2. reclaim
3. handle out of memory

Between all those steps, there are defined break-out points.  Between
1. and 2. there is the check for batching.  Between 2. and 3. is the
check for whether we should OOM directly or let it be handled by the
caller.

These break points make perferct sense, because when batching we want
to charge but not reclaim.  With huge pages we want to charge,
rcelaim, but not OOM.  This is straight-forward exactly what my
patches implement.  Not by introducing new break points, but by fixing
those that are already there!

I resent your patch because you mess up this logic by moving the break
point between 1. and 2. between 2. and 3. where it is not intuitive to
understand anymore.  And your argument is that you don't want to trust
your own code to function correctly (oom_check).  This is insane.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [BUGFIX][PATCH 2/4] memcg: fix charge path for THP and allow early retirement
@ 2011-01-28  9:02         ` Johannes Weiner
  0 siblings, 0 replies; 66+ messages in thread
From: Johannes Weiner @ 2011-01-28  9:02 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, nishimura, balbir

On Fri, Jan 28, 2011 at 05:14:47PM +0900, KAMEZAWA Hiroyuki wrote:
> On Fri, 28 Jan 2011 08:57:24 +0100
> Johannes Weiner <hannes@cmpxchg.org> wrote:
> 
> > On Fri, Jan 28, 2011 at 12:26:08PM +0900, KAMEZAWA Hiroyuki wrote:
> > > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > > 
> > > When THP is used, Hugepage size charge can happen. It's not handled
> > > correctly in mem_cgroup_do_charge(). For example, THP can fallback
> > > to small page allocation when HUGEPAGE allocation seems difficult
> > > or busy, but memory cgroup doesn't understand it and continue to
> > > try HUGEPAGE charging. And the worst thing is memory cgroup
> > > believes 'memory reclaim succeeded' if limit - usage > PAGE_SIZE.
> > > 
> > > By this, khugepaged etc...can goes into inifinite reclaim loop
> > > if tasks in memcg are busy.
> > > 
> > > After this patch 
> > >  - Hugepage allocation will fail if 1st trial of page reclaim fails.
> > > 
> > > Changelog:
> > >  - make changes small. removed renaming codes.
> > > 
> > > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > > ---
> > >  mm/memcontrol.c |   28 ++++++++++++++++++++++++----
> > >  1 file changed, 24 insertions(+), 4 deletions(-)
> > 
> > Was there something wrong with my oneline fix?
> > 
> I thought your patch was against RHEL6.

Sorry, this was a misunderstanding.  All three patches I sent
yesterday were based on the latest mmotm.

> > Really, there is no way to make this a beautiful fix.  The way this
> > function is split up makes no sense, and the constant addition of more
> > and more flags just to correctly communicate with _one callsite_
> > should be an obvious hint.
> > 
> 
> Your version has to depend on oom_check flag to work fine.
> I think it's complicated.

I don't understand.  We want to retry when batching fails, but not
when huge page charging fails.  This is exactly what my patch does.

This function has 3 steps:

1. charge
2. reclaim
3. handle out of memory

Between all those steps, there are defined break-out points.  Between
1. and 2. there is the check for batching.  Between 2. and 3. is the
check for whether we should OOM directly or let it be handled by the
caller.

These break points make perferct sense, because when batching we want
to charge but not reclaim.  With huge pages we want to charge,
rcelaim, but not OOM.  This is straight-forward exactly what my
patches implement.  Not by introducing new break points, but by fixing
those that are already there!

I resent your patch because you mess up this logic by moving the break
point between 1. and 2. between 2. and 3. where it is not intuitive to
understand anymore.  And your argument is that you don't want to trust
your own code to function correctly (oom_check).  This is insane.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [BUGFIX][PATCH 2/4] memcg: fix charge path for THP and allow early retirement
  2011-01-28  9:02         ` Johannes Weiner
@ 2011-01-28  9:16           ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 66+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-01-28  9:16 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: linux-mm, linux-kernel, nishimura, balbir

On Fri, 28 Jan 2011 10:02:42 +0100
Johannes Weiner <hannes@cmpxchg.org> wrote:

> On Fri, Jan 28, 2011 at 05:14:47PM +0900, KAMEZAWA Hiroyuki wrote:
> > On Fri, 28 Jan 2011 08:57:24 +0100
> > Johannes Weiner <hannes@cmpxchg.org> wrote:
> > 
> > > On Fri, Jan 28, 2011 at 12:26:08PM +0900, KAMEZAWA Hiroyuki wrote:
> > > > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > > > 
> > > > When THP is used, Hugepage size charge can happen. It's not handled
> > > > correctly in mem_cgroup_do_charge(). For example, THP can fallback
> > > > to small page allocation when HUGEPAGE allocation seems difficult
> > > > or busy, but memory cgroup doesn't understand it and continue to
> > > > try HUGEPAGE charging. And the worst thing is memory cgroup
> > > > believes 'memory reclaim succeeded' if limit - usage > PAGE_SIZE.
> > > > 
> > > > By this, khugepaged etc...can goes into inifinite reclaim loop
> > > > if tasks in memcg are busy.
> > > > 
> > > > After this patch 
> > > >  - Hugepage allocation will fail if 1st trial of page reclaim fails.
> > > > 
> > > > Changelog:
> > > >  - make changes small. removed renaming codes.
> > > > 
> > > > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > > > ---
> > > >  mm/memcontrol.c |   28 ++++++++++++++++++++++++----
> > > >  1 file changed, 24 insertions(+), 4 deletions(-)
> > > 
> > > Was there something wrong with my oneline fix?
> > > 
> > I thought your patch was against RHEL6.
> 
> Sorry, this was a misunderstanding.  All three patches I sent
> yesterday were based on the latest mmotm.
> 
I misunderstand your e-mail.


> > > Really, there is no way to make this a beautiful fix.  The way this
> > > function is split up makes no sense, and the constant addition of more
> > > and more flags just to correctly communicate with _one callsite_
> > > should be an obvious hint.
> > > 
> > 
> > Your version has to depend on oom_check flag to work fine.
> > I think it's complicated.
> 
> I don't understand.  We want to retry when batching fails, but not
> when huge page charging fails.  This is exactly what my patch does.
> 
> This function has 3 steps:
> 
> 1. charge
> 2. reclaim
> 3. handle out of memory
> 
> Between all those steps, there are defined break-out points.  Between
> 1. and 2. there is the check for batching.  Between 2. and 3. is the
> check for whether we should OOM directly or let it be handled by the
> caller.
> 
> These break points make perferct sense, because when batching we want
> to charge but not reclaim.  With huge pages we want to charge,
> rcelaim, but not OOM.  This is straight-forward exactly what my
> patches implement.  Not by introducing new break points, but by fixing
> those that are already there!
> 
> I resent your patch because you mess up this logic by moving the break
> point between 1. and 2. between 2. and 3. where it is not intuitive to
> understand anymore.  And your argument is that you don't want to trust
> your own code to function correctly (oom_check).  This is insane.
> 

oom_check is not for contorlling 'how we retry'.

But I'm getting tired at handling this breakage caused by THP.
Please post your patch again, I'll ack all in the next week.
Andrew, please pick up Johannes's patch when psoted.

I hope all issues are fixed in the next week and we can go ahead,
implenting dirty_ratio.

Bye.
-Kame









^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [BUGFIX][PATCH 2/4] memcg: fix charge path for THP and allow early retirement
@ 2011-01-28  9:16           ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 66+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-01-28  9:16 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: linux-mm, linux-kernel, nishimura, balbir

On Fri, 28 Jan 2011 10:02:42 +0100
Johannes Weiner <hannes@cmpxchg.org> wrote:

> On Fri, Jan 28, 2011 at 05:14:47PM +0900, KAMEZAWA Hiroyuki wrote:
> > On Fri, 28 Jan 2011 08:57:24 +0100
> > Johannes Weiner <hannes@cmpxchg.org> wrote:
> > 
> > > On Fri, Jan 28, 2011 at 12:26:08PM +0900, KAMEZAWA Hiroyuki wrote:
> > > > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > > > 
> > > > When THP is used, Hugepage size charge can happen. It's not handled
> > > > correctly in mem_cgroup_do_charge(). For example, THP can fallback
> > > > to small page allocation when HUGEPAGE allocation seems difficult
> > > > or busy, but memory cgroup doesn't understand it and continue to
> > > > try HUGEPAGE charging. And the worst thing is memory cgroup
> > > > believes 'memory reclaim succeeded' if limit - usage > PAGE_SIZE.
> > > > 
> > > > By this, khugepaged etc...can goes into inifinite reclaim loop
> > > > if tasks in memcg are busy.
> > > > 
> > > > After this patch 
> > > >  - Hugepage allocation will fail if 1st trial of page reclaim fails.
> > > > 
> > > > Changelog:
> > > >  - make changes small. removed renaming codes.
> > > > 
> > > > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > > > ---
> > > >  mm/memcontrol.c |   28 ++++++++++++++++++++++++----
> > > >  1 file changed, 24 insertions(+), 4 deletions(-)
> > > 
> > > Was there something wrong with my oneline fix?
> > > 
> > I thought your patch was against RHEL6.
> 
> Sorry, this was a misunderstanding.  All three patches I sent
> yesterday were based on the latest mmotm.
> 
I misunderstand your e-mail.


> > > Really, there is no way to make this a beautiful fix.  The way this
> > > function is split up makes no sense, and the constant addition of more
> > > and more flags just to correctly communicate with _one callsite_
> > > should be an obvious hint.
> > > 
> > 
> > Your version has to depend on oom_check flag to work fine.
> > I think it's complicated.
> 
> I don't understand.  We want to retry when batching fails, but not
> when huge page charging fails.  This is exactly what my patch does.
> 
> This function has 3 steps:
> 
> 1. charge
> 2. reclaim
> 3. handle out of memory
> 
> Between all those steps, there are defined break-out points.  Between
> 1. and 2. there is the check for batching.  Between 2. and 3. is the
> check for whether we should OOM directly or let it be handled by the
> caller.
> 
> These break points make perferct sense, because when batching we want
> to charge but not reclaim.  With huge pages we want to charge,
> rcelaim, but not OOM.  This is straight-forward exactly what my
> patches implement.  Not by introducing new break points, but by fixing
> those that are already there!
> 
> I resent your patch because you mess up this logic by moving the break
> point between 1. and 2. between 2. and 3. where it is not intuitive to
> understand anymore.  And your argument is that you don't want to trust
> your own code to function correctly (oom_check).  This is insane.
> 

oom_check is not for contorlling 'how we retry'.

But I'm getting tired at handling this breakage caused by THP.
Please post your patch again, I'll ack all in the next week.
Andrew, please pick up Johannes's patch when psoted.

I hope all issues are fixed in the next week and we can go ahead,
implenting dirty_ratio.

Bye.
-Kame








--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [BUGFIX][PATCH 0/4] Fixes for memcg with THP
  2011-01-28  3:22 ` KAMEZAWA Hiroyuki
@ 2011-01-29 12:47   ` Balbir Singh
  -1 siblings, 0 replies; 66+ messages in thread
From: Balbir Singh @ 2011-01-29 12:47 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, hannes, nishimura

On Fri, Jan 28, 2011 at 8:52 AM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
>
> On recent -mm, when I run make -j 8 under 200M limit of memcg, as
> ==
> # mount -t cgroup none /cgroup/memory -o memory
> # mkdir /cgroup/memory/A
> # echo 200M > /cgroup/memory/A/memory.limit_in_bytes
> # echo $$ > /cgroup/memory/A/tasks
> # make -j 8 kernel
> ==
>
> I see hangs with khugepaged. That's because memcg's memory reclaim
> routine doesn't handle HUGE_PAGE request in proper way. And khugepaged
> doesn't know about memcg.
>
> This patch set is for fixing above hang. Patch 1-3 seems obvious and
> has the same concept as patches in RHEL.

Do you have any backtraces? Are they in the specific patches?

Balbir

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [BUGFIX][PATCH 0/4] Fixes for memcg with THP
@ 2011-01-29 12:47   ` Balbir Singh
  0 siblings, 0 replies; 66+ messages in thread
From: Balbir Singh @ 2011-01-29 12:47 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-mm, linux-kernel, hannes, nishimura

On Fri, Jan 28, 2011 at 8:52 AM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
>
> On recent -mm, when I run make -j 8 under 200M limit of memcg, as
> ==
> # mount -t cgroup none /cgroup/memory -o memory
> # mkdir /cgroup/memory/A
> # echo 200M > /cgroup/memory/A/memory.limit_in_bytes
> # echo $$ > /cgroup/memory/A/tasks
> # make -j 8 kernel
> ==
>
> I see hangs with khugepaged. That's because memcg's memory reclaim
> routine doesn't handle HUGE_PAGE request in proper way. And khugepaged
> doesn't know about memcg.
>
> This patch set is for fixing above hang. Patch 1-3 seems obvious and
> has the same concept as patches in RHEL.

Do you have any backtraces? Are they in the specific patches?

Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [BUGFIX][PATCH 1/4] memcg: fix limit estimation at reclaim for hugepage
  2011-01-28  8:36               ` KAMEZAWA Hiroyuki
@ 2011-01-30  2:26                 ` Minchan Kim
  -1 siblings, 0 replies; 66+ messages in thread
From: Minchan Kim @ 2011-01-30  2:26 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Johannes Weiner, Daisuke Nishimura, linux-mm, linux-kernel, balbir

On Fri, Jan 28, 2011 at 5:36 PM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Fri, 28 Jan 2011 17:25:58 +0900
> Minchan Kim <minchan.kim@gmail.com> wrote:
>
>> Hi Hannes,
>>
>> On Fri, Jan 28, 2011 at 5:17 PM, Johannes Weiner <hannes@cmpxchg.org> wrote:
>> > On Fri, Jan 28, 2011 at 05:04:16PM +0900, Minchan Kim wrote:
>> >> Hi Kame,
>> >>
>> >> On Fri, Jan 28, 2011 at 1:58 PM, KAMEZAWA Hiroyuki
>> >> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>> >> > How about this ?
>> >> > ==
>> >> > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>> >> >
>> >> > Current memory cgroup's code tends to assume page_size == PAGE_SIZE
>> >> > and arrangement for THP is not enough yet.
>> >> >
>> >> > This is one of fixes for supporing THP. This adds
>> >> > mem_cgroup_check_margin() and checks whether there are required amount of
>> >> > free resource after memory reclaim. By this, THP page allocation
>> >> > can know whether it really succeeded or not and avoid infinite-loop
>> >> > and hangup.
>> >> >
>> >> > Total fixes for do_charge()/reclaim memory will follow this patch.
>> >>
>> >> If this patch is only related to THP, I think patch order isn't good.
>> >> Before applying [2/4], huge page allocation will retry without
>> >> reclaiming and loop forever by below part.
>> >>
>> >> @@ -1854,9 +1858,6 @@ static int __mem_cgroup_do_charge(struct
>> >>       } else
>> >>               mem_over_limit = mem_cgroup_from_res_counter(fail_res, res);
>> >>
>> >> -     if (csize > PAGE_SIZE) /* change csize and retry */
>> >> -             return CHARGE_RETRY;
>> >> -
>> >>       if (!(gfp_mask & __GFP_WAIT))
>> >>               return CHARGE_WOULDBLOCK;
>> >>
>> >> Am I missing something?
>> >
>> > No, you are correct.  But I am not sure the order really matters in
>> > theory: you have two endless loops that need independent fixing.
>>
>> That's why I ask a question.
>> Two endless loop?
>>
>> One is what I mentioned. The other is what?
>> Maybe this patch solve the other.
>> But I can't guess it by only this description. Stupid..
>>
>> Please open my eyes.
>>
>
> One is.
>
>  if (csize > PAGE_SIZE)
>        return CHARGE_RETRY;
>
> By this, reclaim will never be called.
>
>
> Another is a check after memory reclaim.
> ==
>       ret = mem_cgroup_hierarchical_reclaim(mem_over_limit, NULL,
>                                        gfp_mask, flags);
>        /*
>         * try_to_free_mem_cgroup_pages() might not give us a full
>         * picture of reclaim. Some pages are reclaimed and might be
>         * moved to swap cache or just unmapped from the cgroup.
>         * Check the limit again to see if the reclaim reduced the
>         * current usage of the cgroup before giving up
>         */
>        if (ret || mem_cgroup_check_under_limit(mem_over_limit))
>                return CHARGE_RETRY;
> ==
>
> ret != 0 if one page is reclaimed. Then, khupaged will retry charge and
> cannot get enough room, reclaim, one page -> again. SO, in busy memcg,
> HPAGE_SIZE allocation never fails.
>
> Even if khupaged luckly allocates HPAGE_SIZE, because khugepaged walks vmas
> one by one and try to collapse each pmd, under mmap_sem(), this seems a hang by
> khugepaged, infinite loop.
>
>
> Thanks,
> -Kame
>
>

Kame, Hannes, Thanks.

I understood yours opinion. :)
As I said earlier, at least, it can help patch review.
When I saw only [1/4] firstly, I felt it doesn't affect anything since
THP allocation would return earlier before reaching the your patch so
infinite loop still happens.

Of course, when we apply [2/4], the problem will be gone.
But I can't know the fact until I read [2/4]. It makes reviewers confuse.

So I suggest [2/4] is ahead of [1/4] and includes following as in [2/4].
"This patch still has a infinite problem in case of xxxx. Next patch solves it"

I hope if it doesn't have a problem on bisect, patch order would be
changed if you don't mind. When I review Hannes's version, it's same.
:(

I will review again when Hannes resends the series.

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [BUGFIX][PATCH 1/4] memcg: fix limit estimation at reclaim for hugepage
@ 2011-01-30  2:26                 ` Minchan Kim
  0 siblings, 0 replies; 66+ messages in thread
From: Minchan Kim @ 2011-01-30  2:26 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Johannes Weiner, Daisuke Nishimura, linux-mm, linux-kernel, balbir

On Fri, Jan 28, 2011 at 5:36 PM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Fri, 28 Jan 2011 17:25:58 +0900
> Minchan Kim <minchan.kim@gmail.com> wrote:
>
>> Hi Hannes,
>>
>> On Fri, Jan 28, 2011 at 5:17 PM, Johannes Weiner <hannes@cmpxchg.org> wrote:
>> > On Fri, Jan 28, 2011 at 05:04:16PM +0900, Minchan Kim wrote:
>> >> Hi Kame,
>> >>
>> >> On Fri, Jan 28, 2011 at 1:58 PM, KAMEZAWA Hiroyuki
>> >> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>> >> > How about this ?
>> >> > ==
>> >> > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>> >> >
>> >> > Current memory cgroup's code tends to assume page_size == PAGE_SIZE
>> >> > and arrangement for THP is not enough yet.
>> >> >
>> >> > This is one of fixes for supporing THP. This adds
>> >> > mem_cgroup_check_margin() and checks whether there are required amount of
>> >> > free resource after memory reclaim. By this, THP page allocation
>> >> > can know whether it really succeeded or not and avoid infinite-loop
>> >> > and hangup.
>> >> >
>> >> > Total fixes for do_charge()/reclaim memory will follow this patch.
>> >>
>> >> If this patch is only related to THP, I think patch order isn't good.
>> >> Before applying [2/4], huge page allocation will retry without
>> >> reclaiming and loop forever by below part.
>> >>
>> >> @@ -1854,9 +1858,6 @@ static int __mem_cgroup_do_charge(struct
>> >>       } else
>> >>               mem_over_limit = mem_cgroup_from_res_counter(fail_res, res);
>> >>
>> >> -     if (csize > PAGE_SIZE) /* change csize and retry */
>> >> -             return CHARGE_RETRY;
>> >> -
>> >>       if (!(gfp_mask & __GFP_WAIT))
>> >>               return CHARGE_WOULDBLOCK;
>> >>
>> >> Am I missing something?
>> >
>> > No, you are correct.  But I am not sure the order really matters in
>> > theory: you have two endless loops that need independent fixing.
>>
>> That's why I ask a question.
>> Two endless loop?
>>
>> One is what I mentioned. The other is what?
>> Maybe this patch solve the other.
>> But I can't guess it by only this description. Stupid..
>>
>> Please open my eyes.
>>
>
> One is.
>
>  if (csize > PAGE_SIZE)
>        return CHARGE_RETRY;
>
> By this, reclaim will never be called.
>
>
> Another is a check after memory reclaim.
> ==
>       ret = mem_cgroup_hierarchical_reclaim(mem_over_limit, NULL,
>                                        gfp_mask, flags);
>        /*
>         * try_to_free_mem_cgroup_pages() might not give us a full
>         * picture of reclaim. Some pages are reclaimed and might be
>         * moved to swap cache or just unmapped from the cgroup.
>         * Check the limit again to see if the reclaim reduced the
>         * current usage of the cgroup before giving up
>         */
>        if (ret || mem_cgroup_check_under_limit(mem_over_limit))
>                return CHARGE_RETRY;
> ==
>
> ret != 0 if one page is reclaimed. Then, khupaged will retry charge and
> cannot get enough room, reclaim, one page -> again. SO, in busy memcg,
> HPAGE_SIZE allocation never fails.
>
> Even if khupaged luckly allocates HPAGE_SIZE, because khugepaged walks vmas
> one by one and try to collapse each pmd, under mmap_sem(), this seems a hang by
> khugepaged, infinite loop.
>
>
> Thanks,
> -Kame
>
>

Kame, Hannes, Thanks.

I understood yours opinion. :)
As I said earlier, at least, it can help patch review.
When I saw only [1/4] firstly, I felt it doesn't affect anything since
THP allocation would return earlier before reaching the your patch so
infinite loop still happens.

Of course, when we apply [2/4], the problem will be gone.
But I can't know the fact until I read [2/4]. It makes reviewers confuse.

So I suggest [2/4] is ahead of [1/4] and includes following as in [2/4].
"This patch still has a infinite problem in case of xxxx. Next patch solves it"

I hope if it doesn't have a problem on bisect, patch order would be
changed if you don't mind. When I review Hannes's version, it's same.
:(

I will review again when Hannes resends the series.

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [BUGFIX][PATCH 0/4] Fixes for memcg with THP
  2011-01-29 12:47   ` Balbir Singh
@ 2011-01-30 23:55     ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 66+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-01-30 23:55 UTC (permalink / raw)
  To: Balbir Singh; +Cc: linux-mm, linux-kernel, hannes, nishimura

On Sat, 29 Jan 2011 18:17:56 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> On Fri, Jan 28, 2011 at 8:52 AM, KAMEZAWA Hiroyuki
> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> >
> > On recent -mm, when I run make -j 8 under 200M limit of memcg, as
> > ==
> > # mount -t cgroup none /cgroup/memory -o memory
> > # mkdir /cgroup/memory/A
> > # echo 200M > /cgroup/memory/A/memory.limit_in_bytes
> > # echo $$ > /cgroup/memory/A/tasks
> > # make -j 8 kernel
> > ==
> >
> > I see hangs with khugepaged. That's because memcg's memory reclaim
> > routine doesn't handle HUGE_PAGE request in proper way. And khugepaged
> > doesn't know about memcg.
> >
> > This patch set is for fixing above hang. Patch 1-3 seems obvious and
> > has the same concept as patches in RHEL.
> 
> Do you have any backtraces? Are they in the specific patches?
> 

Jan 18 10:28:29 rhel6-test kernel: [56245.286007] INFO: rcu_sched_state detected stall on CPU 0
(t=60000 jiffies)
Jan 18 10:28:29 rhel6-test kernel: [56245.286007] sending NMI to all CPUs:
Jan 18 10:28:29 rhel6-test kernel: [56245.286007] NMI backtrace for cpu 0
Jan 18 10:28:29 rhel6-test kernel: [56245.286007] CPU 0

Jan 18 10:28:29 rhel6-test kernel: [56245.286007]  [<ffffffff8102a04e>] arch_trigger_all_cpu_bac
ktrace+0x5e/0xa0
Jan 18 10:28:29 rhel6-test kernel: [56245.286007]  [<ffffffff810bca09>] __rcu_pending+0x169/0x3b
0
Jan 18 10:28:29 rhel6-test kernel: [56245.286007]  [<ffffffff8108a250>] ? tick_sched_timer+0x0/0
xc0
Jan 18 10:28:29 rhel6-test kernel: [56245.286007]  [<ffffffff810bccbc>] rcu_check_callbacks+0x6c
/0x120
Jan 18 10:28:29 rhel6-test kernel: [56245.286007]  [<ffffffff810689a8>] update_process_times+0x4
8/0x90
Jan 18 10:28:29 rhel6-test kernel: [56245.286007]  [<ffffffff8108a2b6>] tick_sched_timer+0x66/0x
c0
Jan 18 10:28:29 rhel6-test kernel: [56245.286007]  [<ffffffff8107ede0>] __run_hrtimer+0x90/0x1e0
Jan 18 10:28:29 rhel6-test kernel: [56245.286007]  [<ffffffff81032db9>] ? kvm_clock_get_cycles+0
x9/0x10
Jan 18 10:28:29 rhel6-test kernel: [56245.286007]  [<ffffffff8107f1be>] hrtimer_interrupt+0xde/0
x240
Jan 18 10:28:29 rhel6-test kernel: [56245.286007]  [<ffffffff8155268b>] smp_apic_timer_interrupt
+0x6b/0x9b
Jan 18 10:28:29 rhel6-test kernel: [56245.286007]  [<ffffffff8100c9d3>] apic_timer_interrupt+0x13/0x20
Jan 18 10:28:29 rhel6-test kernel: [56245.286007]  <EOI>
Jan 18 10:28:29 rhel6-test kernel: [56245.286007]  [<ffffffff810a726a>] ? res_counter_charge+0xda/0x100
Jan 18 10:28:29 rhel6-test kernel: [56245.286007]  [<ffffffff81145459>] __mem_cgroup_try_charge+0x199/0x5d0
Jan 18 10:28:29 rhel6-test kernel: [56245.286007]  [<ffffffff811463b5>] mem_cgroup_newpage_charge+0x45/0x50
Jan 18 10:28:29 rhel6-test kernel: [56245.286007]  [<ffffffff8113dbd4>] khugepaged+0x924/0x1430
Jan 18 10:28:29 rhel6-test kernel: [56245.286007]  [<ffffffff8107af00>] ? autoremove_wake_function+0x0/0x40
Jan 18 10:28:29 rhel6-test kernel: [56245.286007]  [<ffffffff8113d2b0>] ? khugepaged+0x0/0x1430
Jan 18 10:28:29 rhel6-test kernel: [56245.286007]  [<ffffffff8107a8b6>] kthread+0x96/0xa0
Jan 18 10:28:29 rhel6-test kernel: [56245.286007]  [<ffffffff8100ce24>] kernel_thread_helper+0x4/0x10
Jan 18 10:28:29 rhel6-test kernel: [56245.286007]  [<ffffffff8107a820>] ? kthread+0x0/0xa0
Jan 18 10:28:29 rhel6-test kernel: [56245.286007]  [<ffffffff8100ce20>] ? kernel_thread_helper+0x0/0x10

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [BUGFIX][PATCH 0/4] Fixes for memcg with THP
@ 2011-01-30 23:55     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 66+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-01-30 23:55 UTC (permalink / raw)
  To: Balbir Singh; +Cc: linux-mm, linux-kernel, hannes, nishimura

On Sat, 29 Jan 2011 18:17:56 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> On Fri, Jan 28, 2011 at 8:52 AM, KAMEZAWA Hiroyuki
> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> >
> > On recent -mm, when I run make -j 8 under 200M limit of memcg, as
> > ==
> > # mount -t cgroup none /cgroup/memory -o memory
> > # mkdir /cgroup/memory/A
> > # echo 200M > /cgroup/memory/A/memory.limit_in_bytes
> > # echo $$ > /cgroup/memory/A/tasks
> > # make -j 8 kernel
> > ==
> >
> > I see hangs with khugepaged. That's because memcg's memory reclaim
> > routine doesn't handle HUGE_PAGE request in proper way. And khugepaged
> > doesn't know about memcg.
> >
> > This patch set is for fixing above hang. Patch 1-3 seems obvious and
> > has the same concept as patches in RHEL.
> 
> Do you have any backtraces? Are they in the specific patches?
> 

Jan 18 10:28:29 rhel6-test kernel: [56245.286007] INFO: rcu_sched_state detected stall on CPU 0
(t=60000 jiffies)
Jan 18 10:28:29 rhel6-test kernel: [56245.286007] sending NMI to all CPUs:
Jan 18 10:28:29 rhel6-test kernel: [56245.286007] NMI backtrace for cpu 0
Jan 18 10:28:29 rhel6-test kernel: [56245.286007] CPU 0

Jan 18 10:28:29 rhel6-test kernel: [56245.286007]  [<ffffffff8102a04e>] arch_trigger_all_cpu_bac
ktrace+0x5e/0xa0
Jan 18 10:28:29 rhel6-test kernel: [56245.286007]  [<ffffffff810bca09>] __rcu_pending+0x169/0x3b
0
Jan 18 10:28:29 rhel6-test kernel: [56245.286007]  [<ffffffff8108a250>] ? tick_sched_timer+0x0/0
xc0
Jan 18 10:28:29 rhel6-test kernel: [56245.286007]  [<ffffffff810bccbc>] rcu_check_callbacks+0x6c
/0x120
Jan 18 10:28:29 rhel6-test kernel: [56245.286007]  [<ffffffff810689a8>] update_process_times+0x4
8/0x90
Jan 18 10:28:29 rhel6-test kernel: [56245.286007]  [<ffffffff8108a2b6>] tick_sched_timer+0x66/0x
c0
Jan 18 10:28:29 rhel6-test kernel: [56245.286007]  [<ffffffff8107ede0>] __run_hrtimer+0x90/0x1e0
Jan 18 10:28:29 rhel6-test kernel: [56245.286007]  [<ffffffff81032db9>] ? kvm_clock_get_cycles+0
x9/0x10
Jan 18 10:28:29 rhel6-test kernel: [56245.286007]  [<ffffffff8107f1be>] hrtimer_interrupt+0xde/0
x240
Jan 18 10:28:29 rhel6-test kernel: [56245.286007]  [<ffffffff8155268b>] smp_apic_timer_interrupt
+0x6b/0x9b
Jan 18 10:28:29 rhel6-test kernel: [56245.286007]  [<ffffffff8100c9d3>] apic_timer_interrupt+0x13/0x20
Jan 18 10:28:29 rhel6-test kernel: [56245.286007]  <EOI>
Jan 18 10:28:29 rhel6-test kernel: [56245.286007]  [<ffffffff810a726a>] ? res_counter_charge+0xda/0x100
Jan 18 10:28:29 rhel6-test kernel: [56245.286007]  [<ffffffff81145459>] __mem_cgroup_try_charge+0x199/0x5d0
Jan 18 10:28:29 rhel6-test kernel: [56245.286007]  [<ffffffff811463b5>] mem_cgroup_newpage_charge+0x45/0x50
Jan 18 10:28:29 rhel6-test kernel: [56245.286007]  [<ffffffff8113dbd4>] khugepaged+0x924/0x1430
Jan 18 10:28:29 rhel6-test kernel: [56245.286007]  [<ffffffff8107af00>] ? autoremove_wake_function+0x0/0x40
Jan 18 10:28:29 rhel6-test kernel: [56245.286007]  [<ffffffff8113d2b0>] ? khugepaged+0x0/0x1430
Jan 18 10:28:29 rhel6-test kernel: [56245.286007]  [<ffffffff8107a8b6>] kthread+0x96/0xa0
Jan 18 10:28:29 rhel6-test kernel: [56245.286007]  [<ffffffff8100ce24>] kernel_thread_helper+0x4/0x10
Jan 18 10:28:29 rhel6-test kernel: [56245.286007]  [<ffffffff8107a820>] ? kthread+0x0/0xa0
Jan 18 10:28:29 rhel6-test kernel: [56245.286007]  [<ffffffff8100ce20>] ? kernel_thread_helper+0x0/0x10

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [BUGFIX][PATCH 3/4] mecg: fix oom flag at THP charge
  2011-01-28  8:21       ` KAMEZAWA Hiroyuki
@ 2011-01-31  7:41         ` Balbir Singh
  -1 siblings, 0 replies; 66+ messages in thread
From: Balbir Singh @ 2011-01-31  7:41 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: Johannes Weiner, linux-mm, linux-kernel, nishimura

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2011-01-28 17:21:46]:

> On Fri, 28 Jan 2011 09:02:13 +0100
> Johannes Weiner <hannes@cmpxchg.org> wrote:
> 
> > On Fri, Jan 28, 2011 at 12:27:29PM +0900, KAMEZAWA Hiroyuki wrote:
> > > 
> > > Thanks to Johanns and Daisuke for suggestion.
> > > =
> > > Hugepage allocation shouldn't trigger oom.
> > > Allocation failure is not fatal.
> > > 
> > > Orignal-patch-by: Johannes Weiner <hannes@cmpxchg.org>
> > > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > > ---
> > >  mm/memcontrol.c |    4 +++-
> > >  1 file changed, 3 insertions(+), 1 deletion(-)
> > > 
> > > Index: mmotm-0125/mm/memcontrol.c
> > > ===================================================================
> > > --- mmotm-0125.orig/mm/memcontrol.c
> > > +++ mmotm-0125/mm/memcontrol.c
> > > @@ -2369,11 +2369,14 @@ static int mem_cgroup_charge_common(stru
> > >  	struct page_cgroup *pc;
> > >  	int ret;
> > >  	int page_size = PAGE_SIZE;
> > > +	bool oom;
> > >  
> > >  	if (PageTransHuge(page)) {
> > >  		page_size <<= compound_order(page);
> > >  		VM_BUG_ON(!PageTransHuge(page));
> > > -	}
> > > +		oom = false;
> > > +	} else
> > > +		oom = true;
> > 
> > That needs a comment.  You can take the one from my patch if you like.
> > 
> 
> How about this ?
> ==
> Hugepage allocation shouldn't trigger oom.
> Allocation failure is not fatal.
>

 
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
 
-- 
	Three Cheers,
	Balbir

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [BUGFIX][PATCH 3/4] mecg: fix oom flag at THP charge
@ 2011-01-31  7:41         ` Balbir Singh
  0 siblings, 0 replies; 66+ messages in thread
From: Balbir Singh @ 2011-01-31  7:41 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: Johannes Weiner, linux-mm, linux-kernel, nishimura

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2011-01-28 17:21:46]:

> On Fri, 28 Jan 2011 09:02:13 +0100
> Johannes Weiner <hannes@cmpxchg.org> wrote:
> 
> > On Fri, Jan 28, 2011 at 12:27:29PM +0900, KAMEZAWA Hiroyuki wrote:
> > > 
> > > Thanks to Johanns and Daisuke for suggestion.
> > > =
> > > Hugepage allocation shouldn't trigger oom.
> > > Allocation failure is not fatal.
> > > 
> > > Orignal-patch-by: Johannes Weiner <hannes@cmpxchg.org>
> > > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > > ---
> > >  mm/memcontrol.c |    4 +++-
> > >  1 file changed, 3 insertions(+), 1 deletion(-)
> > > 
> > > Index: mmotm-0125/mm/memcontrol.c
> > > ===================================================================
> > > --- mmotm-0125.orig/mm/memcontrol.c
> > > +++ mmotm-0125/mm/memcontrol.c
> > > @@ -2369,11 +2369,14 @@ static int mem_cgroup_charge_common(stru
> > >  	struct page_cgroup *pc;
> > >  	int ret;
> > >  	int page_size = PAGE_SIZE;
> > > +	bool oom;
> > >  
> > >  	if (PageTransHuge(page)) {
> > >  		page_size <<= compound_order(page);
> > >  		VM_BUG_ON(!PageTransHuge(page));
> > > -	}
> > > +		oom = false;
> > > +	} else
> > > +		oom = true;
> > 
> > That needs a comment.  You can take the one from my patch if you like.
> > 
> 
> How about this ?
> ==
> Hugepage allocation shouldn't trigger oom.
> Allocation failure is not fatal.
>

 
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
 
-- 
	Three Cheers,
	Balbir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 66+ messages in thread

end of thread, other threads:[~2011-01-31  9:03 UTC | newest]

Thread overview: 66+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-01-28  3:22 [BUGFIX][PATCH 0/4] Fixes for memcg with THP KAMEZAWA Hiroyuki
2011-01-28  3:22 ` KAMEZAWA Hiroyuki
2011-01-28  3:24 ` [BUGFIX][PATCH 1/4] memcg: fix limit estimation at reclaim for hugepage KAMEZAWA Hiroyuki
2011-01-28  3:24   ` KAMEZAWA Hiroyuki
2011-01-28  4:40   ` Daisuke Nishimura
2011-01-28  4:40     ` Daisuke Nishimura
2011-01-28  4:49     ` KAMEZAWA Hiroyuki
2011-01-28  4:49       ` KAMEZAWA Hiroyuki
2011-01-28  4:58     ` KAMEZAWA Hiroyuki
2011-01-28  4:58       ` KAMEZAWA Hiroyuki
2011-01-28  5:36       ` Daisuke Nishimura
2011-01-28  5:36         ` Daisuke Nishimura
2011-01-28  8:04       ` Minchan Kim
2011-01-28  8:04         ` Minchan Kim
2011-01-28  8:17         ` Johannes Weiner
2011-01-28  8:17           ` Johannes Weiner
2011-01-28  8:25           ` Minchan Kim
2011-01-28  8:25             ` Minchan Kim
2011-01-28  8:36             ` KAMEZAWA Hiroyuki
2011-01-28  8:36               ` KAMEZAWA Hiroyuki
2011-01-30  2:26               ` Minchan Kim
2011-01-30  2:26                 ` Minchan Kim
2011-01-28  8:41             ` Johannes Weiner
2011-01-28  8:41               ` Johannes Weiner
2011-01-28  8:24         ` KAMEZAWA Hiroyuki
2011-01-28  8:24           ` KAMEZAWA Hiroyuki
2011-01-28  8:37           ` Minchan Kim
2011-01-28  8:37             ` Minchan Kim
2011-01-28  7:52   ` Johannes Weiner
2011-01-28  7:52     ` Johannes Weiner
2011-01-28  8:06     ` KAMEZAWA Hiroyuki
2011-01-28  8:06       ` KAMEZAWA Hiroyuki
2011-01-28  3:26 ` [BUGFIX][PATCH 2/4] memcg: fix charge path for THP and allow early retirement KAMEZAWA Hiroyuki
2011-01-28  3:26   ` KAMEZAWA Hiroyuki
2011-01-28  5:37   ` Daisuke Nishimura
2011-01-28  5:37     ` Daisuke Nishimura
2011-01-28  7:57   ` Johannes Weiner
2011-01-28  7:57     ` Johannes Weiner
2011-01-28  8:14     ` KAMEZAWA Hiroyuki
2011-01-28  8:14       ` KAMEZAWA Hiroyuki
2011-01-28  9:02       ` Johannes Weiner
2011-01-28  9:02         ` Johannes Weiner
2011-01-28  9:16         ` KAMEZAWA Hiroyuki
2011-01-28  9:16           ` KAMEZAWA Hiroyuki
2011-01-28  3:27 ` [BUGFIX][PATCH 3/4] mecg: fix oom flag at THP charge KAMEZAWA Hiroyuki
2011-01-28  3:27   ` KAMEZAWA Hiroyuki
2011-01-28  5:39   ` Daisuke Nishimura
2011-01-28  5:39     ` Daisuke Nishimura
2011-01-28  5:50     ` KAMEZAWA Hiroyuki
2011-01-28  5:50       ` KAMEZAWA Hiroyuki
2011-01-28  8:02   ` Johannes Weiner
2011-01-28  8:02     ` Johannes Weiner
2011-01-28  8:21     ` KAMEZAWA Hiroyuki
2011-01-28  8:21       ` KAMEZAWA Hiroyuki
2011-01-31  7:41       ` Balbir Singh
2011-01-31  7:41         ` Balbir Singh
2011-01-28  3:28 ` [BUGFIX][PATCH 4/4] memcg: fix khugepaged should skip busy memcg KAMEZAWA Hiroyuki
2011-01-28  3:28   ` KAMEZAWA Hiroyuki
2011-01-28  8:20   ` Daisuke Nishimura
2011-01-28  8:20     ` Daisuke Nishimura
2011-01-28  8:30     ` KAMEZAWA Hiroyuki
2011-01-28  8:30       ` KAMEZAWA Hiroyuki
2011-01-29 12:47 ` [BUGFIX][PATCH 0/4] Fixes for memcg with THP Balbir Singh
2011-01-29 12:47   ` Balbir Singh
2011-01-30 23:55   ` KAMEZAWA Hiroyuki
2011-01-30 23:55     ` KAMEZAWA Hiroyuki

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.