linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [BUGFIX][PATCH 0/5] memcg bugfixes in the last week.
@ 2011-06-13  3:00 KAMEZAWA Hiroyuki
  2011-06-13  3:03 ` [BUGFIX][PATCH 1/5] memcg fix numa_stat permission KAMEZAWA Hiroyuki
                   ` (4 more replies)
  0 siblings, 5 replies; 24+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-06-13  3:00 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, akpm, nishimura, bsingharora, hannes, Michal Hocko, Ying Han


In the last week, I(and memcg people) posted 5 bug fixes.
I was slightly confued. 

For making things clearer, I post all 5 patches again,
which are I'm now testing. 

If I miss some patches/fixes/bugs, please notify me. 

[1/5] - fix memory.numa_stat permission (this is in mmotm)
[2/5] - fix init_page_cgroup() nid with sparsemem
[3/5] - fix mm->owner update
[4/5] - fix wrong check of noswap with softlimit
[5/5] - fix percpu cached charge draining.


Thank you for all your helps. 

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [BUGFIX][PATCH 1/5] memcg fix numa_stat permission
  2011-06-13  3:00 [BUGFIX][PATCH 0/5] memcg bugfixes in the last week KAMEZAWA Hiroyuki
@ 2011-06-13  3:03 ` KAMEZAWA Hiroyuki
  2011-06-14  9:42   ` Johannes Weiner
  2011-06-13  3:06 ` [BUGFIX][PATCH 2/5] memcg: fix init_page_cgroup nid with sparsemem KAMEZAWA Hiroyuki
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 24+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-06-13  3:03 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-kernel, linux-mm, akpm, nishimura, bsingharora, hannes,
	Michal Hocko, Ying Han

This is already queued in mmotm.
==
>From 3b1bec1d07ba21339697f069acab47dae6b81097 Mon Sep 17 00:00:00 2001
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Date: Mon, 13 Jun 2011 09:32:28 +0900
Subject: [PATCH 1/5] memcg fix numa_stat permission

commit 406eb0c9b ("memcg: add memory.numastat api for numa statistics")
 adds memory.numa_stat file for memory cgroup.  But the file permissions
 are wrong.

[kamezawa@bluextal linux-2.6]$ ls -l /cgroup/memory/A/memory.numa_stat
---------- 1 root root 0 Jun  9 18:36 /cgroup/memory/A/memory.numa_stat

This patch fixes the permission as

[root@bluextal kamezawa]# ls -l /cgroup/memory/A/memory.numa_stat
-r--r--r-- 1 root root 0 Jun 10 16:49 /cgroup/memory/A/memory.numa_stat

Acked-by: Ying Han <yinghan@google.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index bd9052a..ce05835 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4640,6 +4640,7 @@ static struct cftype mem_cgroup_files[] = {
 	{
 		.name = "numa_stat",
 		.open = mem_control_numa_stat_open,
+		.mode = S_IRUGO,
 	},
 #endif
 };
-- 
1.7.4.1



^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [BUGFIX][PATCH 2/5] memcg: fix init_page_cgroup nid with sparsemem
  2011-06-13  3:00 [BUGFIX][PATCH 0/5] memcg bugfixes in the last week KAMEZAWA Hiroyuki
  2011-06-13  3:03 ` [BUGFIX][PATCH 1/5] memcg fix numa_stat permission KAMEZAWA Hiroyuki
@ 2011-06-13  3:06 ` KAMEZAWA Hiroyuki
  2011-06-14  9:44   ` Johannes Weiner
  2011-06-16 10:09   ` [-git build bug, PATCH] " Ingo Molnar
  2011-06-13  3:09 ` [BUGFIX][PATCH 3/5] memcg: clear mm->owner when last possible owner leaves KAMEZAWA Hiroyuki
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 24+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-06-13  3:06 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-kernel, linux-mm, akpm, nishimura, bsingharora, hannes,
	Michal Hocko, Ying Han

added some clean ups.
==
>From 45b8881201f73015c4e0352eb7e434f6e9c53fdd Mon Sep 17 00:00:00 2001
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Date: Mon, 13 Jun 2011 10:09:17 +0900
Subject: [PATCH 2/5] [BUGFIX] memcg: fix init_page_cgroup nid with sparsemem


commit 21a3c96 makes page_cgroup allocation as NUMA aware. But that caused
a problem https://bugzilla.kernel.org/show_bug.cgi?id=36192.

The problem was getting a NID from invalid struct pages, which was not
initialized because it was out-of-node, out of [node_start_pfn, node_end_pfn)

Now, with sparsemem, page_cgroup_init scans pfn from 0 to max_pfn.
But this may scan a pfn which is not on any node and can access
memmap which is not initialized.

This makes page_cgroup_init() for SPARSEMEM node aware and remove
a code to get nid from page->flags. (Then, we'll use valid NID
always.)

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Changelog :
  - use better macros as node_start/end_pfn(), PAGE_SECTION_MASK..etc
  - pull out error handling code, which is rarely called.
  - fixed MEM_GOING_ONLINE case.
  - Add more comments and updated patch descriptions.
---
 mm/page_cgroup.c |   67 +++++++++++++++++++++++++++++++++++++++++-------------
 1 files changed, 51 insertions(+), 16 deletions(-)

diff --git a/mm/page_cgroup.c b/mm/page_cgroup.c
index 74ccff6..0742633 100644
--- a/mm/page_cgroup.c
+++ b/mm/page_cgroup.c
@@ -162,13 +162,13 @@ static void free_page_cgroup(void *addr)
 }
 #endif
 
-static int __meminit init_section_page_cgroup(unsigned long pfn)
+static int __meminit init_section_page_cgroup(unsigned long pfn, int nid)
 {
 	struct page_cgroup *base, *pc;
 	struct mem_section *section;
 	unsigned long table_size;
 	unsigned long nr;
-	int nid, index;
+	int index;
 
 	nr = pfn_to_section_nr(pfn);
 	section = __nr_to_section(nr);
@@ -176,7 +176,6 @@ static int __meminit init_section_page_cgroup(unsigned long pfn)
 	if (section->page_cgroup)
 		return 0;
 
-	nid = page_to_nid(pfn_to_page(pfn));
 	table_size = sizeof(struct page_cgroup) * PAGES_PER_SECTION;
 	base = alloc_page_cgroup(table_size, nid);
 
@@ -196,7 +195,11 @@ static int __meminit init_section_page_cgroup(unsigned long pfn)
 		pc = base + index;
 		init_page_cgroup(pc, nr);
 	}
-
+	/*
+	 * Passed "pfn" may not be aligned to SECTION. For calculation,
+	 * nedd to apply a mask.
+	 */
+	pfn &= PAGE_SECTION_MASK;
 	section->page_cgroup = base - pfn;
 	total_usage += table_size;
 	return 0;
@@ -225,10 +228,20 @@ int __meminit online_page_cgroup(unsigned long start_pfn,
 	start = start_pfn & ~(PAGES_PER_SECTION - 1);
 	end = ALIGN(start_pfn + nr_pages, PAGES_PER_SECTION);
 
+	if (nid == -1) {
+		/*
+		 * In this case, "nid" already exists and contains valid memory.
+		 * "start_pfn" passed for us is a pfn which is args for online_
+		 * _pages(), and start_pfn should exists.
+		 */
+		nid = pfn_to_nid(start_pfn);
+		VM_BUG_ON(!node_state(nid, N_ONLINE));
+	}
+
 	for (pfn = start; !fail && pfn < end; pfn += PAGES_PER_SECTION) {
 		if (!pfn_present(pfn))
 			continue;
-		fail = init_section_page_cgroup(pfn);
+		fail = init_section_page_cgroup(pfn, nid);
 	}
 	if (!fail)
 		return 0;
@@ -284,25 +297,47 @@ static int __meminit page_cgroup_callback(struct notifier_block *self,
 void __init page_cgroup_init(void)
 {
 	unsigned long pfn;
-	int fail = 0;
+	int nid;
 
 	if (mem_cgroup_disabled())
 		return;
 
-	for (pfn = 0; !fail && pfn < max_pfn; pfn += PAGES_PER_SECTION) {
-		if (!pfn_present(pfn))
-			continue;
-		fail = init_section_page_cgroup(pfn);
-	}
-	if (fail) {
-		printk(KERN_CRIT "try 'cgroup_disable=memory' boot option\n");
-		panic("Out of memory");
-	} else {
-		hotplug_memory_notifier(page_cgroup_callback, 0);
+	for_each_node_state(nid, N_HIGH_MEMORY) {
+		unsigned long start_pfn, end_pfn;
+
+		start_pfn = node_start_pfn(nid);
+		end_pfn = node_end_pfn(nid);
+		/*
+		 * start_pfn and end_pfn may not be aligned to SECTION and
+		 * page->flags of out of node pages are not initialized. So,
+		 * we scan [start_pfn, the biggest section's pfn < end_pfn), here.
+		 */
+		for (pfn = start_pfn;
+		     pfn < end_pfn;
+                     pfn = ALIGN(pfn + 1, PAGES_PER_SECTION)) {
+
+			if (!pfn_valid(pfn))
+				continue;
+			/*
+			 * Nodes's pfn can be overlapped.
+			 * We know some arch can have nodes layout as
+			 * -------------pfn-------------->
+			 * N0 | N1 | N2 | N0 | N1 | N2|....
+			 */
+			if (pfn_to_nid(pfn) != nid)
+				continue;
+			if (init_section_page_cgroup(pfn, nid))
+				goto oom;
+		}
 	}
+	hotplug_memory_notifier(page_cgroup_callback, 0);
 	printk(KERN_INFO "allocated %ld bytes of page_cgroup\n", total_usage);
 	printk(KERN_INFO "please try 'cgroup_disable=memory' option if you don't"
 	" want memory cgroups\n");
+	return;
+oom:
+	printk(KERN_CRIT "try 'cgroup_disable=memory' boot option\n");
+	panic("Out of memory");
 }
 
 void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat)
-- 
1.7.4.1



^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [BUGFIX][PATCH 3/5] memcg: clear mm->owner when last possible owner leaves
  2011-06-13  3:00 [BUGFIX][PATCH 0/5] memcg bugfixes in the last week KAMEZAWA Hiroyuki
  2011-06-13  3:03 ` [BUGFIX][PATCH 1/5] memcg fix numa_stat permission KAMEZAWA Hiroyuki
  2011-06-13  3:06 ` [BUGFIX][PATCH 2/5] memcg: fix init_page_cgroup nid with sparsemem KAMEZAWA Hiroyuki
@ 2011-06-13  3:09 ` KAMEZAWA Hiroyuki
  2011-06-14  9:45   ` Johannes Weiner
  2011-06-13  3:11 ` [BUGFIX][PATCH 4/5] memcg: fix wrong check of noswap with softlimit KAMEZAWA Hiroyuki
  2011-06-13  3:16 ` [BUGFIX][PATCH 5/5] memcg: fix percpu cached charge draining frequency KAMEZAWA Hiroyuki
  4 siblings, 1 reply; 24+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-06-13  3:09 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-kernel, linux-mm, akpm, nishimura, bsingharora, hannes,
	Michal Hocko, Ying Han, Hugh Dickins, davej

This is Hugh's version.
==
>From c3d1fb5637dd01ae034cdf2106aaf3249856738e Mon Sep 17 00:00:00 2001
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Date: Mon, 13 Jun 2011 10:26:29 +0900
Subject: [PATCH 3/5] [BUGFIX] memcg: clear mm->owner when last possible owner leaves

The following crash was reported:

> Call Trace:
> [<ffffffff81139792>] mem_cgroup_from_task+0x15/0x17
> [<ffffffff8113a75a>] __mem_cgroup_try_charge+0x148/0x4b4
> [<ffffffff810493f3>] ? need_resched+0x23/0x2d
> [<ffffffff814cbf43>] ? preempt_schedule+0x46/0x4f
> [<ffffffff8113afe8>] mem_cgroup_charge_common+0x9a/0xce
> [<ffffffff8113b6d1>] mem_cgroup_newpage_charge+0x5d/0x5f
> [<ffffffff81134024>] khugepaged+0x5da/0xfaf
> [<ffffffff81078ea0>] ? __init_waitqueue_head+0x4b/0x4b
> [<ffffffff81133a4a>] ? add_mm_counter.constprop.5+0x13/0x13
> [<ffffffff81078625>] kthread+0xa8/0xb0
> [<ffffffff814d13e8>] ? sub_preempt_count+0xa1/0xb4
> [<ffffffff814d5664>] kernel_thread_helper+0x4/0x10
> [<ffffffff814ce858>] ? retint_restore_args+0x13/0x13
> [<ffffffff8107857d>] ? __init_kthread_worker+0x5a/0x5a

What happens is that khugepaged tries to charge a huge page against an
mm whose last possible owner has already exited, and the memory
controller crashes when the stale mm->owner is used to look up the
cgroup to charge.

mm->owner has never been set to NULL with the last owner going away,
but nobody cared until khugepaged came along.

Even then it wasn't a problem because the final mmput() on an mm was
forced to acquire and release mmap_sem in write-mode, preventing an
exiting owner to go away while the mmap_sem was held, and until
"692e0b3 mm: thp: optimize memcg charge in khugepaged", the memory
cgroup charge was protected by mmap_sem in read-mode.

Instead of going back to relying on the mmap_sem to enforce lifetime
of a task, this patch ensures that mm->owner is properly set to NULL
when the last possible owner is exiting, which the memory controller
can handle just fine.

Reported-by: Hugh Dickins <hughd@google.com>
Reported-by: Dave Jones <davej@redhat.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 kernel/exit.c |   31 +++++++++++++++----------------
 1 files changed, 15 insertions(+), 16 deletions(-)

diff --git a/kernel/exit.c b/kernel/exit.c
index 20a4064..26c5feb 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -561,29 +561,28 @@ void exit_files(struct task_struct *tsk)
 
 #ifdef CONFIG_MM_OWNER
 /*
- * Task p is exiting and it owned mm, lets find a new owner for it
+ * A task is exiting and if it owned mm, lets find a new owner for it
  */
-static inline int
-mm_need_new_owner(struct mm_struct *mm, struct task_struct *p)
-{
-	/*
-	 * If there are other users of the mm and the owner (us) is exiting
-	 * we need to find a new owner to take on the responsibility.
-	 */
-	if (atomic_read(&mm->mm_users) <= 1)
-		return 0;
-	if (mm->owner != p)
-		return 0;
-	return 1;
-}
-
 void mm_update_next_owner(struct mm_struct *mm)
 {
 	struct task_struct *c, *g, *p = current;
 
 retry:
-	if (!mm_need_new_owner(mm, p))
+	/*
+	 * If the exiting or execing task is not the owner, it's
+	 * someone else's problem.
+	 */
+	if (mm->owner != p)
 		return;
+	/*
+	 * The current owner is exiting/execing and there are no other
+	 * candidates.  Do not leave the mm pointing to a possibly
+	 * freed task structure.
+	 */
+	if (atomic_read(&mm->mm_users) <= 1) {
+		mm->owner = NULL;
+		return;
+	}
 
 	read_lock(&tasklist_lock);
 	/*
-- 
1.7.4.1



^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [BUGFIX][PATCH 4/5] memcg:  fix wrong check of noswap with softlimit
  2011-06-13  3:00 [BUGFIX][PATCH 0/5] memcg bugfixes in the last week KAMEZAWA Hiroyuki
                   ` (2 preceding siblings ...)
  2011-06-13  3:09 ` [BUGFIX][PATCH 3/5] memcg: clear mm->owner when last possible owner leaves KAMEZAWA Hiroyuki
@ 2011-06-13  3:11 ` KAMEZAWA Hiroyuki
  2011-06-14  9:48   ` Johannes Weiner
  2011-06-13  3:16 ` [BUGFIX][PATCH 5/5] memcg: fix percpu cached charge draining frequency KAMEZAWA Hiroyuki
  4 siblings, 1 reply; 24+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-06-13  3:11 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-kernel, linux-mm, akpm, nishimura, bsingharora, hannes,
	Michal Hocko, Ying Han

>From 0a0358d300330a4ba86e39ea56ed63f1e4519dfd Mon Sep 17 00:00:00 2001
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Date: Mon, 13 Jun 2011 10:31:16 +0900
Subject: [PATCH 4/5]  fix wrong check of noswap with softlimit

Hierarchical reclaim doesn't swap out if memsw and resource limits are
same (memsw_is_minimum == true) because we would hit mem+swap limit
anyway (during hard limit reclaim).
If it comes to the solft limit we shouldn't consider memsw_is_minimum at
all because it doesn't make much sense. Either the soft limit is bellow
the hard limit and then we cannot hit mem+swap limit or the direct
reclaim takes a precedence.

Reviewed-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ce05835..915c3f3 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1663,7 +1663,7 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
 	excess = res_counter_soft_limit_excess(&root_mem->res) >> PAGE_SHIFT;
 
 	/* If memsw_is_minimum==1, swap-out is of-no-use. */
-	if (root_mem->memsw_is_minimum)
+	if (!check_soft && root_mem->memsw_is_minimum)
 		noswap = true;
 
 	while (1) {
-- 
1.7.4.1



^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [BUGFIX][PATCH 5/5] memcg: fix percpu cached charge draining frequency
  2011-06-13  3:00 [BUGFIX][PATCH 0/5] memcg bugfixes in the last week KAMEZAWA Hiroyuki
                   ` (3 preceding siblings ...)
  2011-06-13  3:11 ` [BUGFIX][PATCH 4/5] memcg: fix wrong check of noswap with softlimit KAMEZAWA Hiroyuki
@ 2011-06-13  3:16 ` KAMEZAWA Hiroyuki
  2011-06-13 21:25   ` Andrew Morton
                     ` (3 more replies)
  4 siblings, 4 replies; 24+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-06-13  3:16 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-kernel, linux-mm, akpm, nishimura, bsingharora, hannes,
	Michal Hocko, Ying Han

>From 18b12e53f1cdf6d7feed1f9226c189c34866338c Mon Sep 17 00:00:00 2001
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Date: Mon, 13 Jun 2011 11:25:43 +0900
Subject: [PATCH 5/5] memcg: fix percpu cached charge draining frequency

 For performance, memory cgroup caches some "charge" from res_counter
 into per cpu cache. This works well but because it's cache,
 it needs to be flushed in some cases. Typical cases are
         1. when someone hit limit.
         2. when rmdir() is called and need to charges to be 0.

But "1" has problem.

Recently, with large SMP machines, we see many kworker runs because
of flushing memcg's cache. Bad things in implementation are
that even if a cpu contains a cache for memcg not related to
a memcg which hits limit, drain code is called.

This patch does
	A) check percpu cache contains a useful data or not.
	B) check other asynchronous percpu draining doesn't run.
	C) don't call local cpu callback.
	D) don't call at softlimit reclaim.

(*)This patch avoid changing the calling condition with hard-limit.

When I run "cat 1Gfile > /dev/null" under 300M limit memcg,

[Before]
13767 kamezawa  20   0 98.6m  424  416 D 10.0  0.0   0:00.61 cat
   58 root      20   0     0    0    0 S  0.6  0.0   0:00.09 kworker/2:1
   60 root      20   0     0    0    0 S  0.6  0.0   0:00.08 kworker/4:1
    4 root      20   0     0    0    0 S  0.3  0.0   0:00.02 kworker/0:0
   57 root      20   0     0    0    0 S  0.3  0.0   0:00.05 kworker/1:1
   61 root      20   0     0    0    0 S  0.3  0.0   0:00.05 kworker/5:1
   62 root      20   0     0    0    0 S  0.3  0.0   0:00.05 kworker/6:1
   63 root      20   0     0    0    0 S  0.3  0.0   0:00.05 kworker/7:1

[After]
 2676 root      20   0 98.6m  416  416 D  9.3  0.0   0:00.87 cat
 2626 kamezawa  20   0 15192 1312  920 R  0.3  0.0   0:00.28 top
    1 root      20   0 19384 1496 1204 S  0.0  0.0   0:00.66 init
    2 root      20   0     0    0    0 S  0.0  0.0   0:00.00 kthreadd
    3 root      20   0     0    0    0 S  0.0  0.0   0:00.00 ksoftirqd/0
    4 root      20   0     0    0    0 S  0.0  0.0   0:00.00 kworker/0:0

Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Changelog:
  - cleaned up and add more comments.
  - don't call at softlimit reclaim.
---
 mm/memcontrol.c |   61 +++++++++++++++++++++++++++++++++++++++---------------
 1 files changed, 44 insertions(+), 17 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 915c3f3..79f68ab 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -359,7 +359,7 @@ enum charge_type {
 static void mem_cgroup_get(struct mem_cgroup *mem);
 static void mem_cgroup_put(struct mem_cgroup *mem);
 static struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *mem);
-static void drain_all_stock_async(void);
+static void drain_all_stock_async(struct mem_cgroup *mem);
 
 static struct mem_cgroup_per_zone *
 mem_cgroup_zoneinfo(struct mem_cgroup *mem, int nid, int zid)
@@ -1670,8 +1670,8 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
 		victim = mem_cgroup_select_victim(root_mem);
 		if (victim == root_mem) {
 			loop++;
-			if (loop >= 1)
-				drain_all_stock_async();
+			if (!check_soft && loop >= 1)
+				drain_all_stock_async(root_mem);
 			if (loop >= 2) {
 				/*
 				 * If we have not been able to reclaim
@@ -1934,9 +1934,11 @@ struct memcg_stock_pcp {
 	struct mem_cgroup *cached; /* this never be root cgroup */
 	unsigned int nr_pages;
 	struct work_struct work;
+	unsigned long flags;
+#define FLUSHING_CACHED_CHARGE	(0)
 };
 static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock);
-static atomic_t memcg_drain_count;
+DEFINE_MUTEX(percpu_charge_mutex);
 
 /*
  * Try to consume stocked charge on this cpu. If success, one page is consumed
@@ -1984,6 +1986,7 @@ static void drain_local_stock(struct work_struct *dummy)
 {
 	struct memcg_stock_pcp *stock = &__get_cpu_var(memcg_stock);
 	drain_stock(stock);
+	clear_bit(FLUSHING_CACHED_CHARGE, &stock->flags);
 }
 
 /*
@@ -2008,26 +2011,50 @@ static void refill_stock(struct mem_cgroup *mem, unsigned int nr_pages)
  * expects some charges will be back to res_counter later but cannot wait for
  * it.
  */
-static void drain_all_stock_async(void)
+static void drain_all_stock_async(struct mem_cgroup *root_mem)
 {
-	int cpu;
-	/* This function is for scheduling "drain" in asynchronous way.
-	 * The result of "drain" is not directly handled by callers. Then,
-	 * if someone is calling drain, we don't have to call drain more.
-	 * Anyway, WORK_STRUCT_PENDING check in queue_work_on() will catch if
-	 * there is a race. We just do loose check here.
+	int cpu, curcpu;
+	/*
+	 * If someone calls draining, avoid adding more kworker runs.
 	 */
-	if (atomic_read(&memcg_drain_count))
+	if (!mutex_trylock(&percpu_charge_mutex))
 		return;
 	/* Notify other cpus that system-wide "drain" is running */
-	atomic_inc(&memcg_drain_count);
 	get_online_cpus();
+
+	/*
+	 * get a hint for avoiding draining charges on the current cpu,
+	 * which must be exhausted by our charging. But this is not
+	 * required to be a precise check, We use raw_smp_processor_id()
+	 * instead of getcpu()/putcpu().
+	 */
+	curcpu = raw_smp_processor_id();
 	for_each_online_cpu(cpu) {
 		struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu);
-		schedule_work_on(cpu, &stock->work);
+		struct mem_cgroup *mem;
+
+		if (cpu == curcpu)
+			continue;
+
+		mem = stock->cached;
+		if (!mem)
+			continue;
+		if (mem != root_mem) {
+			if (!root_mem->use_hierarchy)
+				continue;
+			/* check whether "mem" is under tree of "root_mem" */
+			rcu_read_lock();
+			if (!css_is_ancestor(&mem->css, &root_mem->css)) {
+				rcu_read_unlock();
+				continue;
+			}
+			rcu_read_unlock();
+		}
+		if (!test_and_set_bit(FLUSHING_CACHED_CHARGE, &stock->flags))
+			schedule_work_on(cpu, &stock->work);
 	}
  	put_online_cpus();
-	atomic_dec(&memcg_drain_count);
+	mutex_unlock(&percpu_charge_mutex);
 	/* We don't wait for flush_work */
 }
 
@@ -2035,9 +2062,9 @@ static void drain_all_stock_async(void)
 static void drain_all_stock_sync(void)
 {
 	/* called when force_empty is called */
-	atomic_inc(&memcg_drain_count);
+	mutex_lock(&percpu_charge_mutex);
 	schedule_on_each_cpu(drain_local_stock);
-	atomic_dec(&memcg_drain_count);
+	mutex_unlock(&percpu_charge_mutex);
 }
 
 /*
-- 
1.7.4.1



^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [BUGFIX][PATCH 5/5] memcg: fix percpu cached charge draining frequency
  2011-06-13  3:16 ` [BUGFIX][PATCH 5/5] memcg: fix percpu cached charge draining frequency KAMEZAWA Hiroyuki
@ 2011-06-13 21:25   ` Andrew Morton
  2011-06-15  0:09     ` KAMEZAWA Hiroyuki
  2011-06-14  7:36   ` Michal Hocko
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 24+ messages in thread
From: Andrew Morton @ 2011-06-13 21:25 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-kernel, linux-mm, nishimura, bsingharora, hannes,
	Michal Hocko, Ying Han

On Mon, 13 Jun 2011 12:16:48 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> >From 18b12e53f1cdf6d7feed1f9226c189c34866338c Mon Sep 17 00:00:00 2001
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Date: Mon, 13 Jun 2011 11:25:43 +0900
> Subject: [PATCH 5/5] memcg: fix percpu cached charge draining frequency
> 
>  For performance, memory cgroup caches some "charge" from res_counter
>  into per cpu cache. This works well but because it's cache,
>  it needs to be flushed in some cases. Typical cases are
>          1. when someone hit limit.
>          2. when rmdir() is called and need to charges to be 0.
> 
> But "1" has problem.
> 
> Recently, with large SMP machines, we see many kworker runs because
> of flushing memcg's cache. Bad things in implementation are
> that even if a cpu contains a cache for memcg not related to
> a memcg which hits limit, drain code is called.
> 
> This patch does
> 	A) check percpu cache contains a useful data or not.
> 	B) check other asynchronous percpu draining doesn't run.
> 	C) don't call local cpu callback.
> 	D) don't call at softlimit reclaim.
> 
>
> ...
>
> +DEFINE_MUTEX(percpu_charge_mutex);

I made this static.  If we later wish to give it kernel-wide scope then
"percpu_charge_mutex" will not be a good choice of name.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [BUGFIX][PATCH 5/5] memcg: fix percpu cached charge draining frequency
  2011-06-13  3:16 ` [BUGFIX][PATCH 5/5] memcg: fix percpu cached charge draining frequency KAMEZAWA Hiroyuki
  2011-06-13 21:25   ` Andrew Morton
@ 2011-06-14  7:36   ` Michal Hocko
  2011-06-15  0:12     ` KAMEZAWA Hiroyuki
  2011-06-14 10:04   ` Johannes Weiner
  2011-06-15  1:49   ` [BUGFIX][PATCH v6] " KAMEZAWA Hiroyuki
  3 siblings, 1 reply; 24+ messages in thread
From: Michal Hocko @ 2011-06-14  7:36 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-kernel, linux-mm, akpm, nishimura, bsingharora, hannes, Ying Han

On Mon 13-06-11 12:16:48, KAMEZAWA Hiroyuki wrote:
> From 18b12e53f1cdf6d7feed1f9226c189c34866338c Mon Sep 17 00:00:00 2001
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Date: Mon, 13 Jun 2011 11:25:43 +0900
> Subject: [PATCH 5/5] memcg: fix percpu cached charge draining frequency
> 
>  For performance, memory cgroup caches some "charge" from res_counter
>  into per cpu cache. This works well but because it's cache,
>  it needs to be flushed in some cases. Typical cases are
>          1. when someone hit limit.
>          2. when rmdir() is called and need to charges to be 0.
> 
> But "1" has problem.
> 
> Recently, with large SMP machines, we see many kworker runs because
> of flushing memcg's cache. Bad things in implementation are
> that even if a cpu contains a cache for memcg not related to
> a memcg which hits limit, drain code is called.
> 
> This patch does
> 	D) don't call at softlimit reclaim.

I think this needs some justification. The decision is not that
obvious IMO. I would say that this is a good decision because cached
charges will not help to free any memory (at least not directly) during
background reclaim. What about something like:
"
We are not draining per cpu cached charges during soft limit reclaim 
because background reclaim doesn't care about charges. It tries to free
some memory and charges will not give any.
Cached charges might influence only selection of the biggest soft limit
offender but as the call is done only after the selection has been
already done it makes no change.
"

Anyway, wouldn't it be better to have this change separate from the
async draining logic change?
-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [BUGFIX][PATCH 1/5] memcg fix numa_stat permission
  2011-06-13  3:03 ` [BUGFIX][PATCH 1/5] memcg fix numa_stat permission KAMEZAWA Hiroyuki
@ 2011-06-14  9:42   ` Johannes Weiner
  0 siblings, 0 replies; 24+ messages in thread
From: Johannes Weiner @ 2011-06-14  9:42 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-kernel, linux-mm, akpm, nishimura, bsingharora, hannes,
	Michal Hocko, Ying Han

On Mon, Jun 13, 2011 at 12:03:01PM +0900, KAMEZAWA Hiroyuki wrote:
> This is already queued in mmotm.
> ==
> >From 3b1bec1d07ba21339697f069acab47dae6b81097 Mon Sep 17 00:00:00 2001
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Date: Mon, 13 Jun 2011 09:32:28 +0900
> Subject: [PATCH 1/5] memcg fix numa_stat permission
> 
> commit 406eb0c9b ("memcg: add memory.numastat api for numa statistics")
>  adds memory.numa_stat file for memory cgroup.  But the file permissions
>  are wrong.
> 
> [kamezawa@bluextal linux-2.6]$ ls -l /cgroup/memory/A/memory.numa_stat
> ---------- 1 root root 0 Jun  9 18:36 /cgroup/memory/A/memory.numa_stat
> 
> This patch fixes the permission as
> 
> [root@bluextal kamezawa]# ls -l /cgroup/memory/A/memory.numa_stat
> -r--r--r-- 1 root root 0 Jun 10 16:49 /cgroup/memory/A/memory.numa_stat
> 
> Acked-by: Ying Han <yinghan@google.com>
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Acked-by: Johannes Weiner <jweiner@redhat.com>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [BUGFIX][PATCH 2/5] memcg: fix init_page_cgroup nid with sparsemem
  2011-06-13  3:06 ` [BUGFIX][PATCH 2/5] memcg: fix init_page_cgroup nid with sparsemem KAMEZAWA Hiroyuki
@ 2011-06-14  9:44   ` Johannes Weiner
  2011-06-16 10:09   ` [-git build bug, PATCH] " Ingo Molnar
  1 sibling, 0 replies; 24+ messages in thread
From: Johannes Weiner @ 2011-06-14  9:44 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-kernel, linux-mm, akpm, nishimura, bsingharora, hannes,
	Michal Hocko, Ying Han

On Mon, Jun 13, 2011 at 12:06:08PM +0900, KAMEZAWA Hiroyuki wrote:
> added some clean ups.
> ==
> >From 45b8881201f73015c4e0352eb7e434f6e9c53fdd Mon Sep 17 00:00:00 2001
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Date: Mon, 13 Jun 2011 10:09:17 +0900
> Subject: [PATCH 2/5] [BUGFIX] memcg: fix init_page_cgroup nid with sparsemem
> 
> 
> commit 21a3c96 makes page_cgroup allocation as NUMA aware. But that caused
> a problem https://bugzilla.kernel.org/show_bug.cgi?id=36192.
> 
> The problem was getting a NID from invalid struct pages, which was not
> initialized because it was out-of-node, out of [node_start_pfn, node_end_pfn)
> 
> Now, with sparsemem, page_cgroup_init scans pfn from 0 to max_pfn.
> But this may scan a pfn which is not on any node and can access
> memmap which is not initialized.
> 
> This makes page_cgroup_init() for SPARSEMEM node aware and remove
> a code to get nid from page->flags. (Then, we'll use valid NID
> always.)
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Acked-by: Johannes Weiner <jweiner@redhat.com>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [BUGFIX][PATCH 3/5] memcg: clear mm->owner when last possible owner leaves
  2011-06-13  3:09 ` [BUGFIX][PATCH 3/5] memcg: clear mm->owner when last possible owner leaves KAMEZAWA Hiroyuki
@ 2011-06-14  9:45   ` Johannes Weiner
  0 siblings, 0 replies; 24+ messages in thread
From: Johannes Weiner @ 2011-06-14  9:45 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-kernel, linux-mm, akpm, nishimura, bsingharora, hannes,
	Michal Hocko, Ying Han, Hugh Dickins, davej

On Mon, Jun 13, 2011 at 12:09:51PM +0900, KAMEZAWA Hiroyuki wrote:
> This is Hugh's version.

I approve of Hugh's fixes :-)  Thanks

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [BUGFIX][PATCH 4/5] memcg:  fix wrong check of noswap with softlimit
  2011-06-13  3:11 ` [BUGFIX][PATCH 4/5] memcg: fix wrong check of noswap with softlimit KAMEZAWA Hiroyuki
@ 2011-06-14  9:48   ` Johannes Weiner
  0 siblings, 0 replies; 24+ messages in thread
From: Johannes Weiner @ 2011-06-14  9:48 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-kernel, linux-mm, akpm, nishimura, bsingharora, hannes,
	Michal Hocko, Ying Han

On Mon, Jun 13, 2011 at 12:11:05PM +0900, KAMEZAWA Hiroyuki wrote:
> >From 0a0358d300330a4ba86e39ea56ed63f1e4519dfd Mon Sep 17 00:00:00 2001
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Date: Mon, 13 Jun 2011 10:31:16 +0900
> Subject: [PATCH 4/5]  fix wrong check of noswap with softlimit
> 
> Hierarchical reclaim doesn't swap out if memsw and resource limits are
> same (memsw_is_minimum == true) because we would hit mem+swap limit
> anyway (during hard limit reclaim).
> If it comes to the solft limit we shouldn't consider memsw_is_minimum at
> all because it doesn't make much sense. Either the soft limit is bellow
> the hard limit and then we cannot hit mem+swap limit or the direct
> reclaim takes a precedence.
> 
> Reviewed-by: Michal Hocko <mhocko@suse.cz>
> Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Acked-by: Johannes Weiner <jweiner@redhat.com>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [BUGFIX][PATCH 5/5] memcg: fix percpu cached charge draining frequency
  2011-06-13  3:16 ` [BUGFIX][PATCH 5/5] memcg: fix percpu cached charge draining frequency KAMEZAWA Hiroyuki
  2011-06-13 21:25   ` Andrew Morton
  2011-06-14  7:36   ` Michal Hocko
@ 2011-06-14 10:04   ` Johannes Weiner
  2011-06-15  0:16     ` KAMEZAWA Hiroyuki
  2011-06-15  1:49   ` [BUGFIX][PATCH v6] " KAMEZAWA Hiroyuki
  3 siblings, 1 reply; 24+ messages in thread
From: Johannes Weiner @ 2011-06-14 10:04 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-kernel, linux-mm, akpm, nishimura, bsingharora, hannes,
	Michal Hocko, Ying Han

On Mon, Jun 13, 2011 at 12:16:48PM +0900, KAMEZAWA Hiroyuki wrote:
> @@ -1670,8 +1670,8 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
>  		victim = mem_cgroup_select_victim(root_mem);
>  		if (victim == root_mem) {
>  			loop++;
> -			if (loop >= 1)
> -				drain_all_stock_async();
> +			if (!check_soft && loop >= 1)
> +				drain_all_stock_async(root_mem);

I agree with Michal, this should be a separate change.

> @@ -2008,26 +2011,50 @@ static void refill_stock(struct mem_cgroup *mem, unsigned int nr_pages)
>   * expects some charges will be back to res_counter later but cannot wait for
>   * it.
>   */
> -static void drain_all_stock_async(void)
> +static void drain_all_stock_async(struct mem_cgroup *root_mem)
>  {
> -	int cpu;
> -	/* This function is for scheduling "drain" in asynchronous way.
> -	 * The result of "drain" is not directly handled by callers. Then,
> -	 * if someone is calling drain, we don't have to call drain more.
> -	 * Anyway, WORK_STRUCT_PENDING check in queue_work_on() will catch if
> -	 * there is a race. We just do loose check here.
> +	int cpu, curcpu;
> +	/*
> +	 * If someone calls draining, avoid adding more kworker runs.
>  	 */
> -	if (atomic_read(&memcg_drain_count))
> +	if (!mutex_trylock(&percpu_charge_mutex))
>  		return;
>  	/* Notify other cpus that system-wide "drain" is running */
> -	atomic_inc(&memcg_drain_count);
>  	get_online_cpus();
> +
> +	/*
> +	 * get a hint for avoiding draining charges on the current cpu,
> +	 * which must be exhausted by our charging. But this is not
> +	 * required to be a precise check, We use raw_smp_processor_id()
> +	 * instead of getcpu()/putcpu().
> +	 */
> +	curcpu = raw_smp_processor_id();
>  	for_each_online_cpu(cpu) {
>  		struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu);
> -		schedule_work_on(cpu, &stock->work);
> +		struct mem_cgroup *mem;
> +
> +		if (cpu == curcpu)
> +			continue;
> +
> +		mem = stock->cached;
> +		if (!mem)
> +			continue;
> +		if (mem != root_mem) {
> +			if (!root_mem->use_hierarchy)
> +				continue;
> +			/* check whether "mem" is under tree of "root_mem" */
> +			rcu_read_lock();
> +			if (!css_is_ancestor(&mem->css, &root_mem->css)) {
> +				rcu_read_unlock();
> +				continue;
> +			}
> +			rcu_read_unlock();

css_is_ancestor() takes the rcu read lock itself already.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [BUGFIX][PATCH 5/5] memcg: fix percpu cached charge draining frequency
  2011-06-13 21:25   ` Andrew Morton
@ 2011-06-15  0:09     ` KAMEZAWA Hiroyuki
  2011-06-15  1:19       ` Andrew Morton
  0 siblings, 1 reply; 24+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-06-15  0:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, nishimura, bsingharora, hannes,
	Michal Hocko, Ying Han

On Mon, 13 Jun 2011 14:25:01 -0700
Andrew Morton <akpm@linux-foundation.org> wrote:

> On Mon, 13 Jun 2011 12:16:48 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 
> > >From 18b12e53f1cdf6d7feed1f9226c189c34866338c Mon Sep 17 00:00:00 2001
> > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > Date: Mon, 13 Jun 2011 11:25:43 +0900
> > Subject: [PATCH 5/5] memcg: fix percpu cached charge draining frequency
> > 
> >  For performance, memory cgroup caches some "charge" from res_counter
> >  into per cpu cache. This works well but because it's cache,
> >  it needs to be flushed in some cases. Typical cases are
> >          1. when someone hit limit.
> >          2. when rmdir() is called and need to charges to be 0.
> > 
> > But "1" has problem.
> > 
> > Recently, with large SMP machines, we see many kworker runs because
> > of flushing memcg's cache. Bad things in implementation are
> > that even if a cpu contains a cache for memcg not related to
> > a memcg which hits limit, drain code is called.
> > 
> > This patch does
> > 	A) check percpu cache contains a useful data or not.
> > 	B) check other asynchronous percpu draining doesn't run.
> > 	C) don't call local cpu callback.
> > 	D) don't call at softlimit reclaim.
> > 
> >
> > ...
> >
> > +DEFINE_MUTEX(percpu_charge_mutex);
> 
> I made this static.  If we later wish to give it kernel-wide scope then
> "percpu_charge_mutex" will not be a good choice of name.

Thank you.
And, yes..... memcg_cached_charge_mutex ?

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [BUGFIX][PATCH 5/5] memcg: fix percpu cached charge draining frequency
  2011-06-14  7:36   ` Michal Hocko
@ 2011-06-15  0:12     ` KAMEZAWA Hiroyuki
  2011-06-15  1:12       ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 24+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-06-15  0:12 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, linux-mm, akpm, nishimura, bsingharora, hannes, Ying Han

On Tue, 14 Jun 2011 09:36:51 +0200
Michal Hocko <mhocko@suse.cz> wrote:

> On Mon 13-06-11 12:16:48, KAMEZAWA Hiroyuki wrote:
> > From 18b12e53f1cdf6d7feed1f9226c189c34866338c Mon Sep 17 00:00:00 2001
> > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > Date: Mon, 13 Jun 2011 11:25:43 +0900
> > Subject: [PATCH 5/5] memcg: fix percpu cached charge draining frequency
> > 
> >  For performance, memory cgroup caches some "charge" from res_counter
> >  into per cpu cache. This works well but because it's cache,
> >  it needs to be flushed in some cases. Typical cases are
> >          1. when someone hit limit.
> >          2. when rmdir() is called and need to charges to be 0.
> > 
> > But "1" has problem.
> > 
> > Recently, with large SMP machines, we see many kworker runs because
> > of flushing memcg's cache. Bad things in implementation are
> > that even if a cpu contains a cache for memcg not related to
> > a memcg which hits limit, drain code is called.
> > 
> > This patch does
> > 	D) don't call at softlimit reclaim.
> 
> I think this needs some justification. The decision is not that
> obvious IMO. I would say that this is a good decision because cached
> charges will not help to free any memory (at least not directly) during
> background reclaim. What about something like:
> "
> We are not draining per cpu cached charges during soft limit reclaim 
> because background reclaim doesn't care about charges. It tries to free
> some memory and charges will not give any.
> Cached charges might influence only selection of the biggest soft limit
> offender but as the call is done only after the selection has been
> already done it makes no change.
> "
> 
> Anyway, wouldn't it be better to have this change separate from the
> async draining logic change?

Hmm. I think calling "draining" at softlimit is just a bug.

Thanks,
-Kame



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [BUGFIX][PATCH 5/5] memcg: fix percpu cached charge draining frequency
  2011-06-14 10:04   ` Johannes Weiner
@ 2011-06-15  0:16     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 24+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-06-15  0:16 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-kernel, linux-mm, akpm, nishimura, bsingharora, hannes,
	Michal Hocko, Ying Han

On Tue, 14 Jun 2011 12:04:12 +0200
Johannes Weiner <jweiner@redhat.com> wrote:

> On Mon, Jun 13, 2011 at 12:16:48PM +0900, KAMEZAWA Hiroyuki wrote:
> > @@ -1670,8 +1670,8 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
> >  		victim = mem_cgroup_select_victim(root_mem);
> >  		if (victim == root_mem) {
> >  			loop++;
> > -			if (loop >= 1)
> > -				drain_all_stock_async();
> > +			if (!check_soft && loop >= 1)
> > +				drain_all_stock_async(root_mem);
> 
> I agree with Michal, this should be a separate change.
> 

Hm, ok, I'll do.

> > @@ -2008,26 +2011,50 @@ static void refill_stock(struct mem_cgroup *mem, unsigned int nr_pages)
> >   * expects some charges will be back to res_counter later but cannot wait for
> >   * it.
> >   */
> > -static void drain_all_stock_async(void)
> > +static void drain_all_stock_async(struct mem_cgroup *root_mem)
> >  {
> > -	int cpu;
> > -	/* This function is for scheduling "drain" in asynchronous way.
> > -	 * The result of "drain" is not directly handled by callers. Then,
> > -	 * if someone is calling drain, we don't have to call drain more.
> > -	 * Anyway, WORK_STRUCT_PENDING check in queue_work_on() will catch if
> > -	 * there is a race. We just do loose check here.
> > +	int cpu, curcpu;
> > +	/*
> > +	 * If someone calls draining, avoid adding more kworker runs.
> >  	 */
> > -	if (atomic_read(&memcg_drain_count))
> > +	if (!mutex_trylock(&percpu_charge_mutex))
> >  		return;
> >  	/* Notify other cpus that system-wide "drain" is running */
> > -	atomic_inc(&memcg_drain_count);
> >  	get_online_cpus();
> > +
> > +	/*
> > +	 * get a hint for avoiding draining charges on the current cpu,
> > +	 * which must be exhausted by our charging. But this is not
> > +	 * required to be a precise check, We use raw_smp_processor_id()
> > +	 * instead of getcpu()/putcpu().
> > +	 */
> > +	curcpu = raw_smp_processor_id();
> >  	for_each_online_cpu(cpu) {
> >  		struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu);
> > -		schedule_work_on(cpu, &stock->work);
> > +		struct mem_cgroup *mem;
> > +
> > +		if (cpu == curcpu)
> > +			continue;
> > +
> > +		mem = stock->cached;
> > +		if (!mem)
> > +			continue;
> > +		if (mem != root_mem) {
> > +			if (!root_mem->use_hierarchy)
> > +				continue;
> > +			/* check whether "mem" is under tree of "root_mem" */
> > +			rcu_read_lock();
> > +			if (!css_is_ancestor(&mem->css, &root_mem->css)) {
> > +				rcu_read_unlock();
> > +				continue;
> > +			}
> > +			rcu_read_unlock();
> 
> css_is_ancestor() takes the rcu read lock itself already.
> 

you're right.

I'll post an update.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [BUGFIX][PATCH 5/5] memcg: fix percpu cached charge draining frequency
  2011-06-15  0:12     ` KAMEZAWA Hiroyuki
@ 2011-06-15  1:12       ` KAMEZAWA Hiroyuki
  2011-06-15  7:15         ` Michal Hocko
  0 siblings, 1 reply; 24+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-06-15  1:12 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Michal Hocko, linux-kernel, linux-mm, akpm, nishimura,
	bsingharora, hannes, Ying Han

On Wed, 15 Jun 2011 09:12:45 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> On Tue, 14 Jun 2011 09:36:51 +0200
> Michal Hocko <mhocko@suse.cz> wrote:
> 
> > On Mon 13-06-11 12:16:48, KAMEZAWA Hiroyuki wrote:
> > > From 18b12e53f1cdf6d7feed1f9226c189c34866338c Mon Sep 17 00:00:00 2001
> > > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > > Date: Mon, 13 Jun 2011 11:25:43 +0900
> > > Subject: [PATCH 5/5] memcg: fix percpu cached charge draining frequency
> > > 
> > >  For performance, memory cgroup caches some "charge" from res_counter
> > >  into per cpu cache. This works well but because it's cache,
> > >  it needs to be flushed in some cases. Typical cases are
> > >          1. when someone hit limit.
> > >          2. when rmdir() is called and need to charges to be 0.
> > > 
> > > But "1" has problem.
> > > 
> > > Recently, with large SMP machines, we see many kworker runs because
> > > of flushing memcg's cache. Bad things in implementation are
> > > that even if a cpu contains a cache for memcg not related to
> > > a memcg which hits limit, drain code is called.
> > > 
> > > This patch does
> > > 	D) don't call at softlimit reclaim.
> > 
> > I think this needs some justification. The decision is not that
> > obvious IMO. I would say that this is a good decision because cached
> > charges will not help to free any memory (at least not directly) during
> > background reclaim. What about something like:
> > "
> > We are not draining per cpu cached charges during soft limit reclaim 
> > because background reclaim doesn't care about charges. It tries to free
> > some memory and charges will not give any.
> > Cached charges might influence only selection of the biggest soft limit
> > offender but as the call is done only after the selection has been
> > already done it makes no change.
> > "
> > 
> > Anyway, wouldn't it be better to have this change separate from the
> > async draining logic change?
> 
> Hmm. I think calling "draining" at softlimit is just a bug.
> 
I'll divide patches.

Thanks,
-kame


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [BUGFIX][PATCH 5/5] memcg: fix percpu cached charge draining frequency
  2011-06-15  0:09     ` KAMEZAWA Hiroyuki
@ 2011-06-15  1:19       ` Andrew Morton
  0 siblings, 0 replies; 24+ messages in thread
From: Andrew Morton @ 2011-06-15  1:19 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-kernel, linux-mm, nishimura, bsingharora, hannes,
	Michal Hocko, Ying Han

On Wed, 15 Jun 2011 09:09:35 +0900 KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> > I made this static.  If we later wish to give it kernel-wide scope then
> > "percpu_charge_mutex" will not be a good choice of name.
> 
> Thank you.
> And, yes..... memcg_cached_charge_mutex ?

I don't see a need to rename the mutex unless we change it to have
kernel-wide scope.  Given that it's presently static to memcontrol.c,
it's obviously part of memcg ;)


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [BUGFIX][PATCH v6] memcg: fix percpu cached charge draining frequency
  2011-06-13  3:16 ` [BUGFIX][PATCH 5/5] memcg: fix percpu cached charge draining frequency KAMEZAWA Hiroyuki
                     ` (2 preceding siblings ...)
  2011-06-14 10:04   ` Johannes Weiner
@ 2011-06-15  1:49   ` KAMEZAWA Hiroyuki
  2011-06-15  1:52     ` [BUGFIX][PATCH v2] memcg: avoid percpu cached charge draining at softlimit KAMEZAWA Hiroyuki
  2011-06-15  5:30     ` [BUGFIX][PATCH v6] memcg: fix percpu cached charge draining frequency Ying Han
  3 siblings, 2 replies; 24+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-06-15  1:49 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-kernel, linux-mm, akpm, nishimura, bsingharora, hannes,
	Michal Hocko, Ying Han

This is a repleacement for
memcg-fix-percpu-cached-charge-draining-frequency.patch
+
memcg-fix-percpu-cached-charge-draining-frequency-fix.patch


Changelog:
  - removed unnecessary rcu_read_lock()
  - removed a fix for softlimit case (move to another independent patch)
  - make mutex static.
  - applied comment updates from Andrew Morton.

A patch for softlimit will follow this.

==
>From f3f41b827d70142858ba8b370510a82d608870d0 Mon Sep 17 00:00:00 2001
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Date: Wed, 15 Jun 2011 10:39:57 +0900
Subject: [PATCH 5/6] memcg: fix behavior of per cpu charge cache draining.

 For performance, memory cgroup caches some "charge" from res_counter
 into per cpu cache. This works well but because it's cache,
 it needs to be flushed in some cases. Typical cases are
         1. when someone hit limit.
         2. when rmdir() is called and need to charges to be 0.

But "1" has problem.

Recently, with large SMP machines, we see many kworker runs because
of flushing memcg's cache. Bad things in implementation are
that even if a cpu contains a cache for memcg not related to
a memcg which hits limit, drain code is called.

This patch does
        A) check percpu cache contains a useful data or not.
        B) check other asynchronous percpu draining doesn't run.
        C) don't call local cpu callback.

(*)This patch avoid changing the calling condition with hard-limit.

When I run "cat 1Gfile > /dev/null" under 300M limit memcg,

[Before]
13767 kamezawa  20   0 98.6m  424  416 D 10.0  0.0   0:00.61 cat
   58 root      20   0     0    0    0 S  0.6  0.0   0:00.09 kworker/2:1
   60 root      20   0     0    0    0 S  0.6  0.0   0:00.08 kworker/4:1
    4 root      20   0     0    0    0 S  0.3  0.0   0:00.02 kworker/0:0
   57 root      20   0     0    0    0 S  0.3  0.0   0:00.05 kworker/1:1
   61 root      20   0     0    0    0 S  0.3  0.0   0:00.05 kworker/5:1
   62 root      20   0     0    0    0 S  0.3  0.0   0:00.05 kworker/6:1
   63 root      20   0     0    0    0 S  0.3  0.0   0:00.05 kworker/7:1

[After]
 2676 root      20   0 98.6m  416  416 D  9.3  0.0   0:00.87 cat
 2626 kamezawa  20   0 15192 1312  920 R  0.3  0.0   0:00.28 top
    1 root      20   0 19384 1496 1204 S  0.0  0.0   0:00.66 init
    2 root      20   0     0    0    0 S  0.0  0.0   0:00.00 kthreadd
    3 root      20   0     0    0    0 S  0.0  0.0   0:00.00 ksoftirqd/0
    4 root      20   0     0    0    0 S  0.0  0.0   0:00.00 kworker/0:0

Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Changelog:
  - removed unnecessary rcu_read_lock()
  - removed a fix for softlimit case (move to another independent patch)
  - make mutex static.
  - applied comment updates from Andrew Morton.
---
 mm/memcontrol.c |   54 ++++++++++++++++++++++++++++++++++++++----------------
 1 files changed, 38 insertions(+), 16 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 915c3f3..8fb29de 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -359,7 +359,7 @@ enum charge_type {
 static void mem_cgroup_get(struct mem_cgroup *mem);
 static void mem_cgroup_put(struct mem_cgroup *mem);
 static struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *mem);
-static void drain_all_stock_async(void);
+static void drain_all_stock_async(struct mem_cgroup *mem);
 
 static struct mem_cgroup_per_zone *
 mem_cgroup_zoneinfo(struct mem_cgroup *mem, int nid, int zid)
@@ -1671,7 +1671,7 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
 		if (victim == root_mem) {
 			loop++;
 			if (loop >= 1)
-				drain_all_stock_async();
+				drain_all_stock_async(root_mem);
 			if (loop >= 2) {
 				/*
 				 * If we have not been able to reclaim
@@ -1934,9 +1934,11 @@ struct memcg_stock_pcp {
 	struct mem_cgroup *cached; /* this never be root cgroup */
 	unsigned int nr_pages;
 	struct work_struct work;
+	unsigned long flags;
+#define FLUSHING_CACHED_CHARGE	(0)
 };
 static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock);
-static atomic_t memcg_drain_count;
+static DEFINE_MUTEX(percpu_charge_mutex);
 
 /*
  * Try to consume stocked charge on this cpu. If success, one page is consumed
@@ -1984,6 +1986,7 @@ static void drain_local_stock(struct work_struct *dummy)
 {
 	struct memcg_stock_pcp *stock = &__get_cpu_var(memcg_stock);
 	drain_stock(stock);
+	clear_bit(FLUSHING_CACHED_CHARGE, &stock->flags);
 }
 
 /*
@@ -2008,26 +2011,45 @@ static void refill_stock(struct mem_cgroup *mem, unsigned int nr_pages)
  * expects some charges will be back to res_counter later but cannot wait for
  * it.
  */
-static void drain_all_stock_async(void)
+static void drain_all_stock_async(struct mem_cgroup *root_mem)
 {
-	int cpu;
-	/* This function is for scheduling "drain" in asynchronous way.
-	 * The result of "drain" is not directly handled by callers. Then,
-	 * if someone is calling drain, we don't have to call drain more.
-	 * Anyway, WORK_STRUCT_PENDING check in queue_work_on() will catch if
-	 * there is a race. We just do loose check here.
+	int cpu, curcpu;
+	/*
+	 * If someone calls draining, avoid adding more kworker runs.
 	 */
-	if (atomic_read(&memcg_drain_count))
+	if (!mutex_trylock(&percpu_charge_mutex))
 		return;
 	/* Notify other cpus that system-wide "drain" is running */
-	atomic_inc(&memcg_drain_count);
 	get_online_cpus();
+	/*
+	 * Get a hint for avoiding draining charges on the current cpu,
+	 * which must be exhausted by our charging.  It is not required that
+	 * this be a precise check, so we use raw_smp_processor_id() instead of
+	 * getcpu()/putcpu().
+	 */
+	curcpu = raw_smp_processor_id();
 	for_each_online_cpu(cpu) {
 		struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu);
-		schedule_work_on(cpu, &stock->work);
+		struct mem_cgroup *mem;
+
+		if (cpu == curcpu)
+			continue;
+
+		mem = stock->cached;
+		if (!mem)
+			continue;
+		if (mem != root_mem) {
+			if (!root_mem->use_hierarchy)
+				continue;
+			/* check whether "mem" is under tree of "root_mem" */
+			if (!css_is_ancestor(&mem->css, &root_mem->css))
+				continue;
+		}
+		if (!test_and_set_bit(FLUSHING_CACHED_CHARGE, &stock->flags))
+			schedule_work_on(cpu, &stock->work);
 	}
  	put_online_cpus();
-	atomic_dec(&memcg_drain_count);
+	mutex_unlock(&percpu_charge_mutex);
 	/* We don't wait for flush_work */
 }
 
@@ -2035,9 +2057,9 @@ static void drain_all_stock_async(void)
 static void drain_all_stock_sync(void)
 {
 	/* called when force_empty is called */
-	atomic_inc(&memcg_drain_count);
+	mutex_lock(&percpu_charge_mutex);
 	schedule_on_each_cpu(drain_local_stock);
-	atomic_dec(&memcg_drain_count);
+	mutex_unlock(&percpu_charge_mutex);
 }
 
 /*
-- 
1.7.4.1



^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [BUGFIX][PATCH v2] memcg: avoid percpu cached charge draining at softlimit
  2011-06-15  1:49   ` [BUGFIX][PATCH v6] " KAMEZAWA Hiroyuki
@ 2011-06-15  1:52     ` KAMEZAWA Hiroyuki
  2011-06-15  5:30     ` [BUGFIX][PATCH v6] memcg: fix percpu cached charge draining frequency Ying Han
  1 sibling, 0 replies; 24+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-06-15  1:52 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-kernel, linux-mm, akpm, nishimura, bsingharora, hannes,
	Michal Hocko, Ying Han

Based on Michal Hokko's comment.

=
>From 5d1121e4e5425187a58025cb63be8e297a97f624 Mon Sep 17 00:00:00 2001
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Date: Wed, 15 Jun 2011 10:44:29 +0900
Subject: [PATCH 6/6] memcg: avoid percpu cached charge draining at softlimit.

 We are not draining per cpu cached charges during soft limit reclaim
 because background reclaim doesn't care about charges. It tries to free
 some memory and charges will not give any.
 Cached charges might influence only selection of the biggest soft limit
 offender but as the call is done only after the selection has been
 already done it makes no change.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |    8 +++++++-
 1 files changed, 7 insertions(+), 1 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 8fb29de..fa5c918 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1670,7 +1670,13 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
 		victim = mem_cgroup_select_victim(root_mem);
 		if (victim == root_mem) {
 			loop++;
-			if (loop >= 1)
+			/*
+			 * We are not draining per cpu cached charges during
+			 * soft limit reclaim  because global reclaim doesn't
+			 * care about charges. It tries to free some memory and
+			 * charges will not give any.
+			 */
+			if (!check_soft && loop >= 1)
 				drain_all_stock_async(root_mem);
 			if (loop >= 2) {
 				/*
-- 
1.7.4.1



^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [BUGFIX][PATCH v6] memcg: fix percpu cached charge draining frequency
  2011-06-15  1:49   ` [BUGFIX][PATCH v6] " KAMEZAWA Hiroyuki
  2011-06-15  1:52     ` [BUGFIX][PATCH v2] memcg: avoid percpu cached charge draining at softlimit KAMEZAWA Hiroyuki
@ 2011-06-15  5:30     ` Ying Han
  1 sibling, 0 replies; 24+ messages in thread
From: Ying Han @ 2011-06-15  5:30 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-kernel, linux-mm, akpm, nishimura, bsingharora, hannes,
	Michal Hocko

On Tue, Jun 14, 2011 at 6:49 PM, KAMEZAWA Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> This is a repleacement for
> memcg-fix-percpu-cached-charge-draining-frequency.patch
> +
> memcg-fix-percpu-cached-charge-draining-frequency-fix.patch
>
>
> Changelog:
>  - removed unnecessary rcu_read_lock()
>  - removed a fix for softlimit case (move to another independent patch)
>  - make mutex static.
>  - applied comment updates from Andrew Morton.
>
> A patch for softlimit will follow this.
>
> ==
> From f3f41b827d70142858ba8b370510a82d608870d0 Mon Sep 17 00:00:00 2001
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Date: Wed, 15 Jun 2011 10:39:57 +0900
> Subject: [PATCH 5/6] memcg: fix behavior of per cpu charge cache draining.
>
>  For performance, memory cgroup caches some "charge" from res_counter
>  into per cpu cache. This works well but because it's cache,
>  it needs to be flushed in some cases. Typical cases are
>         1. when someone hit limit.
>         2. when rmdir() is called and need to charges to be 0.
>
> But "1" has problem.
>
> Recently, with large SMP machines, we see many kworker runs because
> of flushing memcg's cache. Bad things in implementation are
> that even if a cpu contains a cache for memcg not related to
> a memcg which hits limit, drain code is called.
>
> This patch does
>        A) check percpu cache contains a useful data or not.
>        B) check other asynchronous percpu draining doesn't run.
>        C) don't call local cpu callback.
>
> (*)This patch avoid changing the calling condition with hard-limit.
>
> When I run "cat 1Gfile > /dev/null" under 300M limit memcg,
>
> [Before]
> 13767 kamezawa  20   0 98.6m  424  416 D 10.0  0.0   0:00.61 cat
>   58 root      20   0     0    0    0 S  0.6  0.0   0:00.09 kworker/2:1
>   60 root      20   0     0    0    0 S  0.6  0.0   0:00.08 kworker/4:1
>    4 root      20   0     0    0    0 S  0.3  0.0   0:00.02 kworker/0:0
>   57 root      20   0     0    0    0 S  0.3  0.0   0:00.05 kworker/1:1
>   61 root      20   0     0    0    0 S  0.3  0.0   0:00.05 kworker/5:1
>   62 root      20   0     0    0    0 S  0.3  0.0   0:00.05 kworker/6:1
>   63 root      20   0     0    0    0 S  0.3  0.0   0:00.05 kworker/7:1
>
> [After]
>  2676 root      20   0 98.6m  416  416 D  9.3  0.0   0:00.87 cat
>  2626 kamezawa  20   0 15192 1312  920 R  0.3  0.0   0:00.28 top
>    1 root      20   0 19384 1496 1204 S  0.0  0.0   0:00.66 init
>    2 root      20   0     0    0    0 S  0.0  0.0   0:00.00 kthreadd
>    3 root      20   0     0    0    0 S  0.0  0.0   0:00.00 ksoftirqd/0
>    4 root      20   0     0    0    0 S  0.0  0.0   0:00.00 kworker/0:0
>
> Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
> Reviewed-by: Michal Hocko <mhocko@suse.cz>
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>
> Changelog:
>  - removed unnecessary rcu_read_lock()
>  - removed a fix for softlimit case (move to another independent patch)
>  - make mutex static.
>  - applied comment updates from Andrew Morton.
> ---
>  mm/memcontrol.c |   54 ++++++++++++++++++++++++++++++++++++++----------------
>  1 files changed, 38 insertions(+), 16 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 915c3f3..8fb29de 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -359,7 +359,7 @@ enum charge_type {
>  static void mem_cgroup_get(struct mem_cgroup *mem);
>  static void mem_cgroup_put(struct mem_cgroup *mem);
>  static struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *mem);
> -static void drain_all_stock_async(void);
> +static void drain_all_stock_async(struct mem_cgroup *mem);
>
>  static struct mem_cgroup_per_zone *
>  mem_cgroup_zoneinfo(struct mem_cgroup *mem, int nid, int zid)
> @@ -1671,7 +1671,7 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
>                if (victim == root_mem) {
>                        loop++;
>                        if (loop >= 1)
> -                               drain_all_stock_async();
> +                               drain_all_stock_async(root_mem);
>                        if (loop >= 2) {
>                                /*
>                                 * If we have not been able to reclaim
> @@ -1934,9 +1934,11 @@ struct memcg_stock_pcp {
>        struct mem_cgroup *cached; /* this never be root cgroup */
>        unsigned int nr_pages;
>        struct work_struct work;
> +       unsigned long flags;
> +#define FLUSHING_CACHED_CHARGE (0)
>  };
>  static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock);
> -static atomic_t memcg_drain_count;
> +static DEFINE_MUTEX(percpu_charge_mutex);
>
>  /*
>  * Try to consume stocked charge on this cpu. If success, one page is consumed
> @@ -1984,6 +1986,7 @@ static void drain_local_stock(struct work_struct *dummy)
>  {
>        struct memcg_stock_pcp *stock = &__get_cpu_var(memcg_stock);
>        drain_stock(stock);
> +       clear_bit(FLUSHING_CACHED_CHARGE, &stock->flags);
>  }
>
>  /*
> @@ -2008,26 +2011,45 @@ static void refill_stock(struct mem_cgroup *mem, unsigned int nr_pages)
>  * expects some charges will be back to res_counter later but cannot wait for
>  * it.
>  */
> -static void drain_all_stock_async(void)
> +static void drain_all_stock_async(struct mem_cgroup *root_mem)
>  {
> -       int cpu;
> -       /* This function is for scheduling "drain" in asynchronous way.
> -        * The result of "drain" is not directly handled by callers. Then,
> -        * if someone is calling drain, we don't have to call drain more.
> -        * Anyway, WORK_STRUCT_PENDING check in queue_work_on() will catch if
> -        * there is a race. We just do loose check here.
> +       int cpu, curcpu;
> +       /*
> +        * If someone calls draining, avoid adding more kworker runs.
>         */
> -       if (atomic_read(&memcg_drain_count))
> +       if (!mutex_trylock(&percpu_charge_mutex))
>                return;
>        /* Notify other cpus that system-wide "drain" is running */
> -       atomic_inc(&memcg_drain_count);
>        get_online_cpus();
> +       /*
> +        * Get a hint for avoiding draining charges on the current cpu,
> +        * which must be exhausted by our charging.  It is not required that
> +        * this be a precise check, so we use raw_smp_processor_id() instead of
> +        * getcpu()/putcpu().
> +        */
> +       curcpu = raw_smp_processor_id();
>        for_each_online_cpu(cpu) {
>                struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu);
> -               schedule_work_on(cpu, &stock->work);
> +               struct mem_cgroup *mem;
> +
> +               if (cpu == curcpu)
> +                       continue;
> +
> +               mem = stock->cached;
> +               if (!mem)
> +                       continue;
> +               if (mem != root_mem) {
> +                       if (!root_mem->use_hierarchy)
> +                               continue;
> +                       /* check whether "mem" is under tree of "root_mem" */
> +                       if (!css_is_ancestor(&mem->css, &root_mem->css))
> +                               continue;
> +               }
> +               if (!test_and_set_bit(FLUSHING_CACHED_CHARGE, &stock->flags))
> +                       schedule_work_on(cpu, &stock->work);
>        }
>        put_online_cpus();
> -       atomic_dec(&memcg_drain_count);
> +       mutex_unlock(&percpu_charge_mutex);
>        /* We don't wait for flush_work */
>  }
>
> @@ -2035,9 +2057,9 @@ static void drain_all_stock_async(void)
>  static void drain_all_stock_sync(void)
>  {
>        /* called when force_empty is called */
> -       atomic_inc(&memcg_drain_count);
> +       mutex_lock(&percpu_charge_mutex);
>        schedule_on_each_cpu(drain_local_stock);
> -       atomic_dec(&memcg_drain_count);
> +       mutex_unlock(&percpu_charge_mutex);
>  }
>
>  /*
> --
> 1.7.4.1
>
>
>

Tested-by: Ying Han <yinghan@google.com>


--Ying

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [BUGFIX][PATCH 5/5] memcg: fix percpu cached charge draining frequency
  2011-06-15  1:12       ` KAMEZAWA Hiroyuki
@ 2011-06-15  7:15         ` Michal Hocko
  0 siblings, 0 replies; 24+ messages in thread
From: Michal Hocko @ 2011-06-15  7:15 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-kernel, linux-mm, akpm, nishimura, bsingharora, hannes, Ying Han

On Wed 15-06-11 10:12:02, KAMEZAWA Hiroyuki wrote:
> On Wed, 15 Jun 2011 09:12:45 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 
> > On Tue, 14 Jun 2011 09:36:51 +0200
> > Michal Hocko <mhocko@suse.cz> wrote:
> > 
> > > On Mon 13-06-11 12:16:48, KAMEZAWA Hiroyuki wrote:
> > > > From 18b12e53f1cdf6d7feed1f9226c189c34866338c Mon Sep 17 00:00:00 2001
> > > > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > > > Date: Mon, 13 Jun 2011 11:25:43 +0900
> > > > Subject: [PATCH 5/5] memcg: fix percpu cached charge draining frequency
> > > > 
> > > >  For performance, memory cgroup caches some "charge" from res_counter
> > > >  into per cpu cache. This works well but because it's cache,
> > > >  it needs to be flushed in some cases. Typical cases are
> > > >          1. when someone hit limit.
> > > >          2. when rmdir() is called and need to charges to be 0.
> > > > 
> > > > But "1" has problem.
> > > > 
> > > > Recently, with large SMP machines, we see many kworker runs because
> > > > of flushing memcg's cache. Bad things in implementation are
> > > > that even if a cpu contains a cache for memcg not related to
> > > > a memcg which hits limit, drain code is called.
> > > > 
> > > > This patch does
> > > > 	D) don't call at softlimit reclaim.
> > > 
> > > I think this needs some justification. The decision is not that
> > > obvious IMO. I would say that this is a good decision because cached
> > > charges will not help to free any memory (at least not directly) during
> > > background reclaim. What about something like:
> > > "
> > > We are not draining per cpu cached charges during soft limit reclaim 
> > > because background reclaim doesn't care about charges. It tries to free
> > > some memory and charges will not give any.
> > > Cached charges might influence only selection of the biggest soft limit
> > > offender but as the call is done only after the selection has been
> > > already done it makes no change.
> > > "
> > > 
> > > Anyway, wouldn't it be better to have this change separate from the
> > > async draining logic change?
> > 
> > Hmm. I think calling "draining" at softlimit is just a bug.
> > 
> I'll divide patches.

Thanks.

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [-git build bug, PATCH] Re: [BUGFIX][PATCH 2/5] memcg: fix init_page_cgroup nid with sparsemem
  2011-06-13  3:06 ` [BUGFIX][PATCH 2/5] memcg: fix init_page_cgroup nid with sparsemem KAMEZAWA Hiroyuki
  2011-06-14  9:44   ` Johannes Weiner
@ 2011-06-16 10:09   ` Ingo Molnar
  2011-06-16 13:01     ` Hiroyuki Kamezawa
  1 sibling, 1 reply; 24+ messages in thread
From: Ingo Molnar @ 2011-06-16 10:09 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-kernel, linux-mm, akpm, nishimura, bsingharora, hannes,
	Michal Hocko, Ying Han, Linus Torvalds


* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> Date: Mon, 13 Jun 2011 10:09:17 +0900
> Subject: [PATCH 2/5] [BUGFIX] memcg: fix init_page_cgroup nid with sparsemem

This fresh upstream commit commit:

  37573e8c7182: memcg: fix init_page_cgroup nid with sparsemem

is causing widespread build failures on latest -git, on x86:

  mm/page_cgroup.c:308:3: error: implicit declaration of function ‘node_start_pfn’ [-Werror=implicit-function-declaration]
  mm/page_cgroup.c:309:3: error: implicit declaration of function ‘node_end_pfn’ [-Werror=implicit-function-declaration]

On any config that has CONFIG_CGROUP_MEM_RES_CTLR=y enabled but 
CONFIG_NUMA disabled.

For now i've worked it around with the patch below, but the real 
solution would be to make the page_cgroup.c code not depend on NUMA.

Thanks,

	Ingo

diff --git a/init/Kconfig b/init/Kconfig
index 412c21b..1593be9 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -639,6 +639,7 @@ config RESOURCE_COUNTERS
 config CGROUP_MEM_RES_CTLR
 	bool "Memory Resource Controller for Control Groups"
 	depends on RESOURCE_COUNTERS
+	depends on NUMA
 	select MM_OWNER
 	help
 	  Provides a memory resource controller that manages both anonymous

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [-git build bug, PATCH] Re: [BUGFIX][PATCH 2/5] memcg: fix init_page_cgroup nid with sparsemem
  2011-06-16 10:09   ` [-git build bug, PATCH] " Ingo Molnar
@ 2011-06-16 13:01     ` Hiroyuki Kamezawa
  0 siblings, 0 replies; 24+ messages in thread
From: Hiroyuki Kamezawa @ 2011-06-16 13:01 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: KAMEZAWA Hiroyuki, linux-kernel, linux-mm, akpm, nishimura,
	bsingharora, hannes, Michal Hocko, Ying Han, Linus Torvalds

2011/6/16 Ingo Molnar <mingo@elte.hu>:
>
> * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>
>> Date: Mon, 13 Jun 2011 10:09:17 +0900
>> Subject: [PATCH 2/5] [BUGFIX] memcg: fix init_page_cgroup nid with sparsemem
>
> This fresh upstream commit commit:
>
>  37573e8c7182: memcg: fix init_page_cgroup nid with sparsemem
>
> is causing widespread build failures on latest -git, on x86:
>
>  mm/page_cgroup.c:308:3: error: implicit declaration of function ‘node_start_pfn’ [-Werror=implicit-function-declaration]
>  mm/page_cgroup.c:309:3: error: implicit declaration of function ‘node_end_pfn’ [-Werror=implicit-function-declaration]
>
> On any config that has CONFIG_CGROUP_MEM_RES_CTLR=y enabled but
> CONFIG_NUMA disabled.
>
> For now i've worked it around with the patch below, but the real
> solution would be to make the page_cgroup.c code not depend on NUMA.
>
> Thanks,
>
>        Ingo
>
yes, very sorry. I'm now preparing a fix in this thread.

http://marc.info/?t=130819986800002&r=1&w=2

I think I'll be able to post a final fix, tomorrow. I'll cc you when I'll post.
Thanks,
-Kame

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2011-06-16 13:01 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-06-13  3:00 [BUGFIX][PATCH 0/5] memcg bugfixes in the last week KAMEZAWA Hiroyuki
2011-06-13  3:03 ` [BUGFIX][PATCH 1/5] memcg fix numa_stat permission KAMEZAWA Hiroyuki
2011-06-14  9:42   ` Johannes Weiner
2011-06-13  3:06 ` [BUGFIX][PATCH 2/5] memcg: fix init_page_cgroup nid with sparsemem KAMEZAWA Hiroyuki
2011-06-14  9:44   ` Johannes Weiner
2011-06-16 10:09   ` [-git build bug, PATCH] " Ingo Molnar
2011-06-16 13:01     ` Hiroyuki Kamezawa
2011-06-13  3:09 ` [BUGFIX][PATCH 3/5] memcg: clear mm->owner when last possible owner leaves KAMEZAWA Hiroyuki
2011-06-14  9:45   ` Johannes Weiner
2011-06-13  3:11 ` [BUGFIX][PATCH 4/5] memcg: fix wrong check of noswap with softlimit KAMEZAWA Hiroyuki
2011-06-14  9:48   ` Johannes Weiner
2011-06-13  3:16 ` [BUGFIX][PATCH 5/5] memcg: fix percpu cached charge draining frequency KAMEZAWA Hiroyuki
2011-06-13 21:25   ` Andrew Morton
2011-06-15  0:09     ` KAMEZAWA Hiroyuki
2011-06-15  1:19       ` Andrew Morton
2011-06-14  7:36   ` Michal Hocko
2011-06-15  0:12     ` KAMEZAWA Hiroyuki
2011-06-15  1:12       ` KAMEZAWA Hiroyuki
2011-06-15  7:15         ` Michal Hocko
2011-06-14 10:04   ` Johannes Weiner
2011-06-15  0:16     ` KAMEZAWA Hiroyuki
2011-06-15  1:49   ` [BUGFIX][PATCH v6] " KAMEZAWA Hiroyuki
2011-06-15  1:52     ` [BUGFIX][PATCH v2] memcg: avoid percpu cached charge draining at softlimit KAMEZAWA Hiroyuki
2011-06-15  5:30     ` [BUGFIX][PATCH v6] memcg: fix percpu cached charge draining frequency Ying Han

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).