All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/5] bypass root memcg charges if no memcgs are possible
@ 2013-03-05 13:10 ` Glauber Costa
  0 siblings, 0 replies; 72+ messages in thread
From: Glauber Costa @ 2013-03-05 13:10 UTC (permalink / raw)
  To: linux-mm
  Cc: cgroups, Tejun Heo, Andrew Morton, Michal Hocko, kamezawa.hiroyu,
	handai.szj, anton.vorontsov

Hi,

Here is my recent changes in my patchset to bypass charges from the root cgroup
if no other memcgs are present in the system. At the time, the main complaint
came from Michal Hocko, correctly pointing out that if hierarchy == 0, we can't
bypass the root memcg forever and at some point, we need to transfer charges.
This is now done in this patch.

DISCLAIMER:

I haven't yet got access to a big box again. I am sending this with outdated
numbers in the interest of having this in the open earlier. However, the main
idea stands, and I believe the numbers are still generally valid (albeit of
course, it would be better to have them updated. Independent evaluations always
welcome)

* v2
- Fixed some LRU bugs
- Only keep bypassing if we have root-level hierarchy.

Glauber Costa (5):
  memcg: make nocpu_base available for non hotplug
  memcg: provide root figures from system totals
  memcg: make it suck faster
  memcg: do not call page_cgroup_init at system_boot
  memcg: do not walk all the way to the root for memcg

 include/linux/memcontrol.h  |  72 ++++++++++---
 include/linux/page_cgroup.h |  28 ++---
 init/main.c                 |   2 -
 mm/memcontrol.c             | 243 ++++++++++++++++++++++++++++++++++++++++----
 mm/page_cgroup.c            | 150 ++++++++++++++-------------
 5 files changed, 382 insertions(+), 113 deletions(-)

-- 
1.8.1.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH v2 0/5] bypass root memcg charges if no memcgs are possible
@ 2013-03-05 13:10 ` Glauber Costa
  0 siblings, 0 replies; 72+ messages in thread
From: Glauber Costa @ 2013-03-05 13:10 UTC (permalink / raw)
  To: linux-mm-Bw31MaZKKs3YtjvyW6yDsg
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, Tejun Heo, Andrew Morton,
	Michal Hocko, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
	handai.szj-Re5JQEeQqe8AvxtiuMwx3w,
	anton.vorontsov-QSEj5FYQhm4dnm+yROfE0A

Hi,

Here is my recent changes in my patchset to bypass charges from the root cgroup
if no other memcgs are present in the system. At the time, the main complaint
came from Michal Hocko, correctly pointing out that if hierarchy == 0, we can't
bypass the root memcg forever and at some point, we need to transfer charges.
This is now done in this patch.

DISCLAIMER:

I haven't yet got access to a big box again. I am sending this with outdated
numbers in the interest of having this in the open earlier. However, the main
idea stands, and I believe the numbers are still generally valid (albeit of
course, it would be better to have them updated. Independent evaluations always
welcome)

* v2
- Fixed some LRU bugs
- Only keep bypassing if we have root-level hierarchy.

Glauber Costa (5):
  memcg: make nocpu_base available for non hotplug
  memcg: provide root figures from system totals
  memcg: make it suck faster
  memcg: do not call page_cgroup_init at system_boot
  memcg: do not walk all the way to the root for memcg

 include/linux/memcontrol.h  |  72 ++++++++++---
 include/linux/page_cgroup.h |  28 ++---
 init/main.c                 |   2 -
 mm/memcontrol.c             | 243 ++++++++++++++++++++++++++++++++++++++++----
 mm/page_cgroup.c            | 150 ++++++++++++++-------------
 5 files changed, 382 insertions(+), 113 deletions(-)

-- 
1.8.1.2

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH v2 1/5] memcg: make nocpu_base available for non hotplug
@ 2013-03-05 13:10   ` Glauber Costa
  0 siblings, 0 replies; 72+ messages in thread
From: Glauber Costa @ 2013-03-05 13:10 UTC (permalink / raw)
  To: linux-mm
  Cc: cgroups, Tejun Heo, Andrew Morton, Michal Hocko, kamezawa.hiroyu,
	handai.szj, anton.vorontsov, Glauber Costa, Johannes Weiner

We are using nocpu_base to accumulate charges on the main counters
during cpu hotplug. I have a similar need, which is transferring charges
to the root cgroup when lazily enabling memcg. Because system wide
information is not kept per-cpu, it is hard to distribute it. This field
works well for this. So we need to make it available for all usages, not
only hotplug cases.

Signed-off-by: Glauber Costa <glommer@parallels.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <tj@kernel.org>
---
 mm/memcontrol.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 669d16a..b8b363f 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -921,11 +921,11 @@ static long mem_cgroup_read_stat(struct mem_cgroup *memcg,
 	get_online_cpus();
 	for_each_online_cpu(cpu)
 		val += per_cpu(memcg->stat->count[idx], cpu);
-#ifdef CONFIG_HOTPLUG_CPU
+
 	spin_lock(&memcg->pcp_counter_lock);
 	val += memcg->nocpu_base.count[idx];
 	spin_unlock(&memcg->pcp_counter_lock);
-#endif
+
 	put_online_cpus();
 	return val;
 }
@@ -945,11 +945,11 @@ static unsigned long mem_cgroup_read_events(struct mem_cgroup *memcg,
 
 	for_each_online_cpu(cpu)
 		val += per_cpu(memcg->stat->events[idx], cpu);
-#ifdef CONFIG_HOTPLUG_CPU
+
 	spin_lock(&memcg->pcp_counter_lock);
 	val += memcg->nocpu_base.events[idx];
 	spin_unlock(&memcg->pcp_counter_lock);
-#endif
+
 	return val;
 }
 
-- 
1.8.1.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v2 1/5] memcg: make nocpu_base available for non hotplug
@ 2013-03-05 13:10   ` Glauber Costa
  0 siblings, 0 replies; 72+ messages in thread
From: Glauber Costa @ 2013-03-05 13:10 UTC (permalink / raw)
  To: linux-mm-Bw31MaZKKs3YtjvyW6yDsg
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, Tejun Heo, Andrew Morton,
	Michal Hocko, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
	handai.szj-Re5JQEeQqe8AvxtiuMwx3w,
	anton.vorontsov-QSEj5FYQhm4dnm+yROfE0A, Glauber Costa,
	Johannes Weiner

We are using nocpu_base to accumulate charges on the main counters
during cpu hotplug. I have a similar need, which is transferring charges
to the root cgroup when lazily enabling memcg. Because system wide
information is not kept per-cpu, it is hard to distribute it. This field
works well for this. So we need to make it available for all usages, not
only hotplug cases.

Signed-off-by: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
Cc: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
Cc: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
Cc: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
---
 mm/memcontrol.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 669d16a..b8b363f 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -921,11 +921,11 @@ static long mem_cgroup_read_stat(struct mem_cgroup *memcg,
 	get_online_cpus();
 	for_each_online_cpu(cpu)
 		val += per_cpu(memcg->stat->count[idx], cpu);
-#ifdef CONFIG_HOTPLUG_CPU
+
 	spin_lock(&memcg->pcp_counter_lock);
 	val += memcg->nocpu_base.count[idx];
 	spin_unlock(&memcg->pcp_counter_lock);
-#endif
+
 	put_online_cpus();
 	return val;
 }
@@ -945,11 +945,11 @@ static unsigned long mem_cgroup_read_events(struct mem_cgroup *memcg,
 
 	for_each_online_cpu(cpu)
 		val += per_cpu(memcg->stat->events[idx], cpu);
-#ifdef CONFIG_HOTPLUG_CPU
+
 	spin_lock(&memcg->pcp_counter_lock);
 	val += memcg->nocpu_base.events[idx];
 	spin_unlock(&memcg->pcp_counter_lock);
-#endif
+
 	return val;
 }
 
-- 
1.8.1.2

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v2 2/5] memcg: provide root figures from system totals
@ 2013-03-05 13:10   ` Glauber Costa
  0 siblings, 0 replies; 72+ messages in thread
From: Glauber Costa @ 2013-03-05 13:10 UTC (permalink / raw)
  To: linux-mm
  Cc: cgroups, Tejun Heo, Andrew Morton, Michal Hocko, kamezawa.hiroyu,
	handai.szj, anton.vorontsov, Glauber Costa, Johannes Weiner,
	Mel Gorman

For the root memcg, there is no need to rely on the res_counters if hierarchy
is enabled The sum of all mem cgroups plus the tasks in root itself, is
necessarily the amount of memory used for the whole system. Since those figures
are already kept somewhere anyway, we can just return them here, without too
much hassle.

Limit and soft limit can't be set for the root cgroup, so they are left at
RESOURCE_MAX. Failcnt is left at 0, because its actual meaning is how many
times we failed allocations due to the limit being hit. We will fail
allocations in the root cgroup, but the limit will never the reason.

Signed-off-by: Glauber Costa <glommer@parallels.com>
CC: Michal Hocko <mhocko@suse.cz>
CC: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
CC: Johannes Weiner <hannes@cmpxchg.org>
CC: Mel Gorman <mgorman@suse.de>
CC: Andrew Morton <akpm@linux-foundation.org>
---
 mm/memcontrol.c | 64 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 64 insertions(+)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b8b363f..bfbf1c2 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4996,6 +4996,56 @@ static inline u64 mem_cgroup_usage(struct mem_cgroup *memcg, bool swap)
 	return val << PAGE_SHIFT;
 }
 
+static u64 memcg_read_root_rss(void)
+{
+	struct task_struct *p;
+
+	u64 rss = 0;
+	read_lock(&tasklist_lock);
+	for_each_process(p) {
+		if (!p->mm)
+			continue;
+		task_lock(p);
+		rss += get_mm_rss(p->mm);
+		task_unlock(p);
+	}
+	read_unlock(&tasklist_lock);
+	return rss;
+}
+
+static u64 mem_cgroup_read_root(enum res_type type, int name)
+{
+	if (name == RES_LIMIT)
+		return RESOURCE_MAX;
+	if (name == RES_SOFT_LIMIT)
+		return RESOURCE_MAX;
+	if (name == RES_FAILCNT)
+		return 0;
+	if (name == RES_MAX_USAGE)
+		return 0;
+
+	if (WARN_ON_ONCE(name != RES_USAGE))
+		return 0;
+
+	switch (type) {
+	case _MEM:
+		return (memcg_read_root_rss() +
+		atomic_long_read(&vm_stat[NR_FILE_PAGES])) << PAGE_SHIFT;
+	case _MEMSWAP: {
+		struct sysinfo i;
+		si_swapinfo(&i);
+
+		return ((memcg_read_root_rss() +
+		atomic_long_read(&vm_stat[NR_FILE_PAGES])) << PAGE_SHIFT) +
+		i.totalswap - i.freeswap;
+	}
+	case _KMEM:
+		return 0;
+	default:
+		BUG();
+	};
+}
+
 static ssize_t mem_cgroup_read(struct cgroup *cont, struct cftype *cft,
 			       struct file *file, char __user *buf,
 			       size_t nbytes, loff_t *ppos)
@@ -5012,6 +5062,19 @@ static ssize_t mem_cgroup_read(struct cgroup *cont, struct cftype *cft,
 	if (!do_swap_account && type == _MEMSWAP)
 		return -EOPNOTSUPP;
 
+	/*
+	 * If we have root-level hierarchy, we can be certain that the charges
+	 * in root are always global. We can then bypass the root cgroup
+	 * entirely in this case, hopefuly leading to less contention in the
+	 * root res_counters. The charges presented after reading it will
+	 * always be the global charges.
+	 */
+	if (mem_cgroup_disabled() ||
+		(mem_cgroup_is_root(memcg) && memcg->use_hierarchy)) {
+		val = mem_cgroup_read_root(type, name);
+		goto root_bypass;
+	}
+
 	switch (type) {
 	case _MEM:
 		if (name == RES_USAGE)
@@ -5032,6 +5095,7 @@ static ssize_t mem_cgroup_read(struct cgroup *cont, struct cftype *cft,
 		BUG();
 	}
 
+root_bypass:
 	len = scnprintf(str, sizeof(str), "%llu\n", (unsigned long long)val);
 	return simple_read_from_buffer(buf, nbytes, ppos, str, len);
 }
-- 
1.8.1.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v2 2/5] memcg: provide root figures from system totals
@ 2013-03-05 13:10   ` Glauber Costa
  0 siblings, 0 replies; 72+ messages in thread
From: Glauber Costa @ 2013-03-05 13:10 UTC (permalink / raw)
  To: linux-mm-Bw31MaZKKs3YtjvyW6yDsg
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, Tejun Heo, Andrew Morton,
	Michal Hocko, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
	handai.szj-Re5JQEeQqe8AvxtiuMwx3w,
	anton.vorontsov-QSEj5FYQhm4dnm+yROfE0A, Glauber Costa,
	Johannes Weiner, Mel Gorman

For the root memcg, there is no need to rely on the res_counters if hierarchy
is enabled The sum of all mem cgroups plus the tasks in root itself, is
necessarily the amount of memory used for the whole system. Since those figures
are already kept somewhere anyway, we can just return them here, without too
much hassle.

Limit and soft limit can't be set for the root cgroup, so they are left at
RESOURCE_MAX. Failcnt is left at 0, because its actual meaning is how many
times we failed allocations due to the limit being hit. We will fail
allocations in the root cgroup, but the limit will never the reason.

Signed-off-by: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
CC: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
CC: Kamezawa Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
CC: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
CC: Mel Gorman <mgorman-l3A5Bk7waGM@public.gmane.org>
CC: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
---
 mm/memcontrol.c | 64 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 64 insertions(+)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b8b363f..bfbf1c2 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4996,6 +4996,56 @@ static inline u64 mem_cgroup_usage(struct mem_cgroup *memcg, bool swap)
 	return val << PAGE_SHIFT;
 }
 
+static u64 memcg_read_root_rss(void)
+{
+	struct task_struct *p;
+
+	u64 rss = 0;
+	read_lock(&tasklist_lock);
+	for_each_process(p) {
+		if (!p->mm)
+			continue;
+		task_lock(p);
+		rss += get_mm_rss(p->mm);
+		task_unlock(p);
+	}
+	read_unlock(&tasklist_lock);
+	return rss;
+}
+
+static u64 mem_cgroup_read_root(enum res_type type, int name)
+{
+	if (name == RES_LIMIT)
+		return RESOURCE_MAX;
+	if (name == RES_SOFT_LIMIT)
+		return RESOURCE_MAX;
+	if (name == RES_FAILCNT)
+		return 0;
+	if (name == RES_MAX_USAGE)
+		return 0;
+
+	if (WARN_ON_ONCE(name != RES_USAGE))
+		return 0;
+
+	switch (type) {
+	case _MEM:
+		return (memcg_read_root_rss() +
+		atomic_long_read(&vm_stat[NR_FILE_PAGES])) << PAGE_SHIFT;
+	case _MEMSWAP: {
+		struct sysinfo i;
+		si_swapinfo(&i);
+
+		return ((memcg_read_root_rss() +
+		atomic_long_read(&vm_stat[NR_FILE_PAGES])) << PAGE_SHIFT) +
+		i.totalswap - i.freeswap;
+	}
+	case _KMEM:
+		return 0;
+	default:
+		BUG();
+	};
+}
+
 static ssize_t mem_cgroup_read(struct cgroup *cont, struct cftype *cft,
 			       struct file *file, char __user *buf,
 			       size_t nbytes, loff_t *ppos)
@@ -5012,6 +5062,19 @@ static ssize_t mem_cgroup_read(struct cgroup *cont, struct cftype *cft,
 	if (!do_swap_account && type == _MEMSWAP)
 		return -EOPNOTSUPP;
 
+	/*
+	 * If we have root-level hierarchy, we can be certain that the charges
+	 * in root are always global. We can then bypass the root cgroup
+	 * entirely in this case, hopefuly leading to less contention in the
+	 * root res_counters. The charges presented after reading it will
+	 * always be the global charges.
+	 */
+	if (mem_cgroup_disabled() ||
+		(mem_cgroup_is_root(memcg) && memcg->use_hierarchy)) {
+		val = mem_cgroup_read_root(type, name);
+		goto root_bypass;
+	}
+
 	switch (type) {
 	case _MEM:
 		if (name == RES_USAGE)
@@ -5032,6 +5095,7 @@ static ssize_t mem_cgroup_read(struct cgroup *cont, struct cftype *cft,
 		BUG();
 	}
 
+root_bypass:
 	len = scnprintf(str, sizeof(str), "%llu\n", (unsigned long long)val);
 	return simple_read_from_buffer(buf, nbytes, ppos, str, len);
 }
-- 
1.8.1.2

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v2 3/5] memcg: make it suck faster
@ 2013-03-05 13:10   ` Glauber Costa
  0 siblings, 0 replies; 72+ messages in thread
From: Glauber Costa @ 2013-03-05 13:10 UTC (permalink / raw)
  To: linux-mm
  Cc: cgroups, Tejun Heo, Andrew Morton, Michal Hocko, kamezawa.hiroyu,
	handai.szj, anton.vorontsov, Glauber Costa, Johannes Weiner,
	Mel Gorman

It is an accepted fact that memcg sucks. But can it suck faster?  Or in
a more fair statement, can it at least stop draining everyone's
performance when it is not in use?

This experimental and slightly crude patch demonstrates that we can do
that by using static branches to patch it out until the first memcg
comes to life. There are edges to be trimmed, and I appreciate comments
for direction. In particular, the events in the root are not fired, but
I believe this can be done without further problems by calling a
specialized event check from mem_cgroup_newpage_charge().

My goal was to have enough numbers to demonstrate the performance gain
that can come from it. I tested it in a 24-way 2-socket Intel box, 24 Gb
mem. I used Mel Gorman's pft test, that he used to demonstrate this
problem back in the Kernel Summit. There are three kernels:

nomemcg  : memcg compile disabled.
base     : memcg enabled, patch not applied.
bypassed : memcg enabled, with patch applied.

                base    bypassed
User          109.12      105.64
System       1646.84     1597.98
Elapsed       229.56      215.76

             nomemcg    bypassed
User          104.35      105.64
System       1578.19     1597.98
Elapsed       212.33      215.76

So as one can see, the difference between base and nomemcg in terms
of both system time and elapsed time is quite drastic, and consistent
with the figures shown by Mel Gorman in the Kernel summit. This is a
~ 7 % drop in performance, just by having memcg enabled. memcg functions
appear heavily in the profiles, even if all tasks lives in the root
memcg.

With bypassed kernel, we drop this down to 1.5 %, which starts to fall
in the acceptable range. More investigation is needed to see if we can
claim that last percent back, but I believe at last part of it should
be.

Signed-off-by: Glauber Costa <glommer@parallels.com>
CC: Michal Hocko <mhocko@suse.cz>
CC: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
CC: Johannes Weiner <hannes@cmpxchg.org>
CC: Mel Gorman <mgorman@suse.de>
CC: Andrew Morton <akpm@linux-foundation.org>
---
 include/linux/memcontrol.h |  72 ++++++++++++++++----
 mm/memcontrol.c            | 166 +++++++++++++++++++++++++++++++++++++++++----
 mm/page_cgroup.c           |   4 +-
 3 files changed, 216 insertions(+), 26 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index d6183f0..009f925 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -42,6 +42,26 @@ struct mem_cgroup_reclaim_cookie {
 };
 
 #ifdef CONFIG_MEMCG
+extern struct static_key memcg_in_use_key;
+
+static inline bool mem_cgroup_subsys_disabled(void)
+{
+	return !!mem_cgroup_subsys.disabled;
+}
+
+static inline bool mem_cgroup_disabled(void)
+{
+	/*
+	 * Will always be false if subsys is disabled, because we have no one
+	 * to bump it up. So the test suffices and we don't have to test the
+	 * subsystem as well
+	 */
+	if (!static_key_false(&memcg_in_use_key))
+		return true;
+	return false;
+}
+
+
 /*
  * All "charge" functions with gfp_mask should use GFP_KERNEL or
  * (gfp_mask & GFP_RECLAIM_MASK). In current implementatin, memcg doesn't
@@ -53,8 +73,18 @@ struct mem_cgroup_reclaim_cookie {
  * (Of course, if memcg does memory allocation in future, GFP_KERNEL is sane.)
  */
 
-extern int mem_cgroup_newpage_charge(struct page *page, struct mm_struct *mm,
+extern int __mem_cgroup_newpage_charge(struct page *page, struct mm_struct *mm,
 				gfp_t gfp_mask);
+
+static inline int
+mem_cgroup_newpage_charge(struct page *page, struct mm_struct *mm,
+			  gfp_t gfp_mask)
+{
+	if (mem_cgroup_disabled())
+		return 0;
+	return __mem_cgroup_newpage_charge(page, mm, gfp_mask);
+}
+
 /* for swap handling */
 extern int mem_cgroup_try_charge_swapin(struct mm_struct *mm,
 		struct page *page, gfp_t mask, struct mem_cgroup **memcgp);
@@ -62,8 +92,17 @@ extern void mem_cgroup_commit_charge_swapin(struct page *page,
 					struct mem_cgroup *memcg);
 extern void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *memcg);
 
-extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
-					gfp_t gfp_mask);
+
+extern int __mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
+				     gfp_t gfp_mask);
+static inline int
+mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask)
+{
+	if (mem_cgroup_disabled())
+		return 0;
+
+	return __mem_cgroup_cache_charge(page, mm, gfp_mask);
+}
 
 struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *);
 struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *);
@@ -72,8 +111,24 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *);
 extern void mem_cgroup_uncharge_start(void);
 extern void mem_cgroup_uncharge_end(void);
 
-extern void mem_cgroup_uncharge_page(struct page *page);
-extern void mem_cgroup_uncharge_cache_page(struct page *page);
+extern void __mem_cgroup_uncharge_page(struct page *page);
+extern void __mem_cgroup_uncharge_cache_page(struct page *page);
+
+static inline void mem_cgroup_uncharge_page(struct page *page)
+{
+	if (mem_cgroup_disabled())
+		return;
+
+	__mem_cgroup_uncharge_page(page);
+}
+
+static inline void mem_cgroup_uncharge_cache_page(struct page *page)
+{
+	if (mem_cgroup_disabled())
+		return;
+
+	__mem_cgroup_uncharge_cache_page(page);
+}
 
 bool __mem_cgroup_same_or_subtree(const struct mem_cgroup *root_memcg,
 				  struct mem_cgroup *memcg);
@@ -128,13 +183,6 @@ extern void mem_cgroup_replace_page_cache(struct page *oldpage,
 extern int do_swap_account;
 #endif
 
-static inline bool mem_cgroup_disabled(void)
-{
-	if (mem_cgroup_subsys.disabled)
-		return true;
-	return false;
-}
-
 void __mem_cgroup_begin_update_page_stat(struct page *page, bool *locked,
 					 unsigned long *flags);
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index bfbf1c2..45c1886 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -575,6 +575,9 @@ static inline bool mem_cgroup_is_root(struct mem_cgroup *memcg)
 	return (memcg == root_mem_cgroup);
 }
 
+static bool memcg_charges_allowed = false;
+struct static_key memcg_in_use_key;
+
 /* Writing them here to avoid exposing memcg's inner layout */
 #if defined(CONFIG_INET) && defined(CONFIG_MEMCG_KMEM)
 
@@ -710,6 +713,7 @@ static void disarm_static_keys(struct mem_cgroup *memcg)
 {
 	disarm_sock_keys(memcg);
 	disarm_kmem_keys(memcg);
+	static_key_slow_dec(&memcg_in_use_key);
 }
 
 static void drain_all_stock_async(struct mem_cgroup *memcg);
@@ -1109,6 +1113,9 @@ struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p)
 	if (unlikely(!p))
 		return NULL;
 
+	if (mem_cgroup_disabled())
+		return root_mem_cgroup;
+
 	return mem_cgroup_from_css(task_subsys_state(p, mem_cgroup_subsys_id));
 }
 
@@ -1157,9 +1164,12 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 	struct mem_cgroup *memcg = NULL;
 	int id = 0;
 
-	if (mem_cgroup_disabled())
+	if (mem_cgroup_subsys_disabled())
 		return NULL;
 
+	if (mem_cgroup_disabled())
+		return root_mem_cgroup;
+
 	if (!root)
 		root = root_mem_cgroup;
 
@@ -1335,6 +1345,20 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct zone *zone)
 	memcg = pc->mem_cgroup;
 
 	/*
+	 * Because we lazily enable memcg only after first child group is
+	 * created, we can have memcg == 0. Because page cgroup is created with
+	 * GFP_ZERO, and after charging, all page cgroups will have a non-zero
+	 * cgroup attached (even if root), we can be sure that this is a
+	 * used-but-not-accounted page. (due to lazyness). We could get around
+	 * that by scanning all pages on cgroup init is too expensive. We can
+	 * ultimately pay, but prefer to just to defer the update until we get
+	 * here. We could take the opportunity to set PageCgroupUsed, but it
+	 * won't be that important for the root cgroup.
+	 */
+	if (!memcg && PageLRU(page))
+		pc->mem_cgroup = memcg = root_mem_cgroup;
+
+	/*
 	 * Surreptitiously switch any uncharged offlist page to root:
 	 * an uncharged page off lru does nothing to secure
 	 * its former mem_cgroup from sudden removal.
@@ -3845,11 +3869,18 @@ static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm,
 	return 0;
 }
 
-int mem_cgroup_newpage_charge(struct page *page,
+int __mem_cgroup_newpage_charge(struct page *page,
 			      struct mm_struct *mm, gfp_t gfp_mask)
 {
-	if (mem_cgroup_disabled())
+	/*
+	 * The branch is actually very likely before the first memcg comes in.
+	 * But since the code is patched out, we'll never reach it. It is only
+	 * reachable when the code is patched in, and in that case it is
+	 * unlikely.  It will only happen during initial charges move.
+	 */
+	if (unlikely(!memcg_charges_allowed))
 		return 0;
+
 	VM_BUG_ON(page_mapped(page));
 	VM_BUG_ON(page->mapping && !PageAnon(page));
 	VM_BUG_ON(!mm);
@@ -3962,15 +3993,13 @@ void mem_cgroup_commit_charge_swapin(struct page *page,
 					  MEM_CGROUP_CHARGE_TYPE_ANON);
 }
 
-int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
-				gfp_t gfp_mask)
+int __mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
+			      gfp_t gfp_mask)
 {
 	struct mem_cgroup *memcg = NULL;
 	enum charge_type type = MEM_CGROUP_CHARGE_TYPE_CACHE;
 	int ret;
 
-	if (mem_cgroup_disabled())
-		return 0;
 	if (PageCompound(page))
 		return 0;
 
@@ -4050,9 +4079,6 @@ __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype,
 	struct page_cgroup *pc;
 	bool anon;
 
-	if (mem_cgroup_disabled())
-		return NULL;
-
 	VM_BUG_ON(PageSwapCache(page));
 
 	if (PageTransHuge(page)) {
@@ -4144,7 +4170,7 @@ unlock_out:
 	return NULL;
 }
 
-void mem_cgroup_uncharge_page(struct page *page)
+void __mem_cgroup_uncharge_page(struct page *page)
 {
 	/* early check. */
 	if (page_mapped(page))
@@ -4155,7 +4181,7 @@ void mem_cgroup_uncharge_page(struct page *page)
 	__mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_ANON, false);
 }
 
-void mem_cgroup_uncharge_cache_page(struct page *page)
+void __mem_cgroup_uncharge_cache_page(struct page *page)
 {
 	VM_BUG_ON(page_mapped(page));
 	VM_BUG_ON(page->mapping);
@@ -4220,6 +4246,9 @@ mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent, bool swapout)
 	struct mem_cgroup *memcg;
 	int ctype = MEM_CGROUP_CHARGE_TYPE_SWAPOUT;
 
+	if (mem_cgroup_disabled())
+		return;
+
 	if (!swapout) /* this was a swap cache but the swap is unused ! */
 		ctype = MEM_CGROUP_CHARGE_TYPE_DROP;
 
@@ -6364,6 +6393,59 @@ free_out:
 	return ERR_PTR(error);
 }
 
+static void memcg_update_root_statistics(void)
+{
+	int cpu;
+	u64 pgin, pgout, faults, mjfaults;
+
+	pgin = pgout = faults = mjfaults = 0;
+	for_each_online_cpu(cpu) {
+		struct vm_event_state *ev = &per_cpu(vm_event_states, cpu);
+		struct mem_cgroup_stat_cpu *memcg_stat;
+
+		memcg_stat = per_cpu_ptr(root_mem_cgroup->stat, cpu);
+
+		memcg_stat->events[MEM_CGROUP_EVENTS_PGPGIN] =
+							ev->event[PGPGIN];
+		memcg_stat->events[MEM_CGROUP_EVENTS_PGPGOUT] =
+							ev->event[PGPGOUT];
+		memcg_stat->events[MEM_CGROUP_EVENTS_PGFAULT] =
+							ev->event[PGFAULT];
+		memcg_stat->events[MEM_CGROUP_EVENTS_PGMAJFAULT] =
+							ev->event[PGMAJFAULT];
+
+		memcg_stat->nr_page_events = ev->event[PGPGIN] +
+					     ev->event[PGPGOUT];
+	}
+
+	root_mem_cgroup->nocpu_base.count[MEM_CGROUP_STAT_RSS] =
+				memcg_read_root_rss();
+	root_mem_cgroup->nocpu_base.count[MEM_CGROUP_STAT_CACHE] =
+				atomic_long_read(&vm_stat[NR_FILE_PAGES]);
+	root_mem_cgroup->nocpu_base.count[MEM_CGROUP_STAT_FILE_MAPPED] =
+				atomic_long_read(&vm_stat[NR_FILE_MAPPED]);
+}
+
+static void memcg_update_root_lru(void)
+{
+	struct zone *zone;
+	struct lruvec *lruvec;
+	struct mem_cgroup_per_zone *mz;
+	enum lru_list lru;
+
+	for_each_populated_zone(zone) {
+		spin_lock_irq(&zone->lru_lock);
+		lruvec = &zone->lruvec;
+		mz = mem_cgroup_zoneinfo(root_mem_cgroup,
+				zone_to_nid(zone), zone_idx(zone));
+
+		for (lru = LRU_BASE; lru < NR_LRU_LISTS; lru++)
+			mz->lru_size[lru] =
+				zone_page_state(zone, NR_LRU_BASE + lru);
+		spin_unlock_irq(&zone->lru_lock);
+	}
+}
+
 static int
 mem_cgroup_css_online(struct cgroup *cont)
 {
@@ -6407,6 +6489,66 @@ mem_cgroup_css_online(struct cgroup *cont)
 	}
 
 	error = memcg_init_kmem(memcg, &mem_cgroup_subsys);
+
+	if (!error) {
+		static_key_slow_inc(&memcg_in_use_key);
+		/*
+		 * The strategy to avoid races here is to let the charges just
+		 * be globally made until we lock the res counter. Since we are
+		 * copying charges from global statistics, it doesn't really
+		 * matter when we do it, as long as we are consistent. So even
+		 * after the code is patched in, they will continue being
+		 * globally charged due to memcg_charges_allowed being set to
+		 * false.
+		 *
+		 * Once we hold the res counter lock, though, we can already
+		 * safely flip it: We will go through with the charging to the
+		 * root memcg, but won't be able to actually charge it: we have
+		 * the lock.
+		 *
+		 * This works because the mm stats are only updated after the
+		 * memcg charging suceeds. If we block the charge by holding
+		 * the res_counter lock, no other charges will happen in the
+		 * system until we release it.
+		 *
+		 * manipulation always safe because the write side is always
+		 * under the memcg_mutex.
+		 */
+		if (!memcg_charges_allowed) {
+			struct zone *zone;
+
+			get_online_cpus();
+			spin_lock(&root_mem_cgroup->res.lock);
+
+			memcg_charges_allowed = true;
+
+			root_mem_cgroup->res.usage =
+				mem_cgroup_read_root(RES_USAGE, _MEM);
+			root_mem_cgroup->memsw.usage =
+				mem_cgroup_read_root(RES_USAGE, _MEMSWAP);
+			/*
+			 * The max usage figure is not entirely accurate. The
+			 * memory may have been higher in the past. But since
+			 * we don't track that globally, this is the best we
+			 * can do.
+			 */
+			root_mem_cgroup->res.max_usage =
+					root_mem_cgroup->res.usage;
+			root_mem_cgroup->memsw.max_usage =
+					root_mem_cgroup->memsw.usage;
+
+			memcg_update_root_statistics();
+			memcg_update_root_lru();
+			/*
+			 * We are now 100 % consistent and all charges are
+			 * transfered.  New charges should reach the
+			 * res_counter directly.
+			 */
+			spin_unlock(&root_mem_cgroup->res.lock);
+			put_online_cpus();
+		}
+	}
+
 	mutex_unlock(&memcg_create_mutex);
 	if (error) {
 		/*
diff --git a/mm/page_cgroup.c b/mm/page_cgroup.c
index 6d757e3..a5bd322 100644
--- a/mm/page_cgroup.c
+++ b/mm/page_cgroup.c
@@ -68,7 +68,7 @@ void __init page_cgroup_init_flatmem(void)
 
 	int nid, fail;
 
-	if (mem_cgroup_disabled())
+	if (mem_cgroup_subsys_disabled())
 		return;
 
 	for_each_online_node(nid)  {
@@ -271,7 +271,7 @@ void __init page_cgroup_init(void)
 	unsigned long pfn;
 	int nid;
 
-	if (mem_cgroup_disabled())
+	if (mem_cgroup_subsys_disabled())
 		return;
 
 	for_each_node_state(nid, N_MEMORY) {
-- 
1.8.1.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v2 3/5] memcg: make it suck faster
@ 2013-03-05 13:10   ` Glauber Costa
  0 siblings, 0 replies; 72+ messages in thread
From: Glauber Costa @ 2013-03-05 13:10 UTC (permalink / raw)
  To: linux-mm-Bw31MaZKKs3YtjvyW6yDsg
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, Tejun Heo, Andrew Morton,
	Michal Hocko, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
	handai.szj-Re5JQEeQqe8AvxtiuMwx3w,
	anton.vorontsov-QSEj5FYQhm4dnm+yROfE0A, Glauber Costa,
	Johannes Weiner, Mel Gorman

It is an accepted fact that memcg sucks. But can it suck faster?  Or in
a more fair statement, can it at least stop draining everyone's
performance when it is not in use?

This experimental and slightly crude patch demonstrates that we can do
that by using static branches to patch it out until the first memcg
comes to life. There are edges to be trimmed, and I appreciate comments
for direction. In particular, the events in the root are not fired, but
I believe this can be done without further problems by calling a
specialized event check from mem_cgroup_newpage_charge().

My goal was to have enough numbers to demonstrate the performance gain
that can come from it. I tested it in a 24-way 2-socket Intel box, 24 Gb
mem. I used Mel Gorman's pft test, that he used to demonstrate this
problem back in the Kernel Summit. There are three kernels:

nomemcg  : memcg compile disabled.
base     : memcg enabled, patch not applied.
bypassed : memcg enabled, with patch applied.

                base    bypassed
User          109.12      105.64
System       1646.84     1597.98
Elapsed       229.56      215.76

             nomemcg    bypassed
User          104.35      105.64
System       1578.19     1597.98
Elapsed       212.33      215.76

So as one can see, the difference between base and nomemcg in terms
of both system time and elapsed time is quite drastic, and consistent
with the figures shown by Mel Gorman in the Kernel summit. This is a
~ 7 % drop in performance, just by having memcg enabled. memcg functions
appear heavily in the profiles, even if all tasks lives in the root
memcg.

With bypassed kernel, we drop this down to 1.5 %, which starts to fall
in the acceptable range. More investigation is needed to see if we can
claim that last percent back, but I believe at last part of it should
be.

Signed-off-by: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
CC: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
CC: Kamezawa Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
CC: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
CC: Mel Gorman <mgorman-l3A5Bk7waGM@public.gmane.org>
CC: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
---
 include/linux/memcontrol.h |  72 ++++++++++++++++----
 mm/memcontrol.c            | 166 +++++++++++++++++++++++++++++++++++++++++----
 mm/page_cgroup.c           |   4 +-
 3 files changed, 216 insertions(+), 26 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index d6183f0..009f925 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -42,6 +42,26 @@ struct mem_cgroup_reclaim_cookie {
 };
 
 #ifdef CONFIG_MEMCG
+extern struct static_key memcg_in_use_key;
+
+static inline bool mem_cgroup_subsys_disabled(void)
+{
+	return !!mem_cgroup_subsys.disabled;
+}
+
+static inline bool mem_cgroup_disabled(void)
+{
+	/*
+	 * Will always be false if subsys is disabled, because we have no one
+	 * to bump it up. So the test suffices and we don't have to test the
+	 * subsystem as well
+	 */
+	if (!static_key_false(&memcg_in_use_key))
+		return true;
+	return false;
+}
+
+
 /*
  * All "charge" functions with gfp_mask should use GFP_KERNEL or
  * (gfp_mask & GFP_RECLAIM_MASK). In current implementatin, memcg doesn't
@@ -53,8 +73,18 @@ struct mem_cgroup_reclaim_cookie {
  * (Of course, if memcg does memory allocation in future, GFP_KERNEL is sane.)
  */
 
-extern int mem_cgroup_newpage_charge(struct page *page, struct mm_struct *mm,
+extern int __mem_cgroup_newpage_charge(struct page *page, struct mm_struct *mm,
 				gfp_t gfp_mask);
+
+static inline int
+mem_cgroup_newpage_charge(struct page *page, struct mm_struct *mm,
+			  gfp_t gfp_mask)
+{
+	if (mem_cgroup_disabled())
+		return 0;
+	return __mem_cgroup_newpage_charge(page, mm, gfp_mask);
+}
+
 /* for swap handling */
 extern int mem_cgroup_try_charge_swapin(struct mm_struct *mm,
 		struct page *page, gfp_t mask, struct mem_cgroup **memcgp);
@@ -62,8 +92,17 @@ extern void mem_cgroup_commit_charge_swapin(struct page *page,
 					struct mem_cgroup *memcg);
 extern void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *memcg);
 
-extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
-					gfp_t gfp_mask);
+
+extern int __mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
+				     gfp_t gfp_mask);
+static inline int
+mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask)
+{
+	if (mem_cgroup_disabled())
+		return 0;
+
+	return __mem_cgroup_cache_charge(page, mm, gfp_mask);
+}
 
 struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *);
 struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *);
@@ -72,8 +111,24 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *);
 extern void mem_cgroup_uncharge_start(void);
 extern void mem_cgroup_uncharge_end(void);
 
-extern void mem_cgroup_uncharge_page(struct page *page);
-extern void mem_cgroup_uncharge_cache_page(struct page *page);
+extern void __mem_cgroup_uncharge_page(struct page *page);
+extern void __mem_cgroup_uncharge_cache_page(struct page *page);
+
+static inline void mem_cgroup_uncharge_page(struct page *page)
+{
+	if (mem_cgroup_disabled())
+		return;
+
+	__mem_cgroup_uncharge_page(page);
+}
+
+static inline void mem_cgroup_uncharge_cache_page(struct page *page)
+{
+	if (mem_cgroup_disabled())
+		return;
+
+	__mem_cgroup_uncharge_cache_page(page);
+}
 
 bool __mem_cgroup_same_or_subtree(const struct mem_cgroup *root_memcg,
 				  struct mem_cgroup *memcg);
@@ -128,13 +183,6 @@ extern void mem_cgroup_replace_page_cache(struct page *oldpage,
 extern int do_swap_account;
 #endif
 
-static inline bool mem_cgroup_disabled(void)
-{
-	if (mem_cgroup_subsys.disabled)
-		return true;
-	return false;
-}
-
 void __mem_cgroup_begin_update_page_stat(struct page *page, bool *locked,
 					 unsigned long *flags);
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index bfbf1c2..45c1886 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -575,6 +575,9 @@ static inline bool mem_cgroup_is_root(struct mem_cgroup *memcg)
 	return (memcg == root_mem_cgroup);
 }
 
+static bool memcg_charges_allowed = false;
+struct static_key memcg_in_use_key;
+
 /* Writing them here to avoid exposing memcg's inner layout */
 #if defined(CONFIG_INET) && defined(CONFIG_MEMCG_KMEM)
 
@@ -710,6 +713,7 @@ static void disarm_static_keys(struct mem_cgroup *memcg)
 {
 	disarm_sock_keys(memcg);
 	disarm_kmem_keys(memcg);
+	static_key_slow_dec(&memcg_in_use_key);
 }
 
 static void drain_all_stock_async(struct mem_cgroup *memcg);
@@ -1109,6 +1113,9 @@ struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p)
 	if (unlikely(!p))
 		return NULL;
 
+	if (mem_cgroup_disabled())
+		return root_mem_cgroup;
+
 	return mem_cgroup_from_css(task_subsys_state(p, mem_cgroup_subsys_id));
 }
 
@@ -1157,9 +1164,12 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 	struct mem_cgroup *memcg = NULL;
 	int id = 0;
 
-	if (mem_cgroup_disabled())
+	if (mem_cgroup_subsys_disabled())
 		return NULL;
 
+	if (mem_cgroup_disabled())
+		return root_mem_cgroup;
+
 	if (!root)
 		root = root_mem_cgroup;
 
@@ -1335,6 +1345,20 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct zone *zone)
 	memcg = pc->mem_cgroup;
 
 	/*
+	 * Because we lazily enable memcg only after first child group is
+	 * created, we can have memcg == 0. Because page cgroup is created with
+	 * GFP_ZERO, and after charging, all page cgroups will have a non-zero
+	 * cgroup attached (even if root), we can be sure that this is a
+	 * used-but-not-accounted page. (due to lazyness). We could get around
+	 * that by scanning all pages on cgroup init is too expensive. We can
+	 * ultimately pay, but prefer to just to defer the update until we get
+	 * here. We could take the opportunity to set PageCgroupUsed, but it
+	 * won't be that important for the root cgroup.
+	 */
+	if (!memcg && PageLRU(page))
+		pc->mem_cgroup = memcg = root_mem_cgroup;
+
+	/*
 	 * Surreptitiously switch any uncharged offlist page to root:
 	 * an uncharged page off lru does nothing to secure
 	 * its former mem_cgroup from sudden removal.
@@ -3845,11 +3869,18 @@ static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm,
 	return 0;
 }
 
-int mem_cgroup_newpage_charge(struct page *page,
+int __mem_cgroup_newpage_charge(struct page *page,
 			      struct mm_struct *mm, gfp_t gfp_mask)
 {
-	if (mem_cgroup_disabled())
+	/*
+	 * The branch is actually very likely before the first memcg comes in.
+	 * But since the code is patched out, we'll never reach it. It is only
+	 * reachable when the code is patched in, and in that case it is
+	 * unlikely.  It will only happen during initial charges move.
+	 */
+	if (unlikely(!memcg_charges_allowed))
 		return 0;
+
 	VM_BUG_ON(page_mapped(page));
 	VM_BUG_ON(page->mapping && !PageAnon(page));
 	VM_BUG_ON(!mm);
@@ -3962,15 +3993,13 @@ void mem_cgroup_commit_charge_swapin(struct page *page,
 					  MEM_CGROUP_CHARGE_TYPE_ANON);
 }
 
-int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
-				gfp_t gfp_mask)
+int __mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
+			      gfp_t gfp_mask)
 {
 	struct mem_cgroup *memcg = NULL;
 	enum charge_type type = MEM_CGROUP_CHARGE_TYPE_CACHE;
 	int ret;
 
-	if (mem_cgroup_disabled())
-		return 0;
 	if (PageCompound(page))
 		return 0;
 
@@ -4050,9 +4079,6 @@ __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype,
 	struct page_cgroup *pc;
 	bool anon;
 
-	if (mem_cgroup_disabled())
-		return NULL;
-
 	VM_BUG_ON(PageSwapCache(page));
 
 	if (PageTransHuge(page)) {
@@ -4144,7 +4170,7 @@ unlock_out:
 	return NULL;
 }
 
-void mem_cgroup_uncharge_page(struct page *page)
+void __mem_cgroup_uncharge_page(struct page *page)
 {
 	/* early check. */
 	if (page_mapped(page))
@@ -4155,7 +4181,7 @@ void mem_cgroup_uncharge_page(struct page *page)
 	__mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_ANON, false);
 }
 
-void mem_cgroup_uncharge_cache_page(struct page *page)
+void __mem_cgroup_uncharge_cache_page(struct page *page)
 {
 	VM_BUG_ON(page_mapped(page));
 	VM_BUG_ON(page->mapping);
@@ -4220,6 +4246,9 @@ mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent, bool swapout)
 	struct mem_cgroup *memcg;
 	int ctype = MEM_CGROUP_CHARGE_TYPE_SWAPOUT;
 
+	if (mem_cgroup_disabled())
+		return;
+
 	if (!swapout) /* this was a swap cache but the swap is unused ! */
 		ctype = MEM_CGROUP_CHARGE_TYPE_DROP;
 
@@ -6364,6 +6393,59 @@ free_out:
 	return ERR_PTR(error);
 }
 
+static void memcg_update_root_statistics(void)
+{
+	int cpu;
+	u64 pgin, pgout, faults, mjfaults;
+
+	pgin = pgout = faults = mjfaults = 0;
+	for_each_online_cpu(cpu) {
+		struct vm_event_state *ev = &per_cpu(vm_event_states, cpu);
+		struct mem_cgroup_stat_cpu *memcg_stat;
+
+		memcg_stat = per_cpu_ptr(root_mem_cgroup->stat, cpu);
+
+		memcg_stat->events[MEM_CGROUP_EVENTS_PGPGIN] =
+							ev->event[PGPGIN];
+		memcg_stat->events[MEM_CGROUP_EVENTS_PGPGOUT] =
+							ev->event[PGPGOUT];
+		memcg_stat->events[MEM_CGROUP_EVENTS_PGFAULT] =
+							ev->event[PGFAULT];
+		memcg_stat->events[MEM_CGROUP_EVENTS_PGMAJFAULT] =
+							ev->event[PGMAJFAULT];
+
+		memcg_stat->nr_page_events = ev->event[PGPGIN] +
+					     ev->event[PGPGOUT];
+	}
+
+	root_mem_cgroup->nocpu_base.count[MEM_CGROUP_STAT_RSS] =
+				memcg_read_root_rss();
+	root_mem_cgroup->nocpu_base.count[MEM_CGROUP_STAT_CACHE] =
+				atomic_long_read(&vm_stat[NR_FILE_PAGES]);
+	root_mem_cgroup->nocpu_base.count[MEM_CGROUP_STAT_FILE_MAPPED] =
+				atomic_long_read(&vm_stat[NR_FILE_MAPPED]);
+}
+
+static void memcg_update_root_lru(void)
+{
+	struct zone *zone;
+	struct lruvec *lruvec;
+	struct mem_cgroup_per_zone *mz;
+	enum lru_list lru;
+
+	for_each_populated_zone(zone) {
+		spin_lock_irq(&zone->lru_lock);
+		lruvec = &zone->lruvec;
+		mz = mem_cgroup_zoneinfo(root_mem_cgroup,
+				zone_to_nid(zone), zone_idx(zone));
+
+		for (lru = LRU_BASE; lru < NR_LRU_LISTS; lru++)
+			mz->lru_size[lru] =
+				zone_page_state(zone, NR_LRU_BASE + lru);
+		spin_unlock_irq(&zone->lru_lock);
+	}
+}
+
 static int
 mem_cgroup_css_online(struct cgroup *cont)
 {
@@ -6407,6 +6489,66 @@ mem_cgroup_css_online(struct cgroup *cont)
 	}
 
 	error = memcg_init_kmem(memcg, &mem_cgroup_subsys);
+
+	if (!error) {
+		static_key_slow_inc(&memcg_in_use_key);
+		/*
+		 * The strategy to avoid races here is to let the charges just
+		 * be globally made until we lock the res counter. Since we are
+		 * copying charges from global statistics, it doesn't really
+		 * matter when we do it, as long as we are consistent. So even
+		 * after the code is patched in, they will continue being
+		 * globally charged due to memcg_charges_allowed being set to
+		 * false.
+		 *
+		 * Once we hold the res counter lock, though, we can already
+		 * safely flip it: We will go through with the charging to the
+		 * root memcg, but won't be able to actually charge it: we have
+		 * the lock.
+		 *
+		 * This works because the mm stats are only updated after the
+		 * memcg charging suceeds. If we block the charge by holding
+		 * the res_counter lock, no other charges will happen in the
+		 * system until we release it.
+		 *
+		 * manipulation always safe because the write side is always
+		 * under the memcg_mutex.
+		 */
+		if (!memcg_charges_allowed) {
+			struct zone *zone;
+
+			get_online_cpus();
+			spin_lock(&root_mem_cgroup->res.lock);
+
+			memcg_charges_allowed = true;
+
+			root_mem_cgroup->res.usage =
+				mem_cgroup_read_root(RES_USAGE, _MEM);
+			root_mem_cgroup->memsw.usage =
+				mem_cgroup_read_root(RES_USAGE, _MEMSWAP);
+			/*
+			 * The max usage figure is not entirely accurate. The
+			 * memory may have been higher in the past. But since
+			 * we don't track that globally, this is the best we
+			 * can do.
+			 */
+			root_mem_cgroup->res.max_usage =
+					root_mem_cgroup->res.usage;
+			root_mem_cgroup->memsw.max_usage =
+					root_mem_cgroup->memsw.usage;
+
+			memcg_update_root_statistics();
+			memcg_update_root_lru();
+			/*
+			 * We are now 100 % consistent and all charges are
+			 * transfered.  New charges should reach the
+			 * res_counter directly.
+			 */
+			spin_unlock(&root_mem_cgroup->res.lock);
+			put_online_cpus();
+		}
+	}
+
 	mutex_unlock(&memcg_create_mutex);
 	if (error) {
 		/*
diff --git a/mm/page_cgroup.c b/mm/page_cgroup.c
index 6d757e3..a5bd322 100644
--- a/mm/page_cgroup.c
+++ b/mm/page_cgroup.c
@@ -68,7 +68,7 @@ void __init page_cgroup_init_flatmem(void)
 
 	int nid, fail;
 
-	if (mem_cgroup_disabled())
+	if (mem_cgroup_subsys_disabled())
 		return;
 
 	for_each_online_node(nid)  {
@@ -271,7 +271,7 @@ void __init page_cgroup_init(void)
 	unsigned long pfn;
 	int nid;
 
-	if (mem_cgroup_disabled())
+	if (mem_cgroup_subsys_disabled())
 		return;
 
 	for_each_node_state(nid, N_MEMORY) {
-- 
1.8.1.2

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v2 4/5] memcg: do not call page_cgroup_init at system_boot
@ 2013-03-05 13:10   ` Glauber Costa
  0 siblings, 0 replies; 72+ messages in thread
From: Glauber Costa @ 2013-03-05 13:10 UTC (permalink / raw)
  To: linux-mm
  Cc: cgroups, Tejun Heo, Andrew Morton, Michal Hocko, kamezawa.hiroyu,
	handai.szj, anton.vorontsov, Glauber Costa, Johannes Weiner,
	Mel Gorman

If we are not using memcg, there is no reason why we should allocate
this structure, that will be a memory waste at best. We can do better
at least in the sparsemem case, and allocate it when the first cgroup
is requested. It should now not panic on failure, and we have to handle
this right.

flatmem case is a bit more complicated, so that one is left out for
the moment.

Signed-off-by: Glauber Costa <glommer@parallels.com>
CC: Michal Hocko <mhocko@suse.cz>
CC: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
CC: Johannes Weiner <hannes@cmpxchg.org>
CC: Mel Gorman <mgorman@suse.de>
CC: Andrew Morton <akpm@linux-foundation.org>
---
 include/linux/page_cgroup.h |  28 +++++----
 init/main.c                 |   2 -
 mm/memcontrol.c             |   3 +-
 mm/page_cgroup.c            | 150 ++++++++++++++++++++++++--------------------
 4 files changed, 99 insertions(+), 84 deletions(-)

diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
index 777a524..ec9fb05 100644
--- a/include/linux/page_cgroup.h
+++ b/include/linux/page_cgroup.h
@@ -14,6 +14,7 @@ enum {
 
 #ifdef CONFIG_MEMCG
 #include <linux/bit_spinlock.h>
+#include <linux/mmzone.h>
 
 /*
  * Page Cgroup can be considered as an extended mem_map.
@@ -27,19 +28,17 @@ struct page_cgroup {
 	struct mem_cgroup *mem_cgroup;
 };
 
-void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat);
-
-#ifdef CONFIG_SPARSEMEM
-static inline void __init page_cgroup_init_flatmem(void)
+static inline size_t page_cgroup_table_size(int nid)
 {
-}
-extern void __init page_cgroup_init(void);
+#ifdef CONFIG_SPARSEMEM
+	return sizeof(struct page_cgroup) * PAGES_PER_SECTION;
 #else
-void __init page_cgroup_init_flatmem(void);
-static inline void __init page_cgroup_init(void)
-{
-}
+	return sizeof(struct page_cgroup) * NODE_DATA(nid)->node_spanned_pages;
 #endif
+}
+void pgdat_page_cgroup_init(struct pglist_data *pgdat);
+
+extern int page_cgroup_init(void);
 
 struct page_cgroup *lookup_page_cgroup(struct page *page);
 struct page *lookup_cgroup_page(struct page_cgroup *pc);
@@ -85,7 +84,7 @@ static inline void unlock_page_cgroup(struct page_cgroup *pc)
 #else /* CONFIG_MEMCG */
 struct page_cgroup;
 
-static inline void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat)
+static inline void pgdat_page_cgroup_init(struct pglist_data *pgdat)
 {
 }
 
@@ -94,7 +93,12 @@ static inline struct page_cgroup *lookup_page_cgroup(struct page *page)
 	return NULL;
 }
 
-static inline void page_cgroup_init(void)
+static inline int page_cgroup_init(void)
+{
+	return 0;
+}
+
+static inline void page_cgroup_destroy(void)
 {
 }
 
diff --git a/init/main.c b/init/main.c
index cee4b5c..1fb3ec0 100644
--- a/init/main.c
+++ b/init/main.c
@@ -457,7 +457,6 @@ static void __init mm_init(void)
 	 * page_cgroup requires contiguous pages,
 	 * bigger than MAX_ORDER unless SPARSEMEM.
 	 */
-	page_cgroup_init_flatmem();
 	mem_init();
 	kmem_cache_init();
 	percpu_init_late();
@@ -592,7 +591,6 @@ asmlinkage void __init start_kernel(void)
 		initrd_start = 0;
 	}
 #endif
-	page_cgroup_init();
 	debug_objects_mem_init();
 	kmemleak_init();
 	setup_per_cpu_pageset();
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 45c1886..6019a32 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6377,7 +6377,8 @@ mem_cgroup_css_alloc(struct cgroup *cont)
 		res_counter_init(&memcg->res, NULL);
 		res_counter_init(&memcg->memsw, NULL);
 		res_counter_init(&memcg->kmem, NULL);
-	}
+	} else if (page_cgroup_init())
+		goto free_out;
 
 	memcg->last_scanned_node = MAX_NUMNODES;
 	INIT_LIST_HEAD(&memcg->oom_notify);
diff --git a/mm/page_cgroup.c b/mm/page_cgroup.c
index a5bd322..6d04c28 100644
--- a/mm/page_cgroup.c
+++ b/mm/page_cgroup.c
@@ -12,11 +12,50 @@
 #include <linux/kmemleak.h>
 
 static unsigned long total_usage;
+static unsigned long page_cgroup_initialized;
 
-#if !defined(CONFIG_SPARSEMEM)
+static void *alloc_page_cgroup(size_t size, int nid)
+{
+	gfp_t flags = GFP_KERNEL | __GFP_ZERO | __GFP_NOWARN;
+	void *addr = NULL;
+
+	addr = alloc_pages_exact_nid(nid, size, flags);
+	if (addr) {
+		kmemleak_alloc(addr, size, 1, flags);
+		return addr;
+	}
+
+	if (node_state(nid, N_HIGH_MEMORY))
+		addr = vzalloc_node(size, nid);
+	else
+		addr = vzalloc(size);
 
+	return addr;
+}
 
-void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat)
+static void free_page_cgroup(void *addr)
+{
+	if (is_vmalloc_addr(addr)) {
+		vfree(addr);
+	} else {
+		struct page *page = virt_to_page(addr);
+		int nid = page_to_nid(page);
+		BUG_ON(PageReserved(page));
+		free_pages_exact(addr, page_cgroup_table_size(nid));
+	}
+}
+
+static void page_cgroup_msg(void)
+{
+	printk(KERN_INFO "allocated %ld bytes of page_cgroup\n", total_usage);
+	printk(KERN_INFO "please try 'cgroup_disable=memory' option if you "
+			 "don't want memory cgroups.\nAlternatively, consider "
+			 "deferring your memory cgroups creation.\n");
+}
+
+#if !defined(CONFIG_SPARSEMEM)
+
+void pgdat_page_cgroup_init(struct pglist_data *pgdat)
 {
 	pgdat->node_page_cgroup = NULL;
 }
@@ -42,20 +81,16 @@ struct page_cgroup *lookup_page_cgroup(struct page *page)
 	return base + offset;
 }
 
-static int __init alloc_node_page_cgroup(int nid)
+static int alloc_node_page_cgroup(int nid)
 {
 	struct page_cgroup *base;
 	unsigned long table_size;
-	unsigned long nr_pages;
 
-	nr_pages = NODE_DATA(nid)->node_spanned_pages;
-	if (!nr_pages)
+	table_size = page_cgroup_table_size(nid);
+	if (!table_size)
 		return 0;
 
-	table_size = sizeof(struct page_cgroup) * nr_pages;
-
-	base = __alloc_bootmem_node_nopanic(NODE_DATA(nid),
-			table_size, PAGE_SIZE, __pa(MAX_DMA_ADDRESS));
+	base = alloc_page_cgroup(table_size, nid);
 	if (!base)
 		return -ENOMEM;
 	NODE_DATA(nid)->node_page_cgroup = base;
@@ -63,27 +98,29 @@ static int __init alloc_node_page_cgroup(int nid)
 	return 0;
 }
 
-void __init page_cgroup_init_flatmem(void)
+int page_cgroup_init(void)
 {
+	int nid, fail, tmpnid;
 
-	int nid, fail;
-
-	if (mem_cgroup_subsys_disabled())
-		return;
+	/* only initialize it once */
+	if (test_and_set_bit(0, &page_cgroup_initialized))
+		return 0;
 
 	for_each_online_node(nid)  {
 		fail = alloc_node_page_cgroup(nid);
 		if (fail)
 			goto fail;
 	}
-	printk(KERN_INFO "allocated %ld bytes of page_cgroup\n", total_usage);
-	printk(KERN_INFO "please try 'cgroup_disable=memory' option if you"
-	" don't want memory cgroups\n");
-	return;
+	page_cgroup_msg();
+	return 0;
 fail:
-	printk(KERN_CRIT "allocation of page_cgroup failed.\n");
-	printk(KERN_CRIT "please try 'cgroup_disable=memory' boot option\n");
-	panic("Out of memory");
+	for_each_online_node(tmpnid)  {
+		if (tmpnid >= nid)
+			break;
+		free_page_cgroup(NODE_DATA(tmpnid)->node_page_cgroup);
+	}
+
+	return -ENOMEM;
 }
 
 #else /* CONFIG_FLAT_NODE_MEM_MAP */
@@ -105,26 +142,7 @@ struct page_cgroup *lookup_page_cgroup(struct page *page)
 	return section->page_cgroup + pfn;
 }
 
-static void *__meminit alloc_page_cgroup(size_t size, int nid)
-{
-	gfp_t flags = GFP_KERNEL | __GFP_ZERO | __GFP_NOWARN;
-	void *addr = NULL;
-
-	addr = alloc_pages_exact_nid(nid, size, flags);
-	if (addr) {
-		kmemleak_alloc(addr, size, 1, flags);
-		return addr;
-	}
-
-	if (node_state(nid, N_HIGH_MEMORY))
-		addr = vzalloc_node(size, nid);
-	else
-		addr = vzalloc(size);
-
-	return addr;
-}
-
-static int __meminit init_section_page_cgroup(unsigned long pfn, int nid)
+static int init_section_page_cgroup(unsigned long pfn, int nid)
 {
 	struct mem_section *section;
 	struct page_cgroup *base;
@@ -135,7 +153,7 @@ static int __meminit init_section_page_cgroup(unsigned long pfn, int nid)
 	if (section->page_cgroup)
 		return 0;
 
-	table_size = sizeof(struct page_cgroup) * PAGES_PER_SECTION;
+	table_size = page_cgroup_table_size(nid);
 	base = alloc_page_cgroup(table_size, nid);
 
 	/*
@@ -159,20 +177,6 @@ static int __meminit init_section_page_cgroup(unsigned long pfn, int nid)
 	total_usage += table_size;
 	return 0;
 }
-#ifdef CONFIG_MEMORY_HOTPLUG
-static void free_page_cgroup(void *addr)
-{
-	if (is_vmalloc_addr(addr)) {
-		vfree(addr);
-	} else {
-		struct page *page = virt_to_page(addr);
-		size_t table_size =
-			sizeof(struct page_cgroup) * PAGES_PER_SECTION;
-
-		BUG_ON(PageReserved(page));
-		free_pages_exact(addr, table_size);
-	}
-}
 
 void __free_page_cgroup(unsigned long pfn)
 {
@@ -187,6 +191,7 @@ void __free_page_cgroup(unsigned long pfn)
 	ms->page_cgroup = NULL;
 }
 
+#ifdef CONFIG_MEMORY_HOTPLUG
 int __meminit online_page_cgroup(unsigned long start_pfn,
 			unsigned long nr_pages,
 			int nid)
@@ -266,16 +271,16 @@ static int __meminit page_cgroup_callback(struct notifier_block *self,
 
 #endif
 
-void __init page_cgroup_init(void)
+int page_cgroup_init(void)
 {
 	unsigned long pfn;
-	int nid;
+	unsigned long start_pfn, end_pfn;
+	int nid, tmpnid;
 
-	if (mem_cgroup_subsys_disabled())
-		return;
+	if (test_and_set_bit(0, &page_cgroup_initialized))
+		return 0;
 
 	for_each_node_state(nid, N_MEMORY) {
-		unsigned long start_pfn, end_pfn;
 
 		start_pfn = node_start_pfn(nid);
 		end_pfn = node_end_pfn(nid);
@@ -303,16 +308,23 @@ void __init page_cgroup_init(void)
 		}
 	}
 	hotplug_memory_notifier(page_cgroup_callback, 0);
-	printk(KERN_INFO "allocated %ld bytes of page_cgroup\n", total_usage);
-	printk(KERN_INFO "please try 'cgroup_disable=memory' option if you "
-			 "don't want memory cgroups\n");
-	return;
+	page_cgroup_msg();
+	return 0;
 oom:
-	printk(KERN_CRIT "try 'cgroup_disable=memory' boot option\n");
-	panic("Out of memory");
+	for_each_node_state(tmpnid, N_MEMORY) {
+		if (tmpnid >= nid)
+			break;
+
+		start_pfn = node_start_pfn(tmpnid);
+		end_pfn = node_end_pfn(tmpnid);
+
+		for (pfn = start_pfn; pfn < end_pfn; pfn += PAGES_PER_SECTION)
+			__free_page_cgroup(pfn);
+	}
+	return -ENOMEM;
 }
 
-void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat)
+void pgdat_page_cgroup_init(struct pglist_data *pgdat)
 {
 	return;
 }
-- 
1.8.1.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v2 4/5] memcg: do not call page_cgroup_init at system_boot
@ 2013-03-05 13:10   ` Glauber Costa
  0 siblings, 0 replies; 72+ messages in thread
From: Glauber Costa @ 2013-03-05 13:10 UTC (permalink / raw)
  To: linux-mm-Bw31MaZKKs3YtjvyW6yDsg
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, Tejun Heo, Andrew Morton,
	Michal Hocko, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
	handai.szj-Re5JQEeQqe8AvxtiuMwx3w,
	anton.vorontsov-QSEj5FYQhm4dnm+yROfE0A, Glauber Costa,
	Johannes Weiner, Mel Gorman

If we are not using memcg, there is no reason why we should allocate
this structure, that will be a memory waste at best. We can do better
at least in the sparsemem case, and allocate it when the first cgroup
is requested. It should now not panic on failure, and we have to handle
this right.

flatmem case is a bit more complicated, so that one is left out for
the moment.

Signed-off-by: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
CC: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
CC: Kamezawa Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
CC: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
CC: Mel Gorman <mgorman-l3A5Bk7waGM@public.gmane.org>
CC: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
---
 include/linux/page_cgroup.h |  28 +++++----
 init/main.c                 |   2 -
 mm/memcontrol.c             |   3 +-
 mm/page_cgroup.c            | 150 ++++++++++++++++++++++++--------------------
 4 files changed, 99 insertions(+), 84 deletions(-)

diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
index 777a524..ec9fb05 100644
--- a/include/linux/page_cgroup.h
+++ b/include/linux/page_cgroup.h
@@ -14,6 +14,7 @@ enum {
 
 #ifdef CONFIG_MEMCG
 #include <linux/bit_spinlock.h>
+#include <linux/mmzone.h>
 
 /*
  * Page Cgroup can be considered as an extended mem_map.
@@ -27,19 +28,17 @@ struct page_cgroup {
 	struct mem_cgroup *mem_cgroup;
 };
 
-void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat);
-
-#ifdef CONFIG_SPARSEMEM
-static inline void __init page_cgroup_init_flatmem(void)
+static inline size_t page_cgroup_table_size(int nid)
 {
-}
-extern void __init page_cgroup_init(void);
+#ifdef CONFIG_SPARSEMEM
+	return sizeof(struct page_cgroup) * PAGES_PER_SECTION;
 #else
-void __init page_cgroup_init_flatmem(void);
-static inline void __init page_cgroup_init(void)
-{
-}
+	return sizeof(struct page_cgroup) * NODE_DATA(nid)->node_spanned_pages;
 #endif
+}
+void pgdat_page_cgroup_init(struct pglist_data *pgdat);
+
+extern int page_cgroup_init(void);
 
 struct page_cgroup *lookup_page_cgroup(struct page *page);
 struct page *lookup_cgroup_page(struct page_cgroup *pc);
@@ -85,7 +84,7 @@ static inline void unlock_page_cgroup(struct page_cgroup *pc)
 #else /* CONFIG_MEMCG */
 struct page_cgroup;
 
-static inline void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat)
+static inline void pgdat_page_cgroup_init(struct pglist_data *pgdat)
 {
 }
 
@@ -94,7 +93,12 @@ static inline struct page_cgroup *lookup_page_cgroup(struct page *page)
 	return NULL;
 }
 
-static inline void page_cgroup_init(void)
+static inline int page_cgroup_init(void)
+{
+	return 0;
+}
+
+static inline void page_cgroup_destroy(void)
 {
 }
 
diff --git a/init/main.c b/init/main.c
index cee4b5c..1fb3ec0 100644
--- a/init/main.c
+++ b/init/main.c
@@ -457,7 +457,6 @@ static void __init mm_init(void)
 	 * page_cgroup requires contiguous pages,
 	 * bigger than MAX_ORDER unless SPARSEMEM.
 	 */
-	page_cgroup_init_flatmem();
 	mem_init();
 	kmem_cache_init();
 	percpu_init_late();
@@ -592,7 +591,6 @@ asmlinkage void __init start_kernel(void)
 		initrd_start = 0;
 	}
 #endif
-	page_cgroup_init();
 	debug_objects_mem_init();
 	kmemleak_init();
 	setup_per_cpu_pageset();
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 45c1886..6019a32 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6377,7 +6377,8 @@ mem_cgroup_css_alloc(struct cgroup *cont)
 		res_counter_init(&memcg->res, NULL);
 		res_counter_init(&memcg->memsw, NULL);
 		res_counter_init(&memcg->kmem, NULL);
-	}
+	} else if (page_cgroup_init())
+		goto free_out;
 
 	memcg->last_scanned_node = MAX_NUMNODES;
 	INIT_LIST_HEAD(&memcg->oom_notify);
diff --git a/mm/page_cgroup.c b/mm/page_cgroup.c
index a5bd322..6d04c28 100644
--- a/mm/page_cgroup.c
+++ b/mm/page_cgroup.c
@@ -12,11 +12,50 @@
 #include <linux/kmemleak.h>
 
 static unsigned long total_usage;
+static unsigned long page_cgroup_initialized;
 
-#if !defined(CONFIG_SPARSEMEM)
+static void *alloc_page_cgroup(size_t size, int nid)
+{
+	gfp_t flags = GFP_KERNEL | __GFP_ZERO | __GFP_NOWARN;
+	void *addr = NULL;
+
+	addr = alloc_pages_exact_nid(nid, size, flags);
+	if (addr) {
+		kmemleak_alloc(addr, size, 1, flags);
+		return addr;
+	}
+
+	if (node_state(nid, N_HIGH_MEMORY))
+		addr = vzalloc_node(size, nid);
+	else
+		addr = vzalloc(size);
 
+	return addr;
+}
 
-void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat)
+static void free_page_cgroup(void *addr)
+{
+	if (is_vmalloc_addr(addr)) {
+		vfree(addr);
+	} else {
+		struct page *page = virt_to_page(addr);
+		int nid = page_to_nid(page);
+		BUG_ON(PageReserved(page));
+		free_pages_exact(addr, page_cgroup_table_size(nid));
+	}
+}
+
+static void page_cgroup_msg(void)
+{
+	printk(KERN_INFO "allocated %ld bytes of page_cgroup\n", total_usage);
+	printk(KERN_INFO "please try 'cgroup_disable=memory' option if you "
+			 "don't want memory cgroups.\nAlternatively, consider "
+			 "deferring your memory cgroups creation.\n");
+}
+
+#if !defined(CONFIG_SPARSEMEM)
+
+void pgdat_page_cgroup_init(struct pglist_data *pgdat)
 {
 	pgdat->node_page_cgroup = NULL;
 }
@@ -42,20 +81,16 @@ struct page_cgroup *lookup_page_cgroup(struct page *page)
 	return base + offset;
 }
 
-static int __init alloc_node_page_cgroup(int nid)
+static int alloc_node_page_cgroup(int nid)
 {
 	struct page_cgroup *base;
 	unsigned long table_size;
-	unsigned long nr_pages;
 
-	nr_pages = NODE_DATA(nid)->node_spanned_pages;
-	if (!nr_pages)
+	table_size = page_cgroup_table_size(nid);
+	if (!table_size)
 		return 0;
 
-	table_size = sizeof(struct page_cgroup) * nr_pages;
-
-	base = __alloc_bootmem_node_nopanic(NODE_DATA(nid),
-			table_size, PAGE_SIZE, __pa(MAX_DMA_ADDRESS));
+	base = alloc_page_cgroup(table_size, nid);
 	if (!base)
 		return -ENOMEM;
 	NODE_DATA(nid)->node_page_cgroup = base;
@@ -63,27 +98,29 @@ static int __init alloc_node_page_cgroup(int nid)
 	return 0;
 }
 
-void __init page_cgroup_init_flatmem(void)
+int page_cgroup_init(void)
 {
+	int nid, fail, tmpnid;
 
-	int nid, fail;
-
-	if (mem_cgroup_subsys_disabled())
-		return;
+	/* only initialize it once */
+	if (test_and_set_bit(0, &page_cgroup_initialized))
+		return 0;
 
 	for_each_online_node(nid)  {
 		fail = alloc_node_page_cgroup(nid);
 		if (fail)
 			goto fail;
 	}
-	printk(KERN_INFO "allocated %ld bytes of page_cgroup\n", total_usage);
-	printk(KERN_INFO "please try 'cgroup_disable=memory' option if you"
-	" don't want memory cgroups\n");
-	return;
+	page_cgroup_msg();
+	return 0;
 fail:
-	printk(KERN_CRIT "allocation of page_cgroup failed.\n");
-	printk(KERN_CRIT "please try 'cgroup_disable=memory' boot option\n");
-	panic("Out of memory");
+	for_each_online_node(tmpnid)  {
+		if (tmpnid >= nid)
+			break;
+		free_page_cgroup(NODE_DATA(tmpnid)->node_page_cgroup);
+	}
+
+	return -ENOMEM;
 }
 
 #else /* CONFIG_FLAT_NODE_MEM_MAP */
@@ -105,26 +142,7 @@ struct page_cgroup *lookup_page_cgroup(struct page *page)
 	return section->page_cgroup + pfn;
 }
 
-static void *__meminit alloc_page_cgroup(size_t size, int nid)
-{
-	gfp_t flags = GFP_KERNEL | __GFP_ZERO | __GFP_NOWARN;
-	void *addr = NULL;
-
-	addr = alloc_pages_exact_nid(nid, size, flags);
-	if (addr) {
-		kmemleak_alloc(addr, size, 1, flags);
-		return addr;
-	}
-
-	if (node_state(nid, N_HIGH_MEMORY))
-		addr = vzalloc_node(size, nid);
-	else
-		addr = vzalloc(size);
-
-	return addr;
-}
-
-static int __meminit init_section_page_cgroup(unsigned long pfn, int nid)
+static int init_section_page_cgroup(unsigned long pfn, int nid)
 {
 	struct mem_section *section;
 	struct page_cgroup *base;
@@ -135,7 +153,7 @@ static int __meminit init_section_page_cgroup(unsigned long pfn, int nid)
 	if (section->page_cgroup)
 		return 0;
 
-	table_size = sizeof(struct page_cgroup) * PAGES_PER_SECTION;
+	table_size = page_cgroup_table_size(nid);
 	base = alloc_page_cgroup(table_size, nid);
 
 	/*
@@ -159,20 +177,6 @@ static int __meminit init_section_page_cgroup(unsigned long pfn, int nid)
 	total_usage += table_size;
 	return 0;
 }
-#ifdef CONFIG_MEMORY_HOTPLUG
-static void free_page_cgroup(void *addr)
-{
-	if (is_vmalloc_addr(addr)) {
-		vfree(addr);
-	} else {
-		struct page *page = virt_to_page(addr);
-		size_t table_size =
-			sizeof(struct page_cgroup) * PAGES_PER_SECTION;
-
-		BUG_ON(PageReserved(page));
-		free_pages_exact(addr, table_size);
-	}
-}
 
 void __free_page_cgroup(unsigned long pfn)
 {
@@ -187,6 +191,7 @@ void __free_page_cgroup(unsigned long pfn)
 	ms->page_cgroup = NULL;
 }
 
+#ifdef CONFIG_MEMORY_HOTPLUG
 int __meminit online_page_cgroup(unsigned long start_pfn,
 			unsigned long nr_pages,
 			int nid)
@@ -266,16 +271,16 @@ static int __meminit page_cgroup_callback(struct notifier_block *self,
 
 #endif
 
-void __init page_cgroup_init(void)
+int page_cgroup_init(void)
 {
 	unsigned long pfn;
-	int nid;
+	unsigned long start_pfn, end_pfn;
+	int nid, tmpnid;
 
-	if (mem_cgroup_subsys_disabled())
-		return;
+	if (test_and_set_bit(0, &page_cgroup_initialized))
+		return 0;
 
 	for_each_node_state(nid, N_MEMORY) {
-		unsigned long start_pfn, end_pfn;
 
 		start_pfn = node_start_pfn(nid);
 		end_pfn = node_end_pfn(nid);
@@ -303,16 +308,23 @@ void __init page_cgroup_init(void)
 		}
 	}
 	hotplug_memory_notifier(page_cgroup_callback, 0);
-	printk(KERN_INFO "allocated %ld bytes of page_cgroup\n", total_usage);
-	printk(KERN_INFO "please try 'cgroup_disable=memory' option if you "
-			 "don't want memory cgroups\n");
-	return;
+	page_cgroup_msg();
+	return 0;
 oom:
-	printk(KERN_CRIT "try 'cgroup_disable=memory' boot option\n");
-	panic("Out of memory");
+	for_each_node_state(tmpnid, N_MEMORY) {
+		if (tmpnid >= nid)
+			break;
+
+		start_pfn = node_start_pfn(tmpnid);
+		end_pfn = node_end_pfn(tmpnid);
+
+		for (pfn = start_pfn; pfn < end_pfn; pfn += PAGES_PER_SECTION)
+			__free_page_cgroup(pfn);
+	}
+	return -ENOMEM;
 }
 
-void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat)
+void pgdat_page_cgroup_init(struct pglist_data *pgdat)
 {
 	return;
 }
-- 
1.8.1.2

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v2 5/5] memcg: do not walk all the way to the root for memcg
@ 2013-03-05 13:10   ` Glauber Costa
  0 siblings, 0 replies; 72+ messages in thread
From: Glauber Costa @ 2013-03-05 13:10 UTC (permalink / raw)
  To: linux-mm
  Cc: cgroups, Tejun Heo, Andrew Morton, Michal Hocko, kamezawa.hiroyu,
	handai.szj, anton.vorontsov, Glauber Costa, Johannes Weiner,
	Mel Gorman

Since the root is special anyway, and we always get its figures from
global counters anyway, there is no make all cgroups its descendants,
wrt res_counters. The sad effect of doing that is that we need to lock
the root for all allocations, since it is a common ancestor of
everybody.

Not having the root as a common ancestor should lead to better
scalability for not-uncommon case of tasks in the cgroup being
node-bound to different nodes in NUMA systems.

Signed-off-by: Glauber Costa <glommer@parallels.com>
CC: Michal Hocko <mhocko@suse.cz>
CC: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
CC: Johannes Weiner <hannes@cmpxchg.org>
CC: Mel Gorman <mgorman@suse.de>
CC: Andrew Morton <akpm@linux-foundation.org>
---
 mm/memcontrol.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6019a32..252dc00 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6464,7 +6464,7 @@ mem_cgroup_css_online(struct cgroup *cont)
 	memcg->oom_kill_disable = parent->oom_kill_disable;
 	memcg->swappiness = mem_cgroup_swappiness(parent);
 
-	if (parent->use_hierarchy) {
+	if (parent && !mem_cgroup_is_root(parent) && parent->use_hierarchy) {
 		res_counter_init(&memcg->res, &parent->res);
 		res_counter_init(&memcg->memsw, &parent->memsw);
 		res_counter_init(&memcg->kmem, &parent->kmem);
-- 
1.8.1.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v2 5/5] memcg: do not walk all the way to the root for memcg
@ 2013-03-05 13:10   ` Glauber Costa
  0 siblings, 0 replies; 72+ messages in thread
From: Glauber Costa @ 2013-03-05 13:10 UTC (permalink / raw)
  To: linux-mm-Bw31MaZKKs3YtjvyW6yDsg
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, Tejun Heo, Andrew Morton,
	Michal Hocko, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
	handai.szj-Re5JQEeQqe8AvxtiuMwx3w,
	anton.vorontsov-QSEj5FYQhm4dnm+yROfE0A, Glauber Costa,
	Johannes Weiner, Mel Gorman

Since the root is special anyway, and we always get its figures from
global counters anyway, there is no make all cgroups its descendants,
wrt res_counters. The sad effect of doing that is that we need to lock
the root for all allocations, since it is a common ancestor of
everybody.

Not having the root as a common ancestor should lead to better
scalability for not-uncommon case of tasks in the cgroup being
node-bound to different nodes in NUMA systems.

Signed-off-by: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
CC: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
CC: Kamezawa Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
CC: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
CC: Mel Gorman <mgorman-l3A5Bk7waGM@public.gmane.org>
CC: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
---
 mm/memcontrol.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6019a32..252dc00 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6464,7 +6464,7 @@ mem_cgroup_css_online(struct cgroup *cont)
 	memcg->oom_kill_disable = parent->oom_kill_disable;
 	memcg->swappiness = mem_cgroup_swappiness(parent);
 
-	if (parent->use_hierarchy) {
+	if (parent && !mem_cgroup_is_root(parent) && parent->use_hierarchy) {
 		res_counter_init(&memcg->res, &parent->res);
 		res_counter_init(&memcg->memsw, &parent->memsw);
 		res_counter_init(&memcg->kmem, &parent->kmem);
-- 
1.8.1.2

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 1/5] memcg: make nocpu_base available for non hotplug
@ 2013-03-06  0:04     ` Kamezawa Hiroyuki
  0 siblings, 0 replies; 72+ messages in thread
From: Kamezawa Hiroyuki @ 2013-03-06  0:04 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-mm, cgroups, Tejun Heo, Andrew Morton, Michal Hocko,
	handai.szj, anton.vorontsov, Johannes Weiner

(2013/03/05 22:10), Glauber Costa wrote:
> We are using nocpu_base to accumulate charges on the main counters
> during cpu hotplug. I have a similar need, which is transferring charges
> to the root cgroup when lazily enabling memcg. Because system wide
> information is not kept per-cpu, it is hard to distribute it. This field
> works well for this. So we need to make it available for all usages, not
> only hotplug cases.
> 
> Signed-off-by: Glauber Costa <glommer@parallels.com>
> Cc: Michal Hocko <mhocko@suse.cz>
> Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Tejun Heo <tj@kernel.org>

Acked-by: KAMEZAWA Hiroyuki<kamezawa.hiroyu@jp.fujitsu.com>

Hmm..comments on nocpu_base definition will be updated in later patch ?

> ---
>   mm/memcontrol.c | 8 ++++----
>   1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 669d16a..b8b363f 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -921,11 +921,11 @@ static long mem_cgroup_read_stat(struct mem_cgroup *memcg,
>   	get_online_cpus();
>   	for_each_online_cpu(cpu)
>   		val += per_cpu(memcg->stat->count[idx], cpu);
> -#ifdef CONFIG_HOTPLUG_CPU
> +
>   	spin_lock(&memcg->pcp_counter_lock);
>   	val += memcg->nocpu_base.count[idx];
>   	spin_unlock(&memcg->pcp_counter_lock);
> -#endif
> +
>   	put_online_cpus();
>   	return val;
>   }
> @@ -945,11 +945,11 @@ static unsigned long mem_cgroup_read_events(struct mem_cgroup *memcg,
>   
>   	for_each_online_cpu(cpu)
>   		val += per_cpu(memcg->stat->events[idx], cpu);
> -#ifdef CONFIG_HOTPLUG_CPU
> +
>   	spin_lock(&memcg->pcp_counter_lock);
>   	val += memcg->nocpu_base.events[idx];
>   	spin_unlock(&memcg->pcp_counter_lock);
> -#endif
> +
>   	return val;
>   }
>   
> 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 1/5] memcg: make nocpu_base available for non hotplug
@ 2013-03-06  0:04     ` Kamezawa Hiroyuki
  0 siblings, 0 replies; 72+ messages in thread
From: Kamezawa Hiroyuki @ 2013-03-06  0:04 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	Tejun Heo, Andrew Morton, Michal Hocko,
	handai.szj-Re5JQEeQqe8AvxtiuMwx3w,
	anton.vorontsov-QSEj5FYQhm4dnm+yROfE0A, Johannes Weiner

(2013/03/05 22:10), Glauber Costa wrote:
> We are using nocpu_base to accumulate charges on the main counters
> during cpu hotplug. I have a similar need, which is transferring charges
> to the root cgroup when lazily enabling memcg. Because system wide
> information is not kept per-cpu, it is hard to distribute it. This field
> works well for this. So we need to make it available for all usages, not
> only hotplug cases.
> 
> Signed-off-by: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
> Cc: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
> Cc: Kamezawa Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
> Cc: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
> Cc: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>

Acked-by: KAMEZAWA Hiroyuki<kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>

Hmm..comments on nocpu_base definition will be updated in later patch ?

> ---
>   mm/memcontrol.c | 8 ++++----
>   1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 669d16a..b8b363f 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -921,11 +921,11 @@ static long mem_cgroup_read_stat(struct mem_cgroup *memcg,
>   	get_online_cpus();
>   	for_each_online_cpu(cpu)
>   		val += per_cpu(memcg->stat->count[idx], cpu);
> -#ifdef CONFIG_HOTPLUG_CPU
> +
>   	spin_lock(&memcg->pcp_counter_lock);
>   	val += memcg->nocpu_base.count[idx];
>   	spin_unlock(&memcg->pcp_counter_lock);
> -#endif
> +
>   	put_online_cpus();
>   	return val;
>   }
> @@ -945,11 +945,11 @@ static unsigned long mem_cgroup_read_events(struct mem_cgroup *memcg,
>   
>   	for_each_online_cpu(cpu)
>   		val += per_cpu(memcg->stat->events[idx], cpu);
> -#ifdef CONFIG_HOTPLUG_CPU
> +
>   	spin_lock(&memcg->pcp_counter_lock);
>   	val += memcg->nocpu_base.events[idx];
>   	spin_unlock(&memcg->pcp_counter_lock);
> -#endif
> +
>   	return val;
>   }
>   
> 


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 2/5] memcg: provide root figures from system totals
  2013-03-05 13:10   ` Glauber Costa
  (?)
@ 2013-03-06  0:27   ` Kamezawa Hiroyuki
  2013-03-06  8:30       ` Glauber Costa
  -1 siblings, 1 reply; 72+ messages in thread
From: Kamezawa Hiroyuki @ 2013-03-06  0:27 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-mm, cgroups, Tejun Heo, Andrew Morton, Michal Hocko,
	handai.szj, anton.vorontsov, Johannes Weiner, Mel Gorman

(2013/03/05 22:10), Glauber Costa wrote:
> For the root memcg, there is no need to rely on the res_counters if hierarchy
> is enabled The sum of all mem cgroups plus the tasks in root itself, is
> necessarily the amount of memory used for the whole system. Since those figures
> are already kept somewhere anyway, we can just return them here, without too
> much hassle.
> 
> Limit and soft limit can't be set for the root cgroup, so they are left at
> RESOURCE_MAX. Failcnt is left at 0, because its actual meaning is how many
> times we failed allocations due to the limit being hit. We will fail
> allocations in the root cgroup, but the limit will never the reason.
> 
> Signed-off-by: Glauber Costa <glommer@parallels.com>
> CC: Michal Hocko <mhocko@suse.cz>
> CC: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> CC: Johannes Weiner <hannes@cmpxchg.org>
> CC: Mel Gorman <mgorman@suse.de>
> CC: Andrew Morton <akpm@linux-foundation.org>

I think this patch's calculation is wrong.

> ---
>   mm/memcontrol.c | 64 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>   1 file changed, 64 insertions(+)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index b8b363f..bfbf1c2 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -4996,6 +4996,56 @@ static inline u64 mem_cgroup_usage(struct mem_cgroup *memcg, bool swap)
>   	return val << PAGE_SHIFT;
>   }
>   
> +static u64 memcg_read_root_rss(void)
> +{
> +	struct task_struct *p;
> +
> +	u64 rss = 0;
> +	read_lock(&tasklist_lock);
> +	for_each_process(p) {
> +		if (!p->mm)
> +			continue;
> +		task_lock(p);
> +		rss += get_mm_rss(p->mm);
> +		task_unlock(p);
> +	}
> +	read_unlock(&tasklist_lock);
> +	return rss;
> +}

I think you can use rcu_read_lock() instead of tasklist_lock.
Isn't it enough to use NR_ANON_LRU rather than this ?

> +
> +static u64 mem_cgroup_read_root(enum res_type type, int name)
> +{
> +	if (name == RES_LIMIT)
> +		return RESOURCE_MAX;
> +	if (name == RES_SOFT_LIMIT)
> +		return RESOURCE_MAX;
> +	if (name == RES_FAILCNT)
> +		return 0;
> +	if (name == RES_MAX_USAGE)
> +		return 0;
> +
> +	if (WARN_ON_ONCE(name != RES_USAGE))
> +		return 0;
> +
> +	switch (type) {
> +	case _MEM:
> +		return (memcg_read_root_rss() +
> +		atomic_long_read(&vm_stat[NR_FILE_PAGES])) << PAGE_SHIFT;
> +	case _MEMSWAP: {
> +		struct sysinfo i;
> +		si_swapinfo(&i);
> +
> +		return ((memcg_read_root_rss() +
> +		atomic_long_read(&vm_stat[NR_FILE_PAGES])) << PAGE_SHIFT) +
> +		i.totalswap - i.freeswap;

How swapcache is handled ? ...and How kmem works with this calc ?

Thanks,
-Kame

> +	}
> +	case _KMEM:
> +		return 0;
> +	default:
> +		BUG();
> +	};
> +}
> +
>   static ssize_t mem_cgroup_read(struct cgroup *cont, struct cftype *cft,
>   			       struct file *file, char __user *buf,
>   			       size_t nbytes, loff_t *ppos)
> @@ -5012,6 +5062,19 @@ static ssize_t mem_cgroup_read(struct cgroup *cont, struct cftype *cft,
>   	if (!do_swap_account && type == _MEMSWAP)
>   		return -EOPNOTSUPP;
>   
> +	/*
> +	 * If we have root-level hierarchy, we can be certain that the charges
> +	 * in root are always global. We can then bypass the root cgroup
> +	 * entirely in this case, hopefuly leading to less contention in the
> +	 * root res_counters. The charges presented after reading it will
> +	 * always be the global charges.
> +	 */
> +	if (mem_cgroup_disabled() ||
> +		(mem_cgroup_is_root(memcg) && memcg->use_hierarchy)) {
> +		val = mem_cgroup_read_root(type, name);
> +		goto root_bypass;
> +	}
> +
>   	switch (type) {
>   	case _MEM:
>   		if (name == RES_USAGE)
> @@ -5032,6 +5095,7 @@ static ssize_t mem_cgroup_read(struct cgroup *cont, struct cftype *cft,
>   		BUG();
>   	}
>   
> +root_bypass:
>   	len = scnprintf(str, sizeof(str), "%llu\n", (unsigned long long)val);
>   	return simple_read_from_buffer(buf, nbytes, ppos, str, len);
>   }
> 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 3/5] memcg: make it suck faster
@ 2013-03-06  0:46     ` Kamezawa Hiroyuki
  0 siblings, 0 replies; 72+ messages in thread
From: Kamezawa Hiroyuki @ 2013-03-06  0:46 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-mm, cgroups, Tejun Heo, Andrew Morton, Michal Hocko,
	handai.szj, anton.vorontsov, Johannes Weiner, Mel Gorman

(2013/03/05 22:10), Glauber Costa wrote:
> It is an accepted fact that memcg sucks. But can it suck faster?  Or in
> a more fair statement, can it at least stop draining everyone's
> performance when it is not in use?
> 
> This experimental and slightly crude patch demonstrates that we can do
> that by using static branches to patch it out until the first memcg
> comes to life. There are edges to be trimmed, and I appreciate comments
> for direction. In particular, the events in the root are not fired, but
> I believe this can be done without further problems by calling a
> specialized event check from mem_cgroup_newpage_charge().
> 
> My goal was to have enough numbers to demonstrate the performance gain
> that can come from it. I tested it in a 24-way 2-socket Intel box, 24 Gb
> mem. I used Mel Gorman's pft test, that he used to demonstrate this
> problem back in the Kernel Summit. There are three kernels:
> 
> nomemcg  : memcg compile disabled.
> base     : memcg enabled, patch not applied.
> bypassed : memcg enabled, with patch applied.
> 
>                  base    bypassed
> User          109.12      105.64
> System       1646.84     1597.98
> Elapsed       229.56      215.76
> 
>               nomemcg    bypassed
> User          104.35      105.64
> System       1578.19     1597.98
> Elapsed       212.33      215.76
> 
> So as one can see, the difference between base and nomemcg in terms
> of both system time and elapsed time is quite drastic, and consistent
> with the figures shown by Mel Gorman in the Kernel summit. This is a
> ~ 7 % drop in performance, just by having memcg enabled. memcg functions
> appear heavily in the profiles, even if all tasks lives in the root
> memcg.
> 
> With bypassed kernel, we drop this down to 1.5 %, which starts to fall
> in the acceptable range. More investigation is needed to see if we can
> claim that last percent back, but I believe at last part of it should
> be.
> 
seems nice.

> Signed-off-by: Glauber Costa <glommer@parallels.com>
> CC: Michal Hocko <mhocko@suse.cz>
> CC: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> CC: Johannes Weiner <hannes@cmpxchg.org>
> CC: Mel Gorman <mgorman@suse.de>
> CC: Andrew Morton <akpm@linux-foundation.org>

After quick look, it seems most parts are good. But I have a concern.

At memcg enablement, you move the numbers from vm_stat[] to res_counters.

Why you need it ? It's not explained.
And if it's necessary, uncharge will leak because page_cgroup is not marked
as PCG_USED, pc->mem_cgroup == NULL. So, res.usage will not be decreased.

Could you fix it if you need to move numbers to res_counter ?

Thanks,
-Kame





> ---
>   include/linux/memcontrol.h |  72 ++++++++++++++++----
>   mm/memcontrol.c            | 166 +++++++++++++++++++++++++++++++++++++++++----
>   mm/page_cgroup.c           |   4 +-
>   3 files changed, 216 insertions(+), 26 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index d6183f0..009f925 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -42,6 +42,26 @@ struct mem_cgroup_reclaim_cookie {
>   };
>   
>   #ifdef CONFIG_MEMCG
> +extern struct static_key memcg_in_use_key;
> +
> +static inline bool mem_cgroup_subsys_disabled(void)
> +{
> +	return !!mem_cgroup_subsys.disabled;
> +}
> +
> +static inline bool mem_cgroup_disabled(void)
> +{
> +	/*
> +	 * Will always be false if subsys is disabled, because we have no one
> +	 * to bump it up. So the test suffices and we don't have to test the
> +	 * subsystem as well
> +	 */
> +	if (!static_key_false(&memcg_in_use_key))
> +		return true;
> +	return false;
> +}
> +
> +
>   /*
>    * All "charge" functions with gfp_mask should use GFP_KERNEL or
>    * (gfp_mask & GFP_RECLAIM_MASK). In current implementatin, memcg doesn't
> @@ -53,8 +73,18 @@ struct mem_cgroup_reclaim_cookie {
>    * (Of course, if memcg does memory allocation in future, GFP_KERNEL is sane.)
>    */
>   
> -extern int mem_cgroup_newpage_charge(struct page *page, struct mm_struct *mm,
> +extern int __mem_cgroup_newpage_charge(struct page *page, struct mm_struct *mm,
>   				gfp_t gfp_mask);
> +
> +static inline int
> +mem_cgroup_newpage_charge(struct page *page, struct mm_struct *mm,
> +			  gfp_t gfp_mask)
> +{
> +	if (mem_cgroup_disabled())
> +		return 0;
> +	return __mem_cgroup_newpage_charge(page, mm, gfp_mask);
> +}
> +
>   /* for swap handling */
>   extern int mem_cgroup_try_charge_swapin(struct mm_struct *mm,
>   		struct page *page, gfp_t mask, struct mem_cgroup **memcgp);
> @@ -62,8 +92,17 @@ extern void mem_cgroup_commit_charge_swapin(struct page *page,
>   					struct mem_cgroup *memcg);
>   extern void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *memcg);
>   
> -extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
> -					gfp_t gfp_mask);
> +
> +extern int __mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
> +				     gfp_t gfp_mask);
> +static inline int
> +mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask)
> +{
> +	if (mem_cgroup_disabled())
> +		return 0;
> +
> +	return __mem_cgroup_cache_charge(page, mm, gfp_mask);
> +}
>   
>   struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *);
>   struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *);
> @@ -72,8 +111,24 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *);
>   extern void mem_cgroup_uncharge_start(void);
>   extern void mem_cgroup_uncharge_end(void);
>   
> -extern void mem_cgroup_uncharge_page(struct page *page);
> -extern void mem_cgroup_uncharge_cache_page(struct page *page);
> +extern void __mem_cgroup_uncharge_page(struct page *page);
> +extern void __mem_cgroup_uncharge_cache_page(struct page *page);
> +
> +static inline void mem_cgroup_uncharge_page(struct page *page)
> +{
> +	if (mem_cgroup_disabled())
> +		return;
> +
> +	__mem_cgroup_uncharge_page(page);
> +}
> +
> +static inline void mem_cgroup_uncharge_cache_page(struct page *page)
> +{
> +	if (mem_cgroup_disabled())
> +		return;
> +
> +	__mem_cgroup_uncharge_cache_page(page);
> +}
>   
>   bool __mem_cgroup_same_or_subtree(const struct mem_cgroup *root_memcg,
>   				  struct mem_cgroup *memcg);
> @@ -128,13 +183,6 @@ extern void mem_cgroup_replace_page_cache(struct page *oldpage,
>   extern int do_swap_account;
>   #endif
>   
> -static inline bool mem_cgroup_disabled(void)
> -{
> -	if (mem_cgroup_subsys.disabled)
> -		return true;
> -	return false;
> -}
> -
>   void __mem_cgroup_begin_update_page_stat(struct page *page, bool *locked,
>   					 unsigned long *flags);
>   
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index bfbf1c2..45c1886 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -575,6 +575,9 @@ static inline bool mem_cgroup_is_root(struct mem_cgroup *memcg)
>   	return (memcg == root_mem_cgroup);
>   }
>   
> +static bool memcg_charges_allowed = false;
> +struct static_key memcg_in_use_key;
> +
>   /* Writing them here to avoid exposing memcg's inner layout */
>   #if defined(CONFIG_INET) && defined(CONFIG_MEMCG_KMEM)
>   
> @@ -710,6 +713,7 @@ static void disarm_static_keys(struct mem_cgroup *memcg)
>   {
>   	disarm_sock_keys(memcg);
>   	disarm_kmem_keys(memcg);
> +	static_key_slow_dec(&memcg_in_use_key);
>   }
>   
>   static void drain_all_stock_async(struct mem_cgroup *memcg);
> @@ -1109,6 +1113,9 @@ struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p)
>   	if (unlikely(!p))
>   		return NULL;
>   
> +	if (mem_cgroup_disabled())
> +		return root_mem_cgroup;
> +
>   	return mem_cgroup_from_css(task_subsys_state(p, mem_cgroup_subsys_id));
>   }
>   
> @@ -1157,9 +1164,12 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
>   	struct mem_cgroup *memcg = NULL;
>   	int id = 0;
>   
> -	if (mem_cgroup_disabled())
> +	if (mem_cgroup_subsys_disabled())
>   		return NULL;
>   
> +	if (mem_cgroup_disabled())
> +		return root_mem_cgroup;
> +
>   	if (!root)
>   		root = root_mem_cgroup;
>   
> @@ -1335,6 +1345,20 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct zone *zone)
>   	memcg = pc->mem_cgroup;
>   
>   	/*
> +	 * Because we lazily enable memcg only after first child group is
> +	 * created, we can have memcg == 0. Because page cgroup is created with
> +	 * GFP_ZERO, and after charging, all page cgroups will have a non-zero
> +	 * cgroup attached (even if root), we can be sure that this is a
> +	 * used-but-not-accounted page. (due to lazyness). We could get around
> +	 * that by scanning all pages on cgroup init is too expensive. We can
> +	 * ultimately pay, but prefer to just to defer the update until we get
> +	 * here. We could take the opportunity to set PageCgroupUsed, but it
> +	 * won't be that important for the root cgroup.
> +	 */
> +	if (!memcg && PageLRU(page))
> +		pc->mem_cgroup = memcg = root_mem_cgroup;
> +

Hmm, more problems with memcg==NULL may happen ;)


> +	/*
>   	 * Surreptitiously switch any uncharged offlist page to root:
>   	 * an uncharged page off lru does nothing to secure
>   	 * its former mem_cgroup from sudden removal.
> @@ -3845,11 +3869,18 @@ static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm,
>   	return 0;
>   }
>   
> -int mem_cgroup_newpage_charge(struct page *page,
> +int __mem_cgroup_newpage_charge(struct page *page,
>   			      struct mm_struct *mm, gfp_t gfp_mask)
>   {
> -	if (mem_cgroup_disabled())
> +	/*
> +	 * The branch is actually very likely before the first memcg comes in.
> +	 * But since the code is patched out, we'll never reach it. It is only
> +	 * reachable when the code is patched in, and in that case it is
> +	 * unlikely.  It will only happen during initial charges move.
> +	 */
> +	if (unlikely(!memcg_charges_allowed))
>   		return 0;
> +
>   	VM_BUG_ON(page_mapped(page));
>   	VM_BUG_ON(page->mapping && !PageAnon(page));
>   	VM_BUG_ON(!mm);
> @@ -3962,15 +3993,13 @@ void mem_cgroup_commit_charge_swapin(struct page *page,
>   					  MEM_CGROUP_CHARGE_TYPE_ANON);
>   }
>   
> -int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
> -				gfp_t gfp_mask)
> +int __mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
> +			      gfp_t gfp_mask)
>   {
>   	struct mem_cgroup *memcg = NULL;
>   	enum charge_type type = MEM_CGROUP_CHARGE_TYPE_CACHE;
>   	int ret;
>   
> -	if (mem_cgroup_disabled())
> -		return 0;
>   	if (PageCompound(page))
>   		return 0;
>   
> @@ -4050,9 +4079,6 @@ __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype,
>   	struct page_cgroup *pc;
>   	bool anon;
>   
> -	if (mem_cgroup_disabled())
> -		return NULL;
> -
>   	VM_BUG_ON(PageSwapCache(page));
>   
>   	if (PageTransHuge(page)) {
> @@ -4144,7 +4170,7 @@ unlock_out:
>   	return NULL;
>   }
>   
> -void mem_cgroup_uncharge_page(struct page *page)
> +void __mem_cgroup_uncharge_page(struct page *page)
>   {
>   	/* early check. */
>   	if (page_mapped(page))
> @@ -4155,7 +4181,7 @@ void mem_cgroup_uncharge_page(struct page *page)
>   	__mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_ANON, false);
>   }
>   
> -void mem_cgroup_uncharge_cache_page(struct page *page)
> +void __mem_cgroup_uncharge_cache_page(struct page *page)
>   {
>   	VM_BUG_ON(page_mapped(page));
>   	VM_BUG_ON(page->mapping);
> @@ -4220,6 +4246,9 @@ mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent, bool swapout)
>   	struct mem_cgroup *memcg;
>   	int ctype = MEM_CGROUP_CHARGE_TYPE_SWAPOUT;
>   
> +	if (mem_cgroup_disabled())
> +		return;
> +
>   	if (!swapout) /* this was a swap cache but the swap is unused ! */
>   		ctype = MEM_CGROUP_CHARGE_TYPE_DROP;
>   
> @@ -6364,6 +6393,59 @@ free_out:
>   	return ERR_PTR(error);
>   }
>   
> +static void memcg_update_root_statistics(void)
> +{
> +	int cpu;
> +	u64 pgin, pgout, faults, mjfaults;
> +
> +	pgin = pgout = faults = mjfaults = 0;
> +	for_each_online_cpu(cpu) {
> +		struct vm_event_state *ev = &per_cpu(vm_event_states, cpu);
> +		struct mem_cgroup_stat_cpu *memcg_stat;
> +
> +		memcg_stat = per_cpu_ptr(root_mem_cgroup->stat, cpu);
> +
> +		memcg_stat->events[MEM_CGROUP_EVENTS_PGPGIN] =
> +							ev->event[PGPGIN];
> +		memcg_stat->events[MEM_CGROUP_EVENTS_PGPGOUT] =
> +							ev->event[PGPGOUT];
> +		memcg_stat->events[MEM_CGROUP_EVENTS_PGFAULT] =
> +							ev->event[PGFAULT];
> +		memcg_stat->events[MEM_CGROUP_EVENTS_PGMAJFAULT] =
> +							ev->event[PGMAJFAULT];
> +
> +		memcg_stat->nr_page_events = ev->event[PGPGIN] +
> +					     ev->event[PGPGOUT];
> +	}
> +
> +	root_mem_cgroup->nocpu_base.count[MEM_CGROUP_STAT_RSS] =
> +				memcg_read_root_rss();
> +	root_mem_cgroup->nocpu_base.count[MEM_CGROUP_STAT_CACHE] =
> +				atomic_long_read(&vm_stat[NR_FILE_PAGES]);
> +	root_mem_cgroup->nocpu_base.count[MEM_CGROUP_STAT_FILE_MAPPED] =
> +				atomic_long_read(&vm_stat[NR_FILE_MAPPED]);
> +}
> +
> +static void memcg_update_root_lru(void)
> +{
> +	struct zone *zone;
> +	struct lruvec *lruvec;
> +	struct mem_cgroup_per_zone *mz;
> +	enum lru_list lru;
> +
> +	for_each_populated_zone(zone) {
> +		spin_lock_irq(&zone->lru_lock);
> +		lruvec = &zone->lruvec;
> +		mz = mem_cgroup_zoneinfo(root_mem_cgroup,
> +				zone_to_nid(zone), zone_idx(zone));
> +
> +		for (lru = LRU_BASE; lru < NR_LRU_LISTS; lru++)
> +			mz->lru_size[lru] =
> +				zone_page_state(zone, NR_LRU_BASE + lru);
> +		spin_unlock_irq(&zone->lru_lock);
> +	}
> +}
> +
>   static int
>   mem_cgroup_css_online(struct cgroup *cont)
>   {
> @@ -6407,6 +6489,66 @@ mem_cgroup_css_online(struct cgroup *cont)
>   	}
>   
>   	error = memcg_init_kmem(memcg, &mem_cgroup_subsys);
> +
> +	if (!error) {
> +		static_key_slow_inc(&memcg_in_use_key);
> +		/*
> +		 * The strategy to avoid races here is to let the charges just
> +		 * be globally made until we lock the res counter. Since we are
> +		 * copying charges from global statistics, it doesn't really
> +		 * matter when we do it, as long as we are consistent. So even
> +		 * after the code is patched in, they will continue being
> +		 * globally charged due to memcg_charges_allowed being set to
> +		 * false.
> +		 *
> +		 * Once we hold the res counter lock, though, we can already
> +		 * safely flip it: We will go through with the charging to the
> +		 * root memcg, but won't be able to actually charge it: we have
> +		 * the lock.
> +		 *
> +		 * This works because the mm stats are only updated after the
> +		 * memcg charging suceeds. If we block the charge by holding
> +		 * the res_counter lock, no other charges will happen in the
> +		 * system until we release it.
> +		 *
> +		 * manipulation always safe because the write side is always
> +		 * under the memcg_mutex.
> +		 */
> +		if (!memcg_charges_allowed) {
> +			struct zone *zone;
> +
> +			get_online_cpus();
> +			spin_lock(&root_mem_cgroup->res.lock);
> +
> +			memcg_charges_allowed = true;
> +
> +			root_mem_cgroup->res.usage =
> +				mem_cgroup_read_root(RES_USAGE, _MEM);
> +			root_mem_cgroup->memsw.usage =
> +				mem_cgroup_read_root(RES_USAGE, _MEMSWAP);
> +			/*
> +			 * The max usage figure is not entirely accurate. The
> +			 * memory may have been higher in the past. But since
> +			 * we don't track that globally, this is the best we
> +			 * can do.
> +			 */
> +			root_mem_cgroup->res.max_usage =
> +					root_mem_cgroup->res.usage;
> +			root_mem_cgroup->memsw.max_usage =
> +					root_mem_cgroup->memsw.usage;
> +
> +			memcg_update_root_statistics();
> +			memcg_update_root_lru();
> +			/*
> +			 * We are now 100 % consistent and all charges are
> +			 * transfered.  New charges should reach the
> +			 * res_counter directly.
> +			 */
> +			spin_unlock(&root_mem_cgroup->res.lock);
> +			put_online_cpus();
> +		}
> +	}
> +
>   	mutex_unlock(&memcg_create_mutex);
>   	if (error) {
>   		/*
> diff --git a/mm/page_cgroup.c b/mm/page_cgroup.c
> index 6d757e3..a5bd322 100644
> --- a/mm/page_cgroup.c
> +++ b/mm/page_cgroup.c
> @@ -68,7 +68,7 @@ void __init page_cgroup_init_flatmem(void)
>   
>   	int nid, fail;
>   
> -	if (mem_cgroup_disabled())
> +	if (mem_cgroup_subsys_disabled())
>   		return;
>   
>   	for_each_online_node(nid)  {
> @@ -271,7 +271,7 @@ void __init page_cgroup_init(void)
>   	unsigned long pfn;
>   	int nid;
>   
> -	if (mem_cgroup_disabled())
> +	if (mem_cgroup_subsys_disabled())
>   		return;
>   
>   	for_each_node_state(nid, N_MEMORY) {
> 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 3/5] memcg: make it suck faster
@ 2013-03-06  0:46     ` Kamezawa Hiroyuki
  0 siblings, 0 replies; 72+ messages in thread
From: Kamezawa Hiroyuki @ 2013-03-06  0:46 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	Tejun Heo, Andrew Morton, Michal Hocko,
	handai.szj-Re5JQEeQqe8AvxtiuMwx3w,
	anton.vorontsov-QSEj5FYQhm4dnm+yROfE0A, Johannes Weiner,
	Mel Gorman

(2013/03/05 22:10), Glauber Costa wrote:
> It is an accepted fact that memcg sucks. But can it suck faster?  Or in
> a more fair statement, can it at least stop draining everyone's
> performance when it is not in use?
> 
> This experimental and slightly crude patch demonstrates that we can do
> that by using static branches to patch it out until the first memcg
> comes to life. There are edges to be trimmed, and I appreciate comments
> for direction. In particular, the events in the root are not fired, but
> I believe this can be done without further problems by calling a
> specialized event check from mem_cgroup_newpage_charge().
> 
> My goal was to have enough numbers to demonstrate the performance gain
> that can come from it. I tested it in a 24-way 2-socket Intel box, 24 Gb
> mem. I used Mel Gorman's pft test, that he used to demonstrate this
> problem back in the Kernel Summit. There are three kernels:
> 
> nomemcg  : memcg compile disabled.
> base     : memcg enabled, patch not applied.
> bypassed : memcg enabled, with patch applied.
> 
>                  base    bypassed
> User          109.12      105.64
> System       1646.84     1597.98
> Elapsed       229.56      215.76
> 
>               nomemcg    bypassed
> User          104.35      105.64
> System       1578.19     1597.98
> Elapsed       212.33      215.76
> 
> So as one can see, the difference between base and nomemcg in terms
> of both system time and elapsed time is quite drastic, and consistent
> with the figures shown by Mel Gorman in the Kernel summit. This is a
> ~ 7 % drop in performance, just by having memcg enabled. memcg functions
> appear heavily in the profiles, even if all tasks lives in the root
> memcg.
> 
> With bypassed kernel, we drop this down to 1.5 %, which starts to fall
> in the acceptable range. More investigation is needed to see if we can
> claim that last percent back, but I believe at last part of it should
> be.
> 
seems nice.

> Signed-off-by: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
> CC: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
> CC: Kamezawa Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
> CC: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
> CC: Mel Gorman <mgorman-l3A5Bk7waGM@public.gmane.org>
> CC: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>

After quick look, it seems most parts are good. But I have a concern.

At memcg enablement, you move the numbers from vm_stat[] to res_counters.

Why you need it ? It's not explained.
And if it's necessary, uncharge will leak because page_cgroup is not marked
as PCG_USED, pc->mem_cgroup == NULL. So, res.usage will not be decreased.

Could you fix it if you need to move numbers to res_counter ?

Thanks,
-Kame





> ---
>   include/linux/memcontrol.h |  72 ++++++++++++++++----
>   mm/memcontrol.c            | 166 +++++++++++++++++++++++++++++++++++++++++----
>   mm/page_cgroup.c           |   4 +-
>   3 files changed, 216 insertions(+), 26 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index d6183f0..009f925 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -42,6 +42,26 @@ struct mem_cgroup_reclaim_cookie {
>   };
>   
>   #ifdef CONFIG_MEMCG
> +extern struct static_key memcg_in_use_key;
> +
> +static inline bool mem_cgroup_subsys_disabled(void)
> +{
> +	return !!mem_cgroup_subsys.disabled;
> +}
> +
> +static inline bool mem_cgroup_disabled(void)
> +{
> +	/*
> +	 * Will always be false if subsys is disabled, because we have no one
> +	 * to bump it up. So the test suffices and we don't have to test the
> +	 * subsystem as well
> +	 */
> +	if (!static_key_false(&memcg_in_use_key))
> +		return true;
> +	return false;
> +}
> +
> +
>   /*
>    * All "charge" functions with gfp_mask should use GFP_KERNEL or
>    * (gfp_mask & GFP_RECLAIM_MASK). In current implementatin, memcg doesn't
> @@ -53,8 +73,18 @@ struct mem_cgroup_reclaim_cookie {
>    * (Of course, if memcg does memory allocation in future, GFP_KERNEL is sane.)
>    */
>   
> -extern int mem_cgroup_newpage_charge(struct page *page, struct mm_struct *mm,
> +extern int __mem_cgroup_newpage_charge(struct page *page, struct mm_struct *mm,
>   				gfp_t gfp_mask);
> +
> +static inline int
> +mem_cgroup_newpage_charge(struct page *page, struct mm_struct *mm,
> +			  gfp_t gfp_mask)
> +{
> +	if (mem_cgroup_disabled())
> +		return 0;
> +	return __mem_cgroup_newpage_charge(page, mm, gfp_mask);
> +}
> +
>   /* for swap handling */
>   extern int mem_cgroup_try_charge_swapin(struct mm_struct *mm,
>   		struct page *page, gfp_t mask, struct mem_cgroup **memcgp);
> @@ -62,8 +92,17 @@ extern void mem_cgroup_commit_charge_swapin(struct page *page,
>   					struct mem_cgroup *memcg);
>   extern void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *memcg);
>   
> -extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
> -					gfp_t gfp_mask);
> +
> +extern int __mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
> +				     gfp_t gfp_mask);
> +static inline int
> +mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask)
> +{
> +	if (mem_cgroup_disabled())
> +		return 0;
> +
> +	return __mem_cgroup_cache_charge(page, mm, gfp_mask);
> +}
>   
>   struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *);
>   struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *);
> @@ -72,8 +111,24 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *);
>   extern void mem_cgroup_uncharge_start(void);
>   extern void mem_cgroup_uncharge_end(void);
>   
> -extern void mem_cgroup_uncharge_page(struct page *page);
> -extern void mem_cgroup_uncharge_cache_page(struct page *page);
> +extern void __mem_cgroup_uncharge_page(struct page *page);
> +extern void __mem_cgroup_uncharge_cache_page(struct page *page);
> +
> +static inline void mem_cgroup_uncharge_page(struct page *page)
> +{
> +	if (mem_cgroup_disabled())
> +		return;
> +
> +	__mem_cgroup_uncharge_page(page);
> +}
> +
> +static inline void mem_cgroup_uncharge_cache_page(struct page *page)
> +{
> +	if (mem_cgroup_disabled())
> +		return;
> +
> +	__mem_cgroup_uncharge_cache_page(page);
> +}
>   
>   bool __mem_cgroup_same_or_subtree(const struct mem_cgroup *root_memcg,
>   				  struct mem_cgroup *memcg);
> @@ -128,13 +183,6 @@ extern void mem_cgroup_replace_page_cache(struct page *oldpage,
>   extern int do_swap_account;
>   #endif
>   
> -static inline bool mem_cgroup_disabled(void)
> -{
> -	if (mem_cgroup_subsys.disabled)
> -		return true;
> -	return false;
> -}
> -
>   void __mem_cgroup_begin_update_page_stat(struct page *page, bool *locked,
>   					 unsigned long *flags);
>   
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index bfbf1c2..45c1886 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -575,6 +575,9 @@ static inline bool mem_cgroup_is_root(struct mem_cgroup *memcg)
>   	return (memcg == root_mem_cgroup);
>   }
>   
> +static bool memcg_charges_allowed = false;
> +struct static_key memcg_in_use_key;
> +
>   /* Writing them here to avoid exposing memcg's inner layout */
>   #if defined(CONFIG_INET) && defined(CONFIG_MEMCG_KMEM)
>   
> @@ -710,6 +713,7 @@ static void disarm_static_keys(struct mem_cgroup *memcg)
>   {
>   	disarm_sock_keys(memcg);
>   	disarm_kmem_keys(memcg);
> +	static_key_slow_dec(&memcg_in_use_key);
>   }
>   
>   static void drain_all_stock_async(struct mem_cgroup *memcg);
> @@ -1109,6 +1113,9 @@ struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p)
>   	if (unlikely(!p))
>   		return NULL;
>   
> +	if (mem_cgroup_disabled())
> +		return root_mem_cgroup;
> +
>   	return mem_cgroup_from_css(task_subsys_state(p, mem_cgroup_subsys_id));
>   }
>   
> @@ -1157,9 +1164,12 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
>   	struct mem_cgroup *memcg = NULL;
>   	int id = 0;
>   
> -	if (mem_cgroup_disabled())
> +	if (mem_cgroup_subsys_disabled())
>   		return NULL;
>   
> +	if (mem_cgroup_disabled())
> +		return root_mem_cgroup;
> +
>   	if (!root)
>   		root = root_mem_cgroup;
>   
> @@ -1335,6 +1345,20 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct zone *zone)
>   	memcg = pc->mem_cgroup;
>   
>   	/*
> +	 * Because we lazily enable memcg only after first child group is
> +	 * created, we can have memcg == 0. Because page cgroup is created with
> +	 * GFP_ZERO, and after charging, all page cgroups will have a non-zero
> +	 * cgroup attached (even if root), we can be sure that this is a
> +	 * used-but-not-accounted page. (due to lazyness). We could get around
> +	 * that by scanning all pages on cgroup init is too expensive. We can
> +	 * ultimately pay, but prefer to just to defer the update until we get
> +	 * here. We could take the opportunity to set PageCgroupUsed, but it
> +	 * won't be that important for the root cgroup.
> +	 */
> +	if (!memcg && PageLRU(page))
> +		pc->mem_cgroup = memcg = root_mem_cgroup;
> +

Hmm, more problems with memcg==NULL may happen ;)


> +	/*
>   	 * Surreptitiously switch any uncharged offlist page to root:
>   	 * an uncharged page off lru does nothing to secure
>   	 * its former mem_cgroup from sudden removal.
> @@ -3845,11 +3869,18 @@ static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm,
>   	return 0;
>   }
>   
> -int mem_cgroup_newpage_charge(struct page *page,
> +int __mem_cgroup_newpage_charge(struct page *page,
>   			      struct mm_struct *mm, gfp_t gfp_mask)
>   {
> -	if (mem_cgroup_disabled())
> +	/*
> +	 * The branch is actually very likely before the first memcg comes in.
> +	 * But since the code is patched out, we'll never reach it. It is only
> +	 * reachable when the code is patched in, and in that case it is
> +	 * unlikely.  It will only happen during initial charges move.
> +	 */
> +	if (unlikely(!memcg_charges_allowed))
>   		return 0;
> +
>   	VM_BUG_ON(page_mapped(page));
>   	VM_BUG_ON(page->mapping && !PageAnon(page));
>   	VM_BUG_ON(!mm);
> @@ -3962,15 +3993,13 @@ void mem_cgroup_commit_charge_swapin(struct page *page,
>   					  MEM_CGROUP_CHARGE_TYPE_ANON);
>   }
>   
> -int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
> -				gfp_t gfp_mask)
> +int __mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
> +			      gfp_t gfp_mask)
>   {
>   	struct mem_cgroup *memcg = NULL;
>   	enum charge_type type = MEM_CGROUP_CHARGE_TYPE_CACHE;
>   	int ret;
>   
> -	if (mem_cgroup_disabled())
> -		return 0;
>   	if (PageCompound(page))
>   		return 0;
>   
> @@ -4050,9 +4079,6 @@ __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype,
>   	struct page_cgroup *pc;
>   	bool anon;
>   
> -	if (mem_cgroup_disabled())
> -		return NULL;
> -
>   	VM_BUG_ON(PageSwapCache(page));
>   
>   	if (PageTransHuge(page)) {
> @@ -4144,7 +4170,7 @@ unlock_out:
>   	return NULL;
>   }
>   
> -void mem_cgroup_uncharge_page(struct page *page)
> +void __mem_cgroup_uncharge_page(struct page *page)
>   {
>   	/* early check. */
>   	if (page_mapped(page))
> @@ -4155,7 +4181,7 @@ void mem_cgroup_uncharge_page(struct page *page)
>   	__mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_ANON, false);
>   }
>   
> -void mem_cgroup_uncharge_cache_page(struct page *page)
> +void __mem_cgroup_uncharge_cache_page(struct page *page)
>   {
>   	VM_BUG_ON(page_mapped(page));
>   	VM_BUG_ON(page->mapping);
> @@ -4220,6 +4246,9 @@ mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent, bool swapout)
>   	struct mem_cgroup *memcg;
>   	int ctype = MEM_CGROUP_CHARGE_TYPE_SWAPOUT;
>   
> +	if (mem_cgroup_disabled())
> +		return;
> +
>   	if (!swapout) /* this was a swap cache but the swap is unused ! */
>   		ctype = MEM_CGROUP_CHARGE_TYPE_DROP;
>   
> @@ -6364,6 +6393,59 @@ free_out:
>   	return ERR_PTR(error);
>   }
>   
> +static void memcg_update_root_statistics(void)
> +{
> +	int cpu;
> +	u64 pgin, pgout, faults, mjfaults;
> +
> +	pgin = pgout = faults = mjfaults = 0;
> +	for_each_online_cpu(cpu) {
> +		struct vm_event_state *ev = &per_cpu(vm_event_states, cpu);
> +		struct mem_cgroup_stat_cpu *memcg_stat;
> +
> +		memcg_stat = per_cpu_ptr(root_mem_cgroup->stat, cpu);
> +
> +		memcg_stat->events[MEM_CGROUP_EVENTS_PGPGIN] =
> +							ev->event[PGPGIN];
> +		memcg_stat->events[MEM_CGROUP_EVENTS_PGPGOUT] =
> +							ev->event[PGPGOUT];
> +		memcg_stat->events[MEM_CGROUP_EVENTS_PGFAULT] =
> +							ev->event[PGFAULT];
> +		memcg_stat->events[MEM_CGROUP_EVENTS_PGMAJFAULT] =
> +							ev->event[PGMAJFAULT];
> +
> +		memcg_stat->nr_page_events = ev->event[PGPGIN] +
> +					     ev->event[PGPGOUT];
> +	}
> +
> +	root_mem_cgroup->nocpu_base.count[MEM_CGROUP_STAT_RSS] =
> +				memcg_read_root_rss();
> +	root_mem_cgroup->nocpu_base.count[MEM_CGROUP_STAT_CACHE] =
> +				atomic_long_read(&vm_stat[NR_FILE_PAGES]);
> +	root_mem_cgroup->nocpu_base.count[MEM_CGROUP_STAT_FILE_MAPPED] =
> +				atomic_long_read(&vm_stat[NR_FILE_MAPPED]);
> +}
> +
> +static void memcg_update_root_lru(void)
> +{
> +	struct zone *zone;
> +	struct lruvec *lruvec;
> +	struct mem_cgroup_per_zone *mz;
> +	enum lru_list lru;
> +
> +	for_each_populated_zone(zone) {
> +		spin_lock_irq(&zone->lru_lock);
> +		lruvec = &zone->lruvec;
> +		mz = mem_cgroup_zoneinfo(root_mem_cgroup,
> +				zone_to_nid(zone), zone_idx(zone));
> +
> +		for (lru = LRU_BASE; lru < NR_LRU_LISTS; lru++)
> +			mz->lru_size[lru] =
> +				zone_page_state(zone, NR_LRU_BASE + lru);
> +		spin_unlock_irq(&zone->lru_lock);
> +	}
> +}
> +
>   static int
>   mem_cgroup_css_online(struct cgroup *cont)
>   {
> @@ -6407,6 +6489,66 @@ mem_cgroup_css_online(struct cgroup *cont)
>   	}
>   
>   	error = memcg_init_kmem(memcg, &mem_cgroup_subsys);
> +
> +	if (!error) {
> +		static_key_slow_inc(&memcg_in_use_key);
> +		/*
> +		 * The strategy to avoid races here is to let the charges just
> +		 * be globally made until we lock the res counter. Since we are
> +		 * copying charges from global statistics, it doesn't really
> +		 * matter when we do it, as long as we are consistent. So even
> +		 * after the code is patched in, they will continue being
> +		 * globally charged due to memcg_charges_allowed being set to
> +		 * false.
> +		 *
> +		 * Once we hold the res counter lock, though, we can already
> +		 * safely flip it: We will go through with the charging to the
> +		 * root memcg, but won't be able to actually charge it: we have
> +		 * the lock.
> +		 *
> +		 * This works because the mm stats are only updated after the
> +		 * memcg charging suceeds. If we block the charge by holding
> +		 * the res_counter lock, no other charges will happen in the
> +		 * system until we release it.
> +		 *
> +		 * manipulation always safe because the write side is always
> +		 * under the memcg_mutex.
> +		 */
> +		if (!memcg_charges_allowed) {
> +			struct zone *zone;
> +
> +			get_online_cpus();
> +			spin_lock(&root_mem_cgroup->res.lock);
> +
> +			memcg_charges_allowed = true;
> +
> +			root_mem_cgroup->res.usage =
> +				mem_cgroup_read_root(RES_USAGE, _MEM);
> +			root_mem_cgroup->memsw.usage =
> +				mem_cgroup_read_root(RES_USAGE, _MEMSWAP);
> +			/*
> +			 * The max usage figure is not entirely accurate. The
> +			 * memory may have been higher in the past. But since
> +			 * we don't track that globally, this is the best we
> +			 * can do.
> +			 */
> +			root_mem_cgroup->res.max_usage =
> +					root_mem_cgroup->res.usage;
> +			root_mem_cgroup->memsw.max_usage =
> +					root_mem_cgroup->memsw.usage;
> +
> +			memcg_update_root_statistics();
> +			memcg_update_root_lru();
> +			/*
> +			 * We are now 100 % consistent and all charges are
> +			 * transfered.  New charges should reach the
> +			 * res_counter directly.
> +			 */
> +			spin_unlock(&root_mem_cgroup->res.lock);
> +			put_online_cpus();
> +		}
> +	}
> +
>   	mutex_unlock(&memcg_create_mutex);
>   	if (error) {
>   		/*
> diff --git a/mm/page_cgroup.c b/mm/page_cgroup.c
> index 6d757e3..a5bd322 100644
> --- a/mm/page_cgroup.c
> +++ b/mm/page_cgroup.c
> @@ -68,7 +68,7 @@ void __init page_cgroup_init_flatmem(void)
>   
>   	int nid, fail;
>   
> -	if (mem_cgroup_disabled())
> +	if (mem_cgroup_subsys_disabled())
>   		return;
>   
>   	for_each_online_node(nid)  {
> @@ -271,7 +271,7 @@ void __init page_cgroup_init(void)
>   	unsigned long pfn;
>   	int nid;
>   
> -	if (mem_cgroup_disabled())
> +	if (mem_cgroup_subsys_disabled())
>   		return;
>   
>   	for_each_node_state(nid, N_MEMORY) {
> 


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 4/5] memcg: do not call page_cgroup_init at system_boot
@ 2013-03-06  1:07     ` Kamezawa Hiroyuki
  0 siblings, 0 replies; 72+ messages in thread
From: Kamezawa Hiroyuki @ 2013-03-06  1:07 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-mm, cgroups, Tejun Heo, Andrew Morton, Michal Hocko,
	handai.szj, anton.vorontsov, Johannes Weiner, Mel Gorman

(2013/03/05 22:10), Glauber Costa wrote:
> If we are not using memcg, there is no reason why we should allocate
> this structure, that will be a memory waste at best. We can do better
> at least in the sparsemem case, and allocate it when the first cgroup
> is requested. It should now not panic on failure, and we have to handle
> this right.
> 
> flatmem case is a bit more complicated, so that one is left out for
> the moment.
> 
> Signed-off-by: Glauber Costa <glommer@parallels.com>
> CC: Michal Hocko <mhocko@suse.cz>
> CC: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> CC: Johannes Weiner <hannes@cmpxchg.org>
> CC: Mel Gorman <mgorman@suse.de>
> CC: Andrew Morton <akpm@linux-foundation.org>
> ---
>   include/linux/page_cgroup.h |  28 +++++----
>   init/main.c                 |   2 -
>   mm/memcontrol.c             |   3 +-
>   mm/page_cgroup.c            | 150 ++++++++++++++++++++++++--------------------
>   4 files changed, 99 insertions(+), 84 deletions(-)

This patch seems a complicated mixture of clean-up and what-you-really-want.

> 
> diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
> index 777a524..ec9fb05 100644
> --- a/include/linux/page_cgroup.h
> +++ b/include/linux/page_cgroup.h
> @@ -14,6 +14,7 @@ enum {
>   
>   #ifdef CONFIG_MEMCG
>   #include <linux/bit_spinlock.h>
> +#include <linux/mmzone.h>
>   
>   /*
>    * Page Cgroup can be considered as an extended mem_map.
> @@ -27,19 +28,17 @@ struct page_cgroup {
>   	struct mem_cgroup *mem_cgroup;
>   };
>   
> -void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat);
> -
> -#ifdef CONFIG_SPARSEMEM
> -static inline void __init page_cgroup_init_flatmem(void)
> +static inline size_t page_cgroup_table_size(int nid)
>   {
> -}
> -extern void __init page_cgroup_init(void);
> +#ifdef CONFIG_SPARSEMEM
> +	return sizeof(struct page_cgroup) * PAGES_PER_SECTION;
>   #else
> -void __init page_cgroup_init_flatmem(void);
> -static inline void __init page_cgroup_init(void)
> -{
> -}
> +	return sizeof(struct page_cgroup) * NODE_DATA(nid)->node_spanned_pages;
>   #endif
> +}
> +void pgdat_page_cgroup_init(struct pglist_data *pgdat);
> +
> +extern int page_cgroup_init(void);
>   
>   struct page_cgroup *lookup_page_cgroup(struct page *page);
>   struct page *lookup_cgroup_page(struct page_cgroup *pc);
> @@ -85,7 +84,7 @@ static inline void unlock_page_cgroup(struct page_cgroup *pc)
>   #else /* CONFIG_MEMCG */
>   struct page_cgroup;
>   
> -static inline void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat)
> +static inline void pgdat_page_cgroup_init(struct pglist_data *pgdat)
>   {
>   }
>   
> @@ -94,7 +93,12 @@ static inline struct page_cgroup *lookup_page_cgroup(struct page *page)
>   	return NULL;
>   }
>   
> -static inline void page_cgroup_init(void)
> +static inline int page_cgroup_init(void)
> +{
> +	return 0;
> +}
> +
> +static inline void page_cgroup_destroy(void)
>   {
>   }
>   
> diff --git a/init/main.c b/init/main.c
> index cee4b5c..1fb3ec0 100644
> --- a/init/main.c
> +++ b/init/main.c
> @@ -457,7 +457,6 @@ static void __init mm_init(void)
>   	 * page_cgroup requires contiguous pages,
>   	 * bigger than MAX_ORDER unless SPARSEMEM.
>   	 */
> -	page_cgroup_init_flatmem();
>   	mem_init();
>   	kmem_cache_init();
>   	percpu_init_late();
> @@ -592,7 +591,6 @@ asmlinkage void __init start_kernel(void)
>   		initrd_start = 0;
>   	}
>   #endif
> -	page_cgroup_init();
>   	debug_objects_mem_init();
>   	kmemleak_init();
>   	setup_per_cpu_pageset();
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 45c1886..6019a32 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -6377,7 +6377,8 @@ mem_cgroup_css_alloc(struct cgroup *cont)
>   		res_counter_init(&memcg->res, NULL);
>   		res_counter_init(&memcg->memsw, NULL);
>   		res_counter_init(&memcg->kmem, NULL);
> -	}
> +	} else if (page_cgroup_init())
> +		goto free_out;
>   
>   	memcg->last_scanned_node = MAX_NUMNODES;
>   	INIT_LIST_HEAD(&memcg->oom_notify);
> diff --git a/mm/page_cgroup.c b/mm/page_cgroup.c
> index a5bd322..6d04c28 100644
> --- a/mm/page_cgroup.c
> +++ b/mm/page_cgroup.c
> @@ -12,11 +12,50 @@
>   #include <linux/kmemleak.h>
>   
>   static unsigned long total_usage;
> +static unsigned long page_cgroup_initialized;
>   
> -#if !defined(CONFIG_SPARSEMEM)
> +static void *alloc_page_cgroup(size_t size, int nid)
> +{
> +	gfp_t flags = GFP_KERNEL | __GFP_ZERO | __GFP_NOWARN;
> +	void *addr = NULL;
> +
> +	addr = alloc_pages_exact_nid(nid, size, flags);
> +	if (addr) {
> +		kmemleak_alloc(addr, size, 1, flags);
> +		return addr;
> +	}

As far as I remember, this function was written for SPARSEMEM.

How big this "size" will be with FLATMEM/DISCONTIGMEM ?
if 16GB, 16 * 1024 * 1024 * 1024 / 4096 * 16 = 64MB. 

What happens if order > MAX_ORDER is passed to alloc_pages()...no warning ?

How about using vmalloc always if not SPARSEMEM ?

> +
> +	if (node_state(nid, N_HIGH_MEMORY))
> +		addr = vzalloc_node(size, nid);
> +	else
> +		addr = vzalloc(size);
>   
> +	return addr;
> +}


>   
> -void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat)
> +static void free_page_cgroup(void *addr)
> +{
> +	if (is_vmalloc_addr(addr)) {
> +		vfree(addr);
> +	} else {
> +		struct page *page = virt_to_page(addr);
> +		int nid = page_to_nid(page);
> +		BUG_ON(PageReserved(page));

This BUG_ON() can be removed.

> +		free_pages_exact(addr, page_cgroup_table_size(nid));
> +	}
> +}
> +
> +static void page_cgroup_msg(void)
> +{
> +	printk(KERN_INFO "allocated %ld bytes of page_cgroup\n", total_usage);
> +	printk(KERN_INFO "please try 'cgroup_disable=memory' option if you "
> +			 "don't want memory cgroups.\nAlternatively, consider "
> +			 "deferring your memory cgroups creation.\n");
> +}

I think this warning can be removed because it's not boot option problem
after this patch. I guess the boot option can be obsolete....

> +
> +#if !defined(CONFIG_SPARSEMEM)
> +
> +void pgdat_page_cgroup_init(struct pglist_data *pgdat)
>   {
>   	pgdat->node_page_cgroup = NULL;
>   }
> @@ -42,20 +81,16 @@ struct page_cgroup *lookup_page_cgroup(struct page *page)
>   	return base + offset;
>   }
>   
> -static int __init alloc_node_page_cgroup(int nid)
> +static int alloc_node_page_cgroup(int nid)
>   {
>   	struct page_cgroup *base;
>   	unsigned long table_size;
> -	unsigned long nr_pages;
>   
> -	nr_pages = NODE_DATA(nid)->node_spanned_pages;
> -	if (!nr_pages)
> +	table_size = page_cgroup_table_size(nid);
> +	if (!table_size)
>   		return 0;
>   
> -	table_size = sizeof(struct page_cgroup) * nr_pages;
> -
> -	base = __alloc_bootmem_node_nopanic(NODE_DATA(nid),
> -			table_size, PAGE_SIZE, __pa(MAX_DMA_ADDRESS));
> +	base = alloc_page_cgroup(table_size, nid);
>   	if (!base)
>   		return -ENOMEM;
>   	NODE_DATA(nid)->node_page_cgroup = base;
> @@ -63,27 +98,29 @@ static int __init alloc_node_page_cgroup(int nid)
>   	return 0;
>   }
>   
> -void __init page_cgroup_init_flatmem(void)
> +int page_cgroup_init(void)
>   {
> +	int nid, fail, tmpnid;
>   
> -	int nid, fail;
> -
> -	if (mem_cgroup_subsys_disabled())
> -		return;
> +	/* only initialize it once */
> +	if (test_and_set_bit(0, &page_cgroup_initialized))
> +		return 0;
>   
>   	for_each_online_node(nid)  {
>   		fail = alloc_node_page_cgroup(nid);
>   		if (fail)
>   			goto fail;
>   	}
> -	printk(KERN_INFO "allocated %ld bytes of page_cgroup\n", total_usage);
> -	printk(KERN_INFO "please try 'cgroup_disable=memory' option if you"
> -	" don't want memory cgroups\n");
> -	return;
> +	page_cgroup_msg();
> +	return 0;
>   fail:
> -	printk(KERN_CRIT "allocation of page_cgroup failed.\n");
> -	printk(KERN_CRIT "please try 'cgroup_disable=memory' boot option\n");
> -	panic("Out of memory");
> +	for_each_online_node(tmpnid)  {
> +		if (tmpnid >= nid)
> +			break;
> +		free_page_cgroup(NODE_DATA(tmpnid)->node_page_cgroup);
> +	}
> +
> +	return -ENOMEM;
>   }
>   
>   #else /* CONFIG_FLAT_NODE_MEM_MAP */
> @@ -105,26 +142,7 @@ struct page_cgroup *lookup_page_cgroup(struct page *page)
>   	return section->page_cgroup + pfn;
>   }
>   
> -static void *__meminit alloc_page_cgroup(size_t size, int nid)
> -{
> -	gfp_t flags = GFP_KERNEL | __GFP_ZERO | __GFP_NOWARN;
> -	void *addr = NULL;
> -
> -	addr = alloc_pages_exact_nid(nid, size, flags);
> -	if (addr) {
> -		kmemleak_alloc(addr, size, 1, flags);
> -		return addr;
> -	}
> -
> -	if (node_state(nid, N_HIGH_MEMORY))
> -		addr = vzalloc_node(size, nid);
> -	else
> -		addr = vzalloc(size);
> -
> -	return addr;
> -}
> -
> -static int __meminit init_section_page_cgroup(unsigned long pfn, int nid)
> +static int init_section_page_cgroup(unsigned long pfn, int nid)
>   {
>   	struct mem_section *section;
>   	struct page_cgroup *base;
> @@ -135,7 +153,7 @@ static int __meminit init_section_page_cgroup(unsigned long pfn, int nid)
>   	if (section->page_cgroup)
>   		return 0;
>   
> -	table_size = sizeof(struct page_cgroup) * PAGES_PER_SECTION;
> +	table_size = page_cgroup_table_size(nid);
>   	base = alloc_page_cgroup(table_size, nid);
>   
>   	/*
> @@ -159,20 +177,6 @@ static int __meminit init_section_page_cgroup(unsigned long pfn, int nid)
>   	total_usage += table_size;
>   	return 0;
>   }
> -#ifdef CONFIG_MEMORY_HOTPLUG
> -static void free_page_cgroup(void *addr)
> -{
> -	if (is_vmalloc_addr(addr)) {
> -		vfree(addr);
> -	} else {
> -		struct page *page = virt_to_page(addr);
> -		size_t table_size =
> -			sizeof(struct page_cgroup) * PAGES_PER_SECTION;
> -
> -		BUG_ON(PageReserved(page));
> -		free_pages_exact(addr, table_size);
> -	}
> -}
>   
>   void __free_page_cgroup(unsigned long pfn)
>   {
> @@ -187,6 +191,7 @@ void __free_page_cgroup(unsigned long pfn)
>   	ms->page_cgroup = NULL;
>   }
>   
> +#ifdef CONFIG_MEMORY_HOTPLUG
>   int __meminit online_page_cgroup(unsigned long start_pfn,
>   			unsigned long nr_pages,
>   			int nid)
> @@ -266,16 +271,16 @@ static int __meminit page_cgroup_callback(struct notifier_block *self,
>   
>   #endif
>   
> -void __init page_cgroup_init(void)
> +int page_cgroup_init(void)
>   {
>   	unsigned long pfn;
> -	int nid;
> +	unsigned long start_pfn, end_pfn;
> +	int nid, tmpnid;
>   
> -	if (mem_cgroup_subsys_disabled())
> -		return;
> +	if (test_and_set_bit(0, &page_cgroup_initialized))
> +		return 0;
>   
>   	for_each_node_state(nid, N_MEMORY) {
> -		unsigned long start_pfn, end_pfn;
>   
>   		start_pfn = node_start_pfn(nid);
>   		end_pfn = node_end_pfn(nid);
> @@ -303,16 +308,23 @@ void __init page_cgroup_init(void)
>   		}
>   	}
>   	hotplug_memory_notifier(page_cgroup_callback, 0);
> -	printk(KERN_INFO "allocated %ld bytes of page_cgroup\n", total_usage);
> -	printk(KERN_INFO "please try 'cgroup_disable=memory' option if you "
> -			 "don't want memory cgroups\n");
> -	return;
> +	page_cgroup_msg();
> +	return 0;
>   oom:
> -	printk(KERN_CRIT "try 'cgroup_disable=memory' boot option\n");
> -	panic("Out of memory");
> +	for_each_node_state(tmpnid, N_MEMORY) {
> +		if (tmpnid >= nid)
> +			break;
> +
> +		start_pfn = node_start_pfn(tmpnid);
> +		end_pfn = node_end_pfn(tmpnid);
> +
> +		for (pfn = start_pfn; pfn < end_pfn; pfn += PAGES_PER_SECTION)
> +			__free_page_cgroup(pfn);
> +	}
> +	return -ENOMEM;
>   }
>   
> -void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat)
> +void pgdat_page_cgroup_init(struct pglist_data *pgdat)
>   {
>   	return;
>   }
> 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 4/5] memcg: do not call page_cgroup_init at system_boot
@ 2013-03-06  1:07     ` Kamezawa Hiroyuki
  0 siblings, 0 replies; 72+ messages in thread
From: Kamezawa Hiroyuki @ 2013-03-06  1:07 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	Tejun Heo, Andrew Morton, Michal Hocko,
	handai.szj-Re5JQEeQqe8AvxtiuMwx3w,
	anton.vorontsov-QSEj5FYQhm4dnm+yROfE0A, Johannes Weiner,
	Mel Gorman

(2013/03/05 22:10), Glauber Costa wrote:
> If we are not using memcg, there is no reason why we should allocate
> this structure, that will be a memory waste at best. We can do better
> at least in the sparsemem case, and allocate it when the first cgroup
> is requested. It should now not panic on failure, and we have to handle
> this right.
> 
> flatmem case is a bit more complicated, so that one is left out for
> the moment.
> 
> Signed-off-by: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
> CC: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
> CC: Kamezawa Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
> CC: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
> CC: Mel Gorman <mgorman-l3A5Bk7waGM@public.gmane.org>
> CC: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
> ---
>   include/linux/page_cgroup.h |  28 +++++----
>   init/main.c                 |   2 -
>   mm/memcontrol.c             |   3 +-
>   mm/page_cgroup.c            | 150 ++++++++++++++++++++++++--------------------
>   4 files changed, 99 insertions(+), 84 deletions(-)

This patch seems a complicated mixture of clean-up and what-you-really-want.

> 
> diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
> index 777a524..ec9fb05 100644
> --- a/include/linux/page_cgroup.h
> +++ b/include/linux/page_cgroup.h
> @@ -14,6 +14,7 @@ enum {
>   
>   #ifdef CONFIG_MEMCG
>   #include <linux/bit_spinlock.h>
> +#include <linux/mmzone.h>
>   
>   /*
>    * Page Cgroup can be considered as an extended mem_map.
> @@ -27,19 +28,17 @@ struct page_cgroup {
>   	struct mem_cgroup *mem_cgroup;
>   };
>   
> -void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat);
> -
> -#ifdef CONFIG_SPARSEMEM
> -static inline void __init page_cgroup_init_flatmem(void)
> +static inline size_t page_cgroup_table_size(int nid)
>   {
> -}
> -extern void __init page_cgroup_init(void);
> +#ifdef CONFIG_SPARSEMEM
> +	return sizeof(struct page_cgroup) * PAGES_PER_SECTION;
>   #else
> -void __init page_cgroup_init_flatmem(void);
> -static inline void __init page_cgroup_init(void)
> -{
> -}
> +	return sizeof(struct page_cgroup) * NODE_DATA(nid)->node_spanned_pages;
>   #endif
> +}
> +void pgdat_page_cgroup_init(struct pglist_data *pgdat);
> +
> +extern int page_cgroup_init(void);
>   
>   struct page_cgroup *lookup_page_cgroup(struct page *page);
>   struct page *lookup_cgroup_page(struct page_cgroup *pc);
> @@ -85,7 +84,7 @@ static inline void unlock_page_cgroup(struct page_cgroup *pc)
>   #else /* CONFIG_MEMCG */
>   struct page_cgroup;
>   
> -static inline void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat)
> +static inline void pgdat_page_cgroup_init(struct pglist_data *pgdat)
>   {
>   }
>   
> @@ -94,7 +93,12 @@ static inline struct page_cgroup *lookup_page_cgroup(struct page *page)
>   	return NULL;
>   }
>   
> -static inline void page_cgroup_init(void)
> +static inline int page_cgroup_init(void)
> +{
> +	return 0;
> +}
> +
> +static inline void page_cgroup_destroy(void)
>   {
>   }
>   
> diff --git a/init/main.c b/init/main.c
> index cee4b5c..1fb3ec0 100644
> --- a/init/main.c
> +++ b/init/main.c
> @@ -457,7 +457,6 @@ static void __init mm_init(void)
>   	 * page_cgroup requires contiguous pages,
>   	 * bigger than MAX_ORDER unless SPARSEMEM.
>   	 */
> -	page_cgroup_init_flatmem();
>   	mem_init();
>   	kmem_cache_init();
>   	percpu_init_late();
> @@ -592,7 +591,6 @@ asmlinkage void __init start_kernel(void)
>   		initrd_start = 0;
>   	}
>   #endif
> -	page_cgroup_init();
>   	debug_objects_mem_init();
>   	kmemleak_init();
>   	setup_per_cpu_pageset();
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 45c1886..6019a32 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -6377,7 +6377,8 @@ mem_cgroup_css_alloc(struct cgroup *cont)
>   		res_counter_init(&memcg->res, NULL);
>   		res_counter_init(&memcg->memsw, NULL);
>   		res_counter_init(&memcg->kmem, NULL);
> -	}
> +	} else if (page_cgroup_init())
> +		goto free_out;
>   
>   	memcg->last_scanned_node = MAX_NUMNODES;
>   	INIT_LIST_HEAD(&memcg->oom_notify);
> diff --git a/mm/page_cgroup.c b/mm/page_cgroup.c
> index a5bd322..6d04c28 100644
> --- a/mm/page_cgroup.c
> +++ b/mm/page_cgroup.c
> @@ -12,11 +12,50 @@
>   #include <linux/kmemleak.h>
>   
>   static unsigned long total_usage;
> +static unsigned long page_cgroup_initialized;
>   
> -#if !defined(CONFIG_SPARSEMEM)
> +static void *alloc_page_cgroup(size_t size, int nid)
> +{
> +	gfp_t flags = GFP_KERNEL | __GFP_ZERO | __GFP_NOWARN;
> +	void *addr = NULL;
> +
> +	addr = alloc_pages_exact_nid(nid, size, flags);
> +	if (addr) {
> +		kmemleak_alloc(addr, size, 1, flags);
> +		return addr;
> +	}

As far as I remember, this function was written for SPARSEMEM.

How big this "size" will be with FLATMEM/DISCONTIGMEM ?
if 16GB, 16 * 1024 * 1024 * 1024 / 4096 * 16 = 64MB. 

What happens if order > MAX_ORDER is passed to alloc_pages()...no warning ?

How about using vmalloc always if not SPARSEMEM ?

> +
> +	if (node_state(nid, N_HIGH_MEMORY))
> +		addr = vzalloc_node(size, nid);
> +	else
> +		addr = vzalloc(size);
>   
> +	return addr;
> +}


>   
> -void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat)
> +static void free_page_cgroup(void *addr)
> +{
> +	if (is_vmalloc_addr(addr)) {
> +		vfree(addr);
> +	} else {
> +		struct page *page = virt_to_page(addr);
> +		int nid = page_to_nid(page);
> +		BUG_ON(PageReserved(page));

This BUG_ON() can be removed.

> +		free_pages_exact(addr, page_cgroup_table_size(nid));
> +	}
> +}
> +
> +static void page_cgroup_msg(void)
> +{
> +	printk(KERN_INFO "allocated %ld bytes of page_cgroup\n", total_usage);
> +	printk(KERN_INFO "please try 'cgroup_disable=memory' option if you "
> +			 "don't want memory cgroups.\nAlternatively, consider "
> +			 "deferring your memory cgroups creation.\n");
> +}

I think this warning can be removed because it's not boot option problem
after this patch. I guess the boot option can be obsolete....

> +
> +#if !defined(CONFIG_SPARSEMEM)
> +
> +void pgdat_page_cgroup_init(struct pglist_data *pgdat)
>   {
>   	pgdat->node_page_cgroup = NULL;
>   }
> @@ -42,20 +81,16 @@ struct page_cgroup *lookup_page_cgroup(struct page *page)
>   	return base + offset;
>   }
>   
> -static int __init alloc_node_page_cgroup(int nid)
> +static int alloc_node_page_cgroup(int nid)
>   {
>   	struct page_cgroup *base;
>   	unsigned long table_size;
> -	unsigned long nr_pages;
>   
> -	nr_pages = NODE_DATA(nid)->node_spanned_pages;
> -	if (!nr_pages)
> +	table_size = page_cgroup_table_size(nid);
> +	if (!table_size)
>   		return 0;
>   
> -	table_size = sizeof(struct page_cgroup) * nr_pages;
> -
> -	base = __alloc_bootmem_node_nopanic(NODE_DATA(nid),
> -			table_size, PAGE_SIZE, __pa(MAX_DMA_ADDRESS));
> +	base = alloc_page_cgroup(table_size, nid);
>   	if (!base)
>   		return -ENOMEM;
>   	NODE_DATA(nid)->node_page_cgroup = base;
> @@ -63,27 +98,29 @@ static int __init alloc_node_page_cgroup(int nid)
>   	return 0;
>   }
>   
> -void __init page_cgroup_init_flatmem(void)
> +int page_cgroup_init(void)
>   {
> +	int nid, fail, tmpnid;
>   
> -	int nid, fail;
> -
> -	if (mem_cgroup_subsys_disabled())
> -		return;
> +	/* only initialize it once */
> +	if (test_and_set_bit(0, &page_cgroup_initialized))
> +		return 0;
>   
>   	for_each_online_node(nid)  {
>   		fail = alloc_node_page_cgroup(nid);
>   		if (fail)
>   			goto fail;
>   	}
> -	printk(KERN_INFO "allocated %ld bytes of page_cgroup\n", total_usage);
> -	printk(KERN_INFO "please try 'cgroup_disable=memory' option if you"
> -	" don't want memory cgroups\n");
> -	return;
> +	page_cgroup_msg();
> +	return 0;
>   fail:
> -	printk(KERN_CRIT "allocation of page_cgroup failed.\n");
> -	printk(KERN_CRIT "please try 'cgroup_disable=memory' boot option\n");
> -	panic("Out of memory");
> +	for_each_online_node(tmpnid)  {
> +		if (tmpnid >= nid)
> +			break;
> +		free_page_cgroup(NODE_DATA(tmpnid)->node_page_cgroup);
> +	}
> +
> +	return -ENOMEM;
>   }
>   
>   #else /* CONFIG_FLAT_NODE_MEM_MAP */
> @@ -105,26 +142,7 @@ struct page_cgroup *lookup_page_cgroup(struct page *page)
>   	return section->page_cgroup + pfn;
>   }
>   
> -static void *__meminit alloc_page_cgroup(size_t size, int nid)
> -{
> -	gfp_t flags = GFP_KERNEL | __GFP_ZERO | __GFP_NOWARN;
> -	void *addr = NULL;
> -
> -	addr = alloc_pages_exact_nid(nid, size, flags);
> -	if (addr) {
> -		kmemleak_alloc(addr, size, 1, flags);
> -		return addr;
> -	}
> -
> -	if (node_state(nid, N_HIGH_MEMORY))
> -		addr = vzalloc_node(size, nid);
> -	else
> -		addr = vzalloc(size);
> -
> -	return addr;
> -}
> -
> -static int __meminit init_section_page_cgroup(unsigned long pfn, int nid)
> +static int init_section_page_cgroup(unsigned long pfn, int nid)
>   {
>   	struct mem_section *section;
>   	struct page_cgroup *base;
> @@ -135,7 +153,7 @@ static int __meminit init_section_page_cgroup(unsigned long pfn, int nid)
>   	if (section->page_cgroup)
>   		return 0;
>   
> -	table_size = sizeof(struct page_cgroup) * PAGES_PER_SECTION;
> +	table_size = page_cgroup_table_size(nid);
>   	base = alloc_page_cgroup(table_size, nid);
>   
>   	/*
> @@ -159,20 +177,6 @@ static int __meminit init_section_page_cgroup(unsigned long pfn, int nid)
>   	total_usage += table_size;
>   	return 0;
>   }
> -#ifdef CONFIG_MEMORY_HOTPLUG
> -static void free_page_cgroup(void *addr)
> -{
> -	if (is_vmalloc_addr(addr)) {
> -		vfree(addr);
> -	} else {
> -		struct page *page = virt_to_page(addr);
> -		size_t table_size =
> -			sizeof(struct page_cgroup) * PAGES_PER_SECTION;
> -
> -		BUG_ON(PageReserved(page));
> -		free_pages_exact(addr, table_size);
> -	}
> -}
>   
>   void __free_page_cgroup(unsigned long pfn)
>   {
> @@ -187,6 +191,7 @@ void __free_page_cgroup(unsigned long pfn)
>   	ms->page_cgroup = NULL;
>   }
>   
> +#ifdef CONFIG_MEMORY_HOTPLUG
>   int __meminit online_page_cgroup(unsigned long start_pfn,
>   			unsigned long nr_pages,
>   			int nid)
> @@ -266,16 +271,16 @@ static int __meminit page_cgroup_callback(struct notifier_block *self,
>   
>   #endif
>   
> -void __init page_cgroup_init(void)
> +int page_cgroup_init(void)
>   {
>   	unsigned long pfn;
> -	int nid;
> +	unsigned long start_pfn, end_pfn;
> +	int nid, tmpnid;
>   
> -	if (mem_cgroup_subsys_disabled())
> -		return;
> +	if (test_and_set_bit(0, &page_cgroup_initialized))
> +		return 0;
>   
>   	for_each_node_state(nid, N_MEMORY) {
> -		unsigned long start_pfn, end_pfn;
>   
>   		start_pfn = node_start_pfn(nid);
>   		end_pfn = node_end_pfn(nid);
> @@ -303,16 +308,23 @@ void __init page_cgroup_init(void)
>   		}
>   	}
>   	hotplug_memory_notifier(page_cgroup_callback, 0);
> -	printk(KERN_INFO "allocated %ld bytes of page_cgroup\n", total_usage);
> -	printk(KERN_INFO "please try 'cgroup_disable=memory' option if you "
> -			 "don't want memory cgroups\n");
> -	return;
> +	page_cgroup_msg();
> +	return 0;
>   oom:
> -	printk(KERN_CRIT "try 'cgroup_disable=memory' boot option\n");
> -	panic("Out of memory");
> +	for_each_node_state(tmpnid, N_MEMORY) {
> +		if (tmpnid >= nid)
> +			break;
> +
> +		start_pfn = node_start_pfn(tmpnid);
> +		end_pfn = node_end_pfn(tmpnid);
> +
> +		for (pfn = start_pfn; pfn < end_pfn; pfn += PAGES_PER_SECTION)
> +			__free_page_cgroup(pfn);
> +	}
> +	return -ENOMEM;
>   }
>   
> -void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat)
> +void pgdat_page_cgroup_init(struct pglist_data *pgdat)
>   {
>   	return;
>   }
> 


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 5/5] memcg: do not walk all the way to the root for memcg
@ 2013-03-06  1:08     ` Kamezawa Hiroyuki
  0 siblings, 0 replies; 72+ messages in thread
From: Kamezawa Hiroyuki @ 2013-03-06  1:08 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-mm, cgroups, Tejun Heo, Andrew Morton, Michal Hocko,
	handai.szj, anton.vorontsov, Johannes Weiner, Mel Gorman

(2013/03/05 22:10), Glauber Costa wrote:
> Since the root is special anyway, and we always get its figures from
> global counters anyway, there is no make all cgroups its descendants,
> wrt res_counters. The sad effect of doing that is that we need to lock
> the root for all allocations, since it is a common ancestor of
> everybody.
> 
> Not having the root as a common ancestor should lead to better
> scalability for not-uncommon case of tasks in the cgroup being
> node-bound to different nodes in NUMA systems.
> 
> Signed-off-by: Glauber Costa <glommer@parallels.com>
> CC: Michal Hocko <mhocko@suse.cz>
> CC: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> CC: Johannes Weiner <hannes@cmpxchg.org>
> CC: Mel Gorman <mgorman@suse.de>
> CC: Andrew Morton <akpm@linux-foundation.org>

Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

> ---
>   mm/memcontrol.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 6019a32..252dc00 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -6464,7 +6464,7 @@ mem_cgroup_css_online(struct cgroup *cont)
>   	memcg->oom_kill_disable = parent->oom_kill_disable;
>   	memcg->swappiness = mem_cgroup_swappiness(parent);
>   
> -	if (parent->use_hierarchy) {
> +	if (parent && !mem_cgroup_is_root(parent) && parent->use_hierarchy) {
>   		res_counter_init(&memcg->res, &parent->res);
>   		res_counter_init(&memcg->memsw, &parent->memsw);
>   		res_counter_init(&memcg->kmem, &parent->kmem);
> 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 5/5] memcg: do not walk all the way to the root for memcg
@ 2013-03-06  1:08     ` Kamezawa Hiroyuki
  0 siblings, 0 replies; 72+ messages in thread
From: Kamezawa Hiroyuki @ 2013-03-06  1:08 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	Tejun Heo, Andrew Morton, Michal Hocko,
	handai.szj-Re5JQEeQqe8AvxtiuMwx3w,
	anton.vorontsov-QSEj5FYQhm4dnm+yROfE0A, Johannes Weiner,
	Mel Gorman

(2013/03/05 22:10), Glauber Costa wrote:
> Since the root is special anyway, and we always get its figures from
> global counters anyway, there is no make all cgroups its descendants,
> wrt res_counters. The sad effect of doing that is that we need to lock
> the root for all allocations, since it is a common ancestor of
> everybody.
> 
> Not having the root as a common ancestor should lead to better
> scalability for not-uncommon case of tasks in the cgroup being
> node-bound to different nodes in NUMA systems.
> 
> Signed-off-by: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
> CC: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
> CC: Kamezawa Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
> CC: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
> CC: Mel Gorman <mgorman-l3A5Bk7waGM@public.gmane.org>
> CC: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>

Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>

> ---
>   mm/memcontrol.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 6019a32..252dc00 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -6464,7 +6464,7 @@ mem_cgroup_css_online(struct cgroup *cont)
>   	memcg->oom_kill_disable = parent->oom_kill_disable;
>   	memcg->swappiness = mem_cgroup_swappiness(parent);
>   
> -	if (parent->use_hierarchy) {
> +	if (parent && !mem_cgroup_is_root(parent) && parent->use_hierarchy) {
>   		res_counter_init(&memcg->res, &parent->res);
>   		res_counter_init(&memcg->memsw, &parent->memsw);
>   		res_counter_init(&memcg->kmem, &parent->kmem);
> 


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 4/5] memcg: do not call page_cgroup_init at system_boot
@ 2013-03-06  8:22       ` Glauber Costa
  0 siblings, 0 replies; 72+ messages in thread
From: Glauber Costa @ 2013-03-06  8:22 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: linux-mm, cgroups, Tejun Heo, Andrew Morton, Michal Hocko,
	handai.szj, anton.vorontsov, Johannes Weiner, Mel Gorman

On 03/06/2013 05:07 AM, Kamezawa Hiroyuki wrote:
> (2013/03/05 22:10), Glauber Costa wrote:
>> If we are not using memcg, there is no reason why we should allocate
>> this structure, that will be a memory waste at best. We can do better
>> at least in the sparsemem case, and allocate it when the first cgroup
>> is requested. It should now not panic on failure, and we have to handle
>> this right.
>>
>> flatmem case is a bit more complicated, so that one is left out for
>> the moment.
>>
>> Signed-off-by: Glauber Costa <glommer@parallels.com>
>> CC: Michal Hocko <mhocko@suse.cz>
>> CC: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>> CC: Johannes Weiner <hannes@cmpxchg.org>
>> CC: Mel Gorman <mgorman@suse.de>
>> CC: Andrew Morton <akpm@linux-foundation.org>
>> ---
>>   include/linux/page_cgroup.h |  28 +++++----
>>   init/main.c                 |   2 -
>>   mm/memcontrol.c             |   3 +-
>>   mm/page_cgroup.c            | 150 ++++++++++++++++++++++++--------------------
>>   4 files changed, 99 insertions(+), 84 deletions(-)
> 
> This patch seems a complicated mixture of clean-up and what-you-really-want.
> 
I swear it is all what-I-really-want, any cleanups are non-intentional!

>> -#if !defined(CONFIG_SPARSEMEM)
>> +static void *alloc_page_cgroup(size_t size, int nid)
>> +{
>> +	gfp_t flags = GFP_KERNEL | __GFP_ZERO | __GFP_NOWARN;
>> +	void *addr = NULL;
>> +
>> +	addr = alloc_pages_exact_nid(nid, size, flags);
>> +	if (addr) {
>> +		kmemleak_alloc(addr, size, 1, flags);
>> +		return addr;
>> +	}
> 
> As far as I remember, this function was written for SPARSEMEM.
> 
> How big this "size" will be with FLATMEM/DISCONTIGMEM ?
> if 16GB, 16 * 1024 * 1024 * 1024 / 4096 * 16 = 64MB. 
> 
> What happens if order > MAX_ORDER is passed to alloc_pages()...no warning ?
> 
> How about using vmalloc always if not SPARSEMEM ?

I don't oppose.

>>   
>> -void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat)
>> +static void free_page_cgroup(void *addr)
>> +{
>> +	if (is_vmalloc_addr(addr)) {
>> +		vfree(addr);
>> +	} else {
>> +		struct page *page = virt_to_page(addr);
>> +		int nid = page_to_nid(page);
>> +		BUG_ON(PageReserved(page));
> 
> This BUG_ON() can be removed.
> 

You are right, although it is still a bug =)

>> +		free_pages_exact(addr, page_cgroup_table_size(nid));
>> +	}
>> +}
>> +
>> +static void page_cgroup_msg(void)
>> +{
>> +	printk(KERN_INFO "allocated %ld bytes of page_cgroup\n", total_usage);
>> +	printk(KERN_INFO "please try 'cgroup_disable=memory' option if you "
>> +			 "don't want memory cgroups.\nAlternatively, consider "
>> +			 "deferring your memory cgroups creation.\n");
>> +}
> 
> I think this warning can be removed because it's not boot option problem
> after this patch. I guess the boot option can be obsolete....
> 

I think it is extremely useful, at least during the next couple of
releases. A lot of distributions will create memcgs for no apparent
reasons way before they are used (if used at all), as a placeholder only.

This can at least tell them that there is a way to stop paying a memory
penalty (together with the actual memory footprint)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 4/5] memcg: do not call page_cgroup_init at system_boot
@ 2013-03-06  8:22       ` Glauber Costa
  0 siblings, 0 replies; 72+ messages in thread
From: Glauber Costa @ 2013-03-06  8:22 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	Tejun Heo, Andrew Morton, Michal Hocko,
	handai.szj-Re5JQEeQqe8AvxtiuMwx3w,
	anton.vorontsov-QSEj5FYQhm4dnm+yROfE0A, Johannes Weiner,
	Mel Gorman

On 03/06/2013 05:07 AM, Kamezawa Hiroyuki wrote:
> (2013/03/05 22:10), Glauber Costa wrote:
>> If we are not using memcg, there is no reason why we should allocate
>> this structure, that will be a memory waste at best. We can do better
>> at least in the sparsemem case, and allocate it when the first cgroup
>> is requested. It should now not panic on failure, and we have to handle
>> this right.
>>
>> flatmem case is a bit more complicated, so that one is left out for
>> the moment.
>>
>> Signed-off-by: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
>> CC: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
>> CC: Kamezawa Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
>> CC: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
>> CC: Mel Gorman <mgorman-l3A5Bk7waGM@public.gmane.org>
>> CC: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
>> ---
>>   include/linux/page_cgroup.h |  28 +++++----
>>   init/main.c                 |   2 -
>>   mm/memcontrol.c             |   3 +-
>>   mm/page_cgroup.c            | 150 ++++++++++++++++++++++++--------------------
>>   4 files changed, 99 insertions(+), 84 deletions(-)
> 
> This patch seems a complicated mixture of clean-up and what-you-really-want.
> 
I swear it is all what-I-really-want, any cleanups are non-intentional!

>> -#if !defined(CONFIG_SPARSEMEM)
>> +static void *alloc_page_cgroup(size_t size, int nid)
>> +{
>> +	gfp_t flags = GFP_KERNEL | __GFP_ZERO | __GFP_NOWARN;
>> +	void *addr = NULL;
>> +
>> +	addr = alloc_pages_exact_nid(nid, size, flags);
>> +	if (addr) {
>> +		kmemleak_alloc(addr, size, 1, flags);
>> +		return addr;
>> +	}
> 
> As far as I remember, this function was written for SPARSEMEM.
> 
> How big this "size" will be with FLATMEM/DISCONTIGMEM ?
> if 16GB, 16 * 1024 * 1024 * 1024 / 4096 * 16 = 64MB. 
> 
> What happens if order > MAX_ORDER is passed to alloc_pages()...no warning ?
> 
> How about using vmalloc always if not SPARSEMEM ?

I don't oppose.

>>   
>> -void __meminit pgdat_page_cgroup_init(struct pglist_data *pgdat)
>> +static void free_page_cgroup(void *addr)
>> +{
>> +	if (is_vmalloc_addr(addr)) {
>> +		vfree(addr);
>> +	} else {
>> +		struct page *page = virt_to_page(addr);
>> +		int nid = page_to_nid(page);
>> +		BUG_ON(PageReserved(page));
> 
> This BUG_ON() can be removed.
> 

You are right, although it is still a bug =)

>> +		free_pages_exact(addr, page_cgroup_table_size(nid));
>> +	}
>> +}
>> +
>> +static void page_cgroup_msg(void)
>> +{
>> +	printk(KERN_INFO "allocated %ld bytes of page_cgroup\n", total_usage);
>> +	printk(KERN_INFO "please try 'cgroup_disable=memory' option if you "
>> +			 "don't want memory cgroups.\nAlternatively, consider "
>> +			 "deferring your memory cgroups creation.\n");
>> +}
> 
> I think this warning can be removed because it's not boot option problem
> after this patch. I guess the boot option can be obsolete....
> 

I think it is extremely useful, at least during the next couple of
releases. A lot of distributions will create memcgs for no apparent
reasons way before they are used (if used at all), as a placeholder only.

This can at least tell them that there is a way to stop paying a memory
penalty (together with the actual memory footprint)

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 2/5] memcg: provide root figures from system totals
@ 2013-03-06  8:30       ` Glauber Costa
  0 siblings, 0 replies; 72+ messages in thread
From: Glauber Costa @ 2013-03-06  8:30 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: linux-mm, cgroups, Tejun Heo, Andrew Morton, Michal Hocko,
	handai.szj, anton.vorontsov, Johannes Weiner, Mel Gorman

On 03/06/2013 04:27 AM, Kamezawa Hiroyuki wrote:
> (2013/03/05 22:10), Glauber Costa wrote:
>> For the root memcg, there is no need to rely on the res_counters if hierarchy
>> is enabled The sum of all mem cgroups plus the tasks in root itself, is
>> necessarily the amount of memory used for the whole system. Since those figures
>> are already kept somewhere anyway, we can just return them here, without too
>> much hassle.
>>
>> Limit and soft limit can't be set for the root cgroup, so they are left at
>> RESOURCE_MAX. Failcnt is left at 0, because its actual meaning is how many
>> times we failed allocations due to the limit being hit. We will fail
>> allocations in the root cgroup, but the limit will never the reason.
>>
>> Signed-off-by: Glauber Costa <glommer@parallels.com>
>> CC: Michal Hocko <mhocko@suse.cz>
>> CC: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>> CC: Johannes Weiner <hannes@cmpxchg.org>
>> CC: Mel Gorman <mgorman@suse.de>
>> CC: Andrew Morton <akpm@linux-foundation.org>
> 
> I think this patch's calculation is wrong.
> 
where exactly ?

>> ---
>>   mm/memcontrol.c | 64 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 64 insertions(+)
>>
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index b8b363f..bfbf1c2 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -4996,6 +4996,56 @@ static inline u64 mem_cgroup_usage(struct mem_cgroup *memcg, bool swap)
>>   	return val << PAGE_SHIFT;
>>   }
>>   
>> +static u64 memcg_read_root_rss(void)
>> +{
>> +	struct task_struct *p;
>> +
>> +	u64 rss = 0;
>> +	read_lock(&tasklist_lock);
>> +	for_each_process(p) {
>> +		if (!p->mm)
>> +			continue;
>> +		task_lock(p);
>> +		rss += get_mm_rss(p->mm);
>> +		task_unlock(p);
>> +	}
>> +	read_unlock(&tasklist_lock);
>> +	return rss;
>> +}
> 
> I think you can use rcu_read_lock() instead of tasklist_lock.
> Isn't it enough to use NR_ANON_LRU rather than this ?

Is it really just ANON_LRU ? get_mm_rss also include filepages, which
are not in this list.

Maybe if we sum up *all* LRUs we would get the right result ?

About the tasklist lock, if I get values from the LRUs, maybe. Otherwise
it is still necessary, no ?

> 
>> +
>> +static u64 mem_cgroup_read_root(enum res_type type, int name)
>> +{
>> +	if (name == RES_LIMIT)
>> +		return RESOURCE_MAX;
>> +	if (name == RES_SOFT_LIMIT)
>> +		return RESOURCE_MAX;
>> +	if (name == RES_FAILCNT)
>> +		return 0;
>> +	if (name == RES_MAX_USAGE)
>> +		return 0;
>> +
>> +	if (WARN_ON_ONCE(name != RES_USAGE))
>> +		return 0;
>> +
>> +	switch (type) {
>> +	case _MEM:
>> +		return (memcg_read_root_rss() +
>> +		atomic_long_read(&vm_stat[NR_FILE_PAGES])) << PAGE_SHIFT;
>> +	case _MEMSWAP: {
>> +		struct sysinfo i;
>> +		si_swapinfo(&i);
>> +
>> +		return ((memcg_read_root_rss() +
>> +		atomic_long_read(&vm_stat[NR_FILE_PAGES])) << PAGE_SHIFT) +
>> +		i.totalswap - i.freeswap;
> 
> How swapcache is handled ? ...and How kmem works with this calc ?
> 
I am ignoring kmem, because we don't account kmem for the root cgroup
anyway.

Setting the limit is invalid, and we don't account until the limit is
set. Then it will be 0, always.

For swapcache, I am hoping that totalswap - freeswap will cover
everything swap related. If you think I am wrong, please enlighten me.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 2/5] memcg: provide root figures from system totals
@ 2013-03-06  8:30       ` Glauber Costa
  0 siblings, 0 replies; 72+ messages in thread
From: Glauber Costa @ 2013-03-06  8:30 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	Tejun Heo, Andrew Morton, Michal Hocko,
	handai.szj-Re5JQEeQqe8AvxtiuMwx3w,
	anton.vorontsov-QSEj5FYQhm4dnm+yROfE0A, Johannes Weiner,
	Mel Gorman

On 03/06/2013 04:27 AM, Kamezawa Hiroyuki wrote:
> (2013/03/05 22:10), Glauber Costa wrote:
>> For the root memcg, there is no need to rely on the res_counters if hierarchy
>> is enabled The sum of all mem cgroups plus the tasks in root itself, is
>> necessarily the amount of memory used for the whole system. Since those figures
>> are already kept somewhere anyway, we can just return them here, without too
>> much hassle.
>>
>> Limit and soft limit can't be set for the root cgroup, so they are left at
>> RESOURCE_MAX. Failcnt is left at 0, because its actual meaning is how many
>> times we failed allocations due to the limit being hit. We will fail
>> allocations in the root cgroup, but the limit will never the reason.
>>
>> Signed-off-by: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
>> CC: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
>> CC: Kamezawa Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
>> CC: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
>> CC: Mel Gorman <mgorman-l3A5Bk7waGM@public.gmane.org>
>> CC: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
> 
> I think this patch's calculation is wrong.
> 
where exactly ?

>> ---
>>   mm/memcontrol.c | 64 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 64 insertions(+)
>>
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index b8b363f..bfbf1c2 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -4996,6 +4996,56 @@ static inline u64 mem_cgroup_usage(struct mem_cgroup *memcg, bool swap)
>>   	return val << PAGE_SHIFT;
>>   }
>>   
>> +static u64 memcg_read_root_rss(void)
>> +{
>> +	struct task_struct *p;
>> +
>> +	u64 rss = 0;
>> +	read_lock(&tasklist_lock);
>> +	for_each_process(p) {
>> +		if (!p->mm)
>> +			continue;
>> +		task_lock(p);
>> +		rss += get_mm_rss(p->mm);
>> +		task_unlock(p);
>> +	}
>> +	read_unlock(&tasklist_lock);
>> +	return rss;
>> +}
> 
> I think you can use rcu_read_lock() instead of tasklist_lock.
> Isn't it enough to use NR_ANON_LRU rather than this ?

Is it really just ANON_LRU ? get_mm_rss also include filepages, which
are not in this list.

Maybe if we sum up *all* LRUs we would get the right result ?

About the tasklist lock, if I get values from the LRUs, maybe. Otherwise
it is still necessary, no ?

> 
>> +
>> +static u64 mem_cgroup_read_root(enum res_type type, int name)
>> +{
>> +	if (name == RES_LIMIT)
>> +		return RESOURCE_MAX;
>> +	if (name == RES_SOFT_LIMIT)
>> +		return RESOURCE_MAX;
>> +	if (name == RES_FAILCNT)
>> +		return 0;
>> +	if (name == RES_MAX_USAGE)
>> +		return 0;
>> +
>> +	if (WARN_ON_ONCE(name != RES_USAGE))
>> +		return 0;
>> +
>> +	switch (type) {
>> +	case _MEM:
>> +		return (memcg_read_root_rss() +
>> +		atomic_long_read(&vm_stat[NR_FILE_PAGES])) << PAGE_SHIFT;
>> +	case _MEMSWAP: {
>> +		struct sysinfo i;
>> +		si_swapinfo(&i);
>> +
>> +		return ((memcg_read_root_rss() +
>> +		atomic_long_read(&vm_stat[NR_FILE_PAGES])) << PAGE_SHIFT) +
>> +		i.totalswap - i.freeswap;
> 
> How swapcache is handled ? ...and How kmem works with this calc ?
> 
I am ignoring kmem, because we don't account kmem for the root cgroup
anyway.

Setting the limit is invalid, and we don't account until the limit is
set. Then it will be 0, always.

For swapcache, I am hoping that totalswap - freeswap will cover
everything swap related. If you think I am wrong, please enlighten me.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 3/5] memcg: make it suck faster
  2013-03-06  0:46     ` Kamezawa Hiroyuki
  (?)
@ 2013-03-06  8:38     ` Glauber Costa
  2013-03-06 10:54         ` Kamezawa Hiroyuki
  -1 siblings, 1 reply; 72+ messages in thread
From: Glauber Costa @ 2013-03-06  8:38 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: linux-mm, cgroups, Tejun Heo, Andrew Morton, Michal Hocko,
	handai.szj, anton.vorontsov, Johannes Weiner, Mel Gorman


> 
>> Signed-off-by: Glauber Costa <glommer@parallels.com>
>> CC: Michal Hocko <mhocko@suse.cz>
>> CC: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>> CC: Johannes Weiner <hannes@cmpxchg.org>
>> CC: Mel Gorman <mgorman@suse.de>
>> CC: Andrew Morton <akpm@linux-foundation.org>
> 
> After quick look, it seems most parts are good. But I have a concern.
> 
> At memcg enablement, you move the numbers from vm_stat[] to res_counters.
> 
Not only to res_counters. Mostly to mem_cgroup_stat_cpu, but I do move
to res_counters as well.

> Why you need it ? It's not explained.

Because at this point, the bypass will no longer be in effect and we
need accurate figures in root cgroup about what happened so far.

If we always have root-level hierarchy, then the bypass could go on
forever. But if we have not, we'll need to rely on whatever was in there.

> And if it's necessary, uncharge will leak because page_cgroup is not marked
> as PCG_USED, pc->mem_cgroup == NULL. So, res.usage will not be decreased.
> 

The same problem happen when deriving an mz from a page. Since
pc->mem_cgroup will be NULL. I am interpreting that as "root mem cgroup".

Maybe even better would be to scan page cgroup writing a magic. Then if
we see that magic we are sure it is an uninitialized pc.

> Could you fix it if you need to move numbers to res_counter ?
> 

At least for the pages in LRUs, I can scan them all, and update their
page information. I am just wondering if this isn't a *very* expensive
operation. Fine that we do it once, but still, is potentially scanning
*all* pages in the system.

So I've basically decided it is better to interpret pc->mem_cgroup =
NULL as this uninitialized state. (and can change to a magic)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 2/5] memcg: provide root figures from system totals
@ 2013-03-06 10:45         ` Kamezawa Hiroyuki
  0 siblings, 0 replies; 72+ messages in thread
From: Kamezawa Hiroyuki @ 2013-03-06 10:45 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-mm, cgroups, Tejun Heo, Andrew Morton, Michal Hocko,
	handai.szj, anton.vorontsov, Johannes Weiner, Mel Gorman

(2013/03/06 17:30), Glauber Costa wrote:
> On 03/06/2013 04:27 AM, Kamezawa Hiroyuki wrote:
>> (2013/03/05 22:10), Glauber Costa wrote:
>>> +	case _MEMSWAP: {
>>> +		struct sysinfo i;
>>> +		si_swapinfo(&i);
>>> +
>>> +		return ((memcg_read_root_rss() +
>>> +		atomic_long_read(&vm_stat[NR_FILE_PAGES])) << PAGE_SHIFT) +
>>> +		i.totalswap - i.freeswap;
>>
>> How swapcache is handled ? ...and How kmem works with this calc ?
>>
> I am ignoring kmem, because we don't account kmem for the root cgroup
> anyway.
> 
> Setting the limit is invalid, and we don't account until the limit is
> set. Then it will be 0, always.
> 
> For swapcache, I am hoping that totalswap - freeswap will cover
> everything swap related. If you think I am wrong, please enlighten me.
> 

i.totalswap - i.freeswap = # of used swap entries.

SwapCache can be rss and used swap entry at the same time. 


Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 2/5] memcg: provide root figures from system totals
@ 2013-03-06 10:45         ` Kamezawa Hiroyuki
  0 siblings, 0 replies; 72+ messages in thread
From: Kamezawa Hiroyuki @ 2013-03-06 10:45 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	Tejun Heo, Andrew Morton, Michal Hocko,
	handai.szj-Re5JQEeQqe8AvxtiuMwx3w,
	anton.vorontsov-QSEj5FYQhm4dnm+yROfE0A, Johannes Weiner,
	Mel Gorman

(2013/03/06 17:30), Glauber Costa wrote:
> On 03/06/2013 04:27 AM, Kamezawa Hiroyuki wrote:
>> (2013/03/05 22:10), Glauber Costa wrote:
>>> +	case _MEMSWAP: {
>>> +		struct sysinfo i;
>>> +		si_swapinfo(&i);
>>> +
>>> +		return ((memcg_read_root_rss() +
>>> +		atomic_long_read(&vm_stat[NR_FILE_PAGES])) << PAGE_SHIFT) +
>>> +		i.totalswap - i.freeswap;
>>
>> How swapcache is handled ? ...and How kmem works with this calc ?
>>
> I am ignoring kmem, because we don't account kmem for the root cgroup
> anyway.
> 
> Setting the limit is invalid, and we don't account until the limit is
> set. Then it will be 0, always.
> 
> For swapcache, I am hoping that totalswap - freeswap will cover
> everything swap related. If you think I am wrong, please enlighten me.
> 

i.totalswap - i.freeswap = # of used swap entries.

SwapCache can be rss and used swap entry at the same time. 


Thanks,
-Kame

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 2/5] memcg: provide root figures from system totals
@ 2013-03-06 10:50         ` Kamezawa Hiroyuki
  0 siblings, 0 replies; 72+ messages in thread
From: Kamezawa Hiroyuki @ 2013-03-06 10:50 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-mm, cgroups, Tejun Heo, Andrew Morton, Michal Hocko,
	handai.szj, anton.vorontsov, Johannes Weiner, Mel Gorman

(2013/03/06 17:30), Glauber Costa wrote:
> On 03/06/2013 04:27 AM, Kamezawa Hiroyuki wrote:
>> (2013/03/05 22:10), Glauber Costa wrote:
>>> For the root memcg, there is no need to rely on the res_counters if hierarchy
>>> is enabled The sum of all mem cgroups plus the tasks in root itself, is
>>> necessarily the amount of memory used for the whole system. Since those figures
>>> are already kept somewhere anyway, we can just return them here, without too
>>> much hassle.
>>>
>>> Limit and soft limit can't be set for the root cgroup, so they are left at
>>> RESOURCE_MAX. Failcnt is left at 0, because its actual meaning is how many
>>> times we failed allocations due to the limit being hit. We will fail
>>> allocations in the root cgroup, but the limit will never the reason.
>>>
>>> Signed-off-by: Glauber Costa <glommer@parallels.com>
>>> CC: Michal Hocko <mhocko@suse.cz>
>>> CC: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>>> CC: Johannes Weiner <hannes@cmpxchg.org>
>>> CC: Mel Gorman <mgorman@suse.de>
>>> CC: Andrew Morton <akpm@linux-foundation.org>
>>
>> I think this patch's calculation is wrong.
>>
> where exactly ?
> 
>>> ---
>>>    mm/memcontrol.c | 64 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>    1 file changed, 64 insertions(+)
>>>
>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>>> index b8b363f..bfbf1c2 100644
>>> --- a/mm/memcontrol.c
>>> +++ b/mm/memcontrol.c
>>> @@ -4996,6 +4996,56 @@ static inline u64 mem_cgroup_usage(struct mem_cgroup *memcg, bool swap)
>>>    	return val << PAGE_SHIFT;
>>>    }
>>>    
>>> +static u64 memcg_read_root_rss(void)
>>> +{
>>> +	struct task_struct *p;
>>> +
>>> +	u64 rss = 0;
>>> +	read_lock(&tasklist_lock);
>>> +	for_each_process(p) {
>>> +		if (!p->mm)
>>> +			continue;
>>> +		task_lock(p);
>>> +		rss += get_mm_rss(p->mm);
>>> +		task_unlock(p);
>>> +	}
>>> +	read_unlock(&tasklist_lock);
>>> +	return rss;
>>> +}
>>
>> I think you can use rcu_read_lock() instead of tasklist_lock.
>> Isn't it enough to use NR_ANON_LRU rather than this ?
> 
> Is it really just ANON_LRU ? get_mm_rss also include filepages, which
> are not in this list.

And mlocked ones counted as Unevictable
> 
> Maybe if we sum up *all* LRUs we would get the right result ?
> 
_MEM...i.e. ...usage_in_bytes is the sum of all LRUs.

> About the tasklist lock, if I get values from the LRUs, maybe. Otherwise
> it is still necessary, no ?

tasklist is RCU list and we don't need locking at reading values, I think.

Thanks,
-Kame




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 2/5] memcg: provide root figures from system totals
@ 2013-03-06 10:50         ` Kamezawa Hiroyuki
  0 siblings, 0 replies; 72+ messages in thread
From: Kamezawa Hiroyuki @ 2013-03-06 10:50 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	Tejun Heo, Andrew Morton, Michal Hocko,
	handai.szj-Re5JQEeQqe8AvxtiuMwx3w,
	anton.vorontsov-QSEj5FYQhm4dnm+yROfE0A, Johannes Weiner,
	Mel Gorman

(2013/03/06 17:30), Glauber Costa wrote:
> On 03/06/2013 04:27 AM, Kamezawa Hiroyuki wrote:
>> (2013/03/05 22:10), Glauber Costa wrote:
>>> For the root memcg, there is no need to rely on the res_counters if hierarchy
>>> is enabled The sum of all mem cgroups plus the tasks in root itself, is
>>> necessarily the amount of memory used for the whole system. Since those figures
>>> are already kept somewhere anyway, we can just return them here, without too
>>> much hassle.
>>>
>>> Limit and soft limit can't be set for the root cgroup, so they are left at
>>> RESOURCE_MAX. Failcnt is left at 0, because its actual meaning is how many
>>> times we failed allocations due to the limit being hit. We will fail
>>> allocations in the root cgroup, but the limit will never the reason.
>>>
>>> Signed-off-by: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
>>> CC: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
>>> CC: Kamezawa Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
>>> CC: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
>>> CC: Mel Gorman <mgorman-l3A5Bk7waGM@public.gmane.org>
>>> CC: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
>>
>> I think this patch's calculation is wrong.
>>
> where exactly ?
> 
>>> ---
>>>    mm/memcontrol.c | 64 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>    1 file changed, 64 insertions(+)
>>>
>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>>> index b8b363f..bfbf1c2 100644
>>> --- a/mm/memcontrol.c
>>> +++ b/mm/memcontrol.c
>>> @@ -4996,6 +4996,56 @@ static inline u64 mem_cgroup_usage(struct mem_cgroup *memcg, bool swap)
>>>    	return val << PAGE_SHIFT;
>>>    }
>>>    
>>> +static u64 memcg_read_root_rss(void)
>>> +{
>>> +	struct task_struct *p;
>>> +
>>> +	u64 rss = 0;
>>> +	read_lock(&tasklist_lock);
>>> +	for_each_process(p) {
>>> +		if (!p->mm)
>>> +			continue;
>>> +		task_lock(p);
>>> +		rss += get_mm_rss(p->mm);
>>> +		task_unlock(p);
>>> +	}
>>> +	read_unlock(&tasklist_lock);
>>> +	return rss;
>>> +}
>>
>> I think you can use rcu_read_lock() instead of tasklist_lock.
>> Isn't it enough to use NR_ANON_LRU rather than this ?
> 
> Is it really just ANON_LRU ? get_mm_rss also include filepages, which
> are not in this list.

And mlocked ones counted as Unevictable
> 
> Maybe if we sum up *all* LRUs we would get the right result ?
> 
_MEM...i.e. ...usage_in_bytes is the sum of all LRUs.

> About the tasklist lock, if I get values from the LRUs, maybe. Otherwise
> it is still necessary, no ?

tasklist is RCU list and we don't need locking at reading values, I think.

Thanks,
-Kame




^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 2/5] memcg: provide root figures from system totals
@ 2013-03-06 10:52           ` Glauber Costa
  0 siblings, 0 replies; 72+ messages in thread
From: Glauber Costa @ 2013-03-06 10:52 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: linux-mm, cgroups, Tejun Heo, Andrew Morton, Michal Hocko,
	handai.szj, anton.vorontsov, Johannes Weiner, Mel Gorman

On 03/06/2013 02:45 PM, Kamezawa Hiroyuki wrote:
> (2013/03/06 17:30), Glauber Costa wrote:
>> On 03/06/2013 04:27 AM, Kamezawa Hiroyuki wrote:
>>> (2013/03/05 22:10), Glauber Costa wrote:
>>>> +	case _MEMSWAP: {
>>>> +		struct sysinfo i;
>>>> +		si_swapinfo(&i);
>>>> +
>>>> +		return ((memcg_read_root_rss() +
>>>> +		atomic_long_read(&vm_stat[NR_FILE_PAGES])) << PAGE_SHIFT) +
>>>> +		i.totalswap - i.freeswap;
>>>
>>> How swapcache is handled ? ...and How kmem works with this calc ?
>>>
>> I am ignoring kmem, because we don't account kmem for the root cgroup
>> anyway.
>>
>> Setting the limit is invalid, and we don't account until the limit is
>> set. Then it will be 0, always.
>>
>> For swapcache, I am hoping that totalswap - freeswap will cover
>> everything swap related. If you think I am wrong, please enlighten me.
>>
> 
> i.totalswap - i.freeswap = # of used swap entries.
> 
> SwapCache can be rss and used swap entry at the same time. 
> 

Well, yes, but the rss entries would be accounted for in get_mm_rss(),
won't they ?

What am I missing ?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 2/5] memcg: provide root figures from system totals
@ 2013-03-06 10:52           ` Glauber Costa
  0 siblings, 0 replies; 72+ messages in thread
From: Glauber Costa @ 2013-03-06 10:52 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	Tejun Heo, Andrew Morton, Michal Hocko,
	handai.szj-Re5JQEeQqe8AvxtiuMwx3w,
	anton.vorontsov-QSEj5FYQhm4dnm+yROfE0A, Johannes Weiner,
	Mel Gorman

On 03/06/2013 02:45 PM, Kamezawa Hiroyuki wrote:
> (2013/03/06 17:30), Glauber Costa wrote:
>> On 03/06/2013 04:27 AM, Kamezawa Hiroyuki wrote:
>>> (2013/03/05 22:10), Glauber Costa wrote:
>>>> +	case _MEMSWAP: {
>>>> +		struct sysinfo i;
>>>> +		si_swapinfo(&i);
>>>> +
>>>> +		return ((memcg_read_root_rss() +
>>>> +		atomic_long_read(&vm_stat[NR_FILE_PAGES])) << PAGE_SHIFT) +
>>>> +		i.totalswap - i.freeswap;
>>>
>>> How swapcache is handled ? ...and How kmem works with this calc ?
>>>
>> I am ignoring kmem, because we don't account kmem for the root cgroup
>> anyway.
>>
>> Setting the limit is invalid, and we don't account until the limit is
>> set. Then it will be 0, always.
>>
>> For swapcache, I am hoping that totalswap - freeswap will cover
>> everything swap related. If you think I am wrong, please enlighten me.
>>
> 
> i.totalswap - i.freeswap = # of used swap entries.
> 
> SwapCache can be rss and used swap entry at the same time. 
> 

Well, yes, but the rss entries would be accounted for in get_mm_rss(),
won't they ?

What am I missing ?

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 3/5] memcg: make it suck faster
@ 2013-03-06 10:54         ` Kamezawa Hiroyuki
  0 siblings, 0 replies; 72+ messages in thread
From: Kamezawa Hiroyuki @ 2013-03-06 10:54 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-mm, cgroups, Tejun Heo, Andrew Morton, Michal Hocko,
	handai.szj, anton.vorontsov, Johannes Weiner, Mel Gorman

(2013/03/06 17:38), Glauber Costa wrote:
> 
>>
>>> Signed-off-by: Glauber Costa <glommer@parallels.com>
>>> CC: Michal Hocko <mhocko@suse.cz>
>>> CC: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>>> CC: Johannes Weiner <hannes@cmpxchg.org>
>>> CC: Mel Gorman <mgorman@suse.de>
>>> CC: Andrew Morton <akpm@linux-foundation.org>
>>
>> After quick look, it seems most parts are good. But I have a concern.
>>
>> At memcg enablement, you move the numbers from vm_stat[] to res_counters.
>>
> Not only to res_counters. Mostly to mem_cgroup_stat_cpu, but I do move
> to res_counters as well.
> 
>> Why you need it ? It's not explained.
> 
> Because at this point, the bypass will no longer be in effect and we
> need accurate figures in root cgroup about what happened so far.
> 
> If we always have root-level hierarchy, then the bypass could go on
> forever. But if we have not, we'll need to rely on whatever was in there.
> 
>> And if it's necessary, uncharge will leak because page_cgroup is not marked
>> as PCG_USED, pc->mem_cgroup == NULL. So, res.usage will not be decreased.
>>
> 
> The same problem happen when deriving an mz from a page. Since
> pc->mem_cgroup will be NULL. I am interpreting that as "root mem cgroup".
> 
yes.

> Maybe even better would be to scan page cgroup writing a magic. Then if
> we see that magic we are sure it is an uninitialized pc.
> 
>> Could you fix it if you need to move numbers to res_counter ?
>>
> 
> At least for the pages in LRUs, I can scan them all, and update their
> page information. I am just wondering if this isn't a *very* expensive
> operation. Fine that we do it once, but still, is potentially scanning
> *all* pages in the system.
> 
> So I've basically decided it is better to interpret pc->mem_cgroup =
> NULL as this uninitialized state. (and can change to a magic)
> 

I think it can work. 

Thanks,
-Kame




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 3/5] memcg: make it suck faster
@ 2013-03-06 10:54         ` Kamezawa Hiroyuki
  0 siblings, 0 replies; 72+ messages in thread
From: Kamezawa Hiroyuki @ 2013-03-06 10:54 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	Tejun Heo, Andrew Morton, Michal Hocko,
	handai.szj-Re5JQEeQqe8AvxtiuMwx3w,
	anton.vorontsov-QSEj5FYQhm4dnm+yROfE0A, Johannes Weiner,
	Mel Gorman

(2013/03/06 17:38), Glauber Costa wrote:
> 
>>
>>> Signed-off-by: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
>>> CC: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
>>> CC: Kamezawa Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
>>> CC: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
>>> CC: Mel Gorman <mgorman-l3A5Bk7waGM@public.gmane.org>
>>> CC: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
>>
>> After quick look, it seems most parts are good. But I have a concern.
>>
>> At memcg enablement, you move the numbers from vm_stat[] to res_counters.
>>
> Not only to res_counters. Mostly to mem_cgroup_stat_cpu, but I do move
> to res_counters as well.
> 
>> Why you need it ? It's not explained.
> 
> Because at this point, the bypass will no longer be in effect and we
> need accurate figures in root cgroup about what happened so far.
> 
> If we always have root-level hierarchy, then the bypass could go on
> forever. But if we have not, we'll need to rely on whatever was in there.
> 
>> And if it's necessary, uncharge will leak because page_cgroup is not marked
>> as PCG_USED, pc->mem_cgroup == NULL. So, res.usage will not be decreased.
>>
> 
> The same problem happen when deriving an mz from a page. Since
> pc->mem_cgroup will be NULL. I am interpreting that as "root mem cgroup".
> 
yes.

> Maybe even better would be to scan page cgroup writing a magic. Then if
> we see that magic we are sure it is an uninitialized pc.
> 
>> Could you fix it if you need to move numbers to res_counter ?
>>
> 
> At least for the pages in LRUs, I can scan them all, and update their
> page information. I am just wondering if this isn't a *very* expensive
> operation. Fine that we do it once, but still, is potentially scanning
> *all* pages in the system.
> 
> So I've basically decided it is better to interpret pc->mem_cgroup =
> NULL as this uninitialized state. (and can change to a magic)
> 

I think it can work. 

Thanks,
-Kame




^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 2/5] memcg: provide root figures from system totals
@ 2013-03-06 10:59             ` Kamezawa Hiroyuki
  0 siblings, 0 replies; 72+ messages in thread
From: Kamezawa Hiroyuki @ 2013-03-06 10:59 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-mm, cgroups, Tejun Heo, Andrew Morton, Michal Hocko,
	handai.szj, anton.vorontsov, Johannes Weiner, Mel Gorman

(2013/03/06 19:52), Glauber Costa wrote:
> On 03/06/2013 02:45 PM, Kamezawa Hiroyuki wrote:
>> (2013/03/06 17:30), Glauber Costa wrote:
>>> On 03/06/2013 04:27 AM, Kamezawa Hiroyuki wrote:
>>>> (2013/03/05 22:10), Glauber Costa wrote:
>>>>> +	case _MEMSWAP: {
>>>>> +		struct sysinfo i;
>>>>> +		si_swapinfo(&i);
>>>>> +
>>>>> +		return ((memcg_read_root_rss() +
>>>>> +		atomic_long_read(&vm_stat[NR_FILE_PAGES])) << PAGE_SHIFT) +
>>>>> +		i.totalswap - i.freeswap;
>>>>
>>>> How swapcache is handled ? ...and How kmem works with this calc ?
>>>>
>>> I am ignoring kmem, because we don't account kmem for the root cgroup
>>> anyway.
>>>
>>> Setting the limit is invalid, and we don't account until the limit is
>>> set. Then it will be 0, always.
>>>
>>> For swapcache, I am hoping that totalswap - freeswap will cover
>>> everything swap related. If you think I am wrong, please enlighten me.
>>>
>>
>> i.totalswap - i.freeswap = # of used swap entries.
>>
>> SwapCache can be rss and used swap entry at the same time.
>>
> 
> Well, yes, but the rss entries would be accounted for in get_mm_rss(),
> won't they ?
> 
> What am I missing ?


I think the correct caluculation is

  Sum of all RSS + All file caches + (i.total_swap - i.freeswap - # of mapped SwapCache)


In the patch, mapped SwapCache is counted as both of rss and swap.

BTW, how about

  Sum of all LRU + (i.total_swap - i.freeswap - # of all SwapCache)
?

Thanks,
-Kame









--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 2/5] memcg: provide root figures from system totals
@ 2013-03-06 10:59             ` Kamezawa Hiroyuki
  0 siblings, 0 replies; 72+ messages in thread
From: Kamezawa Hiroyuki @ 2013-03-06 10:59 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	Tejun Heo, Andrew Morton, Michal Hocko,
	handai.szj-Re5JQEeQqe8AvxtiuMwx3w,
	anton.vorontsov-QSEj5FYQhm4dnm+yROfE0A, Johannes Weiner,
	Mel Gorman

(2013/03/06 19:52), Glauber Costa wrote:
> On 03/06/2013 02:45 PM, Kamezawa Hiroyuki wrote:
>> (2013/03/06 17:30), Glauber Costa wrote:
>>> On 03/06/2013 04:27 AM, Kamezawa Hiroyuki wrote:
>>>> (2013/03/05 22:10), Glauber Costa wrote:
>>>>> +	case _MEMSWAP: {
>>>>> +		struct sysinfo i;
>>>>> +		si_swapinfo(&i);
>>>>> +
>>>>> +		return ((memcg_read_root_rss() +
>>>>> +		atomic_long_read(&vm_stat[NR_FILE_PAGES])) << PAGE_SHIFT) +
>>>>> +		i.totalswap - i.freeswap;
>>>>
>>>> How swapcache is handled ? ...and How kmem works with this calc ?
>>>>
>>> I am ignoring kmem, because we don't account kmem for the root cgroup
>>> anyway.
>>>
>>> Setting the limit is invalid, and we don't account until the limit is
>>> set. Then it will be 0, always.
>>>
>>> For swapcache, I am hoping that totalswap - freeswap will cover
>>> everything swap related. If you think I am wrong, please enlighten me.
>>>
>>
>> i.totalswap - i.freeswap = # of used swap entries.
>>
>> SwapCache can be rss and used swap entry at the same time.
>>
> 
> Well, yes, but the rss entries would be accounted for in get_mm_rss(),
> won't they ?
> 
> What am I missing ?


I think the correct caluculation is

  Sum of all RSS + All file caches + (i.total_swap - i.freeswap - # of mapped SwapCache)


In the patch, mapped SwapCache is counted as both of rss and swap.

BTW, how about

  Sum of all LRU + (i.total_swap - i.freeswap - # of all SwapCache)
?

Thanks,
-Kame









^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 2/5] memcg: provide root figures from system totals
@ 2013-03-13  6:58               ` Sha Zhengju
  0 siblings, 0 replies; 72+ messages in thread
From: Sha Zhengju @ 2013-03-13  6:58 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: Glauber Costa, linux-mm, cgroups, Tejun Heo, Andrew Morton,
	Michal Hocko, anton.vorontsov, Johannes Weiner, Mel Gorman

On Wed, Mar 6, 2013 at 6:59 PM, Kamezawa Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> (2013/03/06 19:52), Glauber Costa wrote:
>> On 03/06/2013 02:45 PM, Kamezawa Hiroyuki wrote:
>>> (2013/03/06 17:30), Glauber Costa wrote:
>>>> On 03/06/2013 04:27 AM, Kamezawa Hiroyuki wrote:
>>>>> (2013/03/05 22:10), Glauber Costa wrote:
>>>>>> + case _MEMSWAP: {
>>>>>> +         struct sysinfo i;
>>>>>> +         si_swapinfo(&i);
>>>>>> +
>>>>>> +         return ((memcg_read_root_rss() +
>>>>>> +         atomic_long_read(&vm_stat[NR_FILE_PAGES])) << PAGE_SHIFT) +
>>>>>> +         i.totalswap - i.freeswap;
>>>>>
>>>>> How swapcache is handled ? ...and How kmem works with this calc ?
>>>>>
>>>> I am ignoring kmem, because we don't account kmem for the root cgroup
>>>> anyway.
>>>>
>>>> Setting the limit is invalid, and we don't account until the limit is
>>>> set. Then it will be 0, always.
>>>>
>>>> For swapcache, I am hoping that totalswap - freeswap will cover
>>>> everything swap related. If you think I am wrong, please enlighten me.
>>>>
>>>
>>> i.totalswap - i.freeswap = # of used swap entries.
>>>
>>> SwapCache can be rss and used swap entry at the same time.
>>>
>>
>> Well, yes, but the rss entries would be accounted for in get_mm_rss(),
>> won't they ?
>>
>> What am I missing ?
>
>
> I think the correct caluculation is
>
>   Sum of all RSS + All file caches + (i.total_swap - i.freeswap - # of mapped SwapCache)
>
>
> In the patch, mapped SwapCache is counted as both of rss and swap.
>

After a quick look, swapcache is counted as file pages and meanwhile
use a swap entry at the same time(__add_to{delete_from}_swap_cache()).
Even though, I think we still do not need to exclude swapcache out,
because it indeed uses two copy of resource: one is swap entry, one is
cache, so the usage should count both of them in.

What I think it matters is that swapcache may be counted as both file
pages and rss(if it's a process's anonymous page), which we need to
subtract # of swapcache to avoid double-counting. But it isn't always
so: a shmem/tmpfs page may use swapcache and be counted as file pages
but not a rss, then we can not subtract swapcache... Is there anything
I lost?

> BTW, how about
>
>   Sum of all LRU + (i.total_swap - i.freeswap - # of all SwapCache)
> ?
>
> Thanks,
> -Kame
>
>
>
>
>
>
>
>
>



-- 
Thanks,
Sha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 2/5] memcg: provide root figures from system totals
@ 2013-03-13  6:58               ` Sha Zhengju
  0 siblings, 0 replies; 72+ messages in thread
From: Sha Zhengju @ 2013-03-13  6:58 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: Glauber Costa, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Tejun Heo, Andrew Morton,
	Michal Hocko, anton.vorontsov-QSEj5FYQhm4dnm+yROfE0A,
	Johannes Weiner, Mel Gorman

On Wed, Mar 6, 2013 at 6:59 PM, Kamezawa Hiroyuki
<kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org> wrote:
> (2013/03/06 19:52), Glauber Costa wrote:
>> On 03/06/2013 02:45 PM, Kamezawa Hiroyuki wrote:
>>> (2013/03/06 17:30), Glauber Costa wrote:
>>>> On 03/06/2013 04:27 AM, Kamezawa Hiroyuki wrote:
>>>>> (2013/03/05 22:10), Glauber Costa wrote:
>>>>>> + case _MEMSWAP: {
>>>>>> +         struct sysinfo i;
>>>>>> +         si_swapinfo(&i);
>>>>>> +
>>>>>> +         return ((memcg_read_root_rss() +
>>>>>> +         atomic_long_read(&vm_stat[NR_FILE_PAGES])) << PAGE_SHIFT) +
>>>>>> +         i.totalswap - i.freeswap;
>>>>>
>>>>> How swapcache is handled ? ...and How kmem works with this calc ?
>>>>>
>>>> I am ignoring kmem, because we don't account kmem for the root cgroup
>>>> anyway.
>>>>
>>>> Setting the limit is invalid, and we don't account until the limit is
>>>> set. Then it will be 0, always.
>>>>
>>>> For swapcache, I am hoping that totalswap - freeswap will cover
>>>> everything swap related. If you think I am wrong, please enlighten me.
>>>>
>>>
>>> i.totalswap - i.freeswap = # of used swap entries.
>>>
>>> SwapCache can be rss and used swap entry at the same time.
>>>
>>
>> Well, yes, but the rss entries would be accounted for in get_mm_rss(),
>> won't they ?
>>
>> What am I missing ?
>
>
> I think the correct caluculation is
>
>   Sum of all RSS + All file caches + (i.total_swap - i.freeswap - # of mapped SwapCache)
>
>
> In the patch, mapped SwapCache is counted as both of rss and swap.
>

After a quick look, swapcache is counted as file pages and meanwhile
use a swap entry at the same time(__add_to{delete_from}_swap_cache()).
Even though, I think we still do not need to exclude swapcache out,
because it indeed uses two copy of resource: one is swap entry, one is
cache, so the usage should count both of them in.

What I think it matters is that swapcache may be counted as both file
pages and rss(if it's a process's anonymous page), which we need to
subtract # of swapcache to avoid double-counting. But it isn't always
so: a shmem/tmpfs page may use swapcache and be counted as file pages
but not a rss, then we can not subtract swapcache... Is there anything
I lost?

> BTW, how about
>
>   Sum of all LRU + (i.total_swap - i.freeswap - # of all SwapCache)
> ?
>
> Thanks,
> -Kame
>
>
>
>
>
>
>
>
>



-- 
Thanks,
Sha

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 3/5] memcg: make it suck faster
@ 2013-03-13  8:08     ` Sha Zhengju
  0 siblings, 0 replies; 72+ messages in thread
From: Sha Zhengju @ 2013-03-13  8:08 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-mm, cgroups, Tejun Heo, Andrew Morton, Michal Hocko,
	kamezawa.hiroyu, anton.vorontsov, Johannes Weiner, Mel Gorman

On Tue, Mar 5, 2013 at 9:10 PM, Glauber Costa <glommer@parallels.com> wrote:
> It is an accepted fact that memcg sucks. But can it suck faster?  Or in
> a more fair statement, can it at least stop draining everyone's
> performance when it is not in use?
>
> This experimental and slightly crude patch demonstrates that we can do
> that by using static branches to patch it out until the first memcg
> comes to life. There are edges to be trimmed, and I appreciate comments
> for direction. In particular, the events in the root are not fired, but
> I believe this can be done without further problems by calling a
> specialized event check from mem_cgroup_newpage_charge().
>
> My goal was to have enough numbers to demonstrate the performance gain
> that can come from it. I tested it in a 24-way 2-socket Intel box, 24 Gb
> mem. I used Mel Gorman's pft test, that he used to demonstrate this
> problem back in the Kernel Summit. There are three kernels:
>
> nomemcg  : memcg compile disabled.
> base     : memcg enabled, patch not applied.
> bypassed : memcg enabled, with patch applied.
>
>                 base    bypassed
> User          109.12      105.64
> System       1646.84     1597.98
> Elapsed       229.56      215.76
>
>              nomemcg    bypassed
> User          104.35      105.64
> System       1578.19     1597.98
> Elapsed       212.33      215.76
>
> So as one can see, the difference between base and nomemcg in terms
> of both system time and elapsed time is quite drastic, and consistent
> with the figures shown by Mel Gorman in the Kernel summit. This is a
> ~ 7 % drop in performance, just by having memcg enabled. memcg functions
> appear heavily in the profiles, even if all tasks lives in the root
> memcg.
>
> With bypassed kernel, we drop this down to 1.5 %, which starts to fall
> in the acceptable range. More investigation is needed to see if we can
> claim that last percent back, but I believe at last part of it should
> be.
>
> Signed-off-by: Glauber Costa <glommer@parallels.com>
> CC: Michal Hocko <mhocko@suse.cz>
> CC: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> CC: Johannes Weiner <hannes@cmpxchg.org>
> CC: Mel Gorman <mgorman@suse.de>
> CC: Andrew Morton <akpm@linux-foundation.org>
> ---
>  include/linux/memcontrol.h |  72 ++++++++++++++++----
>  mm/memcontrol.c            | 166 +++++++++++++++++++++++++++++++++++++++++----
>  mm/page_cgroup.c           |   4 +-
>  3 files changed, 216 insertions(+), 26 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index d6183f0..009f925 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -42,6 +42,26 @@ struct mem_cgroup_reclaim_cookie {
>  };
>
>  #ifdef CONFIG_MEMCG
> +extern struct static_key memcg_in_use_key;
> +
> +static inline bool mem_cgroup_subsys_disabled(void)
> +{
> +       return !!mem_cgroup_subsys.disabled;
> +}
> +
> +static inline bool mem_cgroup_disabled(void)
> +{
> +       /*
> +        * Will always be false if subsys is disabled, because we have no one
> +        * to bump it up. So the test suffices and we don't have to test the
> +        * subsystem as well
> +        */
> +       if (!static_key_false(&memcg_in_use_key))
> +               return true;
> +       return false;
> +}
> +
> +
>  /*
>   * All "charge" functions with gfp_mask should use GFP_KERNEL or
>   * (gfp_mask & GFP_RECLAIM_MASK). In current implementatin, memcg doesn't
> @@ -53,8 +73,18 @@ struct mem_cgroup_reclaim_cookie {
>   * (Of course, if memcg does memory allocation in future, GFP_KERNEL is sane.)
>   */
>
> -extern int mem_cgroup_newpage_charge(struct page *page, struct mm_struct *mm,
> +extern int __mem_cgroup_newpage_charge(struct page *page, struct mm_struct *mm,
>                                 gfp_t gfp_mask);
> +
> +static inline int
> +mem_cgroup_newpage_charge(struct page *page, struct mm_struct *mm,
> +                         gfp_t gfp_mask)
> +{
> +       if (mem_cgroup_disabled())
> +               return 0;
> +       return __mem_cgroup_newpage_charge(page, mm, gfp_mask);
> +}
> +
>  /* for swap handling */
>  extern int mem_cgroup_try_charge_swapin(struct mm_struct *mm,
>                 struct page *page, gfp_t mask, struct mem_cgroup **memcgp);
> @@ -62,8 +92,17 @@ extern void mem_cgroup_commit_charge_swapin(struct page *page,
>                                         struct mem_cgroup *memcg);
>  extern void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *memcg);
>
> -extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
> -                                       gfp_t gfp_mask);
> +
> +extern int __mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
> +                                    gfp_t gfp_mask);
> +static inline int
> +mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask)
> +{
> +       if (mem_cgroup_disabled())
> +               return 0;
> +
> +       return __mem_cgroup_cache_charge(page, mm, gfp_mask);
> +}
>
>  struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *);
>  struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *);
> @@ -72,8 +111,24 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *);
>  extern void mem_cgroup_uncharge_start(void);
>  extern void mem_cgroup_uncharge_end(void);
>
> -extern void mem_cgroup_uncharge_page(struct page *page);
> -extern void mem_cgroup_uncharge_cache_page(struct page *page);
> +extern void __mem_cgroup_uncharge_page(struct page *page);
> +extern void __mem_cgroup_uncharge_cache_page(struct page *page);
> +
> +static inline void mem_cgroup_uncharge_page(struct page *page)
> +{
> +       if (mem_cgroup_disabled())
> +               return;
> +
> +       __mem_cgroup_uncharge_page(page);
> +}
> +
> +static inline void mem_cgroup_uncharge_cache_page(struct page *page)
> +{
> +       if (mem_cgroup_disabled())
> +               return;
> +
> +       __mem_cgroup_uncharge_cache_page(page);
> +}
>
>  bool __mem_cgroup_same_or_subtree(const struct mem_cgroup *root_memcg,
>                                   struct mem_cgroup *memcg);
> @@ -128,13 +183,6 @@ extern void mem_cgroup_replace_page_cache(struct page *oldpage,
>  extern int do_swap_account;
>  #endif
>
> -static inline bool mem_cgroup_disabled(void)
> -{
> -       if (mem_cgroup_subsys.disabled)
> -               return true;
> -       return false;
> -}
> -
>  void __mem_cgroup_begin_update_page_stat(struct page *page, bool *locked,
>                                          unsigned long *flags);
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index bfbf1c2..45c1886 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -575,6 +575,9 @@ static inline bool mem_cgroup_is_root(struct mem_cgroup *memcg)
>         return (memcg == root_mem_cgroup);
>  }
>
> +static bool memcg_charges_allowed = false;
> +struct static_key memcg_in_use_key;
> +
>  /* Writing them here to avoid exposing memcg's inner layout */
>  #if defined(CONFIG_INET) && defined(CONFIG_MEMCG_KMEM)
>
> @@ -710,6 +713,7 @@ static void disarm_static_keys(struct mem_cgroup *memcg)
>  {
>         disarm_sock_keys(memcg);
>         disarm_kmem_keys(memcg);
> +       static_key_slow_dec(&memcg_in_use_key);
>  }
>
>  static void drain_all_stock_async(struct mem_cgroup *memcg);
> @@ -1109,6 +1113,9 @@ struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p)
>         if (unlikely(!p))
>                 return NULL;
>
> +       if (mem_cgroup_disabled())
> +               return root_mem_cgroup;
> +
>         return mem_cgroup_from_css(task_subsys_state(p, mem_cgroup_subsys_id));
>  }
>
> @@ -1157,9 +1164,12 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
>         struct mem_cgroup *memcg = NULL;
>         int id = 0;
>
> -       if (mem_cgroup_disabled())
> +       if (mem_cgroup_subsys_disabled())
>                 return NULL;
>
> +       if (mem_cgroup_disabled())
> +               return root_mem_cgroup;
> +
>         if (!root)
>                 root = root_mem_cgroup;
>
> @@ -1335,6 +1345,20 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct zone *zone)
>         memcg = pc->mem_cgroup;
>
>         /*
> +        * Because we lazily enable memcg only after first child group is
> +        * created, we can have memcg == 0. Because page cgroup is created with
> +        * GFP_ZERO, and after charging, all page cgroups will have a non-zero
> +        * cgroup attached (even if root), we can be sure that this is a
> +        * used-but-not-accounted page. (due to lazyness). We could get around
> +        * that by scanning all pages on cgroup init is too expensive. We can
> +        * ultimately pay, but prefer to just to defer the update until we get
> +        * here. We could take the opportunity to set PageCgroupUsed, but it
> +        * won't be that important for the root cgroup.
> +        */
> +       if (!memcg && PageLRU(page))
> +               pc->mem_cgroup = memcg = root_mem_cgroup;
> +
> +       /*
>          * Surreptitiously switch any uncharged offlist page to root:
>          * an uncharged page off lru does nothing to secure
>          * its former mem_cgroup from sudden removal.
> @@ -3845,11 +3869,18 @@ static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm,
>         return 0;
>  }
>
> -int mem_cgroup_newpage_charge(struct page *page,
> +int __mem_cgroup_newpage_charge(struct page *page,
>                               struct mm_struct *mm, gfp_t gfp_mask)
>  {
> -       if (mem_cgroup_disabled())
> +       /*
> +        * The branch is actually very likely before the first memcg comes in.
> +        * But since the code is patched out, we'll never reach it. It is only
> +        * reachable when the code is patched in, and in that case it is
> +        * unlikely.  It will only happen during initial charges move.
> +        */
> +       if (unlikely(!memcg_charges_allowed))
>                 return 0;
> +
>         VM_BUG_ON(page_mapped(page));
>         VM_BUG_ON(page->mapping && !PageAnon(page));
>         VM_BUG_ON(!mm);
> @@ -3962,15 +3993,13 @@ void mem_cgroup_commit_charge_swapin(struct page *page,
>                                           MEM_CGROUP_CHARGE_TYPE_ANON);
>  }
>
> -int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
> -                               gfp_t gfp_mask)
> +int __mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
> +                             gfp_t gfp_mask)
>  {
>         struct mem_cgroup *memcg = NULL;
>         enum charge_type type = MEM_CGROUP_CHARGE_TYPE_CACHE;
>         int ret;
>
> -       if (mem_cgroup_disabled())
> -               return 0;
>         if (PageCompound(page))
>                 return 0;
>
> @@ -4050,9 +4079,6 @@ __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype,
>         struct page_cgroup *pc;
>         bool anon;
>
> -       if (mem_cgroup_disabled())
> -               return NULL;
> -
>         VM_BUG_ON(PageSwapCache(page));
>
>         if (PageTransHuge(page)) {
> @@ -4144,7 +4170,7 @@ unlock_out:
>         return NULL;
>  }
>
> -void mem_cgroup_uncharge_page(struct page *page)
> +void __mem_cgroup_uncharge_page(struct page *page)
>  {
>         /* early check. */
>         if (page_mapped(page))
> @@ -4155,7 +4181,7 @@ void mem_cgroup_uncharge_page(struct page *page)
>         __mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_ANON, false);
>  }
>
> -void mem_cgroup_uncharge_cache_page(struct page *page)
> +void __mem_cgroup_uncharge_cache_page(struct page *page)
>  {
>         VM_BUG_ON(page_mapped(page));
>         VM_BUG_ON(page->mapping);
> @@ -4220,6 +4246,9 @@ mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent, bool swapout)
>         struct mem_cgroup *memcg;
>         int ctype = MEM_CGROUP_CHARGE_TYPE_SWAPOUT;
>
> +       if (mem_cgroup_disabled())
> +               return;
> +
>         if (!swapout) /* this was a swap cache but the swap is unused ! */
>                 ctype = MEM_CGROUP_CHARGE_TYPE_DROP;
>
> @@ -6364,6 +6393,59 @@ free_out:
>         return ERR_PTR(error);
>  }
>
> +static void memcg_update_root_statistics(void)
> +{
> +       int cpu;
> +       u64 pgin, pgout, faults, mjfaults;
> +
> +       pgin = pgout = faults = mjfaults = 0;
> +       for_each_online_cpu(cpu) {
> +               struct vm_event_state *ev = &per_cpu(vm_event_states, cpu);
> +               struct mem_cgroup_stat_cpu *memcg_stat;
> +
> +               memcg_stat = per_cpu_ptr(root_mem_cgroup->stat, cpu);
> +
> +               memcg_stat->events[MEM_CGROUP_EVENTS_PGPGIN] =
> +                                                       ev->event[PGPGIN];
> +               memcg_stat->events[MEM_CGROUP_EVENTS_PGPGOUT] =
> +                                                       ev->event[PGPGOUT];

ev->event[PGPGIN/PGPGOUT] is counted in block layer(submit_bio()) and
represents the exactly number of pagein/pageout, but memcg
PGPGIN/PGPGOUT events only count it as an event and ignore the page
size. So here we can't straightforward take the ev->events for use.

> +               memcg_stat->events[MEM_CGROUP_EVENTS_PGFAULT] =
> +                                                       ev->event[PGFAULT];
> +               memcg_stat->events[MEM_CGROUP_EVENTS_PGMAJFAULT] =
> +                                                       ev->event[PGMAJFAULT];
> +
> +               memcg_stat->nr_page_events = ev->event[PGPGIN] +
> +                                            ev->event[PGPGOUT];

There's no valid memcg->nr_page_events until now, so the threshold
notifier, but some people may use it even only root memcg exists.
Moreover, using PGPGIN + PGPGOUT(exactly number of pagein + pageout)
as nr_page_events is also inaccurate IMHO.

> +       }
> +
> +       root_mem_cgroup->nocpu_base.count[MEM_CGROUP_STAT_RSS] =
> +                               memcg_read_root_rss();
> +       root_mem_cgroup->nocpu_base.count[MEM_CGROUP_STAT_CACHE] =
> +                               atomic_long_read(&vm_stat[NR_FILE_PAGES]);
> +       root_mem_cgroup->nocpu_base.count[MEM_CGROUP_STAT_FILE_MAPPED] =
> +                               atomic_long_read(&vm_stat[NR_FILE_MAPPED]);
> +}
> +
> +static void memcg_update_root_lru(void)
> +{
> +       struct zone *zone;
> +       struct lruvec *lruvec;
> +       struct mem_cgroup_per_zone *mz;
> +       enum lru_list lru;
> +
> +       for_each_populated_zone(zone) {
> +               spin_lock_irq(&zone->lru_lock);
> +               lruvec = &zone->lruvec;
> +               mz = mem_cgroup_zoneinfo(root_mem_cgroup,
> +                               zone_to_nid(zone), zone_idx(zone));
> +
> +               for (lru = LRU_BASE; lru < NR_LRU_LISTS; lru++)
> +                       mz->lru_size[lru] =
> +                               zone_page_state(zone, NR_LRU_BASE + lru);
> +               spin_unlock_irq(&zone->lru_lock);
> +       }
> +}
> +
>  static int
>  mem_cgroup_css_online(struct cgroup *cont)
>  {
> @@ -6407,6 +6489,66 @@ mem_cgroup_css_online(struct cgroup *cont)
>         }
>
>         error = memcg_init_kmem(memcg, &mem_cgroup_subsys);
> +
> +       if (!error) {
> +               static_key_slow_inc(&memcg_in_use_key);
> +               /*
> +                * The strategy to avoid races here is to let the charges just
> +                * be globally made until we lock the res counter. Since we are
> +                * copying charges from global statistics, it doesn't really
> +                * matter when we do it, as long as we are consistent. So even
> +                * after the code is patched in, they will continue being
> +                * globally charged due to memcg_charges_allowed being set to
> +                * false.
> +                *
> +                * Once we hold the res counter lock, though, we can already
> +                * safely flip it: We will go through with the charging to the
> +                * root memcg, but won't be able to actually charge it: we have
> +                * the lock.
> +                *
> +                * This works because the mm stats are only updated after the
> +                * memcg charging suceeds. If we block the charge by holding
> +                * the res_counter lock, no other charges will happen in the
> +                * system until we release it.
> +                *
> +                * manipulation always safe because the write side is always
> +                * under the memcg_mutex.
> +                */
> +               if (!memcg_charges_allowed) {
> +                       struct zone *zone;
> +
> +                       get_online_cpus();
> +                       spin_lock(&root_mem_cgroup->res.lock);
> +
> +                       memcg_charges_allowed = true;
> +
> +                       root_mem_cgroup->res.usage =
> +                               mem_cgroup_read_root(RES_USAGE, _MEM);
> +                       root_mem_cgroup->memsw.usage =
> +                               mem_cgroup_read_root(RES_USAGE, _MEMSWAP);
> +                       /*
> +                        * The max usage figure is not entirely accurate. The
> +                        * memory may have been higher in the past. But since
> +                        * we don't track that globally, this is the best we
> +                        * can do.
> +                        */
> +                       root_mem_cgroup->res.max_usage =
> +                                       root_mem_cgroup->res.usage;
> +                       root_mem_cgroup->memsw.max_usage =
> +                                       root_mem_cgroup->memsw.usage;
> +
> +                       memcg_update_root_statistics();
> +                       memcg_update_root_lru();
> +                       /*
> +                        * We are now 100 % consistent and all charges are
> +                        * transfered.  New charges should reach the
> +                        * res_counter directly.
> +                        */
> +                       spin_unlock(&root_mem_cgroup->res.lock);
> +                       put_online_cpus();
> +               }
> +       }
> +
>         mutex_unlock(&memcg_create_mutex);
>         if (error) {
>                 /*
> diff --git a/mm/page_cgroup.c b/mm/page_cgroup.c
> index 6d757e3..a5bd322 100644
> --- a/mm/page_cgroup.c
> +++ b/mm/page_cgroup.c
> @@ -68,7 +68,7 @@ void __init page_cgroup_init_flatmem(void)
>
>         int nid, fail;
>
> -       if (mem_cgroup_disabled())
> +       if (mem_cgroup_subsys_disabled())
>                 return;
>
>         for_each_online_node(nid)  {
> @@ -271,7 +271,7 @@ void __init page_cgroup_init(void)
>         unsigned long pfn;
>         int nid;
>
> -       if (mem_cgroup_disabled())
> +       if (mem_cgroup_subsys_disabled())
>                 return;
>
>         for_each_node_state(nid, N_MEMORY) {
> --
> 1.8.1.2
>



-- 
Thanks,
Sha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 3/5] memcg: make it suck faster
@ 2013-03-13  8:08     ` Sha Zhengju
  0 siblings, 0 replies; 72+ messages in thread
From: Sha Zhengju @ 2013-03-13  8:08 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	Tejun Heo, Andrew Morton, Michal Hocko,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
	anton.vorontsov-QSEj5FYQhm4dnm+yROfE0A, Johannes Weiner,
	Mel Gorman

On Tue, Mar 5, 2013 at 9:10 PM, Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> wrote:
> It is an accepted fact that memcg sucks. But can it suck faster?  Or in
> a more fair statement, can it at least stop draining everyone's
> performance when it is not in use?
>
> This experimental and slightly crude patch demonstrates that we can do
> that by using static branches to patch it out until the first memcg
> comes to life. There are edges to be trimmed, and I appreciate comments
> for direction. In particular, the events in the root are not fired, but
> I believe this can be done without further problems by calling a
> specialized event check from mem_cgroup_newpage_charge().
>
> My goal was to have enough numbers to demonstrate the performance gain
> that can come from it. I tested it in a 24-way 2-socket Intel box, 24 Gb
> mem. I used Mel Gorman's pft test, that he used to demonstrate this
> problem back in the Kernel Summit. There are three kernels:
>
> nomemcg  : memcg compile disabled.
> base     : memcg enabled, patch not applied.
> bypassed : memcg enabled, with patch applied.
>
>                 base    bypassed
> User          109.12      105.64
> System       1646.84     1597.98
> Elapsed       229.56      215.76
>
>              nomemcg    bypassed
> User          104.35      105.64
> System       1578.19     1597.98
> Elapsed       212.33      215.76
>
> So as one can see, the difference between base and nomemcg in terms
> of both system time and elapsed time is quite drastic, and consistent
> with the figures shown by Mel Gorman in the Kernel summit. This is a
> ~ 7 % drop in performance, just by having memcg enabled. memcg functions
> appear heavily in the profiles, even if all tasks lives in the root
> memcg.
>
> With bypassed kernel, we drop this down to 1.5 %, which starts to fall
> in the acceptable range. More investigation is needed to see if we can
> claim that last percent back, but I believe at last part of it should
> be.
>
> Signed-off-by: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
> CC: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
> CC: Kamezawa Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
> CC: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
> CC: Mel Gorman <mgorman-l3A5Bk7waGM@public.gmane.org>
> CC: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
> ---
>  include/linux/memcontrol.h |  72 ++++++++++++++++----
>  mm/memcontrol.c            | 166 +++++++++++++++++++++++++++++++++++++++++----
>  mm/page_cgroup.c           |   4 +-
>  3 files changed, 216 insertions(+), 26 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index d6183f0..009f925 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -42,6 +42,26 @@ struct mem_cgroup_reclaim_cookie {
>  };
>
>  #ifdef CONFIG_MEMCG
> +extern struct static_key memcg_in_use_key;
> +
> +static inline bool mem_cgroup_subsys_disabled(void)
> +{
> +       return !!mem_cgroup_subsys.disabled;
> +}
> +
> +static inline bool mem_cgroup_disabled(void)
> +{
> +       /*
> +        * Will always be false if subsys is disabled, because we have no one
> +        * to bump it up. So the test suffices and we don't have to test the
> +        * subsystem as well
> +        */
> +       if (!static_key_false(&memcg_in_use_key))
> +               return true;
> +       return false;
> +}
> +
> +
>  /*
>   * All "charge" functions with gfp_mask should use GFP_KERNEL or
>   * (gfp_mask & GFP_RECLAIM_MASK). In current implementatin, memcg doesn't
> @@ -53,8 +73,18 @@ struct mem_cgroup_reclaim_cookie {
>   * (Of course, if memcg does memory allocation in future, GFP_KERNEL is sane.)
>   */
>
> -extern int mem_cgroup_newpage_charge(struct page *page, struct mm_struct *mm,
> +extern int __mem_cgroup_newpage_charge(struct page *page, struct mm_struct *mm,
>                                 gfp_t gfp_mask);
> +
> +static inline int
> +mem_cgroup_newpage_charge(struct page *page, struct mm_struct *mm,
> +                         gfp_t gfp_mask)
> +{
> +       if (mem_cgroup_disabled())
> +               return 0;
> +       return __mem_cgroup_newpage_charge(page, mm, gfp_mask);
> +}
> +
>  /* for swap handling */
>  extern int mem_cgroup_try_charge_swapin(struct mm_struct *mm,
>                 struct page *page, gfp_t mask, struct mem_cgroup **memcgp);
> @@ -62,8 +92,17 @@ extern void mem_cgroup_commit_charge_swapin(struct page *page,
>                                         struct mem_cgroup *memcg);
>  extern void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *memcg);
>
> -extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
> -                                       gfp_t gfp_mask);
> +
> +extern int __mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
> +                                    gfp_t gfp_mask);
> +static inline int
> +mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask)
> +{
> +       if (mem_cgroup_disabled())
> +               return 0;
> +
> +       return __mem_cgroup_cache_charge(page, mm, gfp_mask);
> +}
>
>  struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *);
>  struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *);
> @@ -72,8 +111,24 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *);
>  extern void mem_cgroup_uncharge_start(void);
>  extern void mem_cgroup_uncharge_end(void);
>
> -extern void mem_cgroup_uncharge_page(struct page *page);
> -extern void mem_cgroup_uncharge_cache_page(struct page *page);
> +extern void __mem_cgroup_uncharge_page(struct page *page);
> +extern void __mem_cgroup_uncharge_cache_page(struct page *page);
> +
> +static inline void mem_cgroup_uncharge_page(struct page *page)
> +{
> +       if (mem_cgroup_disabled())
> +               return;
> +
> +       __mem_cgroup_uncharge_page(page);
> +}
> +
> +static inline void mem_cgroup_uncharge_cache_page(struct page *page)
> +{
> +       if (mem_cgroup_disabled())
> +               return;
> +
> +       __mem_cgroup_uncharge_cache_page(page);
> +}
>
>  bool __mem_cgroup_same_or_subtree(const struct mem_cgroup *root_memcg,
>                                   struct mem_cgroup *memcg);
> @@ -128,13 +183,6 @@ extern void mem_cgroup_replace_page_cache(struct page *oldpage,
>  extern int do_swap_account;
>  #endif
>
> -static inline bool mem_cgroup_disabled(void)
> -{
> -       if (mem_cgroup_subsys.disabled)
> -               return true;
> -       return false;
> -}
> -
>  void __mem_cgroup_begin_update_page_stat(struct page *page, bool *locked,
>                                          unsigned long *flags);
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index bfbf1c2..45c1886 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -575,6 +575,9 @@ static inline bool mem_cgroup_is_root(struct mem_cgroup *memcg)
>         return (memcg == root_mem_cgroup);
>  }
>
> +static bool memcg_charges_allowed = false;
> +struct static_key memcg_in_use_key;
> +
>  /* Writing them here to avoid exposing memcg's inner layout */
>  #if defined(CONFIG_INET) && defined(CONFIG_MEMCG_KMEM)
>
> @@ -710,6 +713,7 @@ static void disarm_static_keys(struct mem_cgroup *memcg)
>  {
>         disarm_sock_keys(memcg);
>         disarm_kmem_keys(memcg);
> +       static_key_slow_dec(&memcg_in_use_key);
>  }
>
>  static void drain_all_stock_async(struct mem_cgroup *memcg);
> @@ -1109,6 +1113,9 @@ struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p)
>         if (unlikely(!p))
>                 return NULL;
>
> +       if (mem_cgroup_disabled())
> +               return root_mem_cgroup;
> +
>         return mem_cgroup_from_css(task_subsys_state(p, mem_cgroup_subsys_id));
>  }
>
> @@ -1157,9 +1164,12 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
>         struct mem_cgroup *memcg = NULL;
>         int id = 0;
>
> -       if (mem_cgroup_disabled())
> +       if (mem_cgroup_subsys_disabled())
>                 return NULL;
>
> +       if (mem_cgroup_disabled())
> +               return root_mem_cgroup;
> +
>         if (!root)
>                 root = root_mem_cgroup;
>
> @@ -1335,6 +1345,20 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct zone *zone)
>         memcg = pc->mem_cgroup;
>
>         /*
> +        * Because we lazily enable memcg only after first child group is
> +        * created, we can have memcg == 0. Because page cgroup is created with
> +        * GFP_ZERO, and after charging, all page cgroups will have a non-zero
> +        * cgroup attached (even if root), we can be sure that this is a
> +        * used-but-not-accounted page. (due to lazyness). We could get around
> +        * that by scanning all pages on cgroup init is too expensive. We can
> +        * ultimately pay, but prefer to just to defer the update until we get
> +        * here. We could take the opportunity to set PageCgroupUsed, but it
> +        * won't be that important for the root cgroup.
> +        */
> +       if (!memcg && PageLRU(page))
> +               pc->mem_cgroup = memcg = root_mem_cgroup;
> +
> +       /*
>          * Surreptitiously switch any uncharged offlist page to root:
>          * an uncharged page off lru does nothing to secure
>          * its former mem_cgroup from sudden removal.
> @@ -3845,11 +3869,18 @@ static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm,
>         return 0;
>  }
>
> -int mem_cgroup_newpage_charge(struct page *page,
> +int __mem_cgroup_newpage_charge(struct page *page,
>                               struct mm_struct *mm, gfp_t gfp_mask)
>  {
> -       if (mem_cgroup_disabled())
> +       /*
> +        * The branch is actually very likely before the first memcg comes in.
> +        * But since the code is patched out, we'll never reach it. It is only
> +        * reachable when the code is patched in, and in that case it is
> +        * unlikely.  It will only happen during initial charges move.
> +        */
> +       if (unlikely(!memcg_charges_allowed))
>                 return 0;
> +
>         VM_BUG_ON(page_mapped(page));
>         VM_BUG_ON(page->mapping && !PageAnon(page));
>         VM_BUG_ON(!mm);
> @@ -3962,15 +3993,13 @@ void mem_cgroup_commit_charge_swapin(struct page *page,
>                                           MEM_CGROUP_CHARGE_TYPE_ANON);
>  }
>
> -int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
> -                               gfp_t gfp_mask)
> +int __mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
> +                             gfp_t gfp_mask)
>  {
>         struct mem_cgroup *memcg = NULL;
>         enum charge_type type = MEM_CGROUP_CHARGE_TYPE_CACHE;
>         int ret;
>
> -       if (mem_cgroup_disabled())
> -               return 0;
>         if (PageCompound(page))
>                 return 0;
>
> @@ -4050,9 +4079,6 @@ __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype,
>         struct page_cgroup *pc;
>         bool anon;
>
> -       if (mem_cgroup_disabled())
> -               return NULL;
> -
>         VM_BUG_ON(PageSwapCache(page));
>
>         if (PageTransHuge(page)) {
> @@ -4144,7 +4170,7 @@ unlock_out:
>         return NULL;
>  }
>
> -void mem_cgroup_uncharge_page(struct page *page)
> +void __mem_cgroup_uncharge_page(struct page *page)
>  {
>         /* early check. */
>         if (page_mapped(page))
> @@ -4155,7 +4181,7 @@ void mem_cgroup_uncharge_page(struct page *page)
>         __mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_ANON, false);
>  }
>
> -void mem_cgroup_uncharge_cache_page(struct page *page)
> +void __mem_cgroup_uncharge_cache_page(struct page *page)
>  {
>         VM_BUG_ON(page_mapped(page));
>         VM_BUG_ON(page->mapping);
> @@ -4220,6 +4246,9 @@ mem_cgroup_uncharge_swapcache(struct page *page, swp_entry_t ent, bool swapout)
>         struct mem_cgroup *memcg;
>         int ctype = MEM_CGROUP_CHARGE_TYPE_SWAPOUT;
>
> +       if (mem_cgroup_disabled())
> +               return;
> +
>         if (!swapout) /* this was a swap cache but the swap is unused ! */
>                 ctype = MEM_CGROUP_CHARGE_TYPE_DROP;
>
> @@ -6364,6 +6393,59 @@ free_out:
>         return ERR_PTR(error);
>  }
>
> +static void memcg_update_root_statistics(void)
> +{
> +       int cpu;
> +       u64 pgin, pgout, faults, mjfaults;
> +
> +       pgin = pgout = faults = mjfaults = 0;
> +       for_each_online_cpu(cpu) {
> +               struct vm_event_state *ev = &per_cpu(vm_event_states, cpu);
> +               struct mem_cgroup_stat_cpu *memcg_stat;
> +
> +               memcg_stat = per_cpu_ptr(root_mem_cgroup->stat, cpu);
> +
> +               memcg_stat->events[MEM_CGROUP_EVENTS_PGPGIN] =
> +                                                       ev->event[PGPGIN];
> +               memcg_stat->events[MEM_CGROUP_EVENTS_PGPGOUT] =
> +                                                       ev->event[PGPGOUT];

ev->event[PGPGIN/PGPGOUT] is counted in block layer(submit_bio()) and
represents the exactly number of pagein/pageout, but memcg
PGPGIN/PGPGOUT events only count it as an event and ignore the page
size. So here we can't straightforward take the ev->events for use.

> +               memcg_stat->events[MEM_CGROUP_EVENTS_PGFAULT] =
> +                                                       ev->event[PGFAULT];
> +               memcg_stat->events[MEM_CGROUP_EVENTS_PGMAJFAULT] =
> +                                                       ev->event[PGMAJFAULT];
> +
> +               memcg_stat->nr_page_events = ev->event[PGPGIN] +
> +                                            ev->event[PGPGOUT];

There's no valid memcg->nr_page_events until now, so the threshold
notifier, but some people may use it even only root memcg exists.
Moreover, using PGPGIN + PGPGOUT(exactly number of pagein + pageout)
as nr_page_events is also inaccurate IMHO.

> +       }
> +
> +       root_mem_cgroup->nocpu_base.count[MEM_CGROUP_STAT_RSS] =
> +                               memcg_read_root_rss();
> +       root_mem_cgroup->nocpu_base.count[MEM_CGROUP_STAT_CACHE] =
> +                               atomic_long_read(&vm_stat[NR_FILE_PAGES]);
> +       root_mem_cgroup->nocpu_base.count[MEM_CGROUP_STAT_FILE_MAPPED] =
> +                               atomic_long_read(&vm_stat[NR_FILE_MAPPED]);
> +}
> +
> +static void memcg_update_root_lru(void)
> +{
> +       struct zone *zone;
> +       struct lruvec *lruvec;
> +       struct mem_cgroup_per_zone *mz;
> +       enum lru_list lru;
> +
> +       for_each_populated_zone(zone) {
> +               spin_lock_irq(&zone->lru_lock);
> +               lruvec = &zone->lruvec;
> +               mz = mem_cgroup_zoneinfo(root_mem_cgroup,
> +                               zone_to_nid(zone), zone_idx(zone));
> +
> +               for (lru = LRU_BASE; lru < NR_LRU_LISTS; lru++)
> +                       mz->lru_size[lru] =
> +                               zone_page_state(zone, NR_LRU_BASE + lru);
> +               spin_unlock_irq(&zone->lru_lock);
> +       }
> +}
> +
>  static int
>  mem_cgroup_css_online(struct cgroup *cont)
>  {
> @@ -6407,6 +6489,66 @@ mem_cgroup_css_online(struct cgroup *cont)
>         }
>
>         error = memcg_init_kmem(memcg, &mem_cgroup_subsys);
> +
> +       if (!error) {
> +               static_key_slow_inc(&memcg_in_use_key);
> +               /*
> +                * The strategy to avoid races here is to let the charges just
> +                * be globally made until we lock the res counter. Since we are
> +                * copying charges from global statistics, it doesn't really
> +                * matter when we do it, as long as we are consistent. So even
> +                * after the code is patched in, they will continue being
> +                * globally charged due to memcg_charges_allowed being set to
> +                * false.
> +                *
> +                * Once we hold the res counter lock, though, we can already
> +                * safely flip it: We will go through with the charging to the
> +                * root memcg, but won't be able to actually charge it: we have
> +                * the lock.
> +                *
> +                * This works because the mm stats are only updated after the
> +                * memcg charging suceeds. If we block the charge by holding
> +                * the res_counter lock, no other charges will happen in the
> +                * system until we release it.
> +                *
> +                * manipulation always safe because the write side is always
> +                * under the memcg_mutex.
> +                */
> +               if (!memcg_charges_allowed) {
> +                       struct zone *zone;
> +
> +                       get_online_cpus();
> +                       spin_lock(&root_mem_cgroup->res.lock);
> +
> +                       memcg_charges_allowed = true;
> +
> +                       root_mem_cgroup->res.usage =
> +                               mem_cgroup_read_root(RES_USAGE, _MEM);
> +                       root_mem_cgroup->memsw.usage =
> +                               mem_cgroup_read_root(RES_USAGE, _MEMSWAP);
> +                       /*
> +                        * The max usage figure is not entirely accurate. The
> +                        * memory may have been higher in the past. But since
> +                        * we don't track that globally, this is the best we
> +                        * can do.
> +                        */
> +                       root_mem_cgroup->res.max_usage =
> +                                       root_mem_cgroup->res.usage;
> +                       root_mem_cgroup->memsw.max_usage =
> +                                       root_mem_cgroup->memsw.usage;
> +
> +                       memcg_update_root_statistics();
> +                       memcg_update_root_lru();
> +                       /*
> +                        * We are now 100 % consistent and all charges are
> +                        * transfered.  New charges should reach the
> +                        * res_counter directly.
> +                        */
> +                       spin_unlock(&root_mem_cgroup->res.lock);
> +                       put_online_cpus();
> +               }
> +       }
> +
>         mutex_unlock(&memcg_create_mutex);
>         if (error) {
>                 /*
> diff --git a/mm/page_cgroup.c b/mm/page_cgroup.c
> index 6d757e3..a5bd322 100644
> --- a/mm/page_cgroup.c
> +++ b/mm/page_cgroup.c
> @@ -68,7 +68,7 @@ void __init page_cgroup_init_flatmem(void)
>
>         int nid, fail;
>
> -       if (mem_cgroup_disabled())
> +       if (mem_cgroup_subsys_disabled())
>                 return;
>
>         for_each_online_node(nid)  {
> @@ -271,7 +271,7 @@ void __init page_cgroup_init(void)
>         unsigned long pfn;
>         int nid;
>
> -       if (mem_cgroup_disabled())
> +       if (mem_cgroup_subsys_disabled())
>                 return;
>
>         for_each_node_state(nid, N_MEMORY) {
> --
> 1.8.1.2
>



-- 
Thanks,
Sha

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 2/5] memcg: provide root figures from system totals
@ 2013-03-13  9:15                 ` Kamezawa Hiroyuki
  0 siblings, 0 replies; 72+ messages in thread
From: Kamezawa Hiroyuki @ 2013-03-13  9:15 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: Glauber Costa, linux-mm, cgroups, Tejun Heo, Andrew Morton,
	Michal Hocko, anton.vorontsov, Johannes Weiner, Mel Gorman

(2013/03/13 15:58), Sha Zhengju wrote:
> On Wed, Mar 6, 2013 at 6:59 PM, Kamezawa Hiroyuki
> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>> (2013/03/06 19:52), Glauber Costa wrote:
>>> On 03/06/2013 02:45 PM, Kamezawa Hiroyuki wrote:
>>>> (2013/03/06 17:30), Glauber Costa wrote:
>>>>> On 03/06/2013 04:27 AM, Kamezawa Hiroyuki wrote:
>>>>>> (2013/03/05 22:10), Glauber Costa wrote:
>>>>>>> + case _MEMSWAP: {
>>>>>>> +         struct sysinfo i;
>>>>>>> +         si_swapinfo(&i);
>>>>>>> +
>>>>>>> +         return ((memcg_read_root_rss() +
>>>>>>> +         atomic_long_read(&vm_stat[NR_FILE_PAGES])) << PAGE_SHIFT) +
>>>>>>> +         i.totalswap - i.freeswap;
>>>>>>
>>>>>> How swapcache is handled ? ...and How kmem works with this calc ?
>>>>>>
>>>>> I am ignoring kmem, because we don't account kmem for the root cgroup
>>>>> anyway.
>>>>>
>>>>> Setting the limit is invalid, and we don't account until the limit is
>>>>> set. Then it will be 0, always.
>>>>>
>>>>> For swapcache, I am hoping that totalswap - freeswap will cover
>>>>> everything swap related. If you think I am wrong, please enlighten me.
>>>>>
>>>>
>>>> i.totalswap - i.freeswap = # of used swap entries.
>>>>
>>>> SwapCache can be rss and used swap entry at the same time.
>>>>
>>>
>>> Well, yes, but the rss entries would be accounted for in get_mm_rss(),
>>> won't they ?
>>>
>>> What am I missing ?
>>
>>
>> I think the correct caluculation is
>>
>>    Sum of all RSS + All file caches + (i.total_swap - i.freeswap - # of mapped SwapCache)
>>
>>
>> In the patch, mapped SwapCache is counted as both of rss and swap.
>>
>
> After a quick look, swapcache is counted as file pages and meanwhile
> use a swap entry at the same time(__add_to{delete_from}_swap_cache()).
> Even though, I think we still do not need to exclude swapcache out,
> because it indeed uses two copy of resource: one is swap entry, one is
> cache, so the usage should count both of them in.
>
> What I think it matters is that swapcache may be counted as both file
> pages and rss(if it's a process's anonymous page), which we need to
> subtract # of swapcache to avoid double-counting. But it isn't always
> so: a shmem/tmpfs page may use swapcache and be counted as file pages
> but not a rss, then we can not subtract swapcache... Is there anything
> I lost?
>


Please don't think difficult. All pages for user/caches are counted in
LRU. All swap-entry usage can be cauht by total_swap_pages - nr_swap_pages.
We just need to subtract number of swap-cache which is double counted
as swap-entry and a page in LRU.

NR_ACTIVE_ANON + NR_INACTIVE_ANON + NR_ACTIVE_FILE + NR_INACTIVE_FILE
+ NR_UNEVICTABLE + total_swap_pages - nr_swap_pages - NR_SWAP_CACHE

is the number we whant for memsw.usage_in_bytes.

Thanks,
-Kame




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 2/5] memcg: provide root figures from system totals
@ 2013-03-13  9:15                 ` Kamezawa Hiroyuki
  0 siblings, 0 replies; 72+ messages in thread
From: Kamezawa Hiroyuki @ 2013-03-13  9:15 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: Glauber Costa, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Tejun Heo, Andrew Morton,
	Michal Hocko, anton.vorontsov-QSEj5FYQhm4dnm+yROfE0A,
	Johannes Weiner, Mel Gorman

(2013/03/13 15:58), Sha Zhengju wrote:
> On Wed, Mar 6, 2013 at 6:59 PM, Kamezawa Hiroyuki
> <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org> wrote:
>> (2013/03/06 19:52), Glauber Costa wrote:
>>> On 03/06/2013 02:45 PM, Kamezawa Hiroyuki wrote:
>>>> (2013/03/06 17:30), Glauber Costa wrote:
>>>>> On 03/06/2013 04:27 AM, Kamezawa Hiroyuki wrote:
>>>>>> (2013/03/05 22:10), Glauber Costa wrote:
>>>>>>> + case _MEMSWAP: {
>>>>>>> +         struct sysinfo i;
>>>>>>> +         si_swapinfo(&i);
>>>>>>> +
>>>>>>> +         return ((memcg_read_root_rss() +
>>>>>>> +         atomic_long_read(&vm_stat[NR_FILE_PAGES])) << PAGE_SHIFT) +
>>>>>>> +         i.totalswap - i.freeswap;
>>>>>>
>>>>>> How swapcache is handled ? ...and How kmem works with this calc ?
>>>>>>
>>>>> I am ignoring kmem, because we don't account kmem for the root cgroup
>>>>> anyway.
>>>>>
>>>>> Setting the limit is invalid, and we don't account until the limit is
>>>>> set. Then it will be 0, always.
>>>>>
>>>>> For swapcache, I am hoping that totalswap - freeswap will cover
>>>>> everything swap related. If you think I am wrong, please enlighten me.
>>>>>
>>>>
>>>> i.totalswap - i.freeswap = # of used swap entries.
>>>>
>>>> SwapCache can be rss and used swap entry at the same time.
>>>>
>>>
>>> Well, yes, but the rss entries would be accounted for in get_mm_rss(),
>>> won't they ?
>>>
>>> What am I missing ?
>>
>>
>> I think the correct caluculation is
>>
>>    Sum of all RSS + All file caches + (i.total_swap - i.freeswap - # of mapped SwapCache)
>>
>>
>> In the patch, mapped SwapCache is counted as both of rss and swap.
>>
>
> After a quick look, swapcache is counted as file pages and meanwhile
> use a swap entry at the same time(__add_to{delete_from}_swap_cache()).
> Even though, I think we still do not need to exclude swapcache out,
> because it indeed uses two copy of resource: one is swap entry, one is
> cache, so the usage should count both of them in.
>
> What I think it matters is that swapcache may be counted as both file
> pages and rss(if it's a process's anonymous page), which we need to
> subtract # of swapcache to avoid double-counting. But it isn't always
> so: a shmem/tmpfs page may use swapcache and be counted as file pages
> but not a rss, then we can not subtract swapcache... Is there anything
> I lost?
>


Please don't think difficult. All pages for user/caches are counted in
LRU. All swap-entry usage can be cauht by total_swap_pages - nr_swap_pages.
We just need to subtract number of swap-cache which is double counted
as swap-entry and a page in LRU.

NR_ACTIVE_ANON + NR_INACTIVE_ANON + NR_ACTIVE_FILE + NR_INACTIVE_FILE
+ NR_UNEVICTABLE + total_swap_pages - nr_swap_pages - NR_SWAP_CACHE

is the number we whant for memsw.usage_in_bytes.

Thanks,
-Kame




^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 2/5] memcg: provide root figures from system totals
  2013-03-13  9:15                 ` Kamezawa Hiroyuki
  (?)
@ 2013-03-13  9:59                 ` Sha Zhengju
  2013-03-14  0:03                     ` Kamezawa Hiroyuki
  -1 siblings, 1 reply; 72+ messages in thread
From: Sha Zhengju @ 2013-03-13  9:59 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: Glauber Costa, linux-mm, cgroups, Tejun Heo, Andrew Morton,
	Michal Hocko, anton.vorontsov, Johannes Weiner, Mel Gorman

On Wed, Mar 13, 2013 at 5:15 PM, Kamezawa Hiroyuki
<kamezawa.hiroyu@jp.fujitsu.com> wrote:
> (2013/03/13 15:58), Sha Zhengju wrote:
>>
>> On Wed, Mar 6, 2013 at 6:59 PM, Kamezawa Hiroyuki
>> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>>>
>>> (2013/03/06 19:52), Glauber Costa wrote:
>>>>
>>>> On 03/06/2013 02:45 PM, Kamezawa Hiroyuki wrote:
>>>>>
>>>>> (2013/03/06 17:30), Glauber Costa wrote:
>>>>>>
>>>>>> On 03/06/2013 04:27 AM, Kamezawa Hiroyuki wrote:
>>>>>>>
>>>>>>> (2013/03/05 22:10), Glauber Costa wrote:
>>>>>>>>
>>>>>>>> + case _MEMSWAP: {
>>>>>>>> +         struct sysinfo i;
>>>>>>>> +         si_swapinfo(&i);
>>>>>>>> +
>>>>>>>> +         return ((memcg_read_root_rss() +
>>>>>>>> +         atomic_long_read(&vm_stat[NR_FILE_PAGES])) << PAGE_SHIFT)
>>>>>>>> +
>>>>>>>> +         i.totalswap - i.freeswap;
>>>>>>>
>>>>>>>
>>>>>>> How swapcache is handled ? ...and How kmem works with this calc ?
>>>>>>>
>>>>>> I am ignoring kmem, because we don't account kmem for the root cgroup
>>>>>> anyway.
>>>>>>
>>>>>> Setting the limit is invalid, and we don't account until the limit is
>>>>>> set. Then it will be 0, always.
>>>>>>
>>>>>> For swapcache, I am hoping that totalswap - freeswap will cover
>>>>>> everything swap related. If you think I am wrong, please enlighten me.
>>>>>>
>>>>>
>>>>> i.totalswap - i.freeswap = # of used swap entries.
>>>>>
>>>>> SwapCache can be rss and used swap entry at the same time.
>>>>>
>>>>
>>>> Well, yes, but the rss entries would be accounted for in get_mm_rss(),
>>>> won't they ?
>>>>
>>>> What am I missing ?
>>>
>>>
>>>
>>> I think the correct caluculation is
>>>
>>>    Sum of all RSS + All file caches + (i.total_swap - i.freeswap - # of
>>> mapped SwapCache)
>>>
>>>
>>> In the patch, mapped SwapCache is counted as both of rss and swap.
>>>
>>
>> After a quick look, swapcache is counted as file pages and meanwhile
>> use a swap entry at the same time(__add_to{delete_from}_swap_cache()).
>> Even though, I think we still do not need to exclude swapcache out,
>> because it indeed uses two copy of resource: one is swap entry, one is
>> cache, so the usage should count both of them in.
>>
>> What I think it matters is that swapcache may be counted as both file
>> pages and rss(if it's a process's anonymous page), which we need to
>> subtract # of swapcache to avoid double-counting. But it isn't always
>> so: a shmem/tmpfs page may use swapcache and be counted as file pages
>> but not a rss, then we can not subtract swapcache... Is there anything
>> I lost?
>>
>
>
> Please don't think difficult. All pages for user/caches are counted in
> LRU. All swap-entry usage can be cauht by total_swap_pages - nr_swap_pages.
> We just need to subtract number of swap-cache which is double counted
> as swap-entry and a page in LRU.
>
> NR_ACTIVE_ANON + NR_INACTIVE_ANON + NR_ACTIVE_FILE + NR_INACTIVE_FILE
> + NR_UNEVICTABLE + total_swap_pages - nr_swap_pages - NR_SWAP_CACHE
>

Using LRU numbers is more suitable. But forgive me, I still doubt
whether we should subtract NR_SWAP_CACHE out because it uses both a
swap entry and a page cache and it isn't a real double counting.


Thanks,
Sha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 2/5] memcg: provide root figures from system totals
  2013-03-13  9:59                 ` Sha Zhengju
@ 2013-03-14  0:03                     ` Kamezawa Hiroyuki
  0 siblings, 0 replies; 72+ messages in thread
From: Kamezawa Hiroyuki @ 2013-03-14  0:03 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: Glauber Costa, linux-mm, cgroups, Tejun Heo, Andrew Morton,
	Michal Hocko, anton.vorontsov, Johannes Weiner, Mel Gorman

(2013/03/13 18:59), Sha Zhengju wrote:
> On Wed, Mar 13, 2013 at 5:15 PM, Kamezawa Hiroyuki
> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>> (2013/03/13 15:58), Sha Zhengju wrote:
>>>
>>> On Wed, Mar 6, 2013 at 6:59 PM, Kamezawa Hiroyuki
>>> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>>>>
>>>> (2013/03/06 19:52), Glauber Costa wrote:
>>>>>
>>>>> On 03/06/2013 02:45 PM, Kamezawa Hiroyuki wrote:
>>>>>>
>>>>>> (2013/03/06 17:30), Glauber Costa wrote:
>>>>>>>
>>>>>>> On 03/06/2013 04:27 AM, Kamezawa Hiroyuki wrote:
>>>>>>>>
>>>>>>>> (2013/03/05 22:10), Glauber Costa wrote:
>>>>>>>>>
>>>>>>>>> + case _MEMSWAP: {
>>>>>>>>> +         struct sysinfo i;
>>>>>>>>> +         si_swapinfo(&i);
>>>>>>>>> +
>>>>>>>>> +         return ((memcg_read_root_rss() +
>>>>>>>>> +         atomic_long_read(&vm_stat[NR_FILE_PAGES])) << PAGE_SHIFT)
>>>>>>>>> +
>>>>>>>>> +         i.totalswap - i.freeswap;
>>>>>>>>
>>>>>>>>
>>>>>>>> How swapcache is handled ? ...and How kmem works with this calc ?
>>>>>>>>
>>>>>>> I am ignoring kmem, because we don't account kmem for the root cgroup
>>>>>>> anyway.
>>>>>>>
>>>>>>> Setting the limit is invalid, and we don't account until the limit is
>>>>>>> set. Then it will be 0, always.
>>>>>>>
>>>>>>> For swapcache, I am hoping that totalswap - freeswap will cover
>>>>>>> everything swap related. If you think I am wrong, please enlighten me.
>>>>>>>
>>>>>>
>>>>>> i.totalswap - i.freeswap = # of used swap entries.
>>>>>>
>>>>>> SwapCache can be rss and used swap entry at the same time.
>>>>>>
>>>>>
>>>>> Well, yes, but the rss entries would be accounted for in get_mm_rss(),
>>>>> won't they ?
>>>>>
>>>>> What am I missing ?
>>>>
>>>>
>>>>
>>>> I think the correct caluculation is
>>>>
>>>>     Sum of all RSS + All file caches + (i.total_swap - i.freeswap - # of
>>>> mapped SwapCache)
>>>>
>>>>
>>>> In the patch, mapped SwapCache is counted as both of rss and swap.
>>>>
>>>
>>> After a quick look, swapcache is counted as file pages and meanwhile
>>> use a swap entry at the same time(__add_to{delete_from}_swap_cache()).
>>> Even though, I think we still do not need to exclude swapcache out,
>>> because it indeed uses two copy of resource: one is swap entry, one is
>>> cache, so the usage should count both of them in.
>>>
>>> What I think it matters is that swapcache may be counted as both file
>>> pages and rss(if it's a process's anonymous page), which we need to
>>> subtract # of swapcache to avoid double-counting. But it isn't always
>>> so: a shmem/tmpfs page may use swapcache and be counted as file pages
>>> but not a rss, then we can not subtract swapcache... Is there anything
>>> I lost?
>>>
>>
>>
>> Please don't think difficult. All pages for user/caches are counted in
>> LRU. All swap-entry usage can be cauht by total_swap_pages - nr_swap_pages.
>> We just need to subtract number of swap-cache which is double counted
>> as swap-entry and a page in LRU.
>>
>> NR_ACTIVE_ANON + NR_INACTIVE_ANON + NR_ACTIVE_FILE + NR_INACTIVE_FILE
>> + NR_UNEVICTABLE + total_swap_pages - nr_swap_pages - NR_SWAP_CACHE
>>
>
> Using LRU numbers is more suitable. But forgive me, I still doubt
> whether we should subtract NR_SWAP_CACHE out because it uses both a
> swap entry and a page cache and it isn't a real double counting.
>

Used swap entry can be reclaimed if there are SwapCache on memory.

Thanks,
a? 1/4 Kame


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 2/5] memcg: provide root figures from system totals
@ 2013-03-14  0:03                     ` Kamezawa Hiroyuki
  0 siblings, 0 replies; 72+ messages in thread
From: Kamezawa Hiroyuki @ 2013-03-14  0:03 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: Glauber Costa, linux-mm, cgroups, Tejun Heo, Andrew Morton,
	Michal Hocko, anton.vorontsov, Johannes Weiner, Mel Gorman

(2013/03/13 18:59), Sha Zhengju wrote:
> On Wed, Mar 13, 2013 at 5:15 PM, Kamezawa Hiroyuki
> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>> (2013/03/13 15:58), Sha Zhengju wrote:
>>>
>>> On Wed, Mar 6, 2013 at 6:59 PM, Kamezawa Hiroyuki
>>> <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>>>>
>>>> (2013/03/06 19:52), Glauber Costa wrote:
>>>>>
>>>>> On 03/06/2013 02:45 PM, Kamezawa Hiroyuki wrote:
>>>>>>
>>>>>> (2013/03/06 17:30), Glauber Costa wrote:
>>>>>>>
>>>>>>> On 03/06/2013 04:27 AM, Kamezawa Hiroyuki wrote:
>>>>>>>>
>>>>>>>> (2013/03/05 22:10), Glauber Costa wrote:
>>>>>>>>>
>>>>>>>>> + case _MEMSWAP: {
>>>>>>>>> +         struct sysinfo i;
>>>>>>>>> +         si_swapinfo(&i);
>>>>>>>>> +
>>>>>>>>> +         return ((memcg_read_root_rss() +
>>>>>>>>> +         atomic_long_read(&vm_stat[NR_FILE_PAGES])) << PAGE_SHIFT)
>>>>>>>>> +
>>>>>>>>> +         i.totalswap - i.freeswap;
>>>>>>>>
>>>>>>>>
>>>>>>>> How swapcache is handled ? ...and How kmem works with this calc ?
>>>>>>>>
>>>>>>> I am ignoring kmem, because we don't account kmem for the root cgroup
>>>>>>> anyway.
>>>>>>>
>>>>>>> Setting the limit is invalid, and we don't account until the limit is
>>>>>>> set. Then it will be 0, always.
>>>>>>>
>>>>>>> For swapcache, I am hoping that totalswap - freeswap will cover
>>>>>>> everything swap related. If you think I am wrong, please enlighten me.
>>>>>>>
>>>>>>
>>>>>> i.totalswap - i.freeswap = # of used swap entries.
>>>>>>
>>>>>> SwapCache can be rss and used swap entry at the same time.
>>>>>>
>>>>>
>>>>> Well, yes, but the rss entries would be accounted for in get_mm_rss(),
>>>>> won't they ?
>>>>>
>>>>> What am I missing ?
>>>>
>>>>
>>>>
>>>> I think the correct caluculation is
>>>>
>>>>     Sum of all RSS + All file caches + (i.total_swap - i.freeswap - # of
>>>> mapped SwapCache)
>>>>
>>>>
>>>> In the patch, mapped SwapCache is counted as both of rss and swap.
>>>>
>>>
>>> After a quick look, swapcache is counted as file pages and meanwhile
>>> use a swap entry at the same time(__add_to{delete_from}_swap_cache()).
>>> Even though, I think we still do not need to exclude swapcache out,
>>> because it indeed uses two copy of resource: one is swap entry, one is
>>> cache, so the usage should count both of them in.
>>>
>>> What I think it matters is that swapcache may be counted as both file
>>> pages and rss(if it's a process's anonymous page), which we need to
>>> subtract # of swapcache to avoid double-counting. But it isn't always
>>> so: a shmem/tmpfs page may use swapcache and be counted as file pages
>>> but not a rss, then we can not subtract swapcache... Is there anything
>>> I lost?
>>>
>>
>>
>> Please don't think difficult. All pages for user/caches are counted in
>> LRU. All swap-entry usage can be cauht by total_swap_pages - nr_swap_pages.
>> We just need to subtract number of swap-cache which is double counted
>> as swap-entry and a page in LRU.
>>
>> NR_ACTIVE_ANON + NR_INACTIVE_ANON + NR_ACTIVE_FILE + NR_INACTIVE_FILE
>> + NR_UNEVICTABLE + total_swap_pages - nr_swap_pages - NR_SWAP_CACHE
>>
>
> Using LRU numbers is more suitable. But forgive me, I still doubt
> whether we should subtract NR_SWAP_CACHE out because it uses both a
> swap entry and a page cache and it isn't a real double counting.
>

Used swap entry can be reclaimed if there are SwapCache on memory.

Thanks,
ーKame


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 1/5] memcg: make nocpu_base available for non hotplug
  2013-03-05 13:10   ` Glauber Costa
  (?)
  (?)
@ 2013-03-19 11:07   ` Michal Hocko
  -1 siblings, 0 replies; 72+ messages in thread
From: Michal Hocko @ 2013-03-19 11:07 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-mm, cgroups, Tejun Heo, Andrew Morton, kamezawa.hiroyu,
	handai.szj, anton.vorontsov, Johannes Weiner

On Tue 05-03-13 17:10:54, Glauber Costa wrote:
> We are using nocpu_base to accumulate charges on the main counters
> during cpu hotplug. I have a similar need, which is transferring charges
> to the root cgroup when lazily enabling memcg. Because system wide
> information is not kept per-cpu, it is hard to distribute it. This field
> works well for this. So we need to make it available for all usages, not
> only hotplug cases.

Could you also rename it to something else while you are at it?
nocpu_base sounds outdated. What about overflow_base or something like
that.

I am also wondering why do wee need pcp_counter_lock there. Doesn't
get_online_cpus prevent from hotplug so mem_cgroup_drain_pcp_counter
doesn't get called? I am sorry for this stupid question but I am lost in
the hotplug callbacks...

Other than that I don't mind pulling nocpu_base outside the hotplug code
and reusing it for something else. So you can add my
Acked-by: Michal Hocko <mhocko@suse.cz>

but I would be happier with a better name of course ;)

> Signed-off-by: Glauber Costa <glommer@parallels.com>
> Cc: Michal Hocko <mhocko@suse.cz>
> Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Tejun Heo <tj@kernel.org>
> ---
>  mm/memcontrol.c | 8 ++++----
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 669d16a..b8b363f 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -921,11 +921,11 @@ static long mem_cgroup_read_stat(struct mem_cgroup *memcg,
>  	get_online_cpus();
>  	for_each_online_cpu(cpu)
>  		val += per_cpu(memcg->stat->count[idx], cpu);
> -#ifdef CONFIG_HOTPLUG_CPU
> +
>  	spin_lock(&memcg->pcp_counter_lock);
>  	val += memcg->nocpu_base.count[idx];
>  	spin_unlock(&memcg->pcp_counter_lock);
> -#endif
> +
>  	put_online_cpus();
>  	return val;
>  }
> @@ -945,11 +945,11 @@ static unsigned long mem_cgroup_read_events(struct mem_cgroup *memcg,
>  
>  	for_each_online_cpu(cpu)
>  		val += per_cpu(memcg->stat->events[idx], cpu);
> -#ifdef CONFIG_HOTPLUG_CPU
> +
>  	spin_lock(&memcg->pcp_counter_lock);
>  	val += memcg->nocpu_base.events[idx];
>  	spin_unlock(&memcg->pcp_counter_lock);
> -#endif
> +
>  	return val;
>  }
>  
> -- 
> 1.8.1.2
> 
> --
> To unsubscribe from this list: send the line "unsubscribe cgroups" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 2/5] memcg: provide root figures from system totals
@ 2013-03-19 12:46     ` Michal Hocko
  0 siblings, 0 replies; 72+ messages in thread
From: Michal Hocko @ 2013-03-19 12:46 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-mm, cgroups, Tejun Heo, Andrew Morton, kamezawa.hiroyu,
	handai.szj, anton.vorontsov, Johannes Weiner, Mel Gorman

On Tue 05-03-13 17:10:55, Glauber Costa wrote:
> For the root memcg, there is no need to rely on the res_counters if hierarchy
> is enabled The sum of all mem cgroups plus the tasks in root itself, is
> necessarily the amount of memory used for the whole system. Since those figures
> are already kept somewhere anyway, we can just return them here, without too
> much hassle.
> 
> Limit and soft limit can't be set for the root cgroup, so they are left at
> RESOURCE_MAX. Failcnt is left at 0, because its actual meaning is how many
> times we failed allocations due to the limit being hit. We will fail
> allocations in the root cgroup, but the limit will never the reason.

I do not like this very much to be honest. It just adds more hackery...
Why cannot we simply not account if nr_cgroups == 1 and move relevant
global counters to the root at the moment when a first group is
created?
The patch aims at reducing an overhead when there there are no other
groups, right?

> Signed-off-by: Glauber Costa <glommer@parallels.com>
> CC: Michal Hocko <mhocko@suse.cz>
> CC: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> CC: Johannes Weiner <hannes@cmpxchg.org>
> CC: Mel Gorman <mgorman@suse.de>
> CC: Andrew Morton <akpm@linux-foundation.org>
> ---
>  mm/memcontrol.c | 64 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 64 insertions(+)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index b8b363f..bfbf1c2 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -4996,6 +4996,56 @@ static inline u64 mem_cgroup_usage(struct mem_cgroup *memcg, bool swap)
>  	return val << PAGE_SHIFT;
>  }
>  
> +static u64 memcg_read_root_rss(void)
> +{
> +	struct task_struct *p;
> +
> +	u64 rss = 0;
> +	read_lock(&tasklist_lock);
> +	for_each_process(p) {
> +		if (!p->mm)
> +			continue;
> +		task_lock(p);
> +		rss += get_mm_rss(p->mm);
> +		task_unlock(p);
> +	}
> +	read_unlock(&tasklist_lock);
> +	return rss;
> +}
> +
> +static u64 mem_cgroup_read_root(enum res_type type, int name)
> +{
> +	if (name == RES_LIMIT)
> +		return RESOURCE_MAX;
> +	if (name == RES_SOFT_LIMIT)
> +		return RESOURCE_MAX;
> +	if (name == RES_FAILCNT)
> +		return 0;
> +	if (name == RES_MAX_USAGE)
> +		return 0;
> +
> +	if (WARN_ON_ONCE(name != RES_USAGE))
> +		return 0;
> +
> +	switch (type) {
> +	case _MEM:
> +		return (memcg_read_root_rss() +
> +		atomic_long_read(&vm_stat[NR_FILE_PAGES])) << PAGE_SHIFT;
> +	case _MEMSWAP: {
> +		struct sysinfo i;
> +		si_swapinfo(&i);
> +
> +		return ((memcg_read_root_rss() +
> +		atomic_long_read(&vm_stat[NR_FILE_PAGES])) << PAGE_SHIFT) +
> +		i.totalswap - i.freeswap;
> +	}
> +	case _KMEM:
> +		return 0;
> +	default:
> +		BUG();
> +	};
> +}
> +
>  static ssize_t mem_cgroup_read(struct cgroup *cont, struct cftype *cft,
>  			       struct file *file, char __user *buf,
>  			       size_t nbytes, loff_t *ppos)
> @@ -5012,6 +5062,19 @@ static ssize_t mem_cgroup_read(struct cgroup *cont, struct cftype *cft,
>  	if (!do_swap_account && type == _MEMSWAP)
>  		return -EOPNOTSUPP;
>  
> +	/*
> +	 * If we have root-level hierarchy, we can be certain that the charges
> +	 * in root are always global. We can then bypass the root cgroup
> +	 * entirely in this case, hopefuly leading to less contention in the
> +	 * root res_counters. The charges presented after reading it will
> +	 * always be the global charges.
> +	 */
> +	if (mem_cgroup_disabled() ||
> +		(mem_cgroup_is_root(memcg) && memcg->use_hierarchy)) {
> +		val = mem_cgroup_read_root(type, name);
> +		goto root_bypass;
> +	}
> +
>  	switch (type) {
>  	case _MEM:
>  		if (name == RES_USAGE)
> @@ -5032,6 +5095,7 @@ static ssize_t mem_cgroup_read(struct cgroup *cont, struct cftype *cft,
>  		BUG();
>  	}
>  
> +root_bypass:
>  	len = scnprintf(str, sizeof(str), "%llu\n", (unsigned long long)val);
>  	return simple_read_from_buffer(buf, nbytes, ppos, str, len);
>  }
> -- 
> 1.8.1.2
> 
> --
> To unsubscribe from this list: send the line "unsubscribe cgroups" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 2/5] memcg: provide root figures from system totals
@ 2013-03-19 12:46     ` Michal Hocko
  0 siblings, 0 replies; 72+ messages in thread
From: Michal Hocko @ 2013-03-19 12:46 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	Tejun Heo, Andrew Morton, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
	handai.szj-Re5JQEeQqe8AvxtiuMwx3w,
	anton.vorontsov-QSEj5FYQhm4dnm+yROfE0A, Johannes Weiner,
	Mel Gorman

On Tue 05-03-13 17:10:55, Glauber Costa wrote:
> For the root memcg, there is no need to rely on the res_counters if hierarchy
> is enabled The sum of all mem cgroups plus the tasks in root itself, is
> necessarily the amount of memory used for the whole system. Since those figures
> are already kept somewhere anyway, we can just return them here, without too
> much hassle.
> 
> Limit and soft limit can't be set for the root cgroup, so they are left at
> RESOURCE_MAX. Failcnt is left at 0, because its actual meaning is how many
> times we failed allocations due to the limit being hit. We will fail
> allocations in the root cgroup, but the limit will never the reason.

I do not like this very much to be honest. It just adds more hackery...
Why cannot we simply not account if nr_cgroups == 1 and move relevant
global counters to the root at the moment when a first group is
created?
The patch aims at reducing an overhead when there there are no other
groups, right?

> Signed-off-by: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
> CC: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
> CC: Kamezawa Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
> CC: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
> CC: Mel Gorman <mgorman-l3A5Bk7waGM@public.gmane.org>
> CC: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
> ---
>  mm/memcontrol.c | 64 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 64 insertions(+)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index b8b363f..bfbf1c2 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -4996,6 +4996,56 @@ static inline u64 mem_cgroup_usage(struct mem_cgroup *memcg, bool swap)
>  	return val << PAGE_SHIFT;
>  }
>  
> +static u64 memcg_read_root_rss(void)
> +{
> +	struct task_struct *p;
> +
> +	u64 rss = 0;
> +	read_lock(&tasklist_lock);
> +	for_each_process(p) {
> +		if (!p->mm)
> +			continue;
> +		task_lock(p);
> +		rss += get_mm_rss(p->mm);
> +		task_unlock(p);
> +	}
> +	read_unlock(&tasklist_lock);
> +	return rss;
> +}
> +
> +static u64 mem_cgroup_read_root(enum res_type type, int name)
> +{
> +	if (name == RES_LIMIT)
> +		return RESOURCE_MAX;
> +	if (name == RES_SOFT_LIMIT)
> +		return RESOURCE_MAX;
> +	if (name == RES_FAILCNT)
> +		return 0;
> +	if (name == RES_MAX_USAGE)
> +		return 0;
> +
> +	if (WARN_ON_ONCE(name != RES_USAGE))
> +		return 0;
> +
> +	switch (type) {
> +	case _MEM:
> +		return (memcg_read_root_rss() +
> +		atomic_long_read(&vm_stat[NR_FILE_PAGES])) << PAGE_SHIFT;
> +	case _MEMSWAP: {
> +		struct sysinfo i;
> +		si_swapinfo(&i);
> +
> +		return ((memcg_read_root_rss() +
> +		atomic_long_read(&vm_stat[NR_FILE_PAGES])) << PAGE_SHIFT) +
> +		i.totalswap - i.freeswap;
> +	}
> +	case _KMEM:
> +		return 0;
> +	default:
> +		BUG();
> +	};
> +}
> +
>  static ssize_t mem_cgroup_read(struct cgroup *cont, struct cftype *cft,
>  			       struct file *file, char __user *buf,
>  			       size_t nbytes, loff_t *ppos)
> @@ -5012,6 +5062,19 @@ static ssize_t mem_cgroup_read(struct cgroup *cont, struct cftype *cft,
>  	if (!do_swap_account && type == _MEMSWAP)
>  		return -EOPNOTSUPP;
>  
> +	/*
> +	 * If we have root-level hierarchy, we can be certain that the charges
> +	 * in root are always global. We can then bypass the root cgroup
> +	 * entirely in this case, hopefuly leading to less contention in the
> +	 * root res_counters. The charges presented after reading it will
> +	 * always be the global charges.
> +	 */
> +	if (mem_cgroup_disabled() ||
> +		(mem_cgroup_is_root(memcg) && memcg->use_hierarchy)) {
> +		val = mem_cgroup_read_root(type, name);
> +		goto root_bypass;
> +	}
> +
>  	switch (type) {
>  	case _MEM:
>  		if (name == RES_USAGE)
> @@ -5032,6 +5095,7 @@ static ssize_t mem_cgroup_read(struct cgroup *cont, struct cftype *cft,
>  		BUG();
>  	}
>  
> +root_bypass:
>  	len = scnprintf(str, sizeof(str), "%llu\n", (unsigned long long)val);
>  	return simple_read_from_buffer(buf, nbytes, ppos, str, len);
>  }
> -- 
> 1.8.1.2
> 
> --
> To unsubscribe from this list: send the line "unsubscribe cgroups" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 2/5] memcg: provide root figures from system totals
@ 2013-03-19 12:55       ` Michal Hocko
  0 siblings, 0 replies; 72+ messages in thread
From: Michal Hocko @ 2013-03-19 12:55 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-mm, cgroups, Tejun Heo, Andrew Morton, kamezawa.hiroyu,
	handai.szj, anton.vorontsov, Johannes Weiner, Mel Gorman

On Tue 19-03-13 13:46:50, Michal Hocko wrote:
> On Tue 05-03-13 17:10:55, Glauber Costa wrote:
> > For the root memcg, there is no need to rely on the res_counters if hierarchy
> > is enabled The sum of all mem cgroups plus the tasks in root itself, is
> > necessarily the amount of memory used for the whole system. Since those figures
> > are already kept somewhere anyway, we can just return them here, without too
> > much hassle.
> > 
> > Limit and soft limit can't be set for the root cgroup, so they are left at
> > RESOURCE_MAX. Failcnt is left at 0, because its actual meaning is how many
> > times we failed allocations due to the limit being hit. We will fail
> > allocations in the root cgroup, but the limit will never the reason.
> 
> I do not like this very much to be honest. It just adds more hackery...
> Why cannot we simply not account if nr_cgroups == 1 and move relevant
> global counters to the root at the moment when a first group is
> created?

OK, it seems that the very next patch does what I was looking for. So
why all the churn in this patch?
Why do you want to make root even more special?
[...]
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 2/5] memcg: provide root figures from system totals
@ 2013-03-19 12:55       ` Michal Hocko
  0 siblings, 0 replies; 72+ messages in thread
From: Michal Hocko @ 2013-03-19 12:55 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	Tejun Heo, Andrew Morton, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
	handai.szj-Re5JQEeQqe8AvxtiuMwx3w,
	anton.vorontsov-QSEj5FYQhm4dnm+yROfE0A, Johannes Weiner,
	Mel Gorman

On Tue 19-03-13 13:46:50, Michal Hocko wrote:
> On Tue 05-03-13 17:10:55, Glauber Costa wrote:
> > For the root memcg, there is no need to rely on the res_counters if hierarchy
> > is enabled The sum of all mem cgroups plus the tasks in root itself, is
> > necessarily the amount of memory used for the whole system. Since those figures
> > are already kept somewhere anyway, we can just return them here, without too
> > much hassle.
> > 
> > Limit and soft limit can't be set for the root cgroup, so they are left at
> > RESOURCE_MAX. Failcnt is left at 0, because its actual meaning is how many
> > times we failed allocations due to the limit being hit. We will fail
> > allocations in the root cgroup, but the limit will never the reason.
> 
> I do not like this very much to be honest. It just adds more hackery...
> Why cannot we simply not account if nr_cgroups == 1 and move relevant
> global counters to the root at the moment when a first group is
> created?

OK, it seems that the very next patch does what I was looking for. So
why all the churn in this patch?
Why do you want to make root even more special?
[...]
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 3/5] memcg: make it suck faster
@ 2013-03-19 13:58     ` Michal Hocko
  0 siblings, 0 replies; 72+ messages in thread
From: Michal Hocko @ 2013-03-19 13:58 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-mm, cgroups, Tejun Heo, Andrew Morton, kamezawa.hiroyu,
	handai.szj, anton.vorontsov, Johannes Weiner, Mel Gorman

On Tue 05-03-13 17:10:56, Glauber Costa wrote:
> It is an accepted fact that memcg sucks. But can it suck faster?  Or in
> a more fair statement, can it at least stop draining everyone's
> performance when it is not in use?
> 
> This experimental and slightly crude patch demonstrates that we can do
> that by using static branches to patch it out until the first memcg
> comes to life. There are edges to be trimmed, and I appreciate comments
> for direction. In particular, the events in the root are not fired, but
> I believe this can be done without further problems by calling a
> specialized event check from mem_cgroup_newpage_charge().
> 
> My goal was to have enough numbers to demonstrate the performance gain
> that can come from it. I tested it in a 24-way 2-socket Intel box, 24 Gb
> mem. I used Mel Gorman's pft test, that he used to demonstrate this
> problem back in the Kernel Summit. There are three kernels:
> 
> nomemcg  : memcg compile disabled.
> base     : memcg enabled, patch not applied.
> bypassed : memcg enabled, with patch applied.
> 
>                 base    bypassed
> User          109.12      105.64
> System       1646.84     1597.98
> Elapsed       229.56      215.76
> 
>              nomemcg    bypassed
> User          104.35      105.64
> System       1578.19     1597.98
> Elapsed       212.33      215.76

Do you have profiles for where we spend the time?

> So as one can see, the difference between base and nomemcg in terms
> of both system time and elapsed time is quite drastic, and consistent
> with the figures shown by Mel Gorman in the Kernel summit. This is a
> ~ 7 % drop in performance, just by having memcg enabled. memcg functions
> appear heavily in the profiles, even if all tasks lives in the root
> memcg.
> 
> With bypassed kernel, we drop this down to 1.5 %, which starts to fall
> in the acceptable range. More investigation is needed to see if we can
> claim that last percent back, but I believe at last part of it should
> be.
> 
> Signed-off-by: Glauber Costa <glommer@parallels.com>
> CC: Michal Hocko <mhocko@suse.cz>
> CC: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> CC: Johannes Weiner <hannes@cmpxchg.org>
> CC: Mel Gorman <mgorman@suse.de>
> CC: Andrew Morton <akpm@linux-foundation.org>
> ---
>  include/linux/memcontrol.h |  72 ++++++++++++++++----
>  mm/memcontrol.c            | 166 +++++++++++++++++++++++++++++++++++++++++----
>  mm/page_cgroup.c           |   4 +-
>  3 files changed, 216 insertions(+), 26 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index d6183f0..009f925 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -42,6 +42,26 @@ struct mem_cgroup_reclaim_cookie {
>  };
>  
>  #ifdef CONFIG_MEMCG
> +extern struct static_key memcg_in_use_key;
> +
> +static inline bool mem_cgroup_subsys_disabled(void)
> +{
> +	return !!mem_cgroup_subsys.disabled;
> +}
> +
> +static inline bool mem_cgroup_disabled(void)
> +{
> +	/*
> +	 * Will always be false if subsys is disabled, because we have no one
> +	 * to bump it up. So the test suffices and we don't have to test the
> +	 * subsystem as well
> +	 */

but static_key_false adds an atomic read here which is more costly so I
am not sure you are optimizing much.

> +	if (!static_key_false(&memcg_in_use_key))
> +		return true;
> +	return false;
> +}
> +
> +
>  /*
>   * All "charge" functions with gfp_mask should use GFP_KERNEL or
>   * (gfp_mask & GFP_RECLAIM_MASK). In current implementatin, memcg doesn't
> @@ -53,8 +73,18 @@ struct mem_cgroup_reclaim_cookie {
>   * (Of course, if memcg does memory allocation in future, GFP_KERNEL is sane.)
>   */
>  
> -extern int mem_cgroup_newpage_charge(struct page *page, struct mm_struct *mm,
> +extern int __mem_cgroup_newpage_charge(struct page *page, struct mm_struct *mm,
>  				gfp_t gfp_mask);
> +
> +static inline int
> +mem_cgroup_newpage_charge(struct page *page, struct mm_struct *mm,
> +			  gfp_t gfp_mask)
> +{
> +	if (mem_cgroup_disabled())
> +		return 0;
> +	return __mem_cgroup_newpage_charge(page, mm, gfp_mask);
> +}
> +
>  /* for swap handling */
>  extern int mem_cgroup_try_charge_swapin(struct mm_struct *mm,
>  		struct page *page, gfp_t mask, struct mem_cgroup **memcgp);
> @@ -62,8 +92,17 @@ extern void mem_cgroup_commit_charge_swapin(struct page *page,
>  					struct mem_cgroup *memcg);
>  extern void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *memcg);
>  
> -extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
> -					gfp_t gfp_mask);
> +
> +extern int __mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
> +				     gfp_t gfp_mask);
> +static inline int
> +mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask)
> +{
> +	if (mem_cgroup_disabled())
> +		return 0;
> +
> +	return __mem_cgroup_cache_charge(page, mm, gfp_mask);
> +}

Are there any reasons to not get down to __mem_cgroup_try_charge? We
will not be perfect, all right, because some wrappers already do some
work but we should at least cover most of them.

I am also thinking whether this stab at charging path is not just an
overkill. Wouldn't it suffice to do something like:
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f608546..b70e8f6 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2707,7 +2707,7 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
 	 * thread group leader migrates. It's possible that mm is not
 	 * set, if so charge the root memcg (happens for pagecache usage).
 	 */
-	if (!*ptr && !mm)
+	if (!*ptr && (!mm || !static_key_false(&memcg_in_use_key)))
 		*ptr = root_mem_cgroup;
 again:
 	if (*ptr) { /* css should be a valid one */

We should get rid of the biggest overhead, no?

>  struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *);
>  struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *);
[...]
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index bfbf1c2..45c1886 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
[...]
> @@ -1335,6 +1345,20 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct zone *zone)
>  	memcg = pc->mem_cgroup;

I would expect that you want to prevent lookup as well if there are no
other groups.

>  	/*
> +	 * Because we lazily enable memcg only after first child group is
> +	 * created, we can have memcg == 0. Because page cgroup is created with
> +	 * GFP_ZERO, and after charging, all page cgroups will have a non-zero
> +	 * cgroup attached (even if root), we can be sure that this is a
> +	 * used-but-not-accounted page. (due to lazyness). We could get around
> +	 * that by scanning all pages on cgroup init is too expensive. We can
> +	 * ultimately pay, but prefer to just to defer the update until we get
> +	 * here. We could take the opportunity to set PageCgroupUsed, but it
> +	 * won't be that important for the root cgroup.
> +	 */
> +	if (!memcg && PageLRU(page))
> +		pc->mem_cgroup = memcg = root_mem_cgroup;

Why not return page_cgroup_zoneinfo(root_mem_cgroup, page);
This would require messing up with __mem_cgroup_uncharge_common but that
doesn't sound incredibly crazy (to the local standard of course ;)).

> +
> +	/*
>  	 * Surreptitiously switch any uncharged offlist page to root:
>  	 * an uncharged page off lru does nothing to secure
>  	 * its former mem_cgroup from sudden removal.
[...]
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 3/5] memcg: make it suck faster
@ 2013-03-19 13:58     ` Michal Hocko
  0 siblings, 0 replies; 72+ messages in thread
From: Michal Hocko @ 2013-03-19 13:58 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	Tejun Heo, Andrew Morton, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
	handai.szj-Re5JQEeQqe8AvxtiuMwx3w,
	anton.vorontsov-QSEj5FYQhm4dnm+yROfE0A, Johannes Weiner,
	Mel Gorman

On Tue 05-03-13 17:10:56, Glauber Costa wrote:
> It is an accepted fact that memcg sucks. But can it suck faster?  Or in
> a more fair statement, can it at least stop draining everyone's
> performance when it is not in use?
> 
> This experimental and slightly crude patch demonstrates that we can do
> that by using static branches to patch it out until the first memcg
> comes to life. There are edges to be trimmed, and I appreciate comments
> for direction. In particular, the events in the root are not fired, but
> I believe this can be done without further problems by calling a
> specialized event check from mem_cgroup_newpage_charge().
> 
> My goal was to have enough numbers to demonstrate the performance gain
> that can come from it. I tested it in a 24-way 2-socket Intel box, 24 Gb
> mem. I used Mel Gorman's pft test, that he used to demonstrate this
> problem back in the Kernel Summit. There are three kernels:
> 
> nomemcg  : memcg compile disabled.
> base     : memcg enabled, patch not applied.
> bypassed : memcg enabled, with patch applied.
> 
>                 base    bypassed
> User          109.12      105.64
> System       1646.84     1597.98
> Elapsed       229.56      215.76
> 
>              nomemcg    bypassed
> User          104.35      105.64
> System       1578.19     1597.98
> Elapsed       212.33      215.76

Do you have profiles for where we spend the time?

> So as one can see, the difference between base and nomemcg in terms
> of both system time and elapsed time is quite drastic, and consistent
> with the figures shown by Mel Gorman in the Kernel summit. This is a
> ~ 7 % drop in performance, just by having memcg enabled. memcg functions
> appear heavily in the profiles, even if all tasks lives in the root
> memcg.
> 
> With bypassed kernel, we drop this down to 1.5 %, which starts to fall
> in the acceptable range. More investigation is needed to see if we can
> claim that last percent back, but I believe at last part of it should
> be.
> 
> Signed-off-by: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
> CC: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
> CC: Kamezawa Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
> CC: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
> CC: Mel Gorman <mgorman-l3A5Bk7waGM@public.gmane.org>
> CC: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
> ---
>  include/linux/memcontrol.h |  72 ++++++++++++++++----
>  mm/memcontrol.c            | 166 +++++++++++++++++++++++++++++++++++++++++----
>  mm/page_cgroup.c           |   4 +-
>  3 files changed, 216 insertions(+), 26 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index d6183f0..009f925 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -42,6 +42,26 @@ struct mem_cgroup_reclaim_cookie {
>  };
>  
>  #ifdef CONFIG_MEMCG
> +extern struct static_key memcg_in_use_key;
> +
> +static inline bool mem_cgroup_subsys_disabled(void)
> +{
> +	return !!mem_cgroup_subsys.disabled;
> +}
> +
> +static inline bool mem_cgroup_disabled(void)
> +{
> +	/*
> +	 * Will always be false if subsys is disabled, because we have no one
> +	 * to bump it up. So the test suffices and we don't have to test the
> +	 * subsystem as well
> +	 */

but static_key_false adds an atomic read here which is more costly so I
am not sure you are optimizing much.

> +	if (!static_key_false(&memcg_in_use_key))
> +		return true;
> +	return false;
> +}
> +
> +
>  /*
>   * All "charge" functions with gfp_mask should use GFP_KERNEL or
>   * (gfp_mask & GFP_RECLAIM_MASK). In current implementatin, memcg doesn't
> @@ -53,8 +73,18 @@ struct mem_cgroup_reclaim_cookie {
>   * (Of course, if memcg does memory allocation in future, GFP_KERNEL is sane.)
>   */
>  
> -extern int mem_cgroup_newpage_charge(struct page *page, struct mm_struct *mm,
> +extern int __mem_cgroup_newpage_charge(struct page *page, struct mm_struct *mm,
>  				gfp_t gfp_mask);
> +
> +static inline int
> +mem_cgroup_newpage_charge(struct page *page, struct mm_struct *mm,
> +			  gfp_t gfp_mask)
> +{
> +	if (mem_cgroup_disabled())
> +		return 0;
> +	return __mem_cgroup_newpage_charge(page, mm, gfp_mask);
> +}
> +
>  /* for swap handling */
>  extern int mem_cgroup_try_charge_swapin(struct mm_struct *mm,
>  		struct page *page, gfp_t mask, struct mem_cgroup **memcgp);
> @@ -62,8 +92,17 @@ extern void mem_cgroup_commit_charge_swapin(struct page *page,
>  					struct mem_cgroup *memcg);
>  extern void mem_cgroup_cancel_charge_swapin(struct mem_cgroup *memcg);
>  
> -extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
> -					gfp_t gfp_mask);
> +
> +extern int __mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
> +				     gfp_t gfp_mask);
> +static inline int
> +mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask)
> +{
> +	if (mem_cgroup_disabled())
> +		return 0;
> +
> +	return __mem_cgroup_cache_charge(page, mm, gfp_mask);
> +}

Are there any reasons to not get down to __mem_cgroup_try_charge? We
will not be perfect, all right, because some wrappers already do some
work but we should at least cover most of them.

I am also thinking whether this stab at charging path is not just an
overkill. Wouldn't it suffice to do something like:
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f608546..b70e8f6 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2707,7 +2707,7 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
 	 * thread group leader migrates. It's possible that mm is not
 	 * set, if so charge the root memcg (happens for pagecache usage).
 	 */
-	if (!*ptr && !mm)
+	if (!*ptr && (!mm || !static_key_false(&memcg_in_use_key)))
 		*ptr = root_mem_cgroup;
 again:
 	if (*ptr) { /* css should be a valid one */

We should get rid of the biggest overhead, no?

>  struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *);
>  struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *);
[...]
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index bfbf1c2..45c1886 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
[...]
> @@ -1335,6 +1345,20 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct zone *zone)
>  	memcg = pc->mem_cgroup;

I would expect that you want to prevent lookup as well if there are no
other groups.

>  	/*
> +	 * Because we lazily enable memcg only after first child group is
> +	 * created, we can have memcg == 0. Because page cgroup is created with
> +	 * GFP_ZERO, and after charging, all page cgroups will have a non-zero
> +	 * cgroup attached (even if root), we can be sure that this is a
> +	 * used-but-not-accounted page. (due to lazyness). We could get around
> +	 * that by scanning all pages on cgroup init is too expensive. We can
> +	 * ultimately pay, but prefer to just to defer the update until we get
> +	 * here. We could take the opportunity to set PageCgroupUsed, but it
> +	 * won't be that important for the root cgroup.
> +	 */
> +	if (!memcg && PageLRU(page))
> +		pc->mem_cgroup = memcg = root_mem_cgroup;

Why not return page_cgroup_zoneinfo(root_mem_cgroup, page);
This would require messing up with __mem_cgroup_uncharge_common but that
doesn't sound incredibly crazy (to the local standard of course ;)).

> +
> +	/*
>  	 * Surreptitiously switch any uncharged offlist page to root:
>  	 * an uncharged page off lru does nothing to secure
>  	 * its former mem_cgroup from sudden removal.
[...]
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 4/5] memcg: do not call page_cgroup_init at system_boot
  2013-03-05 13:10   ` Glauber Costa
  (?)
  (?)
@ 2013-03-19 14:06   ` Michal Hocko
  -1 siblings, 0 replies; 72+ messages in thread
From: Michal Hocko @ 2013-03-19 14:06 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-mm, cgroups, Tejun Heo, Andrew Morton, kamezawa.hiroyu,
	handai.szj, anton.vorontsov, Johannes Weiner, Mel Gorman

On Tue 05-03-13 17:10:57, Glauber Costa wrote:
> If we are not using memcg, there is no reason why we should allocate
> this structure, that will be a memory waste at best. We can do better
> at least in the sparsemem case, and allocate it when the first cgroup
> is requested. It should now not panic on failure, and we have to handle
> this right.

lookup_page_cgroup needs a special handling as well. Callers are not
prepared to get NULL and the current code would even explode with
!CONFIG_DEBUG_VM.

Anyway, agreed with what Kame said. This is really hard to read. Would
it be possible to split it up somehow - sorry for not being more helpful
here...

> flatmem case is a bit more complicated, so that one is left out for
> the moment.
> 
> Signed-off-by: Glauber Costa <glommer@parallels.com>
> CC: Michal Hocko <mhocko@suse.cz>
> CC: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> CC: Johannes Weiner <hannes@cmpxchg.org>
> CC: Mel Gorman <mgorman@suse.de>
> CC: Andrew Morton <akpm@linux-foundation.org>
> ---
>  include/linux/page_cgroup.h |  28 +++++----
>  init/main.c                 |   2 -
>  mm/memcontrol.c             |   3 +-
>  mm/page_cgroup.c            | 150 ++++++++++++++++++++++++--------------------
>  4 files changed, 99 insertions(+), 84 deletions(-)
> 
[...]
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 3/5] memcg: make it suck faster
  2013-03-19 13:58     ` Michal Hocko
  (?)
@ 2013-03-20  7:00     ` Glauber Costa
  2013-03-20  8:13         ` Michal Hocko
  -1 siblings, 1 reply; 72+ messages in thread
From: Glauber Costa @ 2013-03-20  7:00 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, cgroups, Tejun Heo, Andrew Morton, kamezawa.hiroyu,
	handai.szj, anton.vorontsov, Johannes Weiner, Mel Gorman

Sorry all for taking a lot of time to reply to this. I've been really busy.

On 03/19/2013 05:58 PM, Michal Hocko wrote:
> On Tue 05-03-13 17:10:56, Glauber Costa wrote:
>> It is an accepted fact that memcg sucks. But can it suck faster?  Or in
>> a more fair statement, can it at least stop draining everyone's
>> performance when it is not in use?
>>
>> This experimental and slightly crude patch demonstrates that we can do
>> that by using static branches to patch it out until the first memcg
>> comes to life. There are edges to be trimmed, and I appreciate comments
>> for direction. In particular, the events in the root are not fired, but
>> I believe this can be done without further problems by calling a
>> specialized event check from mem_cgroup_newpage_charge().
>>
>> My goal was to have enough numbers to demonstrate the performance gain
>> that can come from it. I tested it in a 24-way 2-socket Intel box, 24 Gb
>> mem. I used Mel Gorman's pft test, that he used to demonstrate this
>> problem back in the Kernel Summit. There are three kernels:
>>
>> nomemcg  : memcg compile disabled.
>> base     : memcg enabled, patch not applied.
>> bypassed : memcg enabled, with patch applied.
>>
>>                 base    bypassed
>> User          109.12      105.64
>> System       1646.84     1597.98
>> Elapsed       229.56      215.76
>>
>>              nomemcg    bypassed
>> User          104.35      105.64
>> System       1578.19     1597.98
>> Elapsed       212.33      215.76
> 
> Do you have profiles for where we spend the time?
> 

I don't *have* in the sense that I never saved them, but it is easy to
grab. I've just run Mel's pft test with perf top -a in parallel, and
that was mostly the charge and uncharge functions being run.

>>  #ifdef CONFIG_MEMCG
>> +extern struct static_key memcg_in_use_key;
>> +
>> +static inline bool mem_cgroup_subsys_disabled(void)
>> +{
>> +	return !!mem_cgroup_subsys.disabled;
>> +}
>> +
>> +static inline bool mem_cgroup_disabled(void)
>> +{
>> +	/*
>> +	 * Will always be false if subsys is disabled, because we have no one
>> +	 * to bump it up. So the test suffices and we don't have to test the
>> +	 * subsystem as well
>> +	 */
> 
> but static_key_false adds an atomic read here which is more costly so I
> am not sure you are optimizing much.
> 

No it doesn't. You're missing the point of static branches: The code is
*patched out* until it is not used. So it adds a predictable
deterministic jump instruction to the false statement, and that's it
(hence their previous name 'jump label').

>> +
>> +extern int __mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
>> +				     gfp_t gfp_mask);
>> +static inline int
>> +mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask)
>> +{
>> +	if (mem_cgroup_disabled())
>> +		return 0;
>> +
>> +	return __mem_cgroup_cache_charge(page, mm, gfp_mask);
>> +}
> 
> Are there any reasons to not get down to __mem_cgroup_try_charge? We
> will not be perfect, all right, because some wrappers already do some
> work but we should at least cover most of them.
> 
> I am also thinking whether this stab at charging path is not just an
> overkill. Wouldn't it suffice to do something like:

I don't know. I could test. I just see no reason for that. Being able to
patch out code in the caller level means we'll not incur even a function
call. That's a generally accepted good thing to do in hot paths.
Specially given the fact that the memcg overhead seems not to be
concentrated in one single place, but as Christoph Lameter defined,
"death by a thousand cuts", I'd much rather not even pay the function
calls if I can avoid. If I introducing great complexity for that, fine,
I could trade off. But honestly, the patch gets bigger but that's it.

>>  struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *);
>>  struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *);
> [...]
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index bfbf1c2..45c1886 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
> [...]
>> @@ -1335,6 +1345,20 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct zone *zone)
>>  	memcg = pc->mem_cgroup;
> 
> I would expect that you want to prevent lookup as well if there are no
> other groups.
>
well, that function in specific seems to be mostly called during
reclaim, where I wasn't terribly concerned about optimizations, unlike
the steady state functions.

>>  	/*
>> +	 * Because we lazily enable memcg only after first child group is
>> +	 * created, we can have memcg == 0. Because page cgroup is created with
>> +	 * GFP_ZERO, and after charging, all page cgroups will have a non-zero
>> +	 * cgroup attached (even if root), we can be sure that this is a
>> +	 * used-but-not-accounted page. (due to lazyness). We could get around
>> +	 * that by scanning all pages on cgroup init is too expensive. We can
>> +	 * ultimately pay, but prefer to just to defer the update until we get
>> +	 * here. We could take the opportunity to set PageCgroupUsed, but it
>> +	 * won't be that important for the root cgroup.
>> +	 */
>> +	if (!memcg && PageLRU(page))
>> +		pc->mem_cgroup = memcg = root_mem_cgroup;
> 
> Why not return page_cgroup_zoneinfo(root_mem_cgroup, page);
> This would require messing up with __mem_cgroup_uncharge_common but that
> doesn't sound incredibly crazy (to the local standard of course ;)).
> 

Could you clarify?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 2/5] memcg: provide root figures from system totals
  2013-03-19 12:55       ` Michal Hocko
  (?)
@ 2013-03-20  7:03       ` Glauber Costa
  2013-03-20  8:03           ` Michal Hocko
  -1 siblings, 1 reply; 72+ messages in thread
From: Glauber Costa @ 2013-03-20  7:03 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, cgroups, Tejun Heo, Andrew Morton, kamezawa.hiroyu,
	handai.szj, anton.vorontsov, Johannes Weiner, Mel Gorman

On 03/19/2013 04:55 PM, Michal Hocko wrote:
> On Tue 19-03-13 13:46:50, Michal Hocko wrote:
>> On Tue 05-03-13 17:10:55, Glauber Costa wrote:
>>> For the root memcg, there is no need to rely on the res_counters if hierarchy
>>> is enabled The sum of all mem cgroups plus the tasks in root itself, is
>>> necessarily the amount of memory used for the whole system. Since those figures
>>> are already kept somewhere anyway, we can just return them here, without too
>>> much hassle.
>>>
>>> Limit and soft limit can't be set for the root cgroup, so they are left at
>>> RESOURCE_MAX. Failcnt is left at 0, because its actual meaning is how many
>>> times we failed allocations due to the limit being hit. We will fail
>>> allocations in the root cgroup, but the limit will never the reason.
>>
>> I do not like this very much to be honest. It just adds more hackery...
>> Why cannot we simply not account if nr_cgroups == 1 and move relevant
>> global counters to the root at the moment when a first group is
>> created?
> 
> OK, it seems that the very next patch does what I was looking for. So
> why all the churn in this patch?
> Why do you want to make root even more special?

Because I am operating under the assumption that we want to handle that
transparently and keep things working. If you tell me: "Hey, reading
memory.usage_in_bytes from root should return 0!", then I can get rid of
that. The fact that I keep bypassing when hierarchy is present, it is
more of a reuse of the infrastructure since it's there anyway.

Also, I would like the root memcg to be usable, albeit cheap, for
projects like memory pressure notifications.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 2/5] memcg: provide root figures from system totals
  2013-03-19 12:46     ` Michal Hocko
  (?)
  (?)
@ 2013-03-20  7:04     ` Glauber Costa
  -1 siblings, 0 replies; 72+ messages in thread
From: Glauber Costa @ 2013-03-20  7:04 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, cgroups, Tejun Heo, Andrew Morton, kamezawa.hiroyu,
	handai.szj, anton.vorontsov, Johannes Weiner, Mel Gorman

On 03/19/2013 04:46 PM, Michal Hocko wrote:
> On Tue 05-03-13 17:10:55, Glauber Costa wrote:
>> For the root memcg, there is no need to rely on the res_counters if hierarchy
>> is enabled The sum of all mem cgroups plus the tasks in root itself, is
>> necessarily the amount of memory used for the whole system. Since those figures
>> are already kept somewhere anyway, we can just return them here, without too
>> much hassle.
>>
>> Limit and soft limit can't be set for the root cgroup, so they are left at
>> RESOURCE_MAX. Failcnt is left at 0, because its actual meaning is how many
>> times we failed allocations due to the limit being hit. We will fail
>> allocations in the root cgroup, but the limit will never the reason.
> 
> I do not like this very much to be honest. It just adds more hackery...
> Why cannot we simply not account if nr_cgroups == 1 and move relevant
> global counters to the root at the moment when a first group is
> created?
> The patch aims at reducing an overhead when there there are no other
> groups, right?
> 
You've already noted yourself that this is done in a later patch.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 3/5] memcg: make it suck faster
@ 2013-03-20  7:13       ` Glauber Costa
  0 siblings, 0 replies; 72+ messages in thread
From: Glauber Costa @ 2013-03-20  7:13 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-mm, cgroups, Tejun Heo, Andrew Morton, Michal Hocko,
	kamezawa.hiroyu, anton.vorontsov, Johannes Weiner, Mel Gorman

On 03/13/2013 12:08 PM, Sha Zhengju wrote:
>> +static void memcg_update_root_statistics(void)
>> > +{
>> > +       int cpu;
>> > +       u64 pgin, pgout, faults, mjfaults;
>> > +
>> > +       pgin = pgout = faults = mjfaults = 0;
>> > +       for_each_online_cpu(cpu) {
>> > +               struct vm_event_state *ev = &per_cpu(vm_event_states, cpu);
>> > +               struct mem_cgroup_stat_cpu *memcg_stat;
>> > +
>> > +               memcg_stat = per_cpu_ptr(root_mem_cgroup->stat, cpu);
>> > +
>> > +               memcg_stat->events[MEM_CGROUP_EVENTS_PGPGIN] =
>> > +                                                       ev->event[PGPGIN];
>> > +               memcg_stat->events[MEM_CGROUP_EVENTS_PGPGOUT] =
>> > +                                                       ev->event[PGPGOUT];
> ev->event[PGPGIN/PGPGOUT] is counted in block layer(submit_bio()) and
> represents the exactly number of pagein/pageout, but memcg
> PGPGIN/PGPGOUT events only count it as an event and ignore the page
> size. So here we can't straightforward take the ev->events for use.
> 
You are right about that. Although I can't think of a straightforward
way to handle this. Well, except for the obvious of adding another
global statistic.

>> > +               memcg_stat->events[MEM_CGROUP_EVENTS_PGFAULT] =
>> > +                                                       ev->event[PGFAULT];
>> > +               memcg_stat->events[MEM_CGROUP_EVENTS_PGMAJFAULT] =
>> > +                                                       ev->event[PGMAJFAULT];
>> > +
>> > +               memcg_stat->nr_page_events = ev->event[PGPGIN] +
>> > +                                            ev->event[PGPGOUT];
> There's no valid memcg->nr_page_events until now, so the threshold
> notifier, but some people may use it even only root memcg exists.
> Moreover, using PGPGIN + PGPGOUT(exactly number of pagein + pageout)
> as nr_page_events is also inaccurate IMHO.
> 
Humm, I believe I can zero out this. Looking at the code again, this is
not imported to userspace. It is just used to activate the thresholds
and the delta of nr_page_events is a lot more important than nr_page_events.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 3/5] memcg: make it suck faster
@ 2013-03-20  7:13       ` Glauber Costa
  0 siblings, 0 replies; 72+ messages in thread
From: Glauber Costa @ 2013-03-20  7:13 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	Tejun Heo, Andrew Morton, Michal Hocko,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
	anton.vorontsov-QSEj5FYQhm4dnm+yROfE0A, Johannes Weiner,
	Mel Gorman

On 03/13/2013 12:08 PM, Sha Zhengju wrote:
>> +static void memcg_update_root_statistics(void)
>> > +{
>> > +       int cpu;
>> > +       u64 pgin, pgout, faults, mjfaults;
>> > +
>> > +       pgin = pgout = faults = mjfaults = 0;
>> > +       for_each_online_cpu(cpu) {
>> > +               struct vm_event_state *ev = &per_cpu(vm_event_states, cpu);
>> > +               struct mem_cgroup_stat_cpu *memcg_stat;
>> > +
>> > +               memcg_stat = per_cpu_ptr(root_mem_cgroup->stat, cpu);
>> > +
>> > +               memcg_stat->events[MEM_CGROUP_EVENTS_PGPGIN] =
>> > +                                                       ev->event[PGPGIN];
>> > +               memcg_stat->events[MEM_CGROUP_EVENTS_PGPGOUT] =
>> > +                                                       ev->event[PGPGOUT];
> ev->event[PGPGIN/PGPGOUT] is counted in block layer(submit_bio()) and
> represents the exactly number of pagein/pageout, but memcg
> PGPGIN/PGPGOUT events only count it as an event and ignore the page
> size. So here we can't straightforward take the ev->events for use.
> 
You are right about that. Although I can't think of a straightforward
way to handle this. Well, except for the obvious of adding another
global statistic.

>> > +               memcg_stat->events[MEM_CGROUP_EVENTS_PGFAULT] =
>> > +                                                       ev->event[PGFAULT];
>> > +               memcg_stat->events[MEM_CGROUP_EVENTS_PGMAJFAULT] =
>> > +                                                       ev->event[PGMAJFAULT];
>> > +
>> > +               memcg_stat->nr_page_events = ev->event[PGPGIN] +
>> > +                                            ev->event[PGPGOUT];
> There's no valid memcg->nr_page_events until now, so the threshold
> notifier, but some people may use it even only root memcg exists.
> Moreover, using PGPGIN + PGPGOUT(exactly number of pagein + pageout)
> as nr_page_events is also inaccurate IMHO.
> 
Humm, I believe I can zero out this. Looking at the code again, this is
not imported to userspace. It is just used to activate the thresholds
and the delta of nr_page_events is a lot more important than nr_page_events.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 2/5] memcg: provide root figures from system totals
@ 2013-03-20  8:03           ` Michal Hocko
  0 siblings, 0 replies; 72+ messages in thread
From: Michal Hocko @ 2013-03-20  8:03 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-mm, cgroups, Tejun Heo, Andrew Morton, kamezawa.hiroyu,
	handai.szj, anton.vorontsov, Johannes Weiner, Mel Gorman

On Wed 20-03-13 11:03:17, Glauber Costa wrote:
> On 03/19/2013 04:55 PM, Michal Hocko wrote:
> > On Tue 19-03-13 13:46:50, Michal Hocko wrote:
> >> On Tue 05-03-13 17:10:55, Glauber Costa wrote:
> >>> For the root memcg, there is no need to rely on the res_counters if hierarchy
> >>> is enabled The sum of all mem cgroups plus the tasks in root itself, is
> >>> necessarily the amount of memory used for the whole system. Since those figures
> >>> are already kept somewhere anyway, we can just return them here, without too
> >>> much hassle.
> >>>
> >>> Limit and soft limit can't be set for the root cgroup, so they are left at
> >>> RESOURCE_MAX. Failcnt is left at 0, because its actual meaning is how many
> >>> times we failed allocations due to the limit being hit. We will fail
> >>> allocations in the root cgroup, but the limit will never the reason.
> >>
> >> I do not like this very much to be honest. It just adds more hackery...
> >> Why cannot we simply not account if nr_cgroups == 1 and move relevant
> >> global counters to the root at the moment when a first group is
> >> created?
> > 
> > OK, it seems that the very next patch does what I was looking for. So
> > why all the churn in this patch?
> > Why do you want to make root even more special?
> 
> Because I am operating under the assumption that we want to handle that
> transparently and keep things working. If you tell me: "Hey, reading
> memory.usage_in_bytes from root should return 0!", then I can get rid of
> that.

If you simply switch to accounting for root then you do not have to care
about this, don't you?

> The fact that I keep bypassing when hierarchy is present, it is
> more of a reuse of the infrastructure since it's there anyway.
> 
> Also, I would like the root memcg to be usable, albeit cheap, for
> projects like memory pressure notifications.
 
root memcg without any childre, right?

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 2/5] memcg: provide root figures from system totals
@ 2013-03-20  8:03           ` Michal Hocko
  0 siblings, 0 replies; 72+ messages in thread
From: Michal Hocko @ 2013-03-20  8:03 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	Tejun Heo, Andrew Morton, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
	handai.szj-Re5JQEeQqe8AvxtiuMwx3w,
	anton.vorontsov-QSEj5FYQhm4dnm+yROfE0A, Johannes Weiner,
	Mel Gorman

On Wed 20-03-13 11:03:17, Glauber Costa wrote:
> On 03/19/2013 04:55 PM, Michal Hocko wrote:
> > On Tue 19-03-13 13:46:50, Michal Hocko wrote:
> >> On Tue 05-03-13 17:10:55, Glauber Costa wrote:
> >>> For the root memcg, there is no need to rely on the res_counters if hierarchy
> >>> is enabled The sum of all mem cgroups plus the tasks in root itself, is
> >>> necessarily the amount of memory used for the whole system. Since those figures
> >>> are already kept somewhere anyway, we can just return them here, without too
> >>> much hassle.
> >>>
> >>> Limit and soft limit can't be set for the root cgroup, so they are left at
> >>> RESOURCE_MAX. Failcnt is left at 0, because its actual meaning is how many
> >>> times we failed allocations due to the limit being hit. We will fail
> >>> allocations in the root cgroup, but the limit will never the reason.
> >>
> >> I do not like this very much to be honest. It just adds more hackery...
> >> Why cannot we simply not account if nr_cgroups == 1 and move relevant
> >> global counters to the root at the moment when a first group is
> >> created?
> > 
> > OK, it seems that the very next patch does what I was looking for. So
> > why all the churn in this patch?
> > Why do you want to make root even more special?
> 
> Because I am operating under the assumption that we want to handle that
> transparently and keep things working. If you tell me: "Hey, reading
> memory.usage_in_bytes from root should return 0!", then I can get rid of
> that.

If you simply switch to accounting for root then you do not have to care
about this, don't you?

> The fact that I keep bypassing when hierarchy is present, it is
> more of a reuse of the infrastructure since it's there anyway.
> 
> Also, I would like the root memcg to be usable, albeit cheap, for
> projects like memory pressure notifications.
 
root memcg without any childre, right?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 2/5] memcg: provide root figures from system totals
  2013-03-20  8:03           ` Michal Hocko
  (?)
@ 2013-03-20  8:08           ` Glauber Costa
  2013-03-20  8:18               ` Michal Hocko
  2013-03-20 16:40               ` Anton Vorontsov
  -1 siblings, 2 replies; 72+ messages in thread
From: Glauber Costa @ 2013-03-20  8:08 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, cgroups, Tejun Heo, Andrew Morton, kamezawa.hiroyu,
	handai.szj, anton.vorontsov, Johannes Weiner, Mel Gorman

On 03/20/2013 12:03 PM, Michal Hocko wrote:
> On Wed 20-03-13 11:03:17, Glauber Costa wrote:
>> On 03/19/2013 04:55 PM, Michal Hocko wrote:
>>> On Tue 19-03-13 13:46:50, Michal Hocko wrote:
>>>> On Tue 05-03-13 17:10:55, Glauber Costa wrote:
>>>>> For the root memcg, there is no need to rely on the res_counters if hierarchy
>>>>> is enabled The sum of all mem cgroups plus the tasks in root itself, is
>>>>> necessarily the amount of memory used for the whole system. Since those figures
>>>>> are already kept somewhere anyway, we can just return them here, without too
>>>>> much hassle.
>>>>>
>>>>> Limit and soft limit can't be set for the root cgroup, so they are left at
>>>>> RESOURCE_MAX. Failcnt is left at 0, because its actual meaning is how many
>>>>> times we failed allocations due to the limit being hit. We will fail
>>>>> allocations in the root cgroup, but the limit will never the reason.
>>>>
>>>> I do not like this very much to be honest. It just adds more hackery...
>>>> Why cannot we simply not account if nr_cgroups == 1 and move relevant
>>>> global counters to the root at the moment when a first group is
>>>> created?
>>>
>>> OK, it seems that the very next patch does what I was looking for. So
>>> why all the churn in this patch?
>>> Why do you want to make root even more special?
>>
>> Because I am operating under the assumption that we want to handle that
>> transparently and keep things working. If you tell me: "Hey, reading
>> memory.usage_in_bytes from root should return 0!", then I can get rid of
>> that.
> 
> If you simply switch to accounting for root then you do not have to care
> about this, don't you?
> 
Of course not, but the whole point here is *not* accounting root. So if
we are entirely skipping root account, it, I personally believe we need
to replace it with something else so we can keep things working as much
as we can.

It doesn't need to be perfect, though: There is no way we can have
max_usage without something like a res_counter that locks memory
charges. I believe we can live without that. But as for the basic
statistics and numbers, I believe they should keep working.

>> The fact that I keep bypassing when hierarchy is present, it is
>> more of a reuse of the infrastructure since it's there anyway.
>>
>> Also, I would like the root memcg to be usable, albeit cheap, for
>> projects like memory pressure notifications.
>  
> root memcg without any childre, right?
> 
yes, of course.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 3/5] memcg: make it suck faster
@ 2013-03-20  8:13         ` Michal Hocko
  0 siblings, 0 replies; 72+ messages in thread
From: Michal Hocko @ 2013-03-20  8:13 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-mm, cgroups, Tejun Heo, Andrew Morton, kamezawa.hiroyu,
	handai.szj, anton.vorontsov, Johannes Weiner, Mel Gorman

On Wed 20-03-13 11:00:03, Glauber Costa wrote:
> Sorry all for taking a lot of time to reply to this. I've been really busy.
> 
> On 03/19/2013 05:58 PM, Michal Hocko wrote:
> > On Tue 05-03-13 17:10:56, Glauber Costa wrote:
> >> It is an accepted fact that memcg sucks. But can it suck faster?  Or in
> >> a more fair statement, can it at least stop draining everyone's
> >> performance when it is not in use?
> >>
> >> This experimental and slightly crude patch demonstrates that we can do
> >> that by using static branches to patch it out until the first memcg
> >> comes to life. There are edges to be trimmed, and I appreciate comments
> >> for direction. In particular, the events in the root are not fired, but
> >> I believe this can be done without further problems by calling a
> >> specialized event check from mem_cgroup_newpage_charge().
> >>
> >> My goal was to have enough numbers to demonstrate the performance gain
> >> that can come from it. I tested it in a 24-way 2-socket Intel box, 24 Gb
> >> mem. I used Mel Gorman's pft test, that he used to demonstrate this
> >> problem back in the Kernel Summit. There are three kernels:
> >>
> >> nomemcg  : memcg compile disabled.
> >> base     : memcg enabled, patch not applied.
> >> bypassed : memcg enabled, with patch applied.
> >>
> >>                 base    bypassed
> >> User          109.12      105.64
> >> System       1646.84     1597.98
> >> Elapsed       229.56      215.76
> >>
> >>              nomemcg    bypassed
> >> User          104.35      105.64
> >> System       1578.19     1597.98
> >> Elapsed       212.33      215.76
> > 
> > Do you have profiles for where we spend the time?
> > 
> 
> I don't *have* in the sense that I never saved them, but it is easy to
> grab. I've just run Mel's pft test with perf top -a in parallel, and
> that was mostly the charge and uncharge functions being run.

Would be nice to have this information to know which parts need
optimization with a higher priority. I do not think we will make it ~0%
cost in a single run.

> >>  #ifdef CONFIG_MEMCG
> >> +extern struct static_key memcg_in_use_key;
> >> +
> >> +static inline bool mem_cgroup_subsys_disabled(void)
> >> +{
> >> +	return !!mem_cgroup_subsys.disabled;
> >> +}
> >> +
> >> +static inline bool mem_cgroup_disabled(void)
> >> +{
> >> +	/*
> >> +	 * Will always be false if subsys is disabled, because we have no one
> >> +	 * to bump it up. So the test suffices and we don't have to test the
> >> +	 * subsystem as well
> >> +	 */
> > 
> > but static_key_false adds an atomic read here which is more costly so I
> > am not sure you are optimizing much.
> > 
> 
> No it doesn't. You're missing the point of static branches: The code is
> *patched out* until it is not used.

OK, I should have been more specific. It adds an atomic if static
branches are disabled.

> So it adds a predictable deterministic jump instruction to the false
> statement, and that's it (hence their previous name 'jump label').
> 
> >> +
> >> +extern int __mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
> >> +				     gfp_t gfp_mask);
> >> +static inline int
> >> +mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask)
> >> +{
> >> +	if (mem_cgroup_disabled())
> >> +		return 0;
> >> +
> >> +	return __mem_cgroup_cache_charge(page, mm, gfp_mask);
> >> +}
> > 
> > Are there any reasons to not get down to __mem_cgroup_try_charge? We
> > will not be perfect, all right, because some wrappers already do some
> > work but we should at least cover most of them.
> > 
> > I am also thinking whether this stab at charging path is not just an
> > overkill. Wouldn't it suffice to do something like:
> 
> I don't know. I could test. I just see no reason for that. Being able to
> patch out code in the caller level means we'll not incur even a function
> call. That's a generally accepted good thing to do in hot paths.

Agreed. I just think that the charging path is rather complicated and
changing it incrementally is a lower risk. Maybe we just find out that
the biggest overhead can be reduced by a simpler approach.

> Specially given the fact that the memcg overhead seems not to be
> concentrated in one single place, but as Christoph Lameter defined,
> "death by a thousand cuts", I'd much rather not even pay the function
> calls if I can avoid. If I introducing great complexity for that, fine,
> I could trade off. But honestly, the patch gets bigger but that's it.

And more code and all the charging paths are already quite complicated.
I do not want to add more, if possible.
 
> >>  struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *);
> >>  struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *);
> > [...]
> >> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> >> index bfbf1c2..45c1886 100644
> >> --- a/mm/memcontrol.c
> >> +++ b/mm/memcontrol.c
> > [...]
> >> @@ -1335,6 +1345,20 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct zone *zone)
> >>  	memcg = pc->mem_cgroup;
> > 
> > I would expect that you want to prevent lookup as well if there are no
> > other groups.
> >
> well, that function in specific seems to be mostly called during
> reclaim, where I wasn't terribly concerned about optimizations, unlike
> the steady state functions.

Not only from reclaim. Also when a page is uncharged.
Anyway we can get the optimization almost for free here ;)
 
> >>  	/*
> >> +	 * Because we lazily enable memcg only after first child group is
> >> +	 * created, we can have memcg == 0. Because page cgroup is created with
> >> +	 * GFP_ZERO, and after charging, all page cgroups will have a non-zero
> >> +	 * cgroup attached (even if root), we can be sure that this is a
> >> +	 * used-but-not-accounted page. (due to lazyness). We could get around
> >> +	 * that by scanning all pages on cgroup init is too expensive. We can
> >> +	 * ultimately pay, but prefer to just to defer the update until we get
> >> +	 * here. We could take the opportunity to set PageCgroupUsed, but it
> >> +	 * won't be that important for the root cgroup.
> >> +	 */
> >> +	if (!memcg && PageLRU(page))
> >> +		pc->mem_cgroup = memcg = root_mem_cgroup;
> > 
> > Why not return page_cgroup_zoneinfo(root_mem_cgroup, page);
> > This would require messing up with __mem_cgroup_uncharge_common but that
> > doesn't sound incredibly crazy (to the local standard of course ;)).
> > 
> 
> Could you clarify?

You can save some cycles by returning lruvec from here directly and do
not get through:
	if (!PageLRU(page) && !PageCgroupUsed(pc) && memcg != root_mem_cgroup)
		pc->mem_cgroup = memcg = root_mem_cgroup;

	mz = page_cgroup_zoneinfo(memcg, page);
	lruvec = &mz->lruvec; 

again. Just a nano optimization in that path
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 3/5] memcg: make it suck faster
@ 2013-03-20  8:13         ` Michal Hocko
  0 siblings, 0 replies; 72+ messages in thread
From: Michal Hocko @ 2013-03-20  8:13 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	Tejun Heo, Andrew Morton, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
	handai.szj-Re5JQEeQqe8AvxtiuMwx3w,
	anton.vorontsov-QSEj5FYQhm4dnm+yROfE0A, Johannes Weiner,
	Mel Gorman

On Wed 20-03-13 11:00:03, Glauber Costa wrote:
> Sorry all for taking a lot of time to reply to this. I've been really busy.
> 
> On 03/19/2013 05:58 PM, Michal Hocko wrote:
> > On Tue 05-03-13 17:10:56, Glauber Costa wrote:
> >> It is an accepted fact that memcg sucks. But can it suck faster?  Or in
> >> a more fair statement, can it at least stop draining everyone's
> >> performance when it is not in use?
> >>
> >> This experimental and slightly crude patch demonstrates that we can do
> >> that by using static branches to patch it out until the first memcg
> >> comes to life. There are edges to be trimmed, and I appreciate comments
> >> for direction. In particular, the events in the root are not fired, but
> >> I believe this can be done without further problems by calling a
> >> specialized event check from mem_cgroup_newpage_charge().
> >>
> >> My goal was to have enough numbers to demonstrate the performance gain
> >> that can come from it. I tested it in a 24-way 2-socket Intel box, 24 Gb
> >> mem. I used Mel Gorman's pft test, that he used to demonstrate this
> >> problem back in the Kernel Summit. There are three kernels:
> >>
> >> nomemcg  : memcg compile disabled.
> >> base     : memcg enabled, patch not applied.
> >> bypassed : memcg enabled, with patch applied.
> >>
> >>                 base    bypassed
> >> User          109.12      105.64
> >> System       1646.84     1597.98
> >> Elapsed       229.56      215.76
> >>
> >>              nomemcg    bypassed
> >> User          104.35      105.64
> >> System       1578.19     1597.98
> >> Elapsed       212.33      215.76
> > 
> > Do you have profiles for where we spend the time?
> > 
> 
> I don't *have* in the sense that I never saved them, but it is easy to
> grab. I've just run Mel's pft test with perf top -a in parallel, and
> that was mostly the charge and uncharge functions being run.

Would be nice to have this information to know which parts need
optimization with a higher priority. I do not think we will make it ~0%
cost in a single run.

> >>  #ifdef CONFIG_MEMCG
> >> +extern struct static_key memcg_in_use_key;
> >> +
> >> +static inline bool mem_cgroup_subsys_disabled(void)
> >> +{
> >> +	return !!mem_cgroup_subsys.disabled;
> >> +}
> >> +
> >> +static inline bool mem_cgroup_disabled(void)
> >> +{
> >> +	/*
> >> +	 * Will always be false if subsys is disabled, because we have no one
> >> +	 * to bump it up. So the test suffices and we don't have to test the
> >> +	 * subsystem as well
> >> +	 */
> > 
> > but static_key_false adds an atomic read here which is more costly so I
> > am not sure you are optimizing much.
> > 
> 
> No it doesn't. You're missing the point of static branches: The code is
> *patched out* until it is not used.

OK, I should have been more specific. It adds an atomic if static
branches are disabled.

> So it adds a predictable deterministic jump instruction to the false
> statement, and that's it (hence their previous name 'jump label').
> 
> >> +
> >> +extern int __mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
> >> +				     gfp_t gfp_mask);
> >> +static inline int
> >> +mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask)
> >> +{
> >> +	if (mem_cgroup_disabled())
> >> +		return 0;
> >> +
> >> +	return __mem_cgroup_cache_charge(page, mm, gfp_mask);
> >> +}
> > 
> > Are there any reasons to not get down to __mem_cgroup_try_charge? We
> > will not be perfect, all right, because some wrappers already do some
> > work but we should at least cover most of them.
> > 
> > I am also thinking whether this stab at charging path is not just an
> > overkill. Wouldn't it suffice to do something like:
> 
> I don't know. I could test. I just see no reason for that. Being able to
> patch out code in the caller level means we'll not incur even a function
> call. That's a generally accepted good thing to do in hot paths.

Agreed. I just think that the charging path is rather complicated and
changing it incrementally is a lower risk. Maybe we just find out that
the biggest overhead can be reduced by a simpler approach.

> Specially given the fact that the memcg overhead seems not to be
> concentrated in one single place, but as Christoph Lameter defined,
> "death by a thousand cuts", I'd much rather not even pay the function
> calls if I can avoid. If I introducing great complexity for that, fine,
> I could trade off. But honestly, the patch gets bigger but that's it.

And more code and all the charging paths are already quite complicated.
I do not want to add more, if possible.
 
> >>  struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *);
> >>  struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *);
> > [...]
> >> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> >> index bfbf1c2..45c1886 100644
> >> --- a/mm/memcontrol.c
> >> +++ b/mm/memcontrol.c
> > [...]
> >> @@ -1335,6 +1345,20 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct zone *zone)
> >>  	memcg = pc->mem_cgroup;
> > 
> > I would expect that you want to prevent lookup as well if there are no
> > other groups.
> >
> well, that function in specific seems to be mostly called during
> reclaim, where I wasn't terribly concerned about optimizations, unlike
> the steady state functions.

Not only from reclaim. Also when a page is uncharged.
Anyway we can get the optimization almost for free here ;)
 
> >>  	/*
> >> +	 * Because we lazily enable memcg only after first child group is
> >> +	 * created, we can have memcg == 0. Because page cgroup is created with
> >> +	 * GFP_ZERO, and after charging, all page cgroups will have a non-zero
> >> +	 * cgroup attached (even if root), we can be sure that this is a
> >> +	 * used-but-not-accounted page. (due to lazyness). We could get around
> >> +	 * that by scanning all pages on cgroup init is too expensive. We can
> >> +	 * ultimately pay, but prefer to just to defer the update until we get
> >> +	 * here. We could take the opportunity to set PageCgroupUsed, but it
> >> +	 * won't be that important for the root cgroup.
> >> +	 */
> >> +	if (!memcg && PageLRU(page))
> >> +		pc->mem_cgroup = memcg = root_mem_cgroup;
> > 
> > Why not return page_cgroup_zoneinfo(root_mem_cgroup, page);
> > This would require messing up with __mem_cgroup_uncharge_common but that
> > doesn't sound incredibly crazy (to the local standard of course ;)).
> > 
> 
> Could you clarify?

You can save some cycles by returning lruvec from here directly and do
not get through:
	if (!PageLRU(page) && !PageCgroupUsed(pc) && memcg != root_mem_cgroup)
		pc->mem_cgroup = memcg = root_mem_cgroup;

	mz = page_cgroup_zoneinfo(memcg, page);
	lruvec = &mz->lruvec; 

again. Just a nano optimization in that path
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 2/5] memcg: provide root figures from system totals
@ 2013-03-20  8:18               ` Michal Hocko
  0 siblings, 0 replies; 72+ messages in thread
From: Michal Hocko @ 2013-03-20  8:18 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-mm, cgroups, Tejun Heo, Andrew Morton, kamezawa.hiroyu,
	handai.szj, anton.vorontsov, Johannes Weiner, Mel Gorman

On Wed 20-03-13 12:08:17, Glauber Costa wrote:
> On 03/20/2013 12:03 PM, Michal Hocko wrote:
> > On Wed 20-03-13 11:03:17, Glauber Costa wrote:
> >> On 03/19/2013 04:55 PM, Michal Hocko wrote:
> >>> On Tue 19-03-13 13:46:50, Michal Hocko wrote:
> >>>> On Tue 05-03-13 17:10:55, Glauber Costa wrote:
> >>>>> For the root memcg, there is no need to rely on the res_counters if hierarchy
> >>>>> is enabled The sum of all mem cgroups plus the tasks in root itself, is
> >>>>> necessarily the amount of memory used for the whole system. Since those figures
> >>>>> are already kept somewhere anyway, we can just return them here, without too
> >>>>> much hassle.
> >>>>>
> >>>>> Limit and soft limit can't be set for the root cgroup, so they are left at
> >>>>> RESOURCE_MAX. Failcnt is left at 0, because its actual meaning is how many
> >>>>> times we failed allocations due to the limit being hit. We will fail
> >>>>> allocations in the root cgroup, but the limit will never the reason.
> >>>>
> >>>> I do not like this very much to be honest. It just adds more hackery...
> >>>> Why cannot we simply not account if nr_cgroups == 1 and move relevant
> >>>> global counters to the root at the moment when a first group is
> >>>> created?
> >>>
> >>> OK, it seems that the very next patch does what I was looking for. So
> >>> why all the churn in this patch?
> >>> Why do you want to make root even more special?
> >>
> >> Because I am operating under the assumption that we want to handle that
> >> transparently and keep things working. If you tell me: "Hey, reading
> >> memory.usage_in_bytes from root should return 0!", then I can get rid of
> >> that.
> > 
> > If you simply switch to accounting for root then you do not have to care
> > about this, don't you?
> > 
> Of course not, but the whole point here is *not* accounting root.

I thought the objective was to not account root if there are no
children. I would see the "not account root at all" as another step.
And we are skipping charging it already (do not call
mem_cgroup_do_charge) for root.

> So if we are entirely skipping root account, it, I personally believe
> we need to replace it with something else so we can keep things
> working as much as we can.
> 
> It doesn't need to be perfect, though: There is no way we can have
> max_usage without something like a res_counter that locks memory
> charges. I believe we can live without that. But as for the basic
> statistics and numbers, I believe they should keep working.
 
[...]
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 2/5] memcg: provide root figures from system totals
@ 2013-03-20  8:18               ` Michal Hocko
  0 siblings, 0 replies; 72+ messages in thread
From: Michal Hocko @ 2013-03-20  8:18 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	Tejun Heo, Andrew Morton, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
	handai.szj-Re5JQEeQqe8AvxtiuMwx3w,
	anton.vorontsov-QSEj5FYQhm4dnm+yROfE0A, Johannes Weiner,
	Mel Gorman

On Wed 20-03-13 12:08:17, Glauber Costa wrote:
> On 03/20/2013 12:03 PM, Michal Hocko wrote:
> > On Wed 20-03-13 11:03:17, Glauber Costa wrote:
> >> On 03/19/2013 04:55 PM, Michal Hocko wrote:
> >>> On Tue 19-03-13 13:46:50, Michal Hocko wrote:
> >>>> On Tue 05-03-13 17:10:55, Glauber Costa wrote:
> >>>>> For the root memcg, there is no need to rely on the res_counters if hierarchy
> >>>>> is enabled The sum of all mem cgroups plus the tasks in root itself, is
> >>>>> necessarily the amount of memory used for the whole system. Since those figures
> >>>>> are already kept somewhere anyway, we can just return them here, without too
> >>>>> much hassle.
> >>>>>
> >>>>> Limit and soft limit can't be set for the root cgroup, so they are left at
> >>>>> RESOURCE_MAX. Failcnt is left at 0, because its actual meaning is how many
> >>>>> times we failed allocations due to the limit being hit. We will fail
> >>>>> allocations in the root cgroup, but the limit will never the reason.
> >>>>
> >>>> I do not like this very much to be honest. It just adds more hackery...
> >>>> Why cannot we simply not account if nr_cgroups == 1 and move relevant
> >>>> global counters to the root at the moment when a first group is
> >>>> created?
> >>>
> >>> OK, it seems that the very next patch does what I was looking for. So
> >>> why all the churn in this patch?
> >>> Why do you want to make root even more special?
> >>
> >> Because I am operating under the assumption that we want to handle that
> >> transparently and keep things working. If you tell me: "Hey, reading
> >> memory.usage_in_bytes from root should return 0!", then I can get rid of
> >> that.
> > 
> > If you simply switch to accounting for root then you do not have to care
> > about this, don't you?
> > 
> Of course not, but the whole point here is *not* accounting root.

I thought the objective was to not account root if there are no
children. I would see the "not account root at all" as another step.
And we are skipping charging it already (do not call
mem_cgroup_do_charge) for root.

> So if we are entirely skipping root account, it, I personally believe
> we need to replace it with something else so we can keep things
> working as much as we can.
> 
> It doesn't need to be perfect, though: There is no way we can have
> max_usage without something like a res_counter that locks memory
> charges. I believe we can live without that. But as for the basic
> statistics and numbers, I believe they should keep working.
 
[...]
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 2/5] memcg: provide root figures from system totals
  2013-03-20  8:18               ` Michal Hocko
  (?)
@ 2013-03-20  8:34               ` Glauber Costa
  2013-03-20  8:58                 ` Michal Hocko
  -1 siblings, 1 reply; 72+ messages in thread
From: Glauber Costa @ 2013-03-20  8:34 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, cgroups, Tejun Heo, Andrew Morton, kamezawa.hiroyu,
	handai.szj, anton.vorontsov, Johannes Weiner, Mel Gorman

On 03/20/2013 12:18 PM, Michal Hocko wrote:
> On Wed 20-03-13 12:08:17, Glauber Costa wrote:
>> On 03/20/2013 12:03 PM, Michal Hocko wrote:
>>> On Wed 20-03-13 11:03:17, Glauber Costa wrote:
>>>> On 03/19/2013 04:55 PM, Michal Hocko wrote:
>>>>> On Tue 19-03-13 13:46:50, Michal Hocko wrote:
>>>>>> On Tue 05-03-13 17:10:55, Glauber Costa wrote:
>>>>>>> For the root memcg, there is no need to rely on the res_counters if hierarchy
>>>>>>> is enabled The sum of all mem cgroups plus the tasks in root itself, is
>>>>>>> necessarily the amount of memory used for the whole system. Since those figures
>>>>>>> are already kept somewhere anyway, we can just return them here, without too
>>>>>>> much hassle.
>>>>>>>
>>>>>>> Limit and soft limit can't be set for the root cgroup, so they are left at
>>>>>>> RESOURCE_MAX. Failcnt is left at 0, because its actual meaning is how many
>>>>>>> times we failed allocations due to the limit being hit. We will fail
>>>>>>> allocations in the root cgroup, but the limit will never the reason.
>>>>>>
>>>>>> I do not like this very much to be honest. It just adds more hackery...
>>>>>> Why cannot we simply not account if nr_cgroups == 1 and move relevant
>>>>>> global counters to the root at the moment when a first group is
>>>>>> created?
>>>>>
>>>>> OK, it seems that the very next patch does what I was looking for. So
>>>>> why all the churn in this patch?
>>>>> Why do you want to make root even more special?
>>>>
>>>> Because I am operating under the assumption that we want to handle that
>>>> transparently and keep things working. If you tell me: "Hey, reading
>>>> memory.usage_in_bytes from root should return 0!", then I can get rid of
>>>> that.
>>>
>>> If you simply switch to accounting for root then you do not have to care
>>> about this, don't you?
>>>
>> Of course not, but the whole point here is *not* accounting root.
> 
> I thought the objective was to not account root if there are no
> children. 

It is the goal, yes. As I said: I want the root-only case to keep
providing userspace with meaningful statistics, therefore the bypass.
But since the machinery is in place, it is trivial to keep bypassing for
use_hierarchy = 1 at the root level. If you believe it would be simpler,
I could refrain from doing it.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 2/5] memcg: provide root figures from system totals
  2013-03-20  8:34               ` Glauber Costa
@ 2013-03-20  8:58                 ` Michal Hocko
  2013-03-20  9:30                   ` Glauber Costa
  0 siblings, 1 reply; 72+ messages in thread
From: Michal Hocko @ 2013-03-20  8:58 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-mm, cgroups, Tejun Heo, Andrew Morton, kamezawa.hiroyu,
	handai.szj, anton.vorontsov, Johannes Weiner, Mel Gorman

On Wed 20-03-13 12:34:01, Glauber Costa wrote:
> On 03/20/2013 12:18 PM, Michal Hocko wrote:
> > On Wed 20-03-13 12:08:17, Glauber Costa wrote:
> >> On 03/20/2013 12:03 PM, Michal Hocko wrote:
> >>> On Wed 20-03-13 11:03:17, Glauber Costa wrote:
> >>>> On 03/19/2013 04:55 PM, Michal Hocko wrote:
> >>>>> On Tue 19-03-13 13:46:50, Michal Hocko wrote:
> >>>>>> On Tue 05-03-13 17:10:55, Glauber Costa wrote:
> >>>>>>> For the root memcg, there is no need to rely on the res_counters if hierarchy
> >>>>>>> is enabled The sum of all mem cgroups plus the tasks in root itself, is
> >>>>>>> necessarily the amount of memory used for the whole system. Since those figures
> >>>>>>> are already kept somewhere anyway, we can just return them here, without too
> >>>>>>> much hassle.
> >>>>>>>
> >>>>>>> Limit and soft limit can't be set for the root cgroup, so they are left at
> >>>>>>> RESOURCE_MAX. Failcnt is left at 0, because its actual meaning is how many
> >>>>>>> times we failed allocations due to the limit being hit. We will fail
> >>>>>>> allocations in the root cgroup, but the limit will never the reason.
> >>>>>>
> >>>>>> I do not like this very much to be honest. It just adds more hackery...
> >>>>>> Why cannot we simply not account if nr_cgroups == 1 and move relevant
> >>>>>> global counters to the root at the moment when a first group is
> >>>>>> created?
> >>>>>
> >>>>> OK, it seems that the very next patch does what I was looking for. So
> >>>>> why all the churn in this patch?
> >>>>> Why do you want to make root even more special?
> >>>>
> >>>> Because I am operating under the assumption that we want to handle that
> >>>> transparently and keep things working. If you tell me: "Hey, reading
> >>>> memory.usage_in_bytes from root should return 0!", then I can get rid of
> >>>> that.
> >>>
> >>> If you simply switch to accounting for root then you do not have to care
> >>> about this, don't you?
> >>>
> >> Of course not, but the whole point here is *not* accounting root.
> > 
> > I thought the objective was to not account root if there are no
> > children. 
> 
> It is the goal, yes. As I said: I want the root-only case to keep
> providing userspace with meaningful statistics,

Sure, statistics need to stay at the place. I am not objecting on that.

> therefore the bypass.

I am just arguing about bypassing root even when there are children and
use_hierarchy == 1 because it adds more code to maintain.

> But since the machinery is in place, it is trivial to keep bypassing for
> use_hierarchy = 1 at the root level. If you believe it would be simpler,
> I could refrain from doing it.

I am all for "the simple the better" and add more optimizations on top.
We have a real issue now and we should eliminate it. My original plan
was to look at the bottlenecks and eliminate them one after another in
smaller steps. But all the work I have on the plate is preempting me
from looking into that...
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 2/5] memcg: provide root figures from system totals
  2013-03-20  8:58                 ` Michal Hocko
@ 2013-03-20  9:30                   ` Glauber Costa
  2013-03-21  6:08                       ` Kamezawa Hiroyuki
  0 siblings, 1 reply; 72+ messages in thread
From: Glauber Costa @ 2013-03-20  9:30 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, cgroups, Tejun Heo, Andrew Morton, kamezawa.hiroyu,
	handai.szj, anton.vorontsov, Johannes Weiner, Mel Gorman

On 03/20/2013 12:58 PM, Michal Hocko wrote:
> On Wed 20-03-13 12:34:01, Glauber Costa wrote:
>> On 03/20/2013 12:18 PM, Michal Hocko wrote:
>>> On Wed 20-03-13 12:08:17, Glauber Costa wrote:
>>>> On 03/20/2013 12:03 PM, Michal Hocko wrote:
>>>>> On Wed 20-03-13 11:03:17, Glauber Costa wrote:
>>>>>> On 03/19/2013 04:55 PM, Michal Hocko wrote:
>>>>>>> On Tue 19-03-13 13:46:50, Michal Hocko wrote:
>>>>>>>> On Tue 05-03-13 17:10:55, Glauber Costa wrote:
>>>>>>>>> For the root memcg, there is no need to rely on the res_counters if hierarchy
>>>>>>>>> is enabled The sum of all mem cgroups plus the tasks in root itself, is
>>>>>>>>> necessarily the amount of memory used for the whole system. Since those figures
>>>>>>>>> are already kept somewhere anyway, we can just return them here, without too
>>>>>>>>> much hassle.
>>>>>>>>>
>>>>>>>>> Limit and soft limit can't be set for the root cgroup, so they are left at
>>>>>>>>> RESOURCE_MAX. Failcnt is left at 0, because its actual meaning is how many
>>>>>>>>> times we failed allocations due to the limit being hit. We will fail
>>>>>>>>> allocations in the root cgroup, but the limit will never the reason.
>>>>>>>>
>>>>>>>> I do not like this very much to be honest. It just adds more hackery...
>>>>>>>> Why cannot we simply not account if nr_cgroups == 1 and move relevant
>>>>>>>> global counters to the root at the moment when a first group is
>>>>>>>> created?
>>>>>>>
>>>>>>> OK, it seems that the very next patch does what I was looking for. So
>>>>>>> why all the churn in this patch?
>>>>>>> Why do you want to make root even more special?
>>>>>>
>>>>>> Because I am operating under the assumption that we want to handle that
>>>>>> transparently and keep things working. If you tell me: "Hey, reading
>>>>>> memory.usage_in_bytes from root should return 0!", then I can get rid of
>>>>>> that.
>>>>>
>>>>> If you simply switch to accounting for root then you do not have to care
>>>>> about this, don't you?
>>>>>
>>>> Of course not, but the whole point here is *not* accounting root.
>>>
>>> I thought the objective was to not account root if there are no
>>> children. 
>>
>> It is the goal, yes. As I said: I want the root-only case to keep
>> providing userspace with meaningful statistics,
> 
> Sure, statistics need to stay at the place. I am not objecting on that.
> 
>> therefore the bypass.
> 
> I am just arguing about bypassing root even when there are children and
> use_hierarchy == 1 because it adds more code to maintain.
> 
>> But since the machinery is in place, it is trivial to keep bypassing for
>> use_hierarchy = 1 at the root level. If you believe it would be simpler,
>> I could refrain from doing it.
> 
> I am all for "the simple the better" and add more optimizations on top.
> We have a real issue now and we should eliminate it. My original plan
> was to look at the bottlenecks and eliminate them one after another in
> smaller steps. But all the work I have on the plate is preempting me
> from looking into that...
> 
Been there, done that =)

I have no objections removing the special case for use_hierarchy == 1.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 2/5] memcg: provide root figures from system totals
@ 2013-03-20 16:40               ` Anton Vorontsov
  0 siblings, 0 replies; 72+ messages in thread
From: Anton Vorontsov @ 2013-03-20 16:40 UTC (permalink / raw)
  To: Glauber Costa
  Cc: Michal Hocko, linux-mm, cgroups, Tejun Heo, Andrew Morton,
	kamezawa.hiroyu, handai.szj, Johannes Weiner, Mel Gorman

On Wed, Mar 20, 2013 at 12:08:17PM +0400, Glauber Costa wrote:
[...]
> >> The fact that I keep bypassing when hierarchy is present, it is
> >> more of a reuse of the infrastructure since it's there anyway.
> >>
> >> Also, I would like the root memcg to be usable, albeit cheap, for
> >> projects like memory pressure notifications.
> >  
> > root memcg without any childre, right?
> > 
> yes, of course.

Just want to raise a voice of support for this one. Thanks to Glauber's
efforts, we might not need another memory pressure interface for
CONFIG_MEMCG=n case, since we might have CONFIG_MEMCG=y that will be super
cheap when used with just a root memcg w/o children, and still usable for
mem pressure. So this particular scenario is actually in demand[1].

Thanks!

Anton

[1] http://lkml.indiana.edu/hypermail/linux/kernel/1302.2/03173.html

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 2/5] memcg: provide root figures from system totals
@ 2013-03-20 16:40               ` Anton Vorontsov
  0 siblings, 0 replies; 72+ messages in thread
From: Anton Vorontsov @ 2013-03-20 16:40 UTC (permalink / raw)
  To: Glauber Costa
  Cc: Michal Hocko, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Tejun Heo, Andrew Morton,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
	handai.szj-Re5JQEeQqe8AvxtiuMwx3w, Johannes Weiner, Mel Gorman

On Wed, Mar 20, 2013 at 12:08:17PM +0400, Glauber Costa wrote:
[...]
> >> The fact that I keep bypassing when hierarchy is present, it is
> >> more of a reuse of the infrastructure since it's there anyway.
> >>
> >> Also, I would like the root memcg to be usable, albeit cheap, for
> >> projects like memory pressure notifications.
> >  
> > root memcg without any childre, right?
> > 
> yes, of course.

Just want to raise a voice of support for this one. Thanks to Glauber's
efforts, we might not need another memory pressure interface for
CONFIG_MEMCG=n case, since we might have CONFIG_MEMCG=y that will be super
cheap when used with just a root memcg w/o children, and still usable for
mem pressure. So this particular scenario is actually in demand[1].

Thanks!

Anton

[1] http://lkml.indiana.edu/hypermail/linux/kernel/1302.2/03173.html

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 2/5] memcg: provide root figures from system totals
@ 2013-03-21  6:08                       ` Kamezawa Hiroyuki
  0 siblings, 0 replies; 72+ messages in thread
From: Kamezawa Hiroyuki @ 2013-03-21  6:08 UTC (permalink / raw)
  To: Glauber Costa
  Cc: Michal Hocko, linux-mm, cgroups, Tejun Heo, Andrew Morton,
	handai.szj, anton.vorontsov, Johannes Weiner, Mel Gorman

(2013/03/20 18:30), Glauber Costa wrote:
> On 03/20/2013 12:58 PM, Michal Hocko wrote:
>> On Wed 20-03-13 12:34:01, Glauber Costa wrote:
>>> On 03/20/2013 12:18 PM, Michal Hocko wrote:
>>>> On Wed 20-03-13 12:08:17, Glauber Costa wrote:
>>>>> On 03/20/2013 12:03 PM, Michal Hocko wrote:
>>>>>> On Wed 20-03-13 11:03:17, Glauber Costa wrote:
>>>>>>> On 03/19/2013 04:55 PM, Michal Hocko wrote:
>>>>>>>> On Tue 19-03-13 13:46:50, Michal Hocko wrote:
>>>>>>>>> On Tue 05-03-13 17:10:55, Glauber Costa wrote:
>>>>>>>>>> For the root memcg, there is no need to rely on the res_counters if hierarchy
>>>>>>>>>> is enabled The sum of all mem cgroups plus the tasks in root itself, is
>>>>>>>>>> necessarily the amount of memory used for the whole system. Since those figures
>>>>>>>>>> are already kept somewhere anyway, we can just return them here, without too
>>>>>>>>>> much hassle.
>>>>>>>>>>
>>>>>>>>>> Limit and soft limit can't be set for the root cgroup, so they are left at
>>>>>>>>>> RESOURCE_MAX. Failcnt is left at 0, because its actual meaning is how many
>>>>>>>>>> times we failed allocations due to the limit being hit. We will fail
>>>>>>>>>> allocations in the root cgroup, but the limit will never the reason.
>>>>>>>>>
>>>>>>>>> I do not like this very much to be honest. It just adds more hackery...
>>>>>>>>> Why cannot we simply not account if nr_cgroups == 1 and move relevant
>>>>>>>>> global counters to the root at the moment when a first group is
>>>>>>>>> created?
>>>>>>>>
>>>>>>>> OK, it seems that the very next patch does what I was looking for. So
>>>>>>>> why all the churn in this patch?
>>>>>>>> Why do you want to make root even more special?
>>>>>>>
>>>>>>> Because I am operating under the assumption that we want to handle that
>>>>>>> transparently and keep things working. If you tell me: "Hey, reading
>>>>>>> memory.usage_in_bytes from root should return 0!", then I can get rid of
>>>>>>> that.
>>>>>>
>>>>>> If you simply switch to accounting for root then you do not have to care
>>>>>> about this, don't you?
>>>>>>
>>>>> Of course not, but the whole point here is *not* accounting root.
>>>>
>>>> I thought the objective was to not account root if there are no
>>>> children.
>>>
>>> It is the goal, yes. As I said: I want the root-only case to keep
>>> providing userspace with meaningful statistics,
>>
>> Sure, statistics need to stay at the place. I am not objecting on that.
>>
>>> therefore the bypass.
>>
>> I am just arguing about bypassing root even when there are children and
>> use_hierarchy == 1 because it adds more code to maintain.
>>
>>> But since the machinery is in place, it is trivial to keep bypassing for
>>> use_hierarchy = 1 at the root level. If you believe it would be simpler,
>>> I could refrain from doing it.
>>
>> I am all for "the simple the better" and add more optimizations on top.
>> We have a real issue now and we should eliminate it. My original plan
>> was to look at the bottlenecks and eliminate them one after another in
>> smaller steps. But all the work I have on the plate is preempting me
>> from looking into that...
>>
> Been there, done that =)
>
> I have no objections removing the special case for use_hierarchy == 1.
>
I agree.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 2/5] memcg: provide root figures from system totals
@ 2013-03-21  6:08                       ` Kamezawa Hiroyuki
  0 siblings, 0 replies; 72+ messages in thread
From: Kamezawa Hiroyuki @ 2013-03-21  6:08 UTC (permalink / raw)
  To: Glauber Costa
  Cc: Michal Hocko, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Tejun Heo, Andrew Morton,
	handai.szj-Re5JQEeQqe8AvxtiuMwx3w,
	anton.vorontsov-QSEj5FYQhm4dnm+yROfE0A, Johannes Weiner,
	Mel Gorman

(2013/03/20 18:30), Glauber Costa wrote:
> On 03/20/2013 12:58 PM, Michal Hocko wrote:
>> On Wed 20-03-13 12:34:01, Glauber Costa wrote:
>>> On 03/20/2013 12:18 PM, Michal Hocko wrote:
>>>> On Wed 20-03-13 12:08:17, Glauber Costa wrote:
>>>>> On 03/20/2013 12:03 PM, Michal Hocko wrote:
>>>>>> On Wed 20-03-13 11:03:17, Glauber Costa wrote:
>>>>>>> On 03/19/2013 04:55 PM, Michal Hocko wrote:
>>>>>>>> On Tue 19-03-13 13:46:50, Michal Hocko wrote:
>>>>>>>>> On Tue 05-03-13 17:10:55, Glauber Costa wrote:
>>>>>>>>>> For the root memcg, there is no need to rely on the res_counters if hierarchy
>>>>>>>>>> is enabled The sum of all mem cgroups plus the tasks in root itself, is
>>>>>>>>>> necessarily the amount of memory used for the whole system. Since those figures
>>>>>>>>>> are already kept somewhere anyway, we can just return them here, without too
>>>>>>>>>> much hassle.
>>>>>>>>>>
>>>>>>>>>> Limit and soft limit can't be set for the root cgroup, so they are left at
>>>>>>>>>> RESOURCE_MAX. Failcnt is left at 0, because its actual meaning is how many
>>>>>>>>>> times we failed allocations due to the limit being hit. We will fail
>>>>>>>>>> allocations in the root cgroup, but the limit will never the reason.
>>>>>>>>>
>>>>>>>>> I do not like this very much to be honest. It just adds more hackery...
>>>>>>>>> Why cannot we simply not account if nr_cgroups == 1 and move relevant
>>>>>>>>> global counters to the root at the moment when a first group is
>>>>>>>>> created?
>>>>>>>>
>>>>>>>> OK, it seems that the very next patch does what I was looking for. So
>>>>>>>> why all the churn in this patch?
>>>>>>>> Why do you want to make root even more special?
>>>>>>>
>>>>>>> Because I am operating under the assumption that we want to handle that
>>>>>>> transparently and keep things working. If you tell me: "Hey, reading
>>>>>>> memory.usage_in_bytes from root should return 0!", then I can get rid of
>>>>>>> that.
>>>>>>
>>>>>> If you simply switch to accounting for root then you do not have to care
>>>>>> about this, don't you?
>>>>>>
>>>>> Of course not, but the whole point here is *not* accounting root.
>>>>
>>>> I thought the objective was to not account root if there are no
>>>> children.
>>>
>>> It is the goal, yes. As I said: I want the root-only case to keep
>>> providing userspace with meaningful statistics,
>>
>> Sure, statistics need to stay at the place. I am not objecting on that.
>>
>>> therefore the bypass.
>>
>> I am just arguing about bypassing root even when there are children and
>> use_hierarchy == 1 because it adds more code to maintain.
>>
>>> But since the machinery is in place, it is trivial to keep bypassing for
>>> use_hierarchy = 1 at the root level. If you believe it would be simpler,
>>> I could refrain from doing it.
>>
>> I am all for "the simple the better" and add more optimizations on top.
>> We have a real issue now and we should eliminate it. My original plan
>> was to look at the bottlenecks and eliminate them one after another in
>> smaller steps. But all the work I have on the plate is preempting me
>> from looking into that...
>>
> Been there, done that =)
>
> I have no objections removing the special case for use_hierarchy == 1.
>
I agree.

Thanks,
-Kame

^ permalink raw reply	[flat|nested] 72+ messages in thread

end of thread, other threads:[~2013-03-21  6:10 UTC | newest]

Thread overview: 72+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-03-05 13:10 [PATCH v2 0/5] bypass root memcg charges if no memcgs are possible Glauber Costa
2013-03-05 13:10 ` Glauber Costa
2013-03-05 13:10 ` [PATCH v2 1/5] memcg: make nocpu_base available for non hotplug Glauber Costa
2013-03-05 13:10   ` Glauber Costa
2013-03-06  0:04   ` Kamezawa Hiroyuki
2013-03-06  0:04     ` Kamezawa Hiroyuki
2013-03-19 11:07   ` Michal Hocko
2013-03-05 13:10 ` [PATCH v2 2/5] memcg: provide root figures from system totals Glauber Costa
2013-03-05 13:10   ` Glauber Costa
2013-03-06  0:27   ` Kamezawa Hiroyuki
2013-03-06  8:30     ` Glauber Costa
2013-03-06  8:30       ` Glauber Costa
2013-03-06 10:45       ` Kamezawa Hiroyuki
2013-03-06 10:45         ` Kamezawa Hiroyuki
2013-03-06 10:52         ` Glauber Costa
2013-03-06 10:52           ` Glauber Costa
2013-03-06 10:59           ` Kamezawa Hiroyuki
2013-03-06 10:59             ` Kamezawa Hiroyuki
2013-03-13  6:58             ` Sha Zhengju
2013-03-13  6:58               ` Sha Zhengju
2013-03-13  9:15               ` Kamezawa Hiroyuki
2013-03-13  9:15                 ` Kamezawa Hiroyuki
2013-03-13  9:59                 ` Sha Zhengju
2013-03-14  0:03                   ` Kamezawa Hiroyuki
2013-03-14  0:03                     ` Kamezawa Hiroyuki
2013-03-06 10:50       ` Kamezawa Hiroyuki
2013-03-06 10:50         ` Kamezawa Hiroyuki
2013-03-19 12:46   ` Michal Hocko
2013-03-19 12:46     ` Michal Hocko
2013-03-19 12:55     ` Michal Hocko
2013-03-19 12:55       ` Michal Hocko
2013-03-20  7:03       ` Glauber Costa
2013-03-20  8:03         ` Michal Hocko
2013-03-20  8:03           ` Michal Hocko
2013-03-20  8:08           ` Glauber Costa
2013-03-20  8:18             ` Michal Hocko
2013-03-20  8:18               ` Michal Hocko
2013-03-20  8:34               ` Glauber Costa
2013-03-20  8:58                 ` Michal Hocko
2013-03-20  9:30                   ` Glauber Costa
2013-03-21  6:08                     ` Kamezawa Hiroyuki
2013-03-21  6:08                       ` Kamezawa Hiroyuki
2013-03-20 16:40             ` Anton Vorontsov
2013-03-20 16:40               ` Anton Vorontsov
2013-03-20  7:04     ` Glauber Costa
2013-03-05 13:10 ` [PATCH v2 3/5] memcg: make it suck faster Glauber Costa
2013-03-05 13:10   ` Glauber Costa
2013-03-06  0:46   ` Kamezawa Hiroyuki
2013-03-06  0:46     ` Kamezawa Hiroyuki
2013-03-06  8:38     ` Glauber Costa
2013-03-06 10:54       ` Kamezawa Hiroyuki
2013-03-06 10:54         ` Kamezawa Hiroyuki
2013-03-13  8:08   ` Sha Zhengju
2013-03-13  8:08     ` Sha Zhengju
2013-03-20  7:13     ` Glauber Costa
2013-03-20  7:13       ` Glauber Costa
2013-03-19 13:58   ` Michal Hocko
2013-03-19 13:58     ` Michal Hocko
2013-03-20  7:00     ` Glauber Costa
2013-03-20  8:13       ` Michal Hocko
2013-03-20  8:13         ` Michal Hocko
2013-03-05 13:10 ` [PATCH v2 4/5] memcg: do not call page_cgroup_init at system_boot Glauber Costa
2013-03-05 13:10   ` Glauber Costa
2013-03-06  1:07   ` Kamezawa Hiroyuki
2013-03-06  1:07     ` Kamezawa Hiroyuki
2013-03-06  8:22     ` Glauber Costa
2013-03-06  8:22       ` Glauber Costa
2013-03-19 14:06   ` Michal Hocko
2013-03-05 13:10 ` [PATCH v2 5/5] memcg: do not walk all the way to the root for memcg Glauber Costa
2013-03-05 13:10   ` Glauber Costa
2013-03-06  1:08   ` Kamezawa Hiroyuki
2013-03-06  1:08     ` Kamezawa Hiroyuki

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.