linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 00/13] Memcg Kernel Memory Tracking.
@ 2012-03-09 20:39 Suleiman Souhlal
  2012-03-09 20:39 ` [PATCH v2 01/13] memcg: Consolidate various flags into a single flags field Suleiman Souhlal
                   ` (13 more replies)
  0 siblings, 14 replies; 45+ messages in thread
From: Suleiman Souhlal @ 2012-03-09 20:39 UTC (permalink / raw)
  To: cgroups
  Cc: suleiman, glommer, kamezawa.hiroyu, penberg, cl, yinghan, hughd,
	gthelen, peterz, dan.magenheimer, hannes, mgorman,
	James.Bottomley, linux-mm, devel, linux-kernel, Suleiman Souhlal

This is v2 of my kernel memory tracking patchset for memcg.

Lots of changes based on feedback from Glauber and Kamezawa.
In particular, I changed it to be opt-in instead of opt-out:
In order for a slab type to be tracked, it has to be marked with
SLAB_MEMCG_ACCT at kmem_cache_create() time.
Currently, only dentries and kmalloc are tracked.

Planned for v3:
 - Slub support.
 - Using a static_branch to remove overhead when no cgroups have been
   created.
 - Getting rid of kmem_cache_get_ref/drop_ref pair in kmem_cache_free.

Detailed change list from v1 (http://marc.info/?l=linux-mm&m=133038361014525):
 - Fixed misspelling in documentation.
 - Added flags field to struct mem_cgroup.
 - Moved independent_kmem_limit into flags.
 - Renamed kmem_bytes to kmem.
 - Divided consume_stock changes into two changes.
 - Fixed crash at boot when not every commit is applied.
 - Moved the new fields in kmem_cache into their own struct.
 - Got rid of SLAB_MEMCG slab flag.
 - Dropped accounting to root.
 - Added css_id into memcg slab name.
 - Changed memcg cache creation to always be deferred to workqueue.
 - Replaced bypass_bytes with overcharging the cgroup.
 - Got rid of #ifdef CONFIG_SLAB from memcontrol.c.
 - Got rid of __GFP_NOACCOUNT, changing to an opt-in model.
 - Remove kmem limit when turning off independent limit.
 - Moved the accounting of kmalloc to its own patch.
 - Removed useless parameters from memcg_create_kmem_cache().
 - Get a ref to the css when enqueing cache for creation.
 - increased MAX_KMEM_CACHE_TYPES to 400.

Suleiman Souhlal (13):
  memcg: Consolidate various flags into a single flags field.
  memcg: Kernel memory accounting infrastructure.
  memcg: Uncharge all kmem when deleting a cgroup.
  memcg: Make it possible to use the stock for more than one page.
  memcg: Reclaim when more than one page needed.
  slab: Add kmem_cache_gfp_flags() helper function.
  memcg: Slab accounting.
  memcg: Make dentry slab memory accounted in kernel memory accounting.
  memcg: Account for kmalloc in kernel memory accounting.
  memcg: Track all the memcg children of a kmem_cache.
  memcg: Handle bypassed kernel memory charges.
  memcg: Per-memcg memory.kmem.slabinfo file.
  memcg: Document kernel memory accounting.

 Documentation/cgroups/memory.txt |   44 +++-
 fs/dcache.c                      |    4 +-
 include/linux/memcontrol.h       |   30 ++-
 include/linux/slab.h             |   56 ++++
 include/linux/slab_def.h         |   79 +++++-
 include/linux/slob_def.h         |    6 +
 include/linux/slub_def.h         |    9 +
 init/Kconfig                     |    2 +-
 mm/memcontrol.c                  |  633 ++++++++++++++++++++++++++++++++++---
 mm/slab.c                        |  431 +++++++++++++++++++++++---
 10 files changed, 1183 insertions(+), 111 deletions(-)

-- Suleiman


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH v2 01/13] memcg: Consolidate various flags into a single flags field.
  2012-03-09 20:39 [PATCH v2 00/13] Memcg Kernel Memory Tracking Suleiman Souhlal
@ 2012-03-09 20:39 ` Suleiman Souhlal
  2012-03-11  7:50   ` Glauber Costa
  2012-03-09 20:39 ` [PATCH v2 02/13] memcg: Kernel memory accounting infrastructure Suleiman Souhlal
                   ` (12 subsequent siblings)
  13 siblings, 1 reply; 45+ messages in thread
From: Suleiman Souhlal @ 2012-03-09 20:39 UTC (permalink / raw)
  To: cgroups
  Cc: suleiman, glommer, kamezawa.hiroyu, penberg, cl, yinghan, hughd,
	gthelen, peterz, dan.magenheimer, hannes, mgorman,
	James.Bottomley, linux-mm, devel, linux-kernel, Suleiman Souhlal

Since there is an ever-increasing number of flags in the memcg
struct, consolidate them into a single flags field.
The flags that we consolidate are:
    - use_hierarchy
    - memsw_is_minimum
    - oom_kill_disable

Signed-off-by: Suleiman Souhlal <suleiman@google.com>
---
 mm/memcontrol.c |  112 ++++++++++++++++++++++++++++++++++++++-----------------
 1 files changed, 78 insertions(+), 34 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 5585dc3..37ad2cb 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -213,6 +213,15 @@ struct mem_cgroup_eventfd_list {
 static void mem_cgroup_threshold(struct mem_cgroup *memcg);
 static void mem_cgroup_oom_notify(struct mem_cgroup *memcg);
 
+enum memcg_flags {
+	MEMCG_USE_HIERARCHY,	/*
+				 * Should the accounting and control be
+				 * hierarchical, per subtree?
+				 */
+	MEMCG_MEMSW_IS_MINIMUM,	/* Set when res.limit == memsw.limit */
+	MEMCG_OOM_KILL_DISABLE,	/* OOM-Killer disable */
+};
+
 /*
  * The memory controller data structure. The memory controller controls both
  * page cache and RSS per cgroup. We would eventually like to provide
@@ -245,10 +254,7 @@ struct mem_cgroup {
 	atomic_t	numainfo_events;
 	atomic_t	numainfo_updating;
 #endif
-	/*
-	 * Should the accounting and control be hierarchical, per subtree?
-	 */
-	bool use_hierarchy;
+	unsigned long flags;
 
 	bool		oom_lock;
 	atomic_t	under_oom;
@@ -256,11 +262,6 @@ struct mem_cgroup {
 	atomic_t	refcnt;
 
 	int	swappiness;
-	/* OOM-Killer disable */
-	int		oom_kill_disable;
-
-	/* set when res.limit == memsw.limit */
-	bool		memsw_is_minimum;
 
 	/* protect arrays of thresholds */
 	struct mutex thresholds_lock;
@@ -371,6 +372,24 @@ enum charge_type {
 static void mem_cgroup_get(struct mem_cgroup *memcg);
 static void mem_cgroup_put(struct mem_cgroup *memcg);
 
+static inline bool
+mem_cgroup_test_flag(const struct mem_cgroup *memcg, enum memcg_flags flag)
+{
+	return test_bit(flag, &memcg->flags);
+}
+
+static inline void
+mem_cgroup_set_flag(struct mem_cgroup *memcg, enum memcg_flags flag)
+{
+	set_bit(flag, &memcg->flags);
+}
+
+static inline void
+mem_cgroup_clear_flag(struct mem_cgroup *memcg, enum memcg_flags flag)
+{
+	clear_bit(flag, &memcg->flags);
+}
+
 /* Writing them here to avoid exposing memcg's inner layout */
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
 #include <net/sock.h>
@@ -876,7 +895,8 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 	if (prev && prev != root)
 		css_put(&prev->css);
 
-	if (!root->use_hierarchy && root != root_mem_cgroup) {
+	if (!mem_cgroup_test_flag(root, MEMCG_USE_HIERARCHY) && root !=
+	    root_mem_cgroup) {
 		if (prev)
 			return NULL;
 		return root;
@@ -1126,8 +1146,8 @@ static bool mem_cgroup_same_or_subtree(const struct mem_cgroup *root_memcg,
 		struct mem_cgroup *memcg)
 {
 	if (root_memcg != memcg) {
-		return (root_memcg->use_hierarchy &&
-			css_is_ancestor(&memcg->css, &root_memcg->css));
+		return mem_cgroup_test_flag(root_memcg, MEMCG_USE_HIERARCHY) &&
+			css_is_ancestor(&memcg->css, &root_memcg->css);
 	}
 
 	return true;
@@ -1460,7 +1480,8 @@ static unsigned long mem_cgroup_reclaim(struct mem_cgroup *memcg,
 
 	if (flags & MEM_CGROUP_RECLAIM_NOSWAP)
 		noswap = true;
-	if (!(flags & MEM_CGROUP_RECLAIM_SHRINK) && memcg->memsw_is_minimum)
+	if (!(flags & MEM_CGROUP_RECLAIM_SHRINK) && mem_cgroup_test_flag(memcg,
+	    MEMCG_MEMSW_IS_MINIMUM))
 		noswap = true;
 
 	for (loop = 0; loop < MEM_CGROUP_MAX_RECLAIM_LOOPS; loop++) {
@@ -1813,7 +1834,7 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask)
 	 * under OOM is always welcomed, use TASK_KILLABLE here.
 	 */
 	prepare_to_wait(&memcg_oom_waitq, &owait.wait, TASK_KILLABLE);
-	if (!locked || memcg->oom_kill_disable)
+	if (!locked || mem_cgroup_test_flag(memcg, MEMCG_OOM_KILL_DISABLE))
 		need_to_kill = false;
 	if (locked)
 		mem_cgroup_oom_notify(memcg);
@@ -3416,9 +3437,11 @@ static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
 		ret = res_counter_set_limit(&memcg->res, val);
 		if (!ret) {
 			if (memswlimit == val)
-				memcg->memsw_is_minimum = true;
+				mem_cgroup_set_flag(memcg,
+				    MEMCG_MEMSW_IS_MINIMUM);
 			else
-				memcg->memsw_is_minimum = false;
+				mem_cgroup_clear_flag(memcg,
+				    MEMCG_MEMSW_IS_MINIMUM);
 		}
 		mutex_unlock(&set_limit_mutex);
 
@@ -3475,9 +3498,11 @@ static int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
 		ret = res_counter_set_limit(&memcg->memsw, val);
 		if (!ret) {
 			if (memlimit == val)
-				memcg->memsw_is_minimum = true;
+				mem_cgroup_set_flag(memcg,
+				    MEMCG_MEMSW_IS_MINIMUM);
 			else
-				memcg->memsw_is_minimum = false;
+				mem_cgroup_clear_flag(memcg,
+				    MEMCG_MEMSW_IS_MINIMUM);
 		}
 		mutex_unlock(&set_limit_mutex);
 
@@ -3745,7 +3770,8 @@ int mem_cgroup_force_empty_write(struct cgroup *cont, unsigned int event)
 
 static u64 mem_cgroup_hierarchy_read(struct cgroup *cont, struct cftype *cft)
 {
-	return mem_cgroup_from_cont(cont)->use_hierarchy;
+	return mem_cgroup_test_flag(mem_cgroup_from_cont(cont),
+	    MEMCG_USE_HIERARCHY);
 }
 
 static int mem_cgroup_hierarchy_write(struct cgroup *cont, struct cftype *cft,
@@ -3768,10 +3794,14 @@ static int mem_cgroup_hierarchy_write(struct cgroup *cont, struct cftype *cft,
 	 * For the root cgroup, parent_mem is NULL, we allow value to be
 	 * set if there are no children.
 	 */
-	if ((!parent_memcg || !parent_memcg->use_hierarchy) &&
-				(val == 1 || val == 0)) {
+	if ((!parent_memcg || !mem_cgroup_test_flag(parent_memcg,
+	    MEMCG_USE_HIERARCHY)) && (val == 1 || val == 0)) {
 		if (list_empty(&cont->children))
-			memcg->use_hierarchy = val;
+			if (val)
+				mem_cgroup_set_flag(memcg, MEMCG_USE_HIERARCHY);
+			else
+				mem_cgroup_clear_flag(memcg,
+				    MEMCG_USE_HIERARCHY);
 		else
 			retval = -EBUSY;
 	} else
@@ -3903,13 +3933,13 @@ static void memcg_get_hierarchical_limit(struct mem_cgroup *memcg,
 	min_limit = res_counter_read_u64(&memcg->res, RES_LIMIT);
 	min_memsw_limit = res_counter_read_u64(&memcg->memsw, RES_LIMIT);
 	cgroup = memcg->css.cgroup;
-	if (!memcg->use_hierarchy)
+	if (!mem_cgroup_test_flag(memcg, MEMCG_USE_HIERARCHY))
 		goto out;
 
 	while (cgroup->parent) {
 		cgroup = cgroup->parent;
 		memcg = mem_cgroup_from_cont(cgroup);
-		if (!memcg->use_hierarchy)
+		if (!mem_cgroup_test_flag(memcg, MEMCG_USE_HIERARCHY))
 			break;
 		tmp = res_counter_read_u64(&memcg->res, RES_LIMIT);
 		min_limit = min(min_limit, tmp);
@@ -4206,8 +4236,9 @@ static int mem_cgroup_swappiness_write(struct cgroup *cgrp, struct cftype *cft,
 	cgroup_lock();
 
 	/* If under hierarchy, only empty-root can set this value */
-	if ((parent->use_hierarchy) ||
-	    (memcg->use_hierarchy && !list_empty(&cgrp->children))) {
+	if (mem_cgroup_test_flag(parent, MEMCG_USE_HIERARCHY) ||
+	    (mem_cgroup_test_flag(memcg, MEMCG_USE_HIERARCHY) &&
+	    !list_empty(&cgrp->children))) {
 		cgroup_unlock();
 		return -EINVAL;
 	}
@@ -4518,7 +4549,8 @@ static int mem_cgroup_oom_control_read(struct cgroup *cgrp,
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
 
-	cb->fill(cb, "oom_kill_disable", memcg->oom_kill_disable);
+	cb->fill(cb, "oom_kill_disable", mem_cgroup_test_flag(memcg,
+	    MEMCG_OOM_KILL_DISABLE));
 
 	if (atomic_read(&memcg->under_oom))
 		cb->fill(cb, "under_oom", 1);
@@ -4541,14 +4573,18 @@ static int mem_cgroup_oom_control_write(struct cgroup *cgrp,
 
 	cgroup_lock();
 	/* oom-kill-disable is a flag for subhierarchy. */
-	if ((parent->use_hierarchy) ||
-	    (memcg->use_hierarchy && !list_empty(&cgrp->children))) {
+	if (mem_cgroup_test_flag(parent, MEMCG_USE_HIERARCHY) ||
+	    (mem_cgroup_test_flag(memcg, MEMCG_USE_HIERARCHY) &&
+	    !list_empty(&cgrp->children))) {
 		cgroup_unlock();
 		return -EINVAL;
 	}
-	memcg->oom_kill_disable = val;
-	if (!val)
+	if (val)
+		mem_cgroup_set_flag(memcg, MEMCG_OOM_KILL_DISABLE);
+	else {
+		mem_cgroup_clear_flag(memcg, MEMCG_OOM_KILL_DISABLE);
 		memcg_oom_recover(memcg);
+	}
 	cgroup_unlock();
 	return 0;
 }
@@ -4916,11 +4952,19 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
 		hotcpu_notifier(memcg_cpu_hotplug_callback, 0);
 	} else {
 		parent = mem_cgroup_from_cont(cont->parent);
-		memcg->use_hierarchy = parent->use_hierarchy;
-		memcg->oom_kill_disable = parent->oom_kill_disable;
+
+		if (mem_cgroup_test_flag(parent, MEMCG_USE_HIERARCHY))
+			mem_cgroup_set_flag(memcg, MEMCG_USE_HIERARCHY);
+		else
+			mem_cgroup_clear_flag(memcg, MEMCG_USE_HIERARCHY);
+
+		if (mem_cgroup_test_flag(parent, MEMCG_OOM_KILL_DISABLE))
+			mem_cgroup_set_flag(memcg, MEMCG_OOM_KILL_DISABLE);
+		else
+			mem_cgroup_clear_flag(memcg, MEMCG_OOM_KILL_DISABLE);
 	}
 
-	if (parent && parent->use_hierarchy) {
+	if (parent && mem_cgroup_test_flag(parent, MEMCG_USE_HIERARCHY)) {
 		res_counter_init(&memcg->res, &parent->res);
 		res_counter_init(&memcg->memsw, &parent->memsw);
 		/*
-- 
1.7.7.3


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v2 02/13] memcg: Kernel memory accounting infrastructure.
  2012-03-09 20:39 [PATCH v2 00/13] Memcg Kernel Memory Tracking Suleiman Souhlal
  2012-03-09 20:39 ` [PATCH v2 01/13] memcg: Consolidate various flags into a single flags field Suleiman Souhlal
@ 2012-03-09 20:39 ` Suleiman Souhlal
  2012-03-11  8:12   ` Glauber Costa
  2012-03-12 12:38   ` Glauber Costa
  2012-03-09 20:39 ` [PATCH v2 03/13] memcg: Uncharge all kmem when deleting a cgroup Suleiman Souhlal
                   ` (11 subsequent siblings)
  13 siblings, 2 replies; 45+ messages in thread
From: Suleiman Souhlal @ 2012-03-09 20:39 UTC (permalink / raw)
  To: cgroups
  Cc: suleiman, glommer, kamezawa.hiroyu, penberg, cl, yinghan, hughd,
	gthelen, peterz, dan.magenheimer, hannes, mgorman,
	James.Bottomley, linux-mm, devel, linux-kernel, Suleiman Souhlal

Enabled with CONFIG_CGROUP_MEM_RES_CTLR_KMEM.

Adds the following files:
    - memory.kmem.independent_kmem_limit
    - memory.kmem.usage_in_bytes
    - memory.kmem.limit_in_bytes

Signed-off-by: Suleiman Souhlal <suleiman@google.com>
---
 mm/memcontrol.c |  136 ++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 files changed, 135 insertions(+), 1 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 37ad2cb..e6fd558 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -220,6 +220,10 @@ enum memcg_flags {
 				 */
 	MEMCG_MEMSW_IS_MINIMUM,	/* Set when res.limit == memsw.limit */
 	MEMCG_OOM_KILL_DISABLE,	/* OOM-Killer disable */
+	MEMCG_INDEPENDENT_KMEM_LIMIT,	/*
+					 * kernel memory is not counted in
+					 * memory.usage_in_bytes
+					 */
 };
 
 /*
@@ -244,6 +248,10 @@ struct mem_cgroup {
 	 */
 	struct res_counter memsw;
 	/*
+	 * the counter to account for kernel memory usage.
+	 */
+	struct res_counter kmem;
+	/*
 	 * Per cgroup active and inactive list, similar to the
 	 * per zone LRU lists.
 	 */
@@ -355,6 +363,7 @@ enum charge_type {
 #define _MEM			(0)
 #define _MEMSWAP		(1)
 #define _OOM_TYPE		(2)
+#define _KMEM			(3)
 #define MEMFILE_PRIVATE(x, val)	(((x) << 16) | (val))
 #define MEMFILE_TYPE(val)	(((val) >> 16) & 0xffff)
 #define MEMFILE_ATTR(val)	((val) & 0xffff)
@@ -371,6 +380,8 @@ enum charge_type {
 
 static void mem_cgroup_get(struct mem_cgroup *memcg);
 static void mem_cgroup_put(struct mem_cgroup *memcg);
+static void memcg_kmem_init(struct mem_cgroup *memcg,
+    struct mem_cgroup *parent);
 
 static inline bool
 mem_cgroup_test_flag(const struct mem_cgroup *memcg, enum memcg_flags flag)
@@ -1435,6 +1446,10 @@ done:
 		res_counter_read_u64(&memcg->memsw, RES_USAGE) >> 10,
 		res_counter_read_u64(&memcg->memsw, RES_LIMIT) >> 10,
 		res_counter_read_u64(&memcg->memsw, RES_FAILCNT));
+	printk(KERN_INFO "kmem: usage %llukB, limit %llukB, failcnt %llu\n",
+		res_counter_read_u64(&memcg->kmem, RES_USAGE) >> 10,
+		res_counter_read_u64(&memcg->kmem, RES_LIMIT) >> 10,
+		res_counter_read_u64(&memcg->kmem, RES_FAILCNT));
 }
 
 /*
@@ -3868,6 +3883,9 @@ static u64 mem_cgroup_read(struct cgroup *cont, struct cftype *cft)
 		else
 			val = res_counter_read_u64(&memcg->memsw, name);
 		break;
+	case _KMEM:
+		val = res_counter_read_u64(&memcg->kmem, name);
+		break;
 	default:
 		BUG();
 		break;
@@ -3900,8 +3918,15 @@ static int mem_cgroup_write(struct cgroup *cont, struct cftype *cft,
 			break;
 		if (type == _MEM)
 			ret = mem_cgroup_resize_limit(memcg, val);
-		else
+		else if (type == _MEMSWAP)
 			ret = mem_cgroup_resize_memsw_limit(memcg, val);
+		else if (type == _KMEM) {
+			if (!mem_cgroup_test_flag(memcg,
+			    MEMCG_INDEPENDENT_KMEM_LIMIT))
+				return -EINVAL;
+			ret = res_counter_set_limit(&memcg->kmem, val);
+		} else
+			return -EINVAL;
 		break;
 	case RES_SOFT_LIMIT:
 		ret = res_counter_memparse_write_strategy(buffer, &val);
@@ -4606,8 +4631,56 @@ static int mem_control_numa_stat_open(struct inode *unused, struct file *file)
 #endif /* CONFIG_NUMA */
 
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
+static u64
+mem_cgroup_independent_kmem_limit_read(struct cgroup *cgrp, struct cftype *cft)
+{
+	return mem_cgroup_test_flag(mem_cgroup_from_cont(cgrp),
+	    MEMCG_INDEPENDENT_KMEM_LIMIT);
+}
+
+static int mem_cgroup_independent_kmem_limit_write(struct cgroup *cgrp,
+    struct cftype *cft, u64 val)
+{
+	struct mem_cgroup *memcg;
+
+	memcg = mem_cgroup_from_cont(cgrp);
+	if (val)
+		mem_cgroup_set_flag(memcg, MEMCG_INDEPENDENT_KMEM_LIMIT);
+	else {
+		mem_cgroup_clear_flag(memcg, MEMCG_INDEPENDENT_KMEM_LIMIT);
+		res_counter_set_limit(&memcg->kmem, RESOURCE_MAX);
+	}
+
+	return 0;
+}
+
+static struct cftype kmem_cgroup_files[] = {
+	{
+		.name = "kmem.independent_kmem_limit",
+		.write_u64 = mem_cgroup_independent_kmem_limit_write,
+		.read_u64 = mem_cgroup_independent_kmem_limit_read,
+	},
+	{
+		.name = "kmem.limit_in_bytes",
+		.private = MEMFILE_PRIVATE(_KMEM, RES_LIMIT),
+		.write_string = mem_cgroup_write,
+		.read_u64 = mem_cgroup_read,
+	},
+	{
+		.name = "kmem.usage_in_bytes",
+		.private = MEMFILE_PRIVATE(_KMEM, RES_USAGE),
+		.read_u64 = mem_cgroup_read,
+	},
+};
+
 static int register_kmem_files(struct cgroup *cont, struct cgroup_subsys *ss)
 {
+	int ret;
+
+	ret = cgroup_add_files(cont, ss, kmem_cgroup_files,
+	    ARRAY_SIZE(kmem_cgroup_files));
+	if (ret)
+		return ret;
 	/*
 	 * Part of this would be better living in a separate allocation
 	 * function, leaving us with just the cgroup tree population work.
@@ -4621,6 +4694,10 @@ static int register_kmem_files(struct cgroup *cont, struct cgroup_subsys *ss)
 static void kmem_cgroup_destroy(struct cgroup_subsys *ss,
 				struct cgroup *cont)
 {
+	struct mem_cgroup *memcg;
+
+	memcg = mem_cgroup_from_cont(cont);
+	BUG_ON(res_counter_read_u64(&memcg->kmem, RES_USAGE) != 0);
 	mem_cgroup_sockets_destroy(cont, ss);
 }
 #else
@@ -4980,6 +5057,8 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
 	}
 	memcg->last_scanned_node = MAX_NUMNODES;
 	INIT_LIST_HEAD(&memcg->oom_notify);
+	memcg_kmem_init(memcg, parent && mem_cgroup_test_flag(parent,
+	    MEMCG_USE_HIERARCHY) ? parent : NULL);
 
 	if (parent)
 		memcg->swappiness = mem_cgroup_swappiness(parent);
@@ -5561,3 +5640,58 @@ static int __init enable_swap_account(char *s)
 __setup("swapaccount=", enable_swap_account);
 
 #endif
+
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
+int
+memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, long long delta)
+{
+	struct res_counter *fail_res;
+	struct mem_cgroup *_memcg;
+	int may_oom, ret;
+
+	may_oom = (gfp & __GFP_WAIT) && (gfp & __GFP_FS) &&
+	    !(gfp & __GFP_NORETRY);
+
+	ret = 0;
+
+	_memcg = memcg;
+	if (memcg && !mem_cgroup_test_flag(memcg,
+	    MEMCG_INDEPENDENT_KMEM_LIMIT)) {
+		ret = __mem_cgroup_try_charge(NULL, gfp, delta / PAGE_SIZE,
+		    &_memcg, may_oom);
+		if (ret == -ENOMEM)
+			return ret;
+	}
+
+	if (memcg && _memcg == memcg)
+		ret = res_counter_charge(&memcg->kmem, delta, &fail_res);
+
+	return ret;
+}
+
+void
+memcg_uncharge_kmem(struct mem_cgroup *memcg, long long delta)
+{
+	if (memcg)
+		res_counter_uncharge(&memcg->kmem, delta);
+
+	if (memcg && !mem_cgroup_test_flag(memcg, MEMCG_INDEPENDENT_KMEM_LIMIT))
+		res_counter_uncharge(&memcg->res, delta);
+}
+
+static void
+memcg_kmem_init(struct mem_cgroup *memcg, struct mem_cgroup *parent)
+{
+	struct res_counter *parent_res;
+
+	parent_res = NULL;
+	if (parent && parent != root_mem_cgroup)
+		parent_res = &parent->kmem;
+	res_counter_init(&memcg->kmem, parent_res);
+}
+#else /* CONFIG_CGROUP_MEM_RES_CTLR_KMEM */
+static void
+memcg_kmem_init(struct mem_cgroup *memcg, struct mem_cgroup *parent)
+{
+}
+#endif /* CONFIG_CGROUP_MEM_RES_CTLR_KMEM */
-- 
1.7.7.3


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v2 03/13] memcg: Uncharge all kmem when deleting a cgroup.
  2012-03-09 20:39 [PATCH v2 00/13] Memcg Kernel Memory Tracking Suleiman Souhlal
  2012-03-09 20:39 ` [PATCH v2 01/13] memcg: Consolidate various flags into a single flags field Suleiman Souhlal
  2012-03-09 20:39 ` [PATCH v2 02/13] memcg: Kernel memory accounting infrastructure Suleiman Souhlal
@ 2012-03-09 20:39 ` Suleiman Souhlal
  2012-03-11  8:19   ` Glauber Costa
  2012-03-13  6:27   ` KAMEZAWA Hiroyuki
  2012-03-09 20:39 ` [PATCH v2 04/13] memcg: Make it possible to use the stock for more than one page Suleiman Souhlal
                   ` (10 subsequent siblings)
  13 siblings, 2 replies; 45+ messages in thread
From: Suleiman Souhlal @ 2012-03-09 20:39 UTC (permalink / raw)
  To: cgroups
  Cc: suleiman, glommer, kamezawa.hiroyu, penberg, cl, yinghan, hughd,
	gthelen, peterz, dan.magenheimer, hannes, mgorman,
	James.Bottomley, linux-mm, devel, linux-kernel, Suleiman Souhlal

Signed-off-by: Suleiman Souhlal <suleiman@google.com>
---
 mm/memcontrol.c |   31 ++++++++++++++++++++++++++++++-
 1 files changed, 30 insertions(+), 1 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e6fd558..6fbb438 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -382,6 +382,7 @@ static void mem_cgroup_get(struct mem_cgroup *memcg);
 static void mem_cgroup_put(struct mem_cgroup *memcg);
 static void memcg_kmem_init(struct mem_cgroup *memcg,
     struct mem_cgroup *parent);
+static void memcg_kmem_move(struct mem_cgroup *memcg);
 
 static inline bool
 mem_cgroup_test_flag(const struct mem_cgroup *memcg, enum memcg_flags flag)
@@ -3700,6 +3701,7 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg, bool free_all)
 	int ret;
 	int node, zid, shrink;
 	int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
+	unsigned long usage;
 	struct cgroup *cgrp = memcg->css.cgroup;
 
 	css_get(&memcg->css);
@@ -3719,6 +3721,8 @@ move_account:
 		/* This is for making all *used* pages to be on LRU. */
 		lru_add_drain_all();
 		drain_all_stock_sync(memcg);
+		if (!free_all)
+			memcg_kmem_move(memcg);
 		ret = 0;
 		mem_cgroup_start_move(memcg);
 		for_each_node_state(node, N_HIGH_MEMORY) {
@@ -3740,8 +3744,14 @@ move_account:
 		if (ret == -ENOMEM)
 			goto try_to_free;
 		cond_resched();
+		usage = memcg->res.usage;
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
+		if (free_all && !mem_cgroup_test_flag(memcg,
+		    MEMCG_INDEPENDENT_KMEM_LIMIT))
+			usage -= memcg->kmem.usage;
+#endif
 	/* "ret" should also be checked to ensure all lists are empty. */
-	} while (memcg->res.usage > 0 || ret);
+	} while (usage > 0 || ret);
 out:
 	css_put(&memcg->css);
 	return ret;
@@ -5689,9 +5699,28 @@ memcg_kmem_init(struct mem_cgroup *memcg, struct mem_cgroup *parent)
 		parent_res = &parent->kmem;
 	res_counter_init(&memcg->kmem, parent_res);
 }
+
+static void
+memcg_kmem_move(struct mem_cgroup *memcg)
+{
+	unsigned long flags;
+	long kmem;
+
+	spin_lock_irqsave(&memcg->kmem.lock, flags);
+	kmem = memcg->kmem.usage;
+	res_counter_uncharge_locked(&memcg->kmem, kmem);
+	spin_unlock_irqrestore(&memcg->kmem.lock, flags);
+	if (!mem_cgroup_test_flag(memcg, MEMCG_INDEPENDENT_KMEM_LIMIT))
+		res_counter_uncharge(&memcg->res, kmem);
+}
 #else /* CONFIG_CGROUP_MEM_RES_CTLR_KMEM */
 static void
 memcg_kmem_init(struct mem_cgroup *memcg, struct mem_cgroup *parent)
 {
 }
+
+static void
+memcg_kmem_move(struct mem_cgroup *memcg)
+{
+}
 #endif /* CONFIG_CGROUP_MEM_RES_CTLR_KMEM */
-- 
1.7.7.3


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v2 04/13] memcg: Make it possible to use the stock for more than one page.
  2012-03-09 20:39 [PATCH v2 00/13] Memcg Kernel Memory Tracking Suleiman Souhlal
                   ` (2 preceding siblings ...)
  2012-03-09 20:39 ` [PATCH v2 03/13] memcg: Uncharge all kmem when deleting a cgroup Suleiman Souhlal
@ 2012-03-09 20:39 ` Suleiman Souhlal
  2012-03-11 10:49   ` Glauber Costa
  2012-03-09 20:39 ` [PATCH v2 05/13] memcg: Reclaim when more than one page needed Suleiman Souhlal
                   ` (9 subsequent siblings)
  13 siblings, 1 reply; 45+ messages in thread
From: Suleiman Souhlal @ 2012-03-09 20:39 UTC (permalink / raw)
  To: cgroups
  Cc: suleiman, glommer, kamezawa.hiroyu, penberg, cl, yinghan, hughd,
	gthelen, peterz, dan.magenheimer, hannes, mgorman,
	James.Bottomley, linux-mm, devel, linux-kernel, Suleiman Souhlal

Signed-off-by: Suleiman Souhlal <suleiman@google.com>
---
 mm/memcontrol.c |   18 +++++++++---------
 1 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6fbb438..f605100 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1965,19 +1965,19 @@ static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock);
 static DEFINE_MUTEX(percpu_charge_mutex);
 
 /*
- * Try to consume stocked charge on this cpu. If success, one page is consumed
- * from local stock and true is returned. If the stock is 0 or charges from a
- * cgroup which is not current target, returns false. This stock will be
- * refilled.
+ * Try to consume stocked charge on this cpu. If success, nr_pages pages are
+ * consumed from local stock and true is returned. If the stock is 0 or
+ * charges from a cgroup which is not current target, returns false.
+ * This stock will be refilled.
  */
-static bool consume_stock(struct mem_cgroup *memcg)
+static bool consume_stock(struct mem_cgroup *memcg, int nr_pages)
 {
 	struct memcg_stock_pcp *stock;
 	bool ret = true;
 
 	stock = &get_cpu_var(memcg_stock);
-	if (memcg == stock->cached && stock->nr_pages)
-		stock->nr_pages--;
+	if (memcg == stock->cached && stock->nr_pages >= nr_pages)
+		stock->nr_pages -= nr_pages;
 	else /* need to call res_counter_charge */
 		ret = false;
 	put_cpu_var(memcg_stock);
@@ -2290,7 +2290,7 @@ again:
 		VM_BUG_ON(css_is_removed(&memcg->css));
 		if (mem_cgroup_is_root(memcg))
 			goto done;
-		if (nr_pages == 1 && consume_stock(memcg))
+		if (consume_stock(memcg, nr_pages))
 			goto done;
 		css_get(&memcg->css);
 	} else {
@@ -2315,7 +2315,7 @@ again:
 			rcu_read_unlock();
 			goto done;
 		}
-		if (nr_pages == 1 && consume_stock(memcg)) {
+		if (consume_stock(memcg, nr_pages)) {
 			/*
 			 * It seems dagerous to access memcg without css_get().
 			 * But considering how consume_stok works, it's not
-- 
1.7.7.3


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v2 05/13] memcg: Reclaim when more than one page needed.
  2012-03-09 20:39 [PATCH v2 00/13] Memcg Kernel Memory Tracking Suleiman Souhlal
                   ` (3 preceding siblings ...)
  2012-03-09 20:39 ` [PATCH v2 04/13] memcg: Make it possible to use the stock for more than one page Suleiman Souhlal
@ 2012-03-09 20:39 ` Suleiman Souhlal
  2012-03-09 20:39 ` [PATCH v2 06/13] slab: Add kmem_cache_gfp_flags() helper function Suleiman Souhlal
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 45+ messages in thread
From: Suleiman Souhlal @ 2012-03-09 20:39 UTC (permalink / raw)
  To: cgroups
  Cc: suleiman, glommer, kamezawa.hiroyu, penberg, cl, yinghan, hughd,
	gthelen, peterz, dan.magenheimer, hannes, mgorman,
	James.Bottomley, linux-mm, devel, linux-kernel, Suleiman Souhlal

mem_cgroup_do_charge() was written before slab accounting, and expects
three cases: being called for 1 page, being called for a stock of 32 pages,
or being called for a hugepage.  If we call for 2 pages (and several slabs
used in process creation are such, at least with the debug options I had),
it assumed it's being called for stock and just retried without reclaiming.

Fix that by passing down a minsize argument in addition to the csize.

And what to do about that (csize == PAGE_SIZE && ret) retry?  If it's
needed at all (and presumably is since it's there, perhaps to handle
races), then it should be extended to more than PAGE_SIZE, yet how far?
And should there be a retry count limit, of what?  For now retry up to
COSTLY_ORDER (as page_alloc.c does), stay safe with a cond_resched(),
and make sure not to do it if __GFP_NORETRY.

Signed-off-by: Suleiman Souhlal <suleiman@google.com>
---
 mm/memcontrol.c |   17 ++++++++++-------
 1 files changed, 10 insertions(+), 7 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f605100..2576a2b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2168,7 +2168,7 @@ enum {
 };
 
 static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
-				unsigned int nr_pages, bool oom_check)
+    unsigned int nr_pages, unsigned int min_pages, bool oom_check)
 {
 	unsigned long csize = nr_pages * PAGE_SIZE;
 	struct mem_cgroup *mem_over_limit;
@@ -2191,18 +2191,18 @@ static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	} else
 		mem_over_limit = mem_cgroup_from_res_counter(fail_res, res);
 	/*
-	 * nr_pages can be either a huge page (HPAGE_PMD_NR), a batch
-	 * of regular pages (CHARGE_BATCH), or a single regular page (1).
-	 *
 	 * Never reclaim on behalf of optional batching, retry with a
 	 * single page instead.
 	 */
-	if (nr_pages == CHARGE_BATCH)
+	if (nr_pages > min_pages)
 		return CHARGE_RETRY;
 
 	if (!(gfp_mask & __GFP_WAIT))
 		return CHARGE_WOULDBLOCK;
 
+	if (gfp_mask & __GFP_NORETRY)
+		return CHARGE_NOMEM;
+
 	ret = mem_cgroup_reclaim(mem_over_limit, gfp_mask, flags);
 	if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
 		return CHARGE_RETRY;
@@ -2215,8 +2215,10 @@ static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	 * unlikely to succeed so close to the limit, and we fall back
 	 * to regular pages anyway in case of failure.
 	 */
-	if (nr_pages == 1 && ret)
+	if (nr_pages <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER) && ret) {
+		cond_resched();
 		return CHARGE_RETRY;
+	}
 
 	/*
 	 * At task move, charge accounts can be doubly counted. So, it's
@@ -2350,7 +2352,8 @@ again:
 			nr_oom_retries = MEM_CGROUP_RECLAIM_RETRIES;
 		}
 
-		ret = mem_cgroup_do_charge(memcg, gfp_mask, batch, oom_check);
+		ret = mem_cgroup_do_charge(memcg, gfp_mask, batch, nr_pages,
+		    oom_check);
 		switch (ret) {
 		case CHARGE_OK:
 			break;
-- 
1.7.7.3


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v2 06/13] slab: Add kmem_cache_gfp_flags() helper function.
  2012-03-09 20:39 [PATCH v2 00/13] Memcg Kernel Memory Tracking Suleiman Souhlal
                   ` (4 preceding siblings ...)
  2012-03-09 20:39 ` [PATCH v2 05/13] memcg: Reclaim when more than one page needed Suleiman Souhlal
@ 2012-03-09 20:39 ` Suleiman Souhlal
  2012-03-11 10:53   ` Glauber Costa
  2012-03-09 20:39 ` [PATCH v2 07/13] memcg: Slab accounting Suleiman Souhlal
                   ` (7 subsequent siblings)
  13 siblings, 1 reply; 45+ messages in thread
From: Suleiman Souhlal @ 2012-03-09 20:39 UTC (permalink / raw)
  To: cgroups
  Cc: suleiman, glommer, kamezawa.hiroyu, penberg, cl, yinghan, hughd,
	gthelen, peterz, dan.magenheimer, hannes, mgorman,
	James.Bottomley, linux-mm, devel, linux-kernel, Suleiman Souhlal

This function returns the gfp flags that are always applied to
allocations of a kmem_cache.

Signed-off-by: Suleiman Souhlal <suleiman@google.com>
---
 include/linux/slab_def.h |    6 ++++++
 include/linux/slob_def.h |    6 ++++++
 include/linux/slub_def.h |    6 ++++++
 3 files changed, 18 insertions(+), 0 deletions(-)

diff --git a/include/linux/slab_def.h b/include/linux/slab_def.h
index fbd1117..25f9a6a 100644
--- a/include/linux/slab_def.h
+++ b/include/linux/slab_def.h
@@ -159,6 +159,12 @@ found:
 	return __kmalloc(size, flags);
 }
 
+static inline gfp_t
+kmem_cache_gfp_flags(struct kmem_cache *cachep)
+{
+	return cachep->gfpflags;
+}
+
 #ifdef CONFIG_NUMA
 extern void *__kmalloc_node(size_t size, gfp_t flags, int node);
 extern void *kmem_cache_alloc_node(struct kmem_cache *, gfp_t flags, int node);
diff --git a/include/linux/slob_def.h b/include/linux/slob_def.h
index 0ec00b3..3fa527d 100644
--- a/include/linux/slob_def.h
+++ b/include/linux/slob_def.h
@@ -34,4 +34,10 @@ static __always_inline void *__kmalloc(size_t size, gfp_t flags)
 	return kmalloc(size, flags);
 }
 
+static inline gfp_t
+kmem_cache_gfp_flags(struct kmem_cache *cachep)
+{
+	return 0;
+}
+
 #endif /* __LINUX_SLOB_DEF_H */
diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
index a32bcfd..5911d81 100644
--- a/include/linux/slub_def.h
+++ b/include/linux/slub_def.h
@@ -313,4 +313,10 @@ static __always_inline void *kmalloc_node(size_t size, gfp_t flags, int node)
 }
 #endif
 
+static inline gfp_t
+kmem_cache_gfp_flags(struct kmem_cache *cachep)
+{
+	return cachep->allocflags;
+}
+
 #endif /* _LINUX_SLUB_DEF_H */
-- 
1.7.7.3


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v2 07/13] memcg: Slab accounting.
  2012-03-09 20:39 [PATCH v2 00/13] Memcg Kernel Memory Tracking Suleiman Souhlal
                   ` (5 preceding siblings ...)
  2012-03-09 20:39 ` [PATCH v2 06/13] slab: Add kmem_cache_gfp_flags() helper function Suleiman Souhlal
@ 2012-03-09 20:39 ` Suleiman Souhlal
  2012-03-11 10:25   ` Glauber Costa
  2012-03-09 20:39 ` [PATCH v2 08/13] memcg: Make dentry slab memory accounted in kernel memory accounting Suleiman Souhlal
                   ` (6 subsequent siblings)
  13 siblings, 1 reply; 45+ messages in thread
From: Suleiman Souhlal @ 2012-03-09 20:39 UTC (permalink / raw)
  To: cgroups
  Cc: suleiman, glommer, kamezawa.hiroyu, penberg, cl, yinghan, hughd,
	gthelen, peterz, dan.magenheimer, hannes, mgorman,
	James.Bottomley, linux-mm, devel, linux-kernel, Suleiman Souhlal

Introduce per-cgroup kmem_caches for memcg slab accounting, that
get created asynchronously the first time we do an allocation of
that type in the cgroup.
The cgroup cache gets used in subsequent allocations, and permits
accounting of slab on a per-page basis.

For a slab type to getaccounted, the SLAB_MEMCG_ACCT flag has to be
passed to kmem_cache_create().

The per-cgroup kmem_caches get looked up at slab allocation time,
in a MAX_KMEM_CACHE_TYPES-sized array in the memcg structure, based
on the original kmem_cache's id, which gets allocated when the original
cache gets created.

Only allocations that can be attributed to a cgroup get charged.

Each cgroup kmem_cache has a refcount that dictates the lifetime
of the cache: We destroy a cgroup cache when its cgroup has been
destroyed and there are no more active objects in the cache.

Signed-off-by: Suleiman Souhlal <suleiman@google.com>
---
 include/linux/memcontrol.h |   30 +++++-
 include/linux/slab.h       |   41 ++++++
 include/linux/slab_def.h   |   72 +++++++++++-
 include/linux/slub_def.h   |    3 +
 init/Kconfig               |    2 +-
 mm/memcontrol.c            |  290 ++++++++++++++++++++++++++++++++++++++++++++
 mm/slab.c                  |  248 ++++++++++++++++++++++++++++++++++---
 7 files changed, 663 insertions(+), 23 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index b80de52..b6b8388 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -416,13 +416,41 @@ struct sock;
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
 void sock_update_memcg(struct sock *sk);
 void sock_release_memcg(struct sock *sk);
-#else
+struct kmem_cache *mem_cgroup_get_kmem_cache(struct kmem_cache *cachep,
+    gfp_t gfp);
+bool mem_cgroup_charge_slab(struct kmem_cache *cachep, gfp_t gfp, size_t size);
+void mem_cgroup_uncharge_slab(struct kmem_cache *cachep, size_t size);
+void mem_cgroup_flush_cache_create_queue(void);
+void mem_cgroup_remove_child_kmem_cache(struct kmem_cache *cachep, int id);
+#else /* CONFIG_CGROUP_MEM_RES_CTLR_KMEM */
 static inline void sock_update_memcg(struct sock *sk)
 {
 }
 static inline void sock_release_memcg(struct sock *sk)
 {
 }
+
+static inline bool
+mem_cgroup_charge_slab(struct kmem_cache *cachep, gfp_t gfp, size_t size)
+{
+	return true;
+}
+
+static inline void
+mem_cgroup_uncharge_slab(struct kmem_cache *cachep, size_t size)
+{
+}
+
+static inline struct kmem_cache *
+mem_cgroup_get_kmem_cache(struct kmem_cache *cachep, gfp_t gfp)
+{
+	return cachep;
+}
+
+static inline void
+mem_cgroup_flush_cache_create_queue(void)
+{
+}
 #endif /* CONFIG_CGROUP_MEM_RES_CTLR_KMEM */
 #endif /* _LINUX_MEMCONTROL_H */
 
diff --git a/include/linux/slab.h b/include/linux/slab.h
index 573c809..bc9f87f 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -21,6 +21,7 @@
 #define SLAB_POISON		0x00000800UL	/* DEBUG: Poison objects */
 #define SLAB_HWCACHE_ALIGN	0x00002000UL	/* Align objs on cache lines */
 #define SLAB_CACHE_DMA		0x00004000UL	/* Use GFP_DMA memory */
+#define SLAB_MEMCG_ACCT		0x00008000UL	/* Accounted by memcg */
 #define SLAB_STORE_USER		0x00010000UL	/* DEBUG: Store the last owner for bug hunting */
 #define SLAB_PANIC		0x00040000UL	/* Panic if kmem_cache_create() fails */
 /*
@@ -153,6 +154,21 @@ unsigned int kmem_cache_size(struct kmem_cache *);
 #define ARCH_SLAB_MINALIGN __alignof__(unsigned long long)
 #endif
 
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
+struct mem_cgroup_cache_params {
+	struct mem_cgroup *memcg;
+	int id;
+	int obj_size;
+
+	/* Original cache parameters, used when creating a memcg cache */
+	size_t orig_align;
+	unsigned long orig_flags;
+	struct kmem_cache *orig_cache;
+
+	struct list_head destroyed_list; /* Used when deleting cpuset cache */
+};
+#endif
+
 /*
  * Common kmalloc functions provided by all allocators
  */
@@ -353,4 +369,29 @@ static inline void *kzalloc_node(size_t size, gfp_t flags, int node)
 
 void __init kmem_cache_init_late(void);
 
+#if defined(CONFIG_CGROUP_MEM_RES_CTLR_KMEM) && defined(CONFIG_SLAB)
+
+#define MAX_KMEM_CACHE_TYPES 400
+
+struct kmem_cache *kmem_cache_create_memcg(struct kmem_cache *cachep,
+    char *name);
+void kmem_cache_drop_ref(struct kmem_cache *cachep);
+
+#else /* !CONFIG_CGROUP_MEM_RES_CTLR_KMEM || !CONFIG_SLAB */
+
+#define MAX_KMEM_CACHE_TYPES 0
+
+static inline struct kmem_cache *
+kmem_cache_create_memcg(struct kmem_cache *cachep,
+    char *name)
+{
+	return NULL;
+}
+
+static inline void
+kmem_cache_drop_ref(struct kmem_cache *cachep)
+{
+}
+#endif /* CONFIG_CGROUP_MEM_RES_CTLR_KMEM && CONFIG_SLAB */
+
 #endif	/* _LINUX_SLAB_H */
diff --git a/include/linux/slab_def.h b/include/linux/slab_def.h
index 25f9a6a..248b8a9 100644
--- a/include/linux/slab_def.h
+++ b/include/linux/slab_def.h
@@ -51,7 +51,7 @@ struct kmem_cache {
 	void (*ctor)(void *obj);
 
 /* 4) cache creation/removal */
-	const char *name;
+	char *name;
 	struct list_head next;
 
 /* 5) statistics */
@@ -81,6 +81,12 @@ struct kmem_cache {
 	int obj_size;
 #endif /* CONFIG_DEBUG_SLAB */
 
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
+	struct mem_cgroup_cache_params memcg_params;
+
+	atomic_t refcnt;
+#endif /* CONFIG_CGROUP_MEM_RES_CTLR_KMEM */
+
 /* 6) per-cpu/per-node data, touched during every alloc/free */
 	/*
 	 * We put array[] at the end of kmem_cache, because we want to size
@@ -218,4 +224,68 @@ found:
 
 #endif	/* CONFIG_NUMA */
 
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
+
+void kmem_cache_drop_ref(struct kmem_cache *cachep);
+
+static inline void
+kmem_cache_get_ref(struct kmem_cache *cachep)
+{
+	if (cachep->memcg_params.id == -1 &&
+	    unlikely(!atomic_add_unless(&cachep->refcnt, 1, 0)))
+		BUG();
+}
+
+static inline void
+mem_cgroup_put_kmem_cache(struct kmem_cache *cachep)
+{
+	rcu_read_unlock();
+}
+
+static inline void
+mem_cgroup_kmem_cache_prepare_sleep(struct kmem_cache *cachep)
+{
+	/*
+	 * Make sure the cache doesn't get freed while we have interrupts
+	 * enabled.
+	 */
+	kmem_cache_get_ref(cachep);
+	rcu_read_unlock();
+}
+
+static inline void
+mem_cgroup_kmem_cache_finish_sleep(struct kmem_cache *cachep)
+{
+	rcu_read_lock();
+	kmem_cache_drop_ref(cachep);
+}
+
+#else /* CONFIG_CGROUP_MEM_RES_CTLR_KMEM */
+
+static inline void
+kmem_cache_get_ref(struct kmem_cache *cachep)
+{
+}
+
+static inline void
+kmem_cache_drop_ref(struct kmem_cache *cachep)
+{
+}
+
+static inline void
+mem_cgroup_put_kmem_cache(struct kmem_cache *cachep)
+{
+}
+
+static inline void
+mem_cgroup_kmem_cache_prepare_sleep(struct kmem_cache *cachep)
+{
+}
+
+static inline void
+mem_cgroup_kmem_cache_finish_sleep(struct kmem_cache *cachep)
+{
+}
+#endif /* CONFIG_CGROUP_MEM_RES_CTLR_KMEM */
+
 #endif	/* _LINUX_SLAB_DEF_H */
diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
index 5911d81..09b602d 100644
--- a/include/linux/slub_def.h
+++ b/include/linux/slub_def.h
@@ -99,6 +99,9 @@ struct kmem_cache {
 #ifdef CONFIG_SYSFS
 	struct kobject kobj;	/* For sysfs */
 #endif
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
+	struct mem_cgroup_cache_params memcg_params;
+#endif
 
 #ifdef CONFIG_NUMA
 	/*
diff --git a/init/Kconfig b/init/Kconfig
index 3f42cd6..e7eb652 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -705,7 +705,7 @@ config CGROUP_MEM_RES_CTLR_SWAP_ENABLED
 	  then swapaccount=0 does the trick).
 config CGROUP_MEM_RES_CTLR_KMEM
 	bool "Memory Resource Controller Kernel Memory accounting (EXPERIMENTAL)"
-	depends on CGROUP_MEM_RES_CTLR && EXPERIMENTAL
+	depends on CGROUP_MEM_RES_CTLR && EXPERIMENTAL && !SLOB
 	default n
 	help
 	  The Kernel Memory extension for Memory Resource Controller can limit
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2576a2b..a5593cf 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -302,6 +302,11 @@ struct mem_cgroup {
 #ifdef CONFIG_INET
 	struct tcp_memcontrol tcp_mem;
 #endif
+
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
+	/* Slab accounting */
+	struct kmem_cache *slabs[MAX_KMEM_CACHE_TYPES];
+#endif
 };
 
 /* Stuffs for move charges at task migration. */
@@ -5692,6 +5697,287 @@ memcg_uncharge_kmem(struct mem_cgroup *memcg, long long delta)
 		res_counter_uncharge(&memcg->res, delta);
 }
 
+static struct kmem_cache *
+memcg_create_kmem_cache(struct mem_cgroup *memcg, struct kmem_cache *cachep)
+{
+	struct kmem_cache *new_cachep;
+	struct dentry *dentry;
+	char *name;
+	int idx;
+
+	idx = cachep->memcg_params.id;
+
+	dentry = memcg->css.cgroup->dentry;
+	BUG_ON(dentry == NULL);
+
+	/* Preallocate the space for "dead" at the end */
+	name = kasprintf(GFP_KERNEL, "%s(%d:%s)dead",
+	    cachep->name, css_id(&memcg->css), dentry->d_name.name);
+	if (name == NULL)
+		return cachep;
+	/* Remove "dead" */
+	name[strlen(name) - 4] = '\0';
+
+	new_cachep = kmem_cache_create_memcg(cachep, name);
+
+	/*
+	 * Another CPU is creating the same cache?
+	 * We'll use it next time.
+	 */
+	if (new_cachep == NULL) {
+		kfree(name);
+		return cachep;
+	}
+
+	new_cachep->memcg_params.memcg = memcg;
+
+	/*
+	 * Make sure someone else hasn't created the new cache in the
+	 * meantime.
+	 * This should behave as a write barrier, so we should be fine
+	 * with RCU.
+	 */
+	if (cmpxchg(&memcg->slabs[idx], NULL, new_cachep) != NULL) {
+		kmem_cache_destroy(new_cachep);
+		return cachep;
+	}
+
+	return new_cachep;
+}
+
+struct create_work {
+	struct mem_cgroup *memcg;
+	struct kmem_cache *cachep;
+	struct list_head list;
+};
+
+static DEFINE_SPINLOCK(create_queue_lock);
+static LIST_HEAD(create_queue);
+
+/*
+ * Flush the queue of kmem_caches to create, because we're creating a cgroup.
+ *
+ * We might end up flushing other cgroups' creation requests as well, but
+ * they will just get queued again next time someone tries to make a slab
+ * allocation for them.
+ */
+void
+mem_cgroup_flush_cache_create_queue(void)
+{
+	struct create_work *cw, *tmp;
+	unsigned long flags;
+
+	spin_lock_irqsave(&create_queue_lock, flags);
+	list_for_each_entry_safe(cw, tmp, &create_queue, list) {
+		list_del(&cw->list);
+		kfree(cw);
+	}
+	spin_unlock_irqrestore(&create_queue_lock, flags);
+}
+
+static void
+memcg_create_cache_work_func(struct work_struct *w)
+{
+	struct kmem_cache *cachep;
+	struct create_work *cw;
+
+	spin_lock_irq(&create_queue_lock);
+	while (!list_empty(&create_queue)) {
+		cw = list_first_entry(&create_queue, struct create_work, list);
+		list_del(&cw->list);
+		spin_unlock_irq(&create_queue_lock);
+		cachep = memcg_create_kmem_cache(cw->memcg, cw->cachep);
+		if (cachep == NULL && printk_ratelimit())
+			printk(KERN_ALERT "%s: Couldn't create memcg-cache for"
+			    " %s memcg %s\n", __func__, cw->cachep->name,
+			    cw->memcg->css.cgroup->dentry->d_name.name);
+		/* Drop the reference gotten when we enqueued. */
+		css_put(&cw->memcg->css);
+		kfree(cw);
+		spin_lock_irq(&create_queue_lock);
+	}
+	spin_unlock_irq(&create_queue_lock);
+}
+
+static DECLARE_WORK(memcg_create_cache_work, memcg_create_cache_work_func);
+
+/*
+ * Enqueue the creation of a per-memcg kmem_cache.
+ * Called with rcu_read_lock.
+ */
+static void
+memcg_create_cache_enqueue(struct mem_cgroup *memcg, struct kmem_cache *cachep)
+{
+	struct create_work *cw;
+	unsigned long flags;
+
+	spin_lock_irqsave(&create_queue_lock, flags);
+	list_for_each_entry(cw, &create_queue, list) {
+		if (cw->memcg == memcg && cw->cachep == cachep) {
+			spin_unlock_irqrestore(&create_queue_lock, flags);
+			return;
+		}
+	}
+	spin_unlock_irqrestore(&create_queue_lock, flags);
+
+	/* The corresponding put will be done in the workqueue. */
+	if (!css_tryget(&memcg->css))
+		return;
+
+	cw = kmalloc(sizeof(struct create_work), GFP_NOWAIT);
+	if (cw == NULL) {
+		css_put(&memcg->css);
+		return;
+	}
+
+	cw->memcg = memcg;
+	cw->cachep = cachep;
+	spin_lock_irqsave(&create_queue_lock, flags);
+	list_add_tail(&cw->list, &create_queue);
+	spin_unlock_irqrestore(&create_queue_lock, flags);
+
+	schedule_work(&memcg_create_cache_work);
+}
+
+/*
+ * Return the kmem_cache we're supposed to use for a slab allocation.
+ * If we are in interrupt context or otherwise have an allocation that
+ * can't fail, we return the original cache.
+ * Otherwise, we will try to use the current memcg's version of the cache.
+ *
+ * If the cache does not exist yet, if we are the first user of it,
+ * we either create it immediately, if possible, or create it asynchronously
+ * in a workqueue.
+ * In the latter case, we will let the current allocation go through with
+ * the original cache.
+ *
+ * This function returns with rcu_read_lock() held.
+ */
+struct kmem_cache *
+mem_cgroup_get_kmem_cache(struct kmem_cache *cachep, gfp_t gfp)
+{
+	struct mem_cgroup *memcg;
+	int idx;
+
+	rcu_read_lock();
+
+	if (in_interrupt())
+		return cachep;
+	if (current == NULL)
+		return cachep;
+
+	if (!(cachep->flags & SLAB_MEMCG_ACCT))
+		return cachep;
+
+	gfp |= kmem_cache_gfp_flags(cachep);
+	if (gfp & __GFP_NOFAIL)
+		return cachep;
+
+	if (cachep->memcg_params.memcg)
+		return cachep;
+
+	memcg = mem_cgroup_from_task(current);
+	idx = cachep->memcg_params.id;
+
+	if (memcg == NULL || memcg == root_mem_cgroup)
+		return cachep;
+
+	VM_BUG_ON(idx == -1);
+
+	if (rcu_access_pointer(memcg->slabs[idx]) == NULL) {
+		memcg_create_cache_enqueue(memcg, cachep);
+		return cachep;
+	}
+
+	return rcu_dereference(memcg->slabs[idx]);
+}
+
+void
+mem_cgroup_remove_child_kmem_cache(struct kmem_cache *cachep, int id)
+{
+	rcu_assign_pointer(cachep->memcg_params.memcg->slabs[id], NULL);
+}
+
+bool
+mem_cgroup_charge_slab(struct kmem_cache *cachep, gfp_t gfp, size_t size)
+{
+	struct mem_cgroup *memcg;
+	int ret;
+
+	rcu_read_lock();
+	memcg = cachep->memcg_params.memcg;
+
+	if (memcg && !css_tryget(&memcg->css))
+		memcg = NULL;
+	rcu_read_unlock();
+
+	ret = memcg_charge_kmem(memcg, gfp, size);
+	if (memcg)
+		css_put(&memcg->css);
+
+	return ret == 0;
+}
+
+void
+mem_cgroup_uncharge_slab(struct kmem_cache *cachep, size_t size)
+{
+	struct mem_cgroup *memcg;
+
+	rcu_read_lock();
+	memcg = cachep->memcg_params.memcg;
+
+	if (memcg && !css_tryget(&memcg->css))
+		memcg = NULL;
+	rcu_read_unlock();
+
+	memcg_uncharge_kmem(memcg, size);
+	if (memcg)
+		css_put(&memcg->css);
+}
+
+static void
+memcg_slab_init(struct mem_cgroup *memcg)
+{
+	int i;
+
+	for (i = 0; i < MAX_KMEM_CACHE_TYPES; i++)
+		rcu_assign_pointer(memcg->slabs[i], NULL);
+}
+
+/*
+ * Mark all of this memcg's kmem_caches as dead and move them to the
+ * root.
+ *
+ * Assumes that the callers are synchronized (only one thread should be
+ * moving a cgroup's slab at the same time).
+ */
+static void
+memcg_slab_move(struct mem_cgroup *memcg)
+{
+	struct kmem_cache *cachep;
+	int i;
+
+	mem_cgroup_flush_cache_create_queue();
+
+	for (i = 0; i < MAX_KMEM_CACHE_TYPES; i++) {
+		cachep = rcu_access_pointer(memcg->slabs[i]);
+		if (cachep != NULL) {
+			rcu_assign_pointer(memcg->slabs[i], NULL);
+			cachep->memcg_params.memcg = NULL;
+
+			/* The space for this is already allocated */
+			strcat((char *)cachep->name, "dead");
+
+			/*
+			 * Drop the initial reference on the cache.
+			 * This means that from this point on, the cache will
+			 * get destroyed when it no longer has active objects.
+			 */
+			kmem_cache_drop_ref(cachep);
+		}
+	}
+}
+
 static void
 memcg_kmem_init(struct mem_cgroup *memcg, struct mem_cgroup *parent)
 {
@@ -5701,6 +5987,8 @@ memcg_kmem_init(struct mem_cgroup *memcg, struct mem_cgroup *parent)
 	if (parent && parent != root_mem_cgroup)
 		parent_res = &parent->kmem;
 	res_counter_init(&memcg->kmem, parent_res);
+
+	memcg_slab_init(memcg);
 }
 
 static void
@@ -5709,6 +5997,8 @@ memcg_kmem_move(struct mem_cgroup *memcg)
 	unsigned long flags;
 	long kmem;
 
+	memcg_slab_move(memcg);
+
 	spin_lock_irqsave(&memcg->kmem.lock, flags);
 	kmem = memcg->kmem.usage;
 	res_counter_uncharge_locked(&memcg->kmem, kmem);
diff --git a/mm/slab.c b/mm/slab.c
index f0bd785..95b024c 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -159,13 +159,15 @@
 			 SLAB_STORE_USER | \
 			 SLAB_RECLAIM_ACCOUNT | SLAB_PANIC | \
 			 SLAB_DESTROY_BY_RCU | SLAB_MEM_SPREAD | \
-			 SLAB_DEBUG_OBJECTS | SLAB_NOLEAKTRACE | SLAB_NOTRACK)
+			 SLAB_DEBUG_OBJECTS | SLAB_NOLEAKTRACE | \
+			 SLAB_NOTRACK | SLAB_MEMCG_ACCT)
 #else
 # define CREATE_MASK	(SLAB_HWCACHE_ALIGN | \
 			 SLAB_CACHE_DMA | \
 			 SLAB_RECLAIM_ACCOUNT | SLAB_PANIC | \
 			 SLAB_DESTROY_BY_RCU | SLAB_MEM_SPREAD | \
-			 SLAB_DEBUG_OBJECTS | SLAB_NOLEAKTRACE | SLAB_NOTRACK)
+			 SLAB_DEBUG_OBJECTS | SLAB_NOLEAKTRACE | \
+			 SLAB_NOTRACK | SLAB_MEMCG_ACCT)
 #endif
 
 /*
@@ -301,6 +303,8 @@ static void free_block(struct kmem_cache *cachep, void **objpp, int len,
 			int node);
 static int enable_cpucache(struct kmem_cache *cachep, gfp_t gfp);
 static void cache_reap(struct work_struct *unused);
+static int do_tune_cpucache(struct kmem_cache *cachep, int limit,
+    int batchcount, int shared, gfp_t gfp);
 
 /*
  * This function must be completely optimized away if a constant is passed to
@@ -326,6 +330,11 @@ static __always_inline int index_of(const size_t size)
 	return 0;
 }
 
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
+/* Bitmap used for allocating the cache id numbers. */
+static DECLARE_BITMAP(cache_types, MAX_KMEM_CACHE_TYPES);
+#endif
+
 static int slab_early_init = 1;
 
 #define INDEX_AC index_of(sizeof(struct arraycache_init))
@@ -1756,17 +1765,23 @@ static void *kmem_getpages(struct kmem_cache *cachep, gfp_t flags, int nodeid)
 	if (cachep->flags & SLAB_RECLAIM_ACCOUNT)
 		flags |= __GFP_RECLAIMABLE;
 
+	nr_pages = (1 << cachep->gfporder);
+	if (!mem_cgroup_charge_slab(cachep, flags, nr_pages * PAGE_SIZE))
+		return NULL;
+
 	page = alloc_pages_exact_node(nodeid, flags | __GFP_NOTRACK, cachep->gfporder);
-	if (!page)
+	if (!page) {
+		mem_cgroup_uncharge_slab(cachep, nr_pages * PAGE_SIZE);
 		return NULL;
+	}
 
-	nr_pages = (1 << cachep->gfporder);
 	if (cachep->flags & SLAB_RECLAIM_ACCOUNT)
 		add_zone_page_state(page_zone(page),
 			NR_SLAB_RECLAIMABLE, nr_pages);
 	else
 		add_zone_page_state(page_zone(page),
 			NR_SLAB_UNRECLAIMABLE, nr_pages);
+	kmem_cache_get_ref(cachep);
 	for (i = 0; i < nr_pages; i++)
 		__SetPageSlab(page + i);
 
@@ -1799,6 +1814,8 @@ static void kmem_freepages(struct kmem_cache *cachep, void *addr)
 	else
 		sub_zone_page_state(page_zone(page),
 				NR_SLAB_UNRECLAIMABLE, nr_freed);
+	mem_cgroup_uncharge_slab(cachep, i * PAGE_SIZE);
+	kmem_cache_drop_ref(cachep);
 	while (i--) {
 		BUG_ON(!PageSlab(page));
 		__ClearPageSlab(page);
@@ -2224,14 +2241,17 @@ static int __init_refok setup_cpu_cache(struct kmem_cache *cachep, gfp_t gfp)
  * cacheline.  This can be beneficial if you're counting cycles as closely
  * as davem.
  */
-struct kmem_cache *
-kmem_cache_create (const char *name, size_t size, size_t align,
-	unsigned long flags, void (*ctor)(void *))
+static struct kmem_cache *
+__kmem_cache_create(const char *name, size_t size, size_t align,
+    unsigned long flags, void (*ctor)(void *), bool memcg)
 {
-	size_t left_over, slab_size, ralign;
+	size_t left_over, orig_align, ralign, slab_size;
 	struct kmem_cache *cachep = NULL, *pc;
+	unsigned long orig_flags;
 	gfp_t gfp;
 
+	orig_align = align;
+	orig_flags = flags;
 	/*
 	 * Sanity checks... these are all serious usage bugs.
 	 */
@@ -2248,7 +2268,6 @@ kmem_cache_create (const char *name, size_t size, size_t align,
 	 */
 	if (slab_is_available()) {
 		get_online_cpus();
-		mutex_lock(&cache_chain_mutex);
 	}
 
 	list_for_each_entry(pc, &cache_chain, next) {
@@ -2269,10 +2288,12 @@ kmem_cache_create (const char *name, size_t size, size_t align,
 		}
 
 		if (!strcmp(pc->name, name)) {
-			printk(KERN_ERR
-			       "kmem_cache_create: duplicate cache %s\n", name);
-			dump_stack();
-			goto oops;
+			if (!memcg) {
+				printk(KERN_ERR "kmem_cache_create: duplicate"
+				    " cache %s\n", name);
+				dump_stack();
+				goto oops;
+			}
 		}
 	}
 
@@ -2369,6 +2390,13 @@ kmem_cache_create (const char *name, size_t size, size_t align,
 		goto oops;
 
 	cachep->nodelists = (struct kmem_list3 **)&cachep->array[nr_cpu_ids];
+
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
+	cachep->memcg_params.obj_size = size;
+	cachep->memcg_params.orig_align = orig_align;
+	cachep->memcg_params.orig_flags = orig_flags;
+#endif
+
 #if DEBUG
 	cachep->obj_size = size;
 
@@ -2477,7 +2505,23 @@ kmem_cache_create (const char *name, size_t size, size_t align,
 		BUG_ON(ZERO_OR_NULL_PTR(cachep->slabp_cache));
 	}
 	cachep->ctor = ctor;
-	cachep->name = name;
+	cachep->name = (char *)name;
+
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
+	cachep->memcg_params.orig_cache = NULL;
+	atomic_set(&cachep->refcnt, 1);
+	INIT_LIST_HEAD(&cachep->memcg_params.destroyed_list);
+
+	if (!memcg) {
+		int id;
+
+		id = find_first_zero_bit(cache_types, MAX_KMEM_CACHE_TYPES);
+		BUG_ON(id < 0 || id >= MAX_KMEM_CACHE_TYPES);
+		__set_bit(id, cache_types);
+		cachep->memcg_params.id = id;
+	} else
+		cachep->memcg_params.id = -1;
+#endif
 
 	if (setup_cpu_cache(cachep, gfp)) {
 		__kmem_cache_destroy(cachep);
@@ -2502,13 +2546,53 @@ oops:
 		panic("kmem_cache_create(): failed to create slab `%s'\n",
 		      name);
 	if (slab_is_available()) {
-		mutex_unlock(&cache_chain_mutex);
 		put_online_cpus();
 	}
 	return cachep;
 }
+
+struct kmem_cache *
+kmem_cache_create(const char *name, size_t size, size_t align,
+    unsigned long flags, void (*ctor)(void *))
+{
+	struct kmem_cache *cachep;
+
+	mutex_lock(&cache_chain_mutex);
+	cachep = __kmem_cache_create(name, size, align, flags, ctor, false);
+	mutex_unlock(&cache_chain_mutex);
+
+	return cachep;
+}
 EXPORT_SYMBOL(kmem_cache_create);
 
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
+struct kmem_cache *
+kmem_cache_create_memcg(struct kmem_cache *cachep, char *name)
+{
+	struct kmem_cache *new;
+	int flags;
+
+	flags = cachep->memcg_params.orig_flags & ~SLAB_PANIC;
+	mutex_lock(&cache_chain_mutex);
+	new = __kmem_cache_create(name, cachep->memcg_params.obj_size,
+	    cachep->memcg_params.orig_align, flags, cachep->ctor, 1);
+	if (new == NULL) {
+		mutex_unlock(&cache_chain_mutex);
+		return NULL;
+	}
+	new->memcg_params.orig_cache = cachep;
+
+	if ((cachep->limit != new->limit) ||
+	    (cachep->batchcount != new->batchcount) ||
+	    (cachep->shared != new->shared))
+		do_tune_cpucache(new, cachep->limit, cachep->batchcount,
+		    cachep->shared, GFP_KERNEL);
+	mutex_unlock(&cache_chain_mutex);
+
+	return new;
+}
+#endif /* CONFIG_CGROUP_MEM_RES_CTLR_KMEM */
+
 #if DEBUG
 static void check_irq_off(void)
 {
@@ -2703,12 +2787,74 @@ void kmem_cache_destroy(struct kmem_cache *cachep)
 	if (unlikely(cachep->flags & SLAB_DESTROY_BY_RCU))
 		rcu_barrier();
 
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
+	/* Not a memcg cache */
+	if (cachep->memcg_params.id != -1) {
+		__clear_bit(cachep->memcg_params.id, cache_types);
+		mem_cgroup_flush_cache_create_queue();
+	}
+#endif
 	__kmem_cache_destroy(cachep);
 	mutex_unlock(&cache_chain_mutex);
 	put_online_cpus();
 }
 EXPORT_SYMBOL(kmem_cache_destroy);
 
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
+static DEFINE_SPINLOCK(destroy_lock);
+static LIST_HEAD(destroyed_caches);
+
+static void
+kmem_cache_destroy_work_func(struct work_struct *w)
+{
+	struct kmem_cache *cachep;
+	char *name;
+
+	spin_lock_irq(&destroy_lock);
+	while (!list_empty(&destroyed_caches)) {
+		cachep = container_of(list_first_entry(&destroyed_caches,
+		    struct mem_cgroup_cache_params, destroyed_list), struct
+		    kmem_cache, memcg_params);
+		name = (char *)cachep->name;
+		list_del(&cachep->memcg_params.destroyed_list);
+		spin_unlock_irq(&destroy_lock);
+		synchronize_rcu();
+		kmem_cache_destroy(cachep);
+		kfree(name);
+		spin_lock_irq(&destroy_lock);
+	}
+	spin_unlock_irq(&destroy_lock);
+}
+
+static DECLARE_WORK(kmem_cache_destroy_work, kmem_cache_destroy_work_func);
+
+static void
+kmem_cache_destroy_memcg(struct kmem_cache *cachep)
+{
+	unsigned long flags;
+
+	BUG_ON(cachep->memcg_params.id != -1);
+
+	/*
+	 * We have to defer the actual destroying to a workqueue, because
+	 * we might currently be in a context that cannot sleep.
+	 */
+	spin_lock_irqsave(&destroy_lock, flags);
+	list_add(&cachep->memcg_params.destroyed_list, &destroyed_caches);
+	spin_unlock_irqrestore(&destroy_lock, flags);
+
+	schedule_work(&kmem_cache_destroy_work);
+}
+
+void
+kmem_cache_drop_ref(struct kmem_cache *cachep)
+{
+	if (cachep->memcg_params.id == -1 &&
+	    unlikely(atomic_dec_and_test(&cachep->refcnt)))
+		kmem_cache_destroy_memcg(cachep);
+}
+#endif /* CONFIG_CGROUP_MEM_RES_CTLR_KMEM */
+
 /*
  * Get the memory for a slab management obj.
  * For a slab cache when the slab descriptor is off-slab, slab descriptors
@@ -2908,8 +3054,10 @@ static int cache_grow(struct kmem_cache *cachep,
 
 	offset *= cachep->colour_off;
 
-	if (local_flags & __GFP_WAIT)
+	if (local_flags & __GFP_WAIT) {
 		local_irq_enable();
+		mem_cgroup_kmem_cache_prepare_sleep(cachep);
+	}
 
 	/*
 	 * The test for missing atomic flag is performed here, rather than
@@ -2938,8 +3086,10 @@ static int cache_grow(struct kmem_cache *cachep,
 
 	cache_init_objs(cachep, slabp);
 
-	if (local_flags & __GFP_WAIT)
+	if (local_flags & __GFP_WAIT) {
 		local_irq_disable();
+		mem_cgroup_kmem_cache_finish_sleep(cachep);
+	}
 	check_irq_off();
 	spin_lock(&l3->list_lock);
 
@@ -2952,8 +3102,10 @@ static int cache_grow(struct kmem_cache *cachep,
 opps1:
 	kmem_freepages(cachep, objp);
 failed:
-	if (local_flags & __GFP_WAIT)
+	if (local_flags & __GFP_WAIT) {
 		local_irq_disable();
+		mem_cgroup_kmem_cache_finish_sleep(cachep);
+	}
 	return 0;
 }
 
@@ -3712,10 +3864,14 @@ static inline void __cache_free(struct kmem_cache *cachep, void *objp,
  */
 void *kmem_cache_alloc(struct kmem_cache *cachep, gfp_t flags)
 {
-	void *ret = __cache_alloc(cachep, flags, __builtin_return_address(0));
+	void *ret;
+
+	cachep = mem_cgroup_get_kmem_cache(cachep, flags);
+	ret = __cache_alloc(cachep, flags, __builtin_return_address(0));
 
 	trace_kmem_cache_alloc(_RET_IP_, ret,
 			       obj_size(cachep), cachep->buffer_size, flags);
+	mem_cgroup_put_kmem_cache(cachep);
 
 	return ret;
 }
@@ -3727,10 +3883,12 @@ kmem_cache_alloc_trace(size_t size, struct kmem_cache *cachep, gfp_t flags)
 {
 	void *ret;
 
+	cachep = mem_cgroup_get_kmem_cache(cachep, flags);
 	ret = __cache_alloc(cachep, flags, __builtin_return_address(0));
 
 	trace_kmalloc(_RET_IP_, ret,
 		      size, slab_buffer_size(cachep), flags);
+	mem_cgroup_put_kmem_cache(cachep);
 	return ret;
 }
 EXPORT_SYMBOL(kmem_cache_alloc_trace);
@@ -3739,12 +3897,16 @@ EXPORT_SYMBOL(kmem_cache_alloc_trace);
 #ifdef CONFIG_NUMA
 void *kmem_cache_alloc_node(struct kmem_cache *cachep, gfp_t flags, int nodeid)
 {
-	void *ret = __cache_alloc_node(cachep, flags, nodeid,
+	void *ret;
+
+	cachep = mem_cgroup_get_kmem_cache(cachep, flags);
+	ret  = __cache_alloc_node(cachep, flags, nodeid,
 				       __builtin_return_address(0));
 
 	trace_kmem_cache_alloc_node(_RET_IP_, ret,
 				    obj_size(cachep), cachep->buffer_size,
 				    flags, nodeid);
+	mem_cgroup_put_kmem_cache(cachep);
 
 	return ret;
 }
@@ -3758,11 +3920,13 @@ void *kmem_cache_alloc_node_trace(size_t size,
 {
 	void *ret;
 
+	cachep = mem_cgroup_get_kmem_cache(cachep, flags);
 	ret = __cache_alloc_node(cachep, flags, nodeid,
 				  __builtin_return_address(0));
 	trace_kmalloc_node(_RET_IP_, ret,
 			   size, slab_buffer_size(cachep),
 			   flags, nodeid);
+	mem_cgroup_put_kmem_cache(cachep);
 	return ret;
 }
 EXPORT_SYMBOL(kmem_cache_alloc_node_trace);
@@ -3866,9 +4030,35 @@ void kmem_cache_free(struct kmem_cache *cachep, void *objp)
 
 	local_irq_save(flags);
 	debug_check_no_locks_freed(objp, obj_size(cachep));
+
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
+	{
+		struct kmem_cache *actual_cachep;
+
+		actual_cachep = virt_to_cache(objp);
+		if (actual_cachep != cachep) {
+			VM_BUG_ON(actual_cachep->memcg_params.id != -1);
+			VM_BUG_ON(actual_cachep->memcg_params.orig_cache !=
+			    cachep);
+			cachep = actual_cachep;
+		}
+		/*
+		 * Grab a reference so that the cache is guaranteed to stay
+		 * around.
+		 * If we are freeing the last object of a dead memcg cache,
+		 * the kmem_cache_drop_ref() at the end of this function
+		 * will end up freeing the cache.
+		 */
+		kmem_cache_get_ref(cachep);
+	}
+#endif
+
 	if (!(cachep->flags & SLAB_DEBUG_OBJECTS))
 		debug_check_no_obj_freed(objp, obj_size(cachep));
 	__cache_free(cachep, objp, __builtin_return_address(0));
+
+	kmem_cache_drop_ref(cachep);
+
 	local_irq_restore(flags);
 
 	trace_kmem_cache_free(_RET_IP_, objp);
@@ -3896,9 +4086,19 @@ void kfree(const void *objp)
 	local_irq_save(flags);
 	kfree_debugcheck(objp);
 	c = virt_to_cache(objp);
+
+	/*
+	 * Grab a reference so that the cache is guaranteed to stay around.
+	 * If we are freeing the last object of a dead memcg cache, the
+	 * kmem_cache_drop_ref() at the end of this function will end up
+	 * freeing the cache.
+	 */
+	kmem_cache_get_ref(c);
+
 	debug_check_no_locks_freed(objp, obj_size(c));
 	debug_check_no_obj_freed(objp, obj_size(c));
 	__cache_free(c, (void *)objp, __builtin_return_address(0));
+	kmem_cache_drop_ref(c);
 	local_irq_restore(flags);
 }
 EXPORT_SYMBOL(kfree);
@@ -4167,6 +4367,13 @@ static void cache_reap(struct work_struct *w)
 	list_for_each_entry(searchp, &cache_chain, next) {
 		check_irq_on();
 
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
+		/* For memcg caches, make sure we only reap the active ones. */
+		if (searchp->memcg_params.id == -1 &&
+		    !atomic_add_unless(&searchp->refcnt, 1, 0))
+			continue;
+#endif
+
 		/*
 		 * We only take the l3 lock if absolutely necessary and we
 		 * have established with reasonable certainty that
@@ -4199,6 +4406,7 @@ static void cache_reap(struct work_struct *w)
 			STATS_ADD_REAPED(searchp, freed);
 		}
 next:
+		kmem_cache_drop_ref(searchp);
 		cond_resched();
 	}
 	check_irq_on();
-- 
1.7.7.3


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v2 08/13] memcg: Make dentry slab memory accounted in kernel memory accounting.
  2012-03-09 20:39 [PATCH v2 00/13] Memcg Kernel Memory Tracking Suleiman Souhlal
                   ` (6 preceding siblings ...)
  2012-03-09 20:39 ` [PATCH v2 07/13] memcg: Slab accounting Suleiman Souhlal
@ 2012-03-09 20:39 ` Suleiman Souhlal
  2012-03-09 20:39 ` [PATCH v2 09/13] memcg: Account for kmalloc " Suleiman Souhlal
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 45+ messages in thread
From: Suleiman Souhlal @ 2012-03-09 20:39 UTC (permalink / raw)
  To: cgroups
  Cc: suleiman, glommer, kamezawa.hiroyu, penberg, cl, yinghan, hughd,
	gthelen, peterz, dan.magenheimer, hannes, mgorman,
	James.Bottomley, linux-mm, devel, linux-kernel, Suleiman Souhlal

Signed-off-by: <suleiman@google.com>
---
 fs/dcache.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index bcbdb33..cc6dd92 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -3022,8 +3022,8 @@ static void __init dcache_init(void)
 	 * but it is probably not worth it because of the cache nature
 	 * of the dcache. 
 	 */
-	dentry_cache = KMEM_CACHE(dentry,
-		SLAB_RECLAIM_ACCOUNT|SLAB_PANIC|SLAB_MEM_SPREAD);
+	dentry_cache = KMEM_CACHE(dentry, SLAB_RECLAIM_ACCOUNT | SLAB_PANIC |
+	    SLAB_MEM_SPREAD | SLAB_MEMCG_ACCT);
 
 	/* Hash may have been set up in dcache_init_early */
 	if (!hashdist)
-- 
1.7.7.3


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v2 09/13] memcg: Account for kmalloc in kernel memory accounting.
  2012-03-09 20:39 [PATCH v2 00/13] Memcg Kernel Memory Tracking Suleiman Souhlal
                   ` (7 preceding siblings ...)
  2012-03-09 20:39 ` [PATCH v2 08/13] memcg: Make dentry slab memory accounted in kernel memory accounting Suleiman Souhlal
@ 2012-03-09 20:39 ` Suleiman Souhlal
  2012-03-11 12:21   ` Glauber Costa
  2012-03-09 20:39 ` [PATCH v2 10/13] memcg: Track all the memcg children of a kmem_cache Suleiman Souhlal
                   ` (4 subsequent siblings)
  13 siblings, 1 reply; 45+ messages in thread
From: Suleiman Souhlal @ 2012-03-09 20:39 UTC (permalink / raw)
  To: cgroups
  Cc: suleiman, glommer, kamezawa.hiroyu, penberg, cl, yinghan, hughd,
	gthelen, peterz, dan.magenheimer, hannes, mgorman,
	James.Bottomley, linux-mm, devel, linux-kernel, Suleiman Souhlal

In order to do this, we have to create a kmalloc_no_account()
function that is used for kmalloc allocations that we do not
want to account, because the kernel memory accounting code has
to make some kmalloc allocations and is not allowed to recurse.

Signed-off-by: Suleiman Souhlal <suleiman@google.com>
---
 include/linux/slab.h |    8 ++++++++
 mm/memcontrol.c      |    2 +-
 mm/slab.c            |   42 +++++++++++++++++++++++++++++++++++++++---
 3 files changed, 48 insertions(+), 4 deletions(-)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index bc9f87f..906a015 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -377,6 +377,8 @@ struct kmem_cache *kmem_cache_create_memcg(struct kmem_cache *cachep,
     char *name);
 void kmem_cache_drop_ref(struct kmem_cache *cachep);
 
+void *kmalloc_no_account(size_t size, gfp_t flags);
+
 #else /* !CONFIG_CGROUP_MEM_RES_CTLR_KMEM || !CONFIG_SLAB */
 
 #define MAX_KMEM_CACHE_TYPES 0
@@ -392,6 +394,12 @@ static inline void
 kmem_cache_drop_ref(struct kmem_cache *cachep)
 {
 }
+
+static inline void *kmalloc_no_account(size_t size, gfp_t flags)
+{
+	return kmalloc(size, flags);
+}
+
 #endif /* CONFIG_CGROUP_MEM_RES_CTLR_KMEM && CONFIG_SLAB */
 
 #endif	/* _LINUX_SLAB_H */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index a5593cf..72e83af 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5824,7 +5824,7 @@ memcg_create_cache_enqueue(struct mem_cgroup *memcg, struct kmem_cache *cachep)
 	if (!css_tryget(&memcg->css))
 		return;
 
-	cw = kmalloc(sizeof(struct create_work), GFP_NOWAIT);
+	cw = kmalloc_no_account(sizeof(struct create_work), GFP_NOWAIT);
 	if (cw == NULL) {
 		css_put(&memcg->css);
 		return;
diff --git a/mm/slab.c b/mm/slab.c
index 95b024c..7af7384 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -1623,8 +1623,8 @@ void __init kmem_cache_init(void)
 			sizes->cs_cachep = kmem_cache_create(names->name,
 					sizes->cs_size,
 					ARCH_KMALLOC_MINALIGN,
-					ARCH_KMALLOC_FLAGS|SLAB_PANIC,
-					NULL);
+					ARCH_KMALLOC_FLAGS | SLAB_PANIC |
+					SLAB_MEMCG_ACCT, NULL);
 		}
 #ifdef CONFIG_ZONE_DMA
 		sizes->cs_dmacachep = kmem_cache_create(
@@ -3936,11 +3936,16 @@ static __always_inline void *
 __do_kmalloc_node(size_t size, gfp_t flags, int node, void *caller)
 {
 	struct kmem_cache *cachep;
+	void *ret;
 
 	cachep = kmem_find_general_cachep(size, flags);
 	if (unlikely(ZERO_OR_NULL_PTR(cachep)))
 		return cachep;
-	return kmem_cache_alloc_node_trace(size, cachep, flags, node);
+	cachep = mem_cgroup_get_kmem_cache(cachep, flags);
+	ret = kmem_cache_alloc_node_trace(size, cachep, flags, node);
+	mem_cgroup_put_kmem_cache(cachep);
+
+	return ret;
 }
 
 #if defined(CONFIG_DEBUG_SLAB) || defined(CONFIG_TRACING)
@@ -3986,14 +3991,31 @@ static __always_inline void *__do_kmalloc(size_t size, gfp_t flags,
 	cachep = __find_general_cachep(size, flags);
 	if (unlikely(ZERO_OR_NULL_PTR(cachep)))
 		return cachep;
+	cachep = mem_cgroup_get_kmem_cache(cachep, flags);
 	ret = __cache_alloc(cachep, flags, caller);
 
 	trace_kmalloc((unsigned long) caller, ret,
 		      size, cachep->buffer_size, flags);
+	mem_cgroup_put_kmem_cache(cachep);
 
 	return ret;
 }
 
+static __always_inline void *
+__do_kmalloc_no_account(size_t size, gfp_t flags, void *caller)
+{
+	struct kmem_cache *cachep;
+	void *ret;
+
+	cachep = __find_general_cachep(size, flags);
+	if (unlikely(ZERO_OR_NULL_PTR(cachep)))
+		return cachep;
+	ret = __cache_alloc(cachep, flags, caller);
+	trace_kmalloc((unsigned long)caller, ret, size, cachep->buffer_size,
+	    flags);
+
+	return ret;
+}
 
 #if defined(CONFIG_DEBUG_SLAB) || defined(CONFIG_TRACING)
 void *__kmalloc(size_t size, gfp_t flags)
@@ -4008,12 +4030,26 @@ void *__kmalloc_track_caller(size_t size, gfp_t flags, unsigned long caller)
 }
 EXPORT_SYMBOL(__kmalloc_track_caller);
 
+void *
+kmalloc_no_account(size_t size, gfp_t flags)
+{
+	return __do_kmalloc_no_account(size, flags,
+	    __builtin_return_address(0));
+}
+EXPORT_SYMBOL(kmalloc_no_account);
 #else
 void *__kmalloc(size_t size, gfp_t flags)
 {
 	return __do_kmalloc(size, flags, NULL);
 }
 EXPORT_SYMBOL(__kmalloc);
+
+void *
+kmalloc_no_account(size_t size, gfp_t flags)
+{
+	return __do_kmalloc_no_account(size, flags, NULL);
+}
+EXPORT_SYMBOL(kmalloc_no_account);
 #endif
 
 /**
-- 
1.7.7.3


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v2 10/13] memcg: Track all the memcg children of a kmem_cache.
  2012-03-09 20:39 [PATCH v2 00/13] Memcg Kernel Memory Tracking Suleiman Souhlal
                   ` (8 preceding siblings ...)
  2012-03-09 20:39 ` [PATCH v2 09/13] memcg: Account for kmalloc " Suleiman Souhlal
@ 2012-03-09 20:39 ` Suleiman Souhlal
  2012-03-09 20:39 ` [PATCH v2 11/13] memcg: Handle bypassed kernel memory charges Suleiman Souhlal
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 45+ messages in thread
From: Suleiman Souhlal @ 2012-03-09 20:39 UTC (permalink / raw)
  To: cgroups
  Cc: suleiman, glommer, kamezawa.hiroyu, penberg, cl, yinghan, hughd,
	gthelen, peterz, dan.magenheimer, hannes, mgorman,
	James.Bottomley, linux-mm, devel, linux-kernel, Suleiman Souhlal

This enables us to remove all the children of a kmem_cache being
destroyed, if for example the kernel module it's being used in
gets unloaded. Otherwise, the children will still point to the
destroyed parent.

We also use this to propagate /proc/slabinfo settings to all
the children of a cache, when, for example, changing its
batchsize.

Signed-off-by: Suleiman Souhlal <suleiman@google.com>
---
 include/linux/slab.h |    1 +
 mm/slab.c            |   53 ++++++++++++++++++++++++++++++++++++++++++++++---
 2 files changed, 50 insertions(+), 4 deletions(-)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index 906a015..fd1d2ba 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -166,6 +166,7 @@ struct mem_cgroup_cache_params {
 	struct kmem_cache *orig_cache;
 
 	struct list_head destroyed_list; /* Used when deleting cpuset cache */
+	struct list_head sibling_list;
 };
 #endif
 
diff --git a/mm/slab.c b/mm/slab.c
index 7af7384..02239ed 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -2511,6 +2511,7 @@ __kmem_cache_create(const char *name, size_t size, size_t align,
 	cachep->memcg_params.orig_cache = NULL;
 	atomic_set(&cachep->refcnt, 1);
 	INIT_LIST_HEAD(&cachep->memcg_params.destroyed_list);
+	INIT_LIST_HEAD(&cachep->memcg_params.sibling_list);
 
 	if (!memcg) {
 		int id;
@@ -2582,6 +2583,8 @@ kmem_cache_create_memcg(struct kmem_cache *cachep, char *name)
 	}
 	new->memcg_params.orig_cache = cachep;
 
+	list_add(&new->memcg_params.sibling_list,
+	    &cachep->memcg_params.sibling_list);
 	if ((cachep->limit != new->limit) ||
 	    (cachep->batchcount != new->batchcount) ||
 	    (cachep->shared != new->shared))
@@ -2769,6 +2772,29 @@ void kmem_cache_destroy(struct kmem_cache *cachep)
 {
 	BUG_ON(!cachep || in_interrupt());
 
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
+	/* Destroy all the children caches if we aren't a memcg cache */
+	if (cachep->memcg_params.id != -1) {
+		struct kmem_cache *c;
+		struct mem_cgroup_cache_params *p, *tmp;
+
+		mutex_lock(&cache_chain_mutex);
+		list_for_each_entry_safe(p, tmp,
+		    &cachep->memcg_params.sibling_list, sibling_list) {
+			c = container_of(p, struct kmem_cache, memcg_params);
+			if (c == cachep)
+				continue;
+			mutex_unlock(&cache_chain_mutex);
+			BUG_ON(c->memcg_params.id != -1);
+			mem_cgroup_remove_child_kmem_cache(c,
+			    cachep->memcg_params.id);
+			kmem_cache_destroy(c);
+			mutex_lock(&cache_chain_mutex);
+		}
+		mutex_unlock(&cache_chain_mutex);
+	}
+#endif /* CONFIG_CGROUP_MEM_RES_CTLR_KMEM */
+
 	/* Find the cache in the chain of caches. */
 	get_online_cpus();
 	mutex_lock(&cache_chain_mutex);
@@ -2776,6 +2802,9 @@ void kmem_cache_destroy(struct kmem_cache *cachep)
 	 * the chain is never empty, cache_cache is never destroyed
 	 */
 	list_del(&cachep->next);
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
+	list_del(&cachep->memcg_params.sibling_list);
+#endif
 	if (__cache_shrink(cachep)) {
 		slab_error(cachep, "Can't free all objects");
 		list_add(&cachep->next, &cache_chain);
@@ -4654,11 +4683,27 @@ static ssize_t slabinfo_write(struct file *file, const char __user *buffer,
 			if (limit < 1 || batchcount < 1 ||
 					batchcount > limit || shared < 0) {
 				res = 0;
-			} else {
-				res = do_tune_cpucache(cachep, limit,
-						       batchcount, shared,
-						       GFP_KERNEL);
+				break;
 			}
+
+			res = do_tune_cpucache(cachep, limit, batchcount,
+			    shared, GFP_KERNEL);
+
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
+			{
+				struct kmem_cache *c;
+				struct mem_cgroup_cache_params *p;
+
+				list_for_each_entry(p,
+				    &cachep->memcg_params.sibling_list,
+				    sibling_list) {
+					c = container_of(p, struct kmem_cache,
+					    memcg_params);
+					do_tune_cpucache(c, limit, batchcount,
+					    shared, GFP_KERNEL);
+				}
+			}
+#endif
 			break;
 		}
 	}
-- 
1.7.7.3


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v2 11/13] memcg: Handle bypassed kernel memory charges.
  2012-03-09 20:39 [PATCH v2 00/13] Memcg Kernel Memory Tracking Suleiman Souhlal
                   ` (9 preceding siblings ...)
  2012-03-09 20:39 ` [PATCH v2 10/13] memcg: Track all the memcg children of a kmem_cache Suleiman Souhlal
@ 2012-03-09 20:39 ` Suleiman Souhlal
  2012-03-09 20:39 ` [PATCH v2 12/13] memcg: Per-memcg memory.kmem.slabinfo file Suleiman Souhlal
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 45+ messages in thread
From: Suleiman Souhlal @ 2012-03-09 20:39 UTC (permalink / raw)
  To: cgroups
  Cc: suleiman, glommer, kamezawa.hiroyu, penberg, cl, yinghan, hughd,
	gthelen, peterz, dan.magenheimer, hannes, mgorman,
	James.Bottomley, linux-mm, devel, linux-kernel, Suleiman Souhlal

When __mem_cgroup_try_charge() decides to bypass a slab charge
(because we are getting OOM killed or have a fatal signal pending),
we may end up with a slab that belongs to a memcg, but wasn't
charged to it. When we free such a slab page, we end up uncharging
it from the memcg, even though it was never charged, which may
lead to res_counter underflows.

To avoid this, when a charge is bypassed, we force the charge,
without checking for the bypass conditions or doing any reclaim.
This may cause the cgroup's usage to temporarily go above its limit.

Signed-off-by: Suleiman Souhlal <suleiman@google.com>
---
 mm/memcontrol.c |   15 +++++++++++++--
 1 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 72e83af..9f5e9d8 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5672,16 +5672,27 @@ memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, long long delta)
 
 	ret = 0;
 
-	_memcg = memcg;
 	if (memcg && !mem_cgroup_test_flag(memcg,
 	    MEMCG_INDEPENDENT_KMEM_LIMIT)) {
+		_memcg = memcg;
 		ret = __mem_cgroup_try_charge(NULL, gfp, delta / PAGE_SIZE,
 		    &_memcg, may_oom);
 		if (ret == -ENOMEM)
 			return ret;
+		else if (ret == -EINTR) {
+			/*
+			 * __mem_cgroup_try_charge() chose to bypass to root due
+			 * to OOM kill or fatal signal.
+			 * Since our only options are to either fail the
+			 * allocation or charge it to this cgroup, force the
+			 * change, going above the limit if needed.
+			 */
+			ret = res_counter_charge_nofail(&memcg->res, delta,
+			    &fail_res);
+		}
 	}
 
-	if (memcg && _memcg == memcg)
+	if (memcg)
 		ret = res_counter_charge(&memcg->kmem, delta, &fail_res);
 
 	return ret;
-- 
1.7.7.3


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v2 12/13] memcg: Per-memcg memory.kmem.slabinfo file.
  2012-03-09 20:39 [PATCH v2 00/13] Memcg Kernel Memory Tracking Suleiman Souhlal
                   ` (10 preceding siblings ...)
  2012-03-09 20:39 ` [PATCH v2 11/13] memcg: Handle bypassed kernel memory charges Suleiman Souhlal
@ 2012-03-09 20:39 ` Suleiman Souhlal
  2012-03-11 10:35   ` Glauber Costa
  2012-03-09 20:39 ` [PATCH v2 13/13] memcg: Document kernel memory accounting Suleiman Souhlal
  2012-03-10  6:25 ` [PATCH v2 00/13] Memcg Kernel Memory Tracking Suleiman Souhlal
  13 siblings, 1 reply; 45+ messages in thread
From: Suleiman Souhlal @ 2012-03-09 20:39 UTC (permalink / raw)
  To: cgroups
  Cc: suleiman, glommer, kamezawa.hiroyu, penberg, cl, yinghan, hughd,
	gthelen, peterz, dan.magenheimer, hannes, mgorman,
	James.Bottomley, linux-mm, devel, linux-kernel, Suleiman Souhlal

This file shows all the kmem_caches used by a memcg.

Signed-off-by: Suleiman Souhlal <suleiman@google.com>
---
 include/linux/slab.h     |    6 +++
 include/linux/slab_def.h |    1 +
 mm/memcontrol.c          |   18 +++++++++
 mm/slab.c                |   88 ++++++++++++++++++++++++++++++++++------------
 4 files changed, 90 insertions(+), 23 deletions(-)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index fd1d2ba..0ff5ee2 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -401,6 +401,12 @@ static inline void *kmalloc_no_account(size_t size, gfp_t flags)
 	return kmalloc(size, flags);
 }
 
+static inline int
+mem_cgroup_slabinfo(struct mem_cgroup *mem, struct seq_file *m)
+{
+	return 0;
+}
+
 #endif /* CONFIG_CGROUP_MEM_RES_CTLR_KMEM && CONFIG_SLAB */
 
 #endif	/* _LINUX_SLAB_H */
diff --git a/include/linux/slab_def.h b/include/linux/slab_def.h
index 248b8a9..fa6b272 100644
--- a/include/linux/slab_def.h
+++ b/include/linux/slab_def.h
@@ -227,6 +227,7 @@ found:
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
 
 void kmem_cache_drop_ref(struct kmem_cache *cachep);
+int mem_cgroup_slabinfo(struct mem_cgroup *mem, struct seq_file *m);
 
 static inline void
 kmem_cache_get_ref(struct kmem_cache *cachep)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 9f5e9d8..4a4fa48 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4672,6 +4672,20 @@ static int mem_cgroup_independent_kmem_limit_write(struct cgroup *cgrp,
 	return 0;
 }
 
+static int
+mem_cgroup_slabinfo_show(struct cgroup *cgroup, struct cftype *ctf,
+    struct seq_file *m)
+{
+	struct mem_cgroup *mem;
+
+	mem  = mem_cgroup_from_cont(cgroup);
+
+	if (mem == root_mem_cgroup)
+		mem = NULL;
+
+	return mem_cgroup_slabinfo(mem, m);
+}
+
 static struct cftype kmem_cgroup_files[] = {
 	{
 		.name = "kmem.independent_kmem_limit",
@@ -4689,6 +4703,10 @@ static struct cftype kmem_cgroup_files[] = {
 		.private = MEMFILE_PRIVATE(_KMEM, RES_USAGE),
 		.read_u64 = mem_cgroup_read,
 	},
+	{
+		.name = "kmem.slabinfo",
+		.read_seq_string = mem_cgroup_slabinfo_show,
+	},
 };
 
 static int register_kmem_files(struct cgroup *cont, struct cgroup_subsys *ss)
diff --git a/mm/slab.c b/mm/slab.c
index 02239ed..1b35799 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -4528,21 +4528,26 @@ static void s_stop(struct seq_file *m, void *p)
 	mutex_unlock(&cache_chain_mutex);
 }
 
-static int s_show(struct seq_file *m, void *p)
-{
-	struct kmem_cache *cachep = list_entry(p, struct kmem_cache, next);
-	struct slab *slabp;
+struct slab_counts {
 	unsigned long active_objs;
+	unsigned long active_slabs;
+	unsigned long num_slabs;
+	unsigned long free_objects;
+	unsigned long shared_avail;
 	unsigned long num_objs;
-	unsigned long active_slabs = 0;
-	unsigned long num_slabs, free_objects = 0, shared_avail = 0;
-	const char *name;
-	char *error = NULL;
-	int node;
+};
+
+static char *
+get_slab_counts(struct kmem_cache *cachep, struct slab_counts *c)
+{
 	struct kmem_list3 *l3;
+	struct slab *slabp;
+	char *error;
+	int node;
+
+	error = NULL;
+	memset(c, 0, sizeof(struct slab_counts));
 
-	active_objs = 0;
-	num_slabs = 0;
 	for_each_online_node(node) {
 		l3 = cachep->nodelists[node];
 		if (!l3)
@@ -4554,31 +4559,43 @@ static int s_show(struct seq_file *m, void *p)
 		list_for_each_entry(slabp, &l3->slabs_full, list) {
 			if (slabp->inuse != cachep->num && !error)
 				error = "slabs_full accounting error";
-			active_objs += cachep->num;
-			active_slabs++;
+			c->active_objs += cachep->num;
+			c->active_slabs++;
 		}
 		list_for_each_entry(slabp, &l3->slabs_partial, list) {
 			if (slabp->inuse == cachep->num && !error)
 				error = "slabs_partial inuse accounting error";
 			if (!slabp->inuse && !error)
 				error = "slabs_partial/inuse accounting error";
-			active_objs += slabp->inuse;
-			active_slabs++;
+			c->active_objs += slabp->inuse;
+			c->active_slabs++;
 		}
 		list_for_each_entry(slabp, &l3->slabs_free, list) {
 			if (slabp->inuse && !error)
 				error = "slabs_free/inuse accounting error";
-			num_slabs++;
+			c->num_slabs++;
 		}
-		free_objects += l3->free_objects;
+		c->free_objects += l3->free_objects;
 		if (l3->shared)
-			shared_avail += l3->shared->avail;
+			c->shared_avail += l3->shared->avail;
 
 		spin_unlock_irq(&l3->list_lock);
 	}
-	num_slabs += active_slabs;
-	num_objs = num_slabs * cachep->num;
-	if (num_objs - active_objs != free_objects && !error)
+	c->num_slabs += c->active_slabs;
+	c->num_objs = c->num_slabs * cachep->num;
+
+	return error;
+}
+
+static int s_show(struct seq_file *m, void *p)
+{
+	struct kmem_cache *cachep = list_entry(p, struct kmem_cache, next);
+	struct slab_counts c;
+	const char *name;
+	char *error;
+
+	error = get_slab_counts(cachep, &c);
+	if (c.num_objs - c.active_objs != c.free_objects && !error)
 		error = "free_objects accounting error";
 
 	name = cachep->name;
@@ -4586,12 +4603,12 @@ static int s_show(struct seq_file *m, void *p)
 		printk(KERN_ERR "slab: cache %s error: %s\n", name, error);
 
 	seq_printf(m, "%-17s %6lu %6lu %6u %4u %4d",
-		   name, active_objs, num_objs, cachep->buffer_size,
+		   name, c.active_objs, c.num_objs, cachep->buffer_size,
 		   cachep->num, (1 << cachep->gfporder));
 	seq_printf(m, " : tunables %4u %4u %4u",
 		   cachep->limit, cachep->batchcount, cachep->shared);
 	seq_printf(m, " : slabdata %6lu %6lu %6lu",
-		   active_slabs, num_slabs, shared_avail);
+		   c.active_slabs, c.num_slabs, c.shared_avail);
 #if STATS
 	{			/* list3 stats */
 		unsigned long high = cachep->high_mark;
@@ -4726,6 +4743,31 @@ static const struct file_operations proc_slabinfo_operations = {
 	.release	= seq_release,
 };
 
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
+int
+mem_cgroup_slabinfo(struct mem_cgroup *mem, struct seq_file *m)
+{
+	struct kmem_cache *cachep;
+	struct slab_counts c;
+
+	seq_printf(m, "# name            <active_objs> <num_objs> <objsize>\n");
+
+	mutex_lock(&cache_chain_mutex);
+	list_for_each_entry(cachep, &cache_chain, next) {
+		if (cachep->memcg_params.memcg != mem)
+			continue;
+
+		get_slab_counts(cachep, &c);
+
+		seq_printf(m, "%-17s %6lu %6lu %6u\n", cachep->name,
+		   c.active_objs, c.num_objs, cachep->buffer_size);
+	}
+	mutex_unlock(&cache_chain_mutex);
+
+	return 0;
+}
+#endif /* CONFIG_CGROUP_MEM_RES_CTLR_KMEM */
+
 #ifdef CONFIG_DEBUG_SLAB_LEAK
 
 static void *leaks_start(struct seq_file *m, loff_t *pos)
-- 
1.7.7.3


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v2 13/13] memcg: Document kernel memory accounting.
  2012-03-09 20:39 [PATCH v2 00/13] Memcg Kernel Memory Tracking Suleiman Souhlal
                   ` (11 preceding siblings ...)
  2012-03-09 20:39 ` [PATCH v2 12/13] memcg: Per-memcg memory.kmem.slabinfo file Suleiman Souhlal
@ 2012-03-09 20:39 ` Suleiman Souhlal
  2012-03-11 10:42   ` Glauber Costa
  2012-03-10  6:25 ` [PATCH v2 00/13] Memcg Kernel Memory Tracking Suleiman Souhlal
  13 siblings, 1 reply; 45+ messages in thread
From: Suleiman Souhlal @ 2012-03-09 20:39 UTC (permalink / raw)
  To: cgroups
  Cc: suleiman, glommer, kamezawa.hiroyu, penberg, cl, yinghan, hughd,
	gthelen, peterz, dan.magenheimer, hannes, mgorman,
	James.Bottomley, linux-mm, devel, linux-kernel, Suleiman Souhlal

Signed-off-by: Suleiman Souhlal <suleiman@google.com>
---
 Documentation/cgroups/memory.txt |   44 ++++++++++++++++++++++++++++++++++---
 1 files changed, 40 insertions(+), 4 deletions(-)

diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index 4c95c00..73f2e38 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -74,6 +74,11 @@ Brief summary of control files.
 
  memory.kmem.tcp.limit_in_bytes  # set/show hard limit for tcp buf memory
  memory.kmem.tcp.usage_in_bytes  # show current tcp buf memory allocation
+ memory.kmem.usage_in_bytes	 # show current kernel memory usage
+ memory.kmem.limit_in_bytes	 # show/set limit of kernel memory usage
+ memory.kmem.independent_kmem_limit # show/set control of kernel memory limit
+ 				    (See 2.7 for details)
+ memory.kmem.slabinfo		 # show cgroup's slabinfo
 
 1. History
 
@@ -265,11 +270,19 @@ the amount of kernel memory used by the system. Kernel memory is fundamentally
 different than user memory, since it can't be swapped out, which makes it
 possible to DoS the system by consuming too much of this precious resource.
 
-Kernel memory limits are not imposed for the root cgroup. Usage for the root
-cgroup may or may not be accounted.
+Kernel memory limits are not imposed for the root cgroup.
 
-Currently no soft limit is implemented for kernel memory. It is future work
-to trigger slab reclaim when those limits are reached.
+A cgroup's kernel memory is counted into its memory.kmem.usage_in_bytes.
+
+memory.kmem.independent_kmem_limit controls whether or not kernel memory
+should also be counted into the cgroup's memory.usage_in_bytes.
+If it is set, it is possible to specify a limit for kernel memory with
+memory.kmem.limit_in_bytes.
+
+Upon cgroup deletion, all the remaining kernel memory becomes unaccounted.
+
+An accounted kernel memory allocation may trigger reclaim in that cgroup,
+and may also OOM.
 
 2.7.1 Current Kernel Memory resources accounted
 
@@ -279,6 +292,29 @@ per cgroup, instead of globally.
 
 * tcp memory pressure: sockets memory pressure for the tcp protocol.
 
+* slab memory.
+
+2.7.1.1 Slab memory accounting
+
+Any slab type created with the SLAB_MEMCG_ACCT kmem_cache_create() flag
+is accounted.
+
+Slab gets accounted on a per-page basis, which is done by using per-cgroup
+kmem_caches. These per-cgroup kmem_caches get created on-demand, the first
+time a specific kmem_cache gets used by a cgroup.
+
+Only slab memory that can be attributed to a cgroup gets accounted in this
+fashion.
+
+A per-cgroup kmem_cache is named like the original, with the cgroup's name
+in parentheses.
+
+When a cgroup is destroyed, all its kmem_caches get migrated to the root
+cgroup, and "dead" is appended to their name, to indicate that they are not
+going to be used for new allocations.
+These dead caches automatically get removed once there are no more active
+slab objects in them.
+
 3. User Interface
 
 0. Configuration
-- 
1.7.7.3


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 00/13] Memcg Kernel Memory Tracking.
  2012-03-09 20:39 [PATCH v2 00/13] Memcg Kernel Memory Tracking Suleiman Souhlal
                   ` (12 preceding siblings ...)
  2012-03-09 20:39 ` [PATCH v2 13/13] memcg: Document kernel memory accounting Suleiman Souhlal
@ 2012-03-10  6:25 ` Suleiman Souhlal
  13 siblings, 0 replies; 45+ messages in thread
From: Suleiman Souhlal @ 2012-03-10  6:25 UTC (permalink / raw)
  To: Suleiman Souhlal
  Cc: cgroups, glommer, kamezawa.hiroyu, penberg, cl, yinghan, hughd,
	gthelen, peterz, dan.magenheimer, hannes, mgorman,
	James.Bottomley, linux-mm, devel, linux-kernel

On Fri, Mar 9, 2012 at 12:39 PM, Suleiman Souhlal <ssouhlal@freebsd.org> wrote:
> This is v2 of my kernel memory tracking patchset for memcg.

I just realized that I forgot to test without the config option
enabled, so that doesn't compile. :-(

I will make sure to fix this in v3 and test more thoroughly.

Hopefully this shouldn't impede discussion on the patchset.

Sorry about that.
-- Suleiman

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 01/13] memcg: Consolidate various flags into a single flags field.
  2012-03-09 20:39 ` [PATCH v2 01/13] memcg: Consolidate various flags into a single flags field Suleiman Souhlal
@ 2012-03-11  7:50   ` Glauber Costa
  0 siblings, 0 replies; 45+ messages in thread
From: Glauber Costa @ 2012-03-11  7:50 UTC (permalink / raw)
  To: Suleiman Souhlal
  Cc: cgroups, suleiman, kamezawa.hiroyu, penberg, cl, yinghan, hughd,
	gthelen, peterz, dan.magenheimer, hannes, mgorman,
	James.Bottomley, linux-mm, devel, linux-kernel

On 03/10/2012 12:39 AM, Suleiman Souhlal wrote:
> Since there is an ever-increasing number of flags in the memcg
> struct, consolidate them into a single flags field.
> The flags that we consolidate are:
>      - use_hierarchy
>      - memsw_is_minimum
>      - oom_kill_disable
>
> Signed-off-by: Suleiman Souhlal<suleiman@google.com>

> ---
>   mm/memcontrol.c |  112 ++++++++++++++++++++++++++++++++++++++-----------------
>   1 files changed, 78 insertions(+), 34 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 5585dc3..37ad2cb 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -213,6 +213,15 @@ struct mem_cgroup_eventfd_list {
>   static void mem_cgroup_threshold(struct mem_cgroup *memcg);
>   static void mem_cgroup_oom_notify(struct mem_cgroup *memcg);
>
> +enum memcg_flags {
> +	MEMCG_USE_HIERARCHY,	/*
> +				 * Should the accounting and control be
> +				 * hierarchical, per subtree?
> +				 */
Perhaps we should use the opportunity to make this comment shorter, so 
it can fit in a single line. How about:

/* per-subtree hierarchical accounting */ ?

>
> +static inline bool
> +mem_cgroup_test_flag(const struct mem_cgroup *memcg, enum memcg_flags flag)
> +{
> +	return test_bit(flag,&memcg->flags);
> +}
> +
> +static inline void
> +mem_cgroup_set_flag(struct mem_cgroup *memcg, enum memcg_flags flag)
> +{
> +	set_bit(flag,&memcg->flags);
> +}
> +
> +static inline void
> +mem_cgroup_clear_flag(struct mem_cgroup *memcg, enum memcg_flags flag)
> +{
> +	clear_bit(flag,&memcg->flags);
> +}
> +
>   /* Writing them here to avoid exposing memcg's inner layout */
>   #ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
>   #include<net/sock.h>
> @@ -876,7 +895,8 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
>   	if (prev&&  prev != root)
>   		css_put(&prev->css);
>
> -	if (!root->use_hierarchy&&  root != root_mem_cgroup) {
> +	if (!mem_cgroup_test_flag(root, MEMCG_USE_HIERARCHY)&&  root !=
> +	    root_mem_cgroup) {
>   		if (prev)
>   			return NULL;
>   		return root;

Although I like this change in principle, the end result is that many of 
the lines we are touching, used to be single-lines and now span two 
rows. I know not everybody cares about that, but maybe we could provide 
functions specific to each flag?

For the record, that is what cgroup.c does:

static int notify_on_release(const struct cgroup *cgrp)
{
         return test_bit(CGRP_NOTIFY_ON_RELEASE, &cgrp->flags);
}

static int clone_children(const struct cgroup *cgrp)
{
         return test_bit(CGRP_CLONE_CHILDREN, &cgrp->flags);
}


> @@ -1126,8 +1146,8 @@ static bool mem_cgroup_same_or_subtree(const struct mem_cgroup *root_memcg,
>   		struct mem_cgroup *memcg)
>   {
>   	if (root_memcg != memcg) {
> -		return (root_memcg->use_hierarchy&&
> -			css_is_ancestor(&memcg->css,&root_memcg->css));
> +		return mem_cgroup_test_flag(root_memcg, MEMCG_USE_HIERARCHY)&&
> +			css_is_ancestor(&memcg->css,&root_memcg->css);
>   	}
>
>   	return true;
> @@ -1460,7 +1480,8 @@ static unsigned long mem_cgroup_reclaim(struct mem_cgroup *memcg,
>
>   	if (flags&  MEM_CGROUP_RECLAIM_NOSWAP)
>   		noswap = true;
> -	if (!(flags&  MEM_CGROUP_RECLAIM_SHRINK)&&  memcg->memsw_is_minimum)
> +	if (!(flags&  MEM_CGROUP_RECLAIM_SHRINK)&&  mem_cgroup_test_flag(memcg,
> +	    MEMCG_MEMSW_IS_MINIMUM))
>   		noswap = true;
>
>   	for (loop = 0; loop<  MEM_CGROUP_MAX_RECLAIM_LOOPS; loop++) {
> @@ -1813,7 +1834,7 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask)
>   	 * under OOM is always welcomed, use TASK_KILLABLE here.
>   	 */
>   	prepare_to_wait(&memcg_oom_waitq,&owait.wait, TASK_KILLABLE);
> -	if (!locked || memcg->oom_kill_disable)
> +	if (!locked || mem_cgroup_test_flag(memcg, MEMCG_OOM_KILL_DISABLE))
>   		need_to_kill = false;
>   	if (locked)
>   		mem_cgroup_oom_notify(memcg);
> @@ -3416,9 +3437,11 @@ static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
>   		ret = res_counter_set_limit(&memcg->res, val);
>   		if (!ret) {
>   			if (memswlimit == val)
> -				memcg->memsw_is_minimum = true;
> +				mem_cgroup_set_flag(memcg,
> +				    MEMCG_MEMSW_IS_MINIMUM);
>   			else
> -				memcg->memsw_is_minimum = false;
> +				mem_cgroup_clear_flag(memcg,
> +				    MEMCG_MEMSW_IS_MINIMUM);
>   		}
>   		mutex_unlock(&set_limit_mutex);
>
> @@ -3475,9 +3498,11 @@ static int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
>   		ret = res_counter_set_limit(&memcg->memsw, val);
>   		if (!ret) {
>   			if (memlimit == val)
> -				memcg->memsw_is_minimum = true;
> +				mem_cgroup_set_flag(memcg,
> +				    MEMCG_MEMSW_IS_MINIMUM);
>   			else
> -				memcg->memsw_is_minimum = false;
> +				mem_cgroup_clear_flag(memcg,
> +				    MEMCG_MEMSW_IS_MINIMUM);
>   		}
>   		mutex_unlock(&set_limit_mutex);
>
> @@ -3745,7 +3770,8 @@ int mem_cgroup_force_empty_write(struct cgroup *cont, unsigned int event)
>
>   static u64 mem_cgroup_hierarchy_read(struct cgroup *cont, struct cftype *cft)
>   {
> -	return mem_cgroup_from_cont(cont)->use_hierarchy;
> +	return mem_cgroup_test_flag(mem_cgroup_from_cont(cont),
> +	    MEMCG_USE_HIERARCHY);
>   }
>
>   static int mem_cgroup_hierarchy_write(struct cgroup *cont, struct cftype *cft,
> @@ -3768,10 +3794,14 @@ static int mem_cgroup_hierarchy_write(struct cgroup *cont, struct cftype *cft,
>   	 * For the root cgroup, parent_mem is NULL, we allow value to be
>   	 * set if there are no children.
>   	 */
> -	if ((!parent_memcg || !parent_memcg->use_hierarchy)&&
> -				(val == 1 || val == 0)) {
> +	if ((!parent_memcg || !mem_cgroup_test_flag(parent_memcg,
> +	    MEMCG_USE_HIERARCHY))&&  (val == 1 || val == 0)) {
>   		if (list_empty(&cont->children))
> -			memcg->use_hierarchy = val;
> +			if (val)
> +				mem_cgroup_set_flag(memcg, MEMCG_USE_HIERARCHY);
> +			else
> +				mem_cgroup_clear_flag(memcg,
> +				    MEMCG_USE_HIERARCHY);
>   		else
>   			retval = -EBUSY;
>   	} else
> @@ -3903,13 +3933,13 @@ static void memcg_get_hierarchical_limit(struct mem_cgroup *memcg,
>   	min_limit = res_counter_read_u64(&memcg->res, RES_LIMIT);
>   	min_memsw_limit = res_counter_read_u64(&memcg->memsw, RES_LIMIT);
>   	cgroup = memcg->css.cgroup;
> -	if (!memcg->use_hierarchy)
> +	if (!mem_cgroup_test_flag(memcg, MEMCG_USE_HIERARCHY))
>   		goto out;
>
>   	while (cgroup->parent) {
>   		cgroup = cgroup->parent;
>   		memcg = mem_cgroup_from_cont(cgroup);
> -		if (!memcg->use_hierarchy)
> +		if (!mem_cgroup_test_flag(memcg, MEMCG_USE_HIERARCHY))
>   			break;
>   		tmp = res_counter_read_u64(&memcg->res, RES_LIMIT);
>   		min_limit = min(min_limit, tmp);
> @@ -4206,8 +4236,9 @@ static int mem_cgroup_swappiness_write(struct cgroup *cgrp, struct cftype *cft,
>   	cgroup_lock();
>
>   	/* If under hierarchy, only empty-root can set this value */
> -	if ((parent->use_hierarchy) ||
> -	    (memcg->use_hierarchy&&  !list_empty(&cgrp->children))) {
> +	if (mem_cgroup_test_flag(parent, MEMCG_USE_HIERARCHY) ||
> +	    (mem_cgroup_test_flag(memcg, MEMCG_USE_HIERARCHY)&&
> +	    !list_empty(&cgrp->children))) {
>   		cgroup_unlock();
>   		return -EINVAL;
>   	}
> @@ -4518,7 +4549,8 @@ static int mem_cgroup_oom_control_read(struct cgroup *cgrp,
>   {
>   	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
>
> -	cb->fill(cb, "oom_kill_disable", memcg->oom_kill_disable);
> +	cb->fill(cb, "oom_kill_disable", mem_cgroup_test_flag(memcg,
> +	    MEMCG_OOM_KILL_DISABLE));
>
>   	if (atomic_read(&memcg->under_oom))
>   		cb->fill(cb, "under_oom", 1);
> @@ -4541,14 +4573,18 @@ static int mem_cgroup_oom_control_write(struct cgroup *cgrp,
>
>   	cgroup_lock();
>   	/* oom-kill-disable is a flag for subhierarchy. */
> -	if ((parent->use_hierarchy) ||
> -	    (memcg->use_hierarchy&&  !list_empty(&cgrp->children))) {
> +	if (mem_cgroup_test_flag(parent, MEMCG_USE_HIERARCHY) ||
> +	    (mem_cgroup_test_flag(memcg, MEMCG_USE_HIERARCHY)&&
> +	    !list_empty(&cgrp->children))) {
>   		cgroup_unlock();
>   		return -EINVAL;
>   	}
> -	memcg->oom_kill_disable = val;
> -	if (!val)
> +	if (val)
> +		mem_cgroup_set_flag(memcg, MEMCG_OOM_KILL_DISABLE);
> +	else {
> +		mem_cgroup_clear_flag(memcg, MEMCG_OOM_KILL_DISABLE);
>   		memcg_oom_recover(memcg);
> +	}
>   	cgroup_unlock();
>   	return 0;
>   }
> @@ -4916,11 +4952,19 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
>   		hotcpu_notifier(memcg_cpu_hotplug_callback, 0);
>   	} else {
>   		parent = mem_cgroup_from_cont(cont->parent);
> -		memcg->use_hierarchy = parent->use_hierarchy;
> -		memcg->oom_kill_disable = parent->oom_kill_disable;
> +
> +		if (mem_cgroup_test_flag(parent, MEMCG_USE_HIERARCHY))
> +			mem_cgroup_set_flag(memcg, MEMCG_USE_HIERARCHY);
> +		else
> +			mem_cgroup_clear_flag(memcg, MEMCG_USE_HIERARCHY);
> +
> +		if (mem_cgroup_test_flag(parent, MEMCG_OOM_KILL_DISABLE))
> +			mem_cgroup_set_flag(memcg, MEMCG_OOM_KILL_DISABLE);
> +		else
> +			mem_cgroup_clear_flag(memcg, MEMCG_OOM_KILL_DISABLE);
>   	}
>
> -	if (parent&&  parent->use_hierarchy) {
> +	if (parent&&  mem_cgroup_test_flag(parent, MEMCG_USE_HIERARCHY)) {
>   		res_counter_init(&memcg->res,&parent->res);
>   		res_counter_init(&memcg->memsw,&parent->memsw);
>   		/*


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 02/13] memcg: Kernel memory accounting infrastructure.
  2012-03-09 20:39 ` [PATCH v2 02/13] memcg: Kernel memory accounting infrastructure Suleiman Souhlal
@ 2012-03-11  8:12   ` Glauber Costa
  2012-03-13  6:24     ` KAMEZAWA Hiroyuki
  2012-03-12 12:38   ` Glauber Costa
  1 sibling, 1 reply; 45+ messages in thread
From: Glauber Costa @ 2012-03-11  8:12 UTC (permalink / raw)
  To: Suleiman Souhlal
  Cc: cgroups, suleiman, kamezawa.hiroyu, penberg, cl, yinghan, hughd,
	gthelen, peterz, dan.magenheimer, hannes, mgorman,
	James.Bottomley, linux-mm, devel, linux-kernel

On 03/10/2012 12:39 AM, Suleiman Souhlal wrote:
> Enabled with CONFIG_CGROUP_MEM_RES_CTLR_KMEM.
>
> Adds the following files:
>      - memory.kmem.independent_kmem_limit
>      - memory.kmem.usage_in_bytes
>      - memory.kmem.limit_in_bytes
>
> Signed-off-by: Suleiman Souhlal<suleiman@google.com>
> ---
>   mm/memcontrol.c |  136 ++++++++++++++++++++++++++++++++++++++++++++++++++++++-
>   1 files changed, 135 insertions(+), 1 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 37ad2cb..e6fd558 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -220,6 +220,10 @@ enum memcg_flags {
>   				 */
>   	MEMCG_MEMSW_IS_MINIMUM,	/* Set when res.limit == memsw.limit */
>   	MEMCG_OOM_KILL_DISABLE,	/* OOM-Killer disable */
> +	MEMCG_INDEPENDENT_KMEM_LIMIT,	/*
> +					 * kernel memory is not counted in
> +					 * memory.usage_in_bytes
> +					 */
>   };
>
>   /*
> @@ -244,6 +248,10 @@ struct mem_cgroup {
>   	 */
>   	struct res_counter memsw;
>   	/*
> +	 * the counter to account for kernel memory usage.
> +	 */
> +	struct res_counter kmem;
> +	/*
>   	 * Per cgroup active and inactive list, similar to the
>   	 * per zone LRU lists.
>   	 */
> @@ -355,6 +363,7 @@ enum charge_type {
>   #define _MEM			(0)
>   #define _MEMSWAP		(1)
>   #define _OOM_TYPE		(2)
> +#define _KMEM			(3)
>   #define MEMFILE_PRIVATE(x, val)	(((x)<<  16) | (val))
>   #define MEMFILE_TYPE(val)	(((val)>>  16)&  0xffff)
>   #define MEMFILE_ATTR(val)	((val)&  0xffff)
> @@ -371,6 +380,8 @@ enum charge_type {
>
>   static void mem_cgroup_get(struct mem_cgroup *memcg);
>   static void mem_cgroup_put(struct mem_cgroup *memcg);
> +static void memcg_kmem_init(struct mem_cgroup *memcg,
> +    struct mem_cgroup *parent);
>
>   static inline bool
>   mem_cgroup_test_flag(const struct mem_cgroup *memcg, enum memcg_flags flag)
> @@ -1435,6 +1446,10 @@ done:
>   		res_counter_read_u64(&memcg->memsw, RES_USAGE)>>  10,
>   		res_counter_read_u64(&memcg->memsw, RES_LIMIT)>>  10,
>   		res_counter_read_u64(&memcg->memsw, RES_FAILCNT));
> +	printk(KERN_INFO "kmem: usage %llukB, limit %llukB, failcnt %llu\n",
> +		res_counter_read_u64(&memcg->kmem, RES_USAGE)>>  10,
> +		res_counter_read_u64(&memcg->kmem, RES_LIMIT)>>  10,
> +		res_counter_read_u64(&memcg->kmem, RES_FAILCNT));
>   }
>
>   /*
> @@ -3868,6 +3883,9 @@ static u64 mem_cgroup_read(struct cgroup *cont, struct cftype *cft)
>   		else
>   			val = res_counter_read_u64(&memcg->memsw, name);
>   		break;
> +	case _KMEM:
> +		val = res_counter_read_u64(&memcg->kmem, name);
> +		break;
>   	default:
>   		BUG();
>   		break;
> @@ -3900,8 +3918,15 @@ static int mem_cgroup_write(struct cgroup *cont, struct cftype *cft,
>   			break;
>   		if (type == _MEM)
>   			ret = mem_cgroup_resize_limit(memcg, val);
> -		else
> +		else if (type == _MEMSWAP)
>   			ret = mem_cgroup_resize_memsw_limit(memcg, val);
> +		else if (type == _KMEM) {
> +			if (!mem_cgroup_test_flag(memcg,
> +			    MEMCG_INDEPENDENT_KMEM_LIMIT))
> +				return -EINVAL;
> +			ret = res_counter_set_limit(&memcg->kmem, val);
> +		} else
> +			return -EINVAL;
>   		break;
>   	case RES_SOFT_LIMIT:
>   		ret = res_counter_memparse_write_strategy(buffer,&val);
> @@ -4606,8 +4631,56 @@ static int mem_control_numa_stat_open(struct inode *unused, struct file *file)
>   #endif /* CONFIG_NUMA */
>
>   #ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
> +static u64
> +mem_cgroup_independent_kmem_limit_read(struct cgroup *cgrp, struct cftype *cft)
> +{
> +	return mem_cgroup_test_flag(mem_cgroup_from_cont(cgrp),
> +	    MEMCG_INDEPENDENT_KMEM_LIMIT);
> +}
> +
> +static int mem_cgroup_independent_kmem_limit_write(struct cgroup *cgrp,
> +    struct cftype *cft, u64 val)
> +{
> +	struct mem_cgroup *memcg;
> +
> +	memcg = mem_cgroup_from_cont(cgrp);
> +	if (val)
> +		mem_cgroup_set_flag(memcg, MEMCG_INDEPENDENT_KMEM_LIMIT);
> +	else {
> +		mem_cgroup_clear_flag(memcg, MEMCG_INDEPENDENT_KMEM_LIMIT);
> +		res_counter_set_limit(&memcg->kmem, RESOURCE_MAX);
> +	}
> +
> +	return 0;
> +}

We need this test to be a bit more strict.
This is what I have in my current version:

struct mem_cgroup *memcg = mem_cgroup_from_cont(cgroup);
struct mem_cgroup *parent = parent_mem_cgroup(memcg);

val = !!val;

if (!parent || !parent->use_hierarchy || mem_cgroup_is_root(parent)) {
     if (list_empty(&cgroup->children))
         memcg->kmem_independent_accounting = val;
     else
         return -EBUSY;
} else
     return -EINVAL;

return 0;

Over yours, it basically:
  * Makes sure this has no effect on root cgroup
  * disallow changing that when we already have children.

Also, I have a TODO item on that: We need to make sure that no memory
was already accounted. It was a bit tricky for me, because I was not
charging kmem when !independent. But since you are always charging kmem,
and you are so far facing no opposition on that particular point,
maybe you can also test if RES_USAGE == 0 for memcg->kmem.

> +static struct cftype kmem_cgroup_files[] = {
> +	{
> +		.name = "kmem.independent_kmem_limit",
> +		.write_u64 = mem_cgroup_independent_kmem_limit_write,
> +		.read_u64 = mem_cgroup_independent_kmem_limit_read,
> +	},
> +	{
> +		.name = "kmem.limit_in_bytes",
> +		.private = MEMFILE_PRIVATE(_KMEM, RES_LIMIT),
> +		.write_string = mem_cgroup_write,
> +		.read_u64 = mem_cgroup_read,
> +	},
> +	{
> +		.name = "kmem.usage_in_bytes",
> +		.private = MEMFILE_PRIVATE(_KMEM, RES_USAGE),
> +		.read_u64 = mem_cgroup_read,
> +	},
> +};
> +
>   static int register_kmem_files(struct cgroup *cont, struct cgroup_subsys *ss)
>   {
> +	int ret;
> +
> +	ret = cgroup_add_files(cont, ss, kmem_cgroup_files,
> +	    ARRAY_SIZE(kmem_cgroup_files));
> +	if (ret)
> +		return ret;
>   	/*
>   	 * Part of this would be better living in a separate allocation
>   	 * function, leaving us with just the cgroup tree population work.
> @@ -4621,6 +4694,10 @@ static int register_kmem_files(struct cgroup *cont, struct cgroup_subsys *ss)
>   static void kmem_cgroup_destroy(struct cgroup_subsys *ss,
>   				struct cgroup *cont)
>   {
> +	struct mem_cgroup *memcg;
> +
> +	memcg = mem_cgroup_from_cont(cont);
> +	BUG_ON(res_counter_read_u64(&memcg->kmem, RES_USAGE) != 0);
>   	mem_cgroup_sockets_destroy(cont, ss);
>   }
>   #else
> @@ -4980,6 +5057,8 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
>   	}
>   	memcg->last_scanned_node = MAX_NUMNODES;
>   	INIT_LIST_HEAD(&memcg->oom_notify);
> +	memcg_kmem_init(memcg, parent&&  mem_cgroup_test_flag(parent,
> +	    MEMCG_USE_HIERARCHY) ? parent : NULL);
>
>   	if (parent)
>   		memcg->swappiness = mem_cgroup_swappiness(parent);
> @@ -5561,3 +5640,58 @@ static int __init enable_swap_account(char *s)
>   __setup("swapaccount=", enable_swap_account);
>
>   #endif
> +
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
> +int
> +memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, long long delta)
> +{
> +	struct res_counter *fail_res;
> +	struct mem_cgroup *_memcg;
> +	int may_oom, ret;
> +
> +	may_oom = (gfp&  __GFP_WAIT)&&  (gfp&  __GFP_FS)&&
> +	    !(gfp&  __GFP_NORETRY);
> +
> +	ret = 0;
> +
> +	_memcg = memcg;
> +	if (memcg&&  !mem_cgroup_test_flag(memcg,
> +	    MEMCG_INDEPENDENT_KMEM_LIMIT)) {
> +		ret = __mem_cgroup_try_charge(NULL, gfp, delta / PAGE_SIZE,
> +		&_memcg, may_oom);
> +		if (ret == -ENOMEM)
> +			return ret;
> +	}
> +
> +	if (memcg&&  _memcg == memcg)
> +		ret = res_counter_charge(&memcg->kmem, delta,&fail_res);
> +
I don't really follow this if (memcg

If you are planning to call this unconditionally from the slab code, I 
think we should at least exit early if !memcg.

Like this:

int may_oom, ret

if (mem_cgroup_disabled() || !memcg)
    return 0;

(And you may need the mem_cgroup_disabled() in your code anyway)

> +	return ret;
> +}
> +
> +void
> +memcg_uncharge_kmem(struct mem_cgroup *memcg, long long delta)
> +{
> +	if (memcg)
> +		res_counter_uncharge(&memcg->kmem, delta);
> +
> +	if (memcg&&  !mem_cgroup_test_flag(memcg, MEMCG_INDEPENDENT_KMEM_LIMIT))
> +		res_counter_uncharge(&memcg->res, delta);
> +}
mem_cgroup_disabled() here too ?

(Actually, I just grep'd and noticed you wrap some code around it in 
patch 7. It'd make more sense not to call this function when memcg == 
NULL then ?

mem_cgroup_disabled() goes on that function, before the rcu lock.

> +
> +static void
> +memcg_kmem_init(struct mem_cgroup *memcg, struct mem_cgroup *parent)
> +{
> +	struct res_counter *parent_res;
> +
> +	parent_res = NULL;
> +	if (parent&&  parent != root_mem_cgroup)
> +		parent_res =&parent->kmem;
> +	res_counter_init(&memcg->kmem, parent_res);
> +}
> +#else /* CONFIG_CGROUP_MEM_RES_CTLR_KMEM */
> +static void
> +memcg_kmem_init(struct mem_cgroup *memcg, struct mem_cgroup *parent)
> +{
> +}
> +#endif /* CONFIG_CGROUP_MEM_RES_CTLR_KMEM */

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 03/13] memcg: Uncharge all kmem when deleting a cgroup.
  2012-03-09 20:39 ` [PATCH v2 03/13] memcg: Uncharge all kmem when deleting a cgroup Suleiman Souhlal
@ 2012-03-11  8:19   ` Glauber Costa
  2012-03-13 23:16     ` Suleiman Souhlal
  2012-03-13  6:27   ` KAMEZAWA Hiroyuki
  1 sibling, 1 reply; 45+ messages in thread
From: Glauber Costa @ 2012-03-11  8:19 UTC (permalink / raw)
  To: Suleiman Souhlal
  Cc: cgroups, suleiman, kamezawa.hiroyu, penberg, cl, yinghan, hughd,
	gthelen, peterz, dan.magenheimer, hannes, mgorman,
	James.Bottomley, linux-mm, devel, linux-kernel

On 03/10/2012 12:39 AM, Suleiman Souhlal wrote:
> Signed-off-by: Suleiman Souhlal<suleiman@google.com>
> ---
>   mm/memcontrol.c |   31 ++++++++++++++++++++++++++++++-
>   1 files changed, 30 insertions(+), 1 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index e6fd558..6fbb438 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -382,6 +382,7 @@ static void mem_cgroup_get(struct mem_cgroup *memcg);
>   static void mem_cgroup_put(struct mem_cgroup *memcg);
>   static void memcg_kmem_init(struct mem_cgroup *memcg,
>       struct mem_cgroup *parent);
> +static void memcg_kmem_move(struct mem_cgroup *memcg);
>
>   static inline bool
>   mem_cgroup_test_flag(const struct mem_cgroup *memcg, enum memcg_flags flag)
> @@ -3700,6 +3701,7 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg, bool free_all)
>   	int ret;
>   	int node, zid, shrink;
>   	int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
> +	unsigned long usage;
>   	struct cgroup *cgrp = memcg->css.cgroup;
>
>   	css_get(&memcg->css);
> @@ -3719,6 +3721,8 @@ move_account:
>   		/* This is for making all *used* pages to be on LRU. */
>   		lru_add_drain_all();
>   		drain_all_stock_sync(memcg);
> +		if (!free_all)
> +			memcg_kmem_move(memcg);
Any reason we're not moving kmem charges when free_all is set as well?

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 07/13] memcg: Slab accounting.
  2012-03-09 20:39 ` [PATCH v2 07/13] memcg: Slab accounting Suleiman Souhlal
@ 2012-03-11 10:25   ` Glauber Costa
  2012-03-13 22:50     ` Suleiman Souhlal
  0 siblings, 1 reply; 45+ messages in thread
From: Glauber Costa @ 2012-03-11 10:25 UTC (permalink / raw)
  To: Suleiman Souhlal
  Cc: cgroups, suleiman, kamezawa.hiroyu, penberg, cl, yinghan, hughd,
	gthelen, peterz, dan.magenheimer, hannes, mgorman,
	James.Bottomley, linux-mm, devel, linux-kernel, Pekka Enberg

On 03/10/2012 12:39 AM, Suleiman Souhlal wrote:
> Introduce per-cgroup kmem_caches for memcg slab accounting, that
> get created asynchronously the first time we do an allocation of
> that type in the cgroup.
> The cgroup cache gets used in subsequent allocations, and permits
> accounting of slab on a per-page basis.
>
> For a slab type to getaccounted, the SLAB_MEMCG_ACCT flag has to be
> passed to kmem_cache_create().
>
> The per-cgroup kmem_caches get looked up at slab allocation time,
> in a MAX_KMEM_CACHE_TYPES-sized array in the memcg structure, based
> on the original kmem_cache's id, which gets allocated when the original
> cache gets created.
>
> Only allocations that can be attributed to a cgroup get charged.
>
> Each cgroup kmem_cache has a refcount that dictates the lifetime
> of the cache: We destroy a cgroup cache when its cgroup has been
> destroyed and there are no more active objects in the cache.
>
> Signed-off-by: Suleiman Souhlal<suleiman@google.com>
> ---
>   include/linux/memcontrol.h |   30 +++++-
>   include/linux/slab.h       |   41 ++++++
>   include/linux/slab_def.h   |   72 +++++++++++-
>   include/linux/slub_def.h   |    3 +
>   init/Kconfig               |    2 +-
>   mm/memcontrol.c            |  290 ++++++++++++++++++++++++++++++++++++++++++++
>   mm/slab.c                  |  248 ++++++++++++++++++++++++++++++++++---
>   7 files changed, 663 insertions(+), 23 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index b80de52..b6b8388 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -416,13 +416,41 @@ struct sock;
>   #ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
>   void sock_update_memcg(struct sock *sk);
>   void sock_release_memcg(struct sock *sk);
> -#else
> +struct kmem_cache *mem_cgroup_get_kmem_cache(struct kmem_cache *cachep,
> +    gfp_t gfp);
> +bool mem_cgroup_charge_slab(struct kmem_cache *cachep, gfp_t gfp, size_t size);
> +void mem_cgroup_uncharge_slab(struct kmem_cache *cachep, size_t size);
> +void mem_cgroup_flush_cache_create_queue(void);
> +void mem_cgroup_remove_child_kmem_cache(struct kmem_cache *cachep, int id);
> +#else /* CONFIG_CGROUP_MEM_RES_CTLR_KMEM */
>   static inline void sock_update_memcg(struct sock *sk)
>   {
>   }
>   static inline void sock_release_memcg(struct sock *sk)
>   {
>   }
> +
> +static inline bool
> +mem_cgroup_charge_slab(struct kmem_cache *cachep, gfp_t gfp, size_t size)
> +{
> +	return true;
> +}
> +
> +static inline void
> +mem_cgroup_uncharge_slab(struct kmem_cache *cachep, size_t size)
> +{
> +}
> +
> +static inline struct kmem_cache *
> +mem_cgroup_get_kmem_cache(struct kmem_cache *cachep, gfp_t gfp)
> +{
> +	return cachep;
> +}
> +
> +static inline void
> +mem_cgroup_flush_cache_create_queue(void)
> +{
> +}
>   #endif /* CONFIG_CGROUP_MEM_RES_CTLR_KMEM */
>   #endif /* _LINUX_MEMCONTROL_H */
>
> diff --git a/include/linux/slab.h b/include/linux/slab.h
> index 573c809..bc9f87f 100644
> --- a/include/linux/slab.h
> +++ b/include/linux/slab.h
> @@ -21,6 +21,7 @@
>   #define SLAB_POISON		0x00000800UL	/* DEBUG: Poison objects */
>   #define SLAB_HWCACHE_ALIGN	0x00002000UL	/* Align objs on cache lines */
>   #define SLAB_CACHE_DMA		0x00004000UL	/* Use GFP_DMA memory */
> +#define SLAB_MEMCG_ACCT		0x00008000UL	/* Accounted by memcg */
>   #define SLAB_STORE_USER		0x00010000UL	/* DEBUG: Store the last owner for bug hunting */
>   #define SLAB_PANIC		0x00040000UL	/* Panic if kmem_cache_create() fails */
>   /*
> @@ -153,6 +154,21 @@ unsigned int kmem_cache_size(struct kmem_cache *);
>   #define ARCH_SLAB_MINALIGN __alignof__(unsigned long long)
>   #endif
>
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
> +struct mem_cgroup_cache_params {
> +	struct mem_cgroup *memcg;
> +	int id;
> +	int obj_size;
> +
> +	/* Original cache parameters, used when creating a memcg cache */
> +	size_t orig_align;
> +	unsigned long orig_flags;
> +	struct kmem_cache *orig_cache;
> +
> +	struct list_head destroyed_list; /* Used when deleting cpuset cache */
> +};
> +#endif
> +
>   /*
>    * Common kmalloc functions provided by all allocators
>    */
> @@ -353,4 +369,29 @@ static inline void *kzalloc_node(size_t size, gfp_t flags, int node)
>
>   void __init kmem_cache_init_late(void);
>
> +#if defined(CONFIG_CGROUP_MEM_RES_CTLR_KMEM)&&  defined(CONFIG_SLAB)
> +
> +#define MAX_KMEM_CACHE_TYPES 400
> +
> +struct kmem_cache *kmem_cache_create_memcg(struct kmem_cache *cachep,
> +    char *name);
> +void kmem_cache_drop_ref(struct kmem_cache *cachep);
> +
> +#else /* !CONFIG_CGROUP_MEM_RES_CTLR_KMEM || !CONFIG_SLAB */


> +#define MAX_KMEM_CACHE_TYPES 0
> +
> +static inline struct kmem_cache *
> +kmem_cache_create_memcg(struct kmem_cache *cachep,
> +    char *name)
> +{
> +	return NULL;
> +}
> +
> +static inline void
> +kmem_cache_drop_ref(struct kmem_cache *cachep)
> +{
> +}
> +#endif /* CONFIG_CGROUP_MEM_RES_CTLR_KMEM&&  CONFIG_SLAB */
> +
>   #endif	/* _LINUX_SLAB_H */
> diff --git a/include/linux/slab_def.h b/include/linux/slab_def.h
> index 25f9a6a..248b8a9 100644
> --- a/include/linux/slab_def.h
> +++ b/include/linux/slab_def.h
> @@ -51,7 +51,7 @@ struct kmem_cache {
>   	void (*ctor)(void *obj);
>
>   /* 4) cache creation/removal */
> -	const char *name;
> +	char *name;
>   	struct list_head next;
>
>   /* 5) statistics */
> @@ -81,6 +81,12 @@ struct kmem_cache {
>   	int obj_size;
>   #endif /* CONFIG_DEBUG_SLAB */
>
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
> +	struct mem_cgroup_cache_params memcg_params;
> +
Can't we have memcg pointer + id here, and then get
the rest of this data from memcg->slab_params[id] ?

I don't think we'll ever access this on hot paths - since we're 
allocating pages anyway when we do, and then we have less bloat on
the slab structure.

Sure, we're just moving the bloat away somewhere else, since this data
has to exist anyway, but at least it's not here bothering the slab when
memcg is disabled...

But I don't really have the final word here, this is just my preference 
- Would any of the slabmasters comment here, on what's your preferred 
way between those?

> +	atomic_t refcnt;
> +#endif /* CONFIG_CGROUP_MEM_RES_CTLR_KMEM */
> +
>   /* 6) per-cpu/per-node data, touched during every alloc/free */
>   	/*
>   	 * We put array[] at the end of kmem_cache, because we want to size
> @@ -218,4 +224,68 @@ found:
>
>   #endif	/* CONFIG_NUMA */
>
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
> +
> +void kmem_cache_drop_ref(struct kmem_cache *cachep);
> +
> +static inline void
> +kmem_cache_get_ref(struct kmem_cache *cachep)
> +{
> +	if (cachep->memcg_params.id == -1&&
> +	    unlikely(!atomic_add_unless(&cachep->refcnt, 1, 0)))
> +		BUG();
> +}
> +
> +static inline void
> +mem_cgroup_put_kmem_cache(struct kmem_cache *cachep)
> +{
> +	rcu_read_unlock();
> +}
> +
> +static inline void
> +mem_cgroup_kmem_cache_prepare_sleep(struct kmem_cache *cachep)
> +{
> +	/*
> +	 * Make sure the cache doesn't get freed while we have interrupts
> +	 * enabled.
> +	 */
> +	kmem_cache_get_ref(cachep);
> +	rcu_read_unlock();
> +}

Is this really needed ? After this function call in slab.c, the slab 
code itself accesses cachep a thousand times. If it could be freed, it 
would already explode today for other reasons?
Am I missing something here?

> +static inline void
> +mem_cgroup_kmem_cache_finish_sleep(struct kmem_cache *cachep)
> +{
> +	rcu_read_lock();
> +	kmem_cache_drop_ref(cachep);
> +}
> +
> +#else /* CONFIG_CGROUP_MEM_RES_CTLR_KMEM */
> +
> +static inline void
> +kmem_cache_get_ref(struct kmem_cache *cachep)
> +{
> +}
> +
> +static inline void
> +kmem_cache_drop_ref(struct kmem_cache *cachep)
> +{
> +}
> +
> +static inline void
> +mem_cgroup_put_kmem_cache(struct kmem_cache *cachep)
> +{
> +}
> +
> +static inline void
> +mem_cgroup_kmem_cache_prepare_sleep(struct kmem_cache *cachep)
> +{
> +}
> +
> +static inline void
> +mem_cgroup_kmem_cache_finish_sleep(struct kmem_cache *cachep)
> +{
> +}
> +#endif /* CONFIG_CGROUP_MEM_RES_CTLR_KMEM */
> +
>   #endif	/* _LINUX_SLAB_DEF_H */
> diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
> index 5911d81..09b602d 100644
> --- a/include/linux/slub_def.h
> +++ b/include/linux/slub_def.h
> @@ -99,6 +99,9 @@ struct kmem_cache {
>   #ifdef CONFIG_SYSFS
>   	struct kobject kobj;	/* For sysfs */
>   #endif
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
> +	struct mem_cgroup_cache_params memcg_params;
> +#endif
>
>   #ifdef CONFIG_NUMA
>   	/*
> diff --git a/init/Kconfig b/init/Kconfig
> index 3f42cd6..e7eb652 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -705,7 +705,7 @@ config CGROUP_MEM_RES_CTLR_SWAP_ENABLED
>   	  then swapaccount=0 does the trick).
>   config CGROUP_MEM_RES_CTLR_KMEM
>   	bool "Memory Resource Controller Kernel Memory accounting (EXPERIMENTAL)"
> -	depends on CGROUP_MEM_RES_CTLR&&  EXPERIMENTAL
> +	depends on CGROUP_MEM_RES_CTLR&&  EXPERIMENTAL&&  !SLOB

Orthogonal question: Will we ever want this (SLOB) ?

>   	default n
>   	help
>   	  The Kernel Memory extension for Memory Resource Controller can limit
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 2576a2b..a5593cf 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -302,6 +302,11 @@ struct mem_cgroup {
>   #ifdef CONFIG_INET
>   	struct tcp_memcontrol tcp_mem;
>   #endif
> +
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
> +	/* Slab accounting */
> +	struct kmem_cache *slabs[MAX_KMEM_CACHE_TYPES];
> +#endif
>   };
>
>   /* Stuffs for move charges at task migration. */
> @@ -5692,6 +5697,287 @@ memcg_uncharge_kmem(struct mem_cgroup *memcg, long long delta)
>   		res_counter_uncharge(&memcg->res, delta);
>   }
>
> +static struct kmem_cache *
> +memcg_create_kmem_cache(struct mem_cgroup *memcg, struct kmem_cache *cachep)
> +{
> +	struct kmem_cache *new_cachep;
> +	struct dentry *dentry;
> +	char *name;
> +	int idx;
> +
> +	idx = cachep->memcg_params.id;
> +
> +	dentry = memcg->css.cgroup->dentry;
> +	BUG_ON(dentry == NULL);
> +
> +	/* Preallocate the space for "dead" at the end */
> +	name = kasprintf(GFP_KERNEL, "%s(%d:%s)dead",
> +	    cachep->name, css_id(&memcg->css), dentry->d_name.name);
> +	if (name == NULL)
> +		return cachep;
> +	/* Remove "dead" */
> +	name[strlen(name) - 4] = '\0';
> +
> +	new_cachep = kmem_cache_create_memcg(cachep, name);
> +
> +	/*
> +	 * Another CPU is creating the same cache?
> +	 * We'll use it next time.
> +	 */
This comment is a bit misleading. Is it really the only reason
it can fail?

The impression I got is that it can also fail under the normal 
conditions in which kmem_cache_create() fails.

> +	if (new_cachep == NULL) {
> +		kfree(name);
> +		return cachep;
> +	}
> +
> +	new_cachep->memcg_params.memcg = memcg;
> +
> +	/*
> +	 * Make sure someone else hasn't created the new cache in the
> +	 * meantime.
> +	 * This should behave as a write barrier, so we should be fine
> +	 * with RCU.
> +	 */
> +	if (cmpxchg(&memcg->slabs[idx], NULL, new_cachep) != NULL) {
> +		kmem_cache_destroy(new_cachep);
> +		return cachep;
> +	}
> +
> +	return new_cachep;
> +}
> +
> +struct create_work {
> +	struct mem_cgroup *memcg;
> +	struct kmem_cache *cachep;
> +	struct list_head list;
> +};
> +
> +static DEFINE_SPINLOCK(create_queue_lock);
> +static LIST_HEAD(create_queue);

If we move memcg_params to somewhere inside memcontrol.c, we can put 
this worker inside that struct, and save the mallocs here.

We can have it separated from the main mem_cgroup, like the thresholds.
Another way is to have a zero-sized array living at the end of struct 
mem_cgroup. Then we allocate sizeof(*memcg) for the root mem cgroup,
and sizeof(*memcg) + sizeof(max_cache_size) for the rest.

> +/*
> + * Flush the queue of kmem_caches to create, because we're creating a cgroup.
> + *
> + * We might end up flushing other cgroups' creation requests as well, but
> + * they will just get queued again next time someone tries to make a slab
> + * allocation for them.
> + */
> +void
> +mem_cgroup_flush_cache_create_queue(void)
> +{
> +	struct create_work *cw, *tmp;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&create_queue_lock, flags);
> +	list_for_each_entry_safe(cw, tmp,&create_queue, list) {
> +		list_del(&cw->list);
> +		kfree(cw);
> +	}
> +	spin_unlock_irqrestore(&create_queue_lock, flags);
> +}
> +
> +static void
> +memcg_create_cache_work_func(struct work_struct *w)
> +{
> +	struct kmem_cache *cachep;
> +	struct create_work *cw;
> +
> +	spin_lock_irq(&create_queue_lock);
> +	while (!list_empty(&create_queue)) {
> +		cw = list_first_entry(&create_queue, struct create_work, list);
> +		list_del(&cw->list);
> +		spin_unlock_irq(&create_queue_lock);
> +		cachep = memcg_create_kmem_cache(cw->memcg, cw->cachep);
> +		if (cachep == NULL&&  printk_ratelimit())
> +			printk(KERN_ALERT "%s: Couldn't create memcg-cache for"
> +			    " %s memcg %s\n", __func__, cw->cachep->name,
> +			    cw->memcg->css.cgroup->dentry->d_name.name);
> +		/* Drop the reference gotten when we enqueued. */
> +		css_put(&cw->memcg->css);
> +		kfree(cw);
> +		spin_lock_irq(&create_queue_lock);
> +	}
> +	spin_unlock_irq(&create_queue_lock);
> +}
> +
> +static DECLARE_WORK(memcg_create_cache_work, memcg_create_cache_work_func);
> +
> +/*
> + * Enqueue the creation of a per-memcg kmem_cache.
> + * Called with rcu_read_lock.
> + */
> +static void
> +memcg_create_cache_enqueue(struct mem_cgroup *memcg, struct kmem_cache *cachep)
> +{
> +	struct create_work *cw;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&create_queue_lock, flags);
If we can sleep, why not just create the cache now?

Maybe it would be better to split this in two, and create the cache if 
possible, and a worker if not possible. Then w

> +	list_for_each_entry(cw,&create_queue, list) {
> +		if (cw->memcg == memcg&&  cw->cachep == cachep) {
> +			spin_unlock_irqrestore(&create_queue_lock, flags);
> +			return;
> +		}
> +	}
> +	spin_unlock_irqrestore(&create_queue_lock, flags);
> +
> +	/* The corresponding put will be done in the workqueue. */
> +	if (!css_tryget(&memcg->css))
> +		return;
> +	cw = kmalloc(sizeof(struct create_work), GFP_NOWAIT);
> +	if (cw == NULL) {
> +		css_put(&memcg->css);
> +		return;
> +	}
> +
> +	cw->memcg = memcg;
> +	cw->cachep = cachep;
> +	spin_lock_irqsave(&create_queue_lock, flags);
> +	list_add_tail(&cw->list,&create_queue);
> +	spin_unlock_irqrestore(&create_queue_lock, flags);
> +
> +	schedule_work(&memcg_create_cache_work);
> +}
> +
> +/*
> + * Return the kmem_cache we're supposed to use for a slab allocation.
> + * If we are in interrupt context or otherwise have an allocation that
> + * can't fail, we return the original cache.
> + * Otherwise, we will try to use the current memcg's version of the cache.
> + *
> + * If the cache does not exist yet, if we are the first user of it,
> + * we either create it immediately, if possible, or create it asynchronously
> + * in a workqueue.
> + * In the latter case, we will let the current allocation go through with
> + * the original cache.
> + *
> + * This function returns with rcu_read_lock() held.
> + */
> +struct kmem_cache *
> +mem_cgroup_get_kmem_cache(struct kmem_cache *cachep, gfp_t gfp)
> +{
> +	struct mem_cgroup *memcg;
> +	int idx;
> +
> +	rcu_read_lock();
> +
> +	if (in_interrupt())
> +		return cachep;
> +	if (current == NULL)
> +		return cachep;
> +	if (!(cachep->flags&  SLAB_MEMCG_ACCT))
> +		return cachep;
> +
> +	gfp |= kmem_cache_gfp_flags(cachep);
> +	if (gfp&  __GFP_NOFAIL)
> +		return cachep;
> +
> +	if (cachep->memcg_params.memcg)
> +		return cachep;
> +
> +	memcg = mem_cgroup_from_task(current);
> +	idx = cachep->memcg_params.id;
> +
> +	if (memcg == NULL || memcg == root_mem_cgroup)
nitpick: mem_cgroup_is_root(memcg)

It would also be better to have this test done as early as possible,
to reduce the penalty for the root memcg case.

We also need to test mem_cgroup_disabled() and return cachep in
this case.

> +		return cachep;
> +
> +	VM_BUG_ON(idx == -1);
> +
> +	if (rcu_access_pointer(memcg->slabs[idx]) == NULL) {
if (unlikely(rcu_access_pointer(...

> +		memcg_create_cache_enqueue(memcg, cachep);
> +		return cachep;
> +	}
> +
> +	return rcu_dereference(memcg->slabs[idx]);
> +}
> +
> +void
> +mem_cgroup_remove_child_kmem_cache(struct kmem_cache *cachep, int id)
> +{
> +	rcu_assign_pointer(cachep->memcg_params.memcg->slabs[id], NULL);
> +}
> +
> +bool
> +mem_cgroup_charge_slab(struct kmem_cache *cachep, gfp_t gfp, size_t size)
> +{
> +	struct mem_cgroup *memcg;
> +	int ret;
> +
> +	rcu_read_lock();
> +	memcg = cachep->memcg_params.memcg;

         if (!memcg) {
             rcu_read_unlock();
             return 0;
         }

> +	if (memcg&&  !css_tryget(&memcg->css))
> +		memcg = NULL;
> +	rcu_read_unlock();
> +
         if (!memcg)
             return...

> +	ret = memcg_charge_kmem(memcg, gfp, size);
> +	if (memcg)
> +		css_put(&memcg->css);
> +
> +	return ret == 0;
> +}
> +
> +void
> +mem_cgroup_uncharge_slab(struct kmem_cache *cachep, size_t size)
> +{
> +	struct mem_cgroup *memcg;
> +
> +	rcu_read_lock();
> +	memcg = cachep->memcg_params.memcg;
> +
> +	if (memcg&&  !css_tryget(&memcg->css))
> +		memcg = NULL;
> +	rcu_read_unlock();
> +
> +	memcg_uncharge_kmem(memcg, size);
> +	if (memcg)
> +		css_put(&memcg->css);
> +}
> +
> +static void
> +memcg_slab_init(struct mem_cgroup *memcg)
> +{
> +	int i;
> +
> +	for (i = 0; i<  MAX_KMEM_CACHE_TYPES; i++)
> +		rcu_assign_pointer(memcg->slabs[i], NULL);
> +}
> +
> +/*
> + * Mark all of this memcg's kmem_caches as dead and move them to the
> + * root.
> + *
> + * Assumes that the callers are synchronized (only one thread should be
> + * moving a cgroup's slab at the same time).
> + */
> +static void
> +memcg_slab_move(struct mem_cgroup *memcg)
> +{
> +	struct kmem_cache *cachep;
> +	int i;
> +
> +	mem_cgroup_flush_cache_create_queue();
> +
> +	for (i = 0; i<  MAX_KMEM_CACHE_TYPES; i++) {
> +		cachep = rcu_access_pointer(memcg->slabs[i]);
> +		if (cachep != NULL) {
> +			rcu_assign_pointer(memcg->slabs[i], NULL);
> +			cachep->memcg_params.memcg = NULL;
> +
> +			/* The space for this is already allocated */
> +			strcat((char *)cachep->name, "dead");
> +
> +			/*
> +			 * Drop the initial reference on the cache.
> +			 * This means that from this point on, the cache will
> +			 * get destroyed when it no longer has active objects.
> +			 */
> +			kmem_cache_drop_ref(cachep);
> +		}
> +	}
> +}
> +
>   static void
>   memcg_kmem_init(struct mem_cgroup *memcg, struct mem_cgroup *parent)
>   {
> @@ -5701,6 +5987,8 @@ memcg_kmem_init(struct mem_cgroup *memcg, struct mem_cgroup *parent)
>   	if (parent&&  parent != root_mem_cgroup)
>   		parent_res =&parent->kmem;
>   	res_counter_init(&memcg->kmem, parent_res);
> +
> +	memcg_slab_init(memcg);
>   }
>
>   static void
> @@ -5709,6 +5997,8 @@ memcg_kmem_move(struct mem_cgroup *memcg)
>   	unsigned long flags;
>   	long kmem;
>
> +	memcg_slab_move(memcg);
> +
>   	spin_lock_irqsave(&memcg->kmem.lock, flags);
>   	kmem = memcg->kmem.usage;
>   	res_counter_uncharge_locked(&memcg->kmem, kmem);
> diff --git a/mm/slab.c b/mm/slab.c
> index f0bd785..95b024c 100644
> --- a/mm/slab.c
> +++ b/mm/slab.c
> @@ -159,13 +159,15 @@
>   			 SLAB_STORE_USER | \
>   			 SLAB_RECLAIM_ACCOUNT | SLAB_PANIC | \
>   			 SLAB_DESTROY_BY_RCU | SLAB_MEM_SPREAD | \
> -			 SLAB_DEBUG_OBJECTS | SLAB_NOLEAKTRACE | SLAB_NOTRACK)
> +			 SLAB_DEBUG_OBJECTS | SLAB_NOLEAKTRACE | \
> +			 SLAB_NOTRACK | SLAB_MEMCG_ACCT)
>   #else
>   # define CREATE_MASK	(SLAB_HWCACHE_ALIGN | \
>   			 SLAB_CACHE_DMA | \
>   			 SLAB_RECLAIM_ACCOUNT | SLAB_PANIC | \
>   			 SLAB_DESTROY_BY_RCU | SLAB_MEM_SPREAD | \
> -			 SLAB_DEBUG_OBJECTS | SLAB_NOLEAKTRACE | SLAB_NOTRACK)
> +			 SLAB_DEBUG_OBJECTS | SLAB_NOLEAKTRACE | \
> +			 SLAB_NOTRACK | SLAB_MEMCG_ACCT)
>   #endif
>
>   /*
> @@ -301,6 +303,8 @@ static void free_block(struct kmem_cache *cachep, void **objpp, int len,
>   			int node);
>   static int enable_cpucache(struct kmem_cache *cachep, gfp_t gfp);
>   static void cache_reap(struct work_struct *unused);
> +static int do_tune_cpucache(struct kmem_cache *cachep, int limit,
> +    int batchcount, int shared, gfp_t gfp);
>
>   /*
>    * This function must be completely optimized away if a constant is passed to
> @@ -326,6 +330,11 @@ static __always_inline int index_of(const size_t size)
>   	return 0;
>   }
>
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
> +/* Bitmap used for allocating the cache id numbers. */
> +static DECLARE_BITMAP(cache_types, MAX_KMEM_CACHE_TYPES);
> +#endif
> +
>   static int slab_early_init = 1;
>
>   #define INDEX_AC index_of(sizeof(struct arraycache_init))
> @@ -1756,17 +1765,23 @@ static void *kmem_getpages(struct kmem_cache *cachep, gfp_t flags, int nodeid)
>   	if (cachep->flags&  SLAB_RECLAIM_ACCOUNT)
>   		flags |= __GFP_RECLAIMABLE;
>
> +	nr_pages = (1<<  cachep->gfporder);
> +	if (!mem_cgroup_charge_slab(cachep, flags, nr_pages * PAGE_SIZE))
> +		return NULL;
> +
>   	page = alloc_pages_exact_node(nodeid, flags | __GFP_NOTRACK, cachep->gfporder);
> -	if (!page)
> +	if (!page) {
> +		mem_cgroup_uncharge_slab(cachep, nr_pages * PAGE_SIZE);
>   		return NULL;
> +	}


Can't the following happen:

  *) mem_cgroup_charge_slab() is the first one to touch the slab.
     Therefore, this first one is billed to root.
  *) A slab is queued for creation.
  *) alloc_pages sleep.
  *) our workers run, and create the cache, therefore filling
     cachep->memcg_param.memcg
  *) alloc_pages still can't allocate.
  *) uncharge tries to uncharge from cachep->memcg_param.memcg,
     which doesn't have any charges...

Unless you have a strong oposition to this, to avoid this kind of
corner cases, we could do what I was doing in the slub:
Allocate the page first, and then account it.
(freeing the page if it fails).

I know it is not the way it is done for the user pages, but I believe it 
to be better suited for the slab.

> -	nr_pages = (1<<  cachep->gfporder);
>   	if (cachep->flags&  SLAB_RECLAIM_ACCOUNT)
>   		add_zone_page_state(page_zone(page),
>   			NR_SLAB_RECLAIMABLE, nr_pages);
>   	else
>   		add_zone_page_state(page_zone(page),
>   			NR_SLAB_UNRECLAIMABLE, nr_pages);
> +	kmem_cache_get_ref(cachep);
>   	for (i = 0; i<  nr_pages; i++)
>   		__SetPageSlab(page + i);
>
> @@ -1799,6 +1814,8 @@ static void kmem_freepages(struct kmem_cache *cachep, void *addr)
>   	else
>   		sub_zone_page_state(page_zone(page),
>   				NR_SLAB_UNRECLAIMABLE, nr_freed);
> +	mem_cgroup_uncharge_slab(cachep, i * PAGE_SIZE);
> +	kmem_cache_drop_ref(cachep);
>   	while (i--) {
>   		BUG_ON(!PageSlab(page));
>   		__ClearPageSlab(page);
> @@ -2224,14 +2241,17 @@ static int __init_refok setup_cpu_cache(struct kmem_cache *cachep, gfp_t gfp)
>    * cacheline.  This can be beneficial if you're counting cycles as closely
>    * as davem.
>    */
> -struct kmem_cache *
> -kmem_cache_create (const char *name, size_t size, size_t align,
> -	unsigned long flags, void (*ctor)(void *))
> +static struct kmem_cache *
> +__kmem_cache_create(const char *name, size_t size, size_t align,
> +    unsigned long flags, void (*ctor)(void *), bool memcg)
>   {
> -	size_t left_over, slab_size, ralign;
> +	size_t left_over, orig_align, ralign, slab_size;
>   	struct kmem_cache *cachep = NULL, *pc;
> +	unsigned long orig_flags;
>   	gfp_t gfp;
>
> +	orig_align = align;
> +	orig_flags = flags;
>   	/*
>   	 * Sanity checks... these are all serious usage bugs.
>   	 */
> @@ -2248,7 +2268,6 @@ kmem_cache_create (const char *name, size_t size, size_t align,
>   	 */
>   	if (slab_is_available()) {
>   		get_online_cpus();
> -		mutex_lock(&cache_chain_mutex);
>   	}
>
>   	list_for_each_entry(pc,&cache_chain, next) {
> @@ -2269,10 +2288,12 @@ kmem_cache_create (const char *name, size_t size, size_t align,
>   		}
>
>   		if (!strcmp(pc->name, name)) {
> -			printk(KERN_ERR
> -			       "kmem_cache_create: duplicate cache %s\n", name);
> -			dump_stack();
> -			goto oops;
> +			if (!memcg) {
> +				printk(KERN_ERR "kmem_cache_create: duplicate"
> +				    " cache %s\n", name);
> +				dump_stack();
> +				goto oops;
> +			}
Why? Since we are apending the memcg name at the end anyway, duplicates 
still aren't expected.

>   		}
>   	}
>
> @@ -2369,6 +2390,13 @@ kmem_cache_create (const char *name, size_t size, size_t align,
>   		goto oops;
>
>   	cachep->nodelists = (struct kmem_list3 **)&cachep->array[nr_cpu_ids];
> +
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
> +	cachep->memcg_params.obj_size = size;
> +	cachep->memcg_params.orig_align = orig_align;
> +	cachep->memcg_params.orig_flags = orig_flags;
> +#endif
> +
>   #if DEBUG
>   	cachep->obj_size = size;
>
> @@ -2477,7 +2505,23 @@ kmem_cache_create (const char *name, size_t size, size_t align,
>   		BUG_ON(ZERO_OR_NULL_PTR(cachep->slabp_cache));
>   	}
>   	cachep->ctor = ctor;
> -	cachep->name = name;
> +	cachep->name = (char *)name;
> +
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
> +	cachep->memcg_params.orig_cache = NULL;
> +	atomic_set(&cachep->refcnt, 1);
> +	INIT_LIST_HEAD(&cachep->memcg_params.destroyed_list);
> +
> +	if (!memcg) {
> +		int id;
> +
> +		id = find_first_zero_bit(cache_types, MAX_KMEM_CACHE_TYPES);
> +		BUG_ON(id<  0 || id>= MAX_KMEM_CACHE_TYPES);
> +		__set_bit(id, cache_types);
> +		cachep->memcg_params.id = id;
> +	} else
> +		cachep->memcg_params.id = -1;
> +#endif
>
>   	if (setup_cpu_cache(cachep, gfp)) {
>   		__kmem_cache_destroy(cachep);
> @@ -2502,13 +2546,53 @@ oops:
>   		panic("kmem_cache_create(): failed to create slab `%s'\n",
>   		      name);
>   	if (slab_is_available()) {
> -		mutex_unlock(&cache_chain_mutex);
>   		put_online_cpus();
>   	}
>   	return cachep;
>   }
> +
> +struct kmem_cache *
> +kmem_cache_create(const char *name, size_t size, size_t align,
> +    unsigned long flags, void (*ctor)(void *))
> +{
> +	struct kmem_cache *cachep;
> +
> +	mutex_lock(&cache_chain_mutex);
> +	cachep = __kmem_cache_create(name, size, align, flags, ctor, false);
> +	mutex_unlock(&cache_chain_mutex);
> +
> +	return cachep;
> +}
>   EXPORT_SYMBOL(kmem_cache_create);
>
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
> +struct kmem_cache *
> +kmem_cache_create_memcg(struct kmem_cache *cachep, char *name)
> +{
> +	struct kmem_cache *new;
> +	int flags;
> +
> +	flags = cachep->memcg_params.orig_flags&  ~SLAB_PANIC;
> +	mutex_lock(&cache_chain_mutex);
> +	new = __kmem_cache_create(name, cachep->memcg_params.obj_size,
> +	    cachep->memcg_params.orig_align, flags, cachep->ctor, 1);
> +	if (new == NULL) {
> +		mutex_unlock(&cache_chain_mutex);
> +		return NULL;
> +	}
> +	new->memcg_params.orig_cache = cachep;
> +
> +	if ((cachep->limit != new->limit) ||
> +	    (cachep->batchcount != new->batchcount) ||
> +	    (cachep->shared != new->shared))
> +		do_tune_cpucache(new, cachep->limit, cachep->batchcount,
> +		    cachep->shared, GFP_KERNEL);
> +	mutex_unlock(&cache_chain_mutex);
> +
> +	return new;
> +}
> +#endif /* CONFIG_CGROUP_MEM_RES_CTLR_KMEM */
> +
>   #if DEBUG
>   static void check_irq_off(void)
>   {
> @@ -2703,12 +2787,74 @@ void kmem_cache_destroy(struct kmem_cache *cachep)
>   	if (unlikely(cachep->flags&  SLAB_DESTROY_BY_RCU))
>   		rcu_barrier();
>
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
> +	/* Not a memcg cache */
> +	if (cachep->memcg_params.id != -1) {
> +		__clear_bit(cachep->memcg_params.id, cache_types);
> +		mem_cgroup_flush_cache_create_queue();
> +	}
> +#endif

This will clear the id when a leaf cache is destroyed. It seems it is 
not what we want, right? We want this id to be cleared only when
the parent cache is gone.

>   	__kmem_cache_destroy(cachep);
>   	mutex_unlock(&cache_chain_mutex);
>   	put_online_cpus();
>   }
>   EXPORT_SYMBOL(kmem_cache_destroy);
>
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
> +static DEFINE_SPINLOCK(destroy_lock);
> +static LIST_HEAD(destroyed_caches);
> +
> +static void
> +kmem_cache_destroy_work_func(struct work_struct *w)
> +{
> +	struct kmem_cache *cachep;
> +	char *name;
> +
> +	spin_lock_irq(&destroy_lock);
> +	while (!list_empty(&destroyed_caches)) {
> +		cachep = container_of(list_first_entry(&destroyed_caches,
> +		    struct mem_cgroup_cache_params, destroyed_list), struct
> +		    kmem_cache, memcg_params);
> +		name = (char *)cachep->name;
> +		list_del(&cachep->memcg_params.destroyed_list);
> +		spin_unlock_irq(&destroy_lock);
> +		synchronize_rcu();
> +		kmem_cache_destroy(cachep);
                 ^^^^^^
                 will destroy the id.

> +		kfree(name);
> +		spin_lock_irq(&destroy_lock);
> +	}
> +	spin_unlock_irq(&destroy_lock);
> +}
> +
> +static DECLARE_WORK(kmem_cache_destroy_work, kmem_cache_destroy_work_func);
> +
> +static void
> +kmem_cache_destroy_memcg(struct kmem_cache *cachep)
> +{
> +	unsigned long flags;
> +
> +	BUG_ON(cachep->memcg_params.id != -1);
> +
> +	/*
> +	 * We have to defer the actual destroying to a workqueue, because
> +	 * we might currently be in a context that cannot sleep.
> +	 */
> +	spin_lock_irqsave(&destroy_lock, flags);
> +	list_add(&cachep->memcg_params.destroyed_list,&destroyed_caches);
> +	spin_unlock_irqrestore(&destroy_lock, flags);
> +
> +	schedule_work(&kmem_cache_destroy_work);
> +}
> +
> +void
> +kmem_cache_drop_ref(struct kmem_cache *cachep)
> +{
> +	if (cachep->memcg_params.id == -1&&
> +	    unlikely(atomic_dec_and_test(&cachep->refcnt)))
> +		kmem_cache_destroy_memcg(cachep);
> +}
> +#endif /* CONFIG_CGROUP_MEM_RES_CTLR_KMEM */
> +
>   /*
>    * Get the memory for a slab management obj.
>    * For a slab cache when the slab descriptor is off-slab, slab descriptors
> @@ -2908,8 +3054,10 @@ static int cache_grow(struct kmem_cache *cachep,
>
>   	offset *= cachep->colour_off;
>
> -	if (local_flags&  __GFP_WAIT)
> +	if (local_flags&  __GFP_WAIT) {
>   		local_irq_enable();
> +		mem_cgroup_kmem_cache_prepare_sleep(cachep);
> +	}
>
>   	/*
>   	 * The test for missing atomic flag is performed here, rather than
> @@ -2938,8 +3086,10 @@ static int cache_grow(struct kmem_cache *cachep,
>
>   	cache_init_objs(cachep, slabp);
>
> -	if (local_flags&  __GFP_WAIT)
> +	if (local_flags&  __GFP_WAIT) {
>   		local_irq_disable();
> +		mem_cgroup_kmem_cache_finish_sleep(cachep);
> +	}
>   	check_irq_off();
>   	spin_lock(&l3->list_lock);
>
> @@ -2952,8 +3102,10 @@ static int cache_grow(struct kmem_cache *cachep,
>   opps1:
>   	kmem_freepages(cachep, objp);
>   failed:
> -	if (local_flags&  __GFP_WAIT)
> +	if (local_flags&  __GFP_WAIT) {
>   		local_irq_disable();
> +		mem_cgroup_kmem_cache_finish_sleep(cachep);
> +	}
>   	return 0;
>   }
>
> @@ -3712,10 +3864,14 @@ static inline void __cache_free(struct kmem_cache *cachep, void *objp,
>    */
>   void *kmem_cache_alloc(struct kmem_cache *cachep, gfp_t flags)
>   {
> -	void *ret = __cache_alloc(cachep, flags, __builtin_return_address(0));
> +	void *ret;
> +
> +	cachep = mem_cgroup_get_kmem_cache(cachep, flags);
> +	ret = __cache_alloc(cachep, flags, __builtin_return_address(0));
>
>   	trace_kmem_cache_alloc(_RET_IP_, ret,
>   			       obj_size(cachep), cachep->buffer_size, flags);
> +	mem_cgroup_put_kmem_cache(cachep);
>
>   	return ret;
>   }
> @@ -3727,10 +3883,12 @@ kmem_cache_alloc_trace(size_t size, struct kmem_cache *cachep, gfp_t flags)
>   {
>   	void *ret;
>
> +	cachep = mem_cgroup_get_kmem_cache(cachep, flags);
>   	ret = __cache_alloc(cachep, flags, __builtin_return_address(0));
>
>   	trace_kmalloc(_RET_IP_, ret,
>   		      size, slab_buffer_size(cachep), flags);
> +	mem_cgroup_put_kmem_cache(cachep);
>   	return ret;

We really need to bypass those when root cgroup is the only user.
Can use static_branch()...

>   }
>   EXPORT_SYMBOL(kmem_cache_alloc_trace);
> @@ -3739,12 +3897,16 @@ EXPORT_SYMBOL(kmem_cache_alloc_trace);
>   #ifdef CONFIG_NUMA
>   void *kmem_cache_alloc_node(struct kmem_cache *cachep, gfp_t flags, int nodeid)
>   {
> -	void *ret = __cache_alloc_node(cachep, flags, nodeid,
> +	void *ret;
> +
> +	cachep = mem_cgroup_get_kmem_cache(cachep, flags);
> +	ret  = __cache_alloc_node(cachep, flags, nodeid,
>   				       __builtin_return_address(0));
>
>   	trace_kmem_cache_alloc_node(_RET_IP_, ret,
>   				    obj_size(cachep), cachep->buffer_size,
>   				    flags, nodeid);
> +	mem_cgroup_put_kmem_cache(cachep);
>
>   	return ret;
>   }
> @@ -3758,11 +3920,13 @@ void *kmem_cache_alloc_node_trace(size_t size,
>   {
>   	void *ret;
>
> +	cachep = mem_cgroup_get_kmem_cache(cachep, flags);
>   	ret = __cache_alloc_node(cachep, flags, nodeid,
>   				  __builtin_return_address(0));
>   	trace_kmalloc_node(_RET_IP_, ret,
>   			   size, slab_buffer_size(cachep),
>   			   flags, nodeid);
> +	mem_cgroup_put_kmem_cache(cachep);
>   	return ret;
>   }
>   EXPORT_SYMBOL(kmem_cache_alloc_node_trace);
> @@ -3866,9 +4030,35 @@ void kmem_cache_free(struct kmem_cache *cachep, void *objp)
>
>   	local_irq_save(flags);
>   	debug_check_no_locks_freed(objp, obj_size(cachep));
> +
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
> +	{
> +		struct kmem_cache *actual_cachep;
> +
> +		actual_cachep = virt_to_cache(objp);
> +		if (actual_cachep != cachep) {
> +			VM_BUG_ON(actual_cachep->memcg_params.id != -1);
> +			VM_BUG_ON(actual_cachep->memcg_params.orig_cache !=
> +			    cachep);
> +			cachep = actual_cachep;
> +		}
> +		/*
> +		 * Grab a reference so that the cache is guaranteed to stay
> +		 * around.
> +		 * If we are freeing the last object of a dead memcg cache,
> +		 * the kmem_cache_drop_ref() at the end of this function
> +		 * will end up freeing the cache.
> +		 */
> +		kmem_cache_get_ref(cachep);
1) Another obvious candidate to be wrapped by static_branch()...
2) I don't trully follow why we need those references here. Can you
    give us an example of a situation in which the cache can go away?

Also note that we are making a function that used to operate mostly on
local data now issue two atomic operations.

> +	}
> +#endif
> +
>   	if (!(cachep->flags&  SLAB_DEBUG_OBJECTS))
>   		debug_check_no_obj_freed(objp, obj_size(cachep));
>   	__cache_free(cachep, objp, __builtin_return_address(0));
> +
> +	kmem_cache_drop_ref(cachep);
> +
>   	local_irq_restore(flags);
>
>   	trace_kmem_cache_free(_RET_IP_, objp);
> @@ -3896,9 +4086,19 @@ void kfree(const void *objp)
>   	local_irq_save(flags);
>   	kfree_debugcheck(objp);
>   	c = virt_to_cache(objp);
> +
> +	/*
> +	 * Grab a reference so that the cache is guaranteed to stay around.
> +	 * If we are freeing the last object of a dead memcg cache, the
> +	 * kmem_cache_drop_ref() at the end of this function will end up
> +	 * freeing the cache.
> +	 */
> +	kmem_cache_get_ref(c);
> +
>   	debug_check_no_locks_freed(objp, obj_size(c));
>   	debug_check_no_obj_freed(objp, obj_size(c));
>   	__cache_free(c, (void *)objp, __builtin_return_address(0));
> +	kmem_cache_drop_ref(c);
>   	local_irq_restore(flags);
>   }
>   EXPORT_SYMBOL(kfree);
> @@ -4167,6 +4367,13 @@ static void cache_reap(struct work_struct *w)
>   	list_for_each_entry(searchp,&cache_chain, next) {
>   		check_irq_on();
>
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
> +		/* For memcg caches, make sure we only reap the active ones. */
> +		if (searchp->memcg_params.id == -1&&
> +		    !atomic_add_unless(&searchp->refcnt, 1, 0))
> +			continue;
> +#endif
> +
>   		/*
>   		 * We only take the l3 lock if absolutely necessary and we
>   		 * have established with reasonable certainty that
> @@ -4199,6 +4406,7 @@ static void cache_reap(struct work_struct *w)
>   			STATS_ADD_REAPED(searchp, freed);
>   		}
>   next:
> +		kmem_cache_drop_ref(searchp);
>   		cond_resched();
>   	}
>   	check_irq_on();


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 12/13] memcg: Per-memcg memory.kmem.slabinfo file.
  2012-03-09 20:39 ` [PATCH v2 12/13] memcg: Per-memcg memory.kmem.slabinfo file Suleiman Souhlal
@ 2012-03-11 10:35   ` Glauber Costa
  0 siblings, 0 replies; 45+ messages in thread
From: Glauber Costa @ 2012-03-11 10:35 UTC (permalink / raw)
  To: Suleiman Souhlal
  Cc: cgroups, suleiman, kamezawa.hiroyu, penberg, cl, yinghan, hughd,
	gthelen, peterz, dan.magenheimer, hannes, mgorman,
	James.Bottomley, linux-mm, devel, linux-kernel

On 03/10/2012 12:39 AM, Suleiman Souhlal wrote:
> This file shows all the kmem_caches used by a memcg.
>
> Signed-off-by: Suleiman Souhlal<suleiman@google.com>
Reviewed-by: Glauber Costa <glommer@parallels.com>

> ---
>   include/linux/slab.h     |    6 +++
>   include/linux/slab_def.h |    1 +
>   mm/memcontrol.c          |   18 +++++++++
>   mm/slab.c                |   88 ++++++++++++++++++++++++++++++++++------------
>   4 files changed, 90 insertions(+), 23 deletions(-)
>
> diff --git a/include/linux/slab.h b/include/linux/slab.h
> index fd1d2ba..0ff5ee2 100644
> --- a/include/linux/slab.h
> +++ b/include/linux/slab.h
> @@ -401,6 +401,12 @@ static inline void *kmalloc_no_account(size_t size, gfp_t flags)
>   	return kmalloc(size, flags);
>   }
>
> +static inline int
> +mem_cgroup_slabinfo(struct mem_cgroup *mem, struct seq_file *m)
> +{
> +	return 0;
> +}
> +
>   #endif /* CONFIG_CGROUP_MEM_RES_CTLR_KMEM&&  CONFIG_SLAB */
>
>   #endif	/* _LINUX_SLAB_H */
> diff --git a/include/linux/slab_def.h b/include/linux/slab_def.h
> index 248b8a9..fa6b272 100644
> --- a/include/linux/slab_def.h
> +++ b/include/linux/slab_def.h
> @@ -227,6 +227,7 @@ found:
>   #ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
>
>   void kmem_cache_drop_ref(struct kmem_cache *cachep);
> +int mem_cgroup_slabinfo(struct mem_cgroup *mem, struct seq_file *m);
>
>   static inline void
>   kmem_cache_get_ref(struct kmem_cache *cachep)
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 9f5e9d8..4a4fa48 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -4672,6 +4672,20 @@ static int mem_cgroup_independent_kmem_limit_write(struct cgroup *cgrp,
>   	return 0;
>   }
>
> +static int
> +mem_cgroup_slabinfo_show(struct cgroup *cgroup, struct cftype *ctf,
> +    struct seq_file *m)
> +{
> +	struct mem_cgroup *mem;
> +
> +	mem  = mem_cgroup_from_cont(cgroup);
> +
> +	if (mem == root_mem_cgroup)
> +		mem = NULL;
> +
> +	return mem_cgroup_slabinfo(mem, m);
> +}
> +
>   static struct cftype kmem_cgroup_files[] = {
>   	{
>   		.name = "kmem.independent_kmem_limit",
> @@ -4689,6 +4703,10 @@ static struct cftype kmem_cgroup_files[] = {
>   		.private = MEMFILE_PRIVATE(_KMEM, RES_USAGE),
>   		.read_u64 = mem_cgroup_read,
>   	},
> +	{
> +		.name = "kmem.slabinfo",
> +		.read_seq_string = mem_cgroup_slabinfo_show,
> +	},
>   };
>
>   static int register_kmem_files(struct cgroup *cont, struct cgroup_subsys *ss)
> diff --git a/mm/slab.c b/mm/slab.c
> index 02239ed..1b35799 100644
> --- a/mm/slab.c
> +++ b/mm/slab.c
> @@ -4528,21 +4528,26 @@ static void s_stop(struct seq_file *m, void *p)
>   	mutex_unlock(&cache_chain_mutex);
>   }
>
> -static int s_show(struct seq_file *m, void *p)
> -{
> -	struct kmem_cache *cachep = list_entry(p, struct kmem_cache, next);
> -	struct slab *slabp;
> +struct slab_counts {
>   	unsigned long active_objs;
> +	unsigned long active_slabs;
> +	unsigned long num_slabs;
> +	unsigned long free_objects;
> +	unsigned long shared_avail;
>   	unsigned long num_objs;
> -	unsigned long active_slabs = 0;
> -	unsigned long num_slabs, free_objects = 0, shared_avail = 0;
> -	const char *name;
> -	char *error = NULL;
> -	int node;
> +};
> +
> +static char *
> +get_slab_counts(struct kmem_cache *cachep, struct slab_counts *c)
> +{
>   	struct kmem_list3 *l3;
> +	struct slab *slabp;
> +	char *error;
> +	int node;
> +
> +	error = NULL;
> +	memset(c, 0, sizeof(struct slab_counts));
>
> -	active_objs = 0;
> -	num_slabs = 0;
>   	for_each_online_node(node) {
>   		l3 = cachep->nodelists[node];
>   		if (!l3)
> @@ -4554,31 +4559,43 @@ static int s_show(struct seq_file *m, void *p)
>   		list_for_each_entry(slabp,&l3->slabs_full, list) {
>   			if (slabp->inuse != cachep->num&&  !error)
>   				error = "slabs_full accounting error";
> -			active_objs += cachep->num;
> -			active_slabs++;
> +			c->active_objs += cachep->num;
> +			c->active_slabs++;
>   		}
>   		list_for_each_entry(slabp,&l3->slabs_partial, list) {
>   			if (slabp->inuse == cachep->num&&  !error)
>   				error = "slabs_partial inuse accounting error";
>   			if (!slabp->inuse&&  !error)
>   				error = "slabs_partial/inuse accounting error";
> -			active_objs += slabp->inuse;
> -			active_slabs++;
> +			c->active_objs += slabp->inuse;
> +			c->active_slabs++;
>   		}
>   		list_for_each_entry(slabp,&l3->slabs_free, list) {
>   			if (slabp->inuse&&  !error)
>   				error = "slabs_free/inuse accounting error";
> -			num_slabs++;
> +			c->num_slabs++;
>   		}
> -		free_objects += l3->free_objects;
> +		c->free_objects += l3->free_objects;
>   		if (l3->shared)
> -			shared_avail += l3->shared->avail;
> +			c->shared_avail += l3->shared->avail;
>
>   		spin_unlock_irq(&l3->list_lock);
>   	}
> -	num_slabs += active_slabs;
> -	num_objs = num_slabs * cachep->num;
> -	if (num_objs - active_objs != free_objects&&  !error)
> +	c->num_slabs += c->active_slabs;
> +	c->num_objs = c->num_slabs * cachep->num;
> +
> +	return error;
> +}
> +
> +static int s_show(struct seq_file *m, void *p)
> +{
> +	struct kmem_cache *cachep = list_entry(p, struct kmem_cache, next);
> +	struct slab_counts c;
> +	const char *name;
> +	char *error;
> +
> +	error = get_slab_counts(cachep,&c);
> +	if (c.num_objs - c.active_objs != c.free_objects&&  !error)
>   		error = "free_objects accounting error";
>
>   	name = cachep->name;
> @@ -4586,12 +4603,12 @@ static int s_show(struct seq_file *m, void *p)
>   		printk(KERN_ERR "slab: cache %s error: %s\n", name, error);
>
>   	seq_printf(m, "%-17s %6lu %6lu %6u %4u %4d",
> -		   name, active_objs, num_objs, cachep->buffer_size,
> +		   name, c.active_objs, c.num_objs, cachep->buffer_size,
>   		   cachep->num, (1<<  cachep->gfporder));
>   	seq_printf(m, " : tunables %4u %4u %4u",
>   		   cachep->limit, cachep->batchcount, cachep->shared);
>   	seq_printf(m, " : slabdata %6lu %6lu %6lu",
> -		   active_slabs, num_slabs, shared_avail);
> +		   c.active_slabs, c.num_slabs, c.shared_avail);
>   #if STATS
>   	{			/* list3 stats */
>   		unsigned long high = cachep->high_mark;
> @@ -4726,6 +4743,31 @@ static const struct file_operations proc_slabinfo_operations = {
>   	.release	= seq_release,
>   };
>
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
> +int
> +mem_cgroup_slabinfo(struct mem_cgroup *mem, struct seq_file *m)
> +{
> +	struct kmem_cache *cachep;
> +	struct slab_counts c;
> +
> +	seq_printf(m, "# name<active_objs>  <num_objs>  <objsize>\n");
> +
> +	mutex_lock(&cache_chain_mutex);
> +	list_for_each_entry(cachep,&cache_chain, next) {
> +		if (cachep->memcg_params.memcg != mem)
> +			continue;
> +
> +		get_slab_counts(cachep,&c);
> +
> +		seq_printf(m, "%-17s %6lu %6lu %6u\n", cachep->name,
> +		   c.active_objs, c.num_objs, cachep->buffer_size);
> +	}
> +	mutex_unlock(&cache_chain_mutex);
> +
> +	return 0;
> +}
> +#endif /* CONFIG_CGROUP_MEM_RES_CTLR_KMEM */
> +
>   #ifdef CONFIG_DEBUG_SLAB_LEAK
>
>   static void *leaks_start(struct seq_file *m, loff_t *pos)


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 13/13] memcg: Document kernel memory accounting.
  2012-03-09 20:39 ` [PATCH v2 13/13] memcg: Document kernel memory accounting Suleiman Souhlal
@ 2012-03-11 10:42   ` Glauber Costa
  0 siblings, 0 replies; 45+ messages in thread
From: Glauber Costa @ 2012-03-11 10:42 UTC (permalink / raw)
  To: Suleiman Souhlal
  Cc: cgroups, suleiman, kamezawa.hiroyu, penberg, cl, yinghan, hughd,
	gthelen, peterz, dan.magenheimer, hannes, mgorman,
	James.Bottomley, linux-mm, devel, linux-kernel

On 03/10/2012 12:39 AM, Suleiman Souhlal wrote:
> Signed-off-by: Suleiman Souhlal<suleiman@google.com>
> ---
>   Documentation/cgroups/memory.txt |   44 ++++++++++++++++++++++++++++++++++---
>   1 files changed, 40 insertions(+), 4 deletions(-)
>
> diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
> index 4c95c00..73f2e38 100644
> --- a/Documentation/cgroups/memory.txt
> +++ b/Documentation/cgroups/memory.txt
> @@ -74,6 +74,11 @@ Brief summary of control files.
>
>    memory.kmem.tcp.limit_in_bytes  # set/show hard limit for tcp buf memory
>    memory.kmem.tcp.usage_in_bytes  # show current tcp buf memory allocation
> + memory.kmem.usage_in_bytes	 # show current kernel memory usage
> + memory.kmem.limit_in_bytes	 # show/set limit of kernel memory usage
> + memory.kmem.independent_kmem_limit # show/set control of kernel memory limit
> + 				    (See 2.7 for details)
> + memory.kmem.slabinfo		 # show cgroup's slabinfo
>
>   1. History
>
> @@ -265,11 +270,19 @@ the amount of kernel memory used by the system. Kernel memory is fundamentally
>   different than user memory, since it can't be swapped out, which makes it
>   possible to DoS the system by consuming too much of this precious resource.
>
> -Kernel memory limits are not imposed for the root cgroup. Usage for the root
> -cgroup may or may not be accounted.
> +Kernel memory limits are not imposed for the root cgroup.
>
> -Currently no soft limit is implemented for kernel memory. It is future work
> -to trigger slab reclaim when those limits are reached.
> +A cgroup's kernel memory is counted into its memory.kmem.usage_in_bytes.
> +
> +memory.kmem.independent_kmem_limit controls whether or not kernel memory
> +should also be counted into the cgroup's memory.usage_in_bytes.
> +If it is set, it is possible to specify a limit for kernel memory with
> +memory.kmem.limit_in_bytes.
> +
> +Upon cgroup deletion, all the remaining kernel memory becomes unaccounted.
> +
> +An accounted kernel memory allocation may trigger reclaim in that cgroup,
> +and may also OOM.
Why delete the softlimit bit? Since we're not shrinking, at least for 
the independent kmem case, we effectively don't do softlimits here. The 
file for it does not even exist...

>
>   2.7.1 Current Kernel Memory resources accounted
>
> @@ -279,6 +292,29 @@ per cgroup, instead of globally.
>
>   * tcp memory pressure: sockets memory pressure for the tcp protocol.
>
> +* slab memory.
> +
> +2.7.1.1 Slab memory accounting
> +
> +Any slab type created with the SLAB_MEMCG_ACCT kmem_cache_create() flag
> +is accounted.
> +
> +Slab gets accounted on a per-page basis, which is done by using per-cgroup
> +kmem_caches. These per-cgroup kmem_caches get created on-demand, the first
> +time a specific kmem_cache gets used by a cgroup.
> +
> +Only slab memory that can be attributed to a cgroup gets accounted in this
> +fashion.
> +
> +A per-cgroup kmem_cache is named like the original, with the cgroup's name
> +in parentheses.
> +
> +When a cgroup is destroyed, all its kmem_caches get migrated to the root
> +cgroup, and "dead" is appended to their name, to indicate that they are not
> +going to be used for new allocations.
> +These dead caches automatically get removed once there are no more active
> +slab objects in them.
> +
>   3. User Interface
>
>   0. Configuration


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 04/13] memcg: Make it possible to use the stock for more than one page.
  2012-03-09 20:39 ` [PATCH v2 04/13] memcg: Make it possible to use the stock for more than one page Suleiman Souhlal
@ 2012-03-11 10:49   ` Glauber Costa
  0 siblings, 0 replies; 45+ messages in thread
From: Glauber Costa @ 2012-03-11 10:49 UTC (permalink / raw)
  To: Suleiman Souhlal
  Cc: cgroups, suleiman, kamezawa.hiroyu, penberg, cl, yinghan, hughd,
	gthelen, peterz, dan.magenheimer, hannes, mgorman,
	James.Bottomley, linux-mm, devel, linux-kernel

On 03/10/2012 12:39 AM, Suleiman Souhlal wrote:
> Signed-off-by: Suleiman Souhlal<suleiman@google.com>
> ---
>   mm/memcontrol.c |   18 +++++++++---------
>   1 files changed, 9 insertions(+), 9 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 6fbb438..f605100 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1965,19 +1965,19 @@ static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock);
>   static DEFINE_MUTEX(percpu_charge_mutex);
>
>   /*
> - * Try to consume stocked charge on this cpu. If success, one page is consumed
> - * from local stock and true is returned. If the stock is 0 or charges from a
> - * cgroup which is not current target, returns false. This stock will be
> - * refilled.
> + * Try to consume stocked charge on this cpu. If success, nr_pages pages are
> + * consumed from local stock and true is returned. If the stock is 0 or
> + * charges from a cgroup which is not current target, returns false.
> + * This stock will be refilled.
>    */
> -static bool consume_stock(struct mem_cgroup *memcg)
> +static bool consume_stock(struct mem_cgroup *memcg, int nr_pages)
>   {
>   	struct memcg_stock_pcp *stock;
>   	bool ret = true;
>
>   	stock =&get_cpu_var(memcg_stock);
> -	if (memcg == stock->cached&&  stock->nr_pages)
> -		stock->nr_pages--;
> +	if (memcg == stock->cached&&  stock->nr_pages>= nr_pages)
> +		stock->nr_pages -= nr_pages;
>   	else /* need to call res_counter_charge */
>   		ret = false;
>   	put_cpu_var(memcg_stock);
> @@ -2290,7 +2290,7 @@ again:
>   		VM_BUG_ON(css_is_removed(&memcg->css));
>   		if (mem_cgroup_is_root(memcg))
>   			goto done;
> -		if (nr_pages == 1&&  consume_stock(memcg))
> +		if (consume_stock(memcg, nr_pages))
>   			goto done;
>   		css_get(&memcg->css);
>   	} else {
> @@ -2315,7 +2315,7 @@ again:
>   			rcu_read_unlock();
>   			goto done;
>   		}
> -		if (nr_pages == 1&&  consume_stock(memcg)) {
> +		if (consume_stock(memcg, nr_pages)) {
>   			/*
>   			 * It seems dagerous to access memcg without css_get().
>   			 * But considering how consume_stok works, it's not

This patch itself is fine in what it wants to achieve.
But it made me think:

We'll jump into the stock code which makes user allocation faster.
but we're not getting the benefit of it when we're accounting kmem.
since we're allocating to both res_counters, we're actually defeating it 
altogether, since we now have to go to the global poll *everytime* (for 
memcg->kmem).

It would make a whole lot more sense to have the stock code moved to the
res_counter. We're now starting to have more users of that anyway, so
a common implementation makes sense.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 06/13] slab: Add kmem_cache_gfp_flags() helper function.
  2012-03-09 20:39 ` [PATCH v2 06/13] slab: Add kmem_cache_gfp_flags() helper function Suleiman Souhlal
@ 2012-03-11 10:53   ` Glauber Costa
  2012-03-13 23:21     ` Suleiman Souhlal
  0 siblings, 1 reply; 45+ messages in thread
From: Glauber Costa @ 2012-03-11 10:53 UTC (permalink / raw)
  To: Suleiman Souhlal
  Cc: cgroups, suleiman, kamezawa.hiroyu, penberg, cl, yinghan, hughd,
	gthelen, peterz, dan.magenheimer, hannes, mgorman,
	James.Bottomley, linux-mm, devel, linux-kernel

On 03/10/2012 12:39 AM, Suleiman Souhlal wrote:
> This function returns the gfp flags that are always applied to
> allocations of a kmem_cache.
>
> Signed-off-by: Suleiman Souhlal<suleiman@google.com>
> ---
>   include/linux/slab_def.h |    6 ++++++
>   include/linux/slob_def.h |    6 ++++++
>   include/linux/slub_def.h |    6 ++++++
>   3 files changed, 18 insertions(+), 0 deletions(-)
>
> diff --git a/include/linux/slab_def.h b/include/linux/slab_def.h
> index fbd1117..25f9a6a 100644
> --- a/include/linux/slab_def.h
> +++ b/include/linux/slab_def.h
> @@ -159,6 +159,12 @@ found:
>   	return __kmalloc(size, flags);
>   }
>
> +static inline gfp_t
> +kmem_cache_gfp_flags(struct kmem_cache *cachep)
> +{
> +	return cachep->gfpflags;
> +}
> +
>   #ifdef CONFIG_NUMA
>   extern void *__kmalloc_node(size_t size, gfp_t flags, int node);
>   extern void *kmem_cache_alloc_node(struct kmem_cache *, gfp_t flags, int node);
> diff --git a/include/linux/slob_def.h b/include/linux/slob_def.h
> index 0ec00b3..3fa527d 100644
> --- a/include/linux/slob_def.h
> +++ b/include/linux/slob_def.h
> @@ -34,4 +34,10 @@ static __always_inline void *__kmalloc(size_t size, gfp_t flags)
>   	return kmalloc(size, flags);
>   }
>
> +static inline gfp_t
> +kmem_cache_gfp_flags(struct kmem_cache *cachep)
> +{
> +	return 0;
> +}
> +
>   #endif /* __LINUX_SLOB_DEF_H */
> diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
> index a32bcfd..5911d81 100644
> --- a/include/linux/slub_def.h
> +++ b/include/linux/slub_def.h
> @@ -313,4 +313,10 @@ static __always_inline void *kmalloc_node(size_t size, gfp_t flags, int node)
>   }
>   #endif
>
> +static inline gfp_t
> +kmem_cache_gfp_flags(struct kmem_cache *cachep)
> +{
> +	return cachep->allocflags;
> +}
> +

Why is this needed? Can't the caller just call 
mem_cgroup_get_kmem_cache(cachep, flags | cachep->allocflags) ?

>   #endif /* _LINUX_SLUB_DEF_H */


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 09/13] memcg: Account for kmalloc in kernel memory accounting.
  2012-03-09 20:39 ` [PATCH v2 09/13] memcg: Account for kmalloc " Suleiman Souhlal
@ 2012-03-11 12:21   ` Glauber Costa
  0 siblings, 0 replies; 45+ messages in thread
From: Glauber Costa @ 2012-03-11 12:21 UTC (permalink / raw)
  To: Suleiman Souhlal
  Cc: cgroups, suleiman, kamezawa.hiroyu, penberg, cl, yinghan, hughd,
	gthelen, peterz, dan.magenheimer, hannes, mgorman,
	James.Bottomley, linux-mm, devel, linux-kernel

On 03/10/2012 12:39 AM, Suleiman Souhlal wrote:
> In order to do this, we have to create a kmalloc_no_account()
> function that is used for kmalloc allocations that we do not
> want to account, because the kernel memory accounting code has
> to make some kmalloc allocations and is not allowed to recurse.
>
> Signed-off-by: Suleiman Souhlal<suleiman@google.com>
> ---
>   include/linux/slab.h |    8 ++++++++
>   mm/memcontrol.c      |    2 +-
>   mm/slab.c            |   42 +++++++++++++++++++++++++++++++++++++++---
>   3 files changed, 48 insertions(+), 4 deletions(-)
>
> diff --git a/include/linux/slab.h b/include/linux/slab.h
> index bc9f87f..906a015 100644
> --- a/include/linux/slab.h
> +++ b/include/linux/slab.h
> @@ -377,6 +377,8 @@ struct kmem_cache *kmem_cache_create_memcg(struct kmem_cache *cachep,
>       char *name);
>   void kmem_cache_drop_ref(struct kmem_cache *cachep);
>
> +void *kmalloc_no_account(size_t size, gfp_t flags);
> +
>   #else /* !CONFIG_CGROUP_MEM_RES_CTLR_KMEM || !CONFIG_SLAB */
>
>   #define MAX_KMEM_CACHE_TYPES 0
> @@ -392,6 +394,12 @@ static inline void
>   kmem_cache_drop_ref(struct kmem_cache *cachep)
>   {
>   }
> +
> +static inline void *kmalloc_no_account(size_t size, gfp_t flags)
> +{
> +	return kmalloc(size, flags);
> +}
> +
>   #endif /* CONFIG_CGROUP_MEM_RES_CTLR_KMEM&&  CONFIG_SLAB */
>
>   #endif	/* _LINUX_SLAB_H */
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index a5593cf..72e83af 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -5824,7 +5824,7 @@ memcg_create_cache_enqueue(struct mem_cgroup *memcg, struct kmem_cache *cachep)
>   	if (!css_tryget(&memcg->css))
>   		return;
>
> -	cw = kmalloc(sizeof(struct create_work), GFP_NOWAIT);
> +	cw = kmalloc_no_account(sizeof(struct create_work), GFP_NOWAIT);
>   	if (cw == NULL) {
>   		css_put(&memcg->css);
>   		return;
> diff --git a/mm/slab.c b/mm/slab.c
> index 95b024c..7af7384 100644
> --- a/mm/slab.c
> +++ b/mm/slab.c
> @@ -1623,8 +1623,8 @@ void __init kmem_cache_init(void)
>   			sizes->cs_cachep = kmem_cache_create(names->name,
>   					sizes->cs_size,
>   					ARCH_KMALLOC_MINALIGN,
> -					ARCH_KMALLOC_FLAGS|SLAB_PANIC,
> -					NULL);
> +					ARCH_KMALLOC_FLAGS | SLAB_PANIC |
> +					SLAB_MEMCG_ACCT, NULL);
>   		}
>   #ifdef CONFIG_ZONE_DMA
>   		sizes->cs_dmacachep = kmem_cache_create(
> @@ -3936,11 +3936,16 @@ static __always_inline void *
>   __do_kmalloc_node(size_t size, gfp_t flags, int node, void *caller)
>   {
>   	struct kmem_cache *cachep;
> +	void *ret;
>
>   	cachep = kmem_find_general_cachep(size, flags);
>   	if (unlikely(ZERO_OR_NULL_PTR(cachep)))
>   		return cachep;
> -	return kmem_cache_alloc_node_trace(size, cachep, flags, node);
> +	cachep = mem_cgroup_get_kmem_cache(cachep, flags);
> +	ret = kmem_cache_alloc_node_trace(size, cachep, flags, node);
> +	mem_cgroup_put_kmem_cache(cachep);
> +
> +	return ret;
>   }
>
>   #if defined(CONFIG_DEBUG_SLAB) || defined(CONFIG_TRACING)
> @@ -3986,14 +3991,31 @@ static __always_inline void *__do_kmalloc(size_t size, gfp_t flags,
>   	cachep = __find_general_cachep(size, flags);
>   	if (unlikely(ZERO_OR_NULL_PTR(cachep)))
>   		return cachep;
> +	cachep = mem_cgroup_get_kmem_cache(cachep, flags);
>   	ret = __cache_alloc(cachep, flags, caller);
>
>   	trace_kmalloc((unsigned long) caller, ret,
>   		      size, cachep->buffer_size, flags);
> +	mem_cgroup_put_kmem_cache(cachep);
>
>   	return ret;
>   }
>
> +static __always_inline void *
> +__do_kmalloc_no_account(size_t size, gfp_t flags, void *caller)
> +{
> +	struct kmem_cache *cachep;
> +	void *ret;
> +
> +	cachep = __find_general_cachep(size, flags);
> +	if (unlikely(ZERO_OR_NULL_PTR(cachep)))
> +		return cachep;
> +	ret = __cache_alloc(cachep, flags, caller);
> +	trace_kmalloc((unsigned long)caller, ret, size, cachep->buffer_size,
> +	    flags);
> +
> +	return ret;
> +}
>
>   #if defined(CONFIG_DEBUG_SLAB) || defined(CONFIG_TRACING)
>   void *__kmalloc(size_t size, gfp_t flags)
> @@ -4008,12 +4030,26 @@ void *__kmalloc_track_caller(size_t size, gfp_t flags, unsigned long caller)
>   }
>   EXPORT_SYMBOL(__kmalloc_track_caller);
Isn't this the one that is used in skbuffs?  We'd need them as noaccount 
as well, right?

>
> +void *
> +kmalloc_no_account(size_t size, gfp_t flags)
> +{
> +	return __do_kmalloc_no_account(size, flags,
> +	    __builtin_return_address(0));
> +}
> +EXPORT_SYMBOL(kmalloc_no_account);
>   #else
>   void *__kmalloc(size_t size, gfp_t flags)
>   {
>   	return __do_kmalloc(size, flags, NULL);
>   }
>   EXPORT_SYMBOL(__kmalloc);
> +
> +void *
> +kmalloc_no_account(size_t size, gfp_t flags)
> +{
> +	return __do_kmalloc_no_account(size, flags, NULL);
> +}
> +EXPORT_SYMBOL(kmalloc_no_account);
>   #endif
>
>   /**


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 02/13] memcg: Kernel memory accounting infrastructure.
  2012-03-09 20:39 ` [PATCH v2 02/13] memcg: Kernel memory accounting infrastructure Suleiman Souhlal
  2012-03-11  8:12   ` Glauber Costa
@ 2012-03-12 12:38   ` Glauber Costa
  1 sibling, 0 replies; 45+ messages in thread
From: Glauber Costa @ 2012-03-12 12:38 UTC (permalink / raw)
  To: Suleiman Souhlal
  Cc: cgroups, suleiman, kamezawa.hiroyu, penberg, cl, yinghan, hughd,
	gthelen, peterz, dan.magenheimer, hannes, mgorman,
	James.Bottomley, linux-mm, devel, linux-kernel

On 03/10/2012 12:39 AM, Suleiman Souhlal wrote:
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
> +int
> +memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, long long delta)
> +{
> +	struct res_counter *fail_res;
> +	struct mem_cgroup *_memcg;
> +	int may_oom, ret;
> +
> +	may_oom = (gfp&  __GFP_WAIT)&&  (gfp&  __GFP_FS)&&
> +	    !(gfp&  __GFP_NORETRY);
> +
> +	ret = 0;
> +
> +	_memcg = memcg;
> +	if (memcg&&  !mem_cgroup_test_flag(memcg,
> +	    MEMCG_INDEPENDENT_KMEM_LIMIT)) {
> +		ret = __mem_cgroup_try_charge(NULL, gfp, delta / PAGE_SIZE,
> +		&_memcg, may_oom);
> +		if (ret == -ENOMEM)
> +			return ret;
> +	}
> +
> +	if (memcg&&  _memcg == memcg)
> +		ret = res_counter_charge(&memcg->kmem, delta,&fail_res);
> +
> +	return ret;
> +}
> +
> +void
Ok.

So I've spent most of the day today trying to come up with a way not to 
kill the whole performance we gain from consume_stock() by this 
res_counter_charge() to kmem afterwards...

You mentioned you want to still be able to bill to memcg->kmem mostly 
for debugging/display purposes. So we're surely not using all of the 
res_counter infrastructure (limiting, soft limits, etc)

I was thinking: Can't we have a percpu_counter that we use for this 
purpose when !kmem_independent ?

we may not even need to bloat the struct, since we can fold it into a 
union with struct res_counter kmem (which is bigger than a percpu 
counter anyway).

We just need to be a bit more careful not to allow kmem_independent to 
change when we already have charges to any of them (but we need to do it 
anyway)



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 02/13] memcg: Kernel memory accounting infrastructure.
  2012-03-11  8:12   ` Glauber Costa
@ 2012-03-13  6:24     ` KAMEZAWA Hiroyuki
  2012-03-13 10:37       ` Glauber Costa
  0 siblings, 1 reply; 45+ messages in thread
From: KAMEZAWA Hiroyuki @ 2012-03-13  6:24 UTC (permalink / raw)
  To: Glauber Costa
  Cc: Suleiman Souhlal, cgroups, suleiman, penberg, cl, yinghan, hughd,
	gthelen, peterz, dan.magenheimer, hannes, mgorman,
	James.Bottomley, linux-mm, devel, linux-kernel

On Sun, 11 Mar 2012 12:12:04 +0400
Glauber Costa <glommer@parallels.com> wrote:

> On 03/10/2012 12:39 AM, Suleiman Souhlal wrote:
> > Enabled with CONFIG_CGROUP_MEM_RES_CTLR_KMEM.
> >
> > Adds the following files:
> >      - memory.kmem.independent_kmem_limit
> >      - memory.kmem.usage_in_bytes
> >      - memory.kmem.limit_in_bytes
> >
> > Signed-off-by: Suleiman Souhlal<suleiman@google.com>
> > ---
> >   mm/memcontrol.c |  136 ++++++++++++++++++++++++++++++++++++++++++++++++++++++-
> >   1 files changed, 135 insertions(+), 1 deletions(-)
> >
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 37ad2cb..e6fd558 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -220,6 +220,10 @@ enum memcg_flags {
> >   				 */
> >   	MEMCG_MEMSW_IS_MINIMUM,	/* Set when res.limit == memsw.limit */
> >   	MEMCG_OOM_KILL_DISABLE,	/* OOM-Killer disable */
> > +	MEMCG_INDEPENDENT_KMEM_LIMIT,	/*
> > +					 * kernel memory is not counted in
> > +					 * memory.usage_in_bytes
> > +					 */
> >   };


After looking codes, I think we need to think
whether independent_kmem_limit is good or not....

How about adding MEMCG_KMEM_ACCOUNT flag instead of this and use only
memcg->res/memcg->memsw rather than adding a new counter, memcg->kmem ?

if MEMCG_KMEM_ACCOUNT is set     -> slab is accoutned to mem->res/memsw.
if MEMCG_KMEM_ACCOUNT is not set -> slab is never accounted.

(I think On/Off switch is required..)

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 03/13] memcg: Uncharge all kmem when deleting a cgroup.
  2012-03-09 20:39 ` [PATCH v2 03/13] memcg: Uncharge all kmem when deleting a cgroup Suleiman Souhlal
  2012-03-11  8:19   ` Glauber Costa
@ 2012-03-13  6:27   ` KAMEZAWA Hiroyuki
  1 sibling, 0 replies; 45+ messages in thread
From: KAMEZAWA Hiroyuki @ 2012-03-13  6:27 UTC (permalink / raw)
  To: Suleiman Souhlal
  Cc: cgroups, suleiman, glommer, penberg, cl, yinghan, hughd, gthelen,
	peterz, dan.magenheimer, hannes, mgorman, James.Bottomley,
	linux-mm, devel, linux-kernel

On Fri,  9 Mar 2012 12:39:06 -0800
Suleiman Souhlal <ssouhlal@FreeBSD.org> wrote:

> Signed-off-by: Suleiman Souhlal <suleiman@google.com>
> ---
>  mm/memcontrol.c |   31 ++++++++++++++++++++++++++++++-
>  1 files changed, 30 insertions(+), 1 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index e6fd558..6fbb438 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -382,6 +382,7 @@ static void mem_cgroup_get(struct mem_cgroup *memcg);
>  static void mem_cgroup_put(struct mem_cgroup *memcg);
>  static void memcg_kmem_init(struct mem_cgroup *memcg,
>      struct mem_cgroup *parent);
> +static void memcg_kmem_move(struct mem_cgroup *memcg);
>  
>  static inline bool
>  mem_cgroup_test_flag(const struct mem_cgroup *memcg, enum memcg_flags flag)
> @@ -3700,6 +3701,7 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg, bool free_all)
>  	int ret;
>  	int node, zid, shrink;
>  	int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
> +	unsigned long usage;
>  	struct cgroup *cgrp = memcg->css.cgroup;
>  
>  	css_get(&memcg->css);
> @@ -3719,6 +3721,8 @@ move_account:
>  		/* This is for making all *used* pages to be on LRU. */
>  		lru_add_drain_all();
>  		drain_all_stock_sync(memcg);
> +		if (!free_all)
> +			memcg_kmem_move(memcg);
>  		ret = 0;
>  		mem_cgroup_start_move(memcg);
>  		for_each_node_state(node, N_HIGH_MEMORY) {
> @@ -3740,8 +3744,14 @@ move_account:
>  		if (ret == -ENOMEM)
>  			goto try_to_free;
>  		cond_resched();
> +		usage = memcg->res.usage;
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
> +		if (free_all && !mem_cgroup_test_flag(memcg,
> +		    MEMCG_INDEPENDENT_KMEM_LIMIT))
> +			usage -= memcg->kmem.usage;
> +#endif
>  	/* "ret" should also be checked to ensure all lists are empty. */
> -	} while (memcg->res.usage > 0 || ret);
> +	} while (usage > 0 || ret);
>  out:
>  	css_put(&memcg->css);
>  	return ret;
> @@ -5689,9 +5699,28 @@ memcg_kmem_init(struct mem_cgroup *memcg, struct mem_cgroup *parent)
>  		parent_res = &parent->kmem;
>  	res_counter_init(&memcg->kmem, parent_res);
>  }
> +
> +static void
> +memcg_kmem_move(struct mem_cgroup *memcg)

the function name says 'move' but the code seems just do 'forget'
or 'leak'...


> +{
> +	unsigned long flags;
> +	long kmem;
> +
> +	spin_lock_irqsave(&memcg->kmem.lock, flags);
> +	kmem = memcg->kmem.usage;
> +	res_counter_uncharge_locked(&memcg->kmem, kmem);
> +	spin_unlock_irqrestore(&memcg->kmem.lock, flags);
> +	if (!mem_cgroup_test_flag(memcg, MEMCG_INDEPENDENT_KMEM_LIMIT))
> +		res_counter_uncharge(&memcg->res, kmem);
> +}

please update memcg->memsw, too.

Thanks,
-Kame



>  #else /* CONFIG_CGROUP_MEM_RES_CTLR_KMEM */
>  static void
>  memcg_kmem_init(struct mem_cgroup *memcg, struct mem_cgroup *parent)
>  {
>  }
> +
> +static void
> +memcg_kmem_move(struct mem_cgroup *memcg)
> +{
> +}
>  #endif /* CONFIG_CGROUP_MEM_RES_CTLR_KMEM */
> -- 
> 1.7.7.3
> 


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 02/13] memcg: Kernel memory accounting infrastructure.
  2012-03-13  6:24     ` KAMEZAWA Hiroyuki
@ 2012-03-13 10:37       ` Glauber Costa
  2012-03-13 17:00         ` Greg Thelen
  2012-03-14  0:15         ` KAMEZAWA Hiroyuki
  0 siblings, 2 replies; 45+ messages in thread
From: Glauber Costa @ 2012-03-13 10:37 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Suleiman Souhlal, cgroups, suleiman, penberg, cl, yinghan, hughd,
	gthelen, peterz, dan.magenheimer, hannes, mgorman,
	James.Bottomley, linux-mm, devel, linux-kernel

> After looking codes, I think we need to think
> whether independent_kmem_limit is good or not....
>
> How about adding MEMCG_KMEM_ACCOUNT flag instead of this and use only
> memcg->res/memcg->memsw rather than adding a new counter, memcg->kmem ?
>
> if MEMCG_KMEM_ACCOUNT is set     ->  slab is accoutned to mem->res/memsw.
> if MEMCG_KMEM_ACCOUNT is not set ->  slab is never accounted.
>
> (I think On/Off switch is required..)
>
> Thanks,
> -Kame
>

This has been discussed before, I can probably find it in the archives 
if you want to go back and see it.

But in a nutshell:

1) Supposing independent knob disappear (I will explain in item 2 why I 
don't want it to), I don't thing a flag makes sense either. *If* we are 
planning to enable/disable this, it might make more sense to put some 
work on it, and allow particular slabs to be enabled/disabled by writing 
to memory.kmem.slabinfo (-* would disable all, +* enable all, +kmalloc* 
enable all kmalloc, etc).

Alternatively, what we could do instead, is something similar to what 
ended up being done for tcp, by request of the network people: if you 
never touch the limit file, don't bother with it at all, and simply does 
not account. With Suleiman's lazy allocation infrastructure, that should 
actually be trivial. And then again, a flag is not necessary, because 
writing to the limit file does the job, and also convey the meaning well 
enough.

2) For the kernel itself, we are mostly concerned that a malicious 
container may pin into memory big amounts of kernel memory which is, 
ultimately, unreclaimable. In particular, with overcommit allowed 
scenarios, you can fill the whole physical memory (or at least a 
significant part) with those objects, well beyond your softlimit 
allowance, making the creation of further containers impossible.
With user memory, you can reclaim the cgroup back to its place. With 
kernel memory, you can't.

In the particular example of 32-bit boxes, you can easily fill up a 
large part of the available 1gb kernel memory with pinned memory and 
render the whole system unresponsive.

Never allowing the kernel memory to go beyond the soft limit was one of 
the proposed alternatives. However, it may force you to establish a soft
limit where one was not previously needed. Or, establish a low soft 
limit when you really need a bigger one.

All that said, while reading your message, thinking a bit, the following 
crossed my mind:

- We can account the slabs to memcg->res normally, and just store the
   information that this is kernel memory into a percpu counter, as
   I proposed recently.
- The knob goes away, and becomes implicit: if you ever write anything
   to memory.kmem.limit_in_bytes, we transfer that memory to a separate
   kmem res_counter, and proceed from there. We can keep accounting to
   memcg->res anyway, just that kernel memory will now have a separate
   limit.
- With this scheme, it may not be necessary to ever have a file
   memory.kmem.soft_limit_in_bytes. Reclaim is always part of the normal
   memcg reclaim.

The outlined above would work for us, and make the whole scheme simpler, 
I believe.

What do you think ?

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 02/13] memcg: Kernel memory accounting infrastructure.
  2012-03-13 10:37       ` Glauber Costa
@ 2012-03-13 17:00         ` Greg Thelen
  2012-03-13 17:31           ` Glauber Costa
  2012-03-14  0:15         ` KAMEZAWA Hiroyuki
  1 sibling, 1 reply; 45+ messages in thread
From: Greg Thelen @ 2012-03-13 17:00 UTC (permalink / raw)
  To: Glauber Costa
  Cc: KAMEZAWA Hiroyuki, Suleiman Souhlal, cgroups, suleiman, penberg,
	cl, yinghan, hughd, peterz, dan.magenheimer, hannes, mgorman,
	James.Bottomley, linux-mm, devel, linux-kernel, rientjes

Glauber Costa <glommer@parallels.com> writes:
> 2) For the kernel itself, we are mostly concerned that a malicious container may
> pin into memory big amounts of kernel memory which is, ultimately,
> unreclaimable. In particular, with overcommit allowed scenarios, you can fill
> the whole physical memory (or at least a significant part) with those objects,
> well beyond your softlimit allowance, making the creation of further containers
> impossible.
> With user memory, you can reclaim the cgroup back to its place. With kernel
> memory, you can't.

In overcommit situations the page allocator starts failing even though
memcg page can charge pages.  When page allocations fail the oom killer
plays a role.  Page allocations can fail even without malicious usage of
kernel memory (e.g. lots of mlock or anon without swap can fill a
machine).  I assume that the kernel memory pinned the malicious
containers will be freed or at least become reclaimable once the
processes in malicious containers are killed (oom or otherwise).  We
have been making use of the oom killer to save a system from
irreconcilable overcommit situations.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 02/13] memcg: Kernel memory accounting infrastructure.
  2012-03-13 17:00         ` Greg Thelen
@ 2012-03-13 17:31           ` Glauber Costa
  0 siblings, 0 replies; 45+ messages in thread
From: Glauber Costa @ 2012-03-13 17:31 UTC (permalink / raw)
  To: Greg Thelen
  Cc: KAMEZAWA Hiroyuki, Suleiman Souhlal, cgroups, suleiman, penberg,
	cl, yinghan, hughd, peterz, dan.magenheimer, hannes, mgorman,
	James.Bottomley, linux-mm, devel, linux-kernel, rientjes

On 03/13/2012 09:00 PM, Greg Thelen wrote:
> Glauber Costa<glommer@parallels.com>  writes:
>> 2) For the kernel itself, we are mostly concerned that a malicious container may
>> pin into memory big amounts of kernel memory which is, ultimately,
>> unreclaimable. In particular, with overcommit allowed scenarios, you can fill
>> the whole physical memory (or at least a significant part) with those objects,
>> well beyond your softlimit allowance, making the creation of further containers
>> impossible.
>> With user memory, you can reclaim the cgroup back to its place. With kernel
>> memory, you can't.
>
> In overcommit situations the page allocator starts failing even though
> memcg page can charge pages.
If you overcommit mem+swap, yes. If you overcommit mem, no: reclaim 
happens first. And we don't have that option with pinned kernel memory.

Of course you *can* run your system without swap, but the whole thing 
exists exactly because there is a large enough # of ppl who wants to be 
able to overcommit their physical memory, without failing allocations.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 07/13] memcg: Slab accounting.
  2012-03-11 10:25   ` Glauber Costa
@ 2012-03-13 22:50     ` Suleiman Souhlal
  2012-03-14 10:47       ` Glauber Costa
  0 siblings, 1 reply; 45+ messages in thread
From: Suleiman Souhlal @ 2012-03-13 22:50 UTC (permalink / raw)
  To: Glauber Costa
  Cc: Suleiman Souhlal, cgroups, kamezawa.hiroyu, cl, yinghan, hughd,
	gthelen, peterz, dan.magenheimer, hannes, mgorman,
	James.Bottomley, linux-mm, devel, linux-kernel, Pekka Enberg

On Sun, Mar 11, 2012 at 3:25 AM, Glauber Costa <glommer@parallels.com> wrote:
> On 03/10/2012 12:39 AM, Suleiman Souhlal wrote:
>> +static inline void
>> +mem_cgroup_kmem_cache_prepare_sleep(struct kmem_cache *cachep)
>> +{
>> +       /*
>> +        * Make sure the cache doesn't get freed while we have interrupts
>> +        * enabled.
>> +        */
>> +       kmem_cache_get_ref(cachep);
>> +       rcu_read_unlock();
>> +}
>
>
> Is this really needed ? After this function call in slab.c, the slab code
> itself accesses cachep a thousand times. If it could be freed, it would
> already explode today for other reasons?
> Am I missing something here?

We need this because once we drop the rcu_read_lock and go to sleep,
the memcg could get deleted, which could lead to the cachep from
getting deleted as well.

So, we need to grab a reference to the cache, to make sure that the
cache doesn't disappear from under us.

>> diff --git a/init/Kconfig b/init/Kconfig
>> index 3f42cd6..e7eb652 100644
>> --- a/init/Kconfig
>> +++ b/init/Kconfig
>> @@ -705,7 +705,7 @@ config CGROUP_MEM_RES_CTLR_SWAP_ENABLED
>>          then swapaccount=0 does the trick).
>>  config CGROUP_MEM_RES_CTLR_KMEM
>>        bool "Memory Resource Controller Kernel Memory accounting
>> (EXPERIMENTAL)"
>> -       depends on CGROUP_MEM_RES_CTLR&&  EXPERIMENTAL
>> +       depends on CGROUP_MEM_RES_CTLR&&  EXPERIMENTAL&&  !SLOB
>
> Orthogonal question: Will we ever want this (SLOB) ?

I honestly don't know why someone would want to use this and slob at
the same time.
It really doesn't seem like a required feature, in my opinion.
Especially at first.

>> +static struct kmem_cache *
>> +memcg_create_kmem_cache(struct mem_cgroup *memcg, struct kmem_cache
>> *cachep)
>> +{
>> +       struct kmem_cache *new_cachep;
>> +       struct dentry *dentry;
>> +       char *name;
>> +       int idx;
>> +
>> +       idx = cachep->memcg_params.id;
>> +
>> +       dentry = memcg->css.cgroup->dentry;
>> +       BUG_ON(dentry == NULL);
>> +
>> +       /* Preallocate the space for "dead" at the end */
>> +       name = kasprintf(GFP_KERNEL, "%s(%d:%s)dead",
>> +           cachep->name, css_id(&memcg->css), dentry->d_name.name);
>> +       if (name == NULL)
>> +               return cachep;
>> +       /* Remove "dead" */
>> +       name[strlen(name) - 4] = '\0';
>> +
>> +       new_cachep = kmem_cache_create_memcg(cachep, name);
>> +
>> +       /*
>> +        * Another CPU is creating the same cache?
>> +        * We'll use it next time.
>> +        */
>
> This comment is a bit misleading. Is it really the only reason
> it can fail?
>
> The impression I got is that it can also fail under the normal conditions in
> which kmem_cache_create() fails.

kmem_cache_create() isn't expected to fail often.
I wasn't making an exhaustive lists of why this condition can happen,
just what I think is the most common one is.

>> +/*
>> + * Enqueue the creation of a per-memcg kmem_cache.
>> + * Called with rcu_read_lock.
>> + */
>> +static void
>> +memcg_create_cache_enqueue(struct mem_cgroup *memcg, struct kmem_cache
>> *cachep)
>> +{
>> +       struct create_work *cw;
>> +       unsigned long flags;
>> +
>> +       spin_lock_irqsave(&create_queue_lock, flags);
>
> If we can sleep, why not just create the cache now?
>
> Maybe it would be better to split this in two, and create the cache if
> possible, and a worker if not possible. Then w

That's how I had it in my initial patch, but I was under the
impression that you preferred if we always kicked off the creation to
the workqueue?

Which way do you prefer?

>> @@ -1756,17 +1765,23 @@ static void *kmem_getpages(struct kmem_cache
>> *cachep, gfp_t flags, int nodeid)
>>        if (cachep->flags&  SLAB_RECLAIM_ACCOUNT)
>>
>>                flags |= __GFP_RECLAIMABLE;
>>
>> +       nr_pages = (1<<  cachep->gfporder);
>> +       if (!mem_cgroup_charge_slab(cachep, flags, nr_pages * PAGE_SIZE))
>> +               return NULL;
>> +
>>        page = alloc_pages_exact_node(nodeid, flags | __GFP_NOTRACK,
>> cachep->gfporder);
>> -       if (!page)
>> +       if (!page) {
>> +               mem_cgroup_uncharge_slab(cachep, nr_pages * PAGE_SIZE);
>>                return NULL;
>> +       }
>
>
>
> Can't the following happen:
>
>  *) mem_cgroup_charge_slab() is the first one to touch the slab.
>    Therefore, this first one is billed to root.
>  *) A slab is queued for creation.
>  *) alloc_pages sleep.
>  *) our workers run, and create the cache, therefore filling
>    cachep->memcg_param.memcg
>  *) alloc_pages still can't allocate.
>  *) uncharge tries to uncharge from cachep->memcg_param.memcg,
>    which doesn't have any charges...
>
> Unless you have a strong oposition to this, to avoid this kind of
> corner cases, we could do what I was doing in the slub:
> Allocate the page first, and then account it.
> (freeing the page if it fails).
>
> I know it is not the way it is done for the user pages, but I believe it to
> be better suited for the slab.

I don't think the situation you're describing can happen, because the
memcg caches get created and selected at the beginning of the slab
allocation, in mem_cgroup_get_kmem_cache() and not in
mem_cgroup_charge_slab(), which is much later.

Once we are in mem_cgroup_charge_slab() we know that the allocation
will be charged to the cgroup.

>> @@ -2269,10 +2288,12 @@ kmem_cache_create (const char *name, size_t size,
>> size_t align,
>>                }
>>
>>                if (!strcmp(pc->name, name)) {
>> -                       printk(KERN_ERR
>> -                              "kmem_cache_create: duplicate cache %s\n",
>> name);
>> -                       dump_stack();
>> -                       goto oops;
>> +                       if (!memcg) {
>> +                               printk(KERN_ERR "kmem_cache_create:
>> duplicate"
>> +                                   " cache %s\n", name);
>> +                               dump_stack();
>> +                               goto oops;
>> +                       }
>
> Why? Since we are apending the memcg name at the end anyway, duplicates
> still aren't expected.

Duplicates can happen if you have hierarchies, because we're only
appending the basename of the cgroup.

>> @@ -2703,12 +2787,74 @@ void kmem_cache_destroy(struct kmem_cache *cachep)
>>        if (unlikely(cachep->flags&  SLAB_DESTROY_BY_RCU))
>>
>>                rcu_barrier();
>>
>> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
>> +       /* Not a memcg cache */
>> +       if (cachep->memcg_params.id != -1) {
>> +               __clear_bit(cachep->memcg_params.id, cache_types);
>> +               mem_cgroup_flush_cache_create_queue();
>> +       }
>> +#endif
>
>
> This will clear the id when a leaf cache is destroyed. It seems it is not
> what we want, right? We want this id to be cleared only when
> the parent cache is gone.

id != -1, for parent caches (that's what the comment is trying to point out).
I will improve the comment.

>> +static void
>> +kmem_cache_destroy_work_func(struct work_struct *w)
>> +{
>> +       struct kmem_cache *cachep;
>> +       char *name;
>> +
>> +       spin_lock_irq(&destroy_lock);
>> +       while (!list_empty(&destroyed_caches)) {
>> +               cachep = container_of(list_first_entry(&destroyed_caches,
>> +                   struct mem_cgroup_cache_params, destroyed_list),
>> struct
>> +                   kmem_cache, memcg_params);
>> +               name = (char *)cachep->name;
>> +               list_del(&cachep->memcg_params.destroyed_list);
>> +               spin_unlock_irq(&destroy_lock);
>> +               synchronize_rcu();
>> +               kmem_cache_destroy(cachep);
>
>                ^^^^^^
>                will destroy the id.

See my previous comment.

>> @@ -3866,9 +4030,35 @@ void kmem_cache_free(struct kmem_cache *cachep,
>> void *objp)
>>
>>        local_irq_save(flags);
>>        debug_check_no_locks_freed(objp, obj_size(cachep));
>> +
>> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
>> +       {
>> +               struct kmem_cache *actual_cachep;
>> +
>> +               actual_cachep = virt_to_cache(objp);
>> +               if (actual_cachep != cachep) {
>> +                       VM_BUG_ON(actual_cachep->memcg_params.id != -1);
>> +                       VM_BUG_ON(actual_cachep->memcg_params.orig_cache
>> !=
>> +                           cachep);
>> +                       cachep = actual_cachep;
>> +               }
>> +               /*
>> +                * Grab a reference so that the cache is guaranteed to
>> stay
>> +                * around.
>> +                * If we are freeing the last object of a dead memcg
>> cache,
>> +                * the kmem_cache_drop_ref() at the end of this function
>> +                * will end up freeing the cache.
>> +                */
>> +               kmem_cache_get_ref(cachep);
>
> 1) Another obvious candidate to be wrapped by static_branch()...
> 2) I don't trully follow why we need those references here. Can you
>   give us an example of a situation in which the cache can go away?
>
> Also note that we are making a function that used to operate mostly on
> local data now issue two atomic operations.

Yes, improving this is in my v3 TODO already.

The situation is very simple, and will happen every time we are
freeing the last object of a dead cache.
When we free the last object, kmem_freepages() will drop the last
reference, which will cause the kmem_cache to be destroyed right
there.
Grabbing an additional reference before freeing the page is just a
hack to avoid this situation.

It might be possible to just wrap the free path in rcu_read_lock(), or
if that isn't enough, to delay the destruction until the end. I still
have to think about this a bit more, to be sure.

Thanks for the detailed review,
-- Suleiman

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 03/13] memcg: Uncharge all kmem when deleting a cgroup.
  2012-03-11  8:19   ` Glauber Costa
@ 2012-03-13 23:16     ` Suleiman Souhlal
  2012-03-14 11:59       ` Glauber Costa
  0 siblings, 1 reply; 45+ messages in thread
From: Suleiman Souhlal @ 2012-03-13 23:16 UTC (permalink / raw)
  To: Glauber Costa
  Cc: Suleiman Souhlal, cgroups, kamezawa.hiroyu, penberg, cl, yinghan,
	hughd, gthelen, peterz, dan.magenheimer, hannes, mgorman,
	James.Bottomley, linux-mm, devel, linux-kernel

On Sun, Mar 11, 2012 at 12:19 AM, Glauber Costa <glommer@parallels.com> wrote:
> On 03/10/2012 12:39 AM, Suleiman Souhlal wrote:
>>
>> Signed-off-by: Suleiman Souhlal<suleiman@google.com>
>> ---
>>  mm/memcontrol.c |   31 ++++++++++++++++++++++++++++++-
>>  1 files changed, 30 insertions(+), 1 deletions(-)
>>
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index e6fd558..6fbb438 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -382,6 +382,7 @@ static void mem_cgroup_get(struct mem_cgroup *memcg);
>>  static void mem_cgroup_put(struct mem_cgroup *memcg);
>>  static void memcg_kmem_init(struct mem_cgroup *memcg,
>>      struct mem_cgroup *parent);
>> +static void memcg_kmem_move(struct mem_cgroup *memcg);
>>
>>  static inline bool
>>  mem_cgroup_test_flag(const struct mem_cgroup *memcg, enum memcg_flags
>> flag)
>> @@ -3700,6 +3701,7 @@ static int mem_cgroup_force_empty(struct mem_cgroup
>> *memcg, bool free_all)
>>        int ret;
>>        int node, zid, shrink;
>>        int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
>> +       unsigned long usage;
>>        struct cgroup *cgrp = memcg->css.cgroup;
>>
>>        css_get(&memcg->css);
>> @@ -3719,6 +3721,8 @@ move_account:
>>                /* This is for making all *used* pages to be on LRU. */
>>                lru_add_drain_all();
>>                drain_all_stock_sync(memcg);
>> +               if (!free_all)
>> +                       memcg_kmem_move(memcg);
>
> Any reason we're not moving kmem charges when free_all is set as well?

Because the slab moving code expects to be synchronized with
allocations (and itself). We can't call it when there are still tasks
in the cgroup.

-- Suleiman

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 06/13] slab: Add kmem_cache_gfp_flags() helper function.
  2012-03-11 10:53   ` Glauber Costa
@ 2012-03-13 23:21     ` Suleiman Souhlal
  2012-03-14 11:48       ` Glauber Costa
  0 siblings, 1 reply; 45+ messages in thread
From: Suleiman Souhlal @ 2012-03-13 23:21 UTC (permalink / raw)
  To: Glauber Costa
  Cc: Suleiman Souhlal, cgroups, kamezawa.hiroyu, penberg, cl, yinghan,
	hughd, gthelen, peterz, dan.magenheimer, hannes, mgorman,
	James.Bottomley, linux-mm, devel, linux-kernel

On Sun, Mar 11, 2012 at 3:53 AM, Glauber Costa <glommer@parallels.com> wrote:
> On 03/10/2012 12:39 AM, Suleiman Souhlal wrote:
>>
>> This function returns the gfp flags that are always applied to
>> allocations of a kmem_cache.
>>
>> Signed-off-by: Suleiman Souhlal<suleiman@google.com>
>> ---
>>  include/linux/slab_def.h |    6 ++++++
>>  include/linux/slob_def.h |    6 ++++++
>>  include/linux/slub_def.h |    6 ++++++
>>  3 files changed, 18 insertions(+), 0 deletions(-)
>>
>> diff --git a/include/linux/slab_def.h b/include/linux/slab_def.h
>> index fbd1117..25f9a6a 100644
>> --- a/include/linux/slab_def.h
>> +++ b/include/linux/slab_def.h
>> @@ -159,6 +159,12 @@ found:
>>        return __kmalloc(size, flags);
>>  }
>>
>> +static inline gfp_t
>> +kmem_cache_gfp_flags(struct kmem_cache *cachep)
>> +{
>> +       return cachep->gfpflags;
>> +}
>> +
>>  #ifdef CONFIG_NUMA
>>  extern void *__kmalloc_node(size_t size, gfp_t flags, int node);
>>  extern void *kmem_cache_alloc_node(struct kmem_cache *, gfp_t flags, int
>> node);
>> diff --git a/include/linux/slob_def.h b/include/linux/slob_def.h
>> index 0ec00b3..3fa527d 100644
>> --- a/include/linux/slob_def.h
>> +++ b/include/linux/slob_def.h
>> @@ -34,4 +34,10 @@ static __always_inline void *__kmalloc(size_t size,
>> gfp_t flags)
>>        return kmalloc(size, flags);
>>  }
>>
>> +static inline gfp_t
>> +kmem_cache_gfp_flags(struct kmem_cache *cachep)
>> +{
>> +       return 0;
>> +}
>> +
>>  #endif /* __LINUX_SLOB_DEF_H */
>> diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
>> index a32bcfd..5911d81 100644
>> --- a/include/linux/slub_def.h
>> +++ b/include/linux/slub_def.h
>> @@ -313,4 +313,10 @@ static __always_inline void *kmalloc_node(size_t
>> size, gfp_t flags, int node)
>>  }
>>  #endif
>>
>> +static inline gfp_t
>> +kmem_cache_gfp_flags(struct kmem_cache *cachep)
>> +{
>> +       return cachep->allocflags;
>> +}
>> +
>
>
> Why is this needed? Can't the caller just call
> mem_cgroup_get_kmem_cache(cachep, flags | cachep->allocflags) ?

Because slub calls this cachep->allocflags, while slab calls it
cachep->gfpflags.

I'll look into renaming one of them to match the other.

-- Suleiman

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 02/13] memcg: Kernel memory accounting infrastructure.
  2012-03-13 10:37       ` Glauber Costa
  2012-03-13 17:00         ` Greg Thelen
@ 2012-03-14  0:15         ` KAMEZAWA Hiroyuki
  2012-03-14 12:29           ` Glauber Costa
  1 sibling, 1 reply; 45+ messages in thread
From: KAMEZAWA Hiroyuki @ 2012-03-14  0:15 UTC (permalink / raw)
  To: Glauber Costa
  Cc: Suleiman Souhlal, cgroups, suleiman, penberg, cl, yinghan, hughd,
	gthelen, peterz, dan.magenheimer, hannes, mgorman,
	James.Bottomley, linux-mm, devel, linux-kernel

On Tue, 13 Mar 2012 14:37:30 +0400
Glauber Costa <glommer@parallels.com> wrote:

> > After looking codes, I think we need to think
> > whether independent_kmem_limit is good or not....
> >
> > How about adding MEMCG_KMEM_ACCOUNT flag instead of this and use only
> > memcg->res/memcg->memsw rather than adding a new counter, memcg->kmem ?
> >
> > if MEMCG_KMEM_ACCOUNT is set     ->  slab is accoutned to mem->res/memsw.
> > if MEMCG_KMEM_ACCOUNT is not set ->  slab is never accounted.
> >
> > (I think On/Off switch is required..)
> >
> > Thanks,
> > -Kame
> >
> 
> This has been discussed before, I can probably find it in the archives 
> if you want to go back and see it.
> 

Yes. IIUC, we agreed to have independet kmem limit. I just want to think it
again because there are too many proposals and it seems I'm in confusion.

As far as I see, there are ongoing works as
 - kmem limit by 2 guys.
 - hugetlb limit
 - per lru locking (by 2 guys)
 - page cgroup diet (by me, but stops now.)
 - drity-ratio and writeback 
 - Tejun's proposal to remove pre_destroy()
 - moving shared resource

I'm thinking what is a simple plan and implementation. 
Most of series consists of 10+ patches...

Thank you for your help of clarification.



> But in a nutshell:
> 
> 1) Supposing independent knob disappear (I will explain in item 2 why I 
> don't want it to), I don't thing a flag makes sense either. *If* we are 
> planning to enable/disable this, it might make more sense to put some 
> work on it, and allow particular slabs to be enabled/disabled by writing 
> to memory.kmem.slabinfo (-* would disable all, +* enable all, +kmalloc* 
> enable all kmalloc, etc).
> 
seems interesting.

> Alternatively, what we could do instead, is something similar to what 
> ended up being done for tcp, by request of the network people: if you 
> never touch the limit file, don't bother with it at all, and simply does 
> not account. With Suleiman's lazy allocation infrastructure, that should 
> actually be trivial. And then again, a flag is not necessary, because 
> writing to the limit file does the job, and also convey the meaning well 
> enough.
> 

Hm.

> 2) For the kernel itself, we are mostly concerned that a malicious 
> container may pin into memory big amounts of kernel memory which is, 
> ultimately, unreclaimable. 

Yes. This is a big problem both to memcg and the whole system.

In my experience, 2000 process shares a 10GB shared memory and eats up
big memory ;(



> In particular, with overcommit allowed 
> scenarios, you can fill the whole physical memory (or at least a 
> significant part) with those objects, well beyond your softlimit 
> allowance, making the creation of further containers impossible.
> With user memory, you can reclaim the cgroup back to its place. With 
> kernel memory, you can't.
> 
Agreed.

> In the particular example of 32-bit boxes, you can easily fill up a 
> large part of the available 1gb kernel memory with pinned memory and 
> render the whole system unresponsive.
> 
> Never allowing the kernel memory to go beyond the soft limit was one of 
> the proposed alternatives. However, it may force you to establish a soft
> limit where one was not previously needed. Or, establish a low soft 
> limit when you really need a bigger one.
> 
> All that said, while reading your message, thinking a bit, the following 
> crossed my mind:
> 
> - We can account the slabs to memcg->res normally, and just store the
>    information that this is kernel memory into a percpu counter, as
>    I proposed recently.

Ok, then user can see the amount of kernel memory.


> - The knob goes away, and becomes implicit: if you ever write anything
>    to memory.kmem.limit_in_bytes, we transfer that memory to a separate
>    kmem res_counter, and proceed from there. We can keep accounting to
>    memcg->res anyway, just that kernel memory will now have a separate
>    limit.

Okay, then,

	kmem_limit < memory.limit < memsw.limit

...seems reasonable to me.
This means, user can specify 'ratio' of kmem in memory.limit.

More consideration will be interesting.

 - We can show the amount of reclaimable kmem by some means ?
 - What happens when a new cgroup created ?
 - Should we have 'ratio' interface in kernel level ?
 - What happens at task moving ?
 - Should we allow per-slab accounting knob in /sys/kernel/slab/xxx ?
   or somewhere ?
 - Should we show per-memcg usage in /sys/kernel/slab/xxx ?
 - Should we have force_empty for kmem (as last resort) ?

With any implementation, my concern is
 - overhead/performance.
 - unreclaimable kmem
 - shared kmem between cgroups.


> - With this scheme, it may not be necessary to ever have a file
>    memory.kmem.soft_limit_in_bytes. Reclaim is always part of the normal
>    memcg reclaim.
> 
Good.

> The outlined above would work for us, and make the whole scheme simpler, 
> I believe.
> 
> What do you think ?

It sounds interesting to me.

Thanks,
-Kame












^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 07/13] memcg: Slab accounting.
  2012-03-13 22:50     ` Suleiman Souhlal
@ 2012-03-14 10:47       ` Glauber Costa
  2012-03-14 22:04         ` Suleiman Souhlal
  0 siblings, 1 reply; 45+ messages in thread
From: Glauber Costa @ 2012-03-14 10:47 UTC (permalink / raw)
  To: Suleiman Souhlal
  Cc: Suleiman Souhlal, cgroups, kamezawa.hiroyu, cl, yinghan, hughd,
	gthelen, peterz, dan.magenheimer, hannes, mgorman,
	James.Bottomley, linux-mm, devel, linux-kernel, Pekka Enberg

On 03/14/2012 02:50 AM, Suleiman Souhlal wrote:
> On Sun, Mar 11, 2012 at 3:25 AM, Glauber Costa<glommer@parallels.com>  wrote:
>> On 03/10/2012 12:39 AM, Suleiman Souhlal wrote:
>>> +static inline void
>>> +mem_cgroup_kmem_cache_prepare_sleep(struct kmem_cache *cachep)
>>> +{
>>> +       /*
>>> +        * Make sure the cache doesn't get freed while we have interrupts
>>> +        * enabled.
>>> +        */
>>> +       kmem_cache_get_ref(cachep);
>>> +       rcu_read_unlock();
>>> +}
>>
>>
>> Is this really needed ? After this function call in slab.c, the slab code
>> itself accesses cachep a thousand times. If it could be freed, it would
>> already explode today for other reasons?
>> Am I missing something here?
>
> We need this because once we drop the rcu_read_lock and go to sleep,
> the memcg could get deleted, which could lead to the cachep from
> getting deleted as well.
>
> So, we need to grab a reference to the cache, to make sure that the
> cache doesn't disappear from under us.

Don't we grab a memcg reference when we fire the cache creation?
(I did that for slub, can't really recall from the top of my head if
you are doing it as well)

That would prevent the memcg to go away, while relieving us from the
need to take a temporary reference for every page while sleeping.

>>> diff --git a/init/Kconfig b/init/Kconfig
>>> index 3f42cd6..e7eb652 100644
>>> --- a/init/Kconfig
>>> +++ b/init/Kconfig
>>> @@ -705,7 +705,7 @@ config CGROUP_MEM_RES_CTLR_SWAP_ENABLED
>>>           then swapaccount=0 does the trick).
>>>   config CGROUP_MEM_RES_CTLR_KMEM
>>>         bool "Memory Resource Controller Kernel Memory accounting
>>> (EXPERIMENTAL)"
>>> -       depends on CGROUP_MEM_RES_CTLR&&    EXPERIMENTAL
>>> +       depends on CGROUP_MEM_RES_CTLR&&    EXPERIMENTAL&&    !SLOB
>>
>> Orthogonal question: Will we ever want this (SLOB) ?
>
> I honestly don't know why someone would want to use this and slob at
> the same time.
> It really doesn't seem like a required feature, in my opinion.
> Especially at first.

Agree. It was more a question to see if anyone would speak up for it.
But certainly not me.

>>> +static struct kmem_cache *
>>> +memcg_create_kmem_cache(struct mem_cgroup *memcg, struct kmem_cache
>>> *cachep)
>>> +{
>>> +       struct kmem_cache *new_cachep;
>>> +       struct dentry *dentry;
>>> +       char *name;
>>> +       int idx;
>>> +
>>> +       idx = cachep->memcg_params.id;
>>> +
>>> +       dentry = memcg->css.cgroup->dentry;
>>> +       BUG_ON(dentry == NULL);
>>> +
>>> +       /* Preallocate the space for "dead" at the end */
>>> +       name = kasprintf(GFP_KERNEL, "%s(%d:%s)dead",
>>> +           cachep->name, css_id(&memcg->css), dentry->d_name.name);
>>> +       if (name == NULL)
>>> +               return cachep;
>>> +       /* Remove "dead" */
>>> +       name[strlen(name) - 4] = '\0';
>>> +
>>> +       new_cachep = kmem_cache_create_memcg(cachep, name);
>>> +
>>> +       /*
>>> +        * Another CPU is creating the same cache?
>>> +        * We'll use it next time.
>>> +        */
>>
>> This comment is a bit misleading. Is it really the only reason
>> it can fail?
>>
>> The impression I got is that it can also fail under the normal conditions in
>> which kmem_cache_create() fails.
>
> kmem_cache_create() isn't expected to fail often.
> I wasn't making an exhaustive lists of why this condition can happen,
> just what I think is the most common one is.

Keep in mind that our notion of "fail often" may start to change when
we start limiting the amount of kernel memory =p.

Specially in nested cgroups limited by its parent.

So apart from the comment issue, the problem here to me seems to be that:

yes, kmem_cache_create failing is rare. But the circumstances in which 
it can happen all involve memory pressure. And in this case, we'll leave 
memcg->slabs[idx] as NULL, which means we'll keep trying to create the 
cache in further allocations.

This seems at best a tricky way to escape the memcg constraint...

I am not sure this is the behavior we want. Have to think a little bit.

>
>>> +/*
>>> + * Enqueue the creation of a per-memcg kmem_cache.
>>> + * Called with rcu_read_lock.
>>> + */
>>> +static void
>>> +memcg_create_cache_enqueue(struct mem_cgroup *memcg, struct kmem_cache
>>> *cachep)
>>> +{
>>> +       struct create_work *cw;
>>> +       unsigned long flags;
>>> +
>>> +       spin_lock_irqsave(&create_queue_lock, flags);
>>
>> If we can sleep, why not just create the cache now?
>>
>> Maybe it would be better to split this in two, and create the cache if
>> possible, and a worker if not possible. Then w
>
> That's how I had it in my initial patch, but I was under the
> impression that you preferred if we always kicked off the creation to
> the workqueue?
>
> Which way do you prefer?

Sorry If I misled you. But what I remember mentioning, was that it was 
maybe better to create some of the caches right away, instead of putting 
the into the workqueue at all. So earlier, rather than later.

That said, how I view this particular issue changed quite a bit over the 
past days, due to our discussions. Specially, see the last mail I wrote 
to Kame as reply to your patchset. I think that queuing up stuff
in the workqueue may get quite handy in the end.

But in the interest of having  less objects scaping memcg, how about we 
call cond_resched() when we can sleep, after we kicked out the worker?

This way we don't need to deal with conditionals for can sleep vs can't 
sleep, (simpler code), while having the cache created right away when it 
can.


>>> @@ -1756,17 +1765,23 @@ static void *kmem_getpages(struct kmem_cache
>>> *cachep, gfp_t flags, int nodeid)
>>>         if (cachep->flags&    SLAB_RECLAIM_ACCOUNT)
>>>
>>>                 flags |= __GFP_RECLAIMABLE;
>>>
>>> +       nr_pages = (1<<    cachep->gfporder);
>>> +       if (!mem_cgroup_charge_slab(cachep, flags, nr_pages * PAGE_SIZE))
>>> +               return NULL;
>>> +
>>>         page = alloc_pages_exact_node(nodeid, flags | __GFP_NOTRACK,
>>> cachep->gfporder);
>>> -       if (!page)
>>> +       if (!page) {
>>> +               mem_cgroup_uncharge_slab(cachep, nr_pages * PAGE_SIZE);
>>>                 return NULL;
>>> +       }
>>
>>
>>
>> Can't the following happen:
>>
>>   *) mem_cgroup_charge_slab() is the first one to touch the slab.
>>     Therefore, this first one is billed to root.
>>   *) A slab is queued for creation.
>>   *) alloc_pages sleep.
>>   *) our workers run, and create the cache, therefore filling
>>     cachep->memcg_param.memcg
>>   *) alloc_pages still can't allocate.
>>   *) uncharge tries to uncharge from cachep->memcg_param.memcg,
>>     which doesn't have any charges...
>>
>> Unless you have a strong oposition to this, to avoid this kind of
>> corner cases, we could do what I was doing in the slub:
>> Allocate the page first, and then account it.
>> (freeing the page if it fails).
>>
>> I know it is not the way it is done for the user pages, but I believe it to
>> be better suited for the slab.
>
> I don't think the situation you're describing can happen, because the
> memcg caches get created and selected at the beginning of the slab
> allocation, in mem_cgroup_get_kmem_cache() and not in
> mem_cgroup_charge_slab(), which is much later.
>
> Once we are in mem_cgroup_charge_slab() we know that the allocation
> will be charged to the cgroup.

That's not how I read it. Since there is no completion guarantees coming 
from the workqueue, I really don't see how we can be sure that the data 
in cachep->memcg_param.memcg won't change.

You are right that touching the slab actually happens in 
mem_cgroup_get_kmem_cache(). That is called in kmem_cache_aloc(). And 
the first object is likely to be billed to the parent cgroup (or root)

Now imagine that cache being full, so we need a new page for it.
This will quickly lead us to cache_grow(), and all the other steps are 
therefore the same.

So how can we guarantee that the memcg pointer is stable between alloc 
and free?

>>> @@ -2269,10 +2288,12 @@ kmem_cache_create (const char *name, size_t size,
>>> size_t align,
>>>                 }
>>>
>>>                 if (!strcmp(pc->name, name)) {
>>> -                       printk(KERN_ERR
>>> -                              "kmem_cache_create: duplicate cache %s\n",
>>> name);
>>> -                       dump_stack();
>>> -                       goto oops;
>>> +                       if (!memcg) {
>>> +                               printk(KERN_ERR "kmem_cache_create:
>>> duplicate"
>>> +                                   " cache %s\n", name);
>>> +                               dump_stack();
>>> +                               goto oops;
>>> +                       }
>>
>> Why? Since we are apending the memcg name at the end anyway, duplicates
>> still aren't expected.
>
> Duplicates can happen if you have hierarchies, because we're only
> appending the basename of the cgroup.
No, we're appending the css id now too, precisely to cope with the 
duplicates problem.

>
>>> @@ -2703,12 +2787,74 @@ void kmem_cache_destroy(struct kmem_cache *cachep)
>>>         if (unlikely(cachep->flags&    SLAB_DESTROY_BY_RCU))
>>>
>>>                 rcu_barrier();
>>>
>>> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
>>> +       /* Not a memcg cache */
>>> +       if (cachep->memcg_params.id != -1) {
>>> +               __clear_bit(cachep->memcg_params.id, cache_types);
>>> +               mem_cgroup_flush_cache_create_queue();
>>> +       }
>>> +#endif
>>
>>
>> This will clear the id when a leaf cache is destroyed. It seems it is not
>> what we want, right? We want this id to be cleared only when
>> the parent cache is gone.
>
> id != -1, for parent caches (that's what the comment is trying to point out).
> I will improve the comment.

/me goes check all the code again...

Does that mean that when two memcg's are creating the same cache they 
will end up with different ids??


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 06/13] slab: Add kmem_cache_gfp_flags() helper function.
  2012-03-13 23:21     ` Suleiman Souhlal
@ 2012-03-14 11:48       ` Glauber Costa
  2012-03-14 22:08         ` Suleiman Souhlal
  0 siblings, 1 reply; 45+ messages in thread
From: Glauber Costa @ 2012-03-14 11:48 UTC (permalink / raw)
  To: Suleiman Souhlal
  Cc: Suleiman Souhlal, cgroups, kamezawa.hiroyu, penberg, cl, yinghan,
	hughd, gthelen, peterz, dan.magenheimer, hannes, mgorman,
	James.Bottomley, linux-mm, devel, linux-kernel

On 03/14/2012 03:21 AM, Suleiman Souhlal wrote:
> On Sun, Mar 11, 2012 at 3:53 AM, Glauber Costa<glommer@parallels.com>  wrote:
>> On 03/10/2012 12:39 AM, Suleiman Souhlal wrote:
>>>
>>> This function returns the gfp flags that are always applied to
>>> allocations of a kmem_cache.
>>>
>>> Signed-off-by: Suleiman Souhlal<suleiman@google.com>
>>> ---
>>>   include/linux/slab_def.h |    6 ++++++
>>>   include/linux/slob_def.h |    6 ++++++
>>>   include/linux/slub_def.h |    6 ++++++
>>>   3 files changed, 18 insertions(+), 0 deletions(-)
>>>
>>> diff --git a/include/linux/slab_def.h b/include/linux/slab_def.h
>>> index fbd1117..25f9a6a 100644
>>> --- a/include/linux/slab_def.h
>>> +++ b/include/linux/slab_def.h
>>> @@ -159,6 +159,12 @@ found:
>>>         return __kmalloc(size, flags);
>>>   }
>>>
>>> +static inline gfp_t
>>> +kmem_cache_gfp_flags(struct kmem_cache *cachep)
>>> +{
>>> +       return cachep->gfpflags;
>>> +}
>>> +
>>>   #ifdef CONFIG_NUMA
>>>   extern void *__kmalloc_node(size_t size, gfp_t flags, int node);
>>>   extern void *kmem_cache_alloc_node(struct kmem_cache *, gfp_t flags, int
>>> node);
>>> diff --git a/include/linux/slob_def.h b/include/linux/slob_def.h
>>> index 0ec00b3..3fa527d 100644
>>> --- a/include/linux/slob_def.h
>>> +++ b/include/linux/slob_def.h
>>> @@ -34,4 +34,10 @@ static __always_inline void *__kmalloc(size_t size,
>>> gfp_t flags)
>>>         return kmalloc(size, flags);
>>>   }
>>>
>>> +static inline gfp_t
>>> +kmem_cache_gfp_flags(struct kmem_cache *cachep)
>>> +{
>>> +       return 0;
>>> +}
>>> +
>>>   #endif /* __LINUX_SLOB_DEF_H */
>>> diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
>>> index a32bcfd..5911d81 100644
>>> --- a/include/linux/slub_def.h
>>> +++ b/include/linux/slub_def.h
>>> @@ -313,4 +313,10 @@ static __always_inline void *kmalloc_node(size_t
>>> size, gfp_t flags, int node)
>>>   }
>>>   #endif
>>>
>>> +static inline gfp_t
>>> +kmem_cache_gfp_flags(struct kmem_cache *cachep)
>>> +{
>>> +       return cachep->allocflags;
>>> +}
>>> +
>>
>>
>> Why is this needed? Can't the caller just call
>> mem_cgroup_get_kmem_cache(cachep, flags | cachep->allocflags) ?
>
> Because slub calls this cachep->allocflags, while slab calls it
> cachep->gfpflags.
>

So what?
That function is only called from slab.c anyway. Let slab call it
mem_cgroup_get_kmem_cache(cachep, flags | cachep->allocflags);
and slub
mem_cgroup_get_kmem_cache(cachep, flags | cachep->gfpflags);

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 03/13] memcg: Uncharge all kmem when deleting a cgroup.
  2012-03-13 23:16     ` Suleiman Souhlal
@ 2012-03-14 11:59       ` Glauber Costa
  0 siblings, 0 replies; 45+ messages in thread
From: Glauber Costa @ 2012-03-14 11:59 UTC (permalink / raw)
  To: Suleiman Souhlal
  Cc: Suleiman Souhlal, cgroups, kamezawa.hiroyu, penberg, cl, yinghan,
	hughd, gthelen, peterz, dan.magenheimer, hannes, mgorman,
	James.Bottomley, linux-mm, devel, linux-kernel


>>> @@ -3719,6 +3721,8 @@ move_account:
>>>                 /* This is for making all *used* pages to be on LRU. */
>>>                 lru_add_drain_all();
>>>                 drain_all_stock_sync(memcg);
>>> +               if (!free_all)
>>> +                       memcg_kmem_move(memcg);
>>
>> Any reason we're not moving kmem charges when free_all is set as well?
>
> Because the slab moving code expects to be synchronized with
> allocations (and itself). We can't call it when there are still tasks
> in the cgroup.

Ok.

Please add an explanation about that.
Oh boy, reading it all now, I started to believe that "free_all" is a 
really poor name =(



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 02/13] memcg: Kernel memory accounting infrastructure.
  2012-03-14  0:15         ` KAMEZAWA Hiroyuki
@ 2012-03-14 12:29           ` Glauber Costa
  2012-03-15  0:48             ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 45+ messages in thread
From: Glauber Costa @ 2012-03-14 12:29 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Suleiman Souhlal, cgroups, suleiman, penberg, cl, yinghan, hughd,
	gthelen, peterz, dan.magenheimer, hannes, mgorman,
	James.Bottomley, linux-mm, devel, linux-kernel


>> This has been discussed before, I can probably find it in the archives
>> if you want to go back and see it.
>>
>
> Yes. IIUC, we agreed to have independet kmem limit. I just want to think it
> again because there are too many proposals and it seems I'm in confusion.
>

Sure thing. The discussion turned out good, so I'm glad you asked =)

>
>> But in a nutshell:
>>
>> 1) Supposing independent knob disappear (I will explain in item 2 why I
>> don't want it to), I don't thing a flag makes sense either. *If* we are
>> planning to enable/disable this, it might make more sense to put some
>> work on it, and allow particular slabs to be enabled/disabled by writing
>> to memory.kmem.slabinfo (-* would disable all, +* enable all, +kmalloc*
>> enable all kmalloc, etc).
>>
> seems interesting.
I'll try to cook a PoC.

>> All that said, while reading your message, thinking a bit, the following
>> crossed my mind:
>>
>> - We can account the slabs to memcg->res normally, and just store the
>>     information that this is kernel memory into a percpu counter, as
>>     I proposed recently.
>
> Ok, then user can see the amount of kernel memory.
>
>
>> - The knob goes away, and becomes implicit: if you ever write anything
>>     to memory.kmem.limit_in_bytes, we transfer that memory to a separate
>>     kmem res_counter, and proceed from there. We can keep accounting to
>>     memcg->res anyway, just that kernel memory will now have a separate
>>     limit.
>
> Okay, then,
>
> 	kmem_limit<  memory.limit<  memsw.limit
>
> ...seems reasonable to me.
> This means, user can specify 'ratio' of kmem in memory.limit.
Yes, I believe so. It is a big improvement over the current interface
we have today, IMHO.

>
> More consideration will be interesting.
>
>   - We can show the amount of reclaimable kmem by some means ?
That's hard to do. The users of the cache have this information, the 
underlying slab/slub/slut code do not. We need to rely on the cache 
owner to provide this, and provide correctly. So the chances we'll have 
incorrect information here grows by quite a bit.

>   - What happens when a new cgroup created ?

mem_cgroup_create() is called =)
Heh, jokes apart, I don't really follow here. What exactly do you mean? 
There shouldn't be anything extremely out of the ordinary.

>   - Should we have 'ratio' interface in kernel level ?
I personally don't like a ratio interface. I believe specifying "kmem 
should never be allowed to go over X bytes" is more than enough.

>   - What happens at task moving ?

 From kmem PoV, nothing. It is ultimately impossible to track a slab 
page to a task. The page contains objects that were allocated from 
multiple tasks. Only when the whole cgroup is destroyed, is that
we take any action.

>   - Should we allow per-slab accounting knob in /sys/kernel/slab/xxx ?
>     or somewhere ?

Not really follow.

>   - Should we show per-memcg usage in /sys/kernel/slab/xxx ?
I guess so.

>   - Should we have force_empty for kmem (as last resort) ?
We do that when the cgroup is going away. From user action, I suspect 
the best we can do is call the shrinkers, and see if they get freed.

>
> With any implementation, my concern is
>   - overhead/performance.
Yes. For the next round, we need to add some more detailed benchmarks.
>   - unreclaimable kmem
That's actually the reason behind all that!

So if you have a 1 Gb mem allowance, and you fill it with unreclaimable 
kmem, you are in trouble, yes.

Point is, at least all your allocations will stop working, and something 
will be done soon. Without kmem tracking, this can grow and grow, 
outside the cgroups border.

>   - shared kmem between cgroups.

Right now both proposals ended up doing account to first user. In 
theory, it leaves a gap under which a smart cgroup can go pinning a lot 
of kmem without owning it.

But I still believe this to be the best way forward.
It is hard to determine who the object user is without cooperation from 
the object caches. And even then, it is even harder to do so without 
penalizing every single object allocation (right now we only penalize a 
new page allocation, which is way better performance-wise).





>
>
>> - With this scheme, it may not be necessary to ever have a file
>>     memory.kmem.soft_limit_in_bytes. Reclaim is always part of the normal
>>     memcg reclaim.
>>
> Good.
>
>> The outlined above would work for us, and make the whole scheme simpler,
>> I believe.
>>
>> What do you think ?
>
> It sounds interesting to me.
>
> Thanks,
> -Kame
>
>
>
>
>
>
>
>
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe cgroups" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 07/13] memcg: Slab accounting.
  2012-03-14 10:47       ` Glauber Costa
@ 2012-03-14 22:04         ` Suleiman Souhlal
  2012-03-15 11:40           ` Glauber Costa
  0 siblings, 1 reply; 45+ messages in thread
From: Suleiman Souhlal @ 2012-03-14 22:04 UTC (permalink / raw)
  To: Glauber Costa
  Cc: Suleiman Souhlal, cgroups, kamezawa.hiroyu, cl, yinghan, hughd,
	gthelen, peterz, dan.magenheimer, hannes, mgorman,
	James.Bottomley, linux-mm, devel, linux-kernel, Pekka Enberg

On Wed, Mar 14, 2012 at 3:47 AM, Glauber Costa <glommer@parallels.com> wrote:
> On 03/14/2012 02:50 AM, Suleiman Souhlal wrote:
>>
>> On Sun, Mar 11, 2012 at 3:25 AM, Glauber Costa<glommer@parallels.com>
>>  wrote:
>>>
>>> On 03/10/2012 12:39 AM, Suleiman Souhlal wrote:
>>>>
>>>> +static inline void
>>>> +mem_cgroup_kmem_cache_prepare_sleep(struct kmem_cache *cachep)
>>>> +{
>>>> +       /*
>>>> +        * Make sure the cache doesn't get freed while we have
>>>> interrupts
>>>> +        * enabled.
>>>> +        */
>>>> +       kmem_cache_get_ref(cachep);
>>>> +       rcu_read_unlock();
>>>> +}
>>>
>>>
>>>
>>> Is this really needed ? After this function call in slab.c, the slab code
>>> itself accesses cachep a thousand times. If it could be freed, it would
>>> already explode today for other reasons?
>>> Am I missing something here?
>>
>>
>> We need this because once we drop the rcu_read_lock and go to sleep,
>> the memcg could get deleted, which could lead to the cachep from
>> getting deleted as well.
>>
>> So, we need to grab a reference to the cache, to make sure that the
>> cache doesn't disappear from under us.
>
>
> Don't we grab a memcg reference when we fire the cache creation?
> (I did that for slub, can't really recall from the top of my head if
> you are doing it as well)
>
> That would prevent the memcg to go away, while relieving us from the
> need to take a temporary reference for every page while sleeping.

The problem isn't the memcg going away, but the cache going away.

>>>> +static struct kmem_cache *
>>>> +memcg_create_kmem_cache(struct mem_cgroup *memcg, struct kmem_cache
>>>> *cachep)
>>>> +{
>>>> +       struct kmem_cache *new_cachep;
>>>> +       struct dentry *dentry;
>>>> +       char *name;
>>>> +       int idx;
>>>> +
>>>> +       idx = cachep->memcg_params.id;
>>>> +
>>>> +       dentry = memcg->css.cgroup->dentry;
>>>> +       BUG_ON(dentry == NULL);
>>>> +
>>>> +       /* Preallocate the space for "dead" at the end */
>>>> +       name = kasprintf(GFP_KERNEL, "%s(%d:%s)dead",
>>>> +           cachep->name, css_id(&memcg->css), dentry->d_name.name);
>>>> +       if (name == NULL)
>>>> +               return cachep;
>>>> +       /* Remove "dead" */
>>>> +       name[strlen(name) - 4] = '\0';
>>>> +
>>>> +       new_cachep = kmem_cache_create_memcg(cachep, name);
>>>> +
>>>> +       /*
>>>> +        * Another CPU is creating the same cache?
>>>> +        * We'll use it next time.
>>>> +        */
>>>
>>>
>>> This comment is a bit misleading. Is it really the only reason
>>> it can fail?
>>>
>>> The impression I got is that it can also fail under the normal conditions
>>> in
>>> which kmem_cache_create() fails.
>>
>>
>> kmem_cache_create() isn't expected to fail often.
>> I wasn't making an exhaustive lists of why this condition can happen,
>> just what I think is the most common one is.
>
>
> Keep in mind that our notion of "fail often" may start to change when
> we start limiting the amount of kernel memory =p.
>
> Specially in nested cgroups limited by its parent.
>
> So apart from the comment issue, the problem here to me seems to be that:
>
> yes, kmem_cache_create failing is rare. But the circumstances in which it
> can happen all involve memory pressure. And in this case, we'll leave
> memcg->slabs[idx] as NULL, which means we'll keep trying to create the cache
> in further allocations.
>
> This seems at best a tricky way to escape the memcg constraint...
>
> I am not sure this is the behavior we want. Have to think a little bit.

Keep in mind that this function is only called in workqueue context.
(In the earlier revision of the patchset this function was called in
the process context, but kmem_cache_create() would ignore memory
limits, because of __GFP_NOACCOUNT.)

>>>> @@ -1756,17 +1765,23 @@ static void *kmem_getpages(struct kmem_cache
>>>> *cachep, gfp_t flags, int nodeid)
>>>>        if (cachep->flags&    SLAB_RECLAIM_ACCOUNT)
>>>>
>>>>                flags |= __GFP_RECLAIMABLE;
>>>>
>>>> +       nr_pages = (1<<    cachep->gfporder);
>>>> +       if (!mem_cgroup_charge_slab(cachep, flags, nr_pages *
>>>> PAGE_SIZE))
>>>> +               return NULL;
>>>> +
>>>>        page = alloc_pages_exact_node(nodeid, flags | __GFP_NOTRACK,
>>>> cachep->gfporder);
>>>> -       if (!page)
>>>> +       if (!page) {
>>>> +               mem_cgroup_uncharge_slab(cachep, nr_pages * PAGE_SIZE);
>>>>                return NULL;
>>>> +       }
>>>
>>>
>>>
>>>
>>> Can't the following happen:
>>>
>>>  *) mem_cgroup_charge_slab() is the first one to touch the slab.
>>>    Therefore, this first one is billed to root.
>>>  *) A slab is queued for creation.
>>>  *) alloc_pages sleep.
>>>  *) our workers run, and create the cache, therefore filling
>>>    cachep->memcg_param.memcg
>>>  *) alloc_pages still can't allocate.
>>>  *) uncharge tries to uncharge from cachep->memcg_param.memcg,
>>>    which doesn't have any charges...
>>>
>>> Unless you have a strong oposition to this, to avoid this kind of
>>> corner cases, we could do what I was doing in the slub:
>>> Allocate the page first, and then account it.
>>> (freeing the page if it fails).
>>>
>>> I know it is not the way it is done for the user pages, but I believe it
>>> to
>>> be better suited for the slab.
>>
>>
>> I don't think the situation you're describing can happen, because the
>> memcg caches get created and selected at the beginning of the slab
>> allocation, in mem_cgroup_get_kmem_cache() and not in
>> mem_cgroup_charge_slab(), which is much later.
>>
>> Once we are in mem_cgroup_charge_slab() we know that the allocation
>> will be charged to the cgroup.
>
>
> That's not how I read it. Since there is no completion guarantees coming
> from the workqueue, I really don't see how we can be sure that the data in
> cachep->memcg_param.memcg won't change.
>
> You are right that touching the slab actually happens in
> mem_cgroup_get_kmem_cache(). That is called in kmem_cache_aloc(). And the
> first object is likely to be billed to the parent cgroup (or root)
>
> Now imagine that cache being full, so we need a new page for it.
> This will quickly lead us to cache_grow(), and all the other steps are
> therefore the same.
>
> So how can we guarantee that the memcg pointer is stable between alloc and
> free?

When mem_cgroup_get_kmem_cache() returns a memcg cache, that cache has
already been created.

The memcg pointer is not stable between alloc and free: It can become
NULL when the cgroup gets deleted, at which point the accounting has
been "moved to root" (uncharged from the cgroup it was charged in).
When that has happened, we don't want to uncharge it again.
I think the current code already handles this situation.

>>>> @@ -2703,12 +2787,74 @@ void kmem_cache_destroy(struct kmem_cache
>>>> *cachep)
>>>>        if (unlikely(cachep->flags&    SLAB_DESTROY_BY_RCU))
>>>>
>>>>                rcu_barrier();
>>>>
>>>> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
>>>> +       /* Not a memcg cache */
>>>> +       if (cachep->memcg_params.id != -1) {
>>>> +               __clear_bit(cachep->memcg_params.id, cache_types);
>>>> +               mem_cgroup_flush_cache_create_queue();
>>>> +       }
>>>> +#endif
>>>
>>>
>>>
>>> This will clear the id when a leaf cache is destroyed. It seems it is not
>>> what we want, right? We want this id to be cleared only when
>>> the parent cache is gone.
>>
>>
>> id != -1, for parent caches (that's what the comment is trying to point
>> out).
>> I will improve the comment.
>
>
> /me goes check all the code again...
>
> Does that mean that when two memcg's are creating the same cache they will
> end up with different ids??

No, only parent caches have an id that is not -1. memcg caches always
have an id of -1.
Sorry if that wasn't clear. I will try to document it better.

-- Suleiman

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 06/13] slab: Add kmem_cache_gfp_flags() helper function.
  2012-03-14 11:48       ` Glauber Costa
@ 2012-03-14 22:08         ` Suleiman Souhlal
  0 siblings, 0 replies; 45+ messages in thread
From: Suleiman Souhlal @ 2012-03-14 22:08 UTC (permalink / raw)
  To: Glauber Costa
  Cc: Suleiman Souhlal, cgroups, kamezawa.hiroyu, penberg, cl, yinghan,
	hughd, gthelen, peterz, dan.magenheimer, hannes, mgorman,
	James.Bottomley, linux-mm, devel, linux-kernel

On Wed, Mar 14, 2012 at 4:48 AM, Glauber Costa <glommer@parallels.com> wrote:
> So what?
> That function is only called from slab.c anyway. Let slab call it
> mem_cgroup_get_kmem_cache(cachep, flags | cachep->allocflags);
> and slub
> mem_cgroup_get_kmem_cache(cachep, flags | cachep->gfpflags);

Ok, I will do that.
I felt like it was better to do it in a single place, instead of doing
it at all the callers.

-- Suleiman

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 02/13] memcg: Kernel memory accounting infrastructure.
  2012-03-14 12:29           ` Glauber Costa
@ 2012-03-15  0:48             ` KAMEZAWA Hiroyuki
  2012-03-15 11:07               ` Glauber Costa
  0 siblings, 1 reply; 45+ messages in thread
From: KAMEZAWA Hiroyuki @ 2012-03-15  0:48 UTC (permalink / raw)
  To: Glauber Costa
  Cc: Suleiman Souhlal, cgroups, suleiman, penberg, cl, yinghan, hughd,
	gthelen, peterz, dan.magenheimer, hannes, mgorman,
	James.Bottomley, linux-mm, devel, linux-kernel

(2012/03/14 21:29), Glauber Costa wrote:


>>   - What happens when a new cgroup created ?
> 
> mem_cgroup_create() is called =)
> Heh, jokes apart, I don't really follow here. What exactly do you mean? 
> There shouldn't be anything extremely out of the ordinary.
> 


Sorry, too short words.

Assume a cgroup with
	cgroup.memory.limit_in_bytes=1G
	cgroup.memory.kmem.limit_in_bytes=400M

When a child cgroup is created, what should be the default values.
'unlimited' as current implementation ?
Hmm..maybe yes.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 02/13] memcg: Kernel memory accounting infrastructure.
  2012-03-15  0:48             ` KAMEZAWA Hiroyuki
@ 2012-03-15 11:07               ` Glauber Costa
  2012-03-15 11:13                 ` Peter Zijlstra
  0 siblings, 1 reply; 45+ messages in thread
From: Glauber Costa @ 2012-03-15 11:07 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Suleiman Souhlal, cgroups, suleiman, penberg, cl, yinghan, hughd,
	gthelen, peterz, dan.magenheimer, hannes, mgorman,
	James.Bottomley, linux-mm, devel, linux-kernel

On 03/15/2012 04:48 AM, KAMEZAWA Hiroyuki wrote:
>>>    - What happens when a new cgroup created ?
>> >
>> >  mem_cgroup_create() is called =)
>> >  Heh, jokes apart, I don't really follow here. What exactly do you mean?
>> >  There shouldn't be anything extremely out of the ordinary.
>> >
>
> Sorry, too short words.
>
> Assume a cgroup with
> 	cgroup.memory.limit_in_bytes=1G
> 	cgroup.memory.kmem.limit_in_bytes=400M
>
> When a child cgroup is created, what should be the default values.
> 'unlimited' as current implementation ?
> Hmm..maybe yes.

I think so, yes. I see no reason to come up with any default values
in memcg. Yes, your allocations can fail due to your parent limits.
But since I never heard of any machine with
9223372036854775807 bytes of memory, that is true even for the root memcg =)


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 02/13] memcg: Kernel memory accounting infrastructure.
  2012-03-15 11:07               ` Glauber Costa
@ 2012-03-15 11:13                 ` Peter Zijlstra
  2012-03-15 11:21                   ` Glauber Costa
  0 siblings, 1 reply; 45+ messages in thread
From: Peter Zijlstra @ 2012-03-15 11:13 UTC (permalink / raw)
  To: Glauber Costa
  Cc: KAMEZAWA Hiroyuki, Suleiman Souhlal, cgroups, suleiman, penberg,
	cl, yinghan, hughd, gthelen, dan.magenheimer, hannes, mgorman,
	James.Bottomley, linux-mm, devel, linux-kernel

On Thu, 2012-03-15 at 15:07 +0400, Glauber Costa wrote:
> But since I never heard of any machine with
> 9223372036854775807 bytes of memory, that is true even for the root memcg 

What, you don't have more than 8 exabyte of memory in your laptop !?
Surely you're due for an upgrade then.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 02/13] memcg: Kernel memory accounting infrastructure.
  2012-03-15 11:13                 ` Peter Zijlstra
@ 2012-03-15 11:21                   ` Glauber Costa
  0 siblings, 0 replies; 45+ messages in thread
From: Glauber Costa @ 2012-03-15 11:21 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: KAMEZAWA Hiroyuki, Suleiman Souhlal, cgroups, suleiman, penberg,
	cl, yinghan, hughd, gthelen, dan.magenheimer, hannes, mgorman,
	James.Bottomley, linux-mm, devel, linux-kernel

On 03/15/2012 03:13 PM, Peter Zijlstra wrote:
> On Thu, 2012-03-15 at 15:07 +0400, Glauber Costa wrote:
>> But since I never heard of any machine with
>> 9223372036854775807 bytes of memory, that is true even for the root memcg
>
> What, you don't have more than 8 exabyte of memory in your laptop !?
> Surely you're due for an upgrade then.

Yeah, I requested it already, but I was told it could take a while

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 07/13] memcg: Slab accounting.
  2012-03-14 22:04         ` Suleiman Souhlal
@ 2012-03-15 11:40           ` Glauber Costa
  0 siblings, 0 replies; 45+ messages in thread
From: Glauber Costa @ 2012-03-15 11:40 UTC (permalink / raw)
  To: Suleiman Souhlal
  Cc: Suleiman Souhlal, cgroups, kamezawa.hiroyu, cl, yinghan, hughd,
	gthelen, peterz, dan.magenheimer, hannes, mgorman,
	James.Bottomley, linux-mm, devel, linux-kernel, Pekka Enberg

On 03/15/2012 02:04 AM, Suleiman Souhlal wrote:
> On Wed, Mar 14, 2012 at 3:47 AM, Glauber Costa<glommer@parallels.com>  wrote:
>> On 03/14/2012 02:50 AM, Suleiman Souhlal wrote:
>>>
>>> On Sun, Mar 11, 2012 at 3:25 AM, Glauber Costa<glommer@parallels.com>
>>>   wrote:
>>>>
>>>> On 03/10/2012 12:39 AM, Suleiman Souhlal wrote:
>>>>>
>>>>> +static inline void
>>>>> +mem_cgroup_kmem_cache_prepare_sleep(struct kmem_cache *cachep)
>>>>> +{
>>>>> +       /*
>>>>> +        * Make sure the cache doesn't get freed while we have
>>>>> interrupts
>>>>> +        * enabled.
>>>>> +        */
>>>>> +       kmem_cache_get_ref(cachep);
>>>>> +       rcu_read_unlock();
>>>>> +}
>>>>
>>>>
>>>>
>>>> Is this really needed ? After this function call in slab.c, the slab code
>>>> itself accesses cachep a thousand times. If it could be freed, it would
>>>> already explode today for other reasons?
>>>> Am I missing something here?
>>>
>>>
>>> We need this because once we drop the rcu_read_lock and go to sleep,
>>> the memcg could get deleted, which could lead to the cachep from
>>> getting deleted as well.
>>>
>>> So, we need to grab a reference to the cache, to make sure that the
>>> cache doesn't disappear from under us.
>>
>>
>> Don't we grab a memcg reference when we fire the cache creation?
>> (I did that for slub, can't really recall from the top of my head if
>> you are doing it as well)
>>
>> That would prevent the memcg to go away, while relieving us from the
>> need to take a temporary reference for every page while sleeping.
>
> The problem isn't the memcg going away, but the cache going away.
>
I see the problem.

I still think there are ways to avoid getting a reference at every page,
but it might not be worth the complication...

> Keep in mind that this function is only called in workqueue context.
> (In the earlier revision of the patchset this function was called in
> the process context, but kmem_cache_create() would ignore memory
> limits, because of __GFP_NOACCOUNT.)

ok, fair.

>
> When mem_cgroup_get_kmem_cache() returns a memcg cache, that cache has
> already been created.
 >
> The memcg pointer is not stable between alloc and free: It can become
> NULL when the cgroup gets deleted, at which point the accounting has
> been "moved to root" (uncharged from the cgroup it was charged in).
> When that has happened, we don't want to uncharge it again.
> I think the current code already handles this situation.
>

Okay, convinced.

^ permalink raw reply	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2012-03-15 11:41 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-03-09 20:39 [PATCH v2 00/13] Memcg Kernel Memory Tracking Suleiman Souhlal
2012-03-09 20:39 ` [PATCH v2 01/13] memcg: Consolidate various flags into a single flags field Suleiman Souhlal
2012-03-11  7:50   ` Glauber Costa
2012-03-09 20:39 ` [PATCH v2 02/13] memcg: Kernel memory accounting infrastructure Suleiman Souhlal
2012-03-11  8:12   ` Glauber Costa
2012-03-13  6:24     ` KAMEZAWA Hiroyuki
2012-03-13 10:37       ` Glauber Costa
2012-03-13 17:00         ` Greg Thelen
2012-03-13 17:31           ` Glauber Costa
2012-03-14  0:15         ` KAMEZAWA Hiroyuki
2012-03-14 12:29           ` Glauber Costa
2012-03-15  0:48             ` KAMEZAWA Hiroyuki
2012-03-15 11:07               ` Glauber Costa
2012-03-15 11:13                 ` Peter Zijlstra
2012-03-15 11:21                   ` Glauber Costa
2012-03-12 12:38   ` Glauber Costa
2012-03-09 20:39 ` [PATCH v2 03/13] memcg: Uncharge all kmem when deleting a cgroup Suleiman Souhlal
2012-03-11  8:19   ` Glauber Costa
2012-03-13 23:16     ` Suleiman Souhlal
2012-03-14 11:59       ` Glauber Costa
2012-03-13  6:27   ` KAMEZAWA Hiroyuki
2012-03-09 20:39 ` [PATCH v2 04/13] memcg: Make it possible to use the stock for more than one page Suleiman Souhlal
2012-03-11 10:49   ` Glauber Costa
2012-03-09 20:39 ` [PATCH v2 05/13] memcg: Reclaim when more than one page needed Suleiman Souhlal
2012-03-09 20:39 ` [PATCH v2 06/13] slab: Add kmem_cache_gfp_flags() helper function Suleiman Souhlal
2012-03-11 10:53   ` Glauber Costa
2012-03-13 23:21     ` Suleiman Souhlal
2012-03-14 11:48       ` Glauber Costa
2012-03-14 22:08         ` Suleiman Souhlal
2012-03-09 20:39 ` [PATCH v2 07/13] memcg: Slab accounting Suleiman Souhlal
2012-03-11 10:25   ` Glauber Costa
2012-03-13 22:50     ` Suleiman Souhlal
2012-03-14 10:47       ` Glauber Costa
2012-03-14 22:04         ` Suleiman Souhlal
2012-03-15 11:40           ` Glauber Costa
2012-03-09 20:39 ` [PATCH v2 08/13] memcg: Make dentry slab memory accounted in kernel memory accounting Suleiman Souhlal
2012-03-09 20:39 ` [PATCH v2 09/13] memcg: Account for kmalloc " Suleiman Souhlal
2012-03-11 12:21   ` Glauber Costa
2012-03-09 20:39 ` [PATCH v2 10/13] memcg: Track all the memcg children of a kmem_cache Suleiman Souhlal
2012-03-09 20:39 ` [PATCH v2 11/13] memcg: Handle bypassed kernel memory charges Suleiman Souhlal
2012-03-09 20:39 ` [PATCH v2 12/13] memcg: Per-memcg memory.kmem.slabinfo file Suleiman Souhlal
2012-03-11 10:35   ` Glauber Costa
2012-03-09 20:39 ` [PATCH v2 13/13] memcg: Document kernel memory accounting Suleiman Souhlal
2012-03-11 10:42   ` Glauber Costa
2012-03-10  6:25 ` [PATCH v2 00/13] Memcg Kernel Memory Tracking Suleiman Souhlal

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).