linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] mm: memcontrol: fix cgroup creation failure after many small jobs
@ 2016-06-16  3:42 Johannes Weiner
  2016-06-16 20:06 ` Tejun Heo
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Johannes Weiner @ 2016-06-16  3:42 UTC (permalink / raw)
  To: Andrew Morton, Tejun Heo
  Cc: Vladimir Davydov, Michal Hocko, Li Zefan, linux-mm, cgroups,
	linux-kernel, kernel-team

The memory controller has quite a bit of state that usually outlives
the cgroup and pins its CSS until said state disappears. At the same
time it imposes a 16-bit limit on the CSS ID space to economically
store IDs in the wild. Consequently, when we use cgroups to contain
frequent but small and short-lived jobs that leave behind some page
cache, we quickly run into the 64k limitations of outstanding CSSs.
Creating a new cgroup fails with -ENOSPC while there are only a few,
or even no user-visible cgroups in existence.

Although pinning CSSs past cgroup removal is common, there are only
two instances that actually need a CSS ID after a cgroup is deleted:
cache shadow entries and swapout records.

Cache shadow entries reference the ID weakly and can deal with the CSS
having disappeared when it's looked up later. They pose no hurdle.

Swap-out records do need to pin the css to hierarchically attribute
swapins after the cgroup has been deleted; though the only pages that
remain swapped out after a process exits are tmpfs/shmem pages. Those
references are under the user's control and thus manageable.

This patch introduces a private 16bit memcg ID and switches swap and
cache shadow entries over to using that. It then decouples the CSS
lifetime from the CSS ID lifetime, such that a CSS ID can be recycled
when the CSS is only pinned by common objects that don't need an ID.

This script demonstrates the problem by faulting one cache page in a
new cgroup and deleting it again:

set -e
mkdir -p pages
for x in `seq 128000`; do
  [ $((x % 1000)) -eq 0 ] && echo $x
  mkdir /cgroup/foo
  echo $$ >/cgroup/foo/cgroup.procs
  echo trex >pages/$x
  echo $$ >/cgroup/cgroup.procs
  rmdir /cgroup/foo
done

When run on an unpatched kernel, we eventually run out of possible CSS
IDs even though there is no visible cgroup existing anymore:

[root@ham ~]# ./cssidstress.sh
[...]
65000
mkdir: cannot create directory '/cgroup/foo': No space left on device

After this patch, the CSS IDs get released upon cgroup destruction and
the cache and css objects get released once memory reclaim kicks in.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/cgroup.h     |  3 ++-
 include/linux/memcontrol.h | 25 +++++++++------------
 kernel/cgroup.c            | 22 ++++++++++++++++--
 mm/memcontrol.c            | 56 ++++++++++++++++++++++++++++++++++++++++------
 mm/slab_common.c           |  4 ++--
 5 files changed, 83 insertions(+), 27 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index a20320c666fd..6510bf291d36 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -85,9 +85,10 @@ struct cgroup_subsys_state *cgroup_get_e_css(struct cgroup *cgroup,
 					     struct cgroup_subsys *ss);
 struct cgroup_subsys_state *css_tryget_online_from_dir(struct dentry *dentry,
 						       struct cgroup_subsys *ss);
-
 struct cgroup *cgroup_get_from_path(const char *path);
 
+void css_id_free(struct cgroup_subsys_state *css);
+
 int cgroup_attach_task_all(struct task_struct *from, struct task_struct *);
 int cgroup_transfer_tasks(struct cgroup *to, struct cgroup *from);
 
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index a805474df4ab..56e6069d2452 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -97,6 +97,11 @@ enum mem_cgroup_events_target {
 #define MEM_CGROUP_ID_SHIFT	16
 #define MEM_CGROUP_ID_MAX	USHRT_MAX
 
+struct mem_cgroup_id {
+	int id;
+	atomic_t ref;
+};
+
 struct mem_cgroup_stat_cpu {
 	long count[MEMCG_NR_STAT];
 	unsigned long events[MEMCG_NR_EVENTS];
@@ -172,6 +177,9 @@ enum memcg_kmem_state {
 struct mem_cgroup {
 	struct cgroup_subsys_state css;
 
+	/* Private memcg ID. Used to ID objects that outlive the cgroup */
+	struct mem_cgroup_id id;
+
 	/* Accounted resources */
 	struct page_counter memory;
 	struct page_counter swap;
@@ -330,22 +338,9 @@ static inline unsigned short mem_cgroup_id(struct mem_cgroup *memcg)
 	if (mem_cgroup_disabled())
 		return 0;
 
-	return memcg->css.id;
-}
-
-/**
- * mem_cgroup_from_id - look up a memcg from an id
- * @id: the id to look up
- *
- * Caller must hold rcu_read_lock() and use css_tryget() as necessary.
- */
-static inline struct mem_cgroup *mem_cgroup_from_id(unsigned short id)
-{
-	struct cgroup_subsys_state *css;
-
-	css = css_from_id(id, &memory_cgrp_subsys);
-	return mem_cgroup_from_css(css);
+	return memcg->id.id;
 }
+struct mem_cgroup *mem_cgroup_from_id(unsigned short id);
 
 /**
  * parent_mem_cgroup - find the accounting parent of a memcg
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 86cb5c6e8932..2e4aff6fd6ec 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -4961,10 +4961,10 @@ static void css_free_work_fn(struct work_struct *work)
 	if (ss) {
 		/* css free path */
 		struct cgroup_subsys_state *parent = css->parent;
-		int id = css->id;
 
 		ss->css_free(css);
-		cgroup_idr_remove(&ss->css_idr, id);
+		if (css->id)
+			cgroup_idr_remove(&ss->css_idr, css->id);
 		cgroup_put(cgrp);
 
 		if (parent)
@@ -6205,6 +6205,24 @@ struct cgroup *cgroup_get_from_path(const char *path)
 }
 EXPORT_SYMBOL_GPL(cgroup_get_from_path);
 
+/**
+ * css_id_free - relinquish an existing CSS's ID
+ * @css: the CSS
+ *
+ * This releases the @css's ID and allows it to be recycled while the
+ * CSS continues to exist. This is useful for controllers with state
+ * that extends past a cgroup's lifetime but doesn't need precious ID
+ * address space.
+ *
+ * This invalidates @css->id, and css_from_id() might return NULL or a
+ * new css if the ID has been recycled in the meantime.
+ */
+void css_id_free(struct cgroup_subsys_state *css)
+{
+	cgroup_idr_remove(&css->ss->css_idr, css->id);
+	css->id = 0;
+}
+
 /*
  * sock->sk_cgrp_data handling.  For more info, see sock_cgroup_data
  * definition in cgroup-defs.h.
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 75e74408cc8f..1d8a6dffdc25 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4057,6 +4057,34 @@ static struct cftype mem_cgroup_legacy_files[] = {
 	{ },	/* terminate */
 };
 
+static struct idr mem_cgroup_idr;
+
+static void mem_cgroup_id_get(struct mem_cgroup *memcg)
+{
+	atomic_inc(&memcg->id.ref);
+}
+
+static void mem_cgroup_id_put(struct mem_cgroup *memcg)
+{
+	if (atomic_dec_and_test(&memcg->id.ref)) {
+		idr_remove(&mem_cgroup_idr, memcg->id.id);
+		css_id_free(&memcg->css);
+		css_put(&memcg->css);
+	}
+}
+
+/**
+ * mem_cgroup_from_id - look up a memcg from a memcg id
+ * @id: the memcg id to look up
+ *
+ * Caller must hold rcu_read_lock().
+ */
+struct mem_cgroup *mem_cgroup_from_id(unsigned short id)
+{
+	WARN_ON_ONCE(!rcu_read_lock_held());
+	return id > 0 ? idr_find(&mem_cgroup_idr, id) : NULL;
+}
+
 static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *memcg, int node)
 {
 	struct mem_cgroup_per_node *pn;
@@ -4116,6 +4144,12 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
 	if (!memcg)
 		return NULL;
 
+	memcg->id.id = idr_alloc(&mem_cgroup_idr, NULL,
+				 1, MEM_CGROUP_ID_MAX,
+				 GFP_KERNEL);
+	if (memcg->id.id < 0)
+		goto fail;
+
 	memcg->stat = alloc_percpu(struct mem_cgroup_stat_cpu);
 	if (!memcg->stat)
 		goto fail;
@@ -4142,8 +4176,11 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
 #ifdef CONFIG_CGROUP_WRITEBACK
 	INIT_LIST_HEAD(&memcg->cgwb_list);
 #endif
+	idr_replace(&mem_cgroup_idr, memcg, memcg->id.id);
 	return memcg;
 fail:
+	if (memcg->id.id > 0)
+		idr_remove(&mem_cgroup_idr, memcg->id.id);
 	mem_cgroup_free(memcg);
 	return NULL;
 }
@@ -4206,12 +4243,11 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
 	return NULL;
 }
 
-static int
-mem_cgroup_css_online(struct cgroup_subsys_state *css)
+static int mem_cgroup_css_online(struct cgroup_subsys_state *css)
 {
-	if (css->id > MEM_CGROUP_ID_MAX)
-		return -ENOSPC;
-
+	/* Online state pins memcg ID, memcg ID pins CSS and CSS ID */
+	mem_cgroup_id_get(mem_cgroup_from_css(css));
+	css_get(css);
 	return 0;
 }
 
@@ -4234,6 +4270,8 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
 
 	memcg_offline_kmem(memcg);
 	wb_memcg_offline(memcg);
+
+	mem_cgroup_id_put(memcg);
 }
 
 static void mem_cgroup_css_released(struct cgroup_subsys_state *css)
@@ -5755,6 +5793,7 @@ void mem_cgroup_swapout(struct page *page, swp_entry_t entry)
 	if (!memcg)
 		return;
 
+	mem_cgroup_id_get(memcg);
 	oldid = swap_cgroup_record(entry, mem_cgroup_id(memcg));
 	VM_BUG_ON_PAGE(oldid, page);
 	mem_cgroup_swap_statistics(memcg, true);
@@ -5773,6 +5812,9 @@ void mem_cgroup_swapout(struct page *page, swp_entry_t entry)
 	VM_BUG_ON(!irqs_disabled());
 	mem_cgroup_charge_statistics(memcg, page, false, -1);
 	memcg_check_events(memcg, page);
+
+	if (!mem_cgroup_is_root(memcg))
+		css_put(&memcg->css);
 }
 
 /*
@@ -5803,11 +5845,11 @@ int mem_cgroup_try_charge_swap(struct page *page, swp_entry_t entry)
 	    !page_counter_try_charge(&memcg->swap, 1, &counter))
 		return -ENOMEM;
 
+	mem_cgroup_id_get(memcg);
 	oldid = swap_cgroup_record(entry, mem_cgroup_id(memcg));
 	VM_BUG_ON_PAGE(oldid, page);
 	mem_cgroup_swap_statistics(memcg, true);
 
-	css_get(&memcg->css);
 	return 0;
 }
 
@@ -5836,7 +5878,7 @@ void mem_cgroup_uncharge_swap(swp_entry_t entry)
 				page_counter_uncharge(&memcg->memsw, 1);
 		}
 		mem_cgroup_swap_statistics(memcg, false);
-		css_put(&memcg->css);
+		mem_cgroup_id_put(memcg);
 	}
 	rcu_read_unlock();
 }
diff --git a/mm/slab_common.c b/mm/slab_common.c
index a65dad7fdcd1..82317abb03ed 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -526,8 +526,8 @@ void memcg_create_kmem_cache(struct mem_cgroup *memcg,
 		goto out_unlock;
 
 	cgroup_name(css->cgroup, memcg_name_buf, sizeof(memcg_name_buf));
-	cache_name = kasprintf(GFP_KERNEL, "%s(%d:%s)", root_cache->name,
-			       css->id, memcg_name_buf);
+	cache_name = kasprintf(GFP_KERNEL, "%s(%llu:%s)", root_cache->name,
+			       css->serial_nr, memcg_name_buf);
 	if (!cache_name)
 		goto out_unlock;
 
-- 
2.8.3

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm: memcontrol: fix cgroup creation failure after many small jobs
  2016-06-16  3:42 [PATCH] mm: memcontrol: fix cgroup creation failure after many small jobs Johannes Weiner
@ 2016-06-16 20:06 ` Tejun Heo
  2016-06-17 16:23   ` Johannes Weiner
  2016-06-17  9:06 ` [PATCH] " Vladimir Davydov
  2016-07-14 15:37 ` Johannes Weiner
  2 siblings, 1 reply; 14+ messages in thread
From: Tejun Heo @ 2016-06-16 20:06 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Vladimir Davydov, Michal Hocko, Li Zefan,
	linux-mm, cgroups, linux-kernel, kernel-team

Hello,

Looks generally good to me.  Some comments below.

On Wed, Jun 15, 2016 at 11:42:44PM -0400, Johannes Weiner wrote:
> @@ -6205,6 +6205,24 @@ struct cgroup *cgroup_get_from_path(const char *path)
>  }
>  EXPORT_SYMBOL_GPL(cgroup_get_from_path);
>  
> +/**
> + * css_id_free - relinquish an existing CSS's ID
> + * @css: the CSS
> + *
> + * This releases the @css's ID and allows it to be recycled while the
> + * CSS continues to exist. This is useful for controllers with state
> + * that extends past a cgroup's lifetime but doesn't need precious ID
> + * address space.
> + *
> + * This invalidates @css->id, and css_from_id() might return NULL or a
> + * new css if the ID has been recycled in the meantime.
> + */
> +void css_id_free(struct cgroup_subsys_state *css)
> +{
> +	cgroup_idr_remove(&css->ss->css_idr, css->id);
> +	css->id = 0;
> +}

I don't quite get why we're trying to free css->id earlier when memcg
is gonna be using its private id anyway.  From cgroup core side, the
id space isn't restricted.

> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 75e74408cc8f..1d8a6dffdc25 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
...
> +static void mem_cgroup_id_put(struct mem_cgroup *memcg)
> +{
> +	if (atomic_dec_and_test(&memcg->id.ref)) {
> +		idr_remove(&mem_cgroup_idr, memcg->id.id);

Maybe this should do "memcg->id.id = 0"?

> +		css_id_free(&memcg->css);
> +		css_put(&memcg->css);
> +	}
> +}
> +
> +/**
> + * mem_cgroup_from_id - look up a memcg from a memcg id
> + * @id: the memcg id to look up
> + *
> + * Caller must hold rcu_read_lock().
> + */
> +struct mem_cgroup *mem_cgroup_from_id(unsigned short id)
> +{
> +	WARN_ON_ONCE(!rcu_read_lock_held());
> +	return id > 0 ? idr_find(&mem_cgroup_idr, id) : NULL;
> +}

css_from_id() has it too but I don't think id > 0 test is necessary.
We prolly should take it out of css_from_id() too.

It might be useful to add comment explaining why memcg needs private
ids.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm: memcontrol: fix cgroup creation failure after many small jobs
  2016-06-16  3:42 [PATCH] mm: memcontrol: fix cgroup creation failure after many small jobs Johannes Weiner
  2016-06-16 20:06 ` Tejun Heo
@ 2016-06-17  9:06 ` Vladimir Davydov
  2016-06-17 16:40   ` Johannes Weiner
  2016-07-14 15:37 ` Johannes Weiner
  2 siblings, 1 reply; 14+ messages in thread
From: Vladimir Davydov @ 2016-06-17  9:06 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Tejun Heo, Michal Hocko, Li Zefan, linux-mm,
	cgroups, linux-kernel, kernel-team

On Wed, Jun 15, 2016 at 11:42:44PM -0400, Johannes Weiner wrote:
> The memory controller has quite a bit of state that usually outlives
> the cgroup and pins its CSS until said state disappears. At the same
> time it imposes a 16-bit limit on the CSS ID space to economically
> store IDs in the wild. Consequently, when we use cgroups to contain
> frequent but small and short-lived jobs that leave behind some page
> cache, we quickly run into the 64k limitations of outstanding CSSs.
> Creating a new cgroup fails with -ENOSPC while there are only a few,
> or even no user-visible cgroups in existence.
> 
> Although pinning CSSs past cgroup removal is common, there are only
> two instances that actually need a CSS ID after a cgroup is deleted:
> cache shadow entries and swapout records.
> 
> Cache shadow entries reference the ID weakly and can deal with the CSS
> having disappeared when it's looked up later. They pose no hurdle.
> 
> Swap-out records do need to pin the css to hierarchically attribute
> swapins after the cgroup has been deleted; though the only pages that
> remain swapped out after a process exits are tmpfs/shmem pages. Those
> references are under the user's control and thus manageable.
> 
> This patch introduces a private 16bit memcg ID and switches swap and
> cache shadow entries over to using that. It then decouples the CSS
> lifetime from the CSS ID lifetime, such that a CSS ID can be recycled
> when the CSS is only pinned by common objects that don't need an ID.

There's already id which is only used for online memory cgroups - it's
kmemcg_id. May be, instead of introducing one more idr, we could name it
generically and reuse it for shadow entries?

Regarding swap entries, would it really make much difference if we used
4 bytes per swap page instead of 2? For a 100 GB swap it'd increase
overhead from 50 MB up to 100 MB, which still doesn't seem too much IMO,
so may be just use plain unrestricted css->id for swap entries?

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm: memcontrol: fix cgroup creation failure after many small jobs
  2016-06-16 20:06 ` Tejun Heo
@ 2016-06-17 16:23   ` Johannes Weiner
  2016-06-17 16:23     ` [PATCH 1/3] cgroup: fix idr leak for the first cgroup root Johannes Weiner
                       ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Johannes Weiner @ 2016-06-17 16:23 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Andrew Morton, Vladimir Davydov, Michal Hocko, Li Zefan,
	linux-mm, cgroups, linux-kernel, kernel-team

On Thu, Jun 16, 2016 at 04:06:17PM -0400, Tejun Heo wrote:
> On Wed, Jun 15, 2016 at 11:42:44PM -0400, Johannes Weiner wrote:
> > @@ -6205,6 +6205,24 @@ struct cgroup *cgroup_get_from_path(const char *path)
> >  }
> >  EXPORT_SYMBOL_GPL(cgroup_get_from_path);
> >  
> > +/**
> > + * css_id_free - relinquish an existing CSS's ID
> > + * @css: the CSS
> > + *
> > + * This releases the @css's ID and allows it to be recycled while the
> > + * CSS continues to exist. This is useful for controllers with state
> > + * that extends past a cgroup's lifetime but doesn't need precious ID
> > + * address space.
> > + *
> > + * This invalidates @css->id, and css_from_id() might return NULL or a
> > + * new css if the ID has been recycled in the meantime.
> > + */
> > +void css_id_free(struct cgroup_subsys_state *css)
> > +{
> > +	cgroup_idr_remove(&css->ss->css_idr, css->id);
> > +	css->id = 0;
> > +}
> 
> I don't quite get why we're trying to free css->id earlier when memcg
> is gonna be using its private id anyway.  From cgroup core side, the
> id space isn't restricted.

For some reason I was thinking of CSS ID being restricted as well, but
of course the only restriction is what's enforced in memcg onlining. I
deleted it.

> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 75e74408cc8f..1d8a6dffdc25 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> ...
> > +static void mem_cgroup_id_put(struct mem_cgroup *memcg)
> > +{
> > +	if (atomic_dec_and_test(&memcg->id.ref)) {
> > +		idr_remove(&mem_cgroup_idr, memcg->id.id);
> 
> Maybe this should do "memcg->id.id = 0"?

Added.

> > +		css_id_free(&memcg->css);
> > +		css_put(&memcg->css);
> > +	}
> > +}
> > +
> > +/**
> > + * mem_cgroup_from_id - look up a memcg from a memcg id
> > + * @id: the memcg id to look up
> > + *
> > + * Caller must hold rcu_read_lock().
> > + */
> > +struct mem_cgroup *mem_cgroup_from_id(unsigned short id)
> > +{
> > +	WARN_ON_ONCE(!rcu_read_lock_held());
> > +	return id > 0 ? idr_find(&mem_cgroup_idr, id) : NULL;
> > +}
> 
> css_from_id() has it too but I don't think id > 0 test is necessary.
> We prolly should take it out of css_from_id() too.

Yeah, idr_find() just returns NULL for index 0 - no warning. I removed
it from my patch and added a patch to remove it in css_from_id().

> It might be useful to add comment explaining why memcg needs private
> ids.

Good point. I put an intro comment above the mem_cgroup_idr definition
that explains why we need a private space.

Thanks

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 1/3] cgroup: fix idr leak for the first cgroup root
  2016-06-17 16:23   ` Johannes Weiner
@ 2016-06-17 16:23     ` Johannes Weiner
  2016-06-17 16:24     ` [PATCH 2/3] cgroup: remove unnecessary 0 check from css_from_id() Johannes Weiner
  2016-06-17 16:25     ` [PATCH 3/3] mm: memcontrol: fix cgroup creation failure after many small jobs Johannes Weiner
  2 siblings, 0 replies; 14+ messages in thread
From: Johannes Weiner @ 2016-06-17 16:23 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Andrew Morton, Vladimir Davydov, Michal Hocko, Li Zefan,
	linux-mm, cgroups, linux-kernel, kernel-team

The valid cgroup hierarchy ID range includes 0, so we can't filter for
positive numbers when freeing it, or it'll leak the first ID. No big
deal, just disruptive when reading the code.

The ID is freed during error handling and when the reference count
hits zero, so the double-free test is not necessary; remove it.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 kernel/cgroup.c | 8 +-------
 1 file changed, 1 insertion(+), 7 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 86cb5c6e8932..36fc0ff506c3 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1158,18 +1158,12 @@ static void cgroup_exit_root_id(struct cgroup_root *root)
 {
 	lockdep_assert_held(&cgroup_mutex);
 
-	if (root->hierarchy_id) {
-		idr_remove(&cgroup_hierarchy_idr, root->hierarchy_id);
-		root->hierarchy_id = 0;
-	}
+	idr_remove(&cgroup_hierarchy_idr, root->hierarchy_id);
 }
 
 static void cgroup_free_root(struct cgroup_root *root)
 {
 	if (root) {
-		/* hierarchy ID should already have been released */
-		WARN_ON_ONCE(root->hierarchy_id);
-
 		idr_destroy(&root->cgroup_idr);
 		kfree(root);
 	}
-- 
2.8.3

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 2/3] cgroup: remove unnecessary 0 check from css_from_id()
  2016-06-17 16:23   ` Johannes Weiner
  2016-06-17 16:23     ` [PATCH 1/3] cgroup: fix idr leak for the first cgroup root Johannes Weiner
@ 2016-06-17 16:24     ` Johannes Weiner
  2016-06-17 18:17       ` Tejun Heo
  2016-06-17 16:25     ` [PATCH 3/3] mm: memcontrol: fix cgroup creation failure after many small jobs Johannes Weiner
  2 siblings, 1 reply; 14+ messages in thread
From: Johannes Weiner @ 2016-06-17 16:24 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Andrew Morton, Vladimir Davydov, Michal Hocko, Li Zefan,
	linux-mm, cgroups, linux-kernel, kernel-team

css_idr allocation starts at 1, so index 0 will never point to an
item. css_from_id() currently filters that before asking idr_find(),
but idr_find() would also just return NULL, so this is not needed.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 kernel/cgroup.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 36fc0ff506c3..78f6d18ff0af 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -6162,7 +6162,7 @@ struct cgroup_subsys_state *css_tryget_online_from_dir(struct dentry *dentry,
 struct cgroup_subsys_state *css_from_id(int id, struct cgroup_subsys *ss)
 {
 	WARN_ON_ONCE(!rcu_read_lock_held());
-	return id > 0 ? idr_find(&ss->css_idr, id) : NULL;
+	return idr_find(&ss->css_idr, id);
 }
 
 /**
-- 
2.8.3

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 3/3] mm: memcontrol: fix cgroup creation failure after many small jobs
  2016-06-17 16:23   ` Johannes Weiner
  2016-06-17 16:23     ` [PATCH 1/3] cgroup: fix idr leak for the first cgroup root Johannes Weiner
  2016-06-17 16:24     ` [PATCH 2/3] cgroup: remove unnecessary 0 check from css_from_id() Johannes Weiner
@ 2016-06-17 16:25     ` Johannes Weiner
  2016-06-17 18:18       ` Tejun Heo
                         ` (2 more replies)
  2 siblings, 3 replies; 14+ messages in thread
From: Johannes Weiner @ 2016-06-17 16:25 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Andrew Morton, Vladimir Davydov, Michal Hocko, Li Zefan,
	linux-mm, cgroups, linux-kernel, kernel-team

The memory controller has quite a bit of state that usually outlives
the cgroup and pins its CSS until said state disappears. At the same
time it imposes a 16-bit limit on the CSS ID space to economically
store IDs in the wild. Consequently, when we use cgroups to contain
frequent but small and short-lived jobs that leave behind some page
cache, we quickly run into the 64k limitations of outstanding CSSs.
Creating a new cgroup fails with -ENOSPC while there are only a few,
or even no user-visible cgroups in existence.

Although pinning CSSs past cgroup removal is common, there are only
two instances that actually need an ID after a cgroup is deleted:
cache shadow entries and swapout records.

Cache shadow entries reference the ID weakly and can deal with the CSS
having disappeared when it's looked up later. They pose no hurdle.

Swap-out records do need to pin the css to hierarchically attribute
swapins after the cgroup has been deleted; though the only pages that
remain swapped out after offlining are tmpfs/shmem pages. And those
references are under the user's control, so they are manageable.

This patch introduces a private 16-bit memcg ID and switches swap and
cache shadow entries over to using that. This ID can then be recycled
after offlining when the CSS remains pinned only by objects that don't
specifically need it.

This script demonstrates the problem by faulting one cache page in a
new cgroup and deleting it again:

set -e
mkdir -p pages
for x in `seq 128000`; do
  [ $((x % 1000)) -eq 0 ] && echo $x
  mkdir /cgroup/foo
  echo $$ >/cgroup/foo/cgroup.procs
  echo trex >pages/$x
  echo $$ >/cgroup/cgroup.procs
  rmdir /cgroup/foo
done

When run on an unpatched kernel, we eventually run out of possible IDs
even though there are no visible cgroups:

[root@ham ~]# ./cssidstress.sh
[...]
65000
mkdir: cannot create directory '/cgroup/foo': No space left on device

After this patch, the IDs get released upon cgroup destruction and the
cache and css objects get released once memory reclaim kicks in.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/memcontrol.h | 25 ++++++--------
 mm/memcontrol.c            | 82 ++++++++++++++++++++++++++++++++++++++++++----
 mm/slab_common.c           |  4 +--
 3 files changed, 87 insertions(+), 24 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index a805474df4ab..56e6069d2452 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -97,6 +97,11 @@ enum mem_cgroup_events_target {
 #define MEM_CGROUP_ID_SHIFT	16
 #define MEM_CGROUP_ID_MAX	USHRT_MAX
 
+struct mem_cgroup_id {
+	int id;
+	atomic_t ref;
+};
+
 struct mem_cgroup_stat_cpu {
 	long count[MEMCG_NR_STAT];
 	unsigned long events[MEMCG_NR_EVENTS];
@@ -172,6 +177,9 @@ enum memcg_kmem_state {
 struct mem_cgroup {
 	struct cgroup_subsys_state css;
 
+	/* Private memcg ID. Used to ID objects that outlive the cgroup */
+	struct mem_cgroup_id id;
+
 	/* Accounted resources */
 	struct page_counter memory;
 	struct page_counter swap;
@@ -330,22 +338,9 @@ static inline unsigned short mem_cgroup_id(struct mem_cgroup *memcg)
 	if (mem_cgroup_disabled())
 		return 0;
 
-	return memcg->css.id;
-}
-
-/**
- * mem_cgroup_from_id - look up a memcg from an id
- * @id: the id to look up
- *
- * Caller must hold rcu_read_lock() and use css_tryget() as necessary.
- */
-static inline struct mem_cgroup *mem_cgroup_from_id(unsigned short id)
-{
-	struct cgroup_subsys_state *css;
-
-	css = css_from_id(id, &memory_cgrp_subsys);
-	return mem_cgroup_from_css(css);
+	return memcg->id.id;
 }
+struct mem_cgroup *mem_cgroup_from_id(unsigned short id);
 
 /**
  * parent_mem_cgroup - find the accounting parent of a memcg
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 75e74408cc8f..dc92b2df2585 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4057,6 +4057,60 @@ static struct cftype mem_cgroup_legacy_files[] = {
 	{ },	/* terminate */
 };
 
+/*
+ * Private memory cgroup IDR
+ *
+ * Swap-out records and page cache shadow entries need to store memcg
+ * references in constrained space, so we maintain an ID space that is
+ * limited to 16 bit (MEM_CGROUP_ID_MAX), limiting the total number of
+ * memory-controlled cgroups to 64k.
+ *
+ * However, there usually are many references to the oflline CSS after
+ * the cgroup has been destroyed, such as page cache or reclaimable
+ * slab objects, that don't need to hang on to the ID. We want to keep
+ * those dead CSS from occupying IDs, or we might quickly exhaust the
+ * relatively small ID space and prevent the creation of new cgroups
+ * even when there are much fewer than 64k cgroups - possibly none.
+ *
+ * Maintain a private 16-bit ID space for memcg, and allow the ID to
+ * be freed and recycled when it's no longer needed, which is usually
+ * when the CSS is offlined.
+ *
+ * The only exception to that are records of swapped out tmpfs/shmem
+ * pages that need to be attributed to live ancestors on swapin. But
+ * those references are manageable from userspace.
+ */
+
+static struct idr mem_cgroup_idr;
+
+static void mem_cgroup_id_get(struct mem_cgroup *memcg)
+{
+	atomic_inc(&memcg->id.ref);
+}
+
+static void mem_cgroup_id_put(struct mem_cgroup *memcg)
+{
+	if (atomic_dec_and_test(&memcg->id.ref)) {
+		idr_remove(&mem_cgroup_idr, memcg->id.id);
+		memcg->id.id = 0;
+
+		/* Memcg ID pins CSS */
+		css_put(&memcg->css);
+	}
+}
+
+/**
+ * mem_cgroup_from_id - look up a memcg from a memcg id
+ * @id: the memcg id to look up
+ *
+ * Caller must hold rcu_read_lock().
+ */
+struct mem_cgroup *mem_cgroup_from_id(unsigned short id)
+{
+	WARN_ON_ONCE(!rcu_read_lock_held());
+	return idr_find(&mem_cgroup_idr, id);
+}
+
 static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *memcg, int node)
 {
 	struct mem_cgroup_per_node *pn;
@@ -4116,6 +4170,12 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
 	if (!memcg)
 		return NULL;
 
+	memcg->id.id = idr_alloc(&mem_cgroup_idr, NULL,
+				 1, MEM_CGROUP_ID_MAX,
+				 GFP_KERNEL);
+	if (memcg->id.id < 0)
+		goto fail;
+
 	memcg->stat = alloc_percpu(struct mem_cgroup_stat_cpu);
 	if (!memcg->stat)
 		goto fail;
@@ -4142,8 +4202,11 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
 #ifdef CONFIG_CGROUP_WRITEBACK
 	INIT_LIST_HEAD(&memcg->cgwb_list);
 #endif
+	idr_replace(&mem_cgroup_idr, memcg, memcg->id.id);
 	return memcg;
 fail:
+	if (memcg->id.id > 0)
+		idr_remove(&mem_cgroup_idr, memcg->id.id);
 	mem_cgroup_free(memcg);
 	return NULL;
 }
@@ -4206,12 +4269,11 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
 	return NULL;
 }
 
-static int
-mem_cgroup_css_online(struct cgroup_subsys_state *css)
+static int mem_cgroup_css_online(struct cgroup_subsys_state *css)
 {
-	if (css->id > MEM_CGROUP_ID_MAX)
-		return -ENOSPC;
-
+	/* Online state pins memcg ID, memcg ID pins CSS */
+	mem_cgroup_id_get(mem_cgroup_from_css(css));
+	css_get(css);
 	return 0;
 }
 
@@ -4234,6 +4296,8 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
 
 	memcg_offline_kmem(memcg);
 	wb_memcg_offline(memcg);
+
+	mem_cgroup_id_put(memcg);
 }
 
 static void mem_cgroup_css_released(struct cgroup_subsys_state *css)
@@ -5755,6 +5819,7 @@ void mem_cgroup_swapout(struct page *page, swp_entry_t entry)
 	if (!memcg)
 		return;
 
+	mem_cgroup_id_get(memcg);
 	oldid = swap_cgroup_record(entry, mem_cgroup_id(memcg));
 	VM_BUG_ON_PAGE(oldid, page);
 	mem_cgroup_swap_statistics(memcg, true);
@@ -5773,6 +5838,9 @@ void mem_cgroup_swapout(struct page *page, swp_entry_t entry)
 	VM_BUG_ON(!irqs_disabled());
 	mem_cgroup_charge_statistics(memcg, page, false, -1);
 	memcg_check_events(memcg, page);
+
+	if (!mem_cgroup_is_root(memcg))
+		css_put(&memcg->css);
 }
 
 /*
@@ -5803,11 +5871,11 @@ int mem_cgroup_try_charge_swap(struct page *page, swp_entry_t entry)
 	    !page_counter_try_charge(&memcg->swap, 1, &counter))
 		return -ENOMEM;
 
+	mem_cgroup_id_get(memcg);
 	oldid = swap_cgroup_record(entry, mem_cgroup_id(memcg));
 	VM_BUG_ON_PAGE(oldid, page);
 	mem_cgroup_swap_statistics(memcg, true);
 
-	css_get(&memcg->css);
 	return 0;
 }
 
@@ -5836,7 +5904,7 @@ void mem_cgroup_uncharge_swap(swp_entry_t entry)
 				page_counter_uncharge(&memcg->memsw, 1);
 		}
 		mem_cgroup_swap_statistics(memcg, false);
-		css_put(&memcg->css);
+		mem_cgroup_id_put(memcg);
 	}
 	rcu_read_unlock();
 }
diff --git a/mm/slab_common.c b/mm/slab_common.c
index a65dad7fdcd1..82317abb03ed 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -526,8 +526,8 @@ void memcg_create_kmem_cache(struct mem_cgroup *memcg,
 		goto out_unlock;
 
 	cgroup_name(css->cgroup, memcg_name_buf, sizeof(memcg_name_buf));
-	cache_name = kasprintf(GFP_KERNEL, "%s(%d:%s)", root_cache->name,
-			       css->id, memcg_name_buf);
+	cache_name = kasprintf(GFP_KERNEL, "%s(%llu:%s)", root_cache->name,
+			       css->serial_nr, memcg_name_buf);
 	if (!cache_name)
 		goto out_unlock;
 
-- 
2.8.3

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm: memcontrol: fix cgroup creation failure after many small jobs
  2016-06-17  9:06 ` [PATCH] " Vladimir Davydov
@ 2016-06-17 16:40   ` Johannes Weiner
  0 siblings, 0 replies; 14+ messages in thread
From: Johannes Weiner @ 2016-06-17 16:40 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Andrew Morton, Tejun Heo, Michal Hocko, Li Zefan, linux-mm,
	cgroups, linux-kernel, kernel-team

On Fri, Jun 17, 2016 at 12:06:55PM +0300, Vladimir Davydov wrote:
> On Wed, Jun 15, 2016 at 11:42:44PM -0400, Johannes Weiner wrote:
> > The memory controller has quite a bit of state that usually outlives
> > the cgroup and pins its CSS until said state disappears. At the same
> > time it imposes a 16-bit limit on the CSS ID space to economically
> > store IDs in the wild. Consequently, when we use cgroups to contain
> > frequent but small and short-lived jobs that leave behind some page
> > cache, we quickly run into the 64k limitations of outstanding CSSs.
> > Creating a new cgroup fails with -ENOSPC while there are only a few,
> > or even no user-visible cgroups in existence.
> > 
> > Although pinning CSSs past cgroup removal is common, there are only
> > two instances that actually need a CSS ID after a cgroup is deleted:
> > cache shadow entries and swapout records.
> > 
> > Cache shadow entries reference the ID weakly and can deal with the CSS
> > having disappeared when it's looked up later. They pose no hurdle.
> > 
> > Swap-out records do need to pin the css to hierarchically attribute
> > swapins after the cgroup has been deleted; though the only pages that
> > remain swapped out after a process exits are tmpfs/shmem pages. Those
> > references are under the user's control and thus manageable.
> > 
> > This patch introduces a private 16bit memcg ID and switches swap and
> > cache shadow entries over to using that. It then decouples the CSS
> > lifetime from the CSS ID lifetime, such that a CSS ID can be recycled
> > when the CSS is only pinned by common objects that don't need an ID.
> 
> There's already id which is only used for online memory cgroups - it's
> kmemcg_id. May be, instead of introducing one more idr, we could name it
> generically and reuse it for shadow entries?

Good point. But it seems mem_cgroup_idr is more generic, it makes
sense to switch slab accounting over to that. I'll look into that, but
as a refactoring patch on top of this fix.

> Regarding swap entries, would it really make much difference if we used
> 4 bytes per swap page instead of 2? For a 100 GB swap it'd increase
> overhead from 50 MB up to 100 MB, which still doesn't seem too much IMO,
> so may be just use plain unrestricted css->id for swap entries?

Yes and no. I agree that the increased consumption wouldn't be too
crazy, but if we have to maintain a 16-bit ID anyway, we might as well
use it for swap too to save that space. I don't think tmpfs and shmem
pins past offlining will be common enough to significantly eat into
the ID space of online cgroups.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/3] cgroup: remove unnecessary 0 check from css_from_id()
  2016-06-17 16:24     ` [PATCH 2/3] cgroup: remove unnecessary 0 check from css_from_id() Johannes Weiner
@ 2016-06-17 18:17       ` Tejun Heo
  0 siblings, 0 replies; 14+ messages in thread
From: Tejun Heo @ 2016-06-17 18:17 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Vladimir Davydov, Michal Hocko, Li Zefan,
	linux-mm, cgroups, linux-kernel, kernel-team

On Fri, Jun 17, 2016 at 12:24:27PM -0400, Johannes Weiner wrote:
> css_idr allocation starts at 1, so index 0 will never point to an
> item. css_from_id() currently filters that before asking idr_find(),
> but idr_find() would also just return NULL, so this is not needed.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Applied 1-2 to cgroup/for-4.8.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/3] mm: memcontrol: fix cgroup creation failure after many small jobs
  2016-06-17 16:25     ` [PATCH 3/3] mm: memcontrol: fix cgroup creation failure after many small jobs Johannes Weiner
@ 2016-06-17 18:18       ` Tejun Heo
  2016-06-20  6:14       ` Nikolay Borisov
  2016-06-21 10:16       ` Vladimir Davydov
  2 siblings, 0 replies; 14+ messages in thread
From: Tejun Heo @ 2016-06-17 18:18 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Vladimir Davydov, Michal Hocko, Li Zefan,
	linux-mm, cgroups, linux-kernel, kernel-team

On Fri, Jun 17, 2016 at 12:25:16PM -0400, Johannes Weiner wrote:
> The memory controller has quite a bit of state that usually outlives
> the cgroup and pins its CSS until said state disappears. At the same
> time it imposes a 16-bit limit on the CSS ID space to economically
> store IDs in the wild. Consequently, when we use cgroups to contain
> frequent but small and short-lived jobs that leave behind some page
> cache, we quickly run into the 64k limitations of outstanding CSSs.
> Creating a new cgroup fails with -ENOSPC while there are only a few,
> or even no user-visible cgroups in existence.

Acked-by: Tejun Heo <tj@kernel.org>

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/3] mm: memcontrol: fix cgroup creation failure after many small jobs
  2016-06-17 16:25     ` [PATCH 3/3] mm: memcontrol: fix cgroup creation failure after many small jobs Johannes Weiner
  2016-06-17 18:18       ` Tejun Heo
@ 2016-06-20  6:14       ` Nikolay Borisov
  2016-06-21 10:16       ` Vladimir Davydov
  2 siblings, 0 replies; 14+ messages in thread
From: Nikolay Borisov @ 2016-06-20  6:14 UTC (permalink / raw)
  To: Johannes Weiner, Tejun Heo
  Cc: Andrew Morton, Vladimir Davydov, Michal Hocko, Li Zefan,
	linux-mm, cgroups, linux-kernel, kernel-team



On 06/17/2016 07:25 PM, Johannes Weiner wrote:
> The memory controller has quite a bit of state that usually outlives
> the cgroup and pins its CSS until said state disappears. At the same
> time it imposes a 16-bit limit on the CSS ID space to economically
> store IDs in the wild. Consequently, when we use cgroups to contain
> frequent but small and short-lived jobs that leave behind some page
> cache, we quickly run into the 64k limitations of outstanding CSSs.
> Creating a new cgroup fails with -ENOSPC while there are only a few,
> or even no user-visible cgroups in existence.
> 
> Although pinning CSSs past cgroup removal is common, there are only
> two instances that actually need an ID after a cgroup is deleted:
> cache shadow entries and swapout records.
> 
> Cache shadow entries reference the ID weakly and can deal with the CSS
> having disappeared when it's looked up later. They pose no hurdle.
> 
> Swap-out records do need to pin the css to hierarchically attribute
> swapins after the cgroup has been deleted; though the only pages that
> remain swapped out after offlining are tmpfs/shmem pages. And those
> references are under the user's control, so they are manageable.
> 
> This patch introduces a private 16-bit memcg ID and switches swap and
> cache shadow entries over to using that. This ID can then be recycled
> after offlining when the CSS remains pinned only by objects that don't
> specifically need it.
> 
> This script demonstrates the problem by faulting one cache page in a
> new cgroup and deleting it again:
> 
> set -e
> mkdir -p pages
> for x in `seq 128000`; do
>   [ $((x % 1000)) -eq 0 ] && echo $x
>   mkdir /cgroup/foo
>   echo $$ >/cgroup/foo/cgroup.procs
>   echo trex >pages/$x
>   echo $$ >/cgroup/cgroup.procs
>   rmdir /cgroup/foo
> done

Perhaps you could send this script to the LTP project to have this as a
regression test?

> 
> When run on an unpatched kernel, we eventually run out of possible IDs
> even though there are no visible cgroups:
> 
> [root@ham ~]# ./cssidstress.sh
> [...]
> 65000
> mkdir: cannot create directory '/cgroup/foo': No space left on device
> 
> After this patch, the IDs get released upon cgroup destruction and the
> cache and css objects get released once memory reclaim kicks in.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/3] mm: memcontrol: fix cgroup creation failure after many small jobs
  2016-06-17 16:25     ` [PATCH 3/3] mm: memcontrol: fix cgroup creation failure after many small jobs Johannes Weiner
  2016-06-17 18:18       ` Tejun Heo
  2016-06-20  6:14       ` Nikolay Borisov
@ 2016-06-21 10:16       ` Vladimir Davydov
  2016-06-21 15:46         ` Johannes Weiner
  2 siblings, 1 reply; 14+ messages in thread
From: Vladimir Davydov @ 2016-06-21 10:16 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Tejun Heo, Andrew Morton, Michal Hocko, Li Zefan, linux-mm,
	cgroups, linux-kernel, kernel-team

On Fri, Jun 17, 2016 at 12:25:16PM -0400, Johannes Weiner wrote:
> The memory controller has quite a bit of state that usually outlives
> the cgroup and pins its CSS until said state disappears. At the same
> time it imposes a 16-bit limit on the CSS ID space to economically
> store IDs in the wild. Consequently, when we use cgroups to contain
> frequent but small and short-lived jobs that leave behind some page
> cache, we quickly run into the 64k limitations of outstanding CSSs.
> Creating a new cgroup fails with -ENOSPC while there are only a few,
> or even no user-visible cgroups in existence.
> 
> Although pinning CSSs past cgroup removal is common, there are only
> two instances that actually need an ID after a cgroup is deleted:
> cache shadow entries and swapout records.
> 
> Cache shadow entries reference the ID weakly and can deal with the CSS
> having disappeared when it's looked up later. They pose no hurdle.
> 
> Swap-out records do need to pin the css to hierarchically attribute
> swapins after the cgroup has been deleted; though the only pages that
> remain swapped out after offlining are tmpfs/shmem pages. And those
> references are under the user's control, so they are manageable.
> 
> This patch introduces a private 16-bit memcg ID and switches swap and
> cache shadow entries over to using that. This ID can then be recycled
> after offlining when the CSS remains pinned only by objects that don't
> specifically need it.
> 
> This script demonstrates the problem by faulting one cache page in a
> new cgroup and deleting it again:
> 
> set -e
> mkdir -p pages
> for x in `seq 128000`; do
>   [ $((x % 1000)) -eq 0 ] && echo $x
>   mkdir /cgroup/foo
>   echo $$ >/cgroup/foo/cgroup.procs
>   echo trex >pages/$x
>   echo $$ >/cgroup/cgroup.procs
>   rmdir /cgroup/foo
> done
> 
> When run on an unpatched kernel, we eventually run out of possible IDs
> even though there are no visible cgroups:
> 
> [root@ham ~]# ./cssidstress.sh
> [...]
> 65000
> mkdir: cannot create directory '/cgroup/foo': No space left on device
> 
> After this patch, the IDs get released upon cgroup destruction and the
> cache and css objects get released once memory reclaim kicks in.

With 65K cgroups it will take the reclaimer a substantial amount of time
to iterate over all of them, which might result in latency spikes.
Probably, to avoid that, we could move pages from a dead cgroup's lru to
its parent's one on offline while still leaving dead cgroups pinned,
like we do in case of list_lru entries.

> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>

One nit below.

...
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 75e74408cc8f..dc92b2df2585 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -4057,6 +4057,60 @@ static struct cftype mem_cgroup_legacy_files[] = {
>  	{ },	/* terminate */
>  };
>  
> +/*
> + * Private memory cgroup IDR
> + *
> + * Swap-out records and page cache shadow entries need to store memcg
> + * references in constrained space, so we maintain an ID space that is
> + * limited to 16 bit (MEM_CGROUP_ID_MAX), limiting the total number of
> + * memory-controlled cgroups to 64k.
> + *
> + * However, there usually are many references to the oflline CSS after
> + * the cgroup has been destroyed, such as page cache or reclaimable
> + * slab objects, that don't need to hang on to the ID. We want to keep
> + * those dead CSS from occupying IDs, or we might quickly exhaust the
> + * relatively small ID space and prevent the creation of new cgroups
> + * even when there are much fewer than 64k cgroups - possibly none.
> + *
> + * Maintain a private 16-bit ID space for memcg, and allow the ID to
> + * be freed and recycled when it's no longer needed, which is usually
> + * when the CSS is offlined.
> + *
> + * The only exception to that are records of swapped out tmpfs/shmem
> + * pages that need to be attributed to live ancestors on swapin. But
> + * those references are manageable from userspace.
> + */
> +
> +static struct idr mem_cgroup_idr;

static DEFINE_IDR(mem_cgroup_idr);

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/3] mm: memcontrol: fix cgroup creation failure after many small jobs
  2016-06-21 10:16       ` Vladimir Davydov
@ 2016-06-21 15:46         ` Johannes Weiner
  0 siblings, 0 replies; 14+ messages in thread
From: Johannes Weiner @ 2016-06-21 15:46 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Tejun Heo, Andrew Morton, Michal Hocko, Li Zefan, linux-mm,
	cgroups, linux-kernel, kernel-team

On Tue, Jun 21, 2016 at 01:16:51PM +0300, Vladimir Davydov wrote:
> On Fri, Jun 17, 2016 at 12:25:16PM -0400, Johannes Weiner wrote:
> > After this patch, the IDs get released upon cgroup destruction and the
> > cache and css objects get released once memory reclaim kicks in.
> 
> With 65K cgroups it will take the reclaimer a substantial amount of time
> to iterate over all of them, which might result in latency spikes.
> Probably, to avoid that, we could move pages from a dead cgroup's lru to
> its parent's one on offline while still leaving dead cgroups pinned,
> like we do in case of list_lru entries.

Yep, that is true. list_lru is a bit easier because the walker stays
in the context of the original LRU list, whereas the cache/anon LRUs
are not. We'd have to have mem_cgroup_page_lruvec() etc. do a parent
walk to find the closest live ancestor. Maybe you have a better idea?

But it's definitely worth considering. I'll think more about it.

> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> 
> Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>

Thanks!

> > +static struct idr mem_cgroup_idr;
> 
> static DEFINE_IDR(mem_cgroup_idr);

Oops, good catch. Andrew, could you kindly fold this?

>From d1261ede8f975a5fccb2e9125562260e4b4b4f3d Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes@cmpxchg.org>
Date: Tue, 21 Jun 2016 11:36:13 -0400
Subject: [PATCH] mm: memcontrol: fix cgroup creation failure after many small
 jobs fix

init the IDR

Reported-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/memcontrol.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index dc92b2df2585..b0de1342eab2 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4081,7 +4081,7 @@ static struct cftype mem_cgroup_legacy_files[] = {
  * those references are manageable from userspace.
  */
 
-static struct idr mem_cgroup_idr;
+static DEFINE_IDR(mem_cgroup_idr);
 
 static void mem_cgroup_id_get(struct mem_cgroup *memcg)
 {
-- 
2.8.3

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH] mm: memcontrol: fix cgroup creation failure after many small jobs
  2016-06-16  3:42 [PATCH] mm: memcontrol: fix cgroup creation failure after many small jobs Johannes Weiner
  2016-06-16 20:06 ` Tejun Heo
  2016-06-17  9:06 ` [PATCH] " Vladimir Davydov
@ 2016-07-14 15:37 ` Johannes Weiner
  2 siblings, 0 replies; 14+ messages in thread
From: Johannes Weiner @ 2016-07-14 15:37 UTC (permalink / raw)
  To: Andrew Morton, Tejun Heo
  Cc: Vladimir Davydov, Michal Hocko, Li Zefan, linux-mm, cgroups,
	linux-kernel, kernel-team

Hi Andrew,

this issue dates back quite a bit and wasn't reported until now, so I
didn't tag it for stable. However, it seems that larger scale setups
are now running into this as they upgrade their kernels, and several
people have run into this independently now. Could you please add:

Reported-by: John Garcia <john.garcia@mesosphere.io>
Fixes: b2052564e66d ("mm: memcontrol: continue cache reclaim from offlined groups")
CC: stable@kernel.org # 3.19+

and send it linusward?

Thank you

On Wed, Jun 15, 2016 at 11:42:44PM -0400, Johannes Weiner wrote:
> The memory controller has quite a bit of state that usually outlives
> the cgroup and pins its CSS until said state disappears. At the same
> time it imposes a 16-bit limit on the CSS ID space to economically
> store IDs in the wild. Consequently, when we use cgroups to contain
> frequent but small and short-lived jobs that leave behind some page
> cache, we quickly run into the 64k limitations of outstanding CSSs.
> Creating a new cgroup fails with -ENOSPC while there are only a few,
> or even no user-visible cgroups in existence.
> 
> Although pinning CSSs past cgroup removal is common, there are only
> two instances that actually need a CSS ID after a cgroup is deleted:
> cache shadow entries and swapout records.
> 
> Cache shadow entries reference the ID weakly and can deal with the CSS
> having disappeared when it's looked up later. They pose no hurdle.
> 
> Swap-out records do need to pin the css to hierarchically attribute
> swapins after the cgroup has been deleted; though the only pages that
> remain swapped out after a process exits are tmpfs/shmem pages. Those
> references are under the user's control and thus manageable.
> 
> This patch introduces a private 16bit memcg ID and switches swap and
> cache shadow entries over to using that. It then decouples the CSS
> lifetime from the CSS ID lifetime, such that a CSS ID can be recycled
> when the CSS is only pinned by common objects that don't need an ID.
> 
> This script demonstrates the problem by faulting one cache page in a
> new cgroup and deleting it again:
> 
> set -e
> mkdir -p pages
> for x in `seq 128000`; do
>   [ $((x % 1000)) -eq 0 ] && echo $x
>   mkdir /cgroup/foo
>   echo $$ >/cgroup/foo/cgroup.procs
>   echo trex >pages/$x
>   echo $$ >/cgroup/cgroup.procs
>   rmdir /cgroup/foo
> done
> 
> When run on an unpatched kernel, we eventually run out of possible CSS
> IDs even though there is no visible cgroup existing anymore:
> 
> [root@ham ~]# ./cssidstress.sh
> [...]
> 65000
> mkdir: cannot create directory '/cgroup/foo': No space left on device
> 
> After this patch, the CSS IDs get released upon cgroup destruction and
> the cache and css objects get released once memory reclaim kicks in.

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2016-07-14 15:40 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-06-16  3:42 [PATCH] mm: memcontrol: fix cgroup creation failure after many small jobs Johannes Weiner
2016-06-16 20:06 ` Tejun Heo
2016-06-17 16:23   ` Johannes Weiner
2016-06-17 16:23     ` [PATCH 1/3] cgroup: fix idr leak for the first cgroup root Johannes Weiner
2016-06-17 16:24     ` [PATCH 2/3] cgroup: remove unnecessary 0 check from css_from_id() Johannes Weiner
2016-06-17 18:17       ` Tejun Heo
2016-06-17 16:25     ` [PATCH 3/3] mm: memcontrol: fix cgroup creation failure after many small jobs Johannes Weiner
2016-06-17 18:18       ` Tejun Heo
2016-06-20  6:14       ` Nikolay Borisov
2016-06-21 10:16       ` Vladimir Davydov
2016-06-21 15:46         ` Johannes Weiner
2016-06-17  9:06 ` [PATCH] " Vladimir Davydov
2016-06-17 16:40   ` Johannes Weiner
2016-07-14 15:37 ` Johannes Weiner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).