All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH -mm v15 00/13] kmemcg shrinkers
@ 2014-02-05 18:39 ` Vladimir Davydov
  0 siblings, 0 replies; 44+ messages in thread
From: Vladimir Davydov @ 2014-02-05 18:39 UTC (permalink / raw)
  To: akpm
  Cc: dchinner, mhocko, hannes, glommer, rientjes, linux-kernel,
	linux-mm, devel

Hi,

This is the 15th iteration of Glauber Costa's patch-set implementing slab
shrinking on memcg pressure. The main idea is to make the list_lru structure
used by most FS shrinkers per-memcg. When adding or removing an element from a
list_lru, we use the page information to figure out which memcg it belongs to
and relay it to the appropriate list. This allows scanning kmem objects
accounted to different memcgs independently.

Please note that this patch-set implements slab shrinking only when we hit the
user memory limit so that kmem allocations will still fail if we are below the
user memory limit, but close to the kmem limit. I am going to fix this in a
separate patch-set, but currently it is only worthwhile setting the kmem limit
to be greater than the user mem limit just to enable per-memcg slab accounting
and reclaim.

The patch-set is based on top of v3.14-rc1-mmots-2014-02-04-16-48 (there are
some vmscan cleanups that I need committed there) and organized as follows:
 - patches 1-4 introduce some minor changes to memcg needed for this set;
 - patches 5-7 prepare fs for per-memcg list_lru;
 - patch 8 implement kmemcg reclaim core;
 - patch 9 make list_lru per-memcg and patch 10 marks sb shrinker memcg-aware;
 - patch 10 is trivial - it issues shrinkers on memcg destruction;
 - patches 12 and 13 introduce shrinking of dead kmem caches to facilitate
   memcg destruction.

Changes in v15:
 - remove patches that have been merged to -mm;
 - fix memory barrier usage in per-memcg list_lru implementation;
 - fix list_lru_destroy(), which might sleep for per-memcg lrus, called from
   atomic context (__put_super()).

Previous iterations of this patch-set can be found here:
 - https://lkml.org/lkml/2013/12/16/206 (v14)
 - https://lkml.org/lkml/2013/12/9/103 (v13)
 - https://lkml.org/lkml/2013/12/2/141 (v12)
 - https://lkml.org/lkml/2013/11/25/214 (v11)

Comments are highly appreciated.

Thanks.

Glauber Costa (6):
  memcg: make cache index determination more robust
  memcg: consolidate callers of memcg_cache_id
  memcg: move initialization to memcg creation
  memcg: flush memcg items upon memcg destruction
  vmpressure: in-kernel notifications
  memcg: reap dead memcgs upon global memory pressure

Vladimir Davydov (7):
  memcg: make for_each_mem_cgroup macros public
  list_lru, shrinkers: introduce list_lru_shrink_{count,walk}
  fs: consolidate {nr,free}_cached_objects args in shrink_control
  fs: do not call destroy_super() in atomic context
  vmscan: shrink slab on memcg pressure
  list_lru: add per-memcg lists
  fs: make shrinker memcg aware

 fs/dcache.c                |   14 +-
 fs/gfs2/quota.c            |    6 +-
 fs/inode.c                 |    7 +-
 fs/internal.h              |    7 +-
 fs/super.c                 |   44 ++---
 fs/xfs/xfs_buf.c           |    7 +-
 fs/xfs/xfs_qm.c            |    7 +-
 fs/xfs/xfs_super.c         |    7 +-
 include/linux/fs.h         |    8 +-
 include/linux/list_lru.h   |  112 ++++++++++---
 include/linux/memcontrol.h |   50 ++++++
 include/linux/shrinker.h   |   10 +-
 include/linux/vmpressure.h |    5 +
 mm/list_lru.c              |  271 +++++++++++++++++++++++++++---
 mm/memcontrol.c            |  399 ++++++++++++++++++++++++++++++++++++++------
 mm/vmpressure.c            |   53 +++++-
 mm/vmscan.c                |   94 ++++++++---
 17 files changed, 926 insertions(+), 175 deletions(-)

-- 
1.7.10.4


^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH -mm v15 00/13] kmemcg shrinkers
@ 2014-02-05 18:39 ` Vladimir Davydov
  0 siblings, 0 replies; 44+ messages in thread
From: Vladimir Davydov @ 2014-02-05 18:39 UTC (permalink / raw)
  To: akpm
  Cc: dchinner, mhocko, hannes, glommer, rientjes, linux-kernel,
	linux-mm, devel

Hi,

This is the 15th iteration of Glauber Costa's patch-set implementing slab
shrinking on memcg pressure. The main idea is to make the list_lru structure
used by most FS shrinkers per-memcg. When adding or removing an element from a
list_lru, we use the page information to figure out which memcg it belongs to
and relay it to the appropriate list. This allows scanning kmem objects
accounted to different memcgs independently.

Please note that this patch-set implements slab shrinking only when we hit the
user memory limit so that kmem allocations will still fail if we are below the
user memory limit, but close to the kmem limit. I am going to fix this in a
separate patch-set, but currently it is only worthwhile setting the kmem limit
to be greater than the user mem limit just to enable per-memcg slab accounting
and reclaim.

The patch-set is based on top of v3.14-rc1-mmots-2014-02-04-16-48 (there are
some vmscan cleanups that I need committed there) and organized as follows:
 - patches 1-4 introduce some minor changes to memcg needed for this set;
 - patches 5-7 prepare fs for per-memcg list_lru;
 - patch 8 implement kmemcg reclaim core;
 - patch 9 make list_lru per-memcg and patch 10 marks sb shrinker memcg-aware;
 - patch 10 is trivial - it issues shrinkers on memcg destruction;
 - patches 12 and 13 introduce shrinking of dead kmem caches to facilitate
   memcg destruction.

Changes in v15:
 - remove patches that have been merged to -mm;
 - fix memory barrier usage in per-memcg list_lru implementation;
 - fix list_lru_destroy(), which might sleep for per-memcg lrus, called from
   atomic context (__put_super()).

Previous iterations of this patch-set can be found here:
 - https://lkml.org/lkml/2013/12/16/206 (v14)
 - https://lkml.org/lkml/2013/12/9/103 (v13)
 - https://lkml.org/lkml/2013/12/2/141 (v12)
 - https://lkml.org/lkml/2013/11/25/214 (v11)

Comments are highly appreciated.

Thanks.

Glauber Costa (6):
  memcg: make cache index determination more robust
  memcg: consolidate callers of memcg_cache_id
  memcg: move initialization to memcg creation
  memcg: flush memcg items upon memcg destruction
  vmpressure: in-kernel notifications
  memcg: reap dead memcgs upon global memory pressure

Vladimir Davydov (7):
  memcg: make for_each_mem_cgroup macros public
  list_lru, shrinkers: introduce list_lru_shrink_{count,walk}
  fs: consolidate {nr,free}_cached_objects args in shrink_control
  fs: do not call destroy_super() in atomic context
  vmscan: shrink slab on memcg pressure
  list_lru: add per-memcg lists
  fs: make shrinker memcg aware

 fs/dcache.c                |   14 +-
 fs/gfs2/quota.c            |    6 +-
 fs/inode.c                 |    7 +-
 fs/internal.h              |    7 +-
 fs/super.c                 |   44 ++---
 fs/xfs/xfs_buf.c           |    7 +-
 fs/xfs/xfs_qm.c            |    7 +-
 fs/xfs/xfs_super.c         |    7 +-
 include/linux/fs.h         |    8 +-
 include/linux/list_lru.h   |  112 ++++++++++---
 include/linux/memcontrol.h |   50 ++++++
 include/linux/shrinker.h   |   10 +-
 include/linux/vmpressure.h |    5 +
 mm/list_lru.c              |  271 +++++++++++++++++++++++++++---
 mm/memcontrol.c            |  399 ++++++++++++++++++++++++++++++++++++++------
 mm/vmpressure.c            |   53 +++++-
 mm/vmscan.c                |   94 ++++++++---
 17 files changed, 926 insertions(+), 175 deletions(-)

-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH -mm v15 01/13] memcg: make cache index determination more robust
  2014-02-05 18:39 ` Vladimir Davydov
@ 2014-02-05 18:39   ` Vladimir Davydov
  -1 siblings, 0 replies; 44+ messages in thread
From: Vladimir Davydov @ 2014-02-05 18:39 UTC (permalink / raw)
  To: akpm
  Cc: dchinner, mhocko, hannes, glommer, rientjes, linux-kernel,
	linux-mm, devel, Glauber Costa, Balbir Singh, KAMEZAWA Hiroyuki

From: Glauber Costa <glommer@openvz.org>

I caught myself doing something like the following outside memcg core:

	memcg_id = -1;
	if (memcg && memcg_kmem_is_active(memcg))
		memcg_id = memcg_cache_id(memcg);

to be able to handle all possible memcgs in a sane manner. In particular, the
root cache will have kmemcg_id = -1 (just because we don't call memcg_kmem_init
to the root cache since it is not limitable). We have always coped with that by
making sure we sanitize which cache is passed to memcg_cache_id. Although this
example is given for root, what we really need to know is whether or not a
cache is kmem active.

But outside the memcg core testing for root, for instance, is not trivial since
we don't export mem_cgroup_is_root. I ended up realizing that this tests really
belong inside memcg_cache_id. This patch moves a similar but stronger test
inside memcg_cache_id and make sure it always return a meaningful value.

Signed-off-by: Glauber Costa <glommer@openvz.org>
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |    4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 53385cd4e6f0..75758fc5c50c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3110,7 +3110,9 @@ static void memcg_uncharge_kmem(struct mem_cgroup *memcg, u64 size)
  */
 int memcg_cache_id(struct mem_cgroup *memcg)
 {
-	return memcg ? memcg->kmemcg_id : -1;
+	if (!memcg || !memcg_can_account_kmem(memcg))
+		return -1;
+	return memcg->kmemcg_id;
 }
 
 static size_t memcg_caches_array_size(int num_groups)
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH -mm v15 01/13] memcg: make cache index determination more robust
@ 2014-02-05 18:39   ` Vladimir Davydov
  0 siblings, 0 replies; 44+ messages in thread
From: Vladimir Davydov @ 2014-02-05 18:39 UTC (permalink / raw)
  To: akpm
  Cc: dchinner, mhocko, hannes, glommer, rientjes, linux-kernel,
	linux-mm, devel, Glauber Costa, Balbir Singh, KAMEZAWA Hiroyuki

From: Glauber Costa <glommer@openvz.org>

I caught myself doing something like the following outside memcg core:

	memcg_id = -1;
	if (memcg && memcg_kmem_is_active(memcg))
		memcg_id = memcg_cache_id(memcg);

to be able to handle all possible memcgs in a sane manner. In particular, the
root cache will have kmemcg_id = -1 (just because we don't call memcg_kmem_init
to the root cache since it is not limitable). We have always coped with that by
making sure we sanitize which cache is passed to memcg_cache_id. Although this
example is given for root, what we really need to know is whether or not a
cache is kmem active.

But outside the memcg core testing for root, for instance, is not trivial since
we don't export mem_cgroup_is_root. I ended up realizing that this tests really
belong inside memcg_cache_id. This patch moves a similar but stronger test
inside memcg_cache_id and make sure it always return a meaningful value.

Signed-off-by: Glauber Costa <glommer@openvz.org>
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |    4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 53385cd4e6f0..75758fc5c50c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3110,7 +3110,9 @@ static void memcg_uncharge_kmem(struct mem_cgroup *memcg, u64 size)
  */
 int memcg_cache_id(struct mem_cgroup *memcg)
 {
-	return memcg ? memcg->kmemcg_id : -1;
+	if (!memcg || !memcg_can_account_kmem(memcg))
+		return -1;
+	return memcg->kmemcg_id;
 }
 
 static size_t memcg_caches_array_size(int num_groups)
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH -mm v15 02/13] memcg: consolidate callers of memcg_cache_id
  2014-02-05 18:39 ` Vladimir Davydov
@ 2014-02-05 18:39   ` Vladimir Davydov
  -1 siblings, 0 replies; 44+ messages in thread
From: Vladimir Davydov @ 2014-02-05 18:39 UTC (permalink / raw)
  To: akpm
  Cc: dchinner, mhocko, hannes, glommer, rientjes, linux-kernel,
	linux-mm, devel, Glauber Costa, Balbir Singh, KAMEZAWA Hiroyuki

From: Glauber Costa <glommer@openvz.org>

Each caller of memcg_cache_id ends up sanitizing its parameters in its own way.
Now that the memcg_cache_id itself is more robust, we can consolidate this.

Also, as suggested by Michal, a special helper memcg_cache_idx is used when the
result is expected to be used directly as an array index to make sure we never
access in a negative index.

Signed-off-by: Glauber Costa <glommer@openvz.org>
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |   48 +++++++++++++++++++++++++++++++-----------------
 1 file changed, 31 insertions(+), 17 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 75758fc5c50c..9d1245dc993a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3008,6 +3008,30 @@ static inline bool memcg_can_account_kmem(struct mem_cgroup *memcg)
 }
 
 /*
+ * helper for acessing a memcg's index. It will be used as an index in the
+ * child cache array in kmem_cache, and also to derive its name. This function
+ * will return -1 when this is not a kmem-limited memcg.
+ */
+int memcg_cache_id(struct mem_cgroup *memcg)
+{
+	if (!memcg || !memcg_can_account_kmem(memcg))
+		return -1;
+	return memcg->kmemcg_id;
+}
+
+/*
+ * This helper around memcg_cache_id is not intented for use outside memcg
+ * core. It is meant for places where the cache id is used directly as an array
+ * index
+ */
+static int memcg_cache_idx(struct mem_cgroup *memcg)
+{
+	int ret = memcg_cache_id(memcg);
+	BUG_ON(ret < 0);
+	return ret;
+}
+
+/*
  * This is a bit cumbersome, but it is rarely used and avoids a backpointer
  * in the memcg_cache_params struct.
  */
@@ -3017,7 +3041,7 @@ static struct kmem_cache *memcg_params_to_cache(struct memcg_cache_params *p)
 
 	VM_BUG_ON(p->is_root_cache);
 	cachep = p->root_cache;
-	return cache_from_memcg_idx(cachep, memcg_cache_id(p->memcg));
+	return cache_from_memcg_idx(cachep, memcg_cache_idx(p->memcg));
 }
 
 #ifdef CONFIG_SLABINFO
@@ -3103,18 +3127,6 @@ static void memcg_uncharge_kmem(struct mem_cgroup *memcg, u64 size)
 		css_put(&memcg->css);
 }
 
-/*
- * helper for acessing a memcg's index. It will be used as an index in the
- * child cache array in kmem_cache, and also to derive its name. This function
- * will return -1 when this is not a kmem-limited memcg.
- */
-int memcg_cache_id(struct mem_cgroup *memcg)
-{
-	if (!memcg || !memcg_can_account_kmem(memcg))
-		return -1;
-	return memcg->kmemcg_id;
-}
-
 static size_t memcg_caches_array_size(int num_groups)
 {
 	ssize_t size;
@@ -3246,7 +3258,7 @@ void memcg_register_cache(struct kmem_cache *s)
 
 	root = s->memcg_params->root_cache;
 	memcg = s->memcg_params->memcg;
-	id = memcg_cache_id(memcg);
+	id = memcg_cache_idx(memcg);
 
 	css_get(&memcg->css);
 
@@ -3288,7 +3300,7 @@ void memcg_unregister_cache(struct kmem_cache *s)
 
 	root = s->memcg_params->root_cache;
 	memcg = s->memcg_params->memcg;
-	id = memcg_cache_id(memcg);
+	id = memcg_cache_idx(memcg);
 
 	mutex_lock(&memcg->slab_caches_mutex);
 	list_del(&s->memcg_params->list);
@@ -3573,6 +3585,7 @@ struct kmem_cache *__memcg_kmem_get_cache(struct kmem_cache *cachep,
 {
 	struct mem_cgroup *memcg;
 	struct kmem_cache *memcg_cachep;
+	int id;
 
 	VM_BUG_ON(!cachep->memcg_params);
 	VM_BUG_ON(!cachep->memcg_params->is_root_cache);
@@ -3583,10 +3596,11 @@ struct kmem_cache *__memcg_kmem_get_cache(struct kmem_cache *cachep,
 	rcu_read_lock();
 	memcg = mem_cgroup_from_task(rcu_dereference(current->mm->owner));
 
-	if (!memcg_can_account_kmem(memcg))
+	id = memcg_cache_id(memcg);
+	if (id < 0)
 		goto out;
 
-	memcg_cachep = cache_from_memcg_idx(cachep, memcg_cache_id(memcg));
+	memcg_cachep = cache_from_memcg_idx(cachep, id);
 	if (likely(memcg_cachep)) {
 		cachep = memcg_cachep;
 		goto out;
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH -mm v15 02/13] memcg: consolidate callers of memcg_cache_id
@ 2014-02-05 18:39   ` Vladimir Davydov
  0 siblings, 0 replies; 44+ messages in thread
From: Vladimir Davydov @ 2014-02-05 18:39 UTC (permalink / raw)
  To: akpm
  Cc: dchinner, mhocko, hannes, glommer, rientjes, linux-kernel,
	linux-mm, devel, Glauber Costa, Balbir Singh, KAMEZAWA Hiroyuki

From: Glauber Costa <glommer@openvz.org>

Each caller of memcg_cache_id ends up sanitizing its parameters in its own way.
Now that the memcg_cache_id itself is more robust, we can consolidate this.

Also, as suggested by Michal, a special helper memcg_cache_idx is used when the
result is expected to be used directly as an array index to make sure we never
access in a negative index.

Signed-off-by: Glauber Costa <glommer@openvz.org>
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |   48 +++++++++++++++++++++++++++++++-----------------
 1 file changed, 31 insertions(+), 17 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 75758fc5c50c..9d1245dc993a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3008,6 +3008,30 @@ static inline bool memcg_can_account_kmem(struct mem_cgroup *memcg)
 }
 
 /*
+ * helper for acessing a memcg's index. It will be used as an index in the
+ * child cache array in kmem_cache, and also to derive its name. This function
+ * will return -1 when this is not a kmem-limited memcg.
+ */
+int memcg_cache_id(struct mem_cgroup *memcg)
+{
+	if (!memcg || !memcg_can_account_kmem(memcg))
+		return -1;
+	return memcg->kmemcg_id;
+}
+
+/*
+ * This helper around memcg_cache_id is not intented for use outside memcg
+ * core. It is meant for places where the cache id is used directly as an array
+ * index
+ */
+static int memcg_cache_idx(struct mem_cgroup *memcg)
+{
+	int ret = memcg_cache_id(memcg);
+	BUG_ON(ret < 0);
+	return ret;
+}
+
+/*
  * This is a bit cumbersome, but it is rarely used and avoids a backpointer
  * in the memcg_cache_params struct.
  */
@@ -3017,7 +3041,7 @@ static struct kmem_cache *memcg_params_to_cache(struct memcg_cache_params *p)
 
 	VM_BUG_ON(p->is_root_cache);
 	cachep = p->root_cache;
-	return cache_from_memcg_idx(cachep, memcg_cache_id(p->memcg));
+	return cache_from_memcg_idx(cachep, memcg_cache_idx(p->memcg));
 }
 
 #ifdef CONFIG_SLABINFO
@@ -3103,18 +3127,6 @@ static void memcg_uncharge_kmem(struct mem_cgroup *memcg, u64 size)
 		css_put(&memcg->css);
 }
 
-/*
- * helper for acessing a memcg's index. It will be used as an index in the
- * child cache array in kmem_cache, and also to derive its name. This function
- * will return -1 when this is not a kmem-limited memcg.
- */
-int memcg_cache_id(struct mem_cgroup *memcg)
-{
-	if (!memcg || !memcg_can_account_kmem(memcg))
-		return -1;
-	return memcg->kmemcg_id;
-}
-
 static size_t memcg_caches_array_size(int num_groups)
 {
 	ssize_t size;
@@ -3246,7 +3258,7 @@ void memcg_register_cache(struct kmem_cache *s)
 
 	root = s->memcg_params->root_cache;
 	memcg = s->memcg_params->memcg;
-	id = memcg_cache_id(memcg);
+	id = memcg_cache_idx(memcg);
 
 	css_get(&memcg->css);
 
@@ -3288,7 +3300,7 @@ void memcg_unregister_cache(struct kmem_cache *s)
 
 	root = s->memcg_params->root_cache;
 	memcg = s->memcg_params->memcg;
-	id = memcg_cache_id(memcg);
+	id = memcg_cache_idx(memcg);
 
 	mutex_lock(&memcg->slab_caches_mutex);
 	list_del(&s->memcg_params->list);
@@ -3573,6 +3585,7 @@ struct kmem_cache *__memcg_kmem_get_cache(struct kmem_cache *cachep,
 {
 	struct mem_cgroup *memcg;
 	struct kmem_cache *memcg_cachep;
+	int id;
 
 	VM_BUG_ON(!cachep->memcg_params);
 	VM_BUG_ON(!cachep->memcg_params->is_root_cache);
@@ -3583,10 +3596,11 @@ struct kmem_cache *__memcg_kmem_get_cache(struct kmem_cache *cachep,
 	rcu_read_lock();
 	memcg = mem_cgroup_from_task(rcu_dereference(current->mm->owner));
 
-	if (!memcg_can_account_kmem(memcg))
+	id = memcg_cache_id(memcg);
+	if (id < 0)
 		goto out;
 
-	memcg_cachep = cache_from_memcg_idx(cachep, memcg_cache_id(memcg));
+	memcg_cachep = cache_from_memcg_idx(cachep, id);
 	if (likely(memcg_cachep)) {
 		cachep = memcg_cachep;
 		goto out;
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH -mm v15 03/13] memcg: move initialization to memcg creation
  2014-02-05 18:39 ` Vladimir Davydov
@ 2014-02-05 18:39   ` Vladimir Davydov
  -1 siblings, 0 replies; 44+ messages in thread
From: Vladimir Davydov @ 2014-02-05 18:39 UTC (permalink / raw)
  To: akpm
  Cc: dchinner, mhocko, hannes, glommer, rientjes, linux-kernel,
	linux-mm, devel, Glauber Costa, Balbir Singh, KAMEZAWA Hiroyuki

From: Glauber Costa <glommer@openvz.org>

Those structures are only used for memcgs that are effectively using
kmemcg. However, in a later patch I intend to use scan that list
inconditionally (list empty meaning no kmem caches present), which
simplifies the code a lot.

So move the initialization to early kmem creation.

Signed-off-by: Glauber Costa <glommer@openvz.org>
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |    5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 9d1245dc993a..deb5b9bb6188 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5210,8 +5210,6 @@ static int __memcg_activate_kmem(struct mem_cgroup *memcg,
 		goto out_rmid;
 
 	memcg->kmemcg_id = memcg_id;
-	INIT_LIST_HEAD(&memcg->memcg_slab_caches);
-	mutex_init(&memcg->slab_caches_mutex);
 
 	/*
 	 * We couldn't have accounted to this cgroup, because it hasn't got the
@@ -5958,6 +5956,9 @@ static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
 	int ret;
 
 	memcg->kmemcg_id = -1;
+	INIT_LIST_HEAD(&memcg->memcg_slab_caches);
+	mutex_init(&memcg->slab_caches_mutex);
+
 	ret = memcg_propagate_kmem(memcg);
 	if (ret)
 		return ret;
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH -mm v15 03/13] memcg: move initialization to memcg creation
@ 2014-02-05 18:39   ` Vladimir Davydov
  0 siblings, 0 replies; 44+ messages in thread
From: Vladimir Davydov @ 2014-02-05 18:39 UTC (permalink / raw)
  To: akpm
  Cc: dchinner, mhocko, hannes, glommer, rientjes, linux-kernel,
	linux-mm, devel, Glauber Costa, Balbir Singh, KAMEZAWA Hiroyuki

From: Glauber Costa <glommer@openvz.org>

Those structures are only used for memcgs that are effectively using
kmemcg. However, in a later patch I intend to use scan that list
inconditionally (list empty meaning no kmem caches present), which
simplifies the code a lot.

So move the initialization to early kmem creation.

Signed-off-by: Glauber Costa <glommer@openvz.org>
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |    5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 9d1245dc993a..deb5b9bb6188 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5210,8 +5210,6 @@ static int __memcg_activate_kmem(struct mem_cgroup *memcg,
 		goto out_rmid;
 
 	memcg->kmemcg_id = memcg_id;
-	INIT_LIST_HEAD(&memcg->memcg_slab_caches);
-	mutex_init(&memcg->slab_caches_mutex);
 
 	/*
 	 * We couldn't have accounted to this cgroup, because it hasn't got the
@@ -5958,6 +5956,9 @@ static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
 	int ret;
 
 	memcg->kmemcg_id = -1;
+	INIT_LIST_HEAD(&memcg->memcg_slab_caches);
+	mutex_init(&memcg->slab_caches_mutex);
+
 	ret = memcg_propagate_kmem(memcg);
 	if (ret)
 		return ret;
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH -mm v15 04/13] memcg: make for_each_mem_cgroup macros public
  2014-02-05 18:39 ` Vladimir Davydov
@ 2014-02-05 18:39   ` Vladimir Davydov
  -1 siblings, 0 replies; 44+ messages in thread
From: Vladimir Davydov @ 2014-02-05 18:39 UTC (permalink / raw)
  To: akpm
  Cc: dchinner, mhocko, hannes, glommer, rientjes, linux-kernel,
	linux-mm, devel, Glauber Costa, Balbir Singh, KAMEZAWA Hiroyuki

I am going to use these macros in next patches, so let's move them to
the header. These macros are very handy and they depend only on
mem_cgroup_iter(), which is already public, so I guess it's worth it.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Glauber Costa <glommer@openvz.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/memcontrol.h |   15 +++++++++++++++
 mm/memcontrol.c            |   15 ---------------
 2 files changed, 15 insertions(+), 15 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index abd0113b6620..0503b59c3fad 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -53,6 +53,21 @@ struct mem_cgroup_reclaim_cookie {
 	unsigned int generation;
 };
 
+/*
+ * Iteration constructs for visiting all cgroups (under a tree).  If
+ * loops are exited prematurely (break), mem_cgroup_iter_break() must
+ * be used for reference counting.
+ */
+#define for_each_mem_cgroup_tree(iter, root)		\
+	for (iter = mem_cgroup_iter(root, NULL, NULL);	\
+	     iter != NULL;				\
+	     iter = mem_cgroup_iter(root, iter, NULL))
+
+#define for_each_mem_cgroup(iter)			\
+	for (iter = mem_cgroup_iter(NULL, NULL, NULL);	\
+	     iter != NULL;				\
+	     iter = mem_cgroup_iter(NULL, iter, NULL))
+
 #ifdef CONFIG_MEMCG
 /*
  * All "charge" functions with gfp_mask should use GFP_KERNEL or
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index deb5b9bb6188..854d0b8e3c45 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1297,21 +1297,6 @@ void mem_cgroup_iter_break(struct mem_cgroup *root,
 		css_put(&prev->css);
 }
 
-/*
- * Iteration constructs for visiting all cgroups (under a tree).  If
- * loops are exited prematurely (break), mem_cgroup_iter_break() must
- * be used for reference counting.
- */
-#define for_each_mem_cgroup_tree(iter, root)		\
-	for (iter = mem_cgroup_iter(root, NULL, NULL);	\
-	     iter != NULL;				\
-	     iter = mem_cgroup_iter(root, iter, NULL))
-
-#define for_each_mem_cgroup(iter)			\
-	for (iter = mem_cgroup_iter(NULL, NULL, NULL);	\
-	     iter != NULL;				\
-	     iter = mem_cgroup_iter(NULL, iter, NULL))
-
 void __mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item idx)
 {
 	struct mem_cgroup *memcg;
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH -mm v15 04/13] memcg: make for_each_mem_cgroup macros public
@ 2014-02-05 18:39   ` Vladimir Davydov
  0 siblings, 0 replies; 44+ messages in thread
From: Vladimir Davydov @ 2014-02-05 18:39 UTC (permalink / raw)
  To: akpm
  Cc: dchinner, mhocko, hannes, glommer, rientjes, linux-kernel,
	linux-mm, devel, Glauber Costa, Balbir Singh, KAMEZAWA Hiroyuki

I am going to use these macros in next patches, so let's move them to
the header. These macros are very handy and they depend only on
mem_cgroup_iter(), which is already public, so I guess it's worth it.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Glauber Costa <glommer@openvz.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/memcontrol.h |   15 +++++++++++++++
 mm/memcontrol.c            |   15 ---------------
 2 files changed, 15 insertions(+), 15 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index abd0113b6620..0503b59c3fad 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -53,6 +53,21 @@ struct mem_cgroup_reclaim_cookie {
 	unsigned int generation;
 };
 
+/*
+ * Iteration constructs for visiting all cgroups (under a tree).  If
+ * loops are exited prematurely (break), mem_cgroup_iter_break() must
+ * be used for reference counting.
+ */
+#define for_each_mem_cgroup_tree(iter, root)		\
+	for (iter = mem_cgroup_iter(root, NULL, NULL);	\
+	     iter != NULL;				\
+	     iter = mem_cgroup_iter(root, iter, NULL))
+
+#define for_each_mem_cgroup(iter)			\
+	for (iter = mem_cgroup_iter(NULL, NULL, NULL);	\
+	     iter != NULL;				\
+	     iter = mem_cgroup_iter(NULL, iter, NULL))
+
 #ifdef CONFIG_MEMCG
 /*
  * All "charge" functions with gfp_mask should use GFP_KERNEL or
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index deb5b9bb6188..854d0b8e3c45 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1297,21 +1297,6 @@ void mem_cgroup_iter_break(struct mem_cgroup *root,
 		css_put(&prev->css);
 }
 
-/*
- * Iteration constructs for visiting all cgroups (under a tree).  If
- * loops are exited prematurely (break), mem_cgroup_iter_break() must
- * be used for reference counting.
- */
-#define for_each_mem_cgroup_tree(iter, root)		\
-	for (iter = mem_cgroup_iter(root, NULL, NULL);	\
-	     iter != NULL;				\
-	     iter = mem_cgroup_iter(root, iter, NULL))
-
-#define for_each_mem_cgroup(iter)			\
-	for (iter = mem_cgroup_iter(NULL, NULL, NULL);	\
-	     iter != NULL;				\
-	     iter = mem_cgroup_iter(NULL, iter, NULL))
-
 void __mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item idx)
 {
 	struct mem_cgroup *memcg;
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH -mm v15 05/13] list_lru, shrinkers: introduce list_lru_shrink_{count,walk}
  2014-02-05 18:39 ` Vladimir Davydov
@ 2014-02-05 18:39   ` Vladimir Davydov
  -1 siblings, 0 replies; 44+ messages in thread
From: Vladimir Davydov @ 2014-02-05 18:39 UTC (permalink / raw)
  To: akpm
  Cc: dchinner, mhocko, hannes, glommer, rientjes, linux-kernel,
	linux-mm, devel, Glauber Costa, Al Viro

NUMA aware slab shrinkers use the list_lru structure to distribute
objects coming from different NUMA nodes to different lists. Whenever
such a shrinker needs to count or scan objects from a particular node,
it issues commands like this:

        count = list_lru_count_node(lru, sc->nid);
        freed = list_lru_walk_node(lru, sc->nid, isolate_func,
                                   isolate_arg, &sc->nr_to_scan);

where sc is an instance of the shrink_control structure passed to it
from vmscan.

To simplify this, let's add the special list_lru functions to be used by
shrinkers, list_lru_shrink_count() and list_lru_shrink_walk(), which
consolidate the nid and nr_to_scan arguments in the shrink_control
structure.

This will also allow us to avoid patching shrinkers that use list_lru
when we make shrink_slab() per-memcg - all we will have to do is extend
the shrink_control structure to include the memcg to scan from and make
list_lru_shrink_{count,walk} handle this appropriately.

Thanks to David Chinner for the tip.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Glauber Costa <glommer@openvz.org>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
---
 fs/dcache.c              |   14 ++++++--------
 fs/gfs2/quota.c          |    6 +++---
 fs/inode.c               |    7 +++----
 fs/internal.h            |    7 +++----
 fs/super.c               |   22 ++++++++++------------
 fs/xfs/xfs_buf.c         |    7 +++----
 fs/xfs/xfs_qm.c          |    7 +++----
 include/linux/list_lru.h |   16 ++++++++++++++++
 8 files changed, 47 insertions(+), 39 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 265e0ce9769c..4bc85f96a87d 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -972,24 +972,22 @@ dentry_lru_isolate(struct list_head *item, spinlock_t *lru_lock, void *arg)
 /**
  * prune_dcache_sb - shrink the dcache
  * @sb: superblock
- * @nr_to_scan : number of entries to try to free
- * @nid: which node to scan for freeable entities
+ * @sc: shrink control, passed to list_lru_shrink_walk()
  *
- * Attempt to shrink the superblock dcache LRU by @nr_to_scan entries. This is
- * done when we need more memory an called from the superblock shrinker
+ * Attempt to shrink the superblock dcache LRU by @sc->nr_to_scan entries. This
+ * is done when we need more memory and called from the superblock shrinker
  * function.
  *
  * This function may fail to free any resources if all the dentries are in
  * use.
  */
-long prune_dcache_sb(struct super_block *sb, unsigned long nr_to_scan,
-		     int nid)
+long prune_dcache_sb(struct super_block *sb, struct shrink_control *sc)
 {
 	LIST_HEAD(dispose);
 	long freed;
 
-	freed = list_lru_walk_node(&sb->s_dentry_lru, nid, dentry_lru_isolate,
-				       &dispose, &nr_to_scan);
+	freed = list_lru_shrink_walk(&sb->s_dentry_lru, sc,
+				     dentry_lru_isolate, &dispose);
 	shrink_dentry_list(&dispose);
 	return freed;
 }
diff --git a/fs/gfs2/quota.c b/fs/gfs2/quota.c
index 8bec0e3192dd..8746393aed88 100644
--- a/fs/gfs2/quota.c
+++ b/fs/gfs2/quota.c
@@ -169,8 +169,8 @@ static unsigned long gfs2_qd_shrink_scan(struct shrinker *shrink,
 	if (!(sc->gfp_mask & __GFP_FS))
 		return SHRINK_STOP;
 
-	freed = list_lru_walk_node(&gfs2_qd_lru, sc->nid, gfs2_qd_isolate,
-				   &dispose, &sc->nr_to_scan);
+	freed = list_lru_shrink_walk(&gfs2_qd_lru, sc,
+				     gfs2_qd_isolate, &dispose);
 
 	gfs2_qd_dispose(&dispose);
 
@@ -180,7 +180,7 @@ static unsigned long gfs2_qd_shrink_scan(struct shrinker *shrink,
 static unsigned long gfs2_qd_shrink_count(struct shrinker *shrink,
 					  struct shrink_control *sc)
 {
-	return vfs_pressure_ratio(list_lru_count_node(&gfs2_qd_lru, sc->nid));
+	return vfs_pressure_ratio(list_lru_shrink_count(&gfs2_qd_lru, sc));
 }
 
 struct shrinker gfs2_qd_shrinker = {
diff --git a/fs/inode.c b/fs/inode.c
index e6905152c39f..890e4f9b1590 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -748,14 +748,13 @@ inode_lru_isolate(struct list_head *item, spinlock_t *lru_lock, void *arg)
  * to trim from the LRU. Inodes to be freed are moved to a temporary list and
  * then are freed outside inode_lock by dispose_list().
  */
-long prune_icache_sb(struct super_block *sb, unsigned long nr_to_scan,
-		     int nid)
+long prune_icache_sb(struct super_block *sb, struct shrink_control *sc)
 {
 	LIST_HEAD(freeable);
 	long freed;
 
-	freed = list_lru_walk_node(&sb->s_inode_lru, nid, inode_lru_isolate,
-				       &freeable, &nr_to_scan);
+	freed = list_lru_shrink_walk(&sb->s_inode_lru, sc,
+				     inode_lru_isolate, &freeable);
 	dispose_list(&freeable);
 	return freed;
 }
diff --git a/fs/internal.h b/fs/internal.h
index 465742407466..3db5f6e41cd7 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -14,6 +14,7 @@ struct file_system_type;
 struct linux_binprm;
 struct path;
 struct mount;
+struct shrink_control;
 
 /*
  * block_dev.c
@@ -107,8 +108,7 @@ extern int open_check_o_direct(struct file *f);
  * inode.c
  */
 extern spinlock_t inode_sb_list_lock;
-extern long prune_icache_sb(struct super_block *sb, unsigned long nr_to_scan,
-			    int nid);
+extern long prune_icache_sb(struct super_block *sb, struct shrink_control *sc);
 extern void inode_add_lru(struct inode *inode);
 
 /*
@@ -125,8 +125,7 @@ extern int invalidate_inodes(struct super_block *, bool);
  */
 extern struct dentry *__d_alloc(struct super_block *, const struct qstr *);
 extern int d_set_mounted(struct dentry *dentry);
-extern long prune_dcache_sb(struct super_block *sb, unsigned long nr_to_scan,
-			    int nid);
+extern long prune_dcache_sb(struct super_block *sb, struct shrink_control *sc);
 
 /*
  * read_write.c
diff --git a/fs/super.c b/fs/super.c
index 80d5cf2ca765..0688f3eaf012 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -78,27 +78,27 @@ static unsigned long super_cache_scan(struct shrinker *shrink,
 	if (sb->s_op->nr_cached_objects)
 		fs_objects = sb->s_op->nr_cached_objects(sb, sc->nid);
 
-	inodes = list_lru_count_node(&sb->s_inode_lru, sc->nid);
-	dentries = list_lru_count_node(&sb->s_dentry_lru, sc->nid);
+	inodes = list_lru_shrink_count(&sb->s_inode_lru, sc);
+	dentries = list_lru_shrink_count(&sb->s_dentry_lru, sc);
 	total_objects = dentries + inodes + fs_objects + 1;
 
 	/* proportion the scan between the caches */
 	dentries = mult_frac(sc->nr_to_scan, dentries, total_objects);
 	inodes = mult_frac(sc->nr_to_scan, inodes, total_objects);
+	fs_objects = mult_frac(sc->nr_to_scan, fs_objects, total_objects);
 
 	/*
 	 * prune the dcache first as the icache is pinned by it, then
 	 * prune the icache, followed by the filesystem specific caches
 	 */
-	freed = prune_dcache_sb(sb, dentries, sc->nid);
-	freed += prune_icache_sb(sb, inodes, sc->nid);
+	sc->nr_to_scan = dentries;
+	freed = prune_dcache_sb(sb, sc);
+	sc->nr_to_scan = inodes;
+	freed += prune_icache_sb(sb, sc);
 
-	if (fs_objects) {
-		fs_objects = mult_frac(sc->nr_to_scan, fs_objects,
-								total_objects);
+	if (fs_objects)
 		freed += sb->s_op->free_cached_objects(sb, fs_objects,
 						       sc->nid);
-	}
 
 	drop_super(sb);
 	return freed;
@@ -119,10 +119,8 @@ static unsigned long super_cache_count(struct shrinker *shrink,
 		total_objects = sb->s_op->nr_cached_objects(sb,
 						 sc->nid);
 
-	total_objects += list_lru_count_node(&sb->s_dentry_lru,
-						 sc->nid);
-	total_objects += list_lru_count_node(&sb->s_inode_lru,
-						 sc->nid);
+	total_objects += list_lru_shrink_count(&sb->s_dentry_lru, sc);
+	total_objects += list_lru_shrink_count(&sb->s_inode_lru, sc);
 
 	total_objects = vfs_pressure_ratio(total_objects);
 	drop_super(sb);
diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 9c061ef2b0d9..b52ea989a2a4 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -1554,10 +1554,9 @@ xfs_buftarg_shrink_scan(
 					struct xfs_buftarg, bt_shrinker);
 	LIST_HEAD(dispose);
 	unsigned long		freed;
-	unsigned long		nr_to_scan = sc->nr_to_scan;
 
-	freed = list_lru_walk_node(&btp->bt_lru, sc->nid, xfs_buftarg_isolate,
-				       &dispose, &nr_to_scan);
+	freed = list_lru_shrink_walk(&btp->bt_lru, sc,
+				     xfs_buftarg_isolate, &dispose);
 
 	while (!list_empty(&dispose)) {
 		struct xfs_buf *bp;
@@ -1576,7 +1575,7 @@ xfs_buftarg_shrink_count(
 {
 	struct xfs_buftarg	*btp = container_of(shrink,
 					struct xfs_buftarg, bt_shrinker);
-	return list_lru_count_node(&btp->bt_lru, sc->nid);
+	return list_lru_shrink_count(&btp->bt_lru, sc);
 }
 
 void
diff --git a/fs/xfs/xfs_qm.c b/fs/xfs/xfs_qm.c
index 348e4d2ed6e6..b4a33b7ab597 100644
--- a/fs/xfs/xfs_qm.c
+++ b/fs/xfs/xfs_qm.c
@@ -790,7 +790,6 @@ xfs_qm_shrink_scan(
 	struct xfs_qm_isolate	isol;
 	unsigned long		freed;
 	int			error;
-	unsigned long		nr_to_scan = sc->nr_to_scan;
 
 	if ((sc->gfp_mask & (__GFP_FS|__GFP_WAIT)) != (__GFP_FS|__GFP_WAIT))
 		return 0;
@@ -798,8 +797,8 @@ xfs_qm_shrink_scan(
 	INIT_LIST_HEAD(&isol.buffers);
 	INIT_LIST_HEAD(&isol.dispose);
 
-	freed = list_lru_walk_node(&qi->qi_lru, sc->nid, xfs_qm_dquot_isolate, &isol,
-					&nr_to_scan);
+	freed = list_lru_shrink_walk(&qi->qi_lru, sc,
+				     xfs_qm_dquot_isolate, &isol);
 
 	error = xfs_buf_delwri_submit(&isol.buffers);
 	if (error)
@@ -824,7 +823,7 @@ xfs_qm_shrink_count(
 	struct xfs_quotainfo	*qi = container_of(shrink,
 					struct xfs_quotainfo, qi_shrinker);
 
-	return list_lru_count_node(&qi->qi_lru, sc->nid);
+	return list_lru_shrink_count(&qi->qi_lru, sc);
 }
 
 /*
diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
index b02fc233eadd..6ca43b2486fc 100644
--- a/include/linux/list_lru.h
+++ b/include/linux/list_lru.h
@@ -9,6 +9,7 @@
 
 #include <linux/list.h>
 #include <linux/nodemask.h>
+#include <linux/shrinker.h>
 
 /* list_lru_walk_cb has to always return one of those */
 enum lru_status {
@@ -77,6 +78,13 @@ bool list_lru_del(struct list_lru *lru, struct list_head *item);
  * Callers that want such a guarantee need to provide an outer lock.
  */
 unsigned long list_lru_count_node(struct list_lru *lru, int nid);
+
+static inline unsigned long list_lru_shrink_count(struct list_lru *lru,
+						  struct shrink_control *sc)
+{
+	return list_lru_count_node(lru, sc->nid);
+}
+
 static inline unsigned long list_lru_count(struct list_lru *lru)
 {
 	long count = 0;
@@ -116,6 +124,14 @@ unsigned long list_lru_walk_node(struct list_lru *lru, int nid,
 				 unsigned long *nr_to_walk);
 
 static inline unsigned long
+list_lru_shrink_walk(struct list_lru *lru, struct shrink_control *sc,
+		     list_lru_walk_cb isolate, void *cb_arg)
+{
+	return list_lru_walk_node(lru, sc->nid, isolate, cb_arg,
+				  &sc->nr_to_scan);
+}
+
+static inline unsigned long
 list_lru_walk(struct list_lru *lru, list_lru_walk_cb isolate,
 	      void *cb_arg, unsigned long nr_to_walk)
 {
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH -mm v15 05/13] list_lru, shrinkers: introduce list_lru_shrink_{count,walk}
@ 2014-02-05 18:39   ` Vladimir Davydov
  0 siblings, 0 replies; 44+ messages in thread
From: Vladimir Davydov @ 2014-02-05 18:39 UTC (permalink / raw)
  To: akpm
  Cc: dchinner, mhocko, hannes, glommer, rientjes, linux-kernel,
	linux-mm, devel, Glauber Costa, Al Viro

NUMA aware slab shrinkers use the list_lru structure to distribute
objects coming from different NUMA nodes to different lists. Whenever
such a shrinker needs to count or scan objects from a particular node,
it issues commands like this:

        count = list_lru_count_node(lru, sc->nid);
        freed = list_lru_walk_node(lru, sc->nid, isolate_func,
                                   isolate_arg, &sc->nr_to_scan);

where sc is an instance of the shrink_control structure passed to it
from vmscan.

To simplify this, let's add the special list_lru functions to be used by
shrinkers, list_lru_shrink_count() and list_lru_shrink_walk(), which
consolidate the nid and nr_to_scan arguments in the shrink_control
structure.

This will also allow us to avoid patching shrinkers that use list_lru
when we make shrink_slab() per-memcg - all we will have to do is extend
the shrink_control structure to include the memcg to scan from and make
list_lru_shrink_{count,walk} handle this appropriately.

Thanks to David Chinner for the tip.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Glauber Costa <glommer@openvz.org>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
---
 fs/dcache.c              |   14 ++++++--------
 fs/gfs2/quota.c          |    6 +++---
 fs/inode.c               |    7 +++----
 fs/internal.h            |    7 +++----
 fs/super.c               |   22 ++++++++++------------
 fs/xfs/xfs_buf.c         |    7 +++----
 fs/xfs/xfs_qm.c          |    7 +++----
 include/linux/list_lru.h |   16 ++++++++++++++++
 8 files changed, 47 insertions(+), 39 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 265e0ce9769c..4bc85f96a87d 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -972,24 +972,22 @@ dentry_lru_isolate(struct list_head *item, spinlock_t *lru_lock, void *arg)
 /**
  * prune_dcache_sb - shrink the dcache
  * @sb: superblock
- * @nr_to_scan : number of entries to try to free
- * @nid: which node to scan for freeable entities
+ * @sc: shrink control, passed to list_lru_shrink_walk()
  *
- * Attempt to shrink the superblock dcache LRU by @nr_to_scan entries. This is
- * done when we need more memory an called from the superblock shrinker
+ * Attempt to shrink the superblock dcache LRU by @sc->nr_to_scan entries. This
+ * is done when we need more memory and called from the superblock shrinker
  * function.
  *
  * This function may fail to free any resources if all the dentries are in
  * use.
  */
-long prune_dcache_sb(struct super_block *sb, unsigned long nr_to_scan,
-		     int nid)
+long prune_dcache_sb(struct super_block *sb, struct shrink_control *sc)
 {
 	LIST_HEAD(dispose);
 	long freed;
 
-	freed = list_lru_walk_node(&sb->s_dentry_lru, nid, dentry_lru_isolate,
-				       &dispose, &nr_to_scan);
+	freed = list_lru_shrink_walk(&sb->s_dentry_lru, sc,
+				     dentry_lru_isolate, &dispose);
 	shrink_dentry_list(&dispose);
 	return freed;
 }
diff --git a/fs/gfs2/quota.c b/fs/gfs2/quota.c
index 8bec0e3192dd..8746393aed88 100644
--- a/fs/gfs2/quota.c
+++ b/fs/gfs2/quota.c
@@ -169,8 +169,8 @@ static unsigned long gfs2_qd_shrink_scan(struct shrinker *shrink,
 	if (!(sc->gfp_mask & __GFP_FS))
 		return SHRINK_STOP;
 
-	freed = list_lru_walk_node(&gfs2_qd_lru, sc->nid, gfs2_qd_isolate,
-				   &dispose, &sc->nr_to_scan);
+	freed = list_lru_shrink_walk(&gfs2_qd_lru, sc,
+				     gfs2_qd_isolate, &dispose);
 
 	gfs2_qd_dispose(&dispose);
 
@@ -180,7 +180,7 @@ static unsigned long gfs2_qd_shrink_scan(struct shrinker *shrink,
 static unsigned long gfs2_qd_shrink_count(struct shrinker *shrink,
 					  struct shrink_control *sc)
 {
-	return vfs_pressure_ratio(list_lru_count_node(&gfs2_qd_lru, sc->nid));
+	return vfs_pressure_ratio(list_lru_shrink_count(&gfs2_qd_lru, sc));
 }
 
 struct shrinker gfs2_qd_shrinker = {
diff --git a/fs/inode.c b/fs/inode.c
index e6905152c39f..890e4f9b1590 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -748,14 +748,13 @@ inode_lru_isolate(struct list_head *item, spinlock_t *lru_lock, void *arg)
  * to trim from the LRU. Inodes to be freed are moved to a temporary list and
  * then are freed outside inode_lock by dispose_list().
  */
-long prune_icache_sb(struct super_block *sb, unsigned long nr_to_scan,
-		     int nid)
+long prune_icache_sb(struct super_block *sb, struct shrink_control *sc)
 {
 	LIST_HEAD(freeable);
 	long freed;
 
-	freed = list_lru_walk_node(&sb->s_inode_lru, nid, inode_lru_isolate,
-				       &freeable, &nr_to_scan);
+	freed = list_lru_shrink_walk(&sb->s_inode_lru, sc,
+				     inode_lru_isolate, &freeable);
 	dispose_list(&freeable);
 	return freed;
 }
diff --git a/fs/internal.h b/fs/internal.h
index 465742407466..3db5f6e41cd7 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -14,6 +14,7 @@ struct file_system_type;
 struct linux_binprm;
 struct path;
 struct mount;
+struct shrink_control;
 
 /*
  * block_dev.c
@@ -107,8 +108,7 @@ extern int open_check_o_direct(struct file *f);
  * inode.c
  */
 extern spinlock_t inode_sb_list_lock;
-extern long prune_icache_sb(struct super_block *sb, unsigned long nr_to_scan,
-			    int nid);
+extern long prune_icache_sb(struct super_block *sb, struct shrink_control *sc);
 extern void inode_add_lru(struct inode *inode);
 
 /*
@@ -125,8 +125,7 @@ extern int invalidate_inodes(struct super_block *, bool);
  */
 extern struct dentry *__d_alloc(struct super_block *, const struct qstr *);
 extern int d_set_mounted(struct dentry *dentry);
-extern long prune_dcache_sb(struct super_block *sb, unsigned long nr_to_scan,
-			    int nid);
+extern long prune_dcache_sb(struct super_block *sb, struct shrink_control *sc);
 
 /*
  * read_write.c
diff --git a/fs/super.c b/fs/super.c
index 80d5cf2ca765..0688f3eaf012 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -78,27 +78,27 @@ static unsigned long super_cache_scan(struct shrinker *shrink,
 	if (sb->s_op->nr_cached_objects)
 		fs_objects = sb->s_op->nr_cached_objects(sb, sc->nid);
 
-	inodes = list_lru_count_node(&sb->s_inode_lru, sc->nid);
-	dentries = list_lru_count_node(&sb->s_dentry_lru, sc->nid);
+	inodes = list_lru_shrink_count(&sb->s_inode_lru, sc);
+	dentries = list_lru_shrink_count(&sb->s_dentry_lru, sc);
 	total_objects = dentries + inodes + fs_objects + 1;
 
 	/* proportion the scan between the caches */
 	dentries = mult_frac(sc->nr_to_scan, dentries, total_objects);
 	inodes = mult_frac(sc->nr_to_scan, inodes, total_objects);
+	fs_objects = mult_frac(sc->nr_to_scan, fs_objects, total_objects);
 
 	/*
 	 * prune the dcache first as the icache is pinned by it, then
 	 * prune the icache, followed by the filesystem specific caches
 	 */
-	freed = prune_dcache_sb(sb, dentries, sc->nid);
-	freed += prune_icache_sb(sb, inodes, sc->nid);
+	sc->nr_to_scan = dentries;
+	freed = prune_dcache_sb(sb, sc);
+	sc->nr_to_scan = inodes;
+	freed += prune_icache_sb(sb, sc);
 
-	if (fs_objects) {
-		fs_objects = mult_frac(sc->nr_to_scan, fs_objects,
-								total_objects);
+	if (fs_objects)
 		freed += sb->s_op->free_cached_objects(sb, fs_objects,
 						       sc->nid);
-	}
 
 	drop_super(sb);
 	return freed;
@@ -119,10 +119,8 @@ static unsigned long super_cache_count(struct shrinker *shrink,
 		total_objects = sb->s_op->nr_cached_objects(sb,
 						 sc->nid);
 
-	total_objects += list_lru_count_node(&sb->s_dentry_lru,
-						 sc->nid);
-	total_objects += list_lru_count_node(&sb->s_inode_lru,
-						 sc->nid);
+	total_objects += list_lru_shrink_count(&sb->s_dentry_lru, sc);
+	total_objects += list_lru_shrink_count(&sb->s_inode_lru, sc);
 
 	total_objects = vfs_pressure_ratio(total_objects);
 	drop_super(sb);
diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 9c061ef2b0d9..b52ea989a2a4 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -1554,10 +1554,9 @@ xfs_buftarg_shrink_scan(
 					struct xfs_buftarg, bt_shrinker);
 	LIST_HEAD(dispose);
 	unsigned long		freed;
-	unsigned long		nr_to_scan = sc->nr_to_scan;
 
-	freed = list_lru_walk_node(&btp->bt_lru, sc->nid, xfs_buftarg_isolate,
-				       &dispose, &nr_to_scan);
+	freed = list_lru_shrink_walk(&btp->bt_lru, sc,
+				     xfs_buftarg_isolate, &dispose);
 
 	while (!list_empty(&dispose)) {
 		struct xfs_buf *bp;
@@ -1576,7 +1575,7 @@ xfs_buftarg_shrink_count(
 {
 	struct xfs_buftarg	*btp = container_of(shrink,
 					struct xfs_buftarg, bt_shrinker);
-	return list_lru_count_node(&btp->bt_lru, sc->nid);
+	return list_lru_shrink_count(&btp->bt_lru, sc);
 }
 
 void
diff --git a/fs/xfs/xfs_qm.c b/fs/xfs/xfs_qm.c
index 348e4d2ed6e6..b4a33b7ab597 100644
--- a/fs/xfs/xfs_qm.c
+++ b/fs/xfs/xfs_qm.c
@@ -790,7 +790,6 @@ xfs_qm_shrink_scan(
 	struct xfs_qm_isolate	isol;
 	unsigned long		freed;
 	int			error;
-	unsigned long		nr_to_scan = sc->nr_to_scan;
 
 	if ((sc->gfp_mask & (__GFP_FS|__GFP_WAIT)) != (__GFP_FS|__GFP_WAIT))
 		return 0;
@@ -798,8 +797,8 @@ xfs_qm_shrink_scan(
 	INIT_LIST_HEAD(&isol.buffers);
 	INIT_LIST_HEAD(&isol.dispose);
 
-	freed = list_lru_walk_node(&qi->qi_lru, sc->nid, xfs_qm_dquot_isolate, &isol,
-					&nr_to_scan);
+	freed = list_lru_shrink_walk(&qi->qi_lru, sc,
+				     xfs_qm_dquot_isolate, &isol);
 
 	error = xfs_buf_delwri_submit(&isol.buffers);
 	if (error)
@@ -824,7 +823,7 @@ xfs_qm_shrink_count(
 	struct xfs_quotainfo	*qi = container_of(shrink,
 					struct xfs_quotainfo, qi_shrinker);
 
-	return list_lru_count_node(&qi->qi_lru, sc->nid);
+	return list_lru_shrink_count(&qi->qi_lru, sc);
 }
 
 /*
diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
index b02fc233eadd..6ca43b2486fc 100644
--- a/include/linux/list_lru.h
+++ b/include/linux/list_lru.h
@@ -9,6 +9,7 @@
 
 #include <linux/list.h>
 #include <linux/nodemask.h>
+#include <linux/shrinker.h>
 
 /* list_lru_walk_cb has to always return one of those */
 enum lru_status {
@@ -77,6 +78,13 @@ bool list_lru_del(struct list_lru *lru, struct list_head *item);
  * Callers that want such a guarantee need to provide an outer lock.
  */
 unsigned long list_lru_count_node(struct list_lru *lru, int nid);
+
+static inline unsigned long list_lru_shrink_count(struct list_lru *lru,
+						  struct shrink_control *sc)
+{
+	return list_lru_count_node(lru, sc->nid);
+}
+
 static inline unsigned long list_lru_count(struct list_lru *lru)
 {
 	long count = 0;
@@ -116,6 +124,14 @@ unsigned long list_lru_walk_node(struct list_lru *lru, int nid,
 				 unsigned long *nr_to_walk);
 
 static inline unsigned long
+list_lru_shrink_walk(struct list_lru *lru, struct shrink_control *sc,
+		     list_lru_walk_cb isolate, void *cb_arg)
+{
+	return list_lru_walk_node(lru, sc->nid, isolate, cb_arg,
+				  &sc->nr_to_scan);
+}
+
+static inline unsigned long
 list_lru_walk(struct list_lru *lru, list_lru_walk_cb isolate,
 	      void *cb_arg, unsigned long nr_to_walk)
 {
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH -mm v15 06/13] fs: consolidate {nr,free}_cached_objects args in shrink_control
  2014-02-05 18:39 ` Vladimir Davydov
@ 2014-02-05 18:39   ` Vladimir Davydov
  -1 siblings, 0 replies; 44+ messages in thread
From: Vladimir Davydov @ 2014-02-05 18:39 UTC (permalink / raw)
  To: akpm
  Cc: dchinner, mhocko, hannes, glommer, rientjes, linux-kernel,
	linux-mm, devel, Glauber Costa, Al Viro

We are going to make the FS shrinker memcg-aware. To achieve that, we
will have to pass the memcg to scan to the nr_cached_objects and
free_cached_objects VFS methods, which currently take only the NUMA node
to scan. Since the shrink_control structure already holds the node, and
the memcg to scan will be added to it as we introduce memcg-aware
vmscan, let us consolidate the methods' arguments in this structure to
keep things clean.

Thanks to David Chinner for the tip.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Glauber Costa <glommer@openvz.org>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
---
 fs/super.c         |   12 ++++++------
 fs/xfs/xfs_super.c |    7 +++----
 include/linux/fs.h |    6 ++++--
 3 files changed, 13 insertions(+), 12 deletions(-)

diff --git a/fs/super.c b/fs/super.c
index 0688f3eaf012..ff9ff5fad70c 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -76,7 +76,7 @@ static unsigned long super_cache_scan(struct shrinker *shrink,
 		return SHRINK_STOP;
 
 	if (sb->s_op->nr_cached_objects)
-		fs_objects = sb->s_op->nr_cached_objects(sb, sc->nid);
+		fs_objects = sb->s_op->nr_cached_objects(sb, sc);
 
 	inodes = list_lru_shrink_count(&sb->s_inode_lru, sc);
 	dentries = list_lru_shrink_count(&sb->s_dentry_lru, sc);
@@ -96,9 +96,10 @@ static unsigned long super_cache_scan(struct shrinker *shrink,
 	sc->nr_to_scan = inodes;
 	freed += prune_icache_sb(sb, sc);
 
-	if (fs_objects)
-		freed += sb->s_op->free_cached_objects(sb, fs_objects,
-						       sc->nid);
+	if (fs_objects) {
+		sc->nr_to_scan = fs_objects;
+		freed += sb->s_op->free_cached_objects(sb, sc);
+	}
 
 	drop_super(sb);
 	return freed;
@@ -116,8 +117,7 @@ static unsigned long super_cache_count(struct shrinker *shrink,
 		return 0;
 
 	if (sb->s_op && sb->s_op->nr_cached_objects)
-		total_objects = sb->s_op->nr_cached_objects(sb,
-						 sc->nid);
+		total_objects = sb->s_op->nr_cached_objects(sb, sc);
 
 	total_objects += list_lru_shrink_count(&sb->s_dentry_lru, sc);
 	total_objects += list_lru_shrink_count(&sb->s_inode_lru, sc);
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 01ee44444885..4bc182b29c8f 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1524,7 +1524,7 @@ xfs_fs_mount(
 static long
 xfs_fs_nr_cached_objects(
 	struct super_block	*sb,
-	int			nid)
+	struct shrink_control	*sc)
 {
 	return xfs_reclaim_inodes_count(XFS_M(sb));
 }
@@ -1532,10 +1532,9 @@ xfs_fs_nr_cached_objects(
 static long
 xfs_fs_free_cached_objects(
 	struct super_block	*sb,
-	long			nr_to_scan,
-	int			nid)
+	struct shrink_control	*sc)
 {
-	return xfs_reclaim_inodes_nr(XFS_M(sb), nr_to_scan);
+	return xfs_reclaim_inodes_nr(XFS_M(sb), sc->nr_to_scan);
 }
 
 static const struct super_operations xfs_super_operations = {
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 9b613a828b53..3a87b0254408 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1624,8 +1624,10 @@ struct super_operations {
 	ssize_t (*quota_write)(struct super_block *, int, const char *, size_t, loff_t);
 #endif
 	int (*bdev_try_to_free_page)(struct super_block*, struct page*, gfp_t);
-	long (*nr_cached_objects)(struct super_block *, int);
-	long (*free_cached_objects)(struct super_block *, long, int);
+	long (*nr_cached_objects)(struct super_block *,
+				  struct shrink_control *);
+	long (*free_cached_objects)(struct super_block *,
+				    struct shrink_control *);
 };
 
 /*
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH -mm v15 06/13] fs: consolidate {nr,free}_cached_objects args in shrink_control
@ 2014-02-05 18:39   ` Vladimir Davydov
  0 siblings, 0 replies; 44+ messages in thread
From: Vladimir Davydov @ 2014-02-05 18:39 UTC (permalink / raw)
  To: akpm
  Cc: dchinner, mhocko, hannes, glommer, rientjes, linux-kernel,
	linux-mm, devel, Glauber Costa, Al Viro

We are going to make the FS shrinker memcg-aware. To achieve that, we
will have to pass the memcg to scan to the nr_cached_objects and
free_cached_objects VFS methods, which currently take only the NUMA node
to scan. Since the shrink_control structure already holds the node, and
the memcg to scan will be added to it as we introduce memcg-aware
vmscan, let us consolidate the methods' arguments in this structure to
keep things clean.

Thanks to David Chinner for the tip.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Glauber Costa <glommer@openvz.org>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
---
 fs/super.c         |   12 ++++++------
 fs/xfs/xfs_super.c |    7 +++----
 include/linux/fs.h |    6 ++++--
 3 files changed, 13 insertions(+), 12 deletions(-)

diff --git a/fs/super.c b/fs/super.c
index 0688f3eaf012..ff9ff5fad70c 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -76,7 +76,7 @@ static unsigned long super_cache_scan(struct shrinker *shrink,
 		return SHRINK_STOP;
 
 	if (sb->s_op->nr_cached_objects)
-		fs_objects = sb->s_op->nr_cached_objects(sb, sc->nid);
+		fs_objects = sb->s_op->nr_cached_objects(sb, sc);
 
 	inodes = list_lru_shrink_count(&sb->s_inode_lru, sc);
 	dentries = list_lru_shrink_count(&sb->s_dentry_lru, sc);
@@ -96,9 +96,10 @@ static unsigned long super_cache_scan(struct shrinker *shrink,
 	sc->nr_to_scan = inodes;
 	freed += prune_icache_sb(sb, sc);
 
-	if (fs_objects)
-		freed += sb->s_op->free_cached_objects(sb, fs_objects,
-						       sc->nid);
+	if (fs_objects) {
+		sc->nr_to_scan = fs_objects;
+		freed += sb->s_op->free_cached_objects(sb, sc);
+	}
 
 	drop_super(sb);
 	return freed;
@@ -116,8 +117,7 @@ static unsigned long super_cache_count(struct shrinker *shrink,
 		return 0;
 
 	if (sb->s_op && sb->s_op->nr_cached_objects)
-		total_objects = sb->s_op->nr_cached_objects(sb,
-						 sc->nid);
+		total_objects = sb->s_op->nr_cached_objects(sb, sc);
 
 	total_objects += list_lru_shrink_count(&sb->s_dentry_lru, sc);
 	total_objects += list_lru_shrink_count(&sb->s_inode_lru, sc);
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 01ee44444885..4bc182b29c8f 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1524,7 +1524,7 @@ xfs_fs_mount(
 static long
 xfs_fs_nr_cached_objects(
 	struct super_block	*sb,
-	int			nid)
+	struct shrink_control	*sc)
 {
 	return xfs_reclaim_inodes_count(XFS_M(sb));
 }
@@ -1532,10 +1532,9 @@ xfs_fs_nr_cached_objects(
 static long
 xfs_fs_free_cached_objects(
 	struct super_block	*sb,
-	long			nr_to_scan,
-	int			nid)
+	struct shrink_control	*sc)
 {
-	return xfs_reclaim_inodes_nr(XFS_M(sb), nr_to_scan);
+	return xfs_reclaim_inodes_nr(XFS_M(sb), sc->nr_to_scan);
 }
 
 static const struct super_operations xfs_super_operations = {
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 9b613a828b53..3a87b0254408 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1624,8 +1624,10 @@ struct super_operations {
 	ssize_t (*quota_write)(struct super_block *, int, const char *, size_t, loff_t);
 #endif
 	int (*bdev_try_to_free_page)(struct super_block*, struct page*, gfp_t);
-	long (*nr_cached_objects)(struct super_block *, int);
-	long (*free_cached_objects)(struct super_block *, long, int);
+	long (*nr_cached_objects)(struct super_block *,
+				  struct shrink_control *);
+	long (*free_cached_objects)(struct super_block *,
+				    struct shrink_control *);
 };
 
 /*
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH -mm v15 07/13] fs: do not call destroy_super() in atomic context
  2014-02-05 18:39 ` Vladimir Davydov
@ 2014-02-05 18:39   ` Vladimir Davydov
  -1 siblings, 0 replies; 44+ messages in thread
From: Vladimir Davydov @ 2014-02-05 18:39 UTC (permalink / raw)
  To: akpm
  Cc: dchinner, mhocko, hannes, glommer, rientjes, linux-kernel,
	linux-mm, devel, Glauber Costa, Al Viro

To make list_lru per-memcg, I'll have to add some code that might sleep
to list_lru_destroy(), but it is impossible to do now, because
list_lru_destroy() is currently called by destroy_super(), which can be
called in atomic context from __put_super().

To overcome this, in this patch I make __put_super() schedule the super
block destruction in an asynchronous work instead of calling
destroy_super() directly.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Glauber Costa <glommer@openvz.org>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
---
 fs/super.c         |   10 +++++++++-
 include/linux/fs.h |    2 ++
 2 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/fs/super.c b/fs/super.c
index ff9ff5fad70c..33cbff3769e7 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -147,6 +147,13 @@ static void destroy_super(struct super_block *s)
 	kfree_rcu(s, rcu);
 }
 
+static void destroy_super_work_func(struct work_struct *w)
+{
+	struct super_block *s = container_of(w, struct super_block, destroy);
+
+	destroy_super(s);
+}
+
 /**
  *	alloc_super	-	create new superblock
  *	@type:	filesystem type superblock should belong to
@@ -182,6 +189,7 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags)
 	INIT_HLIST_NODE(&s->s_instances);
 	INIT_HLIST_BL_HEAD(&s->s_anon);
 	INIT_LIST_HEAD(&s->s_inodes);
+	INIT_WORK(&s->destroy, destroy_super_work_func);
 
 	if (list_lru_init(&s->s_dentry_lru))
 		goto fail;
@@ -239,7 +247,7 @@ static void __put_super(struct super_block *sb)
 {
 	if (!--sb->s_count) {
 		list_del_init(&sb->s_list);
-		destroy_super(sb);
+		schedule_work(&sb->destroy);
 	}
 }
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 3a87b0254408..200cbf804335 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1328,7 +1328,9 @@ struct super_block {
 	 */
 	struct list_lru		s_dentry_lru ____cacheline_aligned_in_smp;
 	struct list_lru		s_inode_lru ____cacheline_aligned_in_smp;
+
 	struct rcu_head		rcu;
+	struct work_struct	destroy;
 };
 
 extern struct timespec current_fs_time(struct super_block *sb);
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH -mm v15 07/13] fs: do not call destroy_super() in atomic context
@ 2014-02-05 18:39   ` Vladimir Davydov
  0 siblings, 0 replies; 44+ messages in thread
From: Vladimir Davydov @ 2014-02-05 18:39 UTC (permalink / raw)
  To: akpm
  Cc: dchinner, mhocko, hannes, glommer, rientjes, linux-kernel,
	linux-mm, devel, Glauber Costa, Al Viro

To make list_lru per-memcg, I'll have to add some code that might sleep
to list_lru_destroy(), but it is impossible to do now, because
list_lru_destroy() is currently called by destroy_super(), which can be
called in atomic context from __put_super().

To overcome this, in this patch I make __put_super() schedule the super
block destruction in an asynchronous work instead of calling
destroy_super() directly.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Glauber Costa <glommer@openvz.org>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
---
 fs/super.c         |   10 +++++++++-
 include/linux/fs.h |    2 ++
 2 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/fs/super.c b/fs/super.c
index ff9ff5fad70c..33cbff3769e7 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -147,6 +147,13 @@ static void destroy_super(struct super_block *s)
 	kfree_rcu(s, rcu);
 }
 
+static void destroy_super_work_func(struct work_struct *w)
+{
+	struct super_block *s = container_of(w, struct super_block, destroy);
+
+	destroy_super(s);
+}
+
 /**
  *	alloc_super	-	create new superblock
  *	@type:	filesystem type superblock should belong to
@@ -182,6 +189,7 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags)
 	INIT_HLIST_NODE(&s->s_instances);
 	INIT_HLIST_BL_HEAD(&s->s_anon);
 	INIT_LIST_HEAD(&s->s_inodes);
+	INIT_WORK(&s->destroy, destroy_super_work_func);
 
 	if (list_lru_init(&s->s_dentry_lru))
 		goto fail;
@@ -239,7 +247,7 @@ static void __put_super(struct super_block *sb)
 {
 	if (!--sb->s_count) {
 		list_del_init(&sb->s_list);
-		destroy_super(sb);
+		schedule_work(&sb->destroy);
 	}
 }
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 3a87b0254408..200cbf804335 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1328,7 +1328,9 @@ struct super_block {
 	 */
 	struct list_lru		s_dentry_lru ____cacheline_aligned_in_smp;
 	struct list_lru		s_inode_lru ____cacheline_aligned_in_smp;
+
 	struct rcu_head		rcu;
+	struct work_struct	destroy;
 };
 
 extern struct timespec current_fs_time(struct super_block *sb);
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH -mm v15 08/13] vmscan: shrink slab on memcg pressure
  2014-02-05 18:39 ` Vladimir Davydov
@ 2014-02-05 18:39   ` Vladimir Davydov
  -1 siblings, 0 replies; 44+ messages in thread
From: Vladimir Davydov @ 2014-02-05 18:39 UTC (permalink / raw)
  To: akpm
  Cc: dchinner, mhocko, hannes, glommer, rientjes, linux-kernel,
	linux-mm, devel, Glauber Costa, Mel Gorman, Rik van Riel,
	Al Viro, Balbir Singh, KAMEZAWA Hiroyuki

This patch makes direct reclaim path shrink slab not only on global
memory pressure, but also when we reach the user memory limit of a
memcg. To achieve that, it makes shrink_slab() walk over the memcg
hierarchy and run shrinkers marked as memcg-aware on the target memcg
and all its descendants. The memcg to scan is passed in a shrink_control
structure; memcg-unaware shrinkers are still called only on global
memory pressure with memcg=NULL. It is up to the shrinker how to
organize the objects it is responsible for to achieve per-memcg reclaim.

Note that we do not intend to have true per memcg per node reclaim.
Since most memcgs are small and typically confined to a single NUMA node
or two by external means and therefore do not need the scalability NUMA
aware shrinkers provide, we actually call per node shrinking only for
the global list (memcg=NULL), while per-memcg lists are always scanned
only once irrespective of the nodemask with nid=0.

The idea lying behind the patch as well as the initial implementation
belong to Glauber Costa.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Glauber Costa <glommer@openvz.org>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/memcontrol.h |   22 +++++++++++
 include/linux/shrinker.h   |   10 ++++-
 mm/memcontrol.c            |   37 ++++++++++++++++-
 mm/vmscan.c                |   94 ++++++++++++++++++++++++++++++++++----------
 4 files changed, 141 insertions(+), 22 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 0503b59c3fad..fc4a24d31e99 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -95,6 +95,9 @@ extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
 struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *);
 struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *);
 
+unsigned long mem_cgroup_zone_reclaimable_pages(struct zone *,
+						struct mem_cgroup *);
+
 /* For coalescing uncharge for reducing memcg' overhead*/
 extern void mem_cgroup_uncharge_start(void);
 extern void mem_cgroup_uncharge_end(void);
@@ -304,6 +307,12 @@ static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page,
 	return &zone->lruvec;
 }
 
+static inline unsigned long mem_cgroup_zone_reclaimable_pages(struct zone *zone,
+							struct mem_cgroup *)
+{
+	return 0;
+}
+
 static inline struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
 {
 	return NULL;
@@ -494,6 +503,9 @@ static inline bool memcg_kmem_enabled(void)
 	return static_key_false(&memcg_kmem_enabled_key);
 }
 
+bool memcg_kmem_is_active(struct mem_cgroup *memcg);
+bool memcg_kmem_should_reclaim(struct mem_cgroup *memcg);
+
 /*
  * In general, we'll do everything in our power to not incur in any overhead
  * for non-memcg users for the kmem functions. Not even a function call, if we
@@ -636,6 +648,16 @@ static inline bool memcg_kmem_enabled(void)
 	return false;
 }
 
+static inline bool memcg_kmem_is_active(struct mem_cgroup *memcg)
+{
+	return false;
+}
+
+static inline bool memcg_kmem_should_reclaim(struct mem_cgroup *memcg)
+{
+	return false;
+}
+
 static inline bool
 memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup **memcg, int order)
 {
diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
index 68c097077ef0..ab79b174bfbe 100644
--- a/include/linux/shrinker.h
+++ b/include/linux/shrinker.h
@@ -20,8 +20,15 @@ struct shrink_control {
 
 	/* shrink from these nodes */
 	nodemask_t nodes_to_scan;
+
+	/* shrink from this memory cgroup hierarchy (if not NULL) */
+	struct mem_cgroup *target_mem_cgroup;
+
 	/* current node being shrunk (for NUMA aware shrinkers) */
 	int nid;
+
+	/* current memcg being shrunk (for memcg aware shrinkers) */
+	struct mem_cgroup *memcg;
 };
 
 #define SHRINK_STOP (~0UL)
@@ -63,7 +70,8 @@ struct shrinker {
 #define DEFAULT_SEEKS 2 /* A good number if you don't know better. */
 
 /* Flags */
-#define SHRINKER_NUMA_AWARE (1 << 0)
+#define SHRINKER_NUMA_AWARE	(1 << 0)
+#define SHRINKER_MEMCG_AWARE	(1 << 1)
 
 extern int register_shrinker(struct shrinker *);
 extern void unregister_shrinker(struct shrinker *);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 854d0b8e3c45..24557d09213c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -392,7 +392,7 @@ static inline void memcg_kmem_set_active(struct mem_cgroup *memcg)
 	set_bit(KMEM_ACCOUNTED_ACTIVE, &memcg->kmem_account_flags);
 }
 
-static bool memcg_kmem_is_active(struct mem_cgroup *memcg)
+bool memcg_kmem_is_active(struct mem_cgroup *memcg)
 {
 	return test_bit(KMEM_ACCOUNTED_ACTIVE, &memcg->kmem_account_flags);
 }
@@ -1354,6 +1354,26 @@ out:
 	return lruvec;
 }
 
+unsigned long mem_cgroup_zone_reclaimable_pages(struct zone *zone,
+						struct mem_cgroup *memcg)
+{
+	int nid = zone_to_nid(zone);
+	int zid = zone_idx(zone);
+	unsigned long nr = 0;
+	struct mem_cgroup *iter;
+
+	iter = memcg;
+	do {
+		nr += mem_cgroup_zone_nr_lru_pages(iter, nid, zid,
+						   LRU_ALL_FILE);
+		if (do_swap_account)
+			nr += mem_cgroup_zone_nr_lru_pages(iter, nid, zid,
+							   LRU_ALL_ANON);
+		iter = mem_cgroup_iter(memcg, iter, NULL);
+	} while (iter);
+	return nr;
+}
+
 /*
  * Following LRU functions are allowed to be used without PCG_LOCK.
  * Operations are called by routine of global LRU independently from memcg.
@@ -2992,6 +3012,21 @@ static inline bool memcg_can_account_kmem(struct mem_cgroup *memcg)
 		memcg_kmem_is_active(memcg);
 }
 
+bool memcg_kmem_should_reclaim(struct mem_cgroup *memcg)
+{
+	struct mem_cgroup *iter;
+
+	iter = memcg;
+	do {
+		if (memcg_kmem_is_active(iter)) {
+			mem_cgroup_iter_break(memcg, iter);
+			return true;
+		}
+		iter = mem_cgroup_iter(memcg, iter, NULL);
+	} while (iter);
+	return false;
+}
+
 /*
  * helper for acessing a memcg's index. It will be used as an index in the
  * child cache array in kmem_cache, and also to derive its name. This function
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 1f56a80a7c41..1b79d291287e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -328,6 +328,33 @@ shrink_slab_node(struct shrink_control *shrinkctl, struct shrinker *shrinker,
 	return freed;
 }
 
+static unsigned long
+run_shrinker(struct shrink_control *shrinkctl, struct shrinker *shrinker,
+	     unsigned long nr_pages_scanned, unsigned long lru_pages)
+{
+	unsigned long freed = 0;
+
+	/*
+	 * Since most memory cgroups are small and typically confined to a
+	 * single NUMA node or two by external means and therefore do not need
+	 * the scalability NUMA aware shrinkers provide, we implement per node
+	 * shrinking only for the global list.
+	 */
+	if (!(shrinker->flags & SHRINKER_NUMA_AWARE) ||
+	    shrinkctl->memcg) {
+		shrinkctl->nid = 0;
+		return shrink_slab_node(shrinkctl, shrinker,
+					nr_pages_scanned, lru_pages);
+	}
+
+	for_each_node_mask(shrinkctl->nid, shrinkctl->nodes_to_scan) {
+		if (node_online(shrinkctl->nid))
+			freed += shrink_slab_node(shrinkctl, shrinker,
+						  nr_pages_scanned, lru_pages);
+	}
+	return freed;
+}
+
 /*
  * Call the shrink functions to age shrinkable caches
  *
@@ -369,20 +396,34 @@ unsigned long shrink_slab(struct shrink_control *shrinkctl,
 	}
 
 	list_for_each_entry(shrinker, &shrinker_list, list) {
-		if (!(shrinker->flags & SHRINKER_NUMA_AWARE)) {
-			shrinkctl->nid = 0;
-			freed += shrink_slab_node(shrinkctl, shrinker,
-					nr_pages_scanned, lru_pages);
+		/*
+		 * Call memcg-unaware shrinkers only on global pressure.
+		 */
+		if (!(shrinker->flags & SHRINKER_MEMCG_AWARE)) {
+			if (!shrinkctl->target_mem_cgroup) {
+				shrinkctl->memcg = NULL;
+				freed += run_shrinker(shrinkctl, shrinker,
+						nr_pages_scanned, lru_pages);
+			}
 			continue;
 		}
 
-		for_each_node_mask(shrinkctl->nid, shrinkctl->nodes_to_scan) {
-			if (node_online(shrinkctl->nid))
-				freed += shrink_slab_node(shrinkctl, shrinker,
+		/*
+		 * For memcg-aware shrinkers iterate over the target memcg
+		 * hierarchy and run the shrinker on each kmem-active memcg
+		 * found in the hierarchy.
+		 */
+		shrinkctl->memcg = shrinkctl->target_mem_cgroup;
+		do {
+			if (!shrinkctl->memcg ||
+			    memcg_kmem_is_active(shrinkctl->memcg))
+				freed += run_shrinker(shrinkctl, shrinker,
 						nr_pages_scanned, lru_pages);
-
-		}
+		} while ((shrinkctl->memcg =
+			  mem_cgroup_iter(shrinkctl->target_mem_cgroup,
+					  shrinkctl->memcg, NULL)) != NULL);
 	}
+
 	up_read(&shrinker_rwsem);
 out:
 	cond_resched();
@@ -2316,6 +2357,7 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
 	struct reclaim_state *reclaim_state = current->reclaim_state;
 	struct shrink_control shrink = {
 		.gfp_mask = sc->gfp_mask,
+		.target_mem_cgroup = sc->target_mem_cgroup,
 	};
 
 	/*
@@ -2332,17 +2374,22 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
 					gfp_zone(sc->gfp_mask), sc->nodemask) {
 		if (!populated_zone(zone))
 			continue;
+
+		if (global_reclaim(sc) &&
+		    !cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
+			continue;
+
+		lru_pages += global_reclaim(sc) ?
+				zone_reclaimable_pages(zone) :
+				mem_cgroup_zone_reclaimable_pages(zone,
+						sc->target_mem_cgroup);
+		node_set(zone_to_nid(zone), shrink.nodes_to_scan);
+
 		/*
 		 * Take care memory controller reclaiming has small influence
 		 * to global LRU.
 		 */
 		if (global_reclaim(sc)) {
-			if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
-				continue;
-
-			lru_pages += zone_reclaimable_pages(zone);
-			node_set(zone_to_nid(zone), shrink.nodes_to_scan);
-
 			if (sc->priority != DEF_PRIORITY &&
 			    !zone_reclaimable(zone))
 				continue;	/* Let kswapd poll it */
@@ -2380,12 +2427,11 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
 	}
 
 	/*
-	 * Don't shrink slabs when reclaiming memory from over limit cgroups
-	 * but do shrink slab at least once when aborting reclaim for
-	 * compaction to avoid unevenly scanning file/anon LRU pages over slab
-	 * pages.
+	 * Shrink slabs at least once when aborting reclaim for compaction
+	 * to avoid unevenly scanning file/anon LRU pages over slab pages.
 	 */
-	if (global_reclaim(sc)) {
+	if (global_reclaim(sc) ||
+	    memcg_kmem_should_reclaim(sc->target_mem_cgroup)) {
 		shrink_slab(&shrink, sc->nr_scanned, lru_pages);
 		if (reclaim_state) {
 			sc->nr_reclaimed += reclaim_state->reclaimed_slab;
@@ -2679,6 +2725,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 	struct zonelist *zonelist;
 	unsigned long nr_reclaimed;
 	int nid;
+	struct reclaim_state reclaim_state;
 	struct scan_control sc = {
 		.may_writepage = !laptop_mode,
 		.may_unmap = 1,
@@ -2701,6 +2748,10 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 
 	zonelist = NODE_DATA(nid)->node_zonelists;
 
+	lockdep_set_current_reclaim_state(sc.gfp_mask);
+	reclaim_state.reclaimed_slab = 0;
+	current->reclaim_state = &reclaim_state;
+
 	trace_mm_vmscan_memcg_reclaim_begin(0,
 					    sc.may_writepage,
 					    sc.gfp_mask);
@@ -2709,6 +2760,9 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 
 	trace_mm_vmscan_memcg_reclaim_end(nr_reclaimed);
 
+	current->reclaim_state = NULL;
+	lockdep_clear_current_reclaim_state();
+
 	return nr_reclaimed;
 }
 #endif
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH -mm v15 08/13] vmscan: shrink slab on memcg pressure
@ 2014-02-05 18:39   ` Vladimir Davydov
  0 siblings, 0 replies; 44+ messages in thread
From: Vladimir Davydov @ 2014-02-05 18:39 UTC (permalink / raw)
  To: akpm
  Cc: dchinner, mhocko, hannes, glommer, rientjes, linux-kernel,
	linux-mm, devel, Glauber Costa, Mel Gorman, Rik van Riel,
	Al Viro, Balbir Singh, KAMEZAWA Hiroyuki

This patch makes direct reclaim path shrink slab not only on global
memory pressure, but also when we reach the user memory limit of a
memcg. To achieve that, it makes shrink_slab() walk over the memcg
hierarchy and run shrinkers marked as memcg-aware on the target memcg
and all its descendants. The memcg to scan is passed in a shrink_control
structure; memcg-unaware shrinkers are still called only on global
memory pressure with memcg=NULL. It is up to the shrinker how to
organize the objects it is responsible for to achieve per-memcg reclaim.

Note that we do not intend to have true per memcg per node reclaim.
Since most memcgs are small and typically confined to a single NUMA node
or two by external means and therefore do not need the scalability NUMA
aware shrinkers provide, we actually call per node shrinking only for
the global list (memcg=NULL), while per-memcg lists are always scanned
only once irrespective of the nodemask with nid=0.

The idea lying behind the patch as well as the initial implementation
belong to Glauber Costa.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Glauber Costa <glommer@openvz.org>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/memcontrol.h |   22 +++++++++++
 include/linux/shrinker.h   |   10 ++++-
 mm/memcontrol.c            |   37 ++++++++++++++++-
 mm/vmscan.c                |   94 ++++++++++++++++++++++++++++++++++----------
 4 files changed, 141 insertions(+), 22 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 0503b59c3fad..fc4a24d31e99 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -95,6 +95,9 @@ extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
 struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *);
 struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *);
 
+unsigned long mem_cgroup_zone_reclaimable_pages(struct zone *,
+						struct mem_cgroup *);
+
 /* For coalescing uncharge for reducing memcg' overhead*/
 extern void mem_cgroup_uncharge_start(void);
 extern void mem_cgroup_uncharge_end(void);
@@ -304,6 +307,12 @@ static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page,
 	return &zone->lruvec;
 }
 
+static inline unsigned long mem_cgroup_zone_reclaimable_pages(struct zone *zone,
+							struct mem_cgroup *)
+{
+	return 0;
+}
+
 static inline struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
 {
 	return NULL;
@@ -494,6 +503,9 @@ static inline bool memcg_kmem_enabled(void)
 	return static_key_false(&memcg_kmem_enabled_key);
 }
 
+bool memcg_kmem_is_active(struct mem_cgroup *memcg);
+bool memcg_kmem_should_reclaim(struct mem_cgroup *memcg);
+
 /*
  * In general, we'll do everything in our power to not incur in any overhead
  * for non-memcg users for the kmem functions. Not even a function call, if we
@@ -636,6 +648,16 @@ static inline bool memcg_kmem_enabled(void)
 	return false;
 }
 
+static inline bool memcg_kmem_is_active(struct mem_cgroup *memcg)
+{
+	return false;
+}
+
+static inline bool memcg_kmem_should_reclaim(struct mem_cgroup *memcg)
+{
+	return false;
+}
+
 static inline bool
 memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup **memcg, int order)
 {
diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
index 68c097077ef0..ab79b174bfbe 100644
--- a/include/linux/shrinker.h
+++ b/include/linux/shrinker.h
@@ -20,8 +20,15 @@ struct shrink_control {
 
 	/* shrink from these nodes */
 	nodemask_t nodes_to_scan;
+
+	/* shrink from this memory cgroup hierarchy (if not NULL) */
+	struct mem_cgroup *target_mem_cgroup;
+
 	/* current node being shrunk (for NUMA aware shrinkers) */
 	int nid;
+
+	/* current memcg being shrunk (for memcg aware shrinkers) */
+	struct mem_cgroup *memcg;
 };
 
 #define SHRINK_STOP (~0UL)
@@ -63,7 +70,8 @@ struct shrinker {
 #define DEFAULT_SEEKS 2 /* A good number if you don't know better. */
 
 /* Flags */
-#define SHRINKER_NUMA_AWARE (1 << 0)
+#define SHRINKER_NUMA_AWARE	(1 << 0)
+#define SHRINKER_MEMCG_AWARE	(1 << 1)
 
 extern int register_shrinker(struct shrinker *);
 extern void unregister_shrinker(struct shrinker *);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 854d0b8e3c45..24557d09213c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -392,7 +392,7 @@ static inline void memcg_kmem_set_active(struct mem_cgroup *memcg)
 	set_bit(KMEM_ACCOUNTED_ACTIVE, &memcg->kmem_account_flags);
 }
 
-static bool memcg_kmem_is_active(struct mem_cgroup *memcg)
+bool memcg_kmem_is_active(struct mem_cgroup *memcg)
 {
 	return test_bit(KMEM_ACCOUNTED_ACTIVE, &memcg->kmem_account_flags);
 }
@@ -1354,6 +1354,26 @@ out:
 	return lruvec;
 }
 
+unsigned long mem_cgroup_zone_reclaimable_pages(struct zone *zone,
+						struct mem_cgroup *memcg)
+{
+	int nid = zone_to_nid(zone);
+	int zid = zone_idx(zone);
+	unsigned long nr = 0;
+	struct mem_cgroup *iter;
+
+	iter = memcg;
+	do {
+		nr += mem_cgroup_zone_nr_lru_pages(iter, nid, zid,
+						   LRU_ALL_FILE);
+		if (do_swap_account)
+			nr += mem_cgroup_zone_nr_lru_pages(iter, nid, zid,
+							   LRU_ALL_ANON);
+		iter = mem_cgroup_iter(memcg, iter, NULL);
+	} while (iter);
+	return nr;
+}
+
 /*
  * Following LRU functions are allowed to be used without PCG_LOCK.
  * Operations are called by routine of global LRU independently from memcg.
@@ -2992,6 +3012,21 @@ static inline bool memcg_can_account_kmem(struct mem_cgroup *memcg)
 		memcg_kmem_is_active(memcg);
 }
 
+bool memcg_kmem_should_reclaim(struct mem_cgroup *memcg)
+{
+	struct mem_cgroup *iter;
+
+	iter = memcg;
+	do {
+		if (memcg_kmem_is_active(iter)) {
+			mem_cgroup_iter_break(memcg, iter);
+			return true;
+		}
+		iter = mem_cgroup_iter(memcg, iter, NULL);
+	} while (iter);
+	return false;
+}
+
 /*
  * helper for acessing a memcg's index. It will be used as an index in the
  * child cache array in kmem_cache, and also to derive its name. This function
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 1f56a80a7c41..1b79d291287e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -328,6 +328,33 @@ shrink_slab_node(struct shrink_control *shrinkctl, struct shrinker *shrinker,
 	return freed;
 }
 
+static unsigned long
+run_shrinker(struct shrink_control *shrinkctl, struct shrinker *shrinker,
+	     unsigned long nr_pages_scanned, unsigned long lru_pages)
+{
+	unsigned long freed = 0;
+
+	/*
+	 * Since most memory cgroups are small and typically confined to a
+	 * single NUMA node or two by external means and therefore do not need
+	 * the scalability NUMA aware shrinkers provide, we implement per node
+	 * shrinking only for the global list.
+	 */
+	if (!(shrinker->flags & SHRINKER_NUMA_AWARE) ||
+	    shrinkctl->memcg) {
+		shrinkctl->nid = 0;
+		return shrink_slab_node(shrinkctl, shrinker,
+					nr_pages_scanned, lru_pages);
+	}
+
+	for_each_node_mask(shrinkctl->nid, shrinkctl->nodes_to_scan) {
+		if (node_online(shrinkctl->nid))
+			freed += shrink_slab_node(shrinkctl, shrinker,
+						  nr_pages_scanned, lru_pages);
+	}
+	return freed;
+}
+
 /*
  * Call the shrink functions to age shrinkable caches
  *
@@ -369,20 +396,34 @@ unsigned long shrink_slab(struct shrink_control *shrinkctl,
 	}
 
 	list_for_each_entry(shrinker, &shrinker_list, list) {
-		if (!(shrinker->flags & SHRINKER_NUMA_AWARE)) {
-			shrinkctl->nid = 0;
-			freed += shrink_slab_node(shrinkctl, shrinker,
-					nr_pages_scanned, lru_pages);
+		/*
+		 * Call memcg-unaware shrinkers only on global pressure.
+		 */
+		if (!(shrinker->flags & SHRINKER_MEMCG_AWARE)) {
+			if (!shrinkctl->target_mem_cgroup) {
+				shrinkctl->memcg = NULL;
+				freed += run_shrinker(shrinkctl, shrinker,
+						nr_pages_scanned, lru_pages);
+			}
 			continue;
 		}
 
-		for_each_node_mask(shrinkctl->nid, shrinkctl->nodes_to_scan) {
-			if (node_online(shrinkctl->nid))
-				freed += shrink_slab_node(shrinkctl, shrinker,
+		/*
+		 * For memcg-aware shrinkers iterate over the target memcg
+		 * hierarchy and run the shrinker on each kmem-active memcg
+		 * found in the hierarchy.
+		 */
+		shrinkctl->memcg = shrinkctl->target_mem_cgroup;
+		do {
+			if (!shrinkctl->memcg ||
+			    memcg_kmem_is_active(shrinkctl->memcg))
+				freed += run_shrinker(shrinkctl, shrinker,
 						nr_pages_scanned, lru_pages);
-
-		}
+		} while ((shrinkctl->memcg =
+			  mem_cgroup_iter(shrinkctl->target_mem_cgroup,
+					  shrinkctl->memcg, NULL)) != NULL);
 	}
+
 	up_read(&shrinker_rwsem);
 out:
 	cond_resched();
@@ -2316,6 +2357,7 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
 	struct reclaim_state *reclaim_state = current->reclaim_state;
 	struct shrink_control shrink = {
 		.gfp_mask = sc->gfp_mask,
+		.target_mem_cgroup = sc->target_mem_cgroup,
 	};
 
 	/*
@@ -2332,17 +2374,22 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
 					gfp_zone(sc->gfp_mask), sc->nodemask) {
 		if (!populated_zone(zone))
 			continue;
+
+		if (global_reclaim(sc) &&
+		    !cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
+			continue;
+
+		lru_pages += global_reclaim(sc) ?
+				zone_reclaimable_pages(zone) :
+				mem_cgroup_zone_reclaimable_pages(zone,
+						sc->target_mem_cgroup);
+		node_set(zone_to_nid(zone), shrink.nodes_to_scan);
+
 		/*
 		 * Take care memory controller reclaiming has small influence
 		 * to global LRU.
 		 */
 		if (global_reclaim(sc)) {
-			if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
-				continue;
-
-			lru_pages += zone_reclaimable_pages(zone);
-			node_set(zone_to_nid(zone), shrink.nodes_to_scan);
-
 			if (sc->priority != DEF_PRIORITY &&
 			    !zone_reclaimable(zone))
 				continue;	/* Let kswapd poll it */
@@ -2380,12 +2427,11 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
 	}
 
 	/*
-	 * Don't shrink slabs when reclaiming memory from over limit cgroups
-	 * but do shrink slab at least once when aborting reclaim for
-	 * compaction to avoid unevenly scanning file/anon LRU pages over slab
-	 * pages.
+	 * Shrink slabs at least once when aborting reclaim for compaction
+	 * to avoid unevenly scanning file/anon LRU pages over slab pages.
 	 */
-	if (global_reclaim(sc)) {
+	if (global_reclaim(sc) ||
+	    memcg_kmem_should_reclaim(sc->target_mem_cgroup)) {
 		shrink_slab(&shrink, sc->nr_scanned, lru_pages);
 		if (reclaim_state) {
 			sc->nr_reclaimed += reclaim_state->reclaimed_slab;
@@ -2679,6 +2725,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 	struct zonelist *zonelist;
 	unsigned long nr_reclaimed;
 	int nid;
+	struct reclaim_state reclaim_state;
 	struct scan_control sc = {
 		.may_writepage = !laptop_mode,
 		.may_unmap = 1,
@@ -2701,6 +2748,10 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 
 	zonelist = NODE_DATA(nid)->node_zonelists;
 
+	lockdep_set_current_reclaim_state(sc.gfp_mask);
+	reclaim_state.reclaimed_slab = 0;
+	current->reclaim_state = &reclaim_state;
+
 	trace_mm_vmscan_memcg_reclaim_begin(0,
 					    sc.may_writepage,
 					    sc.gfp_mask);
@@ -2709,6 +2760,9 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 
 	trace_mm_vmscan_memcg_reclaim_end(nr_reclaimed);
 
+	current->reclaim_state = NULL;
+	lockdep_clear_current_reclaim_state();
+
 	return nr_reclaimed;
 }
 #endif
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH -mm v15 09/13] list_lru: add per-memcg lists
  2014-02-05 18:39 ` Vladimir Davydov
@ 2014-02-05 18:39   ` Vladimir Davydov
  -1 siblings, 0 replies; 44+ messages in thread
From: Vladimir Davydov @ 2014-02-05 18:39 UTC (permalink / raw)
  To: akpm
  Cc: dchinner, mhocko, hannes, glommer, rientjes, linux-kernel,
	linux-mm, devel, Glauber Costa, Al Viro, Balbir Singh,
	KAMEZAWA Hiroyuki

There are several FS shrinkers, including super_block::s_shrink, that
keep reclaimable objects in the list_lru structure. That said, to turn
them to memcg-aware shrinkers, it is enough to make list_lru per-memcg.

This patch does the trick. It adds an array of LRU lists to the list_lru
structure, one for each kmem-active memcg, and dispatches every item
addition or removal operation to the list corresponding to the memcg the
item is accounted to.

Since we already pass a shrink_control object to count and walk list_lru
functions to specify the NUMA node to scan, and the target memcg is held
in this structure, there is no need in changing the list_lru interface.

To make sure each kmem-active memcg has its list initialized in each
memcg-enabled list_lru, we keep all memcg-enabled list_lrus in a linked
list, which we iterate over allocating per-memcg LRUs whenever a new
kmem-active memcg is added. To synchronize this with creation of new
list_lrus, we have to take activate_kmem_mutex. Since using this mutex
as is would make all mounts proceed serially, we turn it to an rw
semaphore and take it for writing whenever a new kmem-active memcg is
created and for reading when we are going to create a list_lru. This
still does not allow mount_fs() proceed concurrently with creation of a
kmem-active memcg, but since creation of memcgs is rather a rare event,
this is not that critical.

The idea lying behind the patch as well as the initial implementation
belong to Glauber Costa.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Glauber Costa <glommer@openvz.org>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/list_lru.h   |  112 ++++++++++++------
 include/linux/memcontrol.h |   13 +++
 mm/list_lru.c              |  271 +++++++++++++++++++++++++++++++++++++++-----
 mm/memcontrol.c            |  186 ++++++++++++++++++++++++++++--
 4 files changed, 511 insertions(+), 71 deletions(-)

diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
index 6ca43b2486fc..92d29cd790b2 100644
--- a/include/linux/list_lru.h
+++ b/include/linux/list_lru.h
@@ -11,6 +11,8 @@
 #include <linux/nodemask.h>
 #include <linux/shrinker.h>
 
+struct mem_cgroup;
+
 /* list_lru_walk_cb has to always return one of those */
 enum lru_status {
 	LRU_REMOVED,		/* item removed from list */
@@ -32,10 +34,52 @@ struct list_lru_node {
 struct list_lru {
 	struct list_lru_node	*node;
 	nodemask_t		active_nodes;
+#ifdef CONFIG_MEMCG_KMEM
+	/*
+	 * In order to provide ability of scanning objects from different
+	 * memory cgroups independently, we keep a separate LRU list for each
+	 * kmem-active memcg in this array. The array is RCU-protected and
+	 * indexed by memcg_cache_id().
+	 */
+	struct list_lru_node	**memcg;
+	/*
+	 * Every time a kmem-active memcg is created or destroyed, we have to
+	 * update the array of per-memcg LRUs in each memcg enabled list_lru
+	 * structure. To achieve that, we keep all memcg enabled list_lru
+	 * structures in the all_memcg_lrus list.
+	 */
+	struct list_head	memcg_lrus_list;
+	/*
+	 * Since the array of per-memcg LRUs is RCU-protected, we can only free
+	 * it after a call to synchronize_rcu(). To avoid multiple calls to
+	 * synchronize_rcu() when a lot of LRUs get updated at the same time,
+	 * which is a typical scenario, we will store the pointer to the
+	 * previous version of the array in the memcg_old field for each
+	 * list_lru structure, and then free them all at once after a single
+	 * call to synchronize_rcu().
+	 */
+	void			*memcg_old;
+#endif /* CONFIG_MEMCG_KMEM */
 };
 
+#ifdef CONFIG_MEMCG_KMEM
+int list_lru_memcg_alloc(struct list_lru *lru, int memcg_id);
+void list_lru_memcg_free(struct list_lru *lru, int memcg_id);
+int list_lru_grow_memcg(struct list_lru *lru, size_t new_array_size);
+#endif
+
 void list_lru_destroy(struct list_lru *lru);
-int list_lru_init(struct list_lru *lru);
+int __list_lru_init(struct list_lru *lru, bool memcg_enabled);
+
+static inline int list_lru_init(struct list_lru *lru)
+{
+	return __list_lru_init(lru, false);
+}
+
+static inline int list_lru_init_memcg(struct list_lru *lru)
+{
+	return __list_lru_init(lru, true);
+}
 
 /**
  * list_lru_add: add an element to the lru list's tail
@@ -69,39 +113,41 @@ bool list_lru_add(struct list_lru *lru, struct list_head *item);
 bool list_lru_del(struct list_lru *lru, struct list_head *item);
 
 /**
- * list_lru_count_node: return the number of objects currently held by @lru
+ * list_lru_count_node_memcg: return the number of objects currently held by a
+ *  list_lru.
  * @lru: the lru pointer.
  * @nid: the node id to count from.
+ * @memcg: the memcg to count from.
  *
  * Always return a non-negative number, 0 for empty lists. There is no
  * guarantee that the list is not updated while the count is being computed.
  * Callers that want such a guarantee need to provide an outer lock.
  */
-unsigned long list_lru_count_node(struct list_lru *lru, int nid);
+unsigned long list_lru_count_node_memcg(struct list_lru *lru,
+					int nid, struct mem_cgroup *memcg);
 
-static inline unsigned long list_lru_shrink_count(struct list_lru *lru,
-						  struct shrink_control *sc)
+unsigned long list_lru_count(struct list_lru *lru);
+
+static inline unsigned long list_lru_count_node(struct list_lru *lru, int nid)
 {
-	return list_lru_count_node(lru, sc->nid);
+	return list_lru_count_node_memcg(lru, nid, NULL);
 }
 
-static inline unsigned long list_lru_count(struct list_lru *lru)
+static inline unsigned long list_lru_shrink_count(struct list_lru *lru,
+						  struct shrink_control *sc)
 {
-	long count = 0;
-	int nid;
-
-	for_each_node_mask(nid, lru->active_nodes)
-		count += list_lru_count_node(lru, nid);
-
-	return count;
+	return list_lru_count_node_memcg(lru, sc->nid, sc->memcg);
 }
 
 typedef enum lru_status
 (*list_lru_walk_cb)(struct list_head *item, spinlock_t *lock, void *cb_arg);
+
 /**
- * list_lru_walk_node: walk a list_lru, isolating and disposing freeable items.
+ * list_lru_walk_node_memcg: walk a list_lru, isolating and disposing freeable
+ *  items.
  * @lru: the lru pointer.
  * @nid: the node id to scan from.
+ * @memcg: the memcg to scan from.
  * @isolate: callback function that is resposible for deciding what to do with
  *  the item currently being scanned
  * @cb_arg: opaque type that will be passed to @isolate
@@ -119,31 +165,29 @@ typedef enum lru_status
  *
  * Return value: the number of objects effectively removed from the LRU.
  */
-unsigned long list_lru_walk_node(struct list_lru *lru, int nid,
-				 list_lru_walk_cb isolate, void *cb_arg,
-				 unsigned long *nr_to_walk);
+unsigned long list_lru_walk_node_memcg(struct list_lru *lru,
+				       int nid, struct mem_cgroup *memcg,
+				       list_lru_walk_cb isolate, void *cb_arg,
+				       unsigned long *nr_to_walk);
+
+unsigned long list_lru_walk(struct list_lru *lru,
+			    list_lru_walk_cb isolate, void *cb_arg,
+			    unsigned long nr_to_walk);
 
 static inline unsigned long
-list_lru_shrink_walk(struct list_lru *lru, struct shrink_control *sc,
-		     list_lru_walk_cb isolate, void *cb_arg)
+list_lru_walk_node(struct list_lru *lru, int nid,
+		   list_lru_walk_cb isolate, void *cb_arg,
+		   unsigned long *nr_to_walk)
 {
-	return list_lru_walk_node(lru, sc->nid, isolate, cb_arg,
-				  &sc->nr_to_scan);
+	return list_lru_walk_node_memcg(lru, nid, NULL,
+					isolate, cb_arg, nr_to_walk);
 }
 
 static inline unsigned long
-list_lru_walk(struct list_lru *lru, list_lru_walk_cb isolate,
-	      void *cb_arg, unsigned long nr_to_walk)
+list_lru_shrink_walk(struct list_lru *lru, struct shrink_control *sc,
+		     list_lru_walk_cb isolate, void *cb_arg)
 {
-	long isolated = 0;
-	int nid;
-
-	for_each_node_mask(nid, lru->active_nodes) {
-		isolated += list_lru_walk_node(lru, nid, isolate,
-					       cb_arg, &nr_to_walk);
-		if (nr_to_walk <= 0)
-			break;
-	}
-	return isolated;
+	return list_lru_walk_node_memcg(lru, sc->nid, sc->memcg,
+					isolate, cb_arg, &sc->nr_to_scan);
 }
 #endif /* _LRU_LIST_H */
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index fc4a24d31e99..3b310c58822a 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -29,6 +29,7 @@ struct page_cgroup;
 struct page;
 struct mm_struct;
 struct kmem_cache;
+struct list_lru;
 
 /*
  * The corresponding mem_cgroup_stat_names is defined in mm/memcontrol.c,
@@ -539,6 +540,9 @@ __memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp);
 void mem_cgroup_destroy_cache(struct kmem_cache *cachep);
 void kmem_cache_destroy_memcg_children(struct kmem_cache *s);
 
+int memcg_list_lru_init(struct list_lru *lru, bool memcg_enabled);
+void memcg_list_lru_destroy(struct list_lru *lru);
+
 /**
  * memcg_kmem_newpage_charge: verify if a new kmem allocation is allowed.
  * @gfp: the gfp allocation flags.
@@ -705,6 +709,15 @@ memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp)
 static inline void kmem_cache_destroy_memcg_children(struct kmem_cache *s)
 {
 }
+
+static inline int memcg_list_lru_init(struct list_lru *lru, bool memcg_enabled)
+{
+	return 0;
+}
+
+static inline void memcg_list_lru_destroy(struct list_lru *lru)
+{
+}
 #endif /* CONFIG_MEMCG_KMEM */
 #endif /* _LINUX_MEMCONTROL_H */
 
diff --git a/mm/list_lru.c b/mm/list_lru.c
index 7f5b73e2513b..d9c4c48bb8d0 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -7,19 +7,94 @@
 #include <linux/kernel.h>
 #include <linux/module.h>
 #include <linux/mm.h>
-#include <linux/list_lru.h>
 #include <linux/slab.h>
+#include <linux/memcontrol.h>
+#include <linux/page_cgroup.h>
+#include <linux/list_lru.h>
+
+#ifdef CONFIG_MEMCG_KMEM
+static inline bool lru_has_memcg(struct list_lru *lru)
+{
+	return !!lru->memcg;
+}
+
+static struct list_lru_node *lru_node_of_index(struct list_lru *lru,
+					int nid, int memcg_id, bool *is_global)
+{
+	struct list_lru_node **memcg_lrus;
+	struct list_lru_node *nlru = NULL;
+
+	if (memcg_id < 0 || !lru_has_memcg(lru)) {
+		*is_global = true;
+		return &lru->node[nid];
+	}
+
+	rcu_read_lock();
+	memcg_lrus = rcu_dereference(lru->memcg);
+	nlru = memcg_lrus[memcg_id];
+	rcu_read_unlock();
+
+	*is_global = false;
+
+	/*
+	 * Make sure we will access the up-to-date value. The code updating
+	 * memcg_lrus issues a write barrier to match this (see
+	 * list_lru_memcg_alloc()).
+	 */
+	smp_read_barrier_depends();
+	return nlru;
+}
+
+static struct list_lru_node *lru_node_of_page(struct list_lru *lru,
+					struct page *page, bool *is_global)
+{
+	struct page_cgroup *pc;
+	struct mem_cgroup *memcg;
+
+	/*
+	 * Since a kmem page cannot change its cgroup after its allocation is
+	 * committed, we do not need to lock_page_cgroup() here.
+	 */
+	pc = lookup_page_cgroup(compound_head(page));
+	memcg = PageCgroupUsed(pc) ? pc->mem_cgroup : NULL;
+
+	return lru_node_of_index(lru, page_to_nid(page),
+				 memcg_cache_id(memcg), is_global);
+}
+#else /* !CONFIG_MEMCG_KMEM */
+static inline bool lru_has_memcg(struct list_lru *lru)
+{
+	return false;
+}
+
+static inline struct list_lru_node *lru_node_of_index(struct list_lru *lru,
+					int nid, int memcg_id, bool *is_global)
+{
+	*is_global = true;
+	return &lru->node[nid];
+}
+
+static inline struct list_lru_node *lru_node_of_page(struct list_lru *lru,
+					struct page *page, bool *is_global)
+{
+	return lru_node_of_index(lru, page_to_nid(page), -1, is_global);
+}
+#endif /* CONFIG_MEMCG_KMEM */
 
 bool list_lru_add(struct list_lru *lru, struct list_head *item)
 {
-	int nid = page_to_nid(virt_to_page(item));
-	struct list_lru_node *nlru = &lru->node[nid];
+	struct page *page = virt_to_page(item);
+	int nid = page_to_nid(page);
+	bool is_global;
+	struct list_lru_node *nlru;
+
+	nlru = lru_node_of_page(lru, page, &is_global);
 
 	spin_lock(&nlru->lock);
 	WARN_ON_ONCE(nlru->nr_items < 0);
 	if (list_empty(item)) {
 		list_add_tail(item, &nlru->list);
-		if (nlru->nr_items++ == 0)
+		if (nlru->nr_items++ == 0 && is_global)
 			node_set(nid, lru->active_nodes);
 		spin_unlock(&nlru->lock);
 		return true;
@@ -31,13 +106,17 @@ EXPORT_SYMBOL_GPL(list_lru_add);
 
 bool list_lru_del(struct list_lru *lru, struct list_head *item)
 {
-	int nid = page_to_nid(virt_to_page(item));
-	struct list_lru_node *nlru = &lru->node[nid];
+	struct page *page = virt_to_page(item);
+	int nid = page_to_nid(page);
+	bool is_global;
+	struct list_lru_node *nlru;
+
+	nlru = lru_node_of_page(lru, page, &is_global);
 
 	spin_lock(&nlru->lock);
 	if (!list_empty(item)) {
 		list_del_init(item);
-		if (--nlru->nr_items == 0)
+		if (--nlru->nr_items == 0 && is_global)
 			node_clear(nid, lru->active_nodes);
 		WARN_ON_ONCE(nlru->nr_items < 0);
 		spin_unlock(&nlru->lock);
@@ -48,11 +127,14 @@ bool list_lru_del(struct list_lru *lru, struct list_head *item)
 }
 EXPORT_SYMBOL_GPL(list_lru_del);
 
-unsigned long
-list_lru_count_node(struct list_lru *lru, int nid)
+unsigned long list_lru_count_node_memcg(struct list_lru *lru,
+					int nid, struct mem_cgroup *memcg)
 {
 	unsigned long count = 0;
-	struct list_lru_node *nlru = &lru->node[nid];
+	bool is_global;
+	struct list_lru_node *nlru;
+
+	nlru = lru_node_of_index(lru, nid, memcg_cache_id(memcg), &is_global);
 
 	spin_lock(&nlru->lock);
 	WARN_ON_ONCE(nlru->nr_items < 0);
@@ -61,16 +143,41 @@ list_lru_count_node(struct list_lru *lru, int nid)
 
 	return count;
 }
-EXPORT_SYMBOL_GPL(list_lru_count_node);
+EXPORT_SYMBOL_GPL(list_lru_count_node_memcg);
+
+unsigned long list_lru_count(struct list_lru *lru)
+{
+	long count = 0;
+	int nid;
+	struct mem_cgroup *memcg;
+
+	for_each_node_mask(nid, lru->active_nodes)
+		count += list_lru_count_node(lru, nid);
+
+	if (!lru_has_memcg(lru))
+		goto out;
+
+	for_each_mem_cgroup(memcg) {
+		if (memcg_kmem_is_active(memcg))
+			count += list_lru_count_node_memcg(lru, 0, memcg);
+	}
+out:
+	return count;
+}
+EXPORT_SYMBOL_GPL(list_lru_count);
 
-unsigned long
-list_lru_walk_node(struct list_lru *lru, int nid, list_lru_walk_cb isolate,
-		   void *cb_arg, unsigned long *nr_to_walk)
+unsigned long list_lru_walk_node_memcg(struct list_lru *lru,
+				       int nid, struct mem_cgroup *memcg,
+				       list_lru_walk_cb isolate, void *cb_arg,
+				       unsigned long *nr_to_walk)
 {
 
-	struct list_lru_node	*nlru = &lru->node[nid];
 	struct list_head *item, *n;
 	unsigned long isolated = 0;
+	bool is_global;
+	struct list_lru_node *nlru;
+
+	nlru = lru_node_of_index(lru, nid, memcg_cache_id(memcg), &is_global);
 
 	spin_lock(&nlru->lock);
 restart:
@@ -90,7 +197,7 @@ restart:
 		case LRU_REMOVED_RETRY:
 			assert_spin_locked(&nlru->lock);
 		case LRU_REMOVED:
-			if (--nlru->nr_items == 0)
+			if (--nlru->nr_items == 0 && is_global)
 				node_clear(nid, lru->active_nodes);
 			WARN_ON_ONCE(nlru->nr_items < 0);
 			isolated++;
@@ -122,29 +229,141 @@ restart:
 	spin_unlock(&nlru->lock);
 	return isolated;
 }
-EXPORT_SYMBOL_GPL(list_lru_walk_node);
+EXPORT_SYMBOL_GPL(list_lru_walk_node_memcg);
+
+unsigned long list_lru_walk(struct list_lru *lru,
+			    list_lru_walk_cb isolate, void *cb_arg,
+			    unsigned long nr_to_walk)
+{
+	long isolated = 0;
+	int nid;
+	struct mem_cgroup *memcg;
+
+	for_each_node_mask(nid, lru->active_nodes) {
+		isolated += list_lru_walk_node(lru, nid, isolate,
+					       cb_arg, &nr_to_walk);
+		if (nr_to_walk <= 0)
+			goto out;
+	}
+
+	if (!lru_has_memcg(lru))
+		goto out;
+
+	for_each_mem_cgroup(memcg) {
+		if (!memcg_kmem_is_active(memcg))
+			continue;
+		isolated += list_lru_walk_node_memcg(lru, 0, memcg, isolate,
+						     cb_arg, &nr_to_walk);
+		if (nr_to_walk <= 0) {
+			mem_cgroup_iter_break(NULL, memcg);
+			break;
+		}
+	}
+out:
+	return isolated;
+}
+EXPORT_SYMBOL_GPL(list_lru_walk);
 
-int list_lru_init(struct list_lru *lru)
+static void list_lru_node_init(struct list_lru_node *nlru)
+{
+	spin_lock_init(&nlru->lock);
+	INIT_LIST_HEAD(&nlru->list);
+	nlru->nr_items = 0;
+}
+
+int __list_lru_init(struct list_lru *lru, bool memcg_enabled)
 {
 	int i;
-	size_t size = sizeof(*lru->node) * nr_node_ids;
+	int err = 0;
 
-	lru->node = kzalloc(size, GFP_KERNEL);
+	lru->node = kcalloc(nr_node_ids, sizeof(*lru->node), GFP_KERNEL);
 	if (!lru->node)
 		return -ENOMEM;
 
 	nodes_clear(lru->active_nodes);
-	for (i = 0; i < nr_node_ids; i++) {
-		spin_lock_init(&lru->node[i].lock);
-		INIT_LIST_HEAD(&lru->node[i].list);
-		lru->node[i].nr_items = 0;
+	for (i = 0; i < nr_node_ids; i++)
+		list_lru_node_init(&lru->node[i]);
+
+	err = memcg_list_lru_init(lru, memcg_enabled);
+	if (err) {
+		kfree(lru->node);
+		lru->node = NULL; /* see list_lru_destroy() */
 	}
-	return 0;
+
+	return err;
 }
-EXPORT_SYMBOL_GPL(list_lru_init);
+EXPORT_SYMBOL_GPL(__list_lru_init);
 
 void list_lru_destroy(struct list_lru *lru)
 {
+	/*
+	 * We might be called after partial initialization (e.g. due to ENOMEM
+	 * error) so handle that appropriately.
+	 */
+	if (!lru->node)
+		return;
+
 	kfree(lru->node);
+	memcg_list_lru_destroy(lru);
 }
 EXPORT_SYMBOL_GPL(list_lru_destroy);
+
+#ifdef CONFIG_MEMCG_KMEM
+int list_lru_memcg_alloc(struct list_lru *lru, int memcg_id)
+{
+	struct list_lru_node *nlru;
+
+	nlru = kmalloc(sizeof(*nlru), GFP_KERNEL);
+	if (!nlru)
+		return -ENOMEM;
+
+	list_lru_node_init(nlru);
+
+	/*
+	 * Since readers won't lock (see lru_node_of_index()), we need a
+	 * barrier here to ensure nobody will see the list_lru_node partially
+	 * initialized.
+	 */
+	smp_wmb();
+
+	VM_BUG_ON(lru->memcg[memcg_id]);
+	lru->memcg[memcg_id] = nlru;
+	return 0;
+}
+
+void list_lru_memcg_free(struct list_lru *lru, int memcg_id)
+{
+	if (lru->memcg[memcg_id]) {
+		kfree(lru->memcg[memcg_id]);
+		lru->memcg[memcg_id] = NULL;
+	}
+}
+
+int list_lru_grow_memcg(struct list_lru *lru, size_t new_array_size)
+{
+	int i;
+	struct list_lru_node **memcg_lrus;
+
+	memcg_lrus = kcalloc(new_array_size, sizeof(*memcg_lrus), GFP_KERNEL);
+	if (!memcg_lrus)
+		return -ENOMEM;
+
+	if (lru->memcg) {
+		for_each_memcg_cache_index(i) {
+			if (lru->memcg[i])
+				memcg_lrus[i] = lru->memcg[i];
+		}
+	}
+
+	/*
+	 * Since we access the lru->memcg array lockless, inside an RCU
+	 * critical section (see lru_node_of_index()), we cannot free the old
+	 * version of the array right now. So we save it to lru->memcg_old to
+	 * be freed by the caller after a grace period.
+	 */
+	VM_BUG_ON(lru->memcg_old);
+	lru->memcg_old = lru->memcg;
+	rcu_assign_pointer(lru->memcg, memcg_lrus);
+	return 0;
+}
+#endif /* CONFIG_MEMCG_KMEM */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 24557d09213c..27f6d795090a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -56,6 +56,7 @@
 #include <linux/oom.h>
 #include <linux/lockdep.h>
 #include <linux/file.h>
+#include <linux/list_lru.h>
 #include "internal.h"
 #include <net/sock.h>
 #include <net/ip.h>
@@ -3004,7 +3005,11 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *memcg,
 static DEFINE_MUTEX(set_limit_mutex);
 
 #ifdef CONFIG_MEMCG_KMEM
-static DEFINE_MUTEX(activate_kmem_mutex);
+/*
+ * This semaphore serializes activations of kmem accounting for memory cgroups.
+ * Holding it for reading guarantees no cgroups will become kmem active.
+ */
+static DECLARE_RWSEM(activate_kmem_sem);
 
 static inline bool memcg_can_account_kmem(struct mem_cgroup *memcg)
 {
@@ -3338,6 +3343,154 @@ void memcg_unregister_cache(struct kmem_cache *s)
 }
 
 /*
+ * The list of all memcg-enabled list_lru structures. Needed for updating all
+ * per-memcg LRUs whenever a kmem-active memcg is created or destroyed. The
+ * list is updated under the activate_kmem_sem held for reading so to safely
+ * iterate over it, it is enough to take the activate_kmem_sem for writing.
+ */
+static LIST_HEAD(all_memcg_lrus);
+static DEFINE_SPINLOCK(all_memcg_lrus_lock);
+
+static void __memcg_destroy_all_lrus(int memcg_id)
+{
+	struct list_lru *lru;
+
+	list_for_each_entry(lru, &all_memcg_lrus, memcg_lrus_list)
+		list_lru_memcg_free(lru, memcg_id);
+}
+
+/*
+ * This function is called when a kmem-active memcg is destroyed in order to
+ * free LRUs corresponding to the memcg in all list_lru structures.
+ */
+static void memcg_destroy_all_lrus(struct mem_cgroup *memcg)
+{
+	int memcg_id;
+
+	memcg_id = memcg_cache_id(memcg);
+	if (memcg_id >= 0) {
+		down_write(&activate_kmem_sem);
+		__memcg_destroy_all_lrus(memcg_id);
+		up_write(&activate_kmem_sem);
+	}
+}
+
+/*
+ * This function allocates LRUs for a memcg in all list_lru structures. It is
+ * called with activate_kmem_sem held for writing when a new kmem-active memcg
+ * is added.
+ */
+static int memcg_init_all_lrus(int new_memcg_id)
+{
+	int err = 0;
+	int num_memcgs = new_memcg_id + 1;
+	int grow = (num_memcgs > memcg_limited_groups_array_size);
+	size_t new_array_size = memcg_caches_array_size(num_memcgs);
+	struct list_lru *lru;
+
+	if (grow) {
+		list_for_each_entry(lru, &all_memcg_lrus, memcg_lrus_list) {
+			err = list_lru_grow_memcg(lru, new_array_size);
+			if (err)
+				goto out;
+		}
+	}
+
+	list_for_each_entry(lru, &all_memcg_lrus, memcg_lrus_list) {
+		err = list_lru_memcg_alloc(lru, new_memcg_id);
+		if (err) {
+			__memcg_destroy_all_lrus(new_memcg_id);
+			break;
+		}
+	}
+out:
+	if (grow) {
+		synchronize_rcu();
+		list_for_each_entry(lru, &all_memcg_lrus, memcg_lrus_list) {
+			kfree(lru->memcg_old);
+			lru->memcg_old = NULL;
+		}
+	}
+	return err;
+}
+
+int memcg_list_lru_init(struct list_lru *lru, bool memcg_enabled)
+{
+	int err = 0;
+	int i;
+	struct mem_cgroup *memcg;
+
+	lru->memcg = NULL;
+	lru->memcg_old = NULL;
+	INIT_LIST_HEAD(&lru->memcg_lrus_list);
+
+	if (!memcg_enabled)
+		return 0;
+
+	down_read(&activate_kmem_sem);
+	if (!memcg_kmem_enabled())
+		goto out_list_add;
+
+	lru->memcg = kcalloc(memcg_limited_groups_array_size,
+			     sizeof(*lru->memcg), GFP_KERNEL);
+	if (!lru->memcg) {
+		err = -ENOMEM;
+		goto out;
+	}
+
+	for_each_mem_cgroup(memcg) {
+		int memcg_id;
+
+		memcg_id = memcg_cache_id(memcg);
+		if (memcg_id < 0)
+			continue;
+
+		err = list_lru_memcg_alloc(lru, memcg_id);
+		if (err) {
+			mem_cgroup_iter_break(NULL, memcg);
+			goto out_free_lru_memcg;
+		}
+	}
+out_list_add:
+	spin_lock(&all_memcg_lrus_lock);
+	list_add(&lru->memcg_lrus_list, &all_memcg_lrus);
+	spin_unlock(&all_memcg_lrus_lock);
+out:
+	up_read(&activate_kmem_sem);
+	return err;
+
+out_free_lru_memcg:
+	for (i = 0; i < memcg_limited_groups_array_size; i++)
+		list_lru_memcg_free(lru, i);
+	kfree(lru->memcg);
+	goto out;
+}
+
+void memcg_list_lru_destroy(struct list_lru *lru)
+{
+	int i, array_size;
+
+	if (list_empty(&lru->memcg_lrus_list))
+		return;
+
+	down_read(&activate_kmem_sem);
+
+	array_size = memcg_limited_groups_array_size;
+
+	spin_lock(&all_memcg_lrus_lock);
+	list_del(&lru->memcg_lrus_list);
+	spin_unlock(&all_memcg_lrus_lock);
+
+	up_read(&activate_kmem_sem);
+
+	if (lru->memcg) {
+		for (i = 0; i < array_size; i++)
+			list_lru_memcg_free(lru, i);
+		kfree(lru->memcg);
+	}
+}
+
+/*
  * During the creation a new cache, we need to disable our accounting mechanism
  * altogether. This is true even if we are not creating, but rather just
  * enqueing new caches to be created.
@@ -3486,10 +3639,9 @@ void kmem_cache_destroy_memcg_children(struct kmem_cache *s)
 	 *
 	 * Still, we don't want anyone else freeing memcg_caches under our
 	 * noses, which can happen if a new memcg comes to life. As usual,
-	 * we'll take the activate_kmem_mutex to protect ourselves against
-	 * this.
+	 * we'll take the activate_kmem_sem to protect ourselves against this.
 	 */
-	mutex_lock(&activate_kmem_mutex);
+	down_read(&activate_kmem_sem);
 	for_each_memcg_cache_index(i) {
 		c = cache_from_memcg_idx(s, i);
 		if (!c)
@@ -3512,7 +3664,7 @@ void kmem_cache_destroy_memcg_children(struct kmem_cache *s)
 		cancel_work_sync(&c->memcg_params->destroy);
 		kmem_cache_destroy(c);
 	}
-	mutex_unlock(&activate_kmem_mutex);
+	up_read(&activate_kmem_sem);
 }
 
 struct create_work {
@@ -5179,7 +5331,7 @@ static u64 mem_cgroup_read_u64(struct cgroup_subsys_state *css,
 }
 
 #ifdef CONFIG_MEMCG_KMEM
-/* should be called with activate_kmem_mutex held */
+/* should be called with activate_kmem_sem held for writing */
 static int __memcg_activate_kmem(struct mem_cgroup *memcg,
 				 unsigned long long limit)
 {
@@ -5222,12 +5374,21 @@ static int __memcg_activate_kmem(struct mem_cgroup *memcg,
 	}
 
 	/*
+	 * Initialize this cgroup's lists in each list_lru. This must be done
+	 * before calling memcg_update_all_caches(), where we update the
+	 * limited_groups_array_size.
+	 */
+	err = memcg_init_all_lrus(memcg_id);
+	if (err)
+		goto out_rmid;
+
+	/*
 	 * Make sure we have enough space for this cgroup in each root cache's
 	 * memcg_params.
 	 */
 	err = memcg_update_all_caches(memcg_id + 1);
 	if (err)
-		goto out_rmid;
+		goto out_destroy_all_lrus;
 
 	memcg->kmemcg_id = memcg_id;
 
@@ -5249,6 +5410,8 @@ out:
 	memcg_resume_kmem_account();
 	return err;
 
+out_destroy_all_lrus:
+	__memcg_destroy_all_lrus(memcg_id);
 out_rmid:
 	ida_simple_remove(&kmem_limited_groups, memcg_id);
 	goto out;
@@ -5259,9 +5422,9 @@ static int memcg_activate_kmem(struct mem_cgroup *memcg,
 {
 	int ret;
 
-	mutex_lock(&activate_kmem_mutex);
+	down_write(&activate_kmem_sem);
 	ret = __memcg_activate_kmem(memcg, limit);
-	mutex_unlock(&activate_kmem_mutex);
+	up_write(&activate_kmem_sem);
 	return ret;
 }
 
@@ -5285,14 +5448,14 @@ static int memcg_propagate_kmem(struct mem_cgroup *memcg)
 	if (!parent)
 		return 0;
 
-	mutex_lock(&activate_kmem_mutex);
+	down_write(&activate_kmem_sem);
 	/*
 	 * If the parent cgroup is not kmem-active now, it cannot be activated
 	 * after this point, because it has at least one child already.
 	 */
 	if (memcg_kmem_is_active(parent))
 		ret = __memcg_activate_kmem(memcg, RES_COUNTER_MAX);
-	mutex_unlock(&activate_kmem_mutex);
+	up_write(&activate_kmem_sem);
 	return ret;
 }
 #else
@@ -5989,6 +6152,7 @@ static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
 static void memcg_destroy_kmem(struct mem_cgroup *memcg)
 {
 	mem_cgroup_sockets_destroy(memcg);
+	memcg_destroy_all_lrus(memcg);
 }
 
 static void kmem_cgroup_css_offline(struct mem_cgroup *memcg)
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH -mm v15 09/13] list_lru: add per-memcg lists
@ 2014-02-05 18:39   ` Vladimir Davydov
  0 siblings, 0 replies; 44+ messages in thread
From: Vladimir Davydov @ 2014-02-05 18:39 UTC (permalink / raw)
  To: akpm
  Cc: dchinner, mhocko, hannes, glommer, rientjes, linux-kernel,
	linux-mm, devel, Glauber Costa, Al Viro, Balbir Singh,
	KAMEZAWA Hiroyuki

There are several FS shrinkers, including super_block::s_shrink, that
keep reclaimable objects in the list_lru structure. That said, to turn
them to memcg-aware shrinkers, it is enough to make list_lru per-memcg.

This patch does the trick. It adds an array of LRU lists to the list_lru
structure, one for each kmem-active memcg, and dispatches every item
addition or removal operation to the list corresponding to the memcg the
item is accounted to.

Since we already pass a shrink_control object to count and walk list_lru
functions to specify the NUMA node to scan, and the target memcg is held
in this structure, there is no need in changing the list_lru interface.

To make sure each kmem-active memcg has its list initialized in each
memcg-enabled list_lru, we keep all memcg-enabled list_lrus in a linked
list, which we iterate over allocating per-memcg LRUs whenever a new
kmem-active memcg is added. To synchronize this with creation of new
list_lrus, we have to take activate_kmem_mutex. Since using this mutex
as is would make all mounts proceed serially, we turn it to an rw
semaphore and take it for writing whenever a new kmem-active memcg is
created and for reading when we are going to create a list_lru. This
still does not allow mount_fs() proceed concurrently with creation of a
kmem-active memcg, but since creation of memcgs is rather a rare event,
this is not that critical.

The idea lying behind the patch as well as the initial implementation
belong to Glauber Costa.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Glauber Costa <glommer@openvz.org>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/list_lru.h   |  112 ++++++++++++------
 include/linux/memcontrol.h |   13 +++
 mm/list_lru.c              |  271 +++++++++++++++++++++++++++++++++++++++-----
 mm/memcontrol.c            |  186 ++++++++++++++++++++++++++++--
 4 files changed, 511 insertions(+), 71 deletions(-)

diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
index 6ca43b2486fc..92d29cd790b2 100644
--- a/include/linux/list_lru.h
+++ b/include/linux/list_lru.h
@@ -11,6 +11,8 @@
 #include <linux/nodemask.h>
 #include <linux/shrinker.h>
 
+struct mem_cgroup;
+
 /* list_lru_walk_cb has to always return one of those */
 enum lru_status {
 	LRU_REMOVED,		/* item removed from list */
@@ -32,10 +34,52 @@ struct list_lru_node {
 struct list_lru {
 	struct list_lru_node	*node;
 	nodemask_t		active_nodes;
+#ifdef CONFIG_MEMCG_KMEM
+	/*
+	 * In order to provide ability of scanning objects from different
+	 * memory cgroups independently, we keep a separate LRU list for each
+	 * kmem-active memcg in this array. The array is RCU-protected and
+	 * indexed by memcg_cache_id().
+	 */
+	struct list_lru_node	**memcg;
+	/*
+	 * Every time a kmem-active memcg is created or destroyed, we have to
+	 * update the array of per-memcg LRUs in each memcg enabled list_lru
+	 * structure. To achieve that, we keep all memcg enabled list_lru
+	 * structures in the all_memcg_lrus list.
+	 */
+	struct list_head	memcg_lrus_list;
+	/*
+	 * Since the array of per-memcg LRUs is RCU-protected, we can only free
+	 * it after a call to synchronize_rcu(). To avoid multiple calls to
+	 * synchronize_rcu() when a lot of LRUs get updated at the same time,
+	 * which is a typical scenario, we will store the pointer to the
+	 * previous version of the array in the memcg_old field for each
+	 * list_lru structure, and then free them all at once after a single
+	 * call to synchronize_rcu().
+	 */
+	void			*memcg_old;
+#endif /* CONFIG_MEMCG_KMEM */
 };
 
+#ifdef CONFIG_MEMCG_KMEM
+int list_lru_memcg_alloc(struct list_lru *lru, int memcg_id);
+void list_lru_memcg_free(struct list_lru *lru, int memcg_id);
+int list_lru_grow_memcg(struct list_lru *lru, size_t new_array_size);
+#endif
+
 void list_lru_destroy(struct list_lru *lru);
-int list_lru_init(struct list_lru *lru);
+int __list_lru_init(struct list_lru *lru, bool memcg_enabled);
+
+static inline int list_lru_init(struct list_lru *lru)
+{
+	return __list_lru_init(lru, false);
+}
+
+static inline int list_lru_init_memcg(struct list_lru *lru)
+{
+	return __list_lru_init(lru, true);
+}
 
 /**
  * list_lru_add: add an element to the lru list's tail
@@ -69,39 +113,41 @@ bool list_lru_add(struct list_lru *lru, struct list_head *item);
 bool list_lru_del(struct list_lru *lru, struct list_head *item);
 
 /**
- * list_lru_count_node: return the number of objects currently held by @lru
+ * list_lru_count_node_memcg: return the number of objects currently held by a
+ *  list_lru.
  * @lru: the lru pointer.
  * @nid: the node id to count from.
+ * @memcg: the memcg to count from.
  *
  * Always return a non-negative number, 0 for empty lists. There is no
  * guarantee that the list is not updated while the count is being computed.
  * Callers that want such a guarantee need to provide an outer lock.
  */
-unsigned long list_lru_count_node(struct list_lru *lru, int nid);
+unsigned long list_lru_count_node_memcg(struct list_lru *lru,
+					int nid, struct mem_cgroup *memcg);
 
-static inline unsigned long list_lru_shrink_count(struct list_lru *lru,
-						  struct shrink_control *sc)
+unsigned long list_lru_count(struct list_lru *lru);
+
+static inline unsigned long list_lru_count_node(struct list_lru *lru, int nid)
 {
-	return list_lru_count_node(lru, sc->nid);
+	return list_lru_count_node_memcg(lru, nid, NULL);
 }
 
-static inline unsigned long list_lru_count(struct list_lru *lru)
+static inline unsigned long list_lru_shrink_count(struct list_lru *lru,
+						  struct shrink_control *sc)
 {
-	long count = 0;
-	int nid;
-
-	for_each_node_mask(nid, lru->active_nodes)
-		count += list_lru_count_node(lru, nid);
-
-	return count;
+	return list_lru_count_node_memcg(lru, sc->nid, sc->memcg);
 }
 
 typedef enum lru_status
 (*list_lru_walk_cb)(struct list_head *item, spinlock_t *lock, void *cb_arg);
+
 /**
- * list_lru_walk_node: walk a list_lru, isolating and disposing freeable items.
+ * list_lru_walk_node_memcg: walk a list_lru, isolating and disposing freeable
+ *  items.
  * @lru: the lru pointer.
  * @nid: the node id to scan from.
+ * @memcg: the memcg to scan from.
  * @isolate: callback function that is resposible for deciding what to do with
  *  the item currently being scanned
  * @cb_arg: opaque type that will be passed to @isolate
@@ -119,31 +165,29 @@ typedef enum lru_status
  *
  * Return value: the number of objects effectively removed from the LRU.
  */
-unsigned long list_lru_walk_node(struct list_lru *lru, int nid,
-				 list_lru_walk_cb isolate, void *cb_arg,
-				 unsigned long *nr_to_walk);
+unsigned long list_lru_walk_node_memcg(struct list_lru *lru,
+				       int nid, struct mem_cgroup *memcg,
+				       list_lru_walk_cb isolate, void *cb_arg,
+				       unsigned long *nr_to_walk);
+
+unsigned long list_lru_walk(struct list_lru *lru,
+			    list_lru_walk_cb isolate, void *cb_arg,
+			    unsigned long nr_to_walk);
 
 static inline unsigned long
-list_lru_shrink_walk(struct list_lru *lru, struct shrink_control *sc,
-		     list_lru_walk_cb isolate, void *cb_arg)
+list_lru_walk_node(struct list_lru *lru, int nid,
+		   list_lru_walk_cb isolate, void *cb_arg,
+		   unsigned long *nr_to_walk)
 {
-	return list_lru_walk_node(lru, sc->nid, isolate, cb_arg,
-				  &sc->nr_to_scan);
+	return list_lru_walk_node_memcg(lru, nid, NULL,
+					isolate, cb_arg, nr_to_walk);
 }
 
 static inline unsigned long
-list_lru_walk(struct list_lru *lru, list_lru_walk_cb isolate,
-	      void *cb_arg, unsigned long nr_to_walk)
+list_lru_shrink_walk(struct list_lru *lru, struct shrink_control *sc,
+		     list_lru_walk_cb isolate, void *cb_arg)
 {
-	long isolated = 0;
-	int nid;
-
-	for_each_node_mask(nid, lru->active_nodes) {
-		isolated += list_lru_walk_node(lru, nid, isolate,
-					       cb_arg, &nr_to_walk);
-		if (nr_to_walk <= 0)
-			break;
-	}
-	return isolated;
+	return list_lru_walk_node_memcg(lru, sc->nid, sc->memcg,
+					isolate, cb_arg, &sc->nr_to_scan);
 }
 #endif /* _LRU_LIST_H */
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index fc4a24d31e99..3b310c58822a 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -29,6 +29,7 @@ struct page_cgroup;
 struct page;
 struct mm_struct;
 struct kmem_cache;
+struct list_lru;
 
 /*
  * The corresponding mem_cgroup_stat_names is defined in mm/memcontrol.c,
@@ -539,6 +540,9 @@ __memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp);
 void mem_cgroup_destroy_cache(struct kmem_cache *cachep);
 void kmem_cache_destroy_memcg_children(struct kmem_cache *s);
 
+int memcg_list_lru_init(struct list_lru *lru, bool memcg_enabled);
+void memcg_list_lru_destroy(struct list_lru *lru);
+
 /**
  * memcg_kmem_newpage_charge: verify if a new kmem allocation is allowed.
  * @gfp: the gfp allocation flags.
@@ -705,6 +709,15 @@ memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp)
 static inline void kmem_cache_destroy_memcg_children(struct kmem_cache *s)
 {
 }
+
+static inline int memcg_list_lru_init(struct list_lru *lru, bool memcg_enabled)
+{
+	return 0;
+}
+
+static inline void memcg_list_lru_destroy(struct list_lru *lru)
+{
+}
 #endif /* CONFIG_MEMCG_KMEM */
 #endif /* _LINUX_MEMCONTROL_H */
 
diff --git a/mm/list_lru.c b/mm/list_lru.c
index 7f5b73e2513b..d9c4c48bb8d0 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -7,19 +7,94 @@
 #include <linux/kernel.h>
 #include <linux/module.h>
 #include <linux/mm.h>
-#include <linux/list_lru.h>
 #include <linux/slab.h>
+#include <linux/memcontrol.h>
+#include <linux/page_cgroup.h>
+#include <linux/list_lru.h>
+
+#ifdef CONFIG_MEMCG_KMEM
+static inline bool lru_has_memcg(struct list_lru *lru)
+{
+	return !!lru->memcg;
+}
+
+static struct list_lru_node *lru_node_of_index(struct list_lru *lru,
+					int nid, int memcg_id, bool *is_global)
+{
+	struct list_lru_node **memcg_lrus;
+	struct list_lru_node *nlru = NULL;
+
+	if (memcg_id < 0 || !lru_has_memcg(lru)) {
+		*is_global = true;
+		return &lru->node[nid];
+	}
+
+	rcu_read_lock();
+	memcg_lrus = rcu_dereference(lru->memcg);
+	nlru = memcg_lrus[memcg_id];
+	rcu_read_unlock();
+
+	*is_global = false;
+
+	/*
+	 * Make sure we will access the up-to-date value. The code updating
+	 * memcg_lrus issues a write barrier to match this (see
+	 * list_lru_memcg_alloc()).
+	 */
+	smp_read_barrier_depends();
+	return nlru;
+}
+
+static struct list_lru_node *lru_node_of_page(struct list_lru *lru,
+					struct page *page, bool *is_global)
+{
+	struct page_cgroup *pc;
+	struct mem_cgroup *memcg;
+
+	/*
+	 * Since a kmem page cannot change its cgroup after its allocation is
+	 * committed, we do not need to lock_page_cgroup() here.
+	 */
+	pc = lookup_page_cgroup(compound_head(page));
+	memcg = PageCgroupUsed(pc) ? pc->mem_cgroup : NULL;
+
+	return lru_node_of_index(lru, page_to_nid(page),
+				 memcg_cache_id(memcg), is_global);
+}
+#else /* !CONFIG_MEMCG_KMEM */
+static inline bool lru_has_memcg(struct list_lru *lru)
+{
+	return false;
+}
+
+static inline struct list_lru_node *lru_node_of_index(struct list_lru *lru,
+					int nid, int memcg_id, bool *is_global)
+{
+	*is_global = true;
+	return &lru->node[nid];
+}
+
+static inline struct list_lru_node *lru_node_of_page(struct list_lru *lru,
+					struct page *page, bool *is_global)
+{
+	return lru_node_of_index(lru, page_to_nid(page), -1, is_global);
+}
+#endif /* CONFIG_MEMCG_KMEM */
 
 bool list_lru_add(struct list_lru *lru, struct list_head *item)
 {
-	int nid = page_to_nid(virt_to_page(item));
-	struct list_lru_node *nlru = &lru->node[nid];
+	struct page *page = virt_to_page(item);
+	int nid = page_to_nid(page);
+	bool is_global;
+	struct list_lru_node *nlru;
+
+	nlru = lru_node_of_page(lru, page, &is_global);
 
 	spin_lock(&nlru->lock);
 	WARN_ON_ONCE(nlru->nr_items < 0);
 	if (list_empty(item)) {
 		list_add_tail(item, &nlru->list);
-		if (nlru->nr_items++ == 0)
+		if (nlru->nr_items++ == 0 && is_global)
 			node_set(nid, lru->active_nodes);
 		spin_unlock(&nlru->lock);
 		return true;
@@ -31,13 +106,17 @@ EXPORT_SYMBOL_GPL(list_lru_add);
 
 bool list_lru_del(struct list_lru *lru, struct list_head *item)
 {
-	int nid = page_to_nid(virt_to_page(item));
-	struct list_lru_node *nlru = &lru->node[nid];
+	struct page *page = virt_to_page(item);
+	int nid = page_to_nid(page);
+	bool is_global;
+	struct list_lru_node *nlru;
+
+	nlru = lru_node_of_page(lru, page, &is_global);
 
 	spin_lock(&nlru->lock);
 	if (!list_empty(item)) {
 		list_del_init(item);
-		if (--nlru->nr_items == 0)
+		if (--nlru->nr_items == 0 && is_global)
 			node_clear(nid, lru->active_nodes);
 		WARN_ON_ONCE(nlru->nr_items < 0);
 		spin_unlock(&nlru->lock);
@@ -48,11 +127,14 @@ bool list_lru_del(struct list_lru *lru, struct list_head *item)
 }
 EXPORT_SYMBOL_GPL(list_lru_del);
 
-unsigned long
-list_lru_count_node(struct list_lru *lru, int nid)
+unsigned long list_lru_count_node_memcg(struct list_lru *lru,
+					int nid, struct mem_cgroup *memcg)
 {
 	unsigned long count = 0;
-	struct list_lru_node *nlru = &lru->node[nid];
+	bool is_global;
+	struct list_lru_node *nlru;
+
+	nlru = lru_node_of_index(lru, nid, memcg_cache_id(memcg), &is_global);
 
 	spin_lock(&nlru->lock);
 	WARN_ON_ONCE(nlru->nr_items < 0);
@@ -61,16 +143,41 @@ list_lru_count_node(struct list_lru *lru, int nid)
 
 	return count;
 }
-EXPORT_SYMBOL_GPL(list_lru_count_node);
+EXPORT_SYMBOL_GPL(list_lru_count_node_memcg);
+
+unsigned long list_lru_count(struct list_lru *lru)
+{
+	long count = 0;
+	int nid;
+	struct mem_cgroup *memcg;
+
+	for_each_node_mask(nid, lru->active_nodes)
+		count += list_lru_count_node(lru, nid);
+
+	if (!lru_has_memcg(lru))
+		goto out;
+
+	for_each_mem_cgroup(memcg) {
+		if (memcg_kmem_is_active(memcg))
+			count += list_lru_count_node_memcg(lru, 0, memcg);
+	}
+out:
+	return count;
+}
+EXPORT_SYMBOL_GPL(list_lru_count);
 
-unsigned long
-list_lru_walk_node(struct list_lru *lru, int nid, list_lru_walk_cb isolate,
-		   void *cb_arg, unsigned long *nr_to_walk)
+unsigned long list_lru_walk_node_memcg(struct list_lru *lru,
+				       int nid, struct mem_cgroup *memcg,
+				       list_lru_walk_cb isolate, void *cb_arg,
+				       unsigned long *nr_to_walk)
 {
 
-	struct list_lru_node	*nlru = &lru->node[nid];
 	struct list_head *item, *n;
 	unsigned long isolated = 0;
+	bool is_global;
+	struct list_lru_node *nlru;
+
+	nlru = lru_node_of_index(lru, nid, memcg_cache_id(memcg), &is_global);
 
 	spin_lock(&nlru->lock);
 restart:
@@ -90,7 +197,7 @@ restart:
 		case LRU_REMOVED_RETRY:
 			assert_spin_locked(&nlru->lock);
 		case LRU_REMOVED:
-			if (--nlru->nr_items == 0)
+			if (--nlru->nr_items == 0 && is_global)
 				node_clear(nid, lru->active_nodes);
 			WARN_ON_ONCE(nlru->nr_items < 0);
 			isolated++;
@@ -122,29 +229,141 @@ restart:
 	spin_unlock(&nlru->lock);
 	return isolated;
 }
-EXPORT_SYMBOL_GPL(list_lru_walk_node);
+EXPORT_SYMBOL_GPL(list_lru_walk_node_memcg);
+
+unsigned long list_lru_walk(struct list_lru *lru,
+			    list_lru_walk_cb isolate, void *cb_arg,
+			    unsigned long nr_to_walk)
+{
+	long isolated = 0;
+	int nid;
+	struct mem_cgroup *memcg;
+
+	for_each_node_mask(nid, lru->active_nodes) {
+		isolated += list_lru_walk_node(lru, nid, isolate,
+					       cb_arg, &nr_to_walk);
+		if (nr_to_walk <= 0)
+			goto out;
+	}
+
+	if (!lru_has_memcg(lru))
+		goto out;
+
+	for_each_mem_cgroup(memcg) {
+		if (!memcg_kmem_is_active(memcg))
+			continue;
+		isolated += list_lru_walk_node_memcg(lru, 0, memcg, isolate,
+						     cb_arg, &nr_to_walk);
+		if (nr_to_walk <= 0) {
+			mem_cgroup_iter_break(NULL, memcg);
+			break;
+		}
+	}
+out:
+	return isolated;
+}
+EXPORT_SYMBOL_GPL(list_lru_walk);
 
-int list_lru_init(struct list_lru *lru)
+static void list_lru_node_init(struct list_lru_node *nlru)
+{
+	spin_lock_init(&nlru->lock);
+	INIT_LIST_HEAD(&nlru->list);
+	nlru->nr_items = 0;
+}
+
+int __list_lru_init(struct list_lru *lru, bool memcg_enabled)
 {
 	int i;
-	size_t size = sizeof(*lru->node) * nr_node_ids;
+	int err = 0;
 
-	lru->node = kzalloc(size, GFP_KERNEL);
+	lru->node = kcalloc(nr_node_ids, sizeof(*lru->node), GFP_KERNEL);
 	if (!lru->node)
 		return -ENOMEM;
 
 	nodes_clear(lru->active_nodes);
-	for (i = 0; i < nr_node_ids; i++) {
-		spin_lock_init(&lru->node[i].lock);
-		INIT_LIST_HEAD(&lru->node[i].list);
-		lru->node[i].nr_items = 0;
+	for (i = 0; i < nr_node_ids; i++)
+		list_lru_node_init(&lru->node[i]);
+
+	err = memcg_list_lru_init(lru, memcg_enabled);
+	if (err) {
+		kfree(lru->node);
+		lru->node = NULL; /* see list_lru_destroy() */
 	}
-	return 0;
+
+	return err;
 }
-EXPORT_SYMBOL_GPL(list_lru_init);
+EXPORT_SYMBOL_GPL(__list_lru_init);
 
 void list_lru_destroy(struct list_lru *lru)
 {
+	/*
+	 * We might be called after partial initialization (e.g. due to ENOMEM
+	 * error) so handle that appropriately.
+	 */
+	if (!lru->node)
+		return;
+
 	kfree(lru->node);
+	memcg_list_lru_destroy(lru);
 }
 EXPORT_SYMBOL_GPL(list_lru_destroy);
+
+#ifdef CONFIG_MEMCG_KMEM
+int list_lru_memcg_alloc(struct list_lru *lru, int memcg_id)
+{
+	struct list_lru_node *nlru;
+
+	nlru = kmalloc(sizeof(*nlru), GFP_KERNEL);
+	if (!nlru)
+		return -ENOMEM;
+
+	list_lru_node_init(nlru);
+
+	/*
+	 * Since readers won't lock (see lru_node_of_index()), we need a
+	 * barrier here to ensure nobody will see the list_lru_node partially
+	 * initialized.
+	 */
+	smp_wmb();
+
+	VM_BUG_ON(lru->memcg[memcg_id]);
+	lru->memcg[memcg_id] = nlru;
+	return 0;
+}
+
+void list_lru_memcg_free(struct list_lru *lru, int memcg_id)
+{
+	if (lru->memcg[memcg_id]) {
+		kfree(lru->memcg[memcg_id]);
+		lru->memcg[memcg_id] = NULL;
+	}
+}
+
+int list_lru_grow_memcg(struct list_lru *lru, size_t new_array_size)
+{
+	int i;
+	struct list_lru_node **memcg_lrus;
+
+	memcg_lrus = kcalloc(new_array_size, sizeof(*memcg_lrus), GFP_KERNEL);
+	if (!memcg_lrus)
+		return -ENOMEM;
+
+	if (lru->memcg) {
+		for_each_memcg_cache_index(i) {
+			if (lru->memcg[i])
+				memcg_lrus[i] = lru->memcg[i];
+		}
+	}
+
+	/*
+	 * Since we access the lru->memcg array lockless, inside an RCU
+	 * critical section (see lru_node_of_index()), we cannot free the old
+	 * version of the array right now. So we save it to lru->memcg_old to
+	 * be freed by the caller after a grace period.
+	 */
+	VM_BUG_ON(lru->memcg_old);
+	lru->memcg_old = lru->memcg;
+	rcu_assign_pointer(lru->memcg, memcg_lrus);
+	return 0;
+}
+#endif /* CONFIG_MEMCG_KMEM */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 24557d09213c..27f6d795090a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -56,6 +56,7 @@
 #include <linux/oom.h>
 #include <linux/lockdep.h>
 #include <linux/file.h>
+#include <linux/list_lru.h>
 #include "internal.h"
 #include <net/sock.h>
 #include <net/ip.h>
@@ -3004,7 +3005,11 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *memcg,
 static DEFINE_MUTEX(set_limit_mutex);
 
 #ifdef CONFIG_MEMCG_KMEM
-static DEFINE_MUTEX(activate_kmem_mutex);
+/*
+ * This semaphore serializes activations of kmem accounting for memory cgroups.
+ * Holding it for reading guarantees no cgroups will become kmem active.
+ */
+static DECLARE_RWSEM(activate_kmem_sem);
 
 static inline bool memcg_can_account_kmem(struct mem_cgroup *memcg)
 {
@@ -3338,6 +3343,154 @@ void memcg_unregister_cache(struct kmem_cache *s)
 }
 
 /*
+ * The list of all memcg-enabled list_lru structures. Needed for updating all
+ * per-memcg LRUs whenever a kmem-active memcg is created or destroyed. The
+ * list is updated under the activate_kmem_sem held for reading so to safely
+ * iterate over it, it is enough to take the activate_kmem_sem for writing.
+ */
+static LIST_HEAD(all_memcg_lrus);
+static DEFINE_SPINLOCK(all_memcg_lrus_lock);
+
+static void __memcg_destroy_all_lrus(int memcg_id)
+{
+	struct list_lru *lru;
+
+	list_for_each_entry(lru, &all_memcg_lrus, memcg_lrus_list)
+		list_lru_memcg_free(lru, memcg_id);
+}
+
+/*
+ * This function is called when a kmem-active memcg is destroyed in order to
+ * free LRUs corresponding to the memcg in all list_lru structures.
+ */
+static void memcg_destroy_all_lrus(struct mem_cgroup *memcg)
+{
+	int memcg_id;
+
+	memcg_id = memcg_cache_id(memcg);
+	if (memcg_id >= 0) {
+		down_write(&activate_kmem_sem);
+		__memcg_destroy_all_lrus(memcg_id);
+		up_write(&activate_kmem_sem);
+	}
+}
+
+/*
+ * This function allocates LRUs for a memcg in all list_lru structures. It is
+ * called with activate_kmem_sem held for writing when a new kmem-active memcg
+ * is added.
+ */
+static int memcg_init_all_lrus(int new_memcg_id)
+{
+	int err = 0;
+	int num_memcgs = new_memcg_id + 1;
+	int grow = (num_memcgs > memcg_limited_groups_array_size);
+	size_t new_array_size = memcg_caches_array_size(num_memcgs);
+	struct list_lru *lru;
+
+	if (grow) {
+		list_for_each_entry(lru, &all_memcg_lrus, memcg_lrus_list) {
+			err = list_lru_grow_memcg(lru, new_array_size);
+			if (err)
+				goto out;
+		}
+	}
+
+	list_for_each_entry(lru, &all_memcg_lrus, memcg_lrus_list) {
+		err = list_lru_memcg_alloc(lru, new_memcg_id);
+		if (err) {
+			__memcg_destroy_all_lrus(new_memcg_id);
+			break;
+		}
+	}
+out:
+	if (grow) {
+		synchronize_rcu();
+		list_for_each_entry(lru, &all_memcg_lrus, memcg_lrus_list) {
+			kfree(lru->memcg_old);
+			lru->memcg_old = NULL;
+		}
+	}
+	return err;
+}
+
+int memcg_list_lru_init(struct list_lru *lru, bool memcg_enabled)
+{
+	int err = 0;
+	int i;
+	struct mem_cgroup *memcg;
+
+	lru->memcg = NULL;
+	lru->memcg_old = NULL;
+	INIT_LIST_HEAD(&lru->memcg_lrus_list);
+
+	if (!memcg_enabled)
+		return 0;
+
+	down_read(&activate_kmem_sem);
+	if (!memcg_kmem_enabled())
+		goto out_list_add;
+
+	lru->memcg = kcalloc(memcg_limited_groups_array_size,
+			     sizeof(*lru->memcg), GFP_KERNEL);
+	if (!lru->memcg) {
+		err = -ENOMEM;
+		goto out;
+	}
+
+	for_each_mem_cgroup(memcg) {
+		int memcg_id;
+
+		memcg_id = memcg_cache_id(memcg);
+		if (memcg_id < 0)
+			continue;
+
+		err = list_lru_memcg_alloc(lru, memcg_id);
+		if (err) {
+			mem_cgroup_iter_break(NULL, memcg);
+			goto out_free_lru_memcg;
+		}
+	}
+out_list_add:
+	spin_lock(&all_memcg_lrus_lock);
+	list_add(&lru->memcg_lrus_list, &all_memcg_lrus);
+	spin_unlock(&all_memcg_lrus_lock);
+out:
+	up_read(&activate_kmem_sem);
+	return err;
+
+out_free_lru_memcg:
+	for (i = 0; i < memcg_limited_groups_array_size; i++)
+		list_lru_memcg_free(lru, i);
+	kfree(lru->memcg);
+	goto out;
+}
+
+void memcg_list_lru_destroy(struct list_lru *lru)
+{
+	int i, array_size;
+
+	if (list_empty(&lru->memcg_lrus_list))
+		return;
+
+	down_read(&activate_kmem_sem);
+
+	array_size = memcg_limited_groups_array_size;
+
+	spin_lock(&all_memcg_lrus_lock);
+	list_del(&lru->memcg_lrus_list);
+	spin_unlock(&all_memcg_lrus_lock);
+
+	up_read(&activate_kmem_sem);
+
+	if (lru->memcg) {
+		for (i = 0; i < array_size; i++)
+			list_lru_memcg_free(lru, i);
+		kfree(lru->memcg);
+	}
+}
+
+/*
  * During the creation a new cache, we need to disable our accounting mechanism
  * altogether. This is true even if we are not creating, but rather just
  * enqueing new caches to be created.
@@ -3486,10 +3639,9 @@ void kmem_cache_destroy_memcg_children(struct kmem_cache *s)
 	 *
 	 * Still, we don't want anyone else freeing memcg_caches under our
 	 * noses, which can happen if a new memcg comes to life. As usual,
-	 * we'll take the activate_kmem_mutex to protect ourselves against
-	 * this.
+	 * we'll take the activate_kmem_sem to protect ourselves against this.
 	 */
-	mutex_lock(&activate_kmem_mutex);
+	down_read(&activate_kmem_sem);
 	for_each_memcg_cache_index(i) {
 		c = cache_from_memcg_idx(s, i);
 		if (!c)
@@ -3512,7 +3664,7 @@ void kmem_cache_destroy_memcg_children(struct kmem_cache *s)
 		cancel_work_sync(&c->memcg_params->destroy);
 		kmem_cache_destroy(c);
 	}
-	mutex_unlock(&activate_kmem_mutex);
+	up_read(&activate_kmem_sem);
 }
 
 struct create_work {
@@ -5179,7 +5331,7 @@ static u64 mem_cgroup_read_u64(struct cgroup_subsys_state *css,
 }
 
 #ifdef CONFIG_MEMCG_KMEM
-/* should be called with activate_kmem_mutex held */
+/* should be called with activate_kmem_sem held for writing */
 static int __memcg_activate_kmem(struct mem_cgroup *memcg,
 				 unsigned long long limit)
 {
@@ -5222,12 +5374,21 @@ static int __memcg_activate_kmem(struct mem_cgroup *memcg,
 	}
 
 	/*
+	 * Initialize this cgroup's lists in each list_lru. This must be done
+	 * before calling memcg_update_all_caches(), where we update the
+	 * limited_groups_array_size.
+	 */
+	err = memcg_init_all_lrus(memcg_id);
+	if (err)
+		goto out_rmid;
+
+	/*
 	 * Make sure we have enough space for this cgroup in each root cache's
 	 * memcg_params.
 	 */
 	err = memcg_update_all_caches(memcg_id + 1);
 	if (err)
-		goto out_rmid;
+		goto out_destroy_all_lrus;
 
 	memcg->kmemcg_id = memcg_id;
 
@@ -5249,6 +5410,8 @@ out:
 	memcg_resume_kmem_account();
 	return err;
 
+out_destroy_all_lrus:
+	__memcg_destroy_all_lrus(memcg_id);
 out_rmid:
 	ida_simple_remove(&kmem_limited_groups, memcg_id);
 	goto out;
@@ -5259,9 +5422,9 @@ static int memcg_activate_kmem(struct mem_cgroup *memcg,
 {
 	int ret;
 
-	mutex_lock(&activate_kmem_mutex);
+	down_write(&activate_kmem_sem);
 	ret = __memcg_activate_kmem(memcg, limit);
-	mutex_unlock(&activate_kmem_mutex);
+	up_write(&activate_kmem_sem);
 	return ret;
 }
 
@@ -5285,14 +5448,14 @@ static int memcg_propagate_kmem(struct mem_cgroup *memcg)
 	if (!parent)
 		return 0;
 
-	mutex_lock(&activate_kmem_mutex);
+	down_write(&activate_kmem_sem);
 	/*
 	 * If the parent cgroup is not kmem-active now, it cannot be activated
 	 * after this point, because it has at least one child already.
 	 */
 	if (memcg_kmem_is_active(parent))
 		ret = __memcg_activate_kmem(memcg, RES_COUNTER_MAX);
-	mutex_unlock(&activate_kmem_mutex);
+	up_write(&activate_kmem_sem);
 	return ret;
 }
 #else
@@ -5989,6 +6152,7 @@ static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
 static void memcg_destroy_kmem(struct mem_cgroup *memcg)
 {
 	mem_cgroup_sockets_destroy(memcg);
+	memcg_destroy_all_lrus(memcg);
 }
 
 static void kmem_cgroup_css_offline(struct mem_cgroup *memcg)
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH -mm v15 10/13] fs: make shrinker memcg aware
  2014-02-05 18:39 ` Vladimir Davydov
@ 2014-02-05 18:39   ` Vladimir Davydov
  -1 siblings, 0 replies; 44+ messages in thread
From: Vladimir Davydov @ 2014-02-05 18:39 UTC (permalink / raw)
  To: akpm
  Cc: dchinner, mhocko, hannes, glommer, rientjes, linux-kernel,
	linux-mm, devel, Glauber Costa, Al Viro

Now, to make any list_lru-based shrinker memcg aware we should only
initialize its list_lru as memcg-enabled. Let's do it for the general FS
shrinker (super_block::s_shrink) and mark it as memcg aware.

There are other FS-specific shrinkers that use list_lru for storing
objects, such as XFS and GFS2 dquot cache shrinkers, but since they
reclaim objects that may be shared among different cgroups, there is no
point making them memcg aware.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Glauber Costa <glommer@openvz.org>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
---
 fs/super.c |    6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/super.c b/fs/super.c
index 33cbff3769e7..6a58a7196fb2 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -191,9 +191,9 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags)
 	INIT_LIST_HEAD(&s->s_inodes);
 	INIT_WORK(&s->destroy, destroy_super_work_func);
 
-	if (list_lru_init(&s->s_dentry_lru))
+	if (list_lru_init_memcg(&s->s_dentry_lru))
 		goto fail;
-	if (list_lru_init(&s->s_inode_lru))
+	if (list_lru_init_memcg(&s->s_inode_lru))
 		goto fail;
 
 	init_rwsem(&s->s_umount);
@@ -230,7 +230,7 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags)
 	s->s_shrink.scan_objects = super_cache_scan;
 	s->s_shrink.count_objects = super_cache_count;
 	s->s_shrink.batch = 1024;
-	s->s_shrink.flags = SHRINKER_NUMA_AWARE;
+	s->s_shrink.flags = SHRINKER_NUMA_AWARE | SHRINKER_MEMCG_AWARE;
 	return s;
 
 fail:
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH -mm v15 10/13] fs: make shrinker memcg aware
@ 2014-02-05 18:39   ` Vladimir Davydov
  0 siblings, 0 replies; 44+ messages in thread
From: Vladimir Davydov @ 2014-02-05 18:39 UTC (permalink / raw)
  To: akpm
  Cc: dchinner, mhocko, hannes, glommer, rientjes, linux-kernel,
	linux-mm, devel, Glauber Costa, Al Viro

Now, to make any list_lru-based shrinker memcg aware we should only
initialize its list_lru as memcg-enabled. Let's do it for the general FS
shrinker (super_block::s_shrink) and mark it as memcg aware.

There are other FS-specific shrinkers that use list_lru for storing
objects, such as XFS and GFS2 dquot cache shrinkers, but since they
reclaim objects that may be shared among different cgroups, there is no
point making them memcg aware.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Glauber Costa <glommer@openvz.org>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
---
 fs/super.c |    6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/super.c b/fs/super.c
index 33cbff3769e7..6a58a7196fb2 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -191,9 +191,9 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags)
 	INIT_LIST_HEAD(&s->s_inodes);
 	INIT_WORK(&s->destroy, destroy_super_work_func);
 
-	if (list_lru_init(&s->s_dentry_lru))
+	if (list_lru_init_memcg(&s->s_dentry_lru))
 		goto fail;
-	if (list_lru_init(&s->s_inode_lru))
+	if (list_lru_init_memcg(&s->s_inode_lru))
 		goto fail;
 
 	init_rwsem(&s->s_umount);
@@ -230,7 +230,7 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags)
 	s->s_shrink.scan_objects = super_cache_scan;
 	s->s_shrink.count_objects = super_cache_count;
 	s->s_shrink.batch = 1024;
-	s->s_shrink.flags = SHRINKER_NUMA_AWARE;
+	s->s_shrink.flags = SHRINKER_NUMA_AWARE | SHRINKER_MEMCG_AWARE;
 	return s;
 
 fail:
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH -mm v15 11/13] memcg: flush memcg items upon memcg destruction
  2014-02-05 18:39 ` Vladimir Davydov
@ 2014-02-05 18:39   ` Vladimir Davydov
  -1 siblings, 0 replies; 44+ messages in thread
From: Vladimir Davydov @ 2014-02-05 18:39 UTC (permalink / raw)
  To: akpm
  Cc: dchinner, mhocko, hannes, glommer, rientjes, linux-kernel,
	linux-mm, devel, Glauber Costa, Balbir Singh, KAMEZAWA Hiroyuki

From: Glauber Costa <glommer@openvz.org>

When a memcg is destroyed, it won't be imediately released until all
objects are gone. This means that if a memcg is restarted with the very
same workload - a very common case, the objects already cached won't be
billed to the new memcg. This is mostly undesirable since a container
can exploit this by restarting itself every time it reaches its limit,
and then coming up again with a fresh new limit.

Since now we have targeted reclaim, I sustain that we should assume that
a memcg that is destroyed should be flushed away. It makes perfect sense
if we assume that a memcg that goes away most likely indicates an
isolated workload that is terminated.

Signed-off-by: Glauber Costa <glommer@openvz.org>
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |   28 ++++++++++++++++++++++++++++
 1 file changed, 28 insertions(+)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 27f6d795090a..aed1456015cf 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6155,12 +6155,40 @@ static void memcg_destroy_kmem(struct mem_cgroup *memcg)
 	memcg_destroy_all_lrus(memcg);
 }
 
+static void memcg_drop_slab(struct mem_cgroup *memcg)
+{
+	struct shrink_control shrink = {
+		.gfp_mask = GFP_KERNEL,
+		.target_mem_cgroup = memcg,
+	};
+	unsigned long nr_objects;
+
+	nodes_setall(shrink.nodes_to_scan);
+	do {
+		nr_objects = shrink_slab(&shrink, 1000, 1000);
+	} while (nr_objects > 10);
+}
+
 static void kmem_cgroup_css_offline(struct mem_cgroup *memcg)
 {
 	if (!memcg_kmem_is_active(memcg))
 		return;
 
 	/*
+	 * When a memcg is destroyed, it won't be imediately released until all
+	 * objects are gone. This means that if a memcg is restarted with the
+	 * very same workload - a very common case, the objects already cached
+	 * won't be billed to the new memcg. This is mostly undesirable since a
+	 * container can exploit this by restarting itself every time it
+	 * reaches its limit, and then coming up again with a fresh new limit.
+	 *
+	 * Therefore a memcg that is destroyed should be flushed away. It makes
+	 * perfect sense if we assume that a memcg that goes away indicates an
+	 * isolated workload that is terminated.
+	 */
+	memcg_drop_slab(memcg);
+
+	/*
 	 * kmem charges can outlive the cgroup. In the case of slab
 	 * pages, for instance, a page contain objects from various
 	 * processes. As we prevent from taking a reference for every
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH -mm v15 11/13] memcg: flush memcg items upon memcg destruction
@ 2014-02-05 18:39   ` Vladimir Davydov
  0 siblings, 0 replies; 44+ messages in thread
From: Vladimir Davydov @ 2014-02-05 18:39 UTC (permalink / raw)
  To: akpm
  Cc: dchinner, mhocko, hannes, glommer, rientjes, linux-kernel,
	linux-mm, devel, Glauber Costa, Balbir Singh, KAMEZAWA Hiroyuki

From: Glauber Costa <glommer@openvz.org>

When a memcg is destroyed, it won't be imediately released until all
objects are gone. This means that if a memcg is restarted with the very
same workload - a very common case, the objects already cached won't be
billed to the new memcg. This is mostly undesirable since a container
can exploit this by restarting itself every time it reaches its limit,
and then coming up again with a fresh new limit.

Since now we have targeted reclaim, I sustain that we should assume that
a memcg that is destroyed should be flushed away. It makes perfect sense
if we assume that a memcg that goes away most likely indicates an
isolated workload that is terminated.

Signed-off-by: Glauber Costa <glommer@openvz.org>
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |   28 ++++++++++++++++++++++++++++
 1 file changed, 28 insertions(+)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 27f6d795090a..aed1456015cf 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6155,12 +6155,40 @@ static void memcg_destroy_kmem(struct mem_cgroup *memcg)
 	memcg_destroy_all_lrus(memcg);
 }
 
+static void memcg_drop_slab(struct mem_cgroup *memcg)
+{
+	struct shrink_control shrink = {
+		.gfp_mask = GFP_KERNEL,
+		.target_mem_cgroup = memcg,
+	};
+	unsigned long nr_objects;
+
+	nodes_setall(shrink.nodes_to_scan);
+	do {
+		nr_objects = shrink_slab(&shrink, 1000, 1000);
+	} while (nr_objects > 10);
+}
+
 static void kmem_cgroup_css_offline(struct mem_cgroup *memcg)
 {
 	if (!memcg_kmem_is_active(memcg))
 		return;
 
 	/*
+	 * When a memcg is destroyed, it won't be imediately released until all
+	 * objects are gone. This means that if a memcg is restarted with the
+	 * very same workload - a very common case, the objects already cached
+	 * won't be billed to the new memcg. This is mostly undesirable since a
+	 * container can exploit this by restarting itself every time it
+	 * reaches its limit, and then coming up again with a fresh new limit.
+	 *
+	 * Therefore a memcg that is destroyed should be flushed away. It makes
+	 * perfect sense if we assume that a memcg that goes away indicates an
+	 * isolated workload that is terminated.
+	 */
+	memcg_drop_slab(memcg);
+
+	/*
 	 * kmem charges can outlive the cgroup. In the case of slab
 	 * pages, for instance, a page contain objects from various
 	 * processes. As we prevent from taking a reference for every
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH -mm v15 12/13] vmpressure: in-kernel notifications
  2014-02-05 18:39 ` Vladimir Davydov
@ 2014-02-05 18:39   ` Vladimir Davydov
  -1 siblings, 0 replies; 44+ messages in thread
From: Vladimir Davydov @ 2014-02-05 18:39 UTC (permalink / raw)
  To: akpm
  Cc: dchinner, mhocko, hannes, glommer, rientjes, linux-kernel,
	linux-mm, devel, Glauber Costa, Anton Vorontsov, Pekka Enberg,
	Greg Thelen, John Stultz, Joonsoo Kim, Kamezawa Hiroyuki

From: Glauber Costa <glommer@openvz.org>

During the past weeks, it became clear to us that the shrinker interface
we have right now works very well for some particular types of users,
but not that well for others. The latter are usually people interested
in one-shot notifications, that were forced to adapt themselves to the
count+scan behavior of shrinkers. To do so, they had no choice than to
greatly abuse the shrinker interface producing little monsters all over.

During LSF/MM, one of the proposals that popped out during our session
was to reuse Anton Voronstsov's vmpressure for this. They are designed
for userspace consumption, but also provide a well-stablished,
cgroup-aware entry point for notifications.

This patch extends that to also support in-kernel users. Events that
should be generated for in-kernel consumption will be marked as such,
and for those, we will call a registered function instead of triggering
an eventfd notification.

Please note that due to my lack of understanding of each shrinker user,
I will stay away from converting the actual users, you are all welcome
to do so.

Signed-off-by: Glauber Costa <glommer@openvz.org>
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: Anton Vorontsov <anton@enomsg.org>
Acked-by: Pekka Enberg <penberg@kernel.org>
Reviewed-by: Greg Thelen <gthelen@google.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/vmpressure.h |    5 +++++
 mm/vmpressure.c            |   53 +++++++++++++++++++++++++++++++++++++++++---
 2 files changed, 55 insertions(+), 3 deletions(-)

diff --git a/include/linux/vmpressure.h b/include/linux/vmpressure.h
index 3e4535876d37..67f0fbe52c3e 100644
--- a/include/linux/vmpressure.h
+++ b/include/linux/vmpressure.h
@@ -20,6 +20,9 @@ struct vmpressure {
 	/* Have to grab the lock on events traversal or modifications. */
 	struct mutex events_lock;
 
+	/* False if only kernel users want to be notified, true otherwise. */
+	bool notify_userspace;
+
 	struct work_struct work;
 };
 
@@ -37,6 +40,8 @@ extern struct cgroup_subsys_state *vmpressure_to_css(struct vmpressure *vmpr);
 extern int vmpressure_register_event(struct mem_cgroup *memcg,
 				     struct eventfd_ctx *eventfd,
 				     const char *args);
+extern int vmpressure_register_kernel_event(struct mem_cgroup *memcg,
+					    void (*fn)(void));
 extern void vmpressure_unregister_event(struct mem_cgroup *memcg,
 					struct eventfd_ctx *eventfd);
 #else
diff --git a/mm/vmpressure.c b/mm/vmpressure.c
index d4042e75f7c7..046029cbaa67 100644
--- a/mm/vmpressure.c
+++ b/mm/vmpressure.c
@@ -131,8 +131,12 @@ static enum vmpressure_levels vmpressure_calc_level(unsigned long scanned,
 }
 
 struct vmpressure_event {
-	struct eventfd_ctx *efd;
+	union {
+		struct eventfd_ctx *efd;
+		void (*fn)(void);
+	};
 	enum vmpressure_levels level;
+	bool kernel_event;
 	struct list_head node;
 };
 
@@ -148,12 +152,15 @@ static bool vmpressure_event(struct vmpressure *vmpr,
 	mutex_lock(&vmpr->events_lock);
 
 	list_for_each_entry(ev, &vmpr->events, node) {
-		if (level >= ev->level) {
+		if (ev->kernel_event) {
+			ev->fn();
+		} else if (vmpr->notify_userspace && level >= ev->level) {
 			eventfd_signal(ev->efd, 1);
 			signalled = true;
 		}
 	}
 
+	vmpr->notify_userspace = false;
 	mutex_unlock(&vmpr->events_lock);
 
 	return signalled;
@@ -223,7 +230,7 @@ void vmpressure(gfp_t gfp, struct mem_cgroup *memcg,
 	 * we account it too.
 	 */
 	if (!(gfp & (__GFP_HIGHMEM | __GFP_MOVABLE | __GFP_IO | __GFP_FS)))
-		return;
+		goto schedule;
 
 	/*
 	 * If we got here with no pages scanned, then that is an indicator
@@ -240,8 +247,15 @@ void vmpressure(gfp_t gfp, struct mem_cgroup *memcg,
 	vmpr->scanned += scanned;
 	vmpr->reclaimed += reclaimed;
 	scanned = vmpr->scanned;
+	/*
+	 * If we didn't reach this point, only kernel events will be triggered.
+	 * It is the job of the worker thread to clean this up once the
+	 * notifications are all delivered.
+	 */
+	vmpr->notify_userspace = true;
 	spin_unlock(&vmpr->sr_lock);
 
+schedule:
 	if (scanned < vmpressure_win)
 		return;
 	schedule_work(&vmpr->work);
@@ -321,6 +335,39 @@ int vmpressure_register_event(struct mem_cgroup *memcg,
 }
 
 /**
+ * vmpressure_register_kernel_event() - Register kernel-side notification
+ * @memcg:	memcg that is interested in vmpressure notifications
+ * @fn:		function to be called when pressure happens
+ *
+ * This function register in-kernel users interested in receiving notifications
+ * about pressure conditions. Pressure notifications will be triggered at the
+ * same time as userspace notifications (with no particular ordering relative
+ * to it).
+ *
+ * Pressure notifications are a alternative method to shrinkers and will serve
+ * well users that are interested in a one-shot notification, with a
+ * well-defined cgroup aware interface.
+ */
+int vmpressure_register_kernel_event(struct mem_cgroup *memcg,
+				     void (*fn)(void))
+{
+	struct vmpressure *vmpr = memcg_to_vmpressure(memcg);
+	struct vmpressure_event *ev;
+
+	ev = kzalloc(sizeof(*ev), GFP_KERNEL);
+	if (!ev)
+		return -ENOMEM;
+
+	ev->kernel_event = true;
+	ev->fn = fn;
+
+	mutex_lock(&vmpr->events_lock);
+	list_add(&ev->node, &vmpr->events);
+	mutex_unlock(&vmpr->events_lock);
+	return 0;
+}
+
+/**
  * vmpressure_unregister_event() - Unbind eventfd from vmpressure
  * @memcg:	memcg handle
  * @eventfd:	eventfd context that was used to link vmpressure with the @cg
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH -mm v15 12/13] vmpressure: in-kernel notifications
@ 2014-02-05 18:39   ` Vladimir Davydov
  0 siblings, 0 replies; 44+ messages in thread
From: Vladimir Davydov @ 2014-02-05 18:39 UTC (permalink / raw)
  To: akpm
  Cc: dchinner, mhocko, hannes, glommer, rientjes, linux-kernel,
	linux-mm, devel, Glauber Costa, Anton Vorontsov, Pekka Enberg,
	Greg Thelen, John Stultz, Joonsoo Kim, Kamezawa Hiroyuki

From: Glauber Costa <glommer@openvz.org>

During the past weeks, it became clear to us that the shrinker interface
we have right now works very well for some particular types of users,
but not that well for others. The latter are usually people interested
in one-shot notifications, that were forced to adapt themselves to the
count+scan behavior of shrinkers. To do so, they had no choice than to
greatly abuse the shrinker interface producing little monsters all over.

During LSF/MM, one of the proposals that popped out during our session
was to reuse Anton Voronstsov's vmpressure for this. They are designed
for userspace consumption, but also provide a well-stablished,
cgroup-aware entry point for notifications.

This patch extends that to also support in-kernel users. Events that
should be generated for in-kernel consumption will be marked as such,
and for those, we will call a registered function instead of triggering
an eventfd notification.

Please note that due to my lack of understanding of each shrinker user,
I will stay away from converting the actual users, you are all welcome
to do so.

Signed-off-by: Glauber Costa <glommer@openvz.org>
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: Anton Vorontsov <anton@enomsg.org>
Acked-by: Pekka Enberg <penberg@kernel.org>
Reviewed-by: Greg Thelen <gthelen@google.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/vmpressure.h |    5 +++++
 mm/vmpressure.c            |   53 +++++++++++++++++++++++++++++++++++++++++---
 2 files changed, 55 insertions(+), 3 deletions(-)

diff --git a/include/linux/vmpressure.h b/include/linux/vmpressure.h
index 3e4535876d37..67f0fbe52c3e 100644
--- a/include/linux/vmpressure.h
+++ b/include/linux/vmpressure.h
@@ -20,6 +20,9 @@ struct vmpressure {
 	/* Have to grab the lock on events traversal or modifications. */
 	struct mutex events_lock;
 
+	/* False if only kernel users want to be notified, true otherwise. */
+	bool notify_userspace;
+
 	struct work_struct work;
 };
 
@@ -37,6 +40,8 @@ extern struct cgroup_subsys_state *vmpressure_to_css(struct vmpressure *vmpr);
 extern int vmpressure_register_event(struct mem_cgroup *memcg,
 				     struct eventfd_ctx *eventfd,
 				     const char *args);
+extern int vmpressure_register_kernel_event(struct mem_cgroup *memcg,
+					    void (*fn)(void));
 extern void vmpressure_unregister_event(struct mem_cgroup *memcg,
 					struct eventfd_ctx *eventfd);
 #else
diff --git a/mm/vmpressure.c b/mm/vmpressure.c
index d4042e75f7c7..046029cbaa67 100644
--- a/mm/vmpressure.c
+++ b/mm/vmpressure.c
@@ -131,8 +131,12 @@ static enum vmpressure_levels vmpressure_calc_level(unsigned long scanned,
 }
 
 struct vmpressure_event {
-	struct eventfd_ctx *efd;
+	union {
+		struct eventfd_ctx *efd;
+		void (*fn)(void);
+	};
 	enum vmpressure_levels level;
+	bool kernel_event;
 	struct list_head node;
 };
 
@@ -148,12 +152,15 @@ static bool vmpressure_event(struct vmpressure *vmpr,
 	mutex_lock(&vmpr->events_lock);
 
 	list_for_each_entry(ev, &vmpr->events, node) {
-		if (level >= ev->level) {
+		if (ev->kernel_event) {
+			ev->fn();
+		} else if (vmpr->notify_userspace && level >= ev->level) {
 			eventfd_signal(ev->efd, 1);
 			signalled = true;
 		}
 	}
 
+	vmpr->notify_userspace = false;
 	mutex_unlock(&vmpr->events_lock);
 
 	return signalled;
@@ -223,7 +230,7 @@ void vmpressure(gfp_t gfp, struct mem_cgroup *memcg,
 	 * we account it too.
 	 */
 	if (!(gfp & (__GFP_HIGHMEM | __GFP_MOVABLE | __GFP_IO | __GFP_FS)))
-		return;
+		goto schedule;
 
 	/*
 	 * If we got here with no pages scanned, then that is an indicator
@@ -240,8 +247,15 @@ void vmpressure(gfp_t gfp, struct mem_cgroup *memcg,
 	vmpr->scanned += scanned;
 	vmpr->reclaimed += reclaimed;
 	scanned = vmpr->scanned;
+	/*
+	 * If we didn't reach this point, only kernel events will be triggered.
+	 * It is the job of the worker thread to clean this up once the
+	 * notifications are all delivered.
+	 */
+	vmpr->notify_userspace = true;
 	spin_unlock(&vmpr->sr_lock);
 
+schedule:
 	if (scanned < vmpressure_win)
 		return;
 	schedule_work(&vmpr->work);
@@ -321,6 +335,39 @@ int vmpressure_register_event(struct mem_cgroup *memcg,
 }
 
 /**
+ * vmpressure_register_kernel_event() - Register kernel-side notification
+ * @memcg:	memcg that is interested in vmpressure notifications
+ * @fn:		function to be called when pressure happens
+ *
+ * This function register in-kernel users interested in receiving notifications
+ * about pressure conditions. Pressure notifications will be triggered at the
+ * same time as userspace notifications (with no particular ordering relative
+ * to it).
+ *
+ * Pressure notifications are a alternative method to shrinkers and will serve
+ * well users that are interested in a one-shot notification, with a
+ * well-defined cgroup aware interface.
+ */
+int vmpressure_register_kernel_event(struct mem_cgroup *memcg,
+				     void (*fn)(void))
+{
+	struct vmpressure *vmpr = memcg_to_vmpressure(memcg);
+	struct vmpressure_event *ev;
+
+	ev = kzalloc(sizeof(*ev), GFP_KERNEL);
+	if (!ev)
+		return -ENOMEM;
+
+	ev->kernel_event = true;
+	ev->fn = fn;
+
+	mutex_lock(&vmpr->events_lock);
+	list_add(&ev->node, &vmpr->events);
+	mutex_unlock(&vmpr->events_lock);
+	return 0;
+}
+
+/**
  * vmpressure_unregister_event() - Unbind eventfd from vmpressure
  * @memcg:	memcg handle
  * @eventfd:	eventfd context that was used to link vmpressure with the @cg
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH -mm v15 13/13] memcg: reap dead memcgs upon global memory pressure
  2014-02-05 18:39 ` Vladimir Davydov
@ 2014-02-05 18:39   ` Vladimir Davydov
  -1 siblings, 0 replies; 44+ messages in thread
From: Vladimir Davydov @ 2014-02-05 18:39 UTC (permalink / raw)
  To: akpm
  Cc: dchinner, mhocko, hannes, glommer, rientjes, linux-kernel,
	linux-mm, devel, Glauber Costa, Anton Vorontsov, John Stultz,
	Kamezawa Hiroyuki

From: Glauber Costa <glommer@openvz.org>

When we delete kmem-enabled memcgs, they can still be zombieing around
for a while. The reason is that the objects may still be alive, and we
won't be able to delete them at destruction time.

The only entry point for that, though, are the shrinkers. The shrinker
interface, however, is not exactly tailored to our needs. It could be a
little bit better by using the API Dave Chinner proposed, but it is
still not ideal since we aren't really a count-and-scan event, but more
a one-off flush-all-you-can event that would have to abuse that somehow.

Signed-off-by: Glauber Costa <glommer@openvz.org>
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Anton Vorontsov <anton@enomsg.org>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/memcontrol.c |   82 +++++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 79 insertions(+), 3 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index aed1456015cf..14b152f4c69b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -328,8 +328,16 @@ struct mem_cgroup {
 	/* thresholds for mem+swap usage. RCU-protected */
 	struct mem_cgroup_thresholds memsw_thresholds;
 
-	/* For oom notifier event fd */
-	struct list_head oom_notify;
+	union {
+		/* For oom notifier event fd */
+		struct list_head oom_notify;
+		/*
+		 * we can only trigger an oom event if the memcg is alive.
+		 * so we will reuse this field to hook the memcg in the list
+		 * of dead memcgs.
+		 */
+		struct list_head dead;
+	};
 
 	/*
 	 * Should we move charges of a task when a task is moved into this
@@ -6134,6 +6142,58 @@ static int mem_cgroup_oom_control_write(struct cgroup_subsys_state *css,
 }
 
 #ifdef CONFIG_MEMCG_KMEM
+static LIST_HEAD(dangling_memcgs);
+static DEFINE_MUTEX(dangling_memcgs_mutex);
+
+static void memcg_dangling_add(struct mem_cgroup *memcg)
+{
+	mutex_lock(&dangling_memcgs_mutex);
+	list_add(&memcg->dead, &dangling_memcgs);
+	mutex_unlock(&dangling_memcgs_mutex);
+}
+
+static void memcg_dangling_del(struct mem_cgroup *memcg)
+{
+	mutex_lock(&dangling_memcgs_mutex);
+	list_del(&memcg->dead);
+	mutex_unlock(&dangling_memcgs_mutex);
+}
+
+static void memcg_vmpressure_shrink_dead(void)
+{
+	struct memcg_cache_params *params, *tmp;
+	struct kmem_cache *cachep;
+	struct mem_cgroup *memcg;
+
+	mutex_lock(&dangling_memcgs_mutex);
+	list_for_each_entry(memcg, &dangling_memcgs, dead) {
+		mutex_lock(&memcg->slab_caches_mutex);
+		/* The element may go away as an indirect result of shrink */
+		list_for_each_entry_safe(params, tmp,
+					 &memcg->memcg_slab_caches, list) {
+			cachep = memcg_params_to_cache(params);
+			/*
+			 * the cpu_hotplug lock is taken in kmem_cache_create
+			 * outside the slab_caches_mutex manipulation. It will
+			 * be taken by kmem_cache_shrink to flush the cache.
+			 * So we need to drop the lock. It is all right because
+			 * the lock only protects elements moving in and out the
+			 * list.
+			 */
+			mutex_unlock(&memcg->slab_caches_mutex);
+			kmem_cache_shrink(cachep);
+			mutex_lock(&memcg->slab_caches_mutex);
+		}
+		mutex_unlock(&memcg->slab_caches_mutex);
+	}
+	mutex_unlock(&dangling_memcgs_mutex);
+}
+
+static void memcg_register_kmem_events(struct mem_cgroup *memcg)
+{
+	vmpressure_register_kernel_event(memcg, memcg_vmpressure_shrink_dead);
+}
+
 static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
 {
 	int ret;
@@ -6217,6 +6277,18 @@ static void kmem_cgroup_css_offline(struct mem_cgroup *memcg)
 		css_put(&memcg->css);
 }
 #else
+static void memcg_dangling_add(struct mem_cgroup *memcg)
+{
+}
+
+static void memcg_dangling_del(struct mem_cgroup *memcg)
+{
+}
+
+static void memcg_register_kmem_events(struct mem_cgroup *memcg)
+{
+}
+
 static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
 {
 	return 0;
@@ -6767,8 +6839,10 @@ mem_cgroup_css_online(struct cgroup_subsys_state *css)
 	if (css->cgroup->id > MEM_CGROUP_ID_MAX)
 		return -ENOSPC;
 
-	if (!parent)
+	if (!parent) {
+		memcg_register_kmem_events(memcg);
 		return 0;
+	}
 
 	mutex_lock(&memcg_create_mutex);
 
@@ -6842,6 +6916,7 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
 	mem_cgroup_invalidate_reclaim_iterators(memcg);
 	mem_cgroup_reparent_charges(memcg);
 	mem_cgroup_destroy_all_caches(memcg);
+	memcg_dangling_add(memcg);
 	vmpressure_cleanup(&memcg->vmpressure);
 }
 
@@ -6886,6 +6961,7 @@ static void mem_cgroup_css_free(struct cgroup_subsys_state *css)
 	mem_cgroup_reparent_charges(memcg);
 
 	memcg_destroy_kmem(memcg);
+	memcg_dangling_del(memcg);
 	__mem_cgroup_free(memcg);
 }
 
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH -mm v15 13/13] memcg: reap dead memcgs upon global memory pressure
@ 2014-02-05 18:39   ` Vladimir Davydov
  0 siblings, 0 replies; 44+ messages in thread
From: Vladimir Davydov @ 2014-02-05 18:39 UTC (permalink / raw)
  To: akpm
  Cc: dchinner, mhocko, hannes, glommer, rientjes, linux-kernel,
	linux-mm, devel, Glauber Costa, Anton Vorontsov, John Stultz,
	Kamezawa Hiroyuki

From: Glauber Costa <glommer@openvz.org>

When we delete kmem-enabled memcgs, they can still be zombieing around
for a while. The reason is that the objects may still be alive, and we
won't be able to delete them at destruction time.

The only entry point for that, though, are the shrinkers. The shrinker
interface, however, is not exactly tailored to our needs. It could be a
little bit better by using the API Dave Chinner proposed, but it is
still not ideal since we aren't really a count-and-scan event, but more
a one-off flush-all-you-can event that would have to abuse that somehow.

Signed-off-by: Glauber Costa <glommer@openvz.org>
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Anton Vorontsov <anton@enomsg.org>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/memcontrol.c |   82 +++++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 79 insertions(+), 3 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index aed1456015cf..14b152f4c69b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -328,8 +328,16 @@ struct mem_cgroup {
 	/* thresholds for mem+swap usage. RCU-protected */
 	struct mem_cgroup_thresholds memsw_thresholds;
 
-	/* For oom notifier event fd */
-	struct list_head oom_notify;
+	union {
+		/* For oom notifier event fd */
+		struct list_head oom_notify;
+		/*
+		 * we can only trigger an oom event if the memcg is alive.
+		 * so we will reuse this field to hook the memcg in the list
+		 * of dead memcgs.
+		 */
+		struct list_head dead;
+	};
 
 	/*
 	 * Should we move charges of a task when a task is moved into this
@@ -6134,6 +6142,58 @@ static int mem_cgroup_oom_control_write(struct cgroup_subsys_state *css,
 }
 
 #ifdef CONFIG_MEMCG_KMEM
+static LIST_HEAD(dangling_memcgs);
+static DEFINE_MUTEX(dangling_memcgs_mutex);
+
+static void memcg_dangling_add(struct mem_cgroup *memcg)
+{
+	mutex_lock(&dangling_memcgs_mutex);
+	list_add(&memcg->dead, &dangling_memcgs);
+	mutex_unlock(&dangling_memcgs_mutex);
+}
+
+static void memcg_dangling_del(struct mem_cgroup *memcg)
+{
+	mutex_lock(&dangling_memcgs_mutex);
+	list_del(&memcg->dead);
+	mutex_unlock(&dangling_memcgs_mutex);
+}
+
+static void memcg_vmpressure_shrink_dead(void)
+{
+	struct memcg_cache_params *params, *tmp;
+	struct kmem_cache *cachep;
+	struct mem_cgroup *memcg;
+
+	mutex_lock(&dangling_memcgs_mutex);
+	list_for_each_entry(memcg, &dangling_memcgs, dead) {
+		mutex_lock(&memcg->slab_caches_mutex);
+		/* The element may go away as an indirect result of shrink */
+		list_for_each_entry_safe(params, tmp,
+					 &memcg->memcg_slab_caches, list) {
+			cachep = memcg_params_to_cache(params);
+			/*
+			 * the cpu_hotplug lock is taken in kmem_cache_create
+			 * outside the slab_caches_mutex manipulation. It will
+			 * be taken by kmem_cache_shrink to flush the cache.
+			 * So we need to drop the lock. It is all right because
+			 * the lock only protects elements moving in and out the
+			 * list.
+			 */
+			mutex_unlock(&memcg->slab_caches_mutex);
+			kmem_cache_shrink(cachep);
+			mutex_lock(&memcg->slab_caches_mutex);
+		}
+		mutex_unlock(&memcg->slab_caches_mutex);
+	}
+	mutex_unlock(&dangling_memcgs_mutex);
+}
+
+static void memcg_register_kmem_events(struct mem_cgroup *memcg)
+{
+	vmpressure_register_kernel_event(memcg, memcg_vmpressure_shrink_dead);
+}
+
 static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
 {
 	int ret;
@@ -6217,6 +6277,18 @@ static void kmem_cgroup_css_offline(struct mem_cgroup *memcg)
 		css_put(&memcg->css);
 }
 #else
+static void memcg_dangling_add(struct mem_cgroup *memcg)
+{
+}
+
+static void memcg_dangling_del(struct mem_cgroup *memcg)
+{
+}
+
+static void memcg_register_kmem_events(struct mem_cgroup *memcg)
+{
+}
+
 static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
 {
 	return 0;
@@ -6767,8 +6839,10 @@ mem_cgroup_css_online(struct cgroup_subsys_state *css)
 	if (css->cgroup->id > MEM_CGROUP_ID_MAX)
 		return -ENOSPC;
 
-	if (!parent)
+	if (!parent) {
+		memcg_register_kmem_events(memcg);
 		return 0;
+	}
 
 	mutex_lock(&memcg_create_mutex);
 
@@ -6842,6 +6916,7 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
 	mem_cgroup_invalidate_reclaim_iterators(memcg);
 	mem_cgroup_reparent_charges(memcg);
 	mem_cgroup_destroy_all_caches(memcg);
+	memcg_dangling_add(memcg);
 	vmpressure_cleanup(&memcg->vmpressure);
 }
 
@@ -6886,6 +6961,7 @@ static void mem_cgroup_css_free(struct cgroup_subsys_state *css)
 	mem_cgroup_reparent_charges(memcg);
 
 	memcg_destroy_kmem(memcg);
+	memcg_dangling_del(memcg);
 	__mem_cgroup_free(memcg);
 }
 
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH -mm v15 00/13] kmemcg shrinkers
  2014-02-05 18:39 ` Vladimir Davydov
@ 2014-02-11 15:15   ` Vladimir Davydov
  -1 siblings, 0 replies; 44+ messages in thread
From: Vladimir Davydov @ 2014-02-11 15:15 UTC (permalink / raw)
  To: dchinner, mhocko, hannes
  Cc: akpm, glommer, rientjes, linux-kernel, linux-mm, devel

Hi Michal, Johannes, David,

Could you please take a look at this if you have time? Without your
review, it'll never get committed.

Thank you.

On 02/05/2014 10:39 PM, Vladimir Davydov wrote:
> Hi,
>
> This is the 15th iteration of Glauber Costa's patch-set implementing slab
> shrinking on memcg pressure. The main idea is to make the list_lru structure
> used by most FS shrinkers per-memcg. When adding or removing an element from a
> list_lru, we use the page information to figure out which memcg it belongs to
> and relay it to the appropriate list. This allows scanning kmem objects
> accounted to different memcgs independently.
>
> Please note that this patch-set implements slab shrinking only when we hit the
> user memory limit so that kmem allocations will still fail if we are below the
> user memory limit, but close to the kmem limit. I am going to fix this in a
> separate patch-set, but currently it is only worthwhile setting the kmem limit
> to be greater than the user mem limit just to enable per-memcg slab accounting
> and reclaim.
>
> The patch-set is based on top of v3.14-rc1-mmots-2014-02-04-16-48 (there are
> some vmscan cleanups that I need committed there) and organized as follows:
>  - patches 1-4 introduce some minor changes to memcg needed for this set;
>  - patches 5-7 prepare fs for per-memcg list_lru;
>  - patch 8 implement kmemcg reclaim core;
>  - patch 9 make list_lru per-memcg and patch 10 marks sb shrinker memcg-aware;
>  - patch 10 is trivial - it issues shrinkers on memcg destruction;
>  - patches 12 and 13 introduce shrinking of dead kmem caches to facilitate
>    memcg destruction.
>
> Changes in v15:
>  - remove patches that have been merged to -mm;
>  - fix memory barrier usage in per-memcg list_lru implementation;
>  - fix list_lru_destroy(), which might sleep for per-memcg lrus, called from
>    atomic context (__put_super()).
>
> Previous iterations of this patch-set can be found here:
>  - https://lkml.org/lkml/2013/12/16/206 (v14)
>  - https://lkml.org/lkml/2013/12/9/103 (v13)
>  - https://lkml.org/lkml/2013/12/2/141 (v12)
>  - https://lkml.org/lkml/2013/11/25/214 (v11)
>
> Comments are highly appreciated.
>
> Thanks.
>
> Glauber Costa (6):
>   memcg: make cache index determination more robust
>   memcg: consolidate callers of memcg_cache_id
>   memcg: move initialization to memcg creation
>   memcg: flush memcg items upon memcg destruction
>   vmpressure: in-kernel notifications
>   memcg: reap dead memcgs upon global memory pressure
>
> Vladimir Davydov (7):
>   memcg: make for_each_mem_cgroup macros public
>   list_lru, shrinkers: introduce list_lru_shrink_{count,walk}
>   fs: consolidate {nr,free}_cached_objects args in shrink_control
>   fs: do not call destroy_super() in atomic context
>   vmscan: shrink slab on memcg pressure
>   list_lru: add per-memcg lists
>   fs: make shrinker memcg aware
>
>  fs/dcache.c                |   14 +-
>  fs/gfs2/quota.c            |    6 +-
>  fs/inode.c                 |    7 +-
>  fs/internal.h              |    7 +-
>  fs/super.c                 |   44 ++---
>  fs/xfs/xfs_buf.c           |    7 +-
>  fs/xfs/xfs_qm.c            |    7 +-
>  fs/xfs/xfs_super.c         |    7 +-
>  include/linux/fs.h         |    8 +-
>  include/linux/list_lru.h   |  112 ++++++++++---
>  include/linux/memcontrol.h |   50 ++++++
>  include/linux/shrinker.h   |   10 +-
>  include/linux/vmpressure.h |    5 +
>  mm/list_lru.c              |  271 +++++++++++++++++++++++++++---
>  mm/memcontrol.c            |  399 ++++++++++++++++++++++++++++++++++++++------
>  mm/vmpressure.c            |   53 +++++-
>  mm/vmscan.c                |   94 ++++++++---
>  17 files changed, 926 insertions(+), 175 deletions(-)
>


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH -mm v15 00/13] kmemcg shrinkers
@ 2014-02-11 15:15   ` Vladimir Davydov
  0 siblings, 0 replies; 44+ messages in thread
From: Vladimir Davydov @ 2014-02-11 15:15 UTC (permalink / raw)
  To: dchinner, mhocko, hannes
  Cc: akpm, glommer, rientjes, linux-kernel, linux-mm, devel

Hi Michal, Johannes, David,

Could you please take a look at this if you have time? Without your
review, it'll never get committed.

Thank you.

On 02/05/2014 10:39 PM, Vladimir Davydov wrote:
> Hi,
>
> This is the 15th iteration of Glauber Costa's patch-set implementing slab
> shrinking on memcg pressure. The main idea is to make the list_lru structure
> used by most FS shrinkers per-memcg. When adding or removing an element from a
> list_lru, we use the page information to figure out which memcg it belongs to
> and relay it to the appropriate list. This allows scanning kmem objects
> accounted to different memcgs independently.
>
> Please note that this patch-set implements slab shrinking only when we hit the
> user memory limit so that kmem allocations will still fail if we are below the
> user memory limit, but close to the kmem limit. I am going to fix this in a
> separate patch-set, but currently it is only worthwhile setting the kmem limit
> to be greater than the user mem limit just to enable per-memcg slab accounting
> and reclaim.
>
> The patch-set is based on top of v3.14-rc1-mmots-2014-02-04-16-48 (there are
> some vmscan cleanups that I need committed there) and organized as follows:
>  - patches 1-4 introduce some minor changes to memcg needed for this set;
>  - patches 5-7 prepare fs for per-memcg list_lru;
>  - patch 8 implement kmemcg reclaim core;
>  - patch 9 make list_lru per-memcg and patch 10 marks sb shrinker memcg-aware;
>  - patch 10 is trivial - it issues shrinkers on memcg destruction;
>  - patches 12 and 13 introduce shrinking of dead kmem caches to facilitate
>    memcg destruction.
>
> Changes in v15:
>  - remove patches that have been merged to -mm;
>  - fix memory barrier usage in per-memcg list_lru implementation;
>  - fix list_lru_destroy(), which might sleep for per-memcg lrus, called from
>    atomic context (__put_super()).
>
> Previous iterations of this patch-set can be found here:
>  - https://lkml.org/lkml/2013/12/16/206 (v14)
>  - https://lkml.org/lkml/2013/12/9/103 (v13)
>  - https://lkml.org/lkml/2013/12/2/141 (v12)
>  - https://lkml.org/lkml/2013/11/25/214 (v11)
>
> Comments are highly appreciated.
>
> Thanks.
>
> Glauber Costa (6):
>   memcg: make cache index determination more robust
>   memcg: consolidate callers of memcg_cache_id
>   memcg: move initialization to memcg creation
>   memcg: flush memcg items upon memcg destruction
>   vmpressure: in-kernel notifications
>   memcg: reap dead memcgs upon global memory pressure
>
> Vladimir Davydov (7):
>   memcg: make for_each_mem_cgroup macros public
>   list_lru, shrinkers: introduce list_lru_shrink_{count,walk}
>   fs: consolidate {nr,free}_cached_objects args in shrink_control
>   fs: do not call destroy_super() in atomic context
>   vmscan: shrink slab on memcg pressure
>   list_lru: add per-memcg lists
>   fs: make shrinker memcg aware
>
>  fs/dcache.c                |   14 +-
>  fs/gfs2/quota.c            |    6 +-
>  fs/inode.c                 |    7 +-
>  fs/internal.h              |    7 +-
>  fs/super.c                 |   44 ++---
>  fs/xfs/xfs_buf.c           |    7 +-
>  fs/xfs/xfs_qm.c            |    7 +-
>  fs/xfs/xfs_super.c         |    7 +-
>  include/linux/fs.h         |    8 +-
>  include/linux/list_lru.h   |  112 ++++++++++---
>  include/linux/memcontrol.h |   50 ++++++
>  include/linux/shrinker.h   |   10 +-
>  include/linux/vmpressure.h |    5 +
>  mm/list_lru.c              |  271 +++++++++++++++++++++++++++---
>  mm/memcontrol.c            |  399 ++++++++++++++++++++++++++++++++++++++------
>  mm/vmpressure.c            |   53 +++++-
>  mm/vmscan.c                |   94 ++++++++---
>  17 files changed, 926 insertions(+), 175 deletions(-)
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH -mm v15 00/13] kmemcg shrinkers
  2014-02-11 15:15   ` Vladimir Davydov
@ 2014-02-11 16:53     ` Michal Hocko
  -1 siblings, 0 replies; 44+ messages in thread
From: Michal Hocko @ 2014-02-11 16:53 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: dchinner, hannes, akpm, glommer, rientjes, linux-kernel, linux-mm, devel

On Tue 11-02-14 19:15:26, Vladimir Davydov wrote:
> Hi Michal, Johannes, David,
> 
> Could you please take a look at this if you have time? Without your
> review, it'll never get committed.

Yes, it is on my todo list. I could barely catch up with discussions on
the previous versions and felt that David was quite concerned about some
high level decisions. I have to check, re-read whether there are still
there.

I am sorry that it takes so long but I am really busy with internal
things recently.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH -mm v15 00/13] kmemcg shrinkers
@ 2014-02-11 16:53     ` Michal Hocko
  0 siblings, 0 replies; 44+ messages in thread
From: Michal Hocko @ 2014-02-11 16:53 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: dchinner, hannes, akpm, glommer, rientjes, linux-kernel, linux-mm, devel

On Tue 11-02-14 19:15:26, Vladimir Davydov wrote:
> Hi Michal, Johannes, David,
> 
> Could you please take a look at this if you have time? Without your
> review, it'll never get committed.

Yes, it is on my todo list. I could barely catch up with discussions on
the previous versions and felt that David was quite concerned about some
high level decisions. I have to check, re-read whether there are still
there.

I am sorry that it takes so long but I am really busy with internal
things recently.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH -mm v15 00/13] kmemcg shrinkers
  2014-02-11 15:15   ` Vladimir Davydov
@ 2014-02-11 20:19     ` Johannes Weiner
  -1 siblings, 0 replies; 44+ messages in thread
From: Johannes Weiner @ 2014-02-11 20:19 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: dchinner, mhocko, akpm, glommer, rientjes, linux-kernel, linux-mm, devel

On Tue, Feb 11, 2014 at 07:15:26PM +0400, Vladimir Davydov wrote:
> Hi Michal, Johannes, David,
> 
> Could you please take a look at this if you have time? Without your
> review, it'll never get committed.

There is simply no review bandwidth for new features as long as we are
fixing fundamental bugs in memcg.

> On 02/05/2014 10:39 PM, Vladimir Davydov wrote:
> > Hi,
> >
> > This is the 15th iteration of Glauber Costa's patch-set implementing slab
> > shrinking on memcg pressure. The main idea is to make the list_lru structure
> > used by most FS shrinkers per-memcg. When adding or removing an element from a
> > list_lru, we use the page information to figure out which memcg it belongs to
> > and relay it to the appropriate list. This allows scanning kmem objects
> > accounted to different memcgs independently.
> >
> > Please note that this patch-set implements slab shrinking only when we hit the
> > user memory limit so that kmem allocations will still fail if we are below the
> > user memory limit, but close to the kmem limit. I am going to fix this in a
> > separate patch-set, but currently it is only worthwhile setting the kmem limit
> > to be greater than the user mem limit just to enable per-memcg slab accounting
> > and reclaim.
> >
> > The patch-set is based on top of v3.14-rc1-mmots-2014-02-04-16-48 (there are
> > some vmscan cleanups that I need committed there) and organized as follows:
> >  - patches 1-4 introduce some minor changes to memcg needed for this set;
> >  - patches 5-7 prepare fs for per-memcg list_lru;
> >  - patch 8 implement kmemcg reclaim core;
> >  - patch 9 make list_lru per-memcg and patch 10 marks sb shrinker memcg-aware;
> >  - patch 10 is trivial - it issues shrinkers on memcg destruction;
> >  - patches 12 and 13 introduce shrinking of dead kmem caches to facilitate
> >    memcg destruction.

In the context of the ongoing discussions about charge reparenting I
was curious how you deal with charges becoming unreclaimable after a
memcg has been offlined.

Patch #11 drops all charged objects at offlining by just invoking
shrink_slab() in a loop until "only a few" (10) objects are remaining.
How long is this going to take?  And why is it okay to destroy these
caches when somebody else might still be using them?

That still leaves you with the free objects that slab caches retain
for allocation efficiency, so now you put all dead memcgs in the
system on a global list, and on a vmpressure event on root_mem_cgroup
you walk the global list and drain the freelist of all remaining
caches.

This is a lot of complexity and scalability problems for less than
desirable behavior.

Please think about how we can properly reparent kmemcg charges during
memcg teardown.  That would simplify your code immensely and help
clean up this unholy mess of css pinning.

Slab caches are already collected in the memcg and on destruction
could be reassigned to the parent.  Kmemcg uncharge from slab freeing
would have to be updated to use the memcg from the cache, not from the
individual page, but I don't see why this wouldn't work right now.

Charged thread stack pages could be reassigned when the task itself is
migrated out of a cgroup.

It would mean that you can't simply use __GFP_KMEMCG and just pin the
css until you can be bothered to return it.  There must be a way for
any memcg charge to be reparented on demand.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH -mm v15 00/13] kmemcg shrinkers
@ 2014-02-11 20:19     ` Johannes Weiner
  0 siblings, 0 replies; 44+ messages in thread
From: Johannes Weiner @ 2014-02-11 20:19 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: dchinner, mhocko, akpm, glommer, rientjes, linux-kernel, linux-mm, devel

On Tue, Feb 11, 2014 at 07:15:26PM +0400, Vladimir Davydov wrote:
> Hi Michal, Johannes, David,
> 
> Could you please take a look at this if you have time? Without your
> review, it'll never get committed.

There is simply no review bandwidth for new features as long as we are
fixing fundamental bugs in memcg.

> On 02/05/2014 10:39 PM, Vladimir Davydov wrote:
> > Hi,
> >
> > This is the 15th iteration of Glauber Costa's patch-set implementing slab
> > shrinking on memcg pressure. The main idea is to make the list_lru structure
> > used by most FS shrinkers per-memcg. When adding or removing an element from a
> > list_lru, we use the page information to figure out which memcg it belongs to
> > and relay it to the appropriate list. This allows scanning kmem objects
> > accounted to different memcgs independently.
> >
> > Please note that this patch-set implements slab shrinking only when we hit the
> > user memory limit so that kmem allocations will still fail if we are below the
> > user memory limit, but close to the kmem limit. I am going to fix this in a
> > separate patch-set, but currently it is only worthwhile setting the kmem limit
> > to be greater than the user mem limit just to enable per-memcg slab accounting
> > and reclaim.
> >
> > The patch-set is based on top of v3.14-rc1-mmots-2014-02-04-16-48 (there are
> > some vmscan cleanups that I need committed there) and organized as follows:
> >  - patches 1-4 introduce some minor changes to memcg needed for this set;
> >  - patches 5-7 prepare fs for per-memcg list_lru;
> >  - patch 8 implement kmemcg reclaim core;
> >  - patch 9 make list_lru per-memcg and patch 10 marks sb shrinker memcg-aware;
> >  - patch 10 is trivial - it issues shrinkers on memcg destruction;
> >  - patches 12 and 13 introduce shrinking of dead kmem caches to facilitate
> >    memcg destruction.

In the context of the ongoing discussions about charge reparenting I
was curious how you deal with charges becoming unreclaimable after a
memcg has been offlined.

Patch #11 drops all charged objects at offlining by just invoking
shrink_slab() in a loop until "only a few" (10) objects are remaining.
How long is this going to take?  And why is it okay to destroy these
caches when somebody else might still be using them?

That still leaves you with the free objects that slab caches retain
for allocation efficiency, so now you put all dead memcgs in the
system on a global list, and on a vmpressure event on root_mem_cgroup
you walk the global list and drain the freelist of all remaining
caches.

This is a lot of complexity and scalability problems for less than
desirable behavior.

Please think about how we can properly reparent kmemcg charges during
memcg teardown.  That would simplify your code immensely and help
clean up this unholy mess of css pinning.

Slab caches are already collected in the memcg and on destruction
could be reassigned to the parent.  Kmemcg uncharge from slab freeing
would have to be updated to use the memcg from the cache, not from the
individual page, but I don't see why this wouldn't work right now.

Charged thread stack pages could be reassigned when the task itself is
migrated out of a cgroup.

It would mean that you can't simply use __GFP_KMEMCG and just pin the
css until you can be bothered to return it.  There must be a way for
any memcg charge to be reparented on demand.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH -mm v15 00/13] kmemcg shrinkers
  2014-02-11 20:19     ` Johannes Weiner
@ 2014-02-12 18:05       ` Vladimir Davydov
  -1 siblings, 0 replies; 44+ messages in thread
From: Vladimir Davydov @ 2014-02-12 18:05 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: dchinner, mhocko, akpm, glommer, rientjes, linux-kernel, linux-mm, devel

On 02/12/2014 12:19 AM, Johannes Weiner wrote:
> On Tue, Feb 11, 2014 at 07:15:26PM +0400, Vladimir Davydov wrote:
>> Hi Michal, Johannes, David,
>>
>> Could you please take a look at this if you have time? Without your
>> review, it'll never get committed.
> There is simply no review bandwidth for new features as long as we are
> fixing fundamental bugs in memcg.
>
>> On 02/05/2014 10:39 PM, Vladimir Davydov wrote:
>>> Hi,
>>>
>>> This is the 15th iteration of Glauber Costa's patch-set implementing slab
>>> shrinking on memcg pressure. The main idea is to make the list_lru structure
>>> used by most FS shrinkers per-memcg. When adding or removing an element from a
>>> list_lru, we use the page information to figure out which memcg it belongs to
>>> and relay it to the appropriate list. This allows scanning kmem objects
>>> accounted to different memcgs independently.
>>>
>>> Please note that this patch-set implements slab shrinking only when we hit the
>>> user memory limit so that kmem allocations will still fail if we are below the
>>> user memory limit, but close to the kmem limit. I am going to fix this in a
>>> separate patch-set, but currently it is only worthwhile setting the kmem limit
>>> to be greater than the user mem limit just to enable per-memcg slab accounting
>>> and reclaim.
>>>
>>> The patch-set is based on top of v3.14-rc1-mmots-2014-02-04-16-48 (there are
>>> some vmscan cleanups that I need committed there) and organized as follows:
>>>  - patches 1-4 introduce some minor changes to memcg needed for this set;
>>>  - patches 5-7 prepare fs for per-memcg list_lru;
>>>  - patch 8 implement kmemcg reclaim core;
>>>  - patch 9 make list_lru per-memcg and patch 10 marks sb shrinker memcg-aware;
>>>  - patch 10 is trivial - it issues shrinkers on memcg destruction;
>>>  - patches 12 and 13 introduce shrinking of dead kmem caches to facilitate
>>>    memcg destruction.
> In the context of the ongoing discussions about charge reparenting I
> was curious how you deal with charges becoming unreclaimable after a
> memcg has been offlined.
>
> Patch #11 drops all charged objects at offlining by just invoking
> shrink_slab() in a loop until "only a few" (10) objects are remaining.
> How long is this going to take?  And why is it okay to destroy these
> caches when somebody else might still be using them?

IMHO, on container destruction we have to drop as many objects accounted
to this container as we can, because otherwise any container will be
able to get access to any number of unaccounted objects by fetching them
and then rebooting.

> That still leaves you with the free objects that slab caches retain
> for allocation efficiency, so now you put all dead memcgs in the
> system on a global list, and on a vmpressure event on root_mem_cgroup
> you walk the global list and drain the freelist of all remaining
> caches.
>
> This is a lot of complexity and scalability problems for less than
> desirable behavior.
>
> Please think about how we can properly reparent kmemcg charges during
> memcg teardown.  That would simplify your code immensely and help
> clean up this unholy mess of css pinning.
>
> Slab caches are already collected in the memcg and on destruction
> could be reassigned to the parent.  Kmemcg uncharge from slab freeing
> would have to be updated to use the memcg from the cache, not from the
> individual page, but I don't see why this wouldn't work right now.

I don't think I understand what you mean by reassigning slab caches to
the parent.

If you mean moving all pages (slabs) from the cache of the memcg being
destroyed to the corresponding root cache (or the parent memcg's cache)
and then destroying the memcg's cache, I don't think this is feasible,
because slub free's fast path is lockless, so AFAIU we can't remove a
partial slab from a cache w/o risking to race with kmem_cache_free.

If you mean clearing all pointers from the memcg's cache to the memcg
(changing them to the parent or root memcg), then AFAIU this won't solve
the problem with "dangling" caches - we will still have to shrink them
on vmpressure. So although this would allow us to put the reference to
the memcg from kmem caches on memcg's death, it wouldn't simplify the
code at all, in fact, it would even make it more complicated, because we
would have to handle various corner cases like reparenting vs
list_lru_{add,remove}.

> Charged thread stack pages could be reassigned when the task itself is
> migrated out of a cgroup.

Thread info pages are only a part of the problem. If a process kmalloc's
an object of size >= KMALLOC_MAX_CACHE_SIZE, it will be given a compound
page accounted to kmemcg, and we won't be able to find this page given
the memcg it is accounted to (except for walking the whole page range).
Thus we will have to organize those pages in per-memcg lists, won't we?
Again, even more complexity.

Although I agree with you that it would be nice to reparent kmem on
memcg destruction, currently I don't see a way to implement this w/o
significantly complicating the code, but I keep thinking.

Thanks.

> It would mean that you can't simply use __GFP_KMEMCG and just pin the
> css until you can be bothered to return it.  There must be a way for
> any memcg charge to be reparented on demand.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH -mm v15 00/13] kmemcg shrinkers
@ 2014-02-12 18:05       ` Vladimir Davydov
  0 siblings, 0 replies; 44+ messages in thread
From: Vladimir Davydov @ 2014-02-12 18:05 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: dchinner, mhocko, akpm, glommer, rientjes, linux-kernel, linux-mm, devel

On 02/12/2014 12:19 AM, Johannes Weiner wrote:
> On Tue, Feb 11, 2014 at 07:15:26PM +0400, Vladimir Davydov wrote:
>> Hi Michal, Johannes, David,
>>
>> Could you please take a look at this if you have time? Without your
>> review, it'll never get committed.
> There is simply no review bandwidth for new features as long as we are
> fixing fundamental bugs in memcg.
>
>> On 02/05/2014 10:39 PM, Vladimir Davydov wrote:
>>> Hi,
>>>
>>> This is the 15th iteration of Glauber Costa's patch-set implementing slab
>>> shrinking on memcg pressure. The main idea is to make the list_lru structure
>>> used by most FS shrinkers per-memcg. When adding or removing an element from a
>>> list_lru, we use the page information to figure out which memcg it belongs to
>>> and relay it to the appropriate list. This allows scanning kmem objects
>>> accounted to different memcgs independently.
>>>
>>> Please note that this patch-set implements slab shrinking only when we hit the
>>> user memory limit so that kmem allocations will still fail if we are below the
>>> user memory limit, but close to the kmem limit. I am going to fix this in a
>>> separate patch-set, but currently it is only worthwhile setting the kmem limit
>>> to be greater than the user mem limit just to enable per-memcg slab accounting
>>> and reclaim.
>>>
>>> The patch-set is based on top of v3.14-rc1-mmots-2014-02-04-16-48 (there are
>>> some vmscan cleanups that I need committed there) and organized as follows:
>>>  - patches 1-4 introduce some minor changes to memcg needed for this set;
>>>  - patches 5-7 prepare fs for per-memcg list_lru;
>>>  - patch 8 implement kmemcg reclaim core;
>>>  - patch 9 make list_lru per-memcg and patch 10 marks sb shrinker memcg-aware;
>>>  - patch 10 is trivial - it issues shrinkers on memcg destruction;
>>>  - patches 12 and 13 introduce shrinking of dead kmem caches to facilitate
>>>    memcg destruction.
> In the context of the ongoing discussions about charge reparenting I
> was curious how you deal with charges becoming unreclaimable after a
> memcg has been offlined.
>
> Patch #11 drops all charged objects at offlining by just invoking
> shrink_slab() in a loop until "only a few" (10) objects are remaining.
> How long is this going to take?  And why is it okay to destroy these
> caches when somebody else might still be using them?

IMHO, on container destruction we have to drop as many objects accounted
to this container as we can, because otherwise any container will be
able to get access to any number of unaccounted objects by fetching them
and then rebooting.

> That still leaves you with the free objects that slab caches retain
> for allocation efficiency, so now you put all dead memcgs in the
> system on a global list, and on a vmpressure event on root_mem_cgroup
> you walk the global list and drain the freelist of all remaining
> caches.
>
> This is a lot of complexity and scalability problems for less than
> desirable behavior.
>
> Please think about how we can properly reparent kmemcg charges during
> memcg teardown.  That would simplify your code immensely and help
> clean up this unholy mess of css pinning.
>
> Slab caches are already collected in the memcg and on destruction
> could be reassigned to the parent.  Kmemcg uncharge from slab freeing
> would have to be updated to use the memcg from the cache, not from the
> individual page, but I don't see why this wouldn't work right now.

I don't think I understand what you mean by reassigning slab caches to
the parent.

If you mean moving all pages (slabs) from the cache of the memcg being
destroyed to the corresponding root cache (or the parent memcg's cache)
and then destroying the memcg's cache, I don't think this is feasible,
because slub free's fast path is lockless, so AFAIU we can't remove a
partial slab from a cache w/o risking to race with kmem_cache_free.

If you mean clearing all pointers from the memcg's cache to the memcg
(changing them to the parent or root memcg), then AFAIU this won't solve
the problem with "dangling" caches - we will still have to shrink them
on vmpressure. So although this would allow us to put the reference to
the memcg from kmem caches on memcg's death, it wouldn't simplify the
code at all, in fact, it would even make it more complicated, because we
would have to handle various corner cases like reparenting vs
list_lru_{add,remove}.

> Charged thread stack pages could be reassigned when the task itself is
> migrated out of a cgroup.

Thread info pages are only a part of the problem. If a process kmalloc's
an object of size >= KMALLOC_MAX_CACHE_SIZE, it will be given a compound
page accounted to kmemcg, and we won't be able to find this page given
the memcg it is accounted to (except for walking the whole page range).
Thus we will have to organize those pages in per-memcg lists, won't we?
Again, even more complexity.

Although I agree with you that it would be nice to reparent kmem on
memcg destruction, currently I don't see a way to implement this w/o
significantly complicating the code, but I keep thinking.

Thanks.

> It would mean that you can't simply use __GFP_KMEMCG and just pin the
> css until you can be bothered to return it.  There must be a way for
> any memcg charge to be reparented on demand.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH -mm v15 00/13] kmemcg shrinkers
  2014-02-12 18:05       ` Vladimir Davydov
@ 2014-02-12 22:01         ` Johannes Weiner
  -1 siblings, 0 replies; 44+ messages in thread
From: Johannes Weiner @ 2014-02-12 22:01 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: dchinner, mhocko, akpm, glommer, rientjes, linux-kernel, linux-mm, devel

On Wed, Feb 12, 2014 at 10:05:43PM +0400, Vladimir Davydov wrote:
> On 02/12/2014 12:19 AM, Johannes Weiner wrote:
> > On Tue, Feb 11, 2014 at 07:15:26PM +0400, Vladimir Davydov wrote:
> >> Hi Michal, Johannes, David,
> >>
> >> Could you please take a look at this if you have time? Without your
> >> review, it'll never get committed.
> > There is simply no review bandwidth for new features as long as we are
> > fixing fundamental bugs in memcg.
> >
> >> On 02/05/2014 10:39 PM, Vladimir Davydov wrote:
> >>> Hi,
> >>>
> >>> This is the 15th iteration of Glauber Costa's patch-set implementing slab
> >>> shrinking on memcg pressure. The main idea is to make the list_lru structure
> >>> used by most FS shrinkers per-memcg. When adding or removing an element from a
> >>> list_lru, we use the page information to figure out which memcg it belongs to
> >>> and relay it to the appropriate list. This allows scanning kmem objects
> >>> accounted to different memcgs independently.
> >>>
> >>> Please note that this patch-set implements slab shrinking only when we hit the
> >>> user memory limit so that kmem allocations will still fail if we are below the
> >>> user memory limit, but close to the kmem limit. I am going to fix this in a
> >>> separate patch-set, but currently it is only worthwhile setting the kmem limit
> >>> to be greater than the user mem limit just to enable per-memcg slab accounting
> >>> and reclaim.
> >>>
> >>> The patch-set is based on top of v3.14-rc1-mmots-2014-02-04-16-48 (there are
> >>> some vmscan cleanups that I need committed there) and organized as follows:
> >>>  - patches 1-4 introduce some minor changes to memcg needed for this set;
> >>>  - patches 5-7 prepare fs for per-memcg list_lru;
> >>>  - patch 8 implement kmemcg reclaim core;
> >>>  - patch 9 make list_lru per-memcg and patch 10 marks sb shrinker memcg-aware;
> >>>  - patch 10 is trivial - it issues shrinkers on memcg destruction;
> >>>  - patches 12 and 13 introduce shrinking of dead kmem caches to facilitate
> >>>    memcg destruction.
> > In the context of the ongoing discussions about charge reparenting I
> > was curious how you deal with charges becoming unreclaimable after a
> > memcg has been offlined.
> >
> > Patch #11 drops all charged objects at offlining by just invoking
> > shrink_slab() in a loop until "only a few" (10) objects are remaining.
> > How long is this going to take?  And why is it okay to destroy these
> > caches when somebody else might still be using them?
> 
> IMHO, on container destruction we have to drop as many objects accounted
> to this container as we can, because otherwise any container will be
> able to get access to any number of unaccounted objects by fetching them
> and then rebooting.

They're accounted to and subject to the limit of the parent.  I don't
see how this is different than page cache.

> > That still leaves you with the free objects that slab caches retain
> > for allocation efficiency, so now you put all dead memcgs in the
> > system on a global list, and on a vmpressure event on root_mem_cgroup
> > you walk the global list and drain the freelist of all remaining
> > caches.
> >
> > This is a lot of complexity and scalability problems for less than
> > desirable behavior.
> >
> > Please think about how we can properly reparent kmemcg charges during
> > memcg teardown.  That would simplify your code immensely and help
> > clean up this unholy mess of css pinning.
> >
> > Slab caches are already collected in the memcg and on destruction
> > could be reassigned to the parent.  Kmemcg uncharge from slab freeing
> > would have to be updated to use the memcg from the cache, not from the
> > individual page, but I don't see why this wouldn't work right now.
> 
> I don't think I understand what you mean by reassigning slab caches to
> the parent.
>
> If you mean moving all pages (slabs) from the cache of the memcg being
> destroyed to the corresponding root cache (or the parent memcg's cache)
> and then destroying the memcg's cache, I don't think this is feasible,
> because slub free's fast path is lockless, so AFAIU we can't remove a
> partial slab from a cache w/o risking to race with kmem_cache_free.
> 
> If you mean clearing all pointers from the memcg's cache to the memcg
> (changing them to the parent or root memcg), then AFAIU this won't solve
> the problem with "dangling" caches - we will still have to shrink them
> on vmpressure. So although this would allow us to put the reference to
> the memcg from kmem caches on memcg's death, it wouldn't simplify the
> code at all, in fact, it would even make it more complicated, because we
> would have to handle various corner cases like reparenting vs
> list_lru_{add,remove}.

I think we have different concepts of what's complicated.  There is an
existing model of what to do with left-over cache memory when a cgroup
is destroyed, which is reparenting.  The rough steps will be the same,
the object lifetime will be the same, the css refcounting will be the
same, the user-visible behavior will be the same.  Any complexity from
charge vs. reparent races will be contained to a few lines of code.

Weird refcounting tricks during offline, trashing kmem caches instead
of moving them to the parent like other memory, a global list of dead
memcgs and sudden freelist thrashing on a vmpressure event, that's what
adds complexity and what makes this code unpredictable, fragile, and
insanely hard to work with.  It's not acceptable.

By reparenting I meant reassigning the memcg cache parameter pointer
from the slab cache such that it points to the parent.  This should be
an atomic operation.  All css lookups already require RCU (I think slab
does not follow this yet because we guarantee that css reference, but
it should be changed).  So switch the cache param pointer, insert an
RCU graceperiod to wait for all the ongoing charges and uncharges until
nobody sees the memcg anymore, then safely reparent all the remaining
memcg objects to the parent.  Maybe individually, maybe we can just
splice the lists to the parent's list_lru lists.

I'm not sure I understand how the dangling cache problem pertains to
this, isn't this an entirely separate issue?

> > Charged thread stack pages could be reassigned when the task itself is
> > migrated out of a cgroup.
> 
> Thread info pages are only a part of the problem. If a process kmalloc's
> an object of size >= KMALLOC_MAX_CACHE_SIZE, it will be given a compound
> page accounted to kmemcg, and we won't be able to find this page given
> the memcg it is accounted to (except for walking the whole page range).
> Thus we will have to organize those pages in per-memcg lists, won't we?
> Again, even more complexity.

Why do we track them in the first place?  We don't track any random
page allocation, so we shouldn't track kmalloc() that falls back to the
page allocator.  In fact we shouldn't track any random slab allocation.

The types of allocations we primarily want to track are the ones that
directly scale with user behavior.  This is a finite set, which on a
global level is covered mostly by ulimits.  After all, an unprivileged
user interfering with other users is not a new problem and existed long
before memcg.

It was a mistake to provide __GFP_KMEMCG and allow charging any random
allocation, without giving memcg the means to actually manage that
memory.  I don't see that such flexibility even needed, and it clearly
hurts us now.  It was a choice we made to keep things simple in the
beginning, before we knew how all this is going to turn out.  We should
rectify this mistake before building on top of it.

Here is the much bigger issue behind this:

Memcg should be a thin layer of accounting and limiting between the VM
and cgroup core code, but look at the line count.  It's more code than
all of the page reclaim logic combined, including the page replacement
algorithm, rmap, LRU list handling - all of which already include a
good deal of memcg specifics.  It's the same size as the scheduler
core, which includes the entire cpu cgroup controller.  And all this
other code is of better quality and has more eyes on it than memcg.

Please demonstrate that you try to see the bigger picture behind memcg
and make an effort to keep things simple beyond the code you introduce
and the niche you care about, otherwise I'm not willing to take any
patches from you that don't straight-up delete stuff.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH -mm v15 00/13] kmemcg shrinkers
@ 2014-02-12 22:01         ` Johannes Weiner
  0 siblings, 0 replies; 44+ messages in thread
From: Johannes Weiner @ 2014-02-12 22:01 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: dchinner, mhocko, akpm, glommer, rientjes, linux-kernel, linux-mm, devel

On Wed, Feb 12, 2014 at 10:05:43PM +0400, Vladimir Davydov wrote:
> On 02/12/2014 12:19 AM, Johannes Weiner wrote:
> > On Tue, Feb 11, 2014 at 07:15:26PM +0400, Vladimir Davydov wrote:
> >> Hi Michal, Johannes, David,
> >>
> >> Could you please take a look at this if you have time? Without your
> >> review, it'll never get committed.
> > There is simply no review bandwidth for new features as long as we are
> > fixing fundamental bugs in memcg.
> >
> >> On 02/05/2014 10:39 PM, Vladimir Davydov wrote:
> >>> Hi,
> >>>
> >>> This is the 15th iteration of Glauber Costa's patch-set implementing slab
> >>> shrinking on memcg pressure. The main idea is to make the list_lru structure
> >>> used by most FS shrinkers per-memcg. When adding or removing an element from a
> >>> list_lru, we use the page information to figure out which memcg it belongs to
> >>> and relay it to the appropriate list. This allows scanning kmem objects
> >>> accounted to different memcgs independently.
> >>>
> >>> Please note that this patch-set implements slab shrinking only when we hit the
> >>> user memory limit so that kmem allocations will still fail if we are below the
> >>> user memory limit, but close to the kmem limit. I am going to fix this in a
> >>> separate patch-set, but currently it is only worthwhile setting the kmem limit
> >>> to be greater than the user mem limit just to enable per-memcg slab accounting
> >>> and reclaim.
> >>>
> >>> The patch-set is based on top of v3.14-rc1-mmots-2014-02-04-16-48 (there are
> >>> some vmscan cleanups that I need committed there) and organized as follows:
> >>>  - patches 1-4 introduce some minor changes to memcg needed for this set;
> >>>  - patches 5-7 prepare fs for per-memcg list_lru;
> >>>  - patch 8 implement kmemcg reclaim core;
> >>>  - patch 9 make list_lru per-memcg and patch 10 marks sb shrinker memcg-aware;
> >>>  - patch 10 is trivial - it issues shrinkers on memcg destruction;
> >>>  - patches 12 and 13 introduce shrinking of dead kmem caches to facilitate
> >>>    memcg destruction.
> > In the context of the ongoing discussions about charge reparenting I
> > was curious how you deal with charges becoming unreclaimable after a
> > memcg has been offlined.
> >
> > Patch #11 drops all charged objects at offlining by just invoking
> > shrink_slab() in a loop until "only a few" (10) objects are remaining.
> > How long is this going to take?  And why is it okay to destroy these
> > caches when somebody else might still be using them?
> 
> IMHO, on container destruction we have to drop as many objects accounted
> to this container as we can, because otherwise any container will be
> able to get access to any number of unaccounted objects by fetching them
> and then rebooting.

They're accounted to and subject to the limit of the parent.  I don't
see how this is different than page cache.

> > That still leaves you with the free objects that slab caches retain
> > for allocation efficiency, so now you put all dead memcgs in the
> > system on a global list, and on a vmpressure event on root_mem_cgroup
> > you walk the global list and drain the freelist of all remaining
> > caches.
> >
> > This is a lot of complexity and scalability problems for less than
> > desirable behavior.
> >
> > Please think about how we can properly reparent kmemcg charges during
> > memcg teardown.  That would simplify your code immensely and help
> > clean up this unholy mess of css pinning.
> >
> > Slab caches are already collected in the memcg and on destruction
> > could be reassigned to the parent.  Kmemcg uncharge from slab freeing
> > would have to be updated to use the memcg from the cache, not from the
> > individual page, but I don't see why this wouldn't work right now.
> 
> I don't think I understand what you mean by reassigning slab caches to
> the parent.
>
> If you mean moving all pages (slabs) from the cache of the memcg being
> destroyed to the corresponding root cache (or the parent memcg's cache)
> and then destroying the memcg's cache, I don't think this is feasible,
> because slub free's fast path is lockless, so AFAIU we can't remove a
> partial slab from a cache w/o risking to race with kmem_cache_free.
> 
> If you mean clearing all pointers from the memcg's cache to the memcg
> (changing them to the parent or root memcg), then AFAIU this won't solve
> the problem with "dangling" caches - we will still have to shrink them
> on vmpressure. So although this would allow us to put the reference to
> the memcg from kmem caches on memcg's death, it wouldn't simplify the
> code at all, in fact, it would even make it more complicated, because we
> would have to handle various corner cases like reparenting vs
> list_lru_{add,remove}.

I think we have different concepts of what's complicated.  There is an
existing model of what to do with left-over cache memory when a cgroup
is destroyed, which is reparenting.  The rough steps will be the same,
the object lifetime will be the same, the css refcounting will be the
same, the user-visible behavior will be the same.  Any complexity from
charge vs. reparent races will be contained to a few lines of code.

Weird refcounting tricks during offline, trashing kmem caches instead
of moving them to the parent like other memory, a global list of dead
memcgs and sudden freelist thrashing on a vmpressure event, that's what
adds complexity and what makes this code unpredictable, fragile, and
insanely hard to work with.  It's not acceptable.

By reparenting I meant reassigning the memcg cache parameter pointer
from the slab cache such that it points to the parent.  This should be
an atomic operation.  All css lookups already require RCU (I think slab
does not follow this yet because we guarantee that css reference, but
it should be changed).  So switch the cache param pointer, insert an
RCU graceperiod to wait for all the ongoing charges and uncharges until
nobody sees the memcg anymore, then safely reparent all the remaining
memcg objects to the parent.  Maybe individually, maybe we can just
splice the lists to the parent's list_lru lists.

I'm not sure I understand how the dangling cache problem pertains to
this, isn't this an entirely separate issue?

> > Charged thread stack pages could be reassigned when the task itself is
> > migrated out of a cgroup.
> 
> Thread info pages are only a part of the problem. If a process kmalloc's
> an object of size >= KMALLOC_MAX_CACHE_SIZE, it will be given a compound
> page accounted to kmemcg, and we won't be able to find this page given
> the memcg it is accounted to (except for walking the whole page range).
> Thus we will have to organize those pages in per-memcg lists, won't we?
> Again, even more complexity.

Why do we track them in the first place?  We don't track any random
page allocation, so we shouldn't track kmalloc() that falls back to the
page allocator.  In fact we shouldn't track any random slab allocation.

The types of allocations we primarily want to track are the ones that
directly scale with user behavior.  This is a finite set, which on a
global level is covered mostly by ulimits.  After all, an unprivileged
user interfering with other users is not a new problem and existed long
before memcg.

It was a mistake to provide __GFP_KMEMCG and allow charging any random
allocation, without giving memcg the means to actually manage that
memory.  I don't see that such flexibility even needed, and it clearly
hurts us now.  It was a choice we made to keep things simple in the
beginning, before we knew how all this is going to turn out.  We should
rectify this mistake before building on top of it.

Here is the much bigger issue behind this:

Memcg should be a thin layer of accounting and limiting between the VM
and cgroup core code, but look at the line count.  It's more code than
all of the page reclaim logic combined, including the page replacement
algorithm, rmap, LRU list handling - all of which already include a
good deal of memcg specifics.  It's the same size as the scheduler
core, which includes the entire cpu cgroup controller.  And all this
other code is of better quality and has more eyes on it than memcg.

Please demonstrate that you try to see the bigger picture behind memcg
and make an effort to keep things simple beyond the code you introduce
and the niche you care about, otherwise I'm not willing to take any
patches from you that don't straight-up delete stuff.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH -mm v15 00/13] kmemcg shrinkers
  2014-02-12 22:01         ` Johannes Weiner
@ 2014-02-13 17:33           ` Vladimir Davydov
  -1 siblings, 0 replies; 44+ messages in thread
From: Vladimir Davydov @ 2014-02-13 17:33 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: dchinner, mhocko, akpm, glommer, rientjes, linux-kernel, linux-mm, devel

On 02/13/2014 02:01 AM, Johannes Weiner wrote:
> On Wed, Feb 12, 2014 at 10:05:43PM +0400, Vladimir Davydov wrote:
>> On 02/12/2014 12:19 AM, Johannes Weiner wrote:
>>> On Tue, Feb 11, 2014 at 07:15:26PM +0400, Vladimir Davydov wrote:
>>>> Hi Michal, Johannes, David,
>>>>
>>>> Could you please take a look at this if you have time? Without your
>>>> review, it'll never get committed.
>>> There is simply no review bandwidth for new features as long as we are
>>> fixing fundamental bugs in memcg.
>>>
>>>> On 02/05/2014 10:39 PM, Vladimir Davydov wrote:
>>>>> Hi,
>>>>>
>>>>> This is the 15th iteration of Glauber Costa's patch-set implementing slab
>>>>> shrinking on memcg pressure. The main idea is to make the list_lru structure
>>>>> used by most FS shrinkers per-memcg. When adding or removing an element from a
>>>>> list_lru, we use the page information to figure out which memcg it belongs to
>>>>> and relay it to the appropriate list. This allows scanning kmem objects
>>>>> accounted to different memcgs independently.
>>>>>
>>>>> Please note that this patch-set implements slab shrinking only when we hit the
>>>>> user memory limit so that kmem allocations will still fail if we are below the
>>>>> user memory limit, but close to the kmem limit. I am going to fix this in a
>>>>> separate patch-set, but currently it is only worthwhile setting the kmem limit
>>>>> to be greater than the user mem limit just to enable per-memcg slab accounting
>>>>> and reclaim.
>>>>>
>>>>> The patch-set is based on top of v3.14-rc1-mmots-2014-02-04-16-48 (there are
>>>>> some vmscan cleanups that I need committed there) and organized as follows:
>>>>>  - patches 1-4 introduce some minor changes to memcg needed for this set;
>>>>>  - patches 5-7 prepare fs for per-memcg list_lru;
>>>>>  - patch 8 implement kmemcg reclaim core;
>>>>>  - patch 9 make list_lru per-memcg and patch 10 marks sb shrinker memcg-aware;
>>>>>  - patch 10 is trivial - it issues shrinkers on memcg destruction;
>>>>>  - patches 12 and 13 introduce shrinking of dead kmem caches to facilitate
>>>>>    memcg destruction.
>>> In the context of the ongoing discussions about charge reparenting I
>>> was curious how you deal with charges becoming unreclaimable after a
>>> memcg has been offlined.
>>>
>>> Patch #11 drops all charged objects at offlining by just invoking
>>> shrink_slab() in a loop until "only a few" (10) objects are remaining.
>>> How long is this going to take?  And why is it okay to destroy these
>>> caches when somebody else might still be using them?
>> IMHO, on container destruction we have to drop as many objects accounted
>> to this container as we can, because otherwise any container will be
>> able to get access to any number of unaccounted objects by fetching them
>> and then rebooting.
> They're accounted to and subject to the limit of the parent.  I don't
> see how this is different than page cache.
>
>>> That still leaves you with the free objects that slab caches retain
>>> for allocation efficiency, so now you put all dead memcgs in the
>>> system on a global list, and on a vmpressure event on root_mem_cgroup
>>> you walk the global list and drain the freelist of all remaining
>>> caches.
>>>
>>> This is a lot of complexity and scalability problems for less than
>>> desirable behavior.
>>>
>>> Please think about how we can properly reparent kmemcg charges during
>>> memcg teardown.  That would simplify your code immensely and help
>>> clean up this unholy mess of css pinning.
>>>
>>> Slab caches are already collected in the memcg and on destruction
>>> could be reassigned to the parent.  Kmemcg uncharge from slab freeing
>>> would have to be updated to use the memcg from the cache, not from the
>>> individual page, but I don't see why this wouldn't work right now.
>> I don't think I understand what you mean by reassigning slab caches to
>> the parent.
>>
>> If you mean moving all pages (slabs) from the cache of the memcg being
>> destroyed to the corresponding root cache (or the parent memcg's cache)
>> and then destroying the memcg's cache, I don't think this is feasible,
>> because slub free's fast path is lockless, so AFAIU we can't remove a
>> partial slab from a cache w/o risking to race with kmem_cache_free.
>>
>> If you mean clearing all pointers from the memcg's cache to the memcg
>> (changing them to the parent or root memcg), then AFAIU this won't solve
>> the problem with "dangling" caches - we will still have to shrink them
>> on vmpressure. So although this would allow us to put the reference to
>> the memcg from kmem caches on memcg's death, it wouldn't simplify the
>> code at all, in fact, it would even make it more complicated, because we
>> would have to handle various corner cases like reparenting vs
>> list_lru_{add,remove}.
> I think we have different concepts of what's complicated.  There is an
> existing model of what to do with left-over cache memory when a cgroup
> is destroyed, which is reparenting.  The rough steps will be the same,
> the object lifetime will be the same, the css refcounting will be the
> same, the user-visible behavior will be the same.  Any complexity from
> charge vs. reparent races will be contained to a few lines of code.
>
> Weird refcounting tricks during offline, trashing kmem caches instead
> of moving them to the parent like other memory, a global list of dead
> memcgs and sudden freelist thrashing on a vmpressure event, that's what
> adds complexity and what makes this code unpredictable, fragile, and
> insanely hard to work with.  It's not acceptable.
>
> By reparenting I meant reassigning the memcg cache parameter pointer
> from the slab cache such that it points to the parent.  This should be
> an atomic operation.  All css lookups already require RCU (I think slab
> does not follow this yet because we guarantee that css reference, but
> it should be changed).  So switch the cache param pointer, insert an
> RCU graceperiod to wait for all the ongoing charges and uncharges until
> nobody sees the memcg anymore, then safely reparent all the remaining
> memcg objects to the parent.  Maybe individually, maybe we can just
> splice the lists to the parent's list_lru lists.

But what should we do with objects that do not reside on any list_lru?
How can we reparent them?

Let's forget about per-memcg list_lru for a while, because
implementation of kmem reparenting requires rework of the existing code
handling per-memcg kmem caches. I would really appreciate if you could
share your vision on how we should get rid of per-memcg caches on memcg
destruction. Currently they are left hanging around until all memcg's
objects are freed. Should we try to move individual slab pages to the
parent's cache and destroy the memcg's cache? Or perhaps you mean that
all accounted kmem objects should be tracked in memcg by some means
(list_lru in case dcache/icache)?

Please excuse me if I annoy you, but before trying to change anything
I'd like to ensure I understand you right.

Thank you.

> I'm not sure I understand how the dangling cache problem pertains to
> this, isn't this an entirely separate issue?
>
>>> Charged thread stack pages could be reassigned when the task itself is
>>> migrated out of a cgroup.
>> Thread info pages are only a part of the problem. If a process kmalloc's
>> an object of size >= KMALLOC_MAX_CACHE_SIZE, it will be given a compound
>> page accounted to kmemcg, and we won't be able to find this page given
>> the memcg it is accounted to (except for walking the whole page range).
>> Thus we will have to organize those pages in per-memcg lists, won't we?
>> Again, even more complexity.
> Why do we track them in the first place?  We don't track any random
> page allocation, so we shouldn't track kmalloc() that falls back to the
> page allocator.  In fact we shouldn't track any random slab allocation.
>
> The types of allocations we primarily want to track are the ones that
> directly scale with user behavior.  This is a finite set, which on a
> global level is covered mostly by ulimits.  After all, an unprivileged
> user interfering with other users is not a new problem and existed long
> before memcg.
>
> It was a mistake to provide __GFP_KMEMCG and allow charging any random
> allocation, without giving memcg the means to actually manage that
> memory.  I don't see that such flexibility even needed, and it clearly
> hurts us now.  It was a choice we made to keep things simple in the
> beginning, before we knew how all this is going to turn out.  We should
> rectify this mistake before building on top of it.
>
> Here is the much bigger issue behind this:
>
> Memcg should be a thin layer of accounting and limiting between the VM
> and cgroup core code, but look at the line count.  It's more code than
> all of the page reclaim logic combined, including the page replacement
> algorithm, rmap, LRU list handling - all of which already include a
> good deal of memcg specifics.  It's the same size as the scheduler
> core, which includes the entire cpu cgroup controller.  And all this
> other code is of better quality and has more eyes on it than memcg.
>
> Please demonstrate that you try to see the bigger picture behind memcg
> and make an effort to keep things simple beyond the code you introduce
> and the niche you care about, otherwise I'm not willing to take any
> patches from you that don't straight-up delete stuff.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH -mm v15 00/13] kmemcg shrinkers
@ 2014-02-13 17:33           ` Vladimir Davydov
  0 siblings, 0 replies; 44+ messages in thread
From: Vladimir Davydov @ 2014-02-13 17:33 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: dchinner, mhocko, akpm, glommer, rientjes, linux-kernel, linux-mm, devel

On 02/13/2014 02:01 AM, Johannes Weiner wrote:
> On Wed, Feb 12, 2014 at 10:05:43PM +0400, Vladimir Davydov wrote:
>> On 02/12/2014 12:19 AM, Johannes Weiner wrote:
>>> On Tue, Feb 11, 2014 at 07:15:26PM +0400, Vladimir Davydov wrote:
>>>> Hi Michal, Johannes, David,
>>>>
>>>> Could you please take a look at this if you have time? Without your
>>>> review, it'll never get committed.
>>> There is simply no review bandwidth for new features as long as we are
>>> fixing fundamental bugs in memcg.
>>>
>>>> On 02/05/2014 10:39 PM, Vladimir Davydov wrote:
>>>>> Hi,
>>>>>
>>>>> This is the 15th iteration of Glauber Costa's patch-set implementing slab
>>>>> shrinking on memcg pressure. The main idea is to make the list_lru structure
>>>>> used by most FS shrinkers per-memcg. When adding or removing an element from a
>>>>> list_lru, we use the page information to figure out which memcg it belongs to
>>>>> and relay it to the appropriate list. This allows scanning kmem objects
>>>>> accounted to different memcgs independently.
>>>>>
>>>>> Please note that this patch-set implements slab shrinking only when we hit the
>>>>> user memory limit so that kmem allocations will still fail if we are below the
>>>>> user memory limit, but close to the kmem limit. I am going to fix this in a
>>>>> separate patch-set, but currently it is only worthwhile setting the kmem limit
>>>>> to be greater than the user mem limit just to enable per-memcg slab accounting
>>>>> and reclaim.
>>>>>
>>>>> The patch-set is based on top of v3.14-rc1-mmots-2014-02-04-16-48 (there are
>>>>> some vmscan cleanups that I need committed there) and organized as follows:
>>>>>  - patches 1-4 introduce some minor changes to memcg needed for this set;
>>>>>  - patches 5-7 prepare fs for per-memcg list_lru;
>>>>>  - patch 8 implement kmemcg reclaim core;
>>>>>  - patch 9 make list_lru per-memcg and patch 10 marks sb shrinker memcg-aware;
>>>>>  - patch 10 is trivial - it issues shrinkers on memcg destruction;
>>>>>  - patches 12 and 13 introduce shrinking of dead kmem caches to facilitate
>>>>>    memcg destruction.
>>> In the context of the ongoing discussions about charge reparenting I
>>> was curious how you deal with charges becoming unreclaimable after a
>>> memcg has been offlined.
>>>
>>> Patch #11 drops all charged objects at offlining by just invoking
>>> shrink_slab() in a loop until "only a few" (10) objects are remaining.
>>> How long is this going to take?  And why is it okay to destroy these
>>> caches when somebody else might still be using them?
>> IMHO, on container destruction we have to drop as many objects accounted
>> to this container as we can, because otherwise any container will be
>> able to get access to any number of unaccounted objects by fetching them
>> and then rebooting.
> They're accounted to and subject to the limit of the parent.  I don't
> see how this is different than page cache.
>
>>> That still leaves you with the free objects that slab caches retain
>>> for allocation efficiency, so now you put all dead memcgs in the
>>> system on a global list, and on a vmpressure event on root_mem_cgroup
>>> you walk the global list and drain the freelist of all remaining
>>> caches.
>>>
>>> This is a lot of complexity and scalability problems for less than
>>> desirable behavior.
>>>
>>> Please think about how we can properly reparent kmemcg charges during
>>> memcg teardown.  That would simplify your code immensely and help
>>> clean up this unholy mess of css pinning.
>>>
>>> Slab caches are already collected in the memcg and on destruction
>>> could be reassigned to the parent.  Kmemcg uncharge from slab freeing
>>> would have to be updated to use the memcg from the cache, not from the
>>> individual page, but I don't see why this wouldn't work right now.
>> I don't think I understand what you mean by reassigning slab caches to
>> the parent.
>>
>> If you mean moving all pages (slabs) from the cache of the memcg being
>> destroyed to the corresponding root cache (or the parent memcg's cache)
>> and then destroying the memcg's cache, I don't think this is feasible,
>> because slub free's fast path is lockless, so AFAIU we can't remove a
>> partial slab from a cache w/o risking to race with kmem_cache_free.
>>
>> If you mean clearing all pointers from the memcg's cache to the memcg
>> (changing them to the parent or root memcg), then AFAIU this won't solve
>> the problem with "dangling" caches - we will still have to shrink them
>> on vmpressure. So although this would allow us to put the reference to
>> the memcg from kmem caches on memcg's death, it wouldn't simplify the
>> code at all, in fact, it would even make it more complicated, because we
>> would have to handle various corner cases like reparenting vs
>> list_lru_{add,remove}.
> I think we have different concepts of what's complicated.  There is an
> existing model of what to do with left-over cache memory when a cgroup
> is destroyed, which is reparenting.  The rough steps will be the same,
> the object lifetime will be the same, the css refcounting will be the
> same, the user-visible behavior will be the same.  Any complexity from
> charge vs. reparent races will be contained to a few lines of code.
>
> Weird refcounting tricks during offline, trashing kmem caches instead
> of moving them to the parent like other memory, a global list of dead
> memcgs and sudden freelist thrashing on a vmpressure event, that's what
> adds complexity and what makes this code unpredictable, fragile, and
> insanely hard to work with.  It's not acceptable.
>
> By reparenting I meant reassigning the memcg cache parameter pointer
> from the slab cache such that it points to the parent.  This should be
> an atomic operation.  All css lookups already require RCU (I think slab
> does not follow this yet because we guarantee that css reference, but
> it should be changed).  So switch the cache param pointer, insert an
> RCU graceperiod to wait for all the ongoing charges and uncharges until
> nobody sees the memcg anymore, then safely reparent all the remaining
> memcg objects to the parent.  Maybe individually, maybe we can just
> splice the lists to the parent's list_lru lists.

But what should we do with objects that do not reside on any list_lru?
How can we reparent them?

Let's forget about per-memcg list_lru for a while, because
implementation of kmem reparenting requires rework of the existing code
handling per-memcg kmem caches. I would really appreciate if you could
share your vision on how we should get rid of per-memcg caches on memcg
destruction. Currently they are left hanging around until all memcg's
objects are freed. Should we try to move individual slab pages to the
parent's cache and destroy the memcg's cache? Or perhaps you mean that
all accounted kmem objects should be tracked in memcg by some means
(list_lru in case dcache/icache)?

Please excuse me if I annoy you, but before trying to change anything
I'd like to ensure I understand you right.

Thank you.

> I'm not sure I understand how the dangling cache problem pertains to
> this, isn't this an entirely separate issue?
>
>>> Charged thread stack pages could be reassigned when the task itself is
>>> migrated out of a cgroup.
>> Thread info pages are only a part of the problem. If a process kmalloc's
>> an object of size >= KMALLOC_MAX_CACHE_SIZE, it will be given a compound
>> page accounted to kmemcg, and we won't be able to find this page given
>> the memcg it is accounted to (except for walking the whole page range).
>> Thus we will have to organize those pages in per-memcg lists, won't we?
>> Again, even more complexity.
> Why do we track them in the first place?  We don't track any random
> page allocation, so we shouldn't track kmalloc() that falls back to the
> page allocator.  In fact we shouldn't track any random slab allocation.
>
> The types of allocations we primarily want to track are the ones that
> directly scale with user behavior.  This is a finite set, which on a
> global level is covered mostly by ulimits.  After all, an unprivileged
> user interfering with other users is not a new problem and existed long
> before memcg.
>
> It was a mistake to provide __GFP_KMEMCG and allow charging any random
> allocation, without giving memcg the means to actually manage that
> memory.  I don't see that such flexibility even needed, and it clearly
> hurts us now.  It was a choice we made to keep things simple in the
> beginning, before we knew how all this is going to turn out.  We should
> rectify this mistake before building on top of it.
>
> Here is the much bigger issue behind this:
>
> Memcg should be a thin layer of accounting and limiting between the VM
> and cgroup core code, but look at the line count.  It's more code than
> all of the page reclaim logic combined, including the page replacement
> algorithm, rmap, LRU list handling - all of which already include a
> good deal of memcg specifics.  It's the same size as the scheduler
> core, which includes the entire cpu cgroup controller.  And all this
> other code is of better quality and has more eyes on it than memcg.
>
> Please demonstrate that you try to see the bigger picture behind memcg
> and make an effort to keep things simple beyond the code you introduce
> and the niche you care about, otherwise I'm not willing to take any
> patches from you that don't straight-up delete stuff.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH -mm v15 00/13] kmemcg shrinkers
  2014-02-13 17:33           ` Vladimir Davydov
@ 2014-02-13 21:20             ` Johannes Weiner
  -1 siblings, 0 replies; 44+ messages in thread
From: Johannes Weiner @ 2014-02-13 21:20 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: dchinner, mhocko, akpm, glommer, rientjes, linux-kernel, linux-mm, devel

On Thu, Feb 13, 2014 at 09:33:32PM +0400, Vladimir Davydov wrote:
> On 02/13/2014 02:01 AM, Johannes Weiner wrote:
> > On Wed, Feb 12, 2014 at 10:05:43PM +0400, Vladimir Davydov wrote:
> >> On 02/12/2014 12:19 AM, Johannes Weiner wrote:
> >>> On Tue, Feb 11, 2014 at 07:15:26PM +0400, Vladimir Davydov wrote:
> >>>> Hi Michal, Johannes, David,
> >>>>
> >>>> Could you please take a look at this if you have time? Without your
> >>>> review, it'll never get committed.
> >>> There is simply no review bandwidth for new features as long as we are
> >>> fixing fundamental bugs in memcg.
> >>>
> >>>> On 02/05/2014 10:39 PM, Vladimir Davydov wrote:
> >>>>> Hi,
> >>>>>
> >>>>> This is the 15th iteration of Glauber Costa's patch-set implementing slab
> >>>>> shrinking on memcg pressure. The main idea is to make the list_lru structure
> >>>>> used by most FS shrinkers per-memcg. When adding or removing an element from a
> >>>>> list_lru, we use the page information to figure out which memcg it belongs to
> >>>>> and relay it to the appropriate list. This allows scanning kmem objects
> >>>>> accounted to different memcgs independently.
> >>>>>
> >>>>> Please note that this patch-set implements slab shrinking only when we hit the
> >>>>> user memory limit so that kmem allocations will still fail if we are below the
> >>>>> user memory limit, but close to the kmem limit. I am going to fix this in a
> >>>>> separate patch-set, but currently it is only worthwhile setting the kmem limit
> >>>>> to be greater than the user mem limit just to enable per-memcg slab accounting
> >>>>> and reclaim.
> >>>>>
> >>>>> The patch-set is based on top of v3.14-rc1-mmots-2014-02-04-16-48 (there are
> >>>>> some vmscan cleanups that I need committed there) and organized as follows:
> >>>>>  - patches 1-4 introduce some minor changes to memcg needed for this set;
> >>>>>  - patches 5-7 prepare fs for per-memcg list_lru;
> >>>>>  - patch 8 implement kmemcg reclaim core;
> >>>>>  - patch 9 make list_lru per-memcg and patch 10 marks sb shrinker memcg-aware;
> >>>>>  - patch 10 is trivial - it issues shrinkers on memcg destruction;
> >>>>>  - patches 12 and 13 introduce shrinking of dead kmem caches to facilitate
> >>>>>    memcg destruction.
> >>> In the context of the ongoing discussions about charge reparenting I
> >>> was curious how you deal with charges becoming unreclaimable after a
> >>> memcg has been offlined.
> >>>
> >>> Patch #11 drops all charged objects at offlining by just invoking
> >>> shrink_slab() in a loop until "only a few" (10) objects are remaining.
> >>> How long is this going to take?  And why is it okay to destroy these
> >>> caches when somebody else might still be using them?
> >> IMHO, on container destruction we have to drop as many objects accounted
> >> to this container as we can, because otherwise any container will be
> >> able to get access to any number of unaccounted objects by fetching them
> >> and then rebooting.
> > They're accounted to and subject to the limit of the parent.  I don't
> > see how this is different than page cache.
> >
> >>> That still leaves you with the free objects that slab caches retain
> >>> for allocation efficiency, so now you put all dead memcgs in the
> >>> system on a global list, and on a vmpressure event on root_mem_cgroup
> >>> you walk the global list and drain the freelist of all remaining
> >>> caches.
> >>>
> >>> This is a lot of complexity and scalability problems for less than
> >>> desirable behavior.
> >>>
> >>> Please think about how we can properly reparent kmemcg charges during
> >>> memcg teardown.  That would simplify your code immensely and help
> >>> clean up this unholy mess of css pinning.
> >>>
> >>> Slab caches are already collected in the memcg and on destruction
> >>> could be reassigned to the parent.  Kmemcg uncharge from slab freeing
> >>> would have to be updated to use the memcg from the cache, not from the
> >>> individual page, but I don't see why this wouldn't work right now.
> >> I don't think I understand what you mean by reassigning slab caches to
> >> the parent.
> >>
> >> If you mean moving all pages (slabs) from the cache of the memcg being
> >> destroyed to the corresponding root cache (or the parent memcg's cache)
> >> and then destroying the memcg's cache, I don't think this is feasible,
> >> because slub free's fast path is lockless, so AFAIU we can't remove a
> >> partial slab from a cache w/o risking to race with kmem_cache_free.
> >>
> >> If you mean clearing all pointers from the memcg's cache to the memcg
> >> (changing them to the parent or root memcg), then AFAIU this won't solve
> >> the problem with "dangling" caches - we will still have to shrink them
> >> on vmpressure. So although this would allow us to put the reference to
> >> the memcg from kmem caches on memcg's death, it wouldn't simplify the
> >> code at all, in fact, it would even make it more complicated, because we
> >> would have to handle various corner cases like reparenting vs
> >> list_lru_{add,remove}.
> > I think we have different concepts of what's complicated.  There is an
> > existing model of what to do with left-over cache memory when a cgroup
> > is destroyed, which is reparenting.  The rough steps will be the same,
> > the object lifetime will be the same, the css refcounting will be the
> > same, the user-visible behavior will be the same.  Any complexity from
> > charge vs. reparent races will be contained to a few lines of code.
> >
> > Weird refcounting tricks during offline, trashing kmem caches instead
> > of moving them to the parent like other memory, a global list of dead
> > memcgs and sudden freelist thrashing on a vmpressure event, that's what
> > adds complexity and what makes this code unpredictable, fragile, and
> > insanely hard to work with.  It's not acceptable.
> >
> > By reparenting I meant reassigning the memcg cache parameter pointer
> > from the slab cache such that it points to the parent.  This should be
> > an atomic operation.  All css lookups already require RCU (I think slab
> > does not follow this yet because we guarantee that css reference, but
> > it should be changed).  So switch the cache param pointer, insert an
> > RCU graceperiod to wait for all the ongoing charges and uncharges until
> > nobody sees the memcg anymore, then safely reparent all the remaining
> > memcg objects to the parent.  Maybe individually, maybe we can just
> > splice the lists to the parent's list_lru lists.
> 
> But what should we do with objects that do not reside on any list_lru?
> How can we reparent them?

If there are no actual list_lru objects, we only have to make sure
that any allocations are properly uncharged against the parent when
they get freed later.

If the slab freeing path would uncharge against the per-memcg cache's
backpointer (s->memcg_params->memcg) instead of the per-page memcg
association, then we could reparent whole caches with a single pointer
update, without touching each individual slab page.  The kmemcg
interface for slab would have to be reworked to not use
lookup_page_cgroup() but have slab pass s->memcg_params->memcg.

Once that is in place, css_free() can walk memcg->memcg_slab_caches
and move all the items to the parent's memcg_slab_caches, and while
doing that change the memcg pointer of each item, memcg_params->memcg,
to point to the parent.  The cache is now owned by the parent and will
stay alive until the last page is freed.

There won't be any new allocations in these caches because there are
no tasks in the group anymore, so no races from that side, and the
perfect time to shrink the freelists.

There will be racing frees of outstanding allocations, but we can deal
with that.  Frees use page->slab_cache to look up the proper per-memcg
cache, which may or may not have been reparented at this time.  If it
has not, the cache's memcg backpointer (s->memcg_params->memcg) will
point to the dying child (rcu protected) and uncharge the child's
res_counter and the parent's.  If the backpointer IS already pointing
to the parent, it will uncharge the res_counter of the parent without
the child's but we don't care, it's dying anyway.

If there are no list_lru tracked objects, we are done at this point.
The css can be freed, the freelists have been purged, and any pages
still in the cache will get unaccounted properly from the parent.

If there are list_lru objects, they have to be moved to the parent's
list_lru so that they can be reclaimed properly on memory pressure.

Does this make sense?

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH -mm v15 00/13] kmemcg shrinkers
@ 2014-02-13 21:20             ` Johannes Weiner
  0 siblings, 0 replies; 44+ messages in thread
From: Johannes Weiner @ 2014-02-13 21:20 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: dchinner, mhocko, akpm, glommer, rientjes, linux-kernel, linux-mm, devel

On Thu, Feb 13, 2014 at 09:33:32PM +0400, Vladimir Davydov wrote:
> On 02/13/2014 02:01 AM, Johannes Weiner wrote:
> > On Wed, Feb 12, 2014 at 10:05:43PM +0400, Vladimir Davydov wrote:
> >> On 02/12/2014 12:19 AM, Johannes Weiner wrote:
> >>> On Tue, Feb 11, 2014 at 07:15:26PM +0400, Vladimir Davydov wrote:
> >>>> Hi Michal, Johannes, David,
> >>>>
> >>>> Could you please take a look at this if you have time? Without your
> >>>> review, it'll never get committed.
> >>> There is simply no review bandwidth for new features as long as we are
> >>> fixing fundamental bugs in memcg.
> >>>
> >>>> On 02/05/2014 10:39 PM, Vladimir Davydov wrote:
> >>>>> Hi,
> >>>>>
> >>>>> This is the 15th iteration of Glauber Costa's patch-set implementing slab
> >>>>> shrinking on memcg pressure. The main idea is to make the list_lru structure
> >>>>> used by most FS shrinkers per-memcg. When adding or removing an element from a
> >>>>> list_lru, we use the page information to figure out which memcg it belongs to
> >>>>> and relay it to the appropriate list. This allows scanning kmem objects
> >>>>> accounted to different memcgs independently.
> >>>>>
> >>>>> Please note that this patch-set implements slab shrinking only when we hit the
> >>>>> user memory limit so that kmem allocations will still fail if we are below the
> >>>>> user memory limit, but close to the kmem limit. I am going to fix this in a
> >>>>> separate patch-set, but currently it is only worthwhile setting the kmem limit
> >>>>> to be greater than the user mem limit just to enable per-memcg slab accounting
> >>>>> and reclaim.
> >>>>>
> >>>>> The patch-set is based on top of v3.14-rc1-mmots-2014-02-04-16-48 (there are
> >>>>> some vmscan cleanups that I need committed there) and organized as follows:
> >>>>>  - patches 1-4 introduce some minor changes to memcg needed for this set;
> >>>>>  - patches 5-7 prepare fs for per-memcg list_lru;
> >>>>>  - patch 8 implement kmemcg reclaim core;
> >>>>>  - patch 9 make list_lru per-memcg and patch 10 marks sb shrinker memcg-aware;
> >>>>>  - patch 10 is trivial - it issues shrinkers on memcg destruction;
> >>>>>  - patches 12 and 13 introduce shrinking of dead kmem caches to facilitate
> >>>>>    memcg destruction.
> >>> In the context of the ongoing discussions about charge reparenting I
> >>> was curious how you deal with charges becoming unreclaimable after a
> >>> memcg has been offlined.
> >>>
> >>> Patch #11 drops all charged objects at offlining by just invoking
> >>> shrink_slab() in a loop until "only a few" (10) objects are remaining.
> >>> How long is this going to take?  And why is it okay to destroy these
> >>> caches when somebody else might still be using them?
> >> IMHO, on container destruction we have to drop as many objects accounted
> >> to this container as we can, because otherwise any container will be
> >> able to get access to any number of unaccounted objects by fetching them
> >> and then rebooting.
> > They're accounted to and subject to the limit of the parent.  I don't
> > see how this is different than page cache.
> >
> >>> That still leaves you with the free objects that slab caches retain
> >>> for allocation efficiency, so now you put all dead memcgs in the
> >>> system on a global list, and on a vmpressure event on root_mem_cgroup
> >>> you walk the global list and drain the freelist of all remaining
> >>> caches.
> >>>
> >>> This is a lot of complexity and scalability problems for less than
> >>> desirable behavior.
> >>>
> >>> Please think about how we can properly reparent kmemcg charges during
> >>> memcg teardown.  That would simplify your code immensely and help
> >>> clean up this unholy mess of css pinning.
> >>>
> >>> Slab caches are already collected in the memcg and on destruction
> >>> could be reassigned to the parent.  Kmemcg uncharge from slab freeing
> >>> would have to be updated to use the memcg from the cache, not from the
> >>> individual page, but I don't see why this wouldn't work right now.
> >> I don't think I understand what you mean by reassigning slab caches to
> >> the parent.
> >>
> >> If you mean moving all pages (slabs) from the cache of the memcg being
> >> destroyed to the corresponding root cache (or the parent memcg's cache)
> >> and then destroying the memcg's cache, I don't think this is feasible,
> >> because slub free's fast path is lockless, so AFAIU we can't remove a
> >> partial slab from a cache w/o risking to race with kmem_cache_free.
> >>
> >> If you mean clearing all pointers from the memcg's cache to the memcg
> >> (changing them to the parent or root memcg), then AFAIU this won't solve
> >> the problem with "dangling" caches - we will still have to shrink them
> >> on vmpressure. So although this would allow us to put the reference to
> >> the memcg from kmem caches on memcg's death, it wouldn't simplify the
> >> code at all, in fact, it would even make it more complicated, because we
> >> would have to handle various corner cases like reparenting vs
> >> list_lru_{add,remove}.
> > I think we have different concepts of what's complicated.  There is an
> > existing model of what to do with left-over cache memory when a cgroup
> > is destroyed, which is reparenting.  The rough steps will be the same,
> > the object lifetime will be the same, the css refcounting will be the
> > same, the user-visible behavior will be the same.  Any complexity from
> > charge vs. reparent races will be contained to a few lines of code.
> >
> > Weird refcounting tricks during offline, trashing kmem caches instead
> > of moving them to the parent like other memory, a global list of dead
> > memcgs and sudden freelist thrashing on a vmpressure event, that's what
> > adds complexity and what makes this code unpredictable, fragile, and
> > insanely hard to work with.  It's not acceptable.
> >
> > By reparenting I meant reassigning the memcg cache parameter pointer
> > from the slab cache such that it points to the parent.  This should be
> > an atomic operation.  All css lookups already require RCU (I think slab
> > does not follow this yet because we guarantee that css reference, but
> > it should be changed).  So switch the cache param pointer, insert an
> > RCU graceperiod to wait for all the ongoing charges and uncharges until
> > nobody sees the memcg anymore, then safely reparent all the remaining
> > memcg objects to the parent.  Maybe individually, maybe we can just
> > splice the lists to the parent's list_lru lists.
> 
> But what should we do with objects that do not reside on any list_lru?
> How can we reparent them?

If there are no actual list_lru objects, we only have to make sure
that any allocations are properly uncharged against the parent when
they get freed later.

If the slab freeing path would uncharge against the per-memcg cache's
backpointer (s->memcg_params->memcg) instead of the per-page memcg
association, then we could reparent whole caches with a single pointer
update, without touching each individual slab page.  The kmemcg
interface for slab would have to be reworked to not use
lookup_page_cgroup() but have slab pass s->memcg_params->memcg.

Once that is in place, css_free() can walk memcg->memcg_slab_caches
and move all the items to the parent's memcg_slab_caches, and while
doing that change the memcg pointer of each item, memcg_params->memcg,
to point to the parent.  The cache is now owned by the parent and will
stay alive until the last page is freed.

There won't be any new allocations in these caches because there are
no tasks in the group anymore, so no races from that side, and the
perfect time to shrink the freelists.

There will be racing frees of outstanding allocations, but we can deal
with that.  Frees use page->slab_cache to look up the proper per-memcg
cache, which may or may not have been reparented at this time.  If it
has not, the cache's memcg backpointer (s->memcg_params->memcg) will
point to the dying child (rcu protected) and uncharge the child's
res_counter and the parent's.  If the backpointer IS already pointing
to the parent, it will uncharge the res_counter of the parent without
the child's but we don't care, it's dying anyway.

If there are no list_lru tracked objects, we are done at this point.
The css can be freed, the freelists have been purged, and any pages
still in the cache will get unaccounted properly from the parent.

If there are list_lru objects, they have to be moved to the parent's
list_lru so that they can be reclaimed properly on memory pressure.

Does this make sense?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH -mm v15 00/13] kmemcg shrinkers
  2014-02-13 21:20             ` Johannes Weiner
@ 2014-02-14 19:09               ` Vladimir Davydov
  -1 siblings, 0 replies; 44+ messages in thread
From: Vladimir Davydov @ 2014-02-14 19:09 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: dchinner, mhocko, akpm, glommer, rientjes, linux-kernel, linux-mm, devel

On 02/14/2014 01:20 AM, Johannes Weiner wrote:
> On Thu, Feb 13, 2014 at 09:33:32PM +0400, Vladimir Davydov wrote:
>> On 02/13/2014 02:01 AM, Johannes Weiner wrote:
>>> On Wed, Feb 12, 2014 at 10:05:43PM +0400, Vladimir Davydov wrote:
>>>> On 02/12/2014 12:19 AM, Johannes Weiner wrote:
>>>>> On Tue, Feb 11, 2014 at 07:15:26PM +0400, Vladimir Davydov wrote:
>>>>>> Hi Michal, Johannes, David,
>>>>>>
>>>>>> Could you please take a look at this if you have time? Without your
>>>>>> review, it'll never get committed.
>>>>> There is simply no review bandwidth for new features as long as we are
>>>>> fixing fundamental bugs in memcg.
>>>>>
>>>>>> On 02/05/2014 10:39 PM, Vladimir Davydov wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> This is the 15th iteration of Glauber Costa's patch-set implementing slab
>>>>>>> shrinking on memcg pressure. The main idea is to make the list_lru structure
>>>>>>> used by most FS shrinkers per-memcg. When adding or removing an element from a
>>>>>>> list_lru, we use the page information to figure out which memcg it belongs to
>>>>>>> and relay it to the appropriate list. This allows scanning kmem objects
>>>>>>> accounted to different memcgs independently.
>>>>>>>
>>>>>>> Please note that this patch-set implements slab shrinking only when we hit the
>>>>>>> user memory limit so that kmem allocations will still fail if we are below the
>>>>>>> user memory limit, but close to the kmem limit. I am going to fix this in a
>>>>>>> separate patch-set, but currently it is only worthwhile setting the kmem limit
>>>>>>> to be greater than the user mem limit just to enable per-memcg slab accounting
>>>>>>> and reclaim.
>>>>>>>
>>>>>>> The patch-set is based on top of v3.14-rc1-mmots-2014-02-04-16-48 (there are
>>>>>>> some vmscan cleanups that I need committed there) and organized as follows:
>>>>>>>  - patches 1-4 introduce some minor changes to memcg needed for this set;
>>>>>>>  - patches 5-7 prepare fs for per-memcg list_lru;
>>>>>>>  - patch 8 implement kmemcg reclaim core;
>>>>>>>  - patch 9 make list_lru per-memcg and patch 10 marks sb shrinker memcg-aware;
>>>>>>>  - patch 10 is trivial - it issues shrinkers on memcg destruction;
>>>>>>>  - patches 12 and 13 introduce shrinking of dead kmem caches to facilitate
>>>>>>>    memcg destruction.
>>>>> In the context of the ongoing discussions about charge reparenting I
>>>>> was curious how you deal with charges becoming unreclaimable after a
>>>>> memcg has been offlined.
>>>>>
>>>>> Patch #11 drops all charged objects at offlining by just invoking
>>>>> shrink_slab() in a loop until "only a few" (10) objects are remaining.
>>>>> How long is this going to take?  And why is it okay to destroy these
>>>>> caches when somebody else might still be using them?
>>>> IMHO, on container destruction we have to drop as many objects accounted
>>>> to this container as we can, because otherwise any container will be
>>>> able to get access to any number of unaccounted objects by fetching them
>>>> and then rebooting.
>>> They're accounted to and subject to the limit of the parent.  I don't
>>> see how this is different than page cache.
>>>
>>>>> That still leaves you with the free objects that slab caches retain
>>>>> for allocation efficiency, so now you put all dead memcgs in the
>>>>> system on a global list, and on a vmpressure event on root_mem_cgroup
>>>>> you walk the global list and drain the freelist of all remaining
>>>>> caches.
>>>>>
>>>>> This is a lot of complexity and scalability problems for less than
>>>>> desirable behavior.
>>>>>
>>>>> Please think about how we can properly reparent kmemcg charges during
>>>>> memcg teardown.  That would simplify your code immensely and help
>>>>> clean up this unholy mess of css pinning.
>>>>>
>>>>> Slab caches are already collected in the memcg and on destruction
>>>>> could be reassigned to the parent.  Kmemcg uncharge from slab freeing
>>>>> would have to be updated to use the memcg from the cache, not from the
>>>>> individual page, but I don't see why this wouldn't work right now.
>>>> I don't think I understand what you mean by reassigning slab caches to
>>>> the parent.
>>>>
>>>> If you mean moving all pages (slabs) from the cache of the memcg being
>>>> destroyed to the corresponding root cache (or the parent memcg's cache)
>>>> and then destroying the memcg's cache, I don't think this is feasible,
>>>> because slub free's fast path is lockless, so AFAIU we can't remove a
>>>> partial slab from a cache w/o risking to race with kmem_cache_free.
>>>>
>>>> If you mean clearing all pointers from the memcg's cache to the memcg
>>>> (changing them to the parent or root memcg), then AFAIU this won't solve
>>>> the problem with "dangling" caches - we will still have to shrink them
>>>> on vmpressure. So although this would allow us to put the reference to
>>>> the memcg from kmem caches on memcg's death, it wouldn't simplify the
>>>> code at all, in fact, it would even make it more complicated, because we
>>>> would have to handle various corner cases like reparenting vs
>>>> list_lru_{add,remove}.
>>> I think we have different concepts of what's complicated.  There is an
>>> existing model of what to do with left-over cache memory when a cgroup
>>> is destroyed, which is reparenting.  The rough steps will be the same,
>>> the object lifetime will be the same, the css refcounting will be the
>>> same, the user-visible behavior will be the same.  Any complexity from
>>> charge vs. reparent races will be contained to a few lines of code.
>>>
>>> Weird refcounting tricks during offline, trashing kmem caches instead
>>> of moving them to the parent like other memory, a global list of dead
>>> memcgs and sudden freelist thrashing on a vmpressure event, that's what
>>> adds complexity and what makes this code unpredictable, fragile, and
>>> insanely hard to work with.  It's not acceptable.
>>>
>>> By reparenting I meant reassigning the memcg cache parameter pointer
>>> from the slab cache such that it points to the parent.  This should be
>>> an atomic operation.  All css lookups already require RCU (I think slab
>>> does not follow this yet because we guarantee that css reference, but
>>> it should be changed).  So switch the cache param pointer, insert an
>>> RCU graceperiod to wait for all the ongoing charges and uncharges until
>>> nobody sees the memcg anymore, then safely reparent all the remaining
>>> memcg objects to the parent.  Maybe individually, maybe we can just
>>> splice the lists to the parent's list_lru lists.
>> But what should we do with objects that do not reside on any list_lru?
>> How can we reparent them?
> If there are no actual list_lru objects, we only have to make sure
> that any allocations are properly uncharged against the parent when
> they get freed later.
>
> If the slab freeing path would uncharge against the per-memcg cache's
> backpointer (s->memcg_params->memcg) instead of the per-page memcg
> association, then we could reparent whole caches with a single pointer
> update, without touching each individual slab page.  The kmemcg
> interface for slab would have to be reworked to not use
> lookup_page_cgroup() but have slab pass s->memcg_params->memcg.
>
> Once that is in place, css_free() can walk memcg->memcg_slab_caches
> and move all the items to the parent's memcg_slab_caches, and while
> doing that change the memcg pointer of each item, memcg_params->memcg,
> to point to the parent.  The cache is now owned by the parent and will
> stay alive until the last page is freed.
>
> There won't be any new allocations in these caches because there are
> no tasks in the group anymore, so no races from that side, and the
> perfect time to shrink the freelists.
>
> There will be racing frees of outstanding allocations, but we can deal
> with that.  Frees use page->slab_cache to look up the proper per-memcg
> cache, which may or may not have been reparented at this time.  If it
> has not, the cache's memcg backpointer (s->memcg_params->memcg) will
> point to the dying child (rcu protected) and uncharge the child's
> res_counter and the parent's.  If the backpointer IS already pointing
> to the parent, it will uncharge the res_counter of the parent without
> the child's but we don't care, it's dying anyway.
>
> If there are no list_lru tracked objects, we are done at this point.
> The css can be freed, the freelists have been purged, and any pages
> still in the cache will get unaccounted properly from the parent.
>
> If there are list_lru objects, they have to be moved to the parent's
> list_lru so that they can be reclaimed properly on memory pressure.
>
> Does this make sense?

Definitely. Thank you very much for such a detailed explanation! I guess
I'll try to implement this.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH -mm v15 00/13] kmemcg shrinkers
@ 2014-02-14 19:09               ` Vladimir Davydov
  0 siblings, 0 replies; 44+ messages in thread
From: Vladimir Davydov @ 2014-02-14 19:09 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: dchinner, mhocko, akpm, glommer, rientjes, linux-kernel, linux-mm, devel

On 02/14/2014 01:20 AM, Johannes Weiner wrote:
> On Thu, Feb 13, 2014 at 09:33:32PM +0400, Vladimir Davydov wrote:
>> On 02/13/2014 02:01 AM, Johannes Weiner wrote:
>>> On Wed, Feb 12, 2014 at 10:05:43PM +0400, Vladimir Davydov wrote:
>>>> On 02/12/2014 12:19 AM, Johannes Weiner wrote:
>>>>> On Tue, Feb 11, 2014 at 07:15:26PM +0400, Vladimir Davydov wrote:
>>>>>> Hi Michal, Johannes, David,
>>>>>>
>>>>>> Could you please take a look at this if you have time? Without your
>>>>>> review, it'll never get committed.
>>>>> There is simply no review bandwidth for new features as long as we are
>>>>> fixing fundamental bugs in memcg.
>>>>>
>>>>>> On 02/05/2014 10:39 PM, Vladimir Davydov wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> This is the 15th iteration of Glauber Costa's patch-set implementing slab
>>>>>>> shrinking on memcg pressure. The main idea is to make the list_lru structure
>>>>>>> used by most FS shrinkers per-memcg. When adding or removing an element from a
>>>>>>> list_lru, we use the page information to figure out which memcg it belongs to
>>>>>>> and relay it to the appropriate list. This allows scanning kmem objects
>>>>>>> accounted to different memcgs independently.
>>>>>>>
>>>>>>> Please note that this patch-set implements slab shrinking only when we hit the
>>>>>>> user memory limit so that kmem allocations will still fail if we are below the
>>>>>>> user memory limit, but close to the kmem limit. I am going to fix this in a
>>>>>>> separate patch-set, but currently it is only worthwhile setting the kmem limit
>>>>>>> to be greater than the user mem limit just to enable per-memcg slab accounting
>>>>>>> and reclaim.
>>>>>>>
>>>>>>> The patch-set is based on top of v3.14-rc1-mmots-2014-02-04-16-48 (there are
>>>>>>> some vmscan cleanups that I need committed there) and organized as follows:
>>>>>>>  - patches 1-4 introduce some minor changes to memcg needed for this set;
>>>>>>>  - patches 5-7 prepare fs for per-memcg list_lru;
>>>>>>>  - patch 8 implement kmemcg reclaim core;
>>>>>>>  - patch 9 make list_lru per-memcg and patch 10 marks sb shrinker memcg-aware;
>>>>>>>  - patch 10 is trivial - it issues shrinkers on memcg destruction;
>>>>>>>  - patches 12 and 13 introduce shrinking of dead kmem caches to facilitate
>>>>>>>    memcg destruction.
>>>>> In the context of the ongoing discussions about charge reparenting I
>>>>> was curious how you deal with charges becoming unreclaimable after a
>>>>> memcg has been offlined.
>>>>>
>>>>> Patch #11 drops all charged objects at offlining by just invoking
>>>>> shrink_slab() in a loop until "only a few" (10) objects are remaining.
>>>>> How long is this going to take?  And why is it okay to destroy these
>>>>> caches when somebody else might still be using them?
>>>> IMHO, on container destruction we have to drop as many objects accounted
>>>> to this container as we can, because otherwise any container will be
>>>> able to get access to any number of unaccounted objects by fetching them
>>>> and then rebooting.
>>> They're accounted to and subject to the limit of the parent.  I don't
>>> see how this is different than page cache.
>>>
>>>>> That still leaves you with the free objects that slab caches retain
>>>>> for allocation efficiency, so now you put all dead memcgs in the
>>>>> system on a global list, and on a vmpressure event on root_mem_cgroup
>>>>> you walk the global list and drain the freelist of all remaining
>>>>> caches.
>>>>>
>>>>> This is a lot of complexity and scalability problems for less than
>>>>> desirable behavior.
>>>>>
>>>>> Please think about how we can properly reparent kmemcg charges during
>>>>> memcg teardown.  That would simplify your code immensely and help
>>>>> clean up this unholy mess of css pinning.
>>>>>
>>>>> Slab caches are already collected in the memcg and on destruction
>>>>> could be reassigned to the parent.  Kmemcg uncharge from slab freeing
>>>>> would have to be updated to use the memcg from the cache, not from the
>>>>> individual page, but I don't see why this wouldn't work right now.
>>>> I don't think I understand what you mean by reassigning slab caches to
>>>> the parent.
>>>>
>>>> If you mean moving all pages (slabs) from the cache of the memcg being
>>>> destroyed to the corresponding root cache (or the parent memcg's cache)
>>>> and then destroying the memcg's cache, I don't think this is feasible,
>>>> because slub free's fast path is lockless, so AFAIU we can't remove a
>>>> partial slab from a cache w/o risking to race with kmem_cache_free.
>>>>
>>>> If you mean clearing all pointers from the memcg's cache to the memcg
>>>> (changing them to the parent or root memcg), then AFAIU this won't solve
>>>> the problem with "dangling" caches - we will still have to shrink them
>>>> on vmpressure. So although this would allow us to put the reference to
>>>> the memcg from kmem caches on memcg's death, it wouldn't simplify the
>>>> code at all, in fact, it would even make it more complicated, because we
>>>> would have to handle various corner cases like reparenting vs
>>>> list_lru_{add,remove}.
>>> I think we have different concepts of what's complicated.  There is an
>>> existing model of what to do with left-over cache memory when a cgroup
>>> is destroyed, which is reparenting.  The rough steps will be the same,
>>> the object lifetime will be the same, the css refcounting will be the
>>> same, the user-visible behavior will be the same.  Any complexity from
>>> charge vs. reparent races will be contained to a few lines of code.
>>>
>>> Weird refcounting tricks during offline, trashing kmem caches instead
>>> of moving them to the parent like other memory, a global list of dead
>>> memcgs and sudden freelist thrashing on a vmpressure event, that's what
>>> adds complexity and what makes this code unpredictable, fragile, and
>>> insanely hard to work with.  It's not acceptable.
>>>
>>> By reparenting I meant reassigning the memcg cache parameter pointer
>>> from the slab cache such that it points to the parent.  This should be
>>> an atomic operation.  All css lookups already require RCU (I think slab
>>> does not follow this yet because we guarantee that css reference, but
>>> it should be changed).  So switch the cache param pointer, insert an
>>> RCU graceperiod to wait for all the ongoing charges and uncharges until
>>> nobody sees the memcg anymore, then safely reparent all the remaining
>>> memcg objects to the parent.  Maybe individually, maybe we can just
>>> splice the lists to the parent's list_lru lists.
>> But what should we do with objects that do not reside on any list_lru?
>> How can we reparent them?
> If there are no actual list_lru objects, we only have to make sure
> that any allocations are properly uncharged against the parent when
> they get freed later.
>
> If the slab freeing path would uncharge against the per-memcg cache's
> backpointer (s->memcg_params->memcg) instead of the per-page memcg
> association, then we could reparent whole caches with a single pointer
> update, without touching each individual slab page.  The kmemcg
> interface for slab would have to be reworked to not use
> lookup_page_cgroup() but have slab pass s->memcg_params->memcg.
>
> Once that is in place, css_free() can walk memcg->memcg_slab_caches
> and move all the items to the parent's memcg_slab_caches, and while
> doing that change the memcg pointer of each item, memcg_params->memcg,
> to point to the parent.  The cache is now owned by the parent and will
> stay alive until the last page is freed.
>
> There won't be any new allocations in these caches because there are
> no tasks in the group anymore, so no races from that side, and the
> perfect time to shrink the freelists.
>
> There will be racing frees of outstanding allocations, but we can deal
> with that.  Frees use page->slab_cache to look up the proper per-memcg
> cache, which may or may not have been reparented at this time.  If it
> has not, the cache's memcg backpointer (s->memcg_params->memcg) will
> point to the dying child (rcu protected) and uncharge the child's
> res_counter and the parent's.  If the backpointer IS already pointing
> to the parent, it will uncharge the res_counter of the parent without
> the child's but we don't care, it's dying anyway.
>
> If there are no list_lru tracked objects, we are done at this point.
> The css can be freed, the freelists have been purged, and any pages
> still in the cache will get unaccounted properly from the parent.
>
> If there are list_lru objects, they have to be moved to the parent's
> list_lru so that they can be reclaimed properly on memory pressure.
>
> Does this make sense?

Definitely. Thank you very much for such a detailed explanation! I guess
I'll try to implement this.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2014-02-14 19:09 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-02-05 18:39 [PATCH -mm v15 00/13] kmemcg shrinkers Vladimir Davydov
2014-02-05 18:39 ` Vladimir Davydov
2014-02-05 18:39 ` [PATCH -mm v15 01/13] memcg: make cache index determination more robust Vladimir Davydov
2014-02-05 18:39   ` Vladimir Davydov
2014-02-05 18:39 ` [PATCH -mm v15 02/13] memcg: consolidate callers of memcg_cache_id Vladimir Davydov
2014-02-05 18:39   ` Vladimir Davydov
2014-02-05 18:39 ` [PATCH -mm v15 03/13] memcg: move initialization to memcg creation Vladimir Davydov
2014-02-05 18:39   ` Vladimir Davydov
2014-02-05 18:39 ` [PATCH -mm v15 04/13] memcg: make for_each_mem_cgroup macros public Vladimir Davydov
2014-02-05 18:39   ` Vladimir Davydov
2014-02-05 18:39 ` [PATCH -mm v15 05/13] list_lru, shrinkers: introduce list_lru_shrink_{count,walk} Vladimir Davydov
2014-02-05 18:39   ` Vladimir Davydov
2014-02-05 18:39 ` [PATCH -mm v15 06/13] fs: consolidate {nr,free}_cached_objects args in shrink_control Vladimir Davydov
2014-02-05 18:39   ` Vladimir Davydov
2014-02-05 18:39 ` [PATCH -mm v15 07/13] fs: do not call destroy_super() in atomic context Vladimir Davydov
2014-02-05 18:39   ` Vladimir Davydov
2014-02-05 18:39 ` [PATCH -mm v15 08/13] vmscan: shrink slab on memcg pressure Vladimir Davydov
2014-02-05 18:39   ` Vladimir Davydov
2014-02-05 18:39 ` [PATCH -mm v15 09/13] list_lru: add per-memcg lists Vladimir Davydov
2014-02-05 18:39   ` Vladimir Davydov
2014-02-05 18:39 ` [PATCH -mm v15 10/13] fs: make shrinker memcg aware Vladimir Davydov
2014-02-05 18:39   ` Vladimir Davydov
2014-02-05 18:39 ` [PATCH -mm v15 11/13] memcg: flush memcg items upon memcg destruction Vladimir Davydov
2014-02-05 18:39   ` Vladimir Davydov
2014-02-05 18:39 ` [PATCH -mm v15 12/13] vmpressure: in-kernel notifications Vladimir Davydov
2014-02-05 18:39   ` Vladimir Davydov
2014-02-05 18:39 ` [PATCH -mm v15 13/13] memcg: reap dead memcgs upon global memory pressure Vladimir Davydov
2014-02-05 18:39   ` Vladimir Davydov
2014-02-11 15:15 ` [PATCH -mm v15 00/13] kmemcg shrinkers Vladimir Davydov
2014-02-11 15:15   ` Vladimir Davydov
2014-02-11 16:53   ` Michal Hocko
2014-02-11 16:53     ` Michal Hocko
2014-02-11 20:19   ` Johannes Weiner
2014-02-11 20:19     ` Johannes Weiner
2014-02-12 18:05     ` Vladimir Davydov
2014-02-12 18:05       ` Vladimir Davydov
2014-02-12 22:01       ` Johannes Weiner
2014-02-12 22:01         ` Johannes Weiner
2014-02-13 17:33         ` Vladimir Davydov
2014-02-13 17:33           ` Vladimir Davydov
2014-02-13 21:20           ` Johannes Weiner
2014-02-13 21:20             ` Johannes Weiner
2014-02-14 19:09             ` Vladimir Davydov
2014-02-14 19:09               ` Vladimir Davydov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.