All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v13 00/16] kmemcg shrinkers
@ 2013-12-09  8:05 ` Vladimir Davydov
  0 siblings, 0 replies; 81+ messages in thread
From: Vladimir Davydov @ 2013-12-09  8:05 UTC (permalink / raw)
  To: dchinner, hannes, mhocko, akpm
  Cc: linux-kernel, linux-mm, cgroups, devel, glommer, glommer, vdavydov

Hi,

This is the 13th iteration of Glauber Costa's patch-set implementing slab
shrinking on memcg pressure. The main idea is to make the list_lru structure
used by most FS shrinkers per-memcg. When adding or removing an element from a
list_lru, we use the page information to figure out which memcg it belongs to
and relay it to the appropriate list. This allows scanning kmem objects
accounted to different memcgs independently.

Please note that in contrast to previous versions this patch-set implements
slab shrinking only when we hit the user memory limit so that kmem allocations
will still fail if we are below the user memory limit, but close to the kmem
limit. This is, because the implementation of kmem-only reclaim was rather
incomplete - we had to fail GFP_NOFS allocations since everything we could
reclaim was only FS data. I will try to improve this and send in a separate
patch-set, but currently it is only worthwhile setting the kmem limit to be
greater than the user mem limit just to enable per-memcg slab accounting and
reclaim.

The patch-set is based on top of Linux-3.13-rc3 and organized as follows:
 - patches 1-9 prepare vmscan, memcontrol, list_lru to kmemcg reclaim;
 - patch 10 implements the kmemcg reclaim core;
 - patch 11 makes the list_lru struct per-memcg and patch 12 marks all
   list_lru-based shrinkers as memcg-aware;
 - patches 13-16 slightly improve memcontrol behavior regarding mem reclaim.

Changes in v13:
 - fix NUMA-unaware shrinkers not being called when node 0 is not set in the
   nodemask;
 - rework list_lru API to require a shrink_control
 - make list_lru automatically handle memcgs w/o introducing a separate struct;
 - simplify walk over all memcg LRUs of a list_lru;
 - cleanup shrink_slab()
 - remove kmem-only reclaim as explained above;

Previous iterations of this patch-set can be found here:
 - https://lkml.org/lkml/2013/12/2/141
 - https://lkml.org/lkml/2013/11/25/214

Thanks.

Glauber Costa (7):
  memcg: make cache index determination more robust
  memcg: consolidate callers of memcg_cache_id
  memcg: move initialization to memcg creation
  vmscan: take at least one pass with shrinkers
  vmpressure: in-kernel notifications
  memcg: reap dead memcgs upon global memory pressure
  memcg: flush memcg items upon memcg destruction

Vladimir Davydov (9):
  memcg: move memcg_caches_array_size() function
  vmscan: move call to shrink_slab() to shrink_zones()
  vmscan: remove shrink_control arg from do_try_to_free_pages()
  vmscan: call NUMA-unaware shrinkers irrespective of nodemask
  mm: list_lru: require shrink_control in count, walk functions
  fs: consolidate {nr,free}_cached_objects args in shrink_control
  vmscan: shrink slab on memcg pressure
  mm: list_lru: add per-memcg lists
  fs: mark list_lru based shrinkers memcg aware

 fs/dcache.c                |   17 +-
 fs/gfs2/quota.c            |   10 +-
 fs/inode.c                 |    8 +-
 fs/internal.h              |    9 +-
 fs/super.c                 |   24 ++-
 fs/xfs/xfs_buf.c           |   16 +-
 fs/xfs/xfs_qm.c            |    8 +-
 fs/xfs/xfs_super.c         |    6 +-
 include/linux/fs.h         |    6 +-
 include/linux/list_lru.h   |   83 ++++++----
 include/linux/memcontrol.h |   35 ++++
 include/linux/shrinker.h   |   10 +-
 include/linux/vmpressure.h |    5 +
 mm/list_lru.c              |  263 +++++++++++++++++++++++++++---
 mm/memcontrol.c            |  384 +++++++++++++++++++++++++++++++++++++++-----
 mm/vmpressure.c            |   53 +++++-
 mm/vmscan.c                |  172 +++++++++++++-------
 17 files changed, 888 insertions(+), 221 deletions(-)

-- 
1.7.10.4


^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH v13 00/16] kmemcg shrinkers
@ 2013-12-09  8:05 ` Vladimir Davydov
  0 siblings, 0 replies; 81+ messages in thread
From: Vladimir Davydov @ 2013-12-09  8:05 UTC (permalink / raw)
  To: dchinner, hannes, mhocko, akpm
  Cc: linux-kernel, linux-mm, cgroups, devel, glommer, glommer, vdavydov

Hi,

This is the 13th iteration of Glauber Costa's patch-set implementing slab
shrinking on memcg pressure. The main idea is to make the list_lru structure
used by most FS shrinkers per-memcg. When adding or removing an element from a
list_lru, we use the page information to figure out which memcg it belongs to
and relay it to the appropriate list. This allows scanning kmem objects
accounted to different memcgs independently.

Please note that in contrast to previous versions this patch-set implements
slab shrinking only when we hit the user memory limit so that kmem allocations
will still fail if we are below the user memory limit, but close to the kmem
limit. This is, because the implementation of kmem-only reclaim was rather
incomplete - we had to fail GFP_NOFS allocations since everything we could
reclaim was only FS data. I will try to improve this and send in a separate
patch-set, but currently it is only worthwhile setting the kmem limit to be
greater than the user mem limit just to enable per-memcg slab accounting and
reclaim.

The patch-set is based on top of Linux-3.13-rc3 and organized as follows:
 - patches 1-9 prepare vmscan, memcontrol, list_lru to kmemcg reclaim;
 - patch 10 implements the kmemcg reclaim core;
 - patch 11 makes the list_lru struct per-memcg and patch 12 marks all
   list_lru-based shrinkers as memcg-aware;
 - patches 13-16 slightly improve memcontrol behavior regarding mem reclaim.

Changes in v13:
 - fix NUMA-unaware shrinkers not being called when node 0 is not set in the
   nodemask;
 - rework list_lru API to require a shrink_control
 - make list_lru automatically handle memcgs w/o introducing a separate struct;
 - simplify walk over all memcg LRUs of a list_lru;
 - cleanup shrink_slab()
 - remove kmem-only reclaim as explained above;

Previous iterations of this patch-set can be found here:
 - https://lkml.org/lkml/2013/12/2/141
 - https://lkml.org/lkml/2013/11/25/214

Thanks.

Glauber Costa (7):
  memcg: make cache index determination more robust
  memcg: consolidate callers of memcg_cache_id
  memcg: move initialization to memcg creation
  vmscan: take at least one pass with shrinkers
  vmpressure: in-kernel notifications
  memcg: reap dead memcgs upon global memory pressure
  memcg: flush memcg items upon memcg destruction

Vladimir Davydov (9):
  memcg: move memcg_caches_array_size() function
  vmscan: move call to shrink_slab() to shrink_zones()
  vmscan: remove shrink_control arg from do_try_to_free_pages()
  vmscan: call NUMA-unaware shrinkers irrespective of nodemask
  mm: list_lru: require shrink_control in count, walk functions
  fs: consolidate {nr,free}_cached_objects args in shrink_control
  vmscan: shrink slab on memcg pressure
  mm: list_lru: add per-memcg lists
  fs: mark list_lru based shrinkers memcg aware

 fs/dcache.c                |   17 +-
 fs/gfs2/quota.c            |   10 +-
 fs/inode.c                 |    8 +-
 fs/internal.h              |    9 +-
 fs/super.c                 |   24 ++-
 fs/xfs/xfs_buf.c           |   16 +-
 fs/xfs/xfs_qm.c            |    8 +-
 fs/xfs/xfs_super.c         |    6 +-
 include/linux/fs.h         |    6 +-
 include/linux/list_lru.h   |   83 ++++++----
 include/linux/memcontrol.h |   35 ++++
 include/linux/shrinker.h   |   10 +-
 include/linux/vmpressure.h |    5 +
 mm/list_lru.c              |  263 +++++++++++++++++++++++++++---
 mm/memcontrol.c            |  384 +++++++++++++++++++++++++++++++++++++++-----
 mm/vmpressure.c            |   53 +++++-
 mm/vmscan.c                |  172 +++++++++++++-------
 17 files changed, 888 insertions(+), 221 deletions(-)

-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 81+ messages in thread

* [PATCH v13 01/16] memcg: make cache index determination more robust
  2013-12-09  8:05 ` Vladimir Davydov
@ 2013-12-09  8:05   ` Vladimir Davydov
  -1 siblings, 0 replies; 81+ messages in thread
From: Vladimir Davydov @ 2013-12-09  8:05 UTC (permalink / raw)
  To: dchinner, hannes, mhocko, akpm
  Cc: linux-kernel, linux-mm, cgroups, devel, glommer, glommer,
	vdavydov, Balbir Singh, KAMEZAWA Hiroyuki

From: Glauber Costa <glommer@openvz.org>

I caught myself doing something like the following outside memcg core:

	memcg_id = -1;
	if (memcg && memcg_kmem_is_active(memcg))
		memcg_id = memcg_cache_id(memcg);

to be able to handle all possible memcgs in a sane manner. In particular, the
root cache will have kmemcg_id = -1 (just because we don't call memcg_kmem_init
to the root cache since it is not limitable). We have always coped with that by
making sure we sanitize which cache is passed to memcg_cache_id. Although this
example is given for root, what we really need to know is whether or not a
cache is kmem active.

But outside the memcg core testing for root, for instance, is not trivial since
we don't export mem_cgroup_is_root. I ended up realizing that this tests really
belong inside memcg_cache_id. This patch moves a similar but stronger test
inside memcg_cache_id and make sure it always return a meaningful value.

Signed-off-by: Glauber Costa <glommer@openvz.org>
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |    4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f1a0ae6..02b5176 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3073,7 +3073,9 @@ void memcg_cache_list_add(struct mem_cgroup *memcg, struct kmem_cache *cachep)
  */
 int memcg_cache_id(struct mem_cgroup *memcg)
 {
-	return memcg ? memcg->kmemcg_id : -1;
+	if (!memcg || !memcg_can_account_kmem(memcg))
+		return -1;
+	return memcg->kmemcg_id;
 }
 
 /*
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH v13 01/16] memcg: make cache index determination more robust
@ 2013-12-09  8:05   ` Vladimir Davydov
  0 siblings, 0 replies; 81+ messages in thread
From: Vladimir Davydov @ 2013-12-09  8:05 UTC (permalink / raw)
  To: dchinner, hannes, mhocko, akpm
  Cc: linux-kernel, linux-mm, cgroups, devel, glommer, glommer,
	vdavydov, Balbir Singh, KAMEZAWA Hiroyuki

From: Glauber Costa <glommer@openvz.org>

I caught myself doing something like the following outside memcg core:

	memcg_id = -1;
	if (memcg && memcg_kmem_is_active(memcg))
		memcg_id = memcg_cache_id(memcg);

to be able to handle all possible memcgs in a sane manner. In particular, the
root cache will have kmemcg_id = -1 (just because we don't call memcg_kmem_init
to the root cache since it is not limitable). We have always coped with that by
making sure we sanitize which cache is passed to memcg_cache_id. Although this
example is given for root, what we really need to know is whether or not a
cache is kmem active.

But outside the memcg core testing for root, for instance, is not trivial since
we don't export mem_cgroup_is_root. I ended up realizing that this tests really
belong inside memcg_cache_id. This patch moves a similar but stronger test
inside memcg_cache_id and make sure it always return a meaningful value.

Signed-off-by: Glauber Costa <glommer@openvz.org>
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |    4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f1a0ae6..02b5176 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3073,7 +3073,9 @@ void memcg_cache_list_add(struct mem_cgroup *memcg, struct kmem_cache *cachep)
  */
 int memcg_cache_id(struct mem_cgroup *memcg)
 {
-	return memcg ? memcg->kmemcg_id : -1;
+	if (!memcg || !memcg_can_account_kmem(memcg))
+		return -1;
+	return memcg->kmemcg_id;
 }
 
 /*
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH v13 02/16] memcg: consolidate callers of memcg_cache_id
  2013-12-09  8:05 ` Vladimir Davydov
@ 2013-12-09  8:05   ` Vladimir Davydov
  -1 siblings, 0 replies; 81+ messages in thread
From: Vladimir Davydov @ 2013-12-09  8:05 UTC (permalink / raw)
  To: dchinner, hannes, mhocko, akpm
  Cc: linux-kernel, linux-mm, cgroups, devel, glommer, glommer,
	vdavydov, Balbir Singh, KAMEZAWA Hiroyuki

From: Glauber Costa <glommer@openvz.org>

Each caller of memcg_cache_id ends up sanitizing its parameters in its own way.
Now that the memcg_cache_id itself is more robust, we can consolidate this.

Also, as suggested by Michal, a special helper memcg_cache_idx is used when the
result is expected to be used directly as an array index to make sure we never
access in a negative index.

Signed-off-by: Glauber Costa <glommer@openvz.org>
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |   49 +++++++++++++++++++++++++++++--------------------
 1 file changed, 29 insertions(+), 20 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 02b5176..144cb4c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2960,6 +2960,30 @@ static inline bool memcg_can_account_kmem(struct mem_cgroup *memcg)
 }
 
 /*
+ * helper for acessing a memcg's index. It will be used as an index in the
+ * child cache array in kmem_cache, and also to derive its name. This function
+ * will return -1 when this is not a kmem-limited memcg.
+ */
+int memcg_cache_id(struct mem_cgroup *memcg)
+{
+	if (!memcg || !memcg_can_account_kmem(memcg))
+		return -1;
+	return memcg->kmemcg_id;
+}
+
+/*
+ * This helper around memcg_cache_id is not intented for use outside memcg
+ * core. It is meant for places where the cache id is used directly as an array
+ * index
+ */
+static int memcg_cache_idx(struct mem_cgroup *memcg)
+{
+	int ret = memcg_cache_id(memcg);
+	BUG_ON(ret < 0);
+	return ret;
+}
+
+/*
  * This is a bit cumbersome, but it is rarely used and avoids a backpointer
  * in the memcg_cache_params struct.
  */
@@ -2969,7 +2993,7 @@ static struct kmem_cache *memcg_params_to_cache(struct memcg_cache_params *p)
 
 	VM_BUG_ON(p->is_root_cache);
 	cachep = p->root_cache;
-	return cache_from_memcg_idx(cachep, memcg_cache_id(p->memcg));
+	return cache_from_memcg_idx(cachep, memcg_cache_idx(p->memcg));
 }
 
 #ifdef CONFIG_SLABINFO
@@ -3067,18 +3091,6 @@ void memcg_cache_list_add(struct mem_cgroup *memcg, struct kmem_cache *cachep)
 }
 
 /*
- * helper for acessing a memcg's index. It will be used as an index in the
- * child cache array in kmem_cache, and also to derive its name. This function
- * will return -1 when this is not a kmem-limited memcg.
- */
-int memcg_cache_id(struct mem_cgroup *memcg)
-{
-	if (!memcg || !memcg_can_account_kmem(memcg))
-		return -1;
-	return memcg->kmemcg_id;
-}
-
-/*
  * This ends up being protected by the set_limit mutex, during normal
  * operation, because that is its main call site.
  *
@@ -3240,7 +3252,7 @@ void memcg_release_cache(struct kmem_cache *s)
 		goto out;
 
 	memcg = s->memcg_params->memcg;
-	id  = memcg_cache_id(memcg);
+	id = memcg_cache_idx(memcg);
 
 	root = s->memcg_params->root_cache;
 	root->memcg_params->memcg_caches[id] = NULL;
@@ -3403,9 +3415,7 @@ static struct kmem_cache *memcg_create_kmem_cache(struct mem_cgroup *memcg,
 	struct kmem_cache *new_cachep;
 	int idx;
 
-	BUG_ON(!memcg_can_account_kmem(memcg));
-
-	idx = memcg_cache_id(memcg);
+	idx = memcg_cache_idx(memcg);
 
 	mutex_lock(&memcg_cache_mutex);
 	new_cachep = cache_from_memcg_idx(cachep, idx);
@@ -3578,10 +3588,9 @@ struct kmem_cache *__memcg_kmem_get_cache(struct kmem_cache *cachep,
 	rcu_read_lock();
 	memcg = mem_cgroup_from_task(rcu_dereference(current->mm->owner));
 
-	if (!memcg_can_account_kmem(memcg))
-		goto out;
-
 	idx = memcg_cache_id(memcg);
+	if (idx < 0)
+		goto out;
 
 	/*
 	 * barrier to mare sure we're always seeing the up to date value.  The
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH v13 02/16] memcg: consolidate callers of memcg_cache_id
@ 2013-12-09  8:05   ` Vladimir Davydov
  0 siblings, 0 replies; 81+ messages in thread
From: Vladimir Davydov @ 2013-12-09  8:05 UTC (permalink / raw)
  To: dchinner, hannes, mhocko, akpm
  Cc: linux-kernel, linux-mm, cgroups, devel, glommer, glommer,
	vdavydov, Balbir Singh, KAMEZAWA Hiroyuki

From: Glauber Costa <glommer@openvz.org>

Each caller of memcg_cache_id ends up sanitizing its parameters in its own way.
Now that the memcg_cache_id itself is more robust, we can consolidate this.

Also, as suggested by Michal, a special helper memcg_cache_idx is used when the
result is expected to be used directly as an array index to make sure we never
access in a negative index.

Signed-off-by: Glauber Costa <glommer@openvz.org>
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |   49 +++++++++++++++++++++++++++++--------------------
 1 file changed, 29 insertions(+), 20 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 02b5176..144cb4c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2960,6 +2960,30 @@ static inline bool memcg_can_account_kmem(struct mem_cgroup *memcg)
 }
 
 /*
+ * helper for acessing a memcg's index. It will be used as an index in the
+ * child cache array in kmem_cache, and also to derive its name. This function
+ * will return -1 when this is not a kmem-limited memcg.
+ */
+int memcg_cache_id(struct mem_cgroup *memcg)
+{
+	if (!memcg || !memcg_can_account_kmem(memcg))
+		return -1;
+	return memcg->kmemcg_id;
+}
+
+/*
+ * This helper around memcg_cache_id is not intented for use outside memcg
+ * core. It is meant for places where the cache id is used directly as an array
+ * index
+ */
+static int memcg_cache_idx(struct mem_cgroup *memcg)
+{
+	int ret = memcg_cache_id(memcg);
+	BUG_ON(ret < 0);
+	return ret;
+}
+
+/*
  * This is a bit cumbersome, but it is rarely used and avoids a backpointer
  * in the memcg_cache_params struct.
  */
@@ -2969,7 +2993,7 @@ static struct kmem_cache *memcg_params_to_cache(struct memcg_cache_params *p)
 
 	VM_BUG_ON(p->is_root_cache);
 	cachep = p->root_cache;
-	return cache_from_memcg_idx(cachep, memcg_cache_id(p->memcg));
+	return cache_from_memcg_idx(cachep, memcg_cache_idx(p->memcg));
 }
 
 #ifdef CONFIG_SLABINFO
@@ -3067,18 +3091,6 @@ void memcg_cache_list_add(struct mem_cgroup *memcg, struct kmem_cache *cachep)
 }
 
 /*
- * helper for acessing a memcg's index. It will be used as an index in the
- * child cache array in kmem_cache, and also to derive its name. This function
- * will return -1 when this is not a kmem-limited memcg.
- */
-int memcg_cache_id(struct mem_cgroup *memcg)
-{
-	if (!memcg || !memcg_can_account_kmem(memcg))
-		return -1;
-	return memcg->kmemcg_id;
-}
-
-/*
  * This ends up being protected by the set_limit mutex, during normal
  * operation, because that is its main call site.
  *
@@ -3240,7 +3252,7 @@ void memcg_release_cache(struct kmem_cache *s)
 		goto out;
 
 	memcg = s->memcg_params->memcg;
-	id  = memcg_cache_id(memcg);
+	id = memcg_cache_idx(memcg);
 
 	root = s->memcg_params->root_cache;
 	root->memcg_params->memcg_caches[id] = NULL;
@@ -3403,9 +3415,7 @@ static struct kmem_cache *memcg_create_kmem_cache(struct mem_cgroup *memcg,
 	struct kmem_cache *new_cachep;
 	int idx;
 
-	BUG_ON(!memcg_can_account_kmem(memcg));
-
-	idx = memcg_cache_id(memcg);
+	idx = memcg_cache_idx(memcg);
 
 	mutex_lock(&memcg_cache_mutex);
 	new_cachep = cache_from_memcg_idx(cachep, idx);
@@ -3578,10 +3588,9 @@ struct kmem_cache *__memcg_kmem_get_cache(struct kmem_cache *cachep,
 	rcu_read_lock();
 	memcg = mem_cgroup_from_task(rcu_dereference(current->mm->owner));
 
-	if (!memcg_can_account_kmem(memcg))
-		goto out;
-
 	idx = memcg_cache_id(memcg);
+	if (idx < 0)
+		goto out;
 
 	/*
 	 * barrier to mare sure we're always seeing the up to date value.  The
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH v13 03/16] memcg: move initialization to memcg creation
  2013-12-09  8:05 ` Vladimir Davydov
@ 2013-12-09  8:05   ` Vladimir Davydov
  -1 siblings, 0 replies; 81+ messages in thread
From: Vladimir Davydov @ 2013-12-09  8:05 UTC (permalink / raw)
  To: dchinner, hannes, mhocko, akpm
  Cc: linux-kernel, linux-mm, cgroups, devel, glommer, glommer,
	vdavydov, Balbir Singh, KAMEZAWA Hiroyuki

From: Glauber Costa <glommer@openvz.org>

Those structures are only used for memcgs that are effectively using
kmemcg. However, in a later patch I intend to use scan that list
inconditionally (list empty meaning no kmem caches present), which
simplifies the code a lot.

So move the initialization to early kmem creation.

Signed-off-by: Glauber Costa <glommer@openvz.org>
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 144cb4c..3a4e2f8 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3122,8 +3122,6 @@ int memcg_update_cache_sizes(struct mem_cgroup *memcg)
 	}
 
 	memcg->kmemcg_id = num;
-	INIT_LIST_HEAD(&memcg->memcg_slab_caches);
-	mutex_init(&memcg->slab_caches_mutex);
 	return 0;
 }
 
@@ -5909,6 +5907,8 @@ static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
 {
 	int ret;
 
+	INIT_LIST_HEAD(&memcg->memcg_slab_caches);
+	mutex_init(&memcg->slab_caches_mutex);
 	memcg->kmemcg_id = -1;
 	ret = memcg_propagate_kmem(memcg);
 	if (ret)
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH v13 03/16] memcg: move initialization to memcg creation
@ 2013-12-09  8:05   ` Vladimir Davydov
  0 siblings, 0 replies; 81+ messages in thread
From: Vladimir Davydov @ 2013-12-09  8:05 UTC (permalink / raw)
  To: dchinner, hannes, mhocko, akpm
  Cc: linux-kernel, linux-mm, cgroups, devel, glommer, glommer,
	vdavydov, Balbir Singh, KAMEZAWA Hiroyuki

From: Glauber Costa <glommer@openvz.org>

Those structures are only used for memcgs that are effectively using
kmemcg. However, in a later patch I intend to use scan that list
inconditionally (list empty meaning no kmem caches present), which
simplifies the code a lot.

So move the initialization to early kmem creation.

Signed-off-by: Glauber Costa <glommer@openvz.org>
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 144cb4c..3a4e2f8 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3122,8 +3122,6 @@ int memcg_update_cache_sizes(struct mem_cgroup *memcg)
 	}
 
 	memcg->kmemcg_id = num;
-	INIT_LIST_HEAD(&memcg->memcg_slab_caches);
-	mutex_init(&memcg->slab_caches_mutex);
 	return 0;
 }
 
@@ -5909,6 +5907,8 @@ static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
 {
 	int ret;
 
+	INIT_LIST_HEAD(&memcg->memcg_slab_caches);
+	mutex_init(&memcg->slab_caches_mutex);
 	memcg->kmemcg_id = -1;
 	ret = memcg_propagate_kmem(memcg);
 	if (ret)
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH v13 04/16] memcg: move memcg_caches_array_size() function
  2013-12-09  8:05 ` Vladimir Davydov
@ 2013-12-09  8:05   ` Vladimir Davydov
  -1 siblings, 0 replies; 81+ messages in thread
From: Vladimir Davydov @ 2013-12-09  8:05 UTC (permalink / raw)
  To: dchinner, hannes, mhocko, akpm
  Cc: linux-kernel, linux-mm, cgroups, devel, glommer, glommer,
	vdavydov, Balbir Singh, KAMEZAWA Hiroyuki

I need to move this up a bit, and I am doing in a separate patch just to
reduce churn in the patch that needs it.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Glauber Costa <glommer@openvz.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |   30 +++++++++++++++---------------
 1 file changed, 15 insertions(+), 15 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 3a4e2f8..220b463 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2983,6 +2983,21 @@ static int memcg_cache_idx(struct mem_cgroup *memcg)
 	return ret;
 }
 
+static size_t memcg_caches_array_size(int num_groups)
+{
+	ssize_t size;
+	if (num_groups <= 0)
+		return 0;
+
+	size = 2 * num_groups;
+	if (size < MEMCG_CACHES_MIN_SIZE)
+		size = MEMCG_CACHES_MIN_SIZE;
+	else if (size > MEMCG_CACHES_MAX_SIZE)
+		size = MEMCG_CACHES_MAX_SIZE;
+
+	return size;
+}
+
 /*
  * This is a bit cumbersome, but it is rarely used and avoids a backpointer
  * in the memcg_cache_params struct.
@@ -3125,21 +3140,6 @@ int memcg_update_cache_sizes(struct mem_cgroup *memcg)
 	return 0;
 }
 
-static size_t memcg_caches_array_size(int num_groups)
-{
-	ssize_t size;
-	if (num_groups <= 0)
-		return 0;
-
-	size = 2 * num_groups;
-	if (size < MEMCG_CACHES_MIN_SIZE)
-		size = MEMCG_CACHES_MIN_SIZE;
-	else if (size > MEMCG_CACHES_MAX_SIZE)
-		size = MEMCG_CACHES_MAX_SIZE;
-
-	return size;
-}
-
 /*
  * We should update the current array size iff all caches updates succeed. This
  * can only be done from the slab side. The slab mutex needs to be held when
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH v13 04/16] memcg: move memcg_caches_array_size() function
@ 2013-12-09  8:05   ` Vladimir Davydov
  0 siblings, 0 replies; 81+ messages in thread
From: Vladimir Davydov @ 2013-12-09  8:05 UTC (permalink / raw)
  To: dchinner, hannes, mhocko, akpm
  Cc: linux-kernel, linux-mm, cgroups, devel, glommer, glommer,
	vdavydov, Balbir Singh, KAMEZAWA Hiroyuki

I need to move this up a bit, and I am doing in a separate patch just to
reduce churn in the patch that needs it.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Glauber Costa <glommer@openvz.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |   30 +++++++++++++++---------------
 1 file changed, 15 insertions(+), 15 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 3a4e2f8..220b463 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2983,6 +2983,21 @@ static int memcg_cache_idx(struct mem_cgroup *memcg)
 	return ret;
 }
 
+static size_t memcg_caches_array_size(int num_groups)
+{
+	ssize_t size;
+	if (num_groups <= 0)
+		return 0;
+
+	size = 2 * num_groups;
+	if (size < MEMCG_CACHES_MIN_SIZE)
+		size = MEMCG_CACHES_MIN_SIZE;
+	else if (size > MEMCG_CACHES_MAX_SIZE)
+		size = MEMCG_CACHES_MAX_SIZE;
+
+	return size;
+}
+
 /*
  * This is a bit cumbersome, but it is rarely used and avoids a backpointer
  * in the memcg_cache_params struct.
@@ -3125,21 +3140,6 @@ int memcg_update_cache_sizes(struct mem_cgroup *memcg)
 	return 0;
 }
 
-static size_t memcg_caches_array_size(int num_groups)
-{
-	ssize_t size;
-	if (num_groups <= 0)
-		return 0;
-
-	size = 2 * num_groups;
-	if (size < MEMCG_CACHES_MIN_SIZE)
-		size = MEMCG_CACHES_MIN_SIZE;
-	else if (size > MEMCG_CACHES_MAX_SIZE)
-		size = MEMCG_CACHES_MAX_SIZE;
-
-	return size;
-}
-
 /*
  * We should update the current array size iff all caches updates succeed. This
  * can only be done from the slab side. The slab mutex needs to be held when
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH v13 05/16] vmscan: move call to shrink_slab() to shrink_zones()
  2013-12-09  8:05 ` Vladimir Davydov
  (?)
@ 2013-12-09  8:05   ` Vladimir Davydov
  -1 siblings, 0 replies; 81+ messages in thread
From: Vladimir Davydov @ 2013-12-09  8:05 UTC (permalink / raw)
  To: dchinner, hannes, mhocko, akpm
  Cc: linux-kernel, linux-mm, cgroups, devel, glommer, glommer,
	vdavydov, Mel Gorman, Rik van Riel

This reduces the indentation level of do_try_to_free_pages() and removes
extra loop over all eligible zones counting the number of on-LRU pages.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Glauber Costa <glommer@openvz.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
---
 mm/vmscan.c |   57 ++++++++++++++++++++++++++-------------------------------
 1 file changed, 26 insertions(+), 31 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index eea668d..035ab3a 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2273,13 +2273,17 @@ static inline bool compaction_ready(struct zone *zone, struct scan_control *sc)
  * the caller that it should consider retrying the allocation instead of
  * further reclaim.
  */
-static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
+static bool shrink_zones(struct zonelist *zonelist,
+			 struct scan_control *sc,
+			 struct shrink_control *shrink)
 {
 	struct zoneref *z;
 	struct zone *zone;
 	unsigned long nr_soft_reclaimed;
 	unsigned long nr_soft_scanned;
+	unsigned long lru_pages = 0;
 	bool aborted_reclaim = false;
+	struct reclaim_state *reclaim_state = current->reclaim_state;
 
 	/*
 	 * If the number of buffer_heads in the machine exceeds the maximum
@@ -2289,6 +2293,8 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
 	if (buffer_heads_over_limit)
 		sc->gfp_mask |= __GFP_HIGHMEM;
 
+	nodes_clear(shrink->nodes_to_scan);
+
 	for_each_zone_zonelist_nodemask(zone, z, zonelist,
 					gfp_zone(sc->gfp_mask), sc->nodemask) {
 		if (!populated_zone(zone))
@@ -2300,6 +2306,10 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
 		if (global_reclaim(sc)) {
 			if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
 				continue;
+
+			lru_pages += zone_reclaimable_pages(zone);
+			node_set(zone_to_nid(zone), shrink->nodes_to_scan);
+
 			if (sc->priority != DEF_PRIORITY &&
 			    !zone_reclaimable(zone))
 				continue;	/* Let kswapd poll it */
@@ -2336,6 +2346,20 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
 		shrink_zone(zone, sc);
 	}
 
+	/*
+	 * Don't shrink slabs when reclaiming memory from over limit
+	 * cgroups but do shrink slab at least once when aborting
+	 * reclaim for compaction to avoid unevenly scanning file/anon
+	 * LRU pages over slab pages.
+	 */
+	if (global_reclaim(sc)) {
+		shrink_slab(shrink, sc->nr_scanned, lru_pages);
+		if (reclaim_state) {
+			sc->nr_reclaimed += reclaim_state->reclaimed_slab;
+			reclaim_state->reclaimed_slab = 0;
+		}
+	}
+
 	return aborted_reclaim;
 }
 
@@ -2380,9 +2404,6 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 					struct shrink_control *shrink)
 {
 	unsigned long total_scanned = 0;
-	struct reclaim_state *reclaim_state = current->reclaim_state;
-	struct zoneref *z;
-	struct zone *zone;
 	unsigned long writeback_threshold;
 	bool aborted_reclaim;
 
@@ -2395,34 +2416,8 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 		vmpressure_prio(sc->gfp_mask, sc->target_mem_cgroup,
 				sc->priority);
 		sc->nr_scanned = 0;
-		aborted_reclaim = shrink_zones(zonelist, sc);
-
-		/*
-		 * Don't shrink slabs when reclaiming memory from over limit
-		 * cgroups but do shrink slab at least once when aborting
-		 * reclaim for compaction to avoid unevenly scanning file/anon
-		 * LRU pages over slab pages.
-		 */
-		if (global_reclaim(sc)) {
-			unsigned long lru_pages = 0;
+		aborted_reclaim = shrink_zones(zonelist, sc, shrink);
 
-			nodes_clear(shrink->nodes_to_scan);
-			for_each_zone_zonelist(zone, z, zonelist,
-					gfp_zone(sc->gfp_mask)) {
-				if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
-					continue;
-
-				lru_pages += zone_reclaimable_pages(zone);
-				node_set(zone_to_nid(zone),
-					 shrink->nodes_to_scan);
-			}
-
-			shrink_slab(shrink, sc->nr_scanned, lru_pages);
-			if (reclaim_state) {
-				sc->nr_reclaimed += reclaim_state->reclaimed_slab;
-				reclaim_state->reclaimed_slab = 0;
-			}
-		}
 		total_scanned += sc->nr_scanned;
 		if (sc->nr_reclaimed >= sc->nr_to_reclaim)
 			goto out;
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH v13 05/16] vmscan: move call to shrink_slab() to shrink_zones()
@ 2013-12-09  8:05   ` Vladimir Davydov
  0 siblings, 0 replies; 81+ messages in thread
From: Vladimir Davydov @ 2013-12-09  8:05 UTC (permalink / raw)
  To: dchinner, hannes, mhocko, akpm
  Cc: linux-kernel, linux-mm, cgroups, devel, glommer, glommer,
	vdavydov, Mel Gorman, Rik van Riel

This reduces the indentation level of do_try_to_free_pages() and removes
extra loop over all eligible zones counting the number of on-LRU pages.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Glauber Costa <glommer@openvz.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
---
 mm/vmscan.c |   57 ++++++++++++++++++++++++++-------------------------------
 1 file changed, 26 insertions(+), 31 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index eea668d..035ab3a 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2273,13 +2273,17 @@ static inline bool compaction_ready(struct zone *zone, struct scan_control *sc)
  * the caller that it should consider retrying the allocation instead of
  * further reclaim.
  */
-static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
+static bool shrink_zones(struct zonelist *zonelist,
+			 struct scan_control *sc,
+			 struct shrink_control *shrink)
 {
 	struct zoneref *z;
 	struct zone *zone;
 	unsigned long nr_soft_reclaimed;
 	unsigned long nr_soft_scanned;
+	unsigned long lru_pages = 0;
 	bool aborted_reclaim = false;
+	struct reclaim_state *reclaim_state = current->reclaim_state;
 
 	/*
 	 * If the number of buffer_heads in the machine exceeds the maximum
@@ -2289,6 +2293,8 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
 	if (buffer_heads_over_limit)
 		sc->gfp_mask |= __GFP_HIGHMEM;
 
+	nodes_clear(shrink->nodes_to_scan);
+
 	for_each_zone_zonelist_nodemask(zone, z, zonelist,
 					gfp_zone(sc->gfp_mask), sc->nodemask) {
 		if (!populated_zone(zone))
@@ -2300,6 +2306,10 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
 		if (global_reclaim(sc)) {
 			if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
 				continue;
+
+			lru_pages += zone_reclaimable_pages(zone);
+			node_set(zone_to_nid(zone), shrink->nodes_to_scan);
+
 			if (sc->priority != DEF_PRIORITY &&
 			    !zone_reclaimable(zone))
 				continue;	/* Let kswapd poll it */
@@ -2336,6 +2346,20 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
 		shrink_zone(zone, sc);
 	}
 
+	/*
+	 * Don't shrink slabs when reclaiming memory from over limit
+	 * cgroups but do shrink slab at least once when aborting
+	 * reclaim for compaction to avoid unevenly scanning file/anon
+	 * LRU pages over slab pages.
+	 */
+	if (global_reclaim(sc)) {
+		shrink_slab(shrink, sc->nr_scanned, lru_pages);
+		if (reclaim_state) {
+			sc->nr_reclaimed += reclaim_state->reclaimed_slab;
+			reclaim_state->reclaimed_slab = 0;
+		}
+	}
+
 	return aborted_reclaim;
 }
 
@@ -2380,9 +2404,6 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 					struct shrink_control *shrink)
 {
 	unsigned long total_scanned = 0;
-	struct reclaim_state *reclaim_state = current->reclaim_state;
-	struct zoneref *z;
-	struct zone *zone;
 	unsigned long writeback_threshold;
 	bool aborted_reclaim;
 
@@ -2395,34 +2416,8 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 		vmpressure_prio(sc->gfp_mask, sc->target_mem_cgroup,
 				sc->priority);
 		sc->nr_scanned = 0;
-		aborted_reclaim = shrink_zones(zonelist, sc);
-
-		/*
-		 * Don't shrink slabs when reclaiming memory from over limit
-		 * cgroups but do shrink slab at least once when aborting
-		 * reclaim for compaction to avoid unevenly scanning file/anon
-		 * LRU pages over slab pages.
-		 */
-		if (global_reclaim(sc)) {
-			unsigned long lru_pages = 0;
+		aborted_reclaim = shrink_zones(zonelist, sc, shrink);
 
-			nodes_clear(shrink->nodes_to_scan);
-			for_each_zone_zonelist(zone, z, zonelist,
-					gfp_zone(sc->gfp_mask)) {
-				if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
-					continue;
-
-				lru_pages += zone_reclaimable_pages(zone);
-				node_set(zone_to_nid(zone),
-					 shrink->nodes_to_scan);
-			}
-
-			shrink_slab(shrink, sc->nr_scanned, lru_pages);
-			if (reclaim_state) {
-				sc->nr_reclaimed += reclaim_state->reclaimed_slab;
-				reclaim_state->reclaimed_slab = 0;
-			}
-		}
 		total_scanned += sc->nr_scanned;
 		if (sc->nr_reclaimed >= sc->nr_to_reclaim)
 			goto out;
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH v13 05/16] vmscan: move call to shrink_slab() to shrink_zones()
@ 2013-12-09  8:05   ` Vladimir Davydov
  0 siblings, 0 replies; 81+ messages in thread
From: Vladimir Davydov @ 2013-12-09  8:05 UTC (permalink / raw)
  To: dchinner-H+wXaHxf7aLQT0dZR+AlfA, hannes-druUgvl0LCNAfugRpC6u6w,
	mhocko-AlSwsSmVLrQ, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	devel-GEFAQzZX7r8dnm+yROfE0A, glommer-GEFAQzZX7r8dnm+yROfE0A,
	glommer-Re5JQEeQqe8AvxtiuMwx3w, vdavydov-bzQdu9zFT3WakBO8gow8eQ,
	Mel Gorman, Rik van Riel

This reduces the indentation level of do_try_to_free_pages() and removes
extra loop over all eligible zones counting the number of on-LRU pages.

Signed-off-by: Vladimir Davydov <vdavydov-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
Cc: Glauber Costa <glommer-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
Cc: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
Cc: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
Cc: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
Cc: Mel Gorman <mgorman-l3A5Bk7waGM@public.gmane.org>
Cc: Rik van Riel <riel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 mm/vmscan.c |   57 ++++++++++++++++++++++++++-------------------------------
 1 file changed, 26 insertions(+), 31 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index eea668d..035ab3a 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2273,13 +2273,17 @@ static inline bool compaction_ready(struct zone *zone, struct scan_control *sc)
  * the caller that it should consider retrying the allocation instead of
  * further reclaim.
  */
-static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
+static bool shrink_zones(struct zonelist *zonelist,
+			 struct scan_control *sc,
+			 struct shrink_control *shrink)
 {
 	struct zoneref *z;
 	struct zone *zone;
 	unsigned long nr_soft_reclaimed;
 	unsigned long nr_soft_scanned;
+	unsigned long lru_pages = 0;
 	bool aborted_reclaim = false;
+	struct reclaim_state *reclaim_state = current->reclaim_state;
 
 	/*
 	 * If the number of buffer_heads in the machine exceeds the maximum
@@ -2289,6 +2293,8 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
 	if (buffer_heads_over_limit)
 		sc->gfp_mask |= __GFP_HIGHMEM;
 
+	nodes_clear(shrink->nodes_to_scan);
+
 	for_each_zone_zonelist_nodemask(zone, z, zonelist,
 					gfp_zone(sc->gfp_mask), sc->nodemask) {
 		if (!populated_zone(zone))
@@ -2300,6 +2306,10 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
 		if (global_reclaim(sc)) {
 			if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
 				continue;
+
+			lru_pages += zone_reclaimable_pages(zone);
+			node_set(zone_to_nid(zone), shrink->nodes_to_scan);
+
 			if (sc->priority != DEF_PRIORITY &&
 			    !zone_reclaimable(zone))
 				continue;	/* Let kswapd poll it */
@@ -2336,6 +2346,20 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
 		shrink_zone(zone, sc);
 	}
 
+	/*
+	 * Don't shrink slabs when reclaiming memory from over limit
+	 * cgroups but do shrink slab at least once when aborting
+	 * reclaim for compaction to avoid unevenly scanning file/anon
+	 * LRU pages over slab pages.
+	 */
+	if (global_reclaim(sc)) {
+		shrink_slab(shrink, sc->nr_scanned, lru_pages);
+		if (reclaim_state) {
+			sc->nr_reclaimed += reclaim_state->reclaimed_slab;
+			reclaim_state->reclaimed_slab = 0;
+		}
+	}
+
 	return aborted_reclaim;
 }
 
@@ -2380,9 +2404,6 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 					struct shrink_control *shrink)
 {
 	unsigned long total_scanned = 0;
-	struct reclaim_state *reclaim_state = current->reclaim_state;
-	struct zoneref *z;
-	struct zone *zone;
 	unsigned long writeback_threshold;
 	bool aborted_reclaim;
 
@@ -2395,34 +2416,8 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 		vmpressure_prio(sc->gfp_mask, sc->target_mem_cgroup,
 				sc->priority);
 		sc->nr_scanned = 0;
-		aborted_reclaim = shrink_zones(zonelist, sc);
-
-		/*
-		 * Don't shrink slabs when reclaiming memory from over limit
-		 * cgroups but do shrink slab at least once when aborting
-		 * reclaim for compaction to avoid unevenly scanning file/anon
-		 * LRU pages over slab pages.
-		 */
-		if (global_reclaim(sc)) {
-			unsigned long lru_pages = 0;
+		aborted_reclaim = shrink_zones(zonelist, sc, shrink);
 
-			nodes_clear(shrink->nodes_to_scan);
-			for_each_zone_zonelist(zone, z, zonelist,
-					gfp_zone(sc->gfp_mask)) {
-				if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
-					continue;
-
-				lru_pages += zone_reclaimable_pages(zone);
-				node_set(zone_to_nid(zone),
-					 shrink->nodes_to_scan);
-			}
-
-			shrink_slab(shrink, sc->nr_scanned, lru_pages);
-			if (reclaim_state) {
-				sc->nr_reclaimed += reclaim_state->reclaimed_slab;
-				reclaim_state->reclaimed_slab = 0;
-			}
-		}
 		total_scanned += sc->nr_scanned;
 		if (sc->nr_reclaimed >= sc->nr_to_reclaim)
 			goto out;
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH v13 06/16] vmscan: remove shrink_control arg from do_try_to_free_pages()
  2013-12-09  8:05 ` Vladimir Davydov
@ 2013-12-09  8:05   ` Vladimir Davydov
  -1 siblings, 0 replies; 81+ messages in thread
From: Vladimir Davydov @ 2013-12-09  8:05 UTC (permalink / raw)
  To: dchinner, hannes, mhocko, akpm
  Cc: linux-kernel, linux-mm, cgroups, devel, glommer, glommer,
	vdavydov, Mel Gorman, Rik van Riel

There is no need passing on a shrink_control struct from
try_to_free_pages() and friends to do_try_to_free_pages() and then to
shrink_zones(), because it is only used in shrink_zones() and the only
field initialized on the top level is gfp_mask, which is always equal to
scan_control.gfp_mask. So let's move shrink_control initialization to
shrink_zones().

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Glauber Costa <glommer@openvz.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
---
 mm/vmscan.c |   32 ++++++++++++--------------------
 1 file changed, 12 insertions(+), 20 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 035ab3a..33b356e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2274,8 +2274,7 @@ static inline bool compaction_ready(struct zone *zone, struct scan_control *sc)
  * further reclaim.
  */
 static bool shrink_zones(struct zonelist *zonelist,
-			 struct scan_control *sc,
-			 struct shrink_control *shrink)
+			 struct scan_control *sc)
 {
 	struct zoneref *z;
 	struct zone *zone;
@@ -2284,6 +2283,9 @@ static bool shrink_zones(struct zonelist *zonelist,
 	unsigned long lru_pages = 0;
 	bool aborted_reclaim = false;
 	struct reclaim_state *reclaim_state = current->reclaim_state;
+	struct shrink_control shrink = {
+		.gfp_mask = sc->gfp_mask,
+	};
 
 	/*
 	 * If the number of buffer_heads in the machine exceeds the maximum
@@ -2293,7 +2295,7 @@ static bool shrink_zones(struct zonelist *zonelist,
 	if (buffer_heads_over_limit)
 		sc->gfp_mask |= __GFP_HIGHMEM;
 
-	nodes_clear(shrink->nodes_to_scan);
+	nodes_clear(shrink.nodes_to_scan);
 
 	for_each_zone_zonelist_nodemask(zone, z, zonelist,
 					gfp_zone(sc->gfp_mask), sc->nodemask) {
@@ -2308,7 +2310,7 @@ static bool shrink_zones(struct zonelist *zonelist,
 				continue;
 
 			lru_pages += zone_reclaimable_pages(zone);
-			node_set(zone_to_nid(zone), shrink->nodes_to_scan);
+			node_set(zone_to_nid(zone), shrink.nodes_to_scan);
 
 			if (sc->priority != DEF_PRIORITY &&
 			    !zone_reclaimable(zone))
@@ -2353,7 +2355,7 @@ static bool shrink_zones(struct zonelist *zonelist,
 	 * LRU pages over slab pages.
 	 */
 	if (global_reclaim(sc)) {
-		shrink_slab(shrink, sc->nr_scanned, lru_pages);
+		shrink_slab(&shrink, sc->nr_scanned, lru_pages);
 		if (reclaim_state) {
 			sc->nr_reclaimed += reclaim_state->reclaimed_slab;
 			reclaim_state->reclaimed_slab = 0;
@@ -2400,8 +2402,7 @@ static bool all_unreclaimable(struct zonelist *zonelist,
  * 		else, the number of pages reclaimed
  */
 static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
-					struct scan_control *sc,
-					struct shrink_control *shrink)
+					  struct scan_control *sc)
 {
 	unsigned long total_scanned = 0;
 	unsigned long writeback_threshold;
@@ -2416,7 +2417,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 		vmpressure_prio(sc->gfp_mask, sc->target_mem_cgroup,
 				sc->priority);
 		sc->nr_scanned = 0;
-		aborted_reclaim = shrink_zones(zonelist, sc, shrink);
+		aborted_reclaim = shrink_zones(zonelist, sc);
 
 		total_scanned += sc->nr_scanned;
 		if (sc->nr_reclaimed >= sc->nr_to_reclaim)
@@ -2579,9 +2580,6 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 		.target_mem_cgroup = NULL,
 		.nodemask = nodemask,
 	};
-	struct shrink_control shrink = {
-		.gfp_mask = sc.gfp_mask,
-	};
 
 	/*
 	 * Do not enter reclaim if fatal signal was delivered while throttled.
@@ -2595,7 +2593,7 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 				sc.may_writepage,
 				gfp_mask);
 
-	nr_reclaimed = do_try_to_free_pages(zonelist, &sc, &shrink);
+	nr_reclaimed = do_try_to_free_pages(zonelist, &sc);
 
 	trace_mm_vmscan_direct_reclaim_end(nr_reclaimed);
 
@@ -2662,9 +2660,6 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 		.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
 				(GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK),
 	};
-	struct shrink_control shrink = {
-		.gfp_mask = sc.gfp_mask,
-	};
 
 	/*
 	 * Unlike direct reclaim via alloc_pages(), memcg's reclaim doesn't
@@ -2679,7 +2674,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 					    sc.may_writepage,
 					    sc.gfp_mask);
 
-	nr_reclaimed = do_try_to_free_pages(zonelist, &sc, &shrink);
+	nr_reclaimed = do_try_to_free_pages(zonelist, &sc);
 
 	trace_mm_vmscan_memcg_reclaim_end(nr_reclaimed);
 
@@ -3335,9 +3330,6 @@ unsigned long shrink_all_memory(unsigned long nr_to_reclaim)
 		.order = 0,
 		.priority = DEF_PRIORITY,
 	};
-	struct shrink_control shrink = {
-		.gfp_mask = sc.gfp_mask,
-	};
 	struct zonelist *zonelist = node_zonelist(numa_node_id(), sc.gfp_mask);
 	struct task_struct *p = current;
 	unsigned long nr_reclaimed;
@@ -3347,7 +3339,7 @@ unsigned long shrink_all_memory(unsigned long nr_to_reclaim)
 	reclaim_state.reclaimed_slab = 0;
 	p->reclaim_state = &reclaim_state;
 
-	nr_reclaimed = do_try_to_free_pages(zonelist, &sc, &shrink);
+	nr_reclaimed = do_try_to_free_pages(zonelist, &sc);
 
 	p->reclaim_state = NULL;
 	lockdep_clear_current_reclaim_state();
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH v13 06/16] vmscan: remove shrink_control arg from do_try_to_free_pages()
@ 2013-12-09  8:05   ` Vladimir Davydov
  0 siblings, 0 replies; 81+ messages in thread
From: Vladimir Davydov @ 2013-12-09  8:05 UTC (permalink / raw)
  To: dchinner, hannes, mhocko, akpm
  Cc: linux-kernel, linux-mm, cgroups, devel, glommer, glommer,
	vdavydov, Mel Gorman, Rik van Riel

There is no need passing on a shrink_control struct from
try_to_free_pages() and friends to do_try_to_free_pages() and then to
shrink_zones(), because it is only used in shrink_zones() and the only
field initialized on the top level is gfp_mask, which is always equal to
scan_control.gfp_mask. So let's move shrink_control initialization to
shrink_zones().

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Glauber Costa <glommer@openvz.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
---
 mm/vmscan.c |   32 ++++++++++++--------------------
 1 file changed, 12 insertions(+), 20 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 035ab3a..33b356e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2274,8 +2274,7 @@ static inline bool compaction_ready(struct zone *zone, struct scan_control *sc)
  * further reclaim.
  */
 static bool shrink_zones(struct zonelist *zonelist,
-			 struct scan_control *sc,
-			 struct shrink_control *shrink)
+			 struct scan_control *sc)
 {
 	struct zoneref *z;
 	struct zone *zone;
@@ -2284,6 +2283,9 @@ static bool shrink_zones(struct zonelist *zonelist,
 	unsigned long lru_pages = 0;
 	bool aborted_reclaim = false;
 	struct reclaim_state *reclaim_state = current->reclaim_state;
+	struct shrink_control shrink = {
+		.gfp_mask = sc->gfp_mask,
+	};
 
 	/*
 	 * If the number of buffer_heads in the machine exceeds the maximum
@@ -2293,7 +2295,7 @@ static bool shrink_zones(struct zonelist *zonelist,
 	if (buffer_heads_over_limit)
 		sc->gfp_mask |= __GFP_HIGHMEM;
 
-	nodes_clear(shrink->nodes_to_scan);
+	nodes_clear(shrink.nodes_to_scan);
 
 	for_each_zone_zonelist_nodemask(zone, z, zonelist,
 					gfp_zone(sc->gfp_mask), sc->nodemask) {
@@ -2308,7 +2310,7 @@ static bool shrink_zones(struct zonelist *zonelist,
 				continue;
 
 			lru_pages += zone_reclaimable_pages(zone);
-			node_set(zone_to_nid(zone), shrink->nodes_to_scan);
+			node_set(zone_to_nid(zone), shrink.nodes_to_scan);
 
 			if (sc->priority != DEF_PRIORITY &&
 			    !zone_reclaimable(zone))
@@ -2353,7 +2355,7 @@ static bool shrink_zones(struct zonelist *zonelist,
 	 * LRU pages over slab pages.
 	 */
 	if (global_reclaim(sc)) {
-		shrink_slab(shrink, sc->nr_scanned, lru_pages);
+		shrink_slab(&shrink, sc->nr_scanned, lru_pages);
 		if (reclaim_state) {
 			sc->nr_reclaimed += reclaim_state->reclaimed_slab;
 			reclaim_state->reclaimed_slab = 0;
@@ -2400,8 +2402,7 @@ static bool all_unreclaimable(struct zonelist *zonelist,
  * 		else, the number of pages reclaimed
  */
 static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
-					struct scan_control *sc,
-					struct shrink_control *shrink)
+					  struct scan_control *sc)
 {
 	unsigned long total_scanned = 0;
 	unsigned long writeback_threshold;
@@ -2416,7 +2417,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 		vmpressure_prio(sc->gfp_mask, sc->target_mem_cgroup,
 				sc->priority);
 		sc->nr_scanned = 0;
-		aborted_reclaim = shrink_zones(zonelist, sc, shrink);
+		aborted_reclaim = shrink_zones(zonelist, sc);
 
 		total_scanned += sc->nr_scanned;
 		if (sc->nr_reclaimed >= sc->nr_to_reclaim)
@@ -2579,9 +2580,6 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 		.target_mem_cgroup = NULL,
 		.nodemask = nodemask,
 	};
-	struct shrink_control shrink = {
-		.gfp_mask = sc.gfp_mask,
-	};
 
 	/*
 	 * Do not enter reclaim if fatal signal was delivered while throttled.
@@ -2595,7 +2593,7 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 				sc.may_writepage,
 				gfp_mask);
 
-	nr_reclaimed = do_try_to_free_pages(zonelist, &sc, &shrink);
+	nr_reclaimed = do_try_to_free_pages(zonelist, &sc);
 
 	trace_mm_vmscan_direct_reclaim_end(nr_reclaimed);
 
@@ -2662,9 +2660,6 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 		.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
 				(GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK),
 	};
-	struct shrink_control shrink = {
-		.gfp_mask = sc.gfp_mask,
-	};
 
 	/*
 	 * Unlike direct reclaim via alloc_pages(), memcg's reclaim doesn't
@@ -2679,7 +2674,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 					    sc.may_writepage,
 					    sc.gfp_mask);
 
-	nr_reclaimed = do_try_to_free_pages(zonelist, &sc, &shrink);
+	nr_reclaimed = do_try_to_free_pages(zonelist, &sc);
 
 	trace_mm_vmscan_memcg_reclaim_end(nr_reclaimed);
 
@@ -3335,9 +3330,6 @@ unsigned long shrink_all_memory(unsigned long nr_to_reclaim)
 		.order = 0,
 		.priority = DEF_PRIORITY,
 	};
-	struct shrink_control shrink = {
-		.gfp_mask = sc.gfp_mask,
-	};
 	struct zonelist *zonelist = node_zonelist(numa_node_id(), sc.gfp_mask);
 	struct task_struct *p = current;
 	unsigned long nr_reclaimed;
@@ -3347,7 +3339,7 @@ unsigned long shrink_all_memory(unsigned long nr_to_reclaim)
 	reclaim_state.reclaimed_slab = 0;
 	p->reclaim_state = &reclaim_state;
 
-	nr_reclaimed = do_try_to_free_pages(zonelist, &sc, &shrink);
+	nr_reclaimed = do_try_to_free_pages(zonelist, &sc);
 
 	p->reclaim_state = NULL;
 	lockdep_clear_current_reclaim_state();
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH v13 07/16] vmscan: call NUMA-unaware shrinkers irrespective of nodemask
  2013-12-09  8:05 ` Vladimir Davydov
  (?)
@ 2013-12-09  8:05   ` Vladimir Davydov
  -1 siblings, 0 replies; 81+ messages in thread
From: Vladimir Davydov @ 2013-12-09  8:05 UTC (permalink / raw)
  To: dchinner, hannes, mhocko, akpm
  Cc: linux-kernel, linux-mm, cgroups, devel, glommer, glommer,
	vdavydov, Mel Gorman, Rik van Riel

If a shrinker is not NUMA-aware, shrink_slab() should call it exactly
once with nid=0, but currently it is not true: if node 0 is not set in
the nodemask or if it is not online, we will not call such shrinkers at
all. As a result some slabs will be left untouched under some
circumstances. Let us fix it.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Reported-by: Dave Chinner <dchinner@redhat.com>
Cc: Glauber Costa <glommer@openvz.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
---
 mm/vmscan.c |   19 ++++++++++---------
 1 file changed, 10 insertions(+), 9 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 33b356e..d98f272 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -352,16 +352,17 @@ unsigned long shrink_slab(struct shrink_control *shrinkctl,
 	}
 
 	list_for_each_entry(shrinker, &shrinker_list, list) {
-		for_each_node_mask(shrinkctl->nid, shrinkctl->nodes_to_scan) {
-			if (!node_online(shrinkctl->nid))
-				continue;
-
-			if (!(shrinker->flags & SHRINKER_NUMA_AWARE) &&
-			    (shrinkctl->nid != 0))
-				break;
-
+		if (!(shrinker->flags & SHRINKER_NUMA_AWARE)) {
+			shrinkctl->nid = 0;
 			freed += shrink_slab_node(shrinkctl, shrinker,
-				 nr_pages_scanned, lru_pages);
+					nr_pages_scanned, lru_pages);
+			continue;
+		}
+
+		for_each_node_mask(shrinkctl->nid, shrinkctl->nodes_to_scan) {
+			if (node_online(shrinkctl->nid))
+				freed += shrink_slab_node(shrinkctl, shrinker,
+						nr_pages_scanned, lru_pages);
 
 		}
 	}
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH v13 07/16] vmscan: call NUMA-unaware shrinkers irrespective of nodemask
@ 2013-12-09  8:05   ` Vladimir Davydov
  0 siblings, 0 replies; 81+ messages in thread
From: Vladimir Davydov @ 2013-12-09  8:05 UTC (permalink / raw)
  To: dchinner, hannes, mhocko, akpm
  Cc: linux-kernel, linux-mm, cgroups, devel, glommer, glommer,
	vdavydov, Mel Gorman, Rik van Riel

If a shrinker is not NUMA-aware, shrink_slab() should call it exactly
once with nid=0, but currently it is not true: if node 0 is not set in
the nodemask or if it is not online, we will not call such shrinkers at
all. As a result some slabs will be left untouched under some
circumstances. Let us fix it.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Reported-by: Dave Chinner <dchinner@redhat.com>
Cc: Glauber Costa <glommer@openvz.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
---
 mm/vmscan.c |   19 ++++++++++---------
 1 file changed, 10 insertions(+), 9 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 33b356e..d98f272 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -352,16 +352,17 @@ unsigned long shrink_slab(struct shrink_control *shrinkctl,
 	}
 
 	list_for_each_entry(shrinker, &shrinker_list, list) {
-		for_each_node_mask(shrinkctl->nid, shrinkctl->nodes_to_scan) {
-			if (!node_online(shrinkctl->nid))
-				continue;
-
-			if (!(shrinker->flags & SHRINKER_NUMA_AWARE) &&
-			    (shrinkctl->nid != 0))
-				break;
-
+		if (!(shrinker->flags & SHRINKER_NUMA_AWARE)) {
+			shrinkctl->nid = 0;
 			freed += shrink_slab_node(shrinkctl, shrinker,
-				 nr_pages_scanned, lru_pages);
+					nr_pages_scanned, lru_pages);
+			continue;
+		}
+
+		for_each_node_mask(shrinkctl->nid, shrinkctl->nodes_to_scan) {
+			if (node_online(shrinkctl->nid))
+				freed += shrink_slab_node(shrinkctl, shrinker,
+						nr_pages_scanned, lru_pages);
 
 		}
 	}
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH v13 07/16] vmscan: call NUMA-unaware shrinkers irrespective of nodemask
@ 2013-12-09  8:05   ` Vladimir Davydov
  0 siblings, 0 replies; 81+ messages in thread
From: Vladimir Davydov @ 2013-12-09  8:05 UTC (permalink / raw)
  To: dchinner, hannes, mhocko, akpm
  Cc: linux-kernel, linux-mm, cgroups, devel, glommer, glommer,
	vdavydov, Mel Gorman, Rik van Riel

If a shrinker is not NUMA-aware, shrink_slab() should call it exactly
once with nid=0, but currently it is not true: if node 0 is not set in
the nodemask or if it is not online, we will not call such shrinkers at
all. As a result some slabs will be left untouched under some
circumstances. Let us fix it.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Reported-by: Dave Chinner <dchinner@redhat.com>
Cc: Glauber Costa <glommer@openvz.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
---
 mm/vmscan.c |   19 ++++++++++---------
 1 file changed, 10 insertions(+), 9 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 33b356e..d98f272 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -352,16 +352,17 @@ unsigned long shrink_slab(struct shrink_control *shrinkctl,
 	}
 
 	list_for_each_entry(shrinker, &shrinker_list, list) {
-		for_each_node_mask(shrinkctl->nid, shrinkctl->nodes_to_scan) {
-			if (!node_online(shrinkctl->nid))
-				continue;
-
-			if (!(shrinker->flags & SHRINKER_NUMA_AWARE) &&
-			    (shrinkctl->nid != 0))
-				break;
-
+		if (!(shrinker->flags & SHRINKER_NUMA_AWARE)) {
+			shrinkctl->nid = 0;
 			freed += shrink_slab_node(shrinkctl, shrinker,
-				 nr_pages_scanned, lru_pages);
+					nr_pages_scanned, lru_pages);
+			continue;
+		}
+
+		for_each_node_mask(shrinkctl->nid, shrinkctl->nodes_to_scan) {
+			if (node_online(shrinkctl->nid))
+				freed += shrink_slab_node(shrinkctl, shrinker,
+						nr_pages_scanned, lru_pages);
 
 		}
 	}
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH v13 08/16] mm: list_lru: require shrink_control in count, walk functions
  2013-12-09  8:05 ` Vladimir Davydov
@ 2013-12-09  8:05   ` Vladimir Davydov
  -1 siblings, 0 replies; 81+ messages in thread
From: Vladimir Davydov @ 2013-12-09  8:05 UTC (permalink / raw)
  To: dchinner, hannes, mhocko, akpm
  Cc: linux-kernel, linux-mm, cgroups, devel, glommer, glommer,
	vdavydov, Al Viro

To enable targeted reclaim, the list_lru structure distributes its
elements among several LRU lists. Currently, there is one LRU per NUMA
node, and the elements from different nodes are placed to different
LRUs. As a result there are two versions of count and walk functions:

 - list_lru_count, list_lru_walk - count, walk items from all nodes;
 - list_lru_count_node, list_lru_walk_node - count, walk items from a
   particular node specified in an additional argument.

We are going to make the list_lru structure per-memcg in addition to
being per-node. This would allow us to reclaim slab not only on global
memory shortage, but also on memcg pressure. If we followed the current
list_lru interface notation, we would have to add a bunch of new
functions taking a memcg and a node in additional arguments, which would
look cumbersome.

To avoid this, we remove the *_node functions and make list_lru_count
and list_lru_walk require a shrink_control argument so that they will
operate only on the NUMA node specified in shrink_control::nid. If the
caller passes NULL instead of a shrink_control object, the functions
will scan all nodes. This looks sane, because targeted list_lru walks are
only used by shrinkers, which always have a shrink_control object.
Furthermore, when we implement targeted memcg shrinking and add the
memcg field to the shrink_control structure, we will not need to change
the existing list_lru interface.

Thanks to David Chinner for the tip.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Glauber Costa <glommer@openvz.org>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
---
 fs/dcache.c              |   17 +++++++++--------
 fs/gfs2/quota.c          |    8 ++++----
 fs/inode.c               |    8 ++++----
 fs/internal.h            |    9 +++++----
 fs/super.c               |   14 ++++++--------
 fs/xfs/xfs_buf.c         |   14 ++++++++------
 fs/xfs/xfs_qm.c          |    6 +++---
 include/linux/list_lru.h |   43 ++++++++++---------------------------------
 mm/list_lru.c            |   45 ++++++++++++++++++++++++++++++++++++++++-----
 9 files changed, 89 insertions(+), 75 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 4bdb300..aac0e61 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -972,8 +972,8 @@ dentry_lru_isolate(struct list_head *item, spinlock_t *lru_lock, void *arg)
 /**
  * prune_dcache_sb - shrink the dcache
  * @sb: superblock
- * @nr_to_scan : number of entries to try to free
- * @nid: which node to scan for freeable entities
+ * @sc: shrink control, passed to list_lru_walk()
+ * @nr_to_scan: number of entries to try to free
  *
  * Attempt to shrink the superblock dcache LRU by @nr_to_scan entries. This is
  * done when we need more memory an called from the superblock shrinker
@@ -982,14 +982,14 @@ dentry_lru_isolate(struct list_head *item, spinlock_t *lru_lock, void *arg)
  * This function may fail to free any resources if all the dentries are in
  * use.
  */
-long prune_dcache_sb(struct super_block *sb, unsigned long nr_to_scan,
-		     int nid)
+long prune_dcache_sb(struct super_block *sb, struct shrink_control *sc,
+		     unsigned long nr_to_scan)
 {
 	LIST_HEAD(dispose);
 	long freed;
 
-	freed = list_lru_walk_node(&sb->s_dentry_lru, nid, dentry_lru_isolate,
-				       &dispose, &nr_to_scan);
+	freed = list_lru_walk(&sb->s_dentry_lru, sc, dentry_lru_isolate,
+			      &dispose, &nr_to_scan);
 	shrink_dentry_list(&dispose);
 	return freed;
 }
@@ -1028,9 +1028,10 @@ void shrink_dcache_sb(struct super_block *sb)
 
 	do {
 		LIST_HEAD(dispose);
+		unsigned long nr_to_scan = ULONG_MAX;
 
-		freed = list_lru_walk(&sb->s_dentry_lru,
-			dentry_lru_isolate_shrink, &dispose, UINT_MAX);
+		freed = list_lru_walk(&sb->s_dentry_lru, NULL,
+			dentry_lru_isolate_shrink, &dispose, &nr_to_scan);
 
 		this_cpu_sub(nr_dentry_unused, freed);
 		shrink_dentry_list(&dispose);
diff --git a/fs/gfs2/quota.c b/fs/gfs2/quota.c
index 98236d0..f0435da 100644
--- a/fs/gfs2/quota.c
+++ b/fs/gfs2/quota.c
@@ -132,8 +132,8 @@ static unsigned long gfs2_qd_shrink_scan(struct shrinker *shrink,
 	if (!(sc->gfp_mask & __GFP_FS))
 		return SHRINK_STOP;
 
-	freed = list_lru_walk_node(&gfs2_qd_lru, sc->nid, gfs2_qd_isolate,
-				   &dispose, &sc->nr_to_scan);
+	freed = list_lru_walk(&gfs2_qd_lru, sc, gfs2_qd_isolate,
+			      &dispose, &sc->nr_to_scan);
 
 	gfs2_qd_dispose(&dispose);
 
@@ -143,7 +143,7 @@ static unsigned long gfs2_qd_shrink_scan(struct shrinker *shrink,
 static unsigned long gfs2_qd_shrink_count(struct shrinker *shrink,
 					  struct shrink_control *sc)
 {
-	return vfs_pressure_ratio(list_lru_count_node(&gfs2_qd_lru, sc->nid));
+	return vfs_pressure_ratio(list_lru_count(&gfs2_qd_lru, sc));
 }
 
 struct shrinker gfs2_qd_shrinker = {
@@ -1504,7 +1504,7 @@ static int gfs2_quota_get_xstate(struct super_block *sb,
 	}
 	fqs->qs_uquota.qfs_nextents = 1; /* unsupported */
 	fqs->qs_gquota = fqs->qs_uquota; /* its the same inode in both cases */
-	fqs->qs_incoredqs = list_lru_count(&gfs2_qd_lru);
+	fqs->qs_incoredqs = list_lru_count(&gfs2_qd_lru, NULL);
 	return 0;
 }
 
diff --git a/fs/inode.c b/fs/inode.c
index 4bcdad3..7c6eda2 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -748,14 +748,14 @@ inode_lru_isolate(struct list_head *item, spinlock_t *lru_lock, void *arg)
  * to trim from the LRU. Inodes to be freed are moved to a temporary list and
  * then are freed outside inode_lock by dispose_list().
  */
-long prune_icache_sb(struct super_block *sb, unsigned long nr_to_scan,
-		     int nid)
+long prune_icache_sb(struct super_block *sb, struct shrink_control *sc,
+		     unsigned long nr_to_scan)
 {
 	LIST_HEAD(freeable);
 	long freed;
 
-	freed = list_lru_walk_node(&sb->s_inode_lru, nid, inode_lru_isolate,
-				       &freeable, &nr_to_scan);
+	freed = list_lru_walk(&sb->s_inode_lru, sc, inode_lru_isolate,
+			      &freeable, &nr_to_scan);
 	dispose_list(&freeable);
 	return freed;
 }
diff --git a/fs/internal.h b/fs/internal.h
index 4657424..a6a9627 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -14,6 +14,7 @@ struct file_system_type;
 struct linux_binprm;
 struct path;
 struct mount;
+struct shrink_control;
 
 /*
  * block_dev.c
@@ -107,8 +108,8 @@ extern int open_check_o_direct(struct file *f);
  * inode.c
  */
 extern spinlock_t inode_sb_list_lock;
-extern long prune_icache_sb(struct super_block *sb, unsigned long nr_to_scan,
-			    int nid);
+extern long prune_icache_sb(struct super_block *sb, struct shrink_control *sc,
+			    unsigned long nr_to_scan);
 extern void inode_add_lru(struct inode *inode);
 
 /*
@@ -125,8 +126,8 @@ extern int invalidate_inodes(struct super_block *, bool);
  */
 extern struct dentry *__d_alloc(struct super_block *, const struct qstr *);
 extern int d_set_mounted(struct dentry *dentry);
-extern long prune_dcache_sb(struct super_block *sb, unsigned long nr_to_scan,
-			    int nid);
+extern long prune_dcache_sb(struct super_block *sb, struct shrink_control *sc,
+			    unsigned long nr_to_scan);
 
 /*
  * read_write.c
diff --git a/fs/super.c b/fs/super.c
index e5f6c2c..a039dba 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -78,8 +78,8 @@ static unsigned long super_cache_scan(struct shrinker *shrink,
 	if (sb->s_op->nr_cached_objects)
 		fs_objects = sb->s_op->nr_cached_objects(sb, sc->nid);
 
-	inodes = list_lru_count_node(&sb->s_inode_lru, sc->nid);
-	dentries = list_lru_count_node(&sb->s_dentry_lru, sc->nid);
+	inodes = list_lru_count(&sb->s_inode_lru, sc);
+	dentries = list_lru_count(&sb->s_dentry_lru, sc);
 	total_objects = dentries + inodes + fs_objects + 1;
 
 	/* proportion the scan between the caches */
@@ -90,8 +90,8 @@ static unsigned long super_cache_scan(struct shrinker *shrink,
 	 * prune the dcache first as the icache is pinned by it, then
 	 * prune the icache, followed by the filesystem specific caches
 	 */
-	freed = prune_dcache_sb(sb, dentries, sc->nid);
-	freed += prune_icache_sb(sb, inodes, sc->nid);
+	freed = prune_dcache_sb(sb, sc, dentries);
+	freed += prune_icache_sb(sb, sc, inodes);
 
 	if (fs_objects) {
 		fs_objects = mult_frac(sc->nr_to_scan, fs_objects,
@@ -119,10 +119,8 @@ static unsigned long super_cache_count(struct shrinker *shrink,
 		total_objects = sb->s_op->nr_cached_objects(sb,
 						 sc->nid);
 
-	total_objects += list_lru_count_node(&sb->s_dentry_lru,
-						 sc->nid);
-	total_objects += list_lru_count_node(&sb->s_inode_lru,
-						 sc->nid);
+	total_objects += list_lru_count(&sb->s_dentry_lru, sc);
+	total_objects += list_lru_count(&sb->s_inode_lru, sc);
 
 	total_objects = vfs_pressure_ratio(total_objects);
 	drop_super(sb);
diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index c7f0b77..5b2a49c 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -1508,9 +1508,11 @@ xfs_wait_buftarg(
 	int loop = 0;
 
 	/* loop until there is nothing left on the lru list. */
-	while (list_lru_count(&btp->bt_lru)) {
-		list_lru_walk(&btp->bt_lru, xfs_buftarg_wait_rele,
-			      &dispose, LONG_MAX);
+	while (list_lru_count(&btp->bt_lru, NULL)) {
+		unsigned long nr_to_scan = ULONG_MAX;
+
+		list_lru_walk(&btp->bt_lru, NULL, xfs_buftarg_wait_rele,
+			      &dispose, &nr_to_scan);
 
 		while (!list_empty(&dispose)) {
 			struct xfs_buf *bp;
@@ -1565,8 +1567,8 @@ xfs_buftarg_shrink_scan(
 	unsigned long		freed;
 	unsigned long		nr_to_scan = sc->nr_to_scan;
 
-	freed = list_lru_walk_node(&btp->bt_lru, sc->nid, xfs_buftarg_isolate,
-				       &dispose, &nr_to_scan);
+	freed = list_lru_walk(&btp->bt_lru, sc, xfs_buftarg_isolate,
+			      &dispose, &nr_to_scan);
 
 	while (!list_empty(&dispose)) {
 		struct xfs_buf *bp;
@@ -1585,7 +1587,7 @@ xfs_buftarg_shrink_count(
 {
 	struct xfs_buftarg	*btp = container_of(shrink,
 					struct xfs_buftarg, bt_shrinker);
-	return list_lru_count_node(&btp->bt_lru, sc->nid);
+	return list_lru_count(&btp->bt_lru, sc);
 }
 
 void
diff --git a/fs/xfs/xfs_qm.c b/fs/xfs/xfs_qm.c
index 14a4996..aaacf8f 100644
--- a/fs/xfs/xfs_qm.c
+++ b/fs/xfs/xfs_qm.c
@@ -769,8 +769,8 @@ xfs_qm_shrink_scan(
 	INIT_LIST_HEAD(&isol.buffers);
 	INIT_LIST_HEAD(&isol.dispose);
 
-	freed = list_lru_walk_node(&qi->qi_lru, sc->nid, xfs_qm_dquot_isolate, &isol,
-					&nr_to_scan);
+	freed = list_lru_walk(&qi->qi_lru, sc, xfs_qm_dquot_isolate, &isol,
+			      &nr_to_scan);
 
 	error = xfs_buf_delwri_submit(&isol.buffers);
 	if (error)
@@ -795,7 +795,7 @@ xfs_qm_shrink_count(
 	struct xfs_quotainfo	*qi = container_of(shrink,
 					struct xfs_quotainfo, qi_shrinker);
 
-	return list_lru_count_node(&qi->qi_lru, sc->nid);
+	return list_lru_count(&qi->qi_lru, sc);
 }
 
 /*
diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
index 3ce5417..34e57af 100644
--- a/include/linux/list_lru.h
+++ b/include/linux/list_lru.h
@@ -10,6 +10,8 @@
 #include <linux/list.h>
 #include <linux/nodemask.h>
 
+struct shrink_control;
+
 /* list_lru_walk_cb has to always return one of those */
 enum lru_status {
 	LRU_REMOVED,		/* item removed from list */
@@ -66,32 +68,22 @@ bool list_lru_add(struct list_lru *lru, struct list_head *item);
 bool list_lru_del(struct list_lru *lru, struct list_head *item);
 
 /**
- * list_lru_count_node: return the number of objects currently held by @lru
+ * list_lru_count: return the number of objects currently held by @lru
  * @lru: the lru pointer.
- * @nid: the node id to count from.
+ * @sc: if not NULL, count only from node @sc->nid.
  *
  * Always return a non-negative number, 0 for empty lists. There is no
  * guarantee that the list is not updated while the count is being computed.
  * Callers that want such a guarantee need to provide an outer lock.
  */
-unsigned long list_lru_count_node(struct list_lru *lru, int nid);
-static inline unsigned long list_lru_count(struct list_lru *lru)
-{
-	long count = 0;
-	int nid;
-
-	for_each_node_mask(nid, lru->active_nodes)
-		count += list_lru_count_node(lru, nid);
-
-	return count;
-}
+unsigned long list_lru_count(struct list_lru *lru, struct shrink_control *sc);
 
 typedef enum lru_status
 (*list_lru_walk_cb)(struct list_head *item, spinlock_t *lock, void *cb_arg);
 /**
- * list_lru_walk_node: walk a list_lru, isolating and disposing freeable items.
+ * list_lru_walk: walk a list_lru, isolating and disposing freeable items.
  * @lru: the lru pointer.
- * @nid: the node id to scan from.
+ * @sc: if not NULL, scan only from node @sc->nid.
  * @isolate: callback function that is resposible for deciding what to do with
  *  the item currently being scanned
  * @cb_arg: opaque type that will be passed to @isolate
@@ -109,23 +101,8 @@ typedef enum lru_status
  *
  * Return value: the number of objects effectively removed from the LRU.
  */
-unsigned long list_lru_walk_node(struct list_lru *lru, int nid,
-				 list_lru_walk_cb isolate, void *cb_arg,
-				 unsigned long *nr_to_walk);
-
-static inline unsigned long
-list_lru_walk(struct list_lru *lru, list_lru_walk_cb isolate,
-	      void *cb_arg, unsigned long nr_to_walk)
-{
-	long isolated = 0;
-	int nid;
+unsigned long list_lru_walk(struct list_lru *lru, struct shrink_control *sc,
+			    list_lru_walk_cb isolate, void *cb_arg,
+			    unsigned long *nr_to_walk);
 
-	for_each_node_mask(nid, lru->active_nodes) {
-		isolated += list_lru_walk_node(lru, nid, isolate,
-					       cb_arg, &nr_to_walk);
-		if (nr_to_walk <= 0)
-			break;
-	}
-	return isolated;
-}
 #endif /* _LRU_LIST_H */
diff --git a/mm/list_lru.c b/mm/list_lru.c
index 72f9dec..7d4a9c2 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -7,8 +7,9 @@
 #include <linux/kernel.h>
 #include <linux/module.h>
 #include <linux/mm.h>
-#include <linux/list_lru.h>
 #include <linux/slab.h>
+#include <linux/shrinker.h>
+#include <linux/list_lru.h>
 
 bool list_lru_add(struct list_lru *lru, struct list_head *item)
 {
@@ -48,7 +49,7 @@ bool list_lru_del(struct list_lru *lru, struct list_head *item)
 }
 EXPORT_SYMBOL_GPL(list_lru_del);
 
-unsigned long
+static unsigned long
 list_lru_count_node(struct list_lru *lru, int nid)
 {
 	unsigned long count = 0;
@@ -61,9 +62,23 @@ list_lru_count_node(struct list_lru *lru, int nid)
 
 	return count;
 }
-EXPORT_SYMBOL_GPL(list_lru_count_node);
 
-unsigned long
+unsigned long list_lru_count(struct list_lru *lru, struct shrink_control *sc)
+{
+	long count = 0;
+	int nid;
+
+	if (sc)
+		return list_lru_count_node(lru, sc->nid);
+
+	for_each_node_mask(nid, lru->active_nodes)
+		count += list_lru_count_node(lru, nid);
+
+	return count;
+}
+EXPORT_SYMBOL_GPL(list_lru_count);
+
+static unsigned long
 list_lru_walk_node(struct list_lru *lru, int nid, list_lru_walk_cb isolate,
 		   void *cb_arg, unsigned long *nr_to_walk)
 {
@@ -112,7 +127,27 @@ restart:
 	spin_unlock(&nlru->lock);
 	return isolated;
 }
-EXPORT_SYMBOL_GPL(list_lru_walk_node);
+
+unsigned long list_lru_walk(struct list_lru *lru, struct shrink_control *sc,
+			    list_lru_walk_cb isolate, void *cb_arg,
+			    unsigned long *nr_to_walk)
+{
+	long isolated = 0;
+	int nid;
+
+	if (sc)
+		return list_lru_walk_node(lru, sc->nid, isolate,
+					  cb_arg, nr_to_walk);
+
+	for_each_node_mask(nid, lru->active_nodes) {
+		isolated += list_lru_walk_node(lru, nid, isolate,
+					       cb_arg, nr_to_walk);
+		if (*nr_to_walk <= 0)
+			break;
+	}
+	return isolated;
+}
+EXPORT_SYMBOL_GPL(list_lru_walk);
 
 int list_lru_init(struct list_lru *lru)
 {
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH v13 08/16] mm: list_lru: require shrink_control in count, walk functions
@ 2013-12-09  8:05   ` Vladimir Davydov
  0 siblings, 0 replies; 81+ messages in thread
From: Vladimir Davydov @ 2013-12-09  8:05 UTC (permalink / raw)
  To: dchinner, hannes, mhocko, akpm
  Cc: linux-kernel, linux-mm, cgroups, devel, glommer, glommer,
	vdavydov, Al Viro

To enable targeted reclaim, the list_lru structure distributes its
elements among several LRU lists. Currently, there is one LRU per NUMA
node, and the elements from different nodes are placed to different
LRUs. As a result there are two versions of count and walk functions:

 - list_lru_count, list_lru_walk - count, walk items from all nodes;
 - list_lru_count_node, list_lru_walk_node - count, walk items from a
   particular node specified in an additional argument.

We are going to make the list_lru structure per-memcg in addition to
being per-node. This would allow us to reclaim slab not only on global
memory shortage, but also on memcg pressure. If we followed the current
list_lru interface notation, we would have to add a bunch of new
functions taking a memcg and a node in additional arguments, which would
look cumbersome.

To avoid this, we remove the *_node functions and make list_lru_count
and list_lru_walk require a shrink_control argument so that they will
operate only on the NUMA node specified in shrink_control::nid. If the
caller passes NULL instead of a shrink_control object, the functions
will scan all nodes. This looks sane, because targeted list_lru walks are
only used by shrinkers, which always have a shrink_control object.
Furthermore, when we implement targeted memcg shrinking and add the
memcg field to the shrink_control structure, we will not need to change
the existing list_lru interface.

Thanks to David Chinner for the tip.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Glauber Costa <glommer@openvz.org>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
---
 fs/dcache.c              |   17 +++++++++--------
 fs/gfs2/quota.c          |    8 ++++----
 fs/inode.c               |    8 ++++----
 fs/internal.h            |    9 +++++----
 fs/super.c               |   14 ++++++--------
 fs/xfs/xfs_buf.c         |   14 ++++++++------
 fs/xfs/xfs_qm.c          |    6 +++---
 include/linux/list_lru.h |   43 ++++++++++---------------------------------
 mm/list_lru.c            |   45 ++++++++++++++++++++++++++++++++++++++++-----
 9 files changed, 89 insertions(+), 75 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 4bdb300..aac0e61 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -972,8 +972,8 @@ dentry_lru_isolate(struct list_head *item, spinlock_t *lru_lock, void *arg)
 /**
  * prune_dcache_sb - shrink the dcache
  * @sb: superblock
- * @nr_to_scan : number of entries to try to free
- * @nid: which node to scan for freeable entities
+ * @sc: shrink control, passed to list_lru_walk()
+ * @nr_to_scan: number of entries to try to free
  *
  * Attempt to shrink the superblock dcache LRU by @nr_to_scan entries. This is
  * done when we need more memory an called from the superblock shrinker
@@ -982,14 +982,14 @@ dentry_lru_isolate(struct list_head *item, spinlock_t *lru_lock, void *arg)
  * This function may fail to free any resources if all the dentries are in
  * use.
  */
-long prune_dcache_sb(struct super_block *sb, unsigned long nr_to_scan,
-		     int nid)
+long prune_dcache_sb(struct super_block *sb, struct shrink_control *sc,
+		     unsigned long nr_to_scan)
 {
 	LIST_HEAD(dispose);
 	long freed;
 
-	freed = list_lru_walk_node(&sb->s_dentry_lru, nid, dentry_lru_isolate,
-				       &dispose, &nr_to_scan);
+	freed = list_lru_walk(&sb->s_dentry_lru, sc, dentry_lru_isolate,
+			      &dispose, &nr_to_scan);
 	shrink_dentry_list(&dispose);
 	return freed;
 }
@@ -1028,9 +1028,10 @@ void shrink_dcache_sb(struct super_block *sb)
 
 	do {
 		LIST_HEAD(dispose);
+		unsigned long nr_to_scan = ULONG_MAX;
 
-		freed = list_lru_walk(&sb->s_dentry_lru,
-			dentry_lru_isolate_shrink, &dispose, UINT_MAX);
+		freed = list_lru_walk(&sb->s_dentry_lru, NULL,
+			dentry_lru_isolate_shrink, &dispose, &nr_to_scan);
 
 		this_cpu_sub(nr_dentry_unused, freed);
 		shrink_dentry_list(&dispose);
diff --git a/fs/gfs2/quota.c b/fs/gfs2/quota.c
index 98236d0..f0435da 100644
--- a/fs/gfs2/quota.c
+++ b/fs/gfs2/quota.c
@@ -132,8 +132,8 @@ static unsigned long gfs2_qd_shrink_scan(struct shrinker *shrink,
 	if (!(sc->gfp_mask & __GFP_FS))
 		return SHRINK_STOP;
 
-	freed = list_lru_walk_node(&gfs2_qd_lru, sc->nid, gfs2_qd_isolate,
-				   &dispose, &sc->nr_to_scan);
+	freed = list_lru_walk(&gfs2_qd_lru, sc, gfs2_qd_isolate,
+			      &dispose, &sc->nr_to_scan);
 
 	gfs2_qd_dispose(&dispose);
 
@@ -143,7 +143,7 @@ static unsigned long gfs2_qd_shrink_scan(struct shrinker *shrink,
 static unsigned long gfs2_qd_shrink_count(struct shrinker *shrink,
 					  struct shrink_control *sc)
 {
-	return vfs_pressure_ratio(list_lru_count_node(&gfs2_qd_lru, sc->nid));
+	return vfs_pressure_ratio(list_lru_count(&gfs2_qd_lru, sc));
 }
 
 struct shrinker gfs2_qd_shrinker = {
@@ -1504,7 +1504,7 @@ static int gfs2_quota_get_xstate(struct super_block *sb,
 	}
 	fqs->qs_uquota.qfs_nextents = 1; /* unsupported */
 	fqs->qs_gquota = fqs->qs_uquota; /* its the same inode in both cases */
-	fqs->qs_incoredqs = list_lru_count(&gfs2_qd_lru);
+	fqs->qs_incoredqs = list_lru_count(&gfs2_qd_lru, NULL);
 	return 0;
 }
 
diff --git a/fs/inode.c b/fs/inode.c
index 4bcdad3..7c6eda2 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -748,14 +748,14 @@ inode_lru_isolate(struct list_head *item, spinlock_t *lru_lock, void *arg)
  * to trim from the LRU. Inodes to be freed are moved to a temporary list and
  * then are freed outside inode_lock by dispose_list().
  */
-long prune_icache_sb(struct super_block *sb, unsigned long nr_to_scan,
-		     int nid)
+long prune_icache_sb(struct super_block *sb, struct shrink_control *sc,
+		     unsigned long nr_to_scan)
 {
 	LIST_HEAD(freeable);
 	long freed;
 
-	freed = list_lru_walk_node(&sb->s_inode_lru, nid, inode_lru_isolate,
-				       &freeable, &nr_to_scan);
+	freed = list_lru_walk(&sb->s_inode_lru, sc, inode_lru_isolate,
+			      &freeable, &nr_to_scan);
 	dispose_list(&freeable);
 	return freed;
 }
diff --git a/fs/internal.h b/fs/internal.h
index 4657424..a6a9627 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -14,6 +14,7 @@ struct file_system_type;
 struct linux_binprm;
 struct path;
 struct mount;
+struct shrink_control;
 
 /*
  * block_dev.c
@@ -107,8 +108,8 @@ extern int open_check_o_direct(struct file *f);
  * inode.c
  */
 extern spinlock_t inode_sb_list_lock;
-extern long prune_icache_sb(struct super_block *sb, unsigned long nr_to_scan,
-			    int nid);
+extern long prune_icache_sb(struct super_block *sb, struct shrink_control *sc,
+			    unsigned long nr_to_scan);
 extern void inode_add_lru(struct inode *inode);
 
 /*
@@ -125,8 +126,8 @@ extern int invalidate_inodes(struct super_block *, bool);
  */
 extern struct dentry *__d_alloc(struct super_block *, const struct qstr *);
 extern int d_set_mounted(struct dentry *dentry);
-extern long prune_dcache_sb(struct super_block *sb, unsigned long nr_to_scan,
-			    int nid);
+extern long prune_dcache_sb(struct super_block *sb, struct shrink_control *sc,
+			    unsigned long nr_to_scan);
 
 /*
  * read_write.c
diff --git a/fs/super.c b/fs/super.c
index e5f6c2c..a039dba 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -78,8 +78,8 @@ static unsigned long super_cache_scan(struct shrinker *shrink,
 	if (sb->s_op->nr_cached_objects)
 		fs_objects = sb->s_op->nr_cached_objects(sb, sc->nid);
 
-	inodes = list_lru_count_node(&sb->s_inode_lru, sc->nid);
-	dentries = list_lru_count_node(&sb->s_dentry_lru, sc->nid);
+	inodes = list_lru_count(&sb->s_inode_lru, sc);
+	dentries = list_lru_count(&sb->s_dentry_lru, sc);
 	total_objects = dentries + inodes + fs_objects + 1;
 
 	/* proportion the scan between the caches */
@@ -90,8 +90,8 @@ static unsigned long super_cache_scan(struct shrinker *shrink,
 	 * prune the dcache first as the icache is pinned by it, then
 	 * prune the icache, followed by the filesystem specific caches
 	 */
-	freed = prune_dcache_sb(sb, dentries, sc->nid);
-	freed += prune_icache_sb(sb, inodes, sc->nid);
+	freed = prune_dcache_sb(sb, sc, dentries);
+	freed += prune_icache_sb(sb, sc, inodes);
 
 	if (fs_objects) {
 		fs_objects = mult_frac(sc->nr_to_scan, fs_objects,
@@ -119,10 +119,8 @@ static unsigned long super_cache_count(struct shrinker *shrink,
 		total_objects = sb->s_op->nr_cached_objects(sb,
 						 sc->nid);
 
-	total_objects += list_lru_count_node(&sb->s_dentry_lru,
-						 sc->nid);
-	total_objects += list_lru_count_node(&sb->s_inode_lru,
-						 sc->nid);
+	total_objects += list_lru_count(&sb->s_dentry_lru, sc);
+	total_objects += list_lru_count(&sb->s_inode_lru, sc);
 
 	total_objects = vfs_pressure_ratio(total_objects);
 	drop_super(sb);
diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index c7f0b77..5b2a49c 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -1508,9 +1508,11 @@ xfs_wait_buftarg(
 	int loop = 0;
 
 	/* loop until there is nothing left on the lru list. */
-	while (list_lru_count(&btp->bt_lru)) {
-		list_lru_walk(&btp->bt_lru, xfs_buftarg_wait_rele,
-			      &dispose, LONG_MAX);
+	while (list_lru_count(&btp->bt_lru, NULL)) {
+		unsigned long nr_to_scan = ULONG_MAX;
+
+		list_lru_walk(&btp->bt_lru, NULL, xfs_buftarg_wait_rele,
+			      &dispose, &nr_to_scan);
 
 		while (!list_empty(&dispose)) {
 			struct xfs_buf *bp;
@@ -1565,8 +1567,8 @@ xfs_buftarg_shrink_scan(
 	unsigned long		freed;
 	unsigned long		nr_to_scan = sc->nr_to_scan;
 
-	freed = list_lru_walk_node(&btp->bt_lru, sc->nid, xfs_buftarg_isolate,
-				       &dispose, &nr_to_scan);
+	freed = list_lru_walk(&btp->bt_lru, sc, xfs_buftarg_isolate,
+			      &dispose, &nr_to_scan);
 
 	while (!list_empty(&dispose)) {
 		struct xfs_buf *bp;
@@ -1585,7 +1587,7 @@ xfs_buftarg_shrink_count(
 {
 	struct xfs_buftarg	*btp = container_of(shrink,
 					struct xfs_buftarg, bt_shrinker);
-	return list_lru_count_node(&btp->bt_lru, sc->nid);
+	return list_lru_count(&btp->bt_lru, sc);
 }
 
 void
diff --git a/fs/xfs/xfs_qm.c b/fs/xfs/xfs_qm.c
index 14a4996..aaacf8f 100644
--- a/fs/xfs/xfs_qm.c
+++ b/fs/xfs/xfs_qm.c
@@ -769,8 +769,8 @@ xfs_qm_shrink_scan(
 	INIT_LIST_HEAD(&isol.buffers);
 	INIT_LIST_HEAD(&isol.dispose);
 
-	freed = list_lru_walk_node(&qi->qi_lru, sc->nid, xfs_qm_dquot_isolate, &isol,
-					&nr_to_scan);
+	freed = list_lru_walk(&qi->qi_lru, sc, xfs_qm_dquot_isolate, &isol,
+			      &nr_to_scan);
 
 	error = xfs_buf_delwri_submit(&isol.buffers);
 	if (error)
@@ -795,7 +795,7 @@ xfs_qm_shrink_count(
 	struct xfs_quotainfo	*qi = container_of(shrink,
 					struct xfs_quotainfo, qi_shrinker);
 
-	return list_lru_count_node(&qi->qi_lru, sc->nid);
+	return list_lru_count(&qi->qi_lru, sc);
 }
 
 /*
diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
index 3ce5417..34e57af 100644
--- a/include/linux/list_lru.h
+++ b/include/linux/list_lru.h
@@ -10,6 +10,8 @@
 #include <linux/list.h>
 #include <linux/nodemask.h>
 
+struct shrink_control;
+
 /* list_lru_walk_cb has to always return one of those */
 enum lru_status {
 	LRU_REMOVED,		/* item removed from list */
@@ -66,32 +68,22 @@ bool list_lru_add(struct list_lru *lru, struct list_head *item);
 bool list_lru_del(struct list_lru *lru, struct list_head *item);
 
 /**
- * list_lru_count_node: return the number of objects currently held by @lru
+ * list_lru_count: return the number of objects currently held by @lru
  * @lru: the lru pointer.
- * @nid: the node id to count from.
+ * @sc: if not NULL, count only from node @sc->nid.
  *
  * Always return a non-negative number, 0 for empty lists. There is no
  * guarantee that the list is not updated while the count is being computed.
  * Callers that want such a guarantee need to provide an outer lock.
  */
-unsigned long list_lru_count_node(struct list_lru *lru, int nid);
-static inline unsigned long list_lru_count(struct list_lru *lru)
-{
-	long count = 0;
-	int nid;
-
-	for_each_node_mask(nid, lru->active_nodes)
-		count += list_lru_count_node(lru, nid);
-
-	return count;
-}
+unsigned long list_lru_count(struct list_lru *lru, struct shrink_control *sc);
 
 typedef enum lru_status
 (*list_lru_walk_cb)(struct list_head *item, spinlock_t *lock, void *cb_arg);
 /**
- * list_lru_walk_node: walk a list_lru, isolating and disposing freeable items.
+ * list_lru_walk: walk a list_lru, isolating and disposing freeable items.
  * @lru: the lru pointer.
- * @nid: the node id to scan from.
+ * @sc: if not NULL, scan only from node @sc->nid.
  * @isolate: callback function that is resposible for deciding what to do with
  *  the item currently being scanned
  * @cb_arg: opaque type that will be passed to @isolate
@@ -109,23 +101,8 @@ typedef enum lru_status
  *
  * Return value: the number of objects effectively removed from the LRU.
  */
-unsigned long list_lru_walk_node(struct list_lru *lru, int nid,
-				 list_lru_walk_cb isolate, void *cb_arg,
-				 unsigned long *nr_to_walk);
-
-static inline unsigned long
-list_lru_walk(struct list_lru *lru, list_lru_walk_cb isolate,
-	      void *cb_arg, unsigned long nr_to_walk)
-{
-	long isolated = 0;
-	int nid;
+unsigned long list_lru_walk(struct list_lru *lru, struct shrink_control *sc,
+			    list_lru_walk_cb isolate, void *cb_arg,
+			    unsigned long *nr_to_walk);
 
-	for_each_node_mask(nid, lru->active_nodes) {
-		isolated += list_lru_walk_node(lru, nid, isolate,
-					       cb_arg, &nr_to_walk);
-		if (nr_to_walk <= 0)
-			break;
-	}
-	return isolated;
-}
 #endif /* _LRU_LIST_H */
diff --git a/mm/list_lru.c b/mm/list_lru.c
index 72f9dec..7d4a9c2 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -7,8 +7,9 @@
 #include <linux/kernel.h>
 #include <linux/module.h>
 #include <linux/mm.h>
-#include <linux/list_lru.h>
 #include <linux/slab.h>
+#include <linux/shrinker.h>
+#include <linux/list_lru.h>
 
 bool list_lru_add(struct list_lru *lru, struct list_head *item)
 {
@@ -48,7 +49,7 @@ bool list_lru_del(struct list_lru *lru, struct list_head *item)
 }
 EXPORT_SYMBOL_GPL(list_lru_del);
 
-unsigned long
+static unsigned long
 list_lru_count_node(struct list_lru *lru, int nid)
 {
 	unsigned long count = 0;
@@ -61,9 +62,23 @@ list_lru_count_node(struct list_lru *lru, int nid)
 
 	return count;
 }
-EXPORT_SYMBOL_GPL(list_lru_count_node);
 
-unsigned long
+unsigned long list_lru_count(struct list_lru *lru, struct shrink_control *sc)
+{
+	long count = 0;
+	int nid;
+
+	if (sc)
+		return list_lru_count_node(lru, sc->nid);
+
+	for_each_node_mask(nid, lru->active_nodes)
+		count += list_lru_count_node(lru, nid);
+
+	return count;
+}
+EXPORT_SYMBOL_GPL(list_lru_count);
+
+static unsigned long
 list_lru_walk_node(struct list_lru *lru, int nid, list_lru_walk_cb isolate,
 		   void *cb_arg, unsigned long *nr_to_walk)
 {
@@ -112,7 +127,27 @@ restart:
 	spin_unlock(&nlru->lock);
 	return isolated;
 }
-EXPORT_SYMBOL_GPL(list_lru_walk_node);
+
+unsigned long list_lru_walk(struct list_lru *lru, struct shrink_control *sc,
+			    list_lru_walk_cb isolate, void *cb_arg,
+			    unsigned long *nr_to_walk)
+{
+	long isolated = 0;
+	int nid;
+
+	if (sc)
+		return list_lru_walk_node(lru, sc->nid, isolate,
+					  cb_arg, nr_to_walk);
+
+	for_each_node_mask(nid, lru->active_nodes) {
+		isolated += list_lru_walk_node(lru, nid, isolate,
+					       cb_arg, nr_to_walk);
+		if (*nr_to_walk <= 0)
+			break;
+	}
+	return isolated;
+}
+EXPORT_SYMBOL_GPL(list_lru_walk);
 
 int list_lru_init(struct list_lru *lru)
 {
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH v13 09/16] fs: consolidate {nr,free}_cached_objects args in shrink_control
  2013-12-09  8:05 ` Vladimir Davydov
@ 2013-12-09  8:05   ` Vladimir Davydov
  -1 siblings, 0 replies; 81+ messages in thread
From: Vladimir Davydov @ 2013-12-09  8:05 UTC (permalink / raw)
  To: dchinner, hannes, mhocko, akpm
  Cc: linux-kernel, linux-mm, cgroups, devel, glommer, glommer,
	vdavydov, Al Viro

We are going to make the FS shrinker memcg-aware. To achieve that, we
will have to pass the memcg to scan to the nr_cached_objects and
free_cached_objects VFS methods, which currently take only the NUMA node
to scan. Since the shrink_control structure already holds the node, and
the memcg to scan will be added to it as we introduce memcg-aware
vmscan, let us consolidate the methods' arguments in this structure to
keep things clean.

Thanks to David Chinner for the tip.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Glauber Costa <glommer@openvz.org>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
---
 fs/super.c         |    8 +++-----
 fs/xfs/xfs_super.c |    6 +++---
 include/linux/fs.h |    6 ++++--
 3 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/fs/super.c b/fs/super.c
index a039dba..8f9a81b 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -76,7 +76,7 @@ static unsigned long super_cache_scan(struct shrinker *shrink,
 		return SHRINK_STOP;
 
 	if (sb->s_op->nr_cached_objects)
-		fs_objects = sb->s_op->nr_cached_objects(sb, sc->nid);
+		fs_objects = sb->s_op->nr_cached_objects(sb, sc);
 
 	inodes = list_lru_count(&sb->s_inode_lru, sc);
 	dentries = list_lru_count(&sb->s_dentry_lru, sc);
@@ -96,8 +96,7 @@ static unsigned long super_cache_scan(struct shrinker *shrink,
 	if (fs_objects) {
 		fs_objects = mult_frac(sc->nr_to_scan, fs_objects,
 								total_objects);
-		freed += sb->s_op->free_cached_objects(sb, fs_objects,
-						       sc->nid);
+		freed += sb->s_op->free_cached_objects(sb, sc, fs_objects);
 	}
 
 	drop_super(sb);
@@ -116,8 +115,7 @@ static unsigned long super_cache_count(struct shrinker *shrink,
 		return 0;
 
 	if (sb->s_op && sb->s_op->nr_cached_objects)
-		total_objects = sb->s_op->nr_cached_objects(sb,
-						 sc->nid);
+		total_objects = sb->s_op->nr_cached_objects(sb, sc);
 
 	total_objects += list_lru_count(&sb->s_dentry_lru, sc);
 	total_objects += list_lru_count(&sb->s_inode_lru, sc);
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index f317488..06d155d 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1524,7 +1524,7 @@ xfs_fs_mount(
 static long
 xfs_fs_nr_cached_objects(
 	struct super_block	*sb,
-	int			nid)
+	struct shrink_control	*sc)
 {
 	return xfs_reclaim_inodes_count(XFS_M(sb));
 }
@@ -1532,8 +1532,8 @@ xfs_fs_nr_cached_objects(
 static long
 xfs_fs_free_cached_objects(
 	struct super_block	*sb,
-	long			nr_to_scan,
-	int			nid)
+	struct shrink_control	*sc,
+	long			nr_to_scan)
 {
 	return xfs_reclaim_inodes_nr(XFS_M(sb), nr_to_scan);
 }
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 121f11f..b367d54 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1619,8 +1619,10 @@ struct super_operations {
 	ssize_t (*quota_write)(struct super_block *, int, const char *, size_t, loff_t);
 #endif
 	int (*bdev_try_to_free_page)(struct super_block*, struct page*, gfp_t);
-	long (*nr_cached_objects)(struct super_block *, int);
-	long (*free_cached_objects)(struct super_block *, long, int);
+	long (*nr_cached_objects)(struct super_block *,
+				  struct shrink_control *);
+	long (*free_cached_objects)(struct super_block *,
+				    struct shrink_control *, long);
 };
 
 /*
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH v13 09/16] fs: consolidate {nr,free}_cached_objects args in shrink_control
@ 2013-12-09  8:05   ` Vladimir Davydov
  0 siblings, 0 replies; 81+ messages in thread
From: Vladimir Davydov @ 2013-12-09  8:05 UTC (permalink / raw)
  To: dchinner, hannes, mhocko, akpm
  Cc: linux-kernel, linux-mm, cgroups, devel, glommer, glommer,
	vdavydov, Al Viro

We are going to make the FS shrinker memcg-aware. To achieve that, we
will have to pass the memcg to scan to the nr_cached_objects and
free_cached_objects VFS methods, which currently take only the NUMA node
to scan. Since the shrink_control structure already holds the node, and
the memcg to scan will be added to it as we introduce memcg-aware
vmscan, let us consolidate the methods' arguments in this structure to
keep things clean.

Thanks to David Chinner for the tip.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Glauber Costa <glommer@openvz.org>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
---
 fs/super.c         |    8 +++-----
 fs/xfs/xfs_super.c |    6 +++---
 include/linux/fs.h |    6 ++++--
 3 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/fs/super.c b/fs/super.c
index a039dba..8f9a81b 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -76,7 +76,7 @@ static unsigned long super_cache_scan(struct shrinker *shrink,
 		return SHRINK_STOP;
 
 	if (sb->s_op->nr_cached_objects)
-		fs_objects = sb->s_op->nr_cached_objects(sb, sc->nid);
+		fs_objects = sb->s_op->nr_cached_objects(sb, sc);
 
 	inodes = list_lru_count(&sb->s_inode_lru, sc);
 	dentries = list_lru_count(&sb->s_dentry_lru, sc);
@@ -96,8 +96,7 @@ static unsigned long super_cache_scan(struct shrinker *shrink,
 	if (fs_objects) {
 		fs_objects = mult_frac(sc->nr_to_scan, fs_objects,
 								total_objects);
-		freed += sb->s_op->free_cached_objects(sb, fs_objects,
-						       sc->nid);
+		freed += sb->s_op->free_cached_objects(sb, sc, fs_objects);
 	}
 
 	drop_super(sb);
@@ -116,8 +115,7 @@ static unsigned long super_cache_count(struct shrinker *shrink,
 		return 0;
 
 	if (sb->s_op && sb->s_op->nr_cached_objects)
-		total_objects = sb->s_op->nr_cached_objects(sb,
-						 sc->nid);
+		total_objects = sb->s_op->nr_cached_objects(sb, sc);
 
 	total_objects += list_lru_count(&sb->s_dentry_lru, sc);
 	total_objects += list_lru_count(&sb->s_inode_lru, sc);
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index f317488..06d155d 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1524,7 +1524,7 @@ xfs_fs_mount(
 static long
 xfs_fs_nr_cached_objects(
 	struct super_block	*sb,
-	int			nid)
+	struct shrink_control	*sc)
 {
 	return xfs_reclaim_inodes_count(XFS_M(sb));
 }
@@ -1532,8 +1532,8 @@ xfs_fs_nr_cached_objects(
 static long
 xfs_fs_free_cached_objects(
 	struct super_block	*sb,
-	long			nr_to_scan,
-	int			nid)
+	struct shrink_control	*sc,
+	long			nr_to_scan)
 {
 	return xfs_reclaim_inodes_nr(XFS_M(sb), nr_to_scan);
 }
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 121f11f..b367d54 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1619,8 +1619,10 @@ struct super_operations {
 	ssize_t (*quota_write)(struct super_block *, int, const char *, size_t, loff_t);
 #endif
 	int (*bdev_try_to_free_page)(struct super_block*, struct page*, gfp_t);
-	long (*nr_cached_objects)(struct super_block *, int);
-	long (*free_cached_objects)(struct super_block *, long, int);
+	long (*nr_cached_objects)(struct super_block *,
+				  struct shrink_control *);
+	long (*free_cached_objects)(struct super_block *,
+				    struct shrink_control *, long);
 };
 
 /*
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH v13 10/16] vmscan: shrink slab on memcg pressure
  2013-12-09  8:05 ` Vladimir Davydov
@ 2013-12-09  8:05   ` Vladimir Davydov
  -1 siblings, 0 replies; 81+ messages in thread
From: Vladimir Davydov @ 2013-12-09  8:05 UTC (permalink / raw)
  To: dchinner, hannes, mhocko, akpm
  Cc: linux-kernel, linux-mm, cgroups, devel, glommer, glommer,
	vdavydov, Mel Gorman, Rik van Riel, Al Viro, Balbir Singh,
	KAMEZAWA Hiroyuki

This patch makes direct reclaim path shrink slab not only on global
memory pressure, but also when we reach the user memory limit of a
memcg. To achieve that, it makes shrink_slab() walk over the memcg
hierarchy and run shrinkers marked as memcg-aware on the target memcg
and all its descendants. The memcg to scan is passed in a shrink_control
structure; memcg-unaware shrinkers are still called only on global
memory pressure with memcg=NULL. It is up to the shrinker how to
organize the objects it is responsible for to achieve per-memcg reclaim.

The idea lying behind the patch as well as the initial implementation
belong to Glauber Costa.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Glauber Costa <glommer@openvz.org>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/memcontrol.h |   22 ++++++++++
 include/linux/shrinker.h   |   10 ++++-
 mm/memcontrol.c            |   37 +++++++++++++++-
 mm/vmscan.c                |  103 +++++++++++++++++++++++++++++++++-----------
 4 files changed, 146 insertions(+), 26 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index b3e7a66..c0f24a9 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -80,6 +80,9 @@ extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
 struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *);
 struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *);
 
+unsigned long mem_cgroup_zone_reclaimable_pages(struct zone *,
+						struct mem_cgroup *);
+
 /* For coalescing uncharge for reducing memcg' overhead*/
 extern void mem_cgroup_uncharge_start(void);
 extern void mem_cgroup_uncharge_end(void);
@@ -289,6 +292,12 @@ static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page,
 	return &zone->lruvec;
 }
 
+static inline unsigned long mem_cgroup_zone_reclaimable_pages(struct zone *zone,
+							struct mem_cgroup *)
+{
+	return 0;
+}
+
 static inline struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
 {
 	return NULL;
@@ -479,6 +488,9 @@ static inline bool memcg_kmem_enabled(void)
 	return static_key_false(&memcg_kmem_enabled_key);
 }
 
+bool memcg_kmem_is_active(struct mem_cgroup *memcg);
+bool memcg_kmem_should_reclaim(struct mem_cgroup *memcg);
+
 /*
  * In general, we'll do everything in our power to not incur in any overhead
  * for non-memcg users for the kmem functions. Not even a function call, if we
@@ -620,6 +632,16 @@ static inline bool memcg_kmem_enabled(void)
 	return false;
 }
 
+static inline bool memcg_kmem_is_active(struct mem_cgroup *memcg)
+{
+	return false;
+}
+
+static inline bool memcg_kmem_should_reclaim(struct mem_cgroup *memcg)
+{
+	return false;
+}
+
 static inline bool
 memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup **memcg, int order)
 {
diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
index 68c0970..ab79b17 100644
--- a/include/linux/shrinker.h
+++ b/include/linux/shrinker.h
@@ -20,8 +20,15 @@ struct shrink_control {
 
 	/* shrink from these nodes */
 	nodemask_t nodes_to_scan;
+
+	/* shrink from this memory cgroup hierarchy (if not NULL) */
+	struct mem_cgroup *target_mem_cgroup;
+
 	/* current node being shrunk (for NUMA aware shrinkers) */
 	int nid;
+
+	/* current memcg being shrunk (for memcg aware shrinkers) */
+	struct mem_cgroup *memcg;
 };
 
 #define SHRINK_STOP (~0UL)
@@ -63,7 +70,8 @@ struct shrinker {
 #define DEFAULT_SEEKS 2 /* A good number if you don't know better. */
 
 /* Flags */
-#define SHRINKER_NUMA_AWARE (1 << 0)
+#define SHRINKER_NUMA_AWARE	(1 << 0)
+#define SHRINKER_MEMCG_AWARE	(1 << 1)
 
 extern int register_shrinker(struct shrinker *);
 extern void unregister_shrinker(struct shrinker *);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 220b463..a3f479b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -358,7 +358,7 @@ static inline void memcg_kmem_set_active(struct mem_cgroup *memcg)
 	set_bit(KMEM_ACCOUNTED_ACTIVE, &memcg->kmem_account_flags);
 }
 
-static bool memcg_kmem_is_active(struct mem_cgroup *memcg)
+bool memcg_kmem_is_active(struct mem_cgroup *memcg)
 {
 	return test_bit(KMEM_ACCOUNTED_ACTIVE, &memcg->kmem_account_flags);
 }
@@ -1333,6 +1333,26 @@ out:
 	return lruvec;
 }
 
+unsigned long mem_cgroup_zone_reclaimable_pages(struct zone *zone,
+						struct mem_cgroup *memcg)
+{
+	int nid = zone_to_nid(zone);
+	int zid = zone_idx(zone);
+	unsigned long nr = 0;
+	struct mem_cgroup *iter;
+
+	iter = memcg;
+	do {
+		nr += mem_cgroup_zone_nr_lru_pages(iter, nid, zid,
+						   LRU_ALL_FILE);
+		if (do_swap_account)
+			nr += mem_cgroup_zone_nr_lru_pages(iter, nid, zid,
+							   LRU_ALL_ANON);
+		iter = mem_cgroup_iter(memcg, iter, NULL);
+	} while (iter);
+	return nr;
+}
+
 /*
  * Following LRU functions are allowed to be used without PCG_LOCK.
  * Operations are called by routine of global LRU independently from memcg.
@@ -2959,6 +2979,21 @@ static inline bool memcg_can_account_kmem(struct mem_cgroup *memcg)
 		(memcg->kmem_account_flags & KMEM_ACCOUNTED_MASK);
 }
 
+bool memcg_kmem_should_reclaim(struct mem_cgroup *memcg)
+{
+	struct mem_cgroup *iter;
+
+	iter = memcg;
+	do {
+		if (memcg_kmem_is_active(iter)) {
+			mem_cgroup_iter_break(memcg, iter);
+			return true;
+		}
+		iter = mem_cgroup_iter(memcg, iter, NULL);
+	} while (iter);
+	return false;
+}
+
 /*
  * helper for acessing a memcg's index. It will be used as an index in the
  * child cache array in kmem_cache, and also to derive its name. This function
diff --git a/mm/vmscan.c b/mm/vmscan.c
index d98f272..1997813 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -311,6 +311,58 @@ shrink_slab_node(struct shrink_control *shrinkctl, struct shrinker *shrinker,
 	return freed;
 }
 
+static unsigned long
+run_shrinker(struct shrink_control *shrinkctl, struct shrinker *shrinker,
+	     unsigned long nr_pages_scanned, unsigned long lru_pages)
+{
+	unsigned long freed = 0;
+
+	/*
+	 * If we don't have a target mem cgroup, we scan them all. Otherwise
+	 * we will limit our scan to shrinkers marked as memcg aware.
+	 */
+	if (!(shrinker->flags & SHRINKER_MEMCG_AWARE) &&
+	    shrinkctl->target_mem_cgroup != NULL)
+		return 0;
+
+	/*
+	 * In a hierarchical chain, it might be that not all memcgs are kmem
+	 * active. kmemcg design mandates that when one memcg is active, its
+	 * children will be active as well. But it is perfectly possible that
+	 * its parent is not.
+	 *
+	 * We also need to make sure we scan at least once, for the global
+	 * case. So if we don't have a target memcg, we proceed normally and
+	 * expect to break in the next round.
+	 */
+	shrinkctl->memcg = shrinkctl->target_mem_cgroup;
+	do {
+		if (shrinkctl->memcg && !memcg_kmem_is_active(shrinkctl->memcg))
+			goto next;
+
+		if (!(shrinker->flags & SHRINKER_NUMA_AWARE)) {
+			shrinkctl->nid = 0;
+			freed += shrink_slab_node(shrinkctl, shrinker,
+					nr_pages_scanned, lru_pages);
+			goto next;
+		}
+
+		for_each_node_mask(shrinkctl->nid, shrinkctl->nodes_to_scan) {
+			if (node_online(shrinkctl->nid))
+				freed += shrink_slab_node(shrinkctl, shrinker,
+						nr_pages_scanned, lru_pages);
+
+		}
+next:
+		if (!(shrinker->flags & SHRINKER_MEMCG_AWARE))
+			break;
+		shrinkctl->memcg = mem_cgroup_iter(shrinkctl->target_mem_cgroup,
+						   shrinkctl->memcg, NULL);
+	} while (shrinkctl->memcg);
+
+	return freed;
+}
+
 /*
  * Call the shrink functions to age shrinkable caches
  *
@@ -352,20 +404,10 @@ unsigned long shrink_slab(struct shrink_control *shrinkctl,
 	}
 
 	list_for_each_entry(shrinker, &shrinker_list, list) {
-		if (!(shrinker->flags & SHRINKER_NUMA_AWARE)) {
-			shrinkctl->nid = 0;
-			freed += shrink_slab_node(shrinkctl, shrinker,
-					nr_pages_scanned, lru_pages);
-			continue;
-		}
-
-		for_each_node_mask(shrinkctl->nid, shrinkctl->nodes_to_scan) {
-			if (node_online(shrinkctl->nid))
-				freed += shrink_slab_node(shrinkctl, shrinker,
-						nr_pages_scanned, lru_pages);
-
-		}
+		freed += run_shrinker(shrinkctl, shrinker,
+				      nr_pages_scanned, lru_pages);
 	}
+
 	up_read(&shrinker_rwsem);
 out:
 	cond_resched();
@@ -2286,6 +2328,7 @@ static bool shrink_zones(struct zonelist *zonelist,
 	struct reclaim_state *reclaim_state = current->reclaim_state;
 	struct shrink_control shrink = {
 		.gfp_mask = sc->gfp_mask,
+		.target_mem_cgroup = sc->target_mem_cgroup,
 	};
 
 	/*
@@ -2302,17 +2345,22 @@ static bool shrink_zones(struct zonelist *zonelist,
 					gfp_zone(sc->gfp_mask), sc->nodemask) {
 		if (!populated_zone(zone))
 			continue;
+
+		if (global_reclaim(sc) &&
+		    !cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
+			continue;
+
+		lru_pages += global_reclaim(sc) ?
+				zone_reclaimable_pages(zone) :
+				mem_cgroup_zone_reclaimable_pages(zone,
+						sc->target_mem_cgroup);
+		node_set(zone_to_nid(zone), shrink.nodes_to_scan);
+
 		/*
 		 * Take care memory controller reclaiming has small influence
 		 * to global LRU.
 		 */
 		if (global_reclaim(sc)) {
-			if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
-				continue;
-
-			lru_pages += zone_reclaimable_pages(zone);
-			node_set(zone_to_nid(zone), shrink.nodes_to_scan);
-
 			if (sc->priority != DEF_PRIORITY &&
 			    !zone_reclaimable(zone))
 				continue;	/* Let kswapd poll it */
@@ -2350,12 +2398,11 @@ static bool shrink_zones(struct zonelist *zonelist,
 	}
 
 	/*
-	 * Don't shrink slabs when reclaiming memory from over limit
-	 * cgroups but do shrink slab at least once when aborting
-	 * reclaim for compaction to avoid unevenly scanning file/anon
-	 * LRU pages over slab pages.
+	 * Shrink slabs at least once when aborting reclaim for compaction
+	 * to avoid unevenly scanning file/anon LRU pages over slab pages.
 	 */
-	if (global_reclaim(sc)) {
+	if (global_reclaim(sc) ||
+	    memcg_kmem_should_reclaim(sc->target_mem_cgroup)) {
 		shrink_slab(&shrink, sc->nr_scanned, lru_pages);
 		if (reclaim_state) {
 			sc->nr_reclaimed += reclaim_state->reclaimed_slab;
@@ -2649,6 +2696,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 	struct zonelist *zonelist;
 	unsigned long nr_reclaimed;
 	int nid;
+	struct reclaim_state reclaim_state;
 	struct scan_control sc = {
 		.may_writepage = !laptop_mode,
 		.may_unmap = 1,
@@ -2671,6 +2719,10 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 
 	zonelist = NODE_DATA(nid)->node_zonelists;
 
+	lockdep_set_current_reclaim_state(sc.gfp_mask);
+	reclaim_state.reclaimed_slab = 0;
+	current->reclaim_state = &reclaim_state;
+
 	trace_mm_vmscan_memcg_reclaim_begin(0,
 					    sc.may_writepage,
 					    sc.gfp_mask);
@@ -2679,6 +2731,9 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 
 	trace_mm_vmscan_memcg_reclaim_end(nr_reclaimed);
 
+	current->reclaim_state = NULL;
+	lockdep_clear_current_reclaim_state();
+
 	return nr_reclaimed;
 }
 #endif
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH v13 10/16] vmscan: shrink slab on memcg pressure
@ 2013-12-09  8:05   ` Vladimir Davydov
  0 siblings, 0 replies; 81+ messages in thread
From: Vladimir Davydov @ 2013-12-09  8:05 UTC (permalink / raw)
  To: dchinner, hannes, mhocko, akpm
  Cc: linux-kernel, linux-mm, cgroups, devel, glommer, glommer,
	vdavydov, Mel Gorman, Rik van Riel, Al Viro, Balbir Singh,
	KAMEZAWA Hiroyuki

This patch makes direct reclaim path shrink slab not only on global
memory pressure, but also when we reach the user memory limit of a
memcg. To achieve that, it makes shrink_slab() walk over the memcg
hierarchy and run shrinkers marked as memcg-aware on the target memcg
and all its descendants. The memcg to scan is passed in a shrink_control
structure; memcg-unaware shrinkers are still called only on global
memory pressure with memcg=NULL. It is up to the shrinker how to
organize the objects it is responsible for to achieve per-memcg reclaim.

The idea lying behind the patch as well as the initial implementation
belong to Glauber Costa.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Glauber Costa <glommer@openvz.org>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/memcontrol.h |   22 ++++++++++
 include/linux/shrinker.h   |   10 ++++-
 mm/memcontrol.c            |   37 +++++++++++++++-
 mm/vmscan.c                |  103 +++++++++++++++++++++++++++++++++-----------
 4 files changed, 146 insertions(+), 26 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index b3e7a66..c0f24a9 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -80,6 +80,9 @@ extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
 struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *);
 struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *);
 
+unsigned long mem_cgroup_zone_reclaimable_pages(struct zone *,
+						struct mem_cgroup *);
+
 /* For coalescing uncharge for reducing memcg' overhead*/
 extern void mem_cgroup_uncharge_start(void);
 extern void mem_cgroup_uncharge_end(void);
@@ -289,6 +292,12 @@ static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page,
 	return &zone->lruvec;
 }
 
+static inline unsigned long mem_cgroup_zone_reclaimable_pages(struct zone *zone,
+							struct mem_cgroup *)
+{
+	return 0;
+}
+
 static inline struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
 {
 	return NULL;
@@ -479,6 +488,9 @@ static inline bool memcg_kmem_enabled(void)
 	return static_key_false(&memcg_kmem_enabled_key);
 }
 
+bool memcg_kmem_is_active(struct mem_cgroup *memcg);
+bool memcg_kmem_should_reclaim(struct mem_cgroup *memcg);
+
 /*
  * In general, we'll do everything in our power to not incur in any overhead
  * for non-memcg users for the kmem functions. Not even a function call, if we
@@ -620,6 +632,16 @@ static inline bool memcg_kmem_enabled(void)
 	return false;
 }
 
+static inline bool memcg_kmem_is_active(struct mem_cgroup *memcg)
+{
+	return false;
+}
+
+static inline bool memcg_kmem_should_reclaim(struct mem_cgroup *memcg)
+{
+	return false;
+}
+
 static inline bool
 memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup **memcg, int order)
 {
diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
index 68c0970..ab79b17 100644
--- a/include/linux/shrinker.h
+++ b/include/linux/shrinker.h
@@ -20,8 +20,15 @@ struct shrink_control {
 
 	/* shrink from these nodes */
 	nodemask_t nodes_to_scan;
+
+	/* shrink from this memory cgroup hierarchy (if not NULL) */
+	struct mem_cgroup *target_mem_cgroup;
+
 	/* current node being shrunk (for NUMA aware shrinkers) */
 	int nid;
+
+	/* current memcg being shrunk (for memcg aware shrinkers) */
+	struct mem_cgroup *memcg;
 };
 
 #define SHRINK_STOP (~0UL)
@@ -63,7 +70,8 @@ struct shrinker {
 #define DEFAULT_SEEKS 2 /* A good number if you don't know better. */
 
 /* Flags */
-#define SHRINKER_NUMA_AWARE (1 << 0)
+#define SHRINKER_NUMA_AWARE	(1 << 0)
+#define SHRINKER_MEMCG_AWARE	(1 << 1)
 
 extern int register_shrinker(struct shrinker *);
 extern void unregister_shrinker(struct shrinker *);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 220b463..a3f479b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -358,7 +358,7 @@ static inline void memcg_kmem_set_active(struct mem_cgroup *memcg)
 	set_bit(KMEM_ACCOUNTED_ACTIVE, &memcg->kmem_account_flags);
 }
 
-static bool memcg_kmem_is_active(struct mem_cgroup *memcg)
+bool memcg_kmem_is_active(struct mem_cgroup *memcg)
 {
 	return test_bit(KMEM_ACCOUNTED_ACTIVE, &memcg->kmem_account_flags);
 }
@@ -1333,6 +1333,26 @@ out:
 	return lruvec;
 }
 
+unsigned long mem_cgroup_zone_reclaimable_pages(struct zone *zone,
+						struct mem_cgroup *memcg)
+{
+	int nid = zone_to_nid(zone);
+	int zid = zone_idx(zone);
+	unsigned long nr = 0;
+	struct mem_cgroup *iter;
+
+	iter = memcg;
+	do {
+		nr += mem_cgroup_zone_nr_lru_pages(iter, nid, zid,
+						   LRU_ALL_FILE);
+		if (do_swap_account)
+			nr += mem_cgroup_zone_nr_lru_pages(iter, nid, zid,
+							   LRU_ALL_ANON);
+		iter = mem_cgroup_iter(memcg, iter, NULL);
+	} while (iter);
+	return nr;
+}
+
 /*
  * Following LRU functions are allowed to be used without PCG_LOCK.
  * Operations are called by routine of global LRU independently from memcg.
@@ -2959,6 +2979,21 @@ static inline bool memcg_can_account_kmem(struct mem_cgroup *memcg)
 		(memcg->kmem_account_flags & KMEM_ACCOUNTED_MASK);
 }
 
+bool memcg_kmem_should_reclaim(struct mem_cgroup *memcg)
+{
+	struct mem_cgroup *iter;
+
+	iter = memcg;
+	do {
+		if (memcg_kmem_is_active(iter)) {
+			mem_cgroup_iter_break(memcg, iter);
+			return true;
+		}
+		iter = mem_cgroup_iter(memcg, iter, NULL);
+	} while (iter);
+	return false;
+}
+
 /*
  * helper for acessing a memcg's index. It will be used as an index in the
  * child cache array in kmem_cache, and also to derive its name. This function
diff --git a/mm/vmscan.c b/mm/vmscan.c
index d98f272..1997813 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -311,6 +311,58 @@ shrink_slab_node(struct shrink_control *shrinkctl, struct shrinker *shrinker,
 	return freed;
 }
 
+static unsigned long
+run_shrinker(struct shrink_control *shrinkctl, struct shrinker *shrinker,
+	     unsigned long nr_pages_scanned, unsigned long lru_pages)
+{
+	unsigned long freed = 0;
+
+	/*
+	 * If we don't have a target mem cgroup, we scan them all. Otherwise
+	 * we will limit our scan to shrinkers marked as memcg aware.
+	 */
+	if (!(shrinker->flags & SHRINKER_MEMCG_AWARE) &&
+	    shrinkctl->target_mem_cgroup != NULL)
+		return 0;
+
+	/*
+	 * In a hierarchical chain, it might be that not all memcgs are kmem
+	 * active. kmemcg design mandates that when one memcg is active, its
+	 * children will be active as well. But it is perfectly possible that
+	 * its parent is not.
+	 *
+	 * We also need to make sure we scan at least once, for the global
+	 * case. So if we don't have a target memcg, we proceed normally and
+	 * expect to break in the next round.
+	 */
+	shrinkctl->memcg = shrinkctl->target_mem_cgroup;
+	do {
+		if (shrinkctl->memcg && !memcg_kmem_is_active(shrinkctl->memcg))
+			goto next;
+
+		if (!(shrinker->flags & SHRINKER_NUMA_AWARE)) {
+			shrinkctl->nid = 0;
+			freed += shrink_slab_node(shrinkctl, shrinker,
+					nr_pages_scanned, lru_pages);
+			goto next;
+		}
+
+		for_each_node_mask(shrinkctl->nid, shrinkctl->nodes_to_scan) {
+			if (node_online(shrinkctl->nid))
+				freed += shrink_slab_node(shrinkctl, shrinker,
+						nr_pages_scanned, lru_pages);
+
+		}
+next:
+		if (!(shrinker->flags & SHRINKER_MEMCG_AWARE))
+			break;
+		shrinkctl->memcg = mem_cgroup_iter(shrinkctl->target_mem_cgroup,
+						   shrinkctl->memcg, NULL);
+	} while (shrinkctl->memcg);
+
+	return freed;
+}
+
 /*
  * Call the shrink functions to age shrinkable caches
  *
@@ -352,20 +404,10 @@ unsigned long shrink_slab(struct shrink_control *shrinkctl,
 	}
 
 	list_for_each_entry(shrinker, &shrinker_list, list) {
-		if (!(shrinker->flags & SHRINKER_NUMA_AWARE)) {
-			shrinkctl->nid = 0;
-			freed += shrink_slab_node(shrinkctl, shrinker,
-					nr_pages_scanned, lru_pages);
-			continue;
-		}
-
-		for_each_node_mask(shrinkctl->nid, shrinkctl->nodes_to_scan) {
-			if (node_online(shrinkctl->nid))
-				freed += shrink_slab_node(shrinkctl, shrinker,
-						nr_pages_scanned, lru_pages);
-
-		}
+		freed += run_shrinker(shrinkctl, shrinker,
+				      nr_pages_scanned, lru_pages);
 	}
+
 	up_read(&shrinker_rwsem);
 out:
 	cond_resched();
@@ -2286,6 +2328,7 @@ static bool shrink_zones(struct zonelist *zonelist,
 	struct reclaim_state *reclaim_state = current->reclaim_state;
 	struct shrink_control shrink = {
 		.gfp_mask = sc->gfp_mask,
+		.target_mem_cgroup = sc->target_mem_cgroup,
 	};
 
 	/*
@@ -2302,17 +2345,22 @@ static bool shrink_zones(struct zonelist *zonelist,
 					gfp_zone(sc->gfp_mask), sc->nodemask) {
 		if (!populated_zone(zone))
 			continue;
+
+		if (global_reclaim(sc) &&
+		    !cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
+			continue;
+
+		lru_pages += global_reclaim(sc) ?
+				zone_reclaimable_pages(zone) :
+				mem_cgroup_zone_reclaimable_pages(zone,
+						sc->target_mem_cgroup);
+		node_set(zone_to_nid(zone), shrink.nodes_to_scan);
+
 		/*
 		 * Take care memory controller reclaiming has small influence
 		 * to global LRU.
 		 */
 		if (global_reclaim(sc)) {
-			if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
-				continue;
-
-			lru_pages += zone_reclaimable_pages(zone);
-			node_set(zone_to_nid(zone), shrink.nodes_to_scan);
-
 			if (sc->priority != DEF_PRIORITY &&
 			    !zone_reclaimable(zone))
 				continue;	/* Let kswapd poll it */
@@ -2350,12 +2398,11 @@ static bool shrink_zones(struct zonelist *zonelist,
 	}
 
 	/*
-	 * Don't shrink slabs when reclaiming memory from over limit
-	 * cgroups but do shrink slab at least once when aborting
-	 * reclaim for compaction to avoid unevenly scanning file/anon
-	 * LRU pages over slab pages.
+	 * Shrink slabs at least once when aborting reclaim for compaction
+	 * to avoid unevenly scanning file/anon LRU pages over slab pages.
 	 */
-	if (global_reclaim(sc)) {
+	if (global_reclaim(sc) ||
+	    memcg_kmem_should_reclaim(sc->target_mem_cgroup)) {
 		shrink_slab(&shrink, sc->nr_scanned, lru_pages);
 		if (reclaim_state) {
 			sc->nr_reclaimed += reclaim_state->reclaimed_slab;
@@ -2649,6 +2696,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 	struct zonelist *zonelist;
 	unsigned long nr_reclaimed;
 	int nid;
+	struct reclaim_state reclaim_state;
 	struct scan_control sc = {
 		.may_writepage = !laptop_mode,
 		.may_unmap = 1,
@@ -2671,6 +2719,10 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 
 	zonelist = NODE_DATA(nid)->node_zonelists;
 
+	lockdep_set_current_reclaim_state(sc.gfp_mask);
+	reclaim_state.reclaimed_slab = 0;
+	current->reclaim_state = &reclaim_state;
+
 	trace_mm_vmscan_memcg_reclaim_begin(0,
 					    sc.may_writepage,
 					    sc.gfp_mask);
@@ -2679,6 +2731,9 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 
 	trace_mm_vmscan_memcg_reclaim_end(nr_reclaimed);
 
+	current->reclaim_state = NULL;
+	lockdep_clear_current_reclaim_state();
+
 	return nr_reclaimed;
 }
 #endif
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH v13 11/16] mm: list_lru: add per-memcg lists
  2013-12-09  8:05 ` Vladimir Davydov
@ 2013-12-09  8:05   ` Vladimir Davydov
  -1 siblings, 0 replies; 81+ messages in thread
From: Vladimir Davydov @ 2013-12-09  8:05 UTC (permalink / raw)
  To: dchinner, hannes, mhocko, akpm
  Cc: linux-kernel, linux-mm, cgroups, devel, glommer, glommer,
	vdavydov, Al Viro, Balbir Singh, KAMEZAWA Hiroyuki

There are several FS shrinkers, including super_block::s_shrink, that
keep reclaimable objects in the list_lru structure. That said, to turn
them to memcg-aware shrinkers, it is enough to make list_lru per-memcg.

This patch does the trick. It adds an array of LRU lists to the list_lru
structure, one for each kmem-active memcg, and dispatches every item
addition or removal operation to the list corresponding to the memcg the
item is accounted to.

Since we already pass a shrink_control object to count and walk list_lru
functions to specify the NUMA node to scan, and the target memcg is held
in this structure, there is no need in changing the list_lru interface.

The idea lying behind the patch as well as the initial implementation
belong to Glauber Costa.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Glauber Costa <glommer@openvz.org>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/list_lru.h   |   44 +++++++-
 include/linux/memcontrol.h |   13 +++
 mm/list_lru.c              |  242 ++++++++++++++++++++++++++++++++++++++------
 mm/memcontrol.c            |  158 ++++++++++++++++++++++++++++-
 4 files changed, 416 insertions(+), 41 deletions(-)

diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
index 34e57af..e8add3d 100644
--- a/include/linux/list_lru.h
+++ b/include/linux/list_lru.h
@@ -28,11 +28,47 @@ struct list_lru_node {
 	long			nr_items;
 } ____cacheline_aligned_in_smp;
 
+struct list_lru_one {
+	struct list_lru_node *node;
+	nodemask_t active_nodes;
+};
+
 struct list_lru {
-	struct list_lru_node	*node;
-	nodemask_t		active_nodes;
+	struct list_lru_one	global;
+#ifdef CONFIG_MEMCG_KMEM
+	/*
+	 * In order to provide ability of scanning objects from different
+	 * memory cgroups independently, we keep a separate LRU list for each
+	 * kmem-active memcg in this array. The array is RCU-protected and
+	 * indexed by memcg_cache_id().
+	 */
+	struct list_lru_one	**memcg;
+	/*
+	 * Every time a kmem-active memcg is created or destroyed, we have to
+	 * update the array of per-memcg LRUs in each list_lru structure. To
+	 * achieve that, we keep all list_lru structures in the all_memcg_lrus
+	 * list.
+	 */
+	struct list_head	list;
+	/*
+	 * Since the array of per-memcg LRUs is RCU-protected, we can only free
+	 * it after a call to synchronize_rcu(). To avoid multiple calls to
+	 * synchronize_rcu() when a lot of LRUs get updated at the same time,
+	 * which is a typical scenario, we will store the pointer to the
+	 * previous version of the array in the memcg_old field for each
+	 * list_lru structure, and then free them all at once after a single
+	 * call to synchronize_rcu().
+	 */
+	void *memcg_old;
+#endif /* CONFIG_MEMCG_KMEM */
 };
 
+#ifdef CONFIG_MEMCG_KMEM
+int list_lru_memcg_alloc(struct list_lru *lru, int memcg_id);
+void list_lru_memcg_free(struct list_lru *lru, int memcg_id);
+int list_lru_grow_memcg(struct list_lru *lru, size_t new_array_size);
+#endif
+
 void list_lru_destroy(struct list_lru *lru);
 int list_lru_init(struct list_lru *lru);
 
@@ -70,7 +106,7 @@ bool list_lru_del(struct list_lru *lru, struct list_head *item);
 /**
  * list_lru_count: return the number of objects currently held by @lru
  * @lru: the lru pointer.
- * @sc: if not NULL, count only from node @sc->nid.
+ * @sc: if not NULL, count only from node @sc->nid and memcg @sc->memcg.
  *
  * Always return a non-negative number, 0 for empty lists. There is no
  * guarantee that the list is not updated while the count is being computed.
@@ -83,7 +119,7 @@ typedef enum lru_status
 /**
  * list_lru_walk: walk a list_lru, isolating and disposing freeable items.
  * @lru: the lru pointer.
- * @sc: if not NULL, scan only from node @sc->nid.
+ * @sc: if not NULL, scan only from node @sc->nid and memcg @sc->memcg.
  * @isolate: callback function that is resposible for deciding what to do with
  *  the item currently being scanned
  * @cb_arg: opaque type that will be passed to @isolate
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index c0f24a9..4c88d72 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -29,6 +29,7 @@ struct page_cgroup;
 struct page;
 struct mm_struct;
 struct kmem_cache;
+struct list_lru;
 
 /*
  * The corresponding mem_cgroup_stat_names is defined in mm/memcontrol.c,
@@ -523,6 +524,9 @@ __memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp);
 void mem_cgroup_destroy_cache(struct kmem_cache *cachep);
 void kmem_cache_destroy_memcg_children(struct kmem_cache *s);
 
+int memcg_list_lru_init(struct list_lru *lru);
+void memcg_list_lru_destroy(struct list_lru *lru);
+
 /**
  * memcg_kmem_newpage_charge: verify if a new kmem allocation is allowed.
  * @gfp: the gfp allocation flags.
@@ -687,6 +691,15 @@ memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp)
 static inline void kmem_cache_destroy_memcg_children(struct kmem_cache *s)
 {
 }
+
+static inline int memcg_list_lru_init(struct list_lru *lru)
+{
+	return 0;
+}
+
+static inline void memcg_list_lru_destroy(struct list_lru *lru)
+{
+}
 #endif /* CONFIG_MEMCG_KMEM */
 #endif /* _LINUX_MEMCONTROL_H */
 
diff --git a/mm/list_lru.c b/mm/list_lru.c
index 7d4a9c2..1c0f39a 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -9,19 +9,70 @@
 #include <linux/mm.h>
 #include <linux/slab.h>
 #include <linux/shrinker.h>
+#include <linux/memcontrol.h>
+#include <linux/page_cgroup.h>
 #include <linux/list_lru.h>
 
+#ifdef CONFIG_MEMCG_KMEM
+static struct list_lru_one *lru_of_index(struct list_lru *lru,
+					 int memcg_id)
+{
+	struct list_lru_one **memcg_lrus;
+	struct list_lru_one *olru;
+
+	if (memcg_id < 0)
+		return &lru->global;
+
+	rcu_read_lock();
+	memcg_lrus = rcu_dereference(lru->memcg);
+	olru = memcg_lrus[memcg_id];
+	rcu_read_unlock();
+
+	return olru;
+}
+
+static struct list_lru_one *lru_of_page(struct list_lru *lru,
+					struct page *page)
+{
+	struct mem_cgroup *memcg = NULL;
+	struct page_cgroup *pc;
+
+	pc = lookup_page_cgroup(compound_head(page));
+	if (PageCgroupUsed(pc)) {
+		lock_page_cgroup(pc);
+		if (PageCgroupUsed(pc))
+			memcg = pc->mem_cgroup;
+		unlock_page_cgroup(pc);
+	}
+	return lru_of_index(lru, memcg_cache_id(memcg));
+}
+#else
+static inline struct list_lru_one *lru_of_index(struct list_lru *lru,
+						int memcg_id)
+{
+	return &lru->global;
+}
+
+static inline struct list_lru_one *lru_of_page(struct list_lru *lru,
+					       struct page *page)
+{
+	return &lru->global;
+}
+#endif /* CONFIG_MEMCG_KMEM */
+
 bool list_lru_add(struct list_lru *lru, struct list_head *item)
 {
-	int nid = page_to_nid(virt_to_page(item));
-	struct list_lru_node *nlru = &lru->node[nid];
+	struct page *page = virt_to_page(item);
+	int nid = page_to_nid(page);
+	struct list_lru_one *olru = lru_of_page(lru, page);
+	struct list_lru_node *nlru = &olru->node[nid];
 
 	spin_lock(&nlru->lock);
 	WARN_ON_ONCE(nlru->nr_items < 0);
 	if (list_empty(item)) {
 		list_add_tail(item, &nlru->list);
 		if (nlru->nr_items++ == 0)
-			node_set(nid, lru->active_nodes);
+			node_set(nid, olru->active_nodes);
 		spin_unlock(&nlru->lock);
 		return true;
 	}
@@ -32,14 +83,16 @@ EXPORT_SYMBOL_GPL(list_lru_add);
 
 bool list_lru_del(struct list_lru *lru, struct list_head *item)
 {
-	int nid = page_to_nid(virt_to_page(item));
-	struct list_lru_node *nlru = &lru->node[nid];
+	struct page *page = virt_to_page(item);
+	int nid = page_to_nid(page);
+	struct list_lru_one *olru = lru_of_page(lru, page);
+	struct list_lru_node *nlru = &olru->node[nid];
 
 	spin_lock(&nlru->lock);
 	if (!list_empty(item)) {
 		list_del_init(item);
 		if (--nlru->nr_items == 0)
-			node_clear(nid, lru->active_nodes);
+			node_clear(nid, olru->active_nodes);
 		WARN_ON_ONCE(nlru->nr_items < 0);
 		spin_unlock(&nlru->lock);
 		return true;
@@ -50,10 +103,10 @@ bool list_lru_del(struct list_lru *lru, struct list_head *item)
 EXPORT_SYMBOL_GPL(list_lru_del);
 
 static unsigned long
-list_lru_count_node(struct list_lru *lru, int nid)
+list_lru_count_node(struct list_lru_one *olru, int nid)
 {
 	unsigned long count = 0;
-	struct list_lru_node *nlru = &lru->node[nid];
+	struct list_lru_node *nlru = &olru->node[nid];
 
 	spin_lock(&nlru->lock);
 	WARN_ON_ONCE(nlru->nr_items < 0);
@@ -67,23 +120,36 @@ unsigned long list_lru_count(struct list_lru *lru, struct shrink_control *sc)
 {
 	long count = 0;
 	int nid;
+	struct list_lru_one *olru;
+	struct mem_cgroup *memcg;
 
-	if (sc)
-		return list_lru_count_node(lru, sc->nid);
+	if (sc) {
+		olru = lru_of_index(lru, memcg_cache_id(sc->memcg));
+		return list_lru_count_node(olru, sc->nid);
+	}
+
+	memcg = NULL;
+	do {
+		if (memcg && !memcg_kmem_is_active(memcg))
+			goto next;
 
-	for_each_node_mask(nid, lru->active_nodes)
-		count += list_lru_count_node(lru, nid);
+		olru = lru_of_index(lru, memcg_cache_id(memcg));
+		for_each_node_mask(nid, olru->active_nodes)
+			count += list_lru_count_node(olru, nid);
+next:
+		memcg = mem_cgroup_iter(NULL, memcg, NULL);
+	} while (memcg);
 
 	return count;
 }
 EXPORT_SYMBOL_GPL(list_lru_count);
 
 static unsigned long
-list_lru_walk_node(struct list_lru *lru, int nid, list_lru_walk_cb isolate,
+list_lru_walk_node(struct list_lru_one *olru, int nid, list_lru_walk_cb isolate,
 		   void *cb_arg, unsigned long *nr_to_walk)
 {
 
-	struct list_lru_node	*nlru = &lru->node[nid];
+	struct list_lru_node *nlru = &olru->node[nid];
 	struct list_head *item, *n;
 	unsigned long isolated = 0;
 
@@ -104,7 +170,7 @@ restart:
 		switch (ret) {
 		case LRU_REMOVED:
 			if (--nlru->nr_items == 0)
-				node_clear(nid, lru->active_nodes);
+				node_clear(nid, olru->active_nodes);
 			WARN_ON_ONCE(nlru->nr_items < 0);
 			isolated++;
 			break;
@@ -134,42 +200,154 @@ unsigned long list_lru_walk(struct list_lru *lru, struct shrink_control *sc,
 {
 	long isolated = 0;
 	int nid;
+	struct list_lru_one *olru;
+	struct mem_cgroup *memcg;
 
-	if (sc)
-		return list_lru_walk_node(lru, sc->nid, isolate,
+	if (sc) {
+		olru = lru_of_index(lru, memcg_cache_id(sc->memcg));
+		return list_lru_walk_node(olru, sc->nid, isolate,
 					  cb_arg, nr_to_walk);
-
-	for_each_node_mask(nid, lru->active_nodes) {
-		isolated += list_lru_walk_node(lru, nid, isolate,
-					       cb_arg, nr_to_walk);
-		if (*nr_to_walk <= 0)
-			break;
 	}
+
+	memcg = NULL;
+	do {
+		if (memcg && !memcg_kmem_is_active(memcg))
+			goto next;
+
+		olru = lru_of_index(lru, memcg_cache_id(memcg));
+		for_each_node_mask(nid, olru->active_nodes) {
+			isolated += list_lru_walk_node(olru, nid, isolate,
+						       cb_arg, nr_to_walk);
+			if (*nr_to_walk <= 0) {
+				mem_cgroup_iter_break(NULL, memcg);
+				goto out;
+			}
+		}
+next:
+		memcg = mem_cgroup_iter(NULL, memcg, NULL);
+	} while (memcg);
+out:
 	return isolated;
 }
 EXPORT_SYMBOL_GPL(list_lru_walk);
 
-int list_lru_init(struct list_lru *lru)
+static int list_lru_init_one(struct list_lru_one *olru)
 {
 	int i;
-	size_t size = sizeof(*lru->node) * nr_node_ids;
 
-	lru->node = kzalloc(size, GFP_KERNEL);
-	if (!lru->node)
+	olru->node = kcalloc(nr_node_ids, sizeof(*olru->node), GFP_KERNEL);
+	if (!olru->node)
 		return -ENOMEM;
 
-	nodes_clear(lru->active_nodes);
+	nodes_clear(olru->active_nodes);
 	for (i = 0; i < nr_node_ids; i++) {
-		spin_lock_init(&lru->node[i].lock);
-		INIT_LIST_HEAD(&lru->node[i].list);
-		lru->node[i].nr_items = 0;
+		struct list_lru_node *nlru = &olru->node[i];
+
+		spin_lock_init(&nlru->lock);
+		INIT_LIST_HEAD(&nlru->list);
+		nlru->nr_items = 0;
 	}
 	return 0;
 }
+
+static void list_lru_destroy_one(struct list_lru_one *olru)
+{
+	kfree(olru->node);
+}
+
+int list_lru_init(struct list_lru *lru)
+{
+	int err;
+
+	err = list_lru_init_one(&lru->global);
+	if (err)
+		goto fail;
+
+	err = memcg_list_lru_init(lru);
+	if (err)
+		goto fail;
+
+	return 0;
+fail:
+	list_lru_destroy_one(&lru->global);
+	lru->global.node = NULL; /* see list_lru_destroy() */
+	return err;
+}
 EXPORT_SYMBOL_GPL(list_lru_init);
 
 void list_lru_destroy(struct list_lru *lru)
 {
-	kfree(lru->node);
+	/*
+	 * It is common throughout the kernel source tree to call the
+	 * destructor on a zeroed out object that has not been initialized or
+	 * whose initialization failed, because it greatly simplifies fail
+	 * paths. Once the list_lru structure was implemented, its destructor
+	 * consisted of the only call to kfree() and thus conformed to the
+	 * rule, but as it growed, it became more complex so that calling
+	 * destructor on an uninitialized object would be a bug. To preserve
+	 * backward compatibility, we explicitly exit the destructor if the
+	 * object seems to be uninitialized.
+	 */
+	if (!lru->global.node)
+		return;
+
+	list_lru_destroy_one(&lru->global);
+	memcg_list_lru_destroy(lru);
 }
 EXPORT_SYMBOL_GPL(list_lru_destroy);
+
+#ifdef CONFIG_MEMCG_KMEM
+int list_lru_memcg_alloc(struct list_lru *lru, int memcg_id)
+{
+	int err;
+	struct list_lru_one *olru;
+
+	olru = kmalloc(sizeof(*olru), GFP_KERNEL);
+	if (!olru)
+		return -ENOMEM;
+
+	err = list_lru_init_one(olru);
+	if (err) {
+		kfree(olru);
+		return err;
+	}
+
+	VM_BUG_ON(lru->memcg[memcg_id]);
+	lru->memcg[memcg_id] = olru;
+	return 0;
+}
+
+void list_lru_memcg_free(struct list_lru *lru, int memcg_id)
+{
+	struct list_lru_one *olru;
+
+	olru = lru->memcg[memcg_id];
+	if (olru) {
+		list_lru_destroy_one(olru);
+		kfree(olru);
+		lru->memcg[memcg_id] = NULL;
+	}
+}
+
+int list_lru_grow_memcg(struct list_lru *lru, size_t new_array_size)
+{
+	int i;
+	struct list_lru_one **memcg_lrus;
+
+	memcg_lrus = kcalloc(new_array_size, sizeof(*memcg_lrus), GFP_KERNEL);
+	if (!memcg_lrus)
+		return -ENOMEM;
+
+	if (lru->memcg) {
+		for_each_memcg_cache_index(i) {
+			if (lru->memcg[i])
+				memcg_lrus[i] = lru->memcg[i];
+		}
+	}
+
+	VM_BUG_ON(lru->memcg_old);
+	lru->memcg_old = lru->memcg;
+	rcu_assign_pointer(lru->memcg, memcg_lrus);
+	return 0;
+}
+#endif /* CONFIG_MEMCG_KMEM */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index a3f479b..b15219e 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -55,6 +55,7 @@
 #include <linux/cpu.h>
 #include <linux/oom.h>
 #include <linux/lockdep.h>
+#include <linux/list_lru.h>
 #include "internal.h"
 #include <net/sock.h>
 #include <net/ip.h>
@@ -3034,6 +3035,137 @@ static size_t memcg_caches_array_size(int num_groups)
 }
 
 /*
+ * The list of all list_lru structures. Protected by memcg_create_mutex.
+ * Needed for updating all per-memcg LRUs whenever a kmem-enabled memcg is
+ * created or destroyed.
+ */
+static LIST_HEAD(all_memcg_lrus);
+
+static void __memcg_destroy_all_lrus(int memcg_id)
+{
+	struct list_lru *lru;
+
+	list_for_each_entry(lru, &all_memcg_lrus, list)
+		list_lru_memcg_free(lru, memcg_id);
+}
+
+/*
+ * This function is called when a kmem-active memcg is destroyed in order to
+ * free LRUs corresponding to the memcg in all list_lru structures.
+ */
+static void memcg_destroy_all_lrus(struct mem_cgroup *memcg)
+{
+	int memcg_id;
+
+	memcg_id = memcg_cache_id(memcg);
+	if (memcg_id >= 0) {
+		mutex_lock(&memcg_create_mutex);
+		__memcg_destroy_all_lrus(memcg_id);
+		mutex_unlock(&memcg_create_mutex);
+	}
+}
+
+/*
+ * This function allocates LRUs for a memcg in all list_lru structures. It is
+ * called under memcg_create_mutex when a new kmem-active memcg is added.
+ */
+static int memcg_init_all_lrus(int new_memcg_id)
+{
+	int err = 0;
+	int num_memcgs = new_memcg_id + 1;
+	int grow = (num_memcgs > memcg_limited_groups_array_size);
+	size_t new_array_size = memcg_caches_array_size(num_memcgs);
+	struct list_lru *lru;
+
+	if (grow) {
+		list_for_each_entry(lru, &all_memcg_lrus, list) {
+			err = list_lru_grow_memcg(lru, new_array_size);
+			if (err)
+				goto out;
+		}
+	}
+
+	list_for_each_entry(lru, &all_memcg_lrus, list) {
+		err = list_lru_memcg_alloc(lru, new_memcg_id);
+		if (err) {
+			__memcg_destroy_all_lrus(new_memcg_id);
+			break;
+		}
+	}
+out:
+	if (grow) {
+		synchronize_rcu();
+		list_for_each_entry(lru, &all_memcg_lrus, list) {
+			kfree(lru->memcg_old);
+			lru->memcg_old = NULL;
+		}
+	}
+	return err;
+}
+
+int memcg_list_lru_init(struct list_lru *lru)
+{
+	int err = 0;
+	int i;
+	struct mem_cgroup *memcg;
+
+	lru->memcg = NULL;
+	lru->memcg_old = NULL;
+
+	mutex_lock(&memcg_create_mutex);
+	if (!memcg_kmem_enabled())
+		goto out_list_add;
+
+	lru->memcg = kcalloc(memcg_limited_groups_array_size,
+			     sizeof(*lru->memcg), GFP_KERNEL);
+	if (!lru->memcg) {
+		err = -ENOMEM;
+		goto out;
+	}
+
+	for_each_mem_cgroup(memcg) {
+		int memcg_id;
+
+		memcg_id = memcg_cache_id(memcg);
+		if (memcg_id < 0)
+			continue;
+
+		err = list_lru_memcg_alloc(lru, memcg_id);
+		if (err) {
+			mem_cgroup_iter_break(NULL, memcg);
+			goto out_free_lru_memcg;
+		}
+	}
+out_list_add:
+	list_add(&lru->list, &all_memcg_lrus);
+out:
+	mutex_unlock(&memcg_create_mutex);
+	return err;
+
+out_free_lru_memcg:
+	for (i = 0; i < memcg_limited_groups_array_size; i++)
+		list_lru_memcg_free(lru, i);
+	kfree(lru->memcg);
+	goto out;
+}
+
+void memcg_list_lru_destroy(struct list_lru *lru)
+{
+	int i, array_size;
+
+	mutex_lock(&memcg_create_mutex);
+	list_del(&lru->list);
+	array_size = memcg_limited_groups_array_size;
+	mutex_unlock(&memcg_create_mutex);
+
+	if (lru->memcg) {
+		for (i = 0; i < array_size; i++)
+			list_lru_memcg_free(lru, i);
+		kfree(lru->memcg);
+	}
+}
+
+/*
  * This is a bit cumbersome, but it is rarely used and avoids a backpointer
  * in the memcg_cache_params struct.
  */
@@ -3164,15 +3296,30 @@ int memcg_update_cache_sizes(struct mem_cgroup *memcg)
 	 */
 	memcg_kmem_set_activated(memcg);
 
+	/*
+	 * We need to init the memcg lru lists before we update the caches.
+	 * Once the caches are updated, they will be able to start hosting
+	 * objects. If a cache is created very quickly and an element is used
+	 * and disposed to the lru quickly as well, we can end up with a NULL
+	 * pointer dereference while trying to add a new element to a memcg
+	 * lru.
+	 */
+	ret = memcg_init_all_lrus(num);
+	if (ret)
+		goto out;
+
 	ret = memcg_update_all_caches(num+1);
-	if (ret) {
-		ida_simple_remove(&kmem_limited_groups, num);
-		memcg_kmem_clear_activated(memcg);
-		return ret;
-	}
+	if (ret)
+		goto out_destroy_all_lrus;
 
 	memcg->kmemcg_id = num;
 	return 0;
+out_destroy_all_lrus:
+	__memcg_destroy_all_lrus(num);
+out:
+	ida_simple_remove(&kmem_limited_groups, num);
+	memcg_kmem_clear_activated(memcg);
+	return ret;
 }
 
 /*
@@ -5955,6 +6102,7 @@ static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
 static void memcg_destroy_kmem(struct mem_cgroup *memcg)
 {
 	mem_cgroup_sockets_destroy(memcg);
+	memcg_destroy_all_lrus(memcg);
 }
 
 static void kmem_cgroup_css_offline(struct mem_cgroup *memcg)
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH v13 11/16] mm: list_lru: add per-memcg lists
@ 2013-12-09  8:05   ` Vladimir Davydov
  0 siblings, 0 replies; 81+ messages in thread
From: Vladimir Davydov @ 2013-12-09  8:05 UTC (permalink / raw)
  To: dchinner, hannes, mhocko, akpm
  Cc: linux-kernel, linux-mm, cgroups, devel, glommer, glommer,
	vdavydov, Al Viro, Balbir Singh, KAMEZAWA Hiroyuki

There are several FS shrinkers, including super_block::s_shrink, that
keep reclaimable objects in the list_lru structure. That said, to turn
them to memcg-aware shrinkers, it is enough to make list_lru per-memcg.

This patch does the trick. It adds an array of LRU lists to the list_lru
structure, one for each kmem-active memcg, and dispatches every item
addition or removal operation to the list corresponding to the memcg the
item is accounted to.

Since we already pass a shrink_control object to count and walk list_lru
functions to specify the NUMA node to scan, and the target memcg is held
in this structure, there is no need in changing the list_lru interface.

The idea lying behind the patch as well as the initial implementation
belong to Glauber Costa.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Glauber Costa <glommer@openvz.org>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/list_lru.h   |   44 +++++++-
 include/linux/memcontrol.h |   13 +++
 mm/list_lru.c              |  242 ++++++++++++++++++++++++++++++++++++++------
 mm/memcontrol.c            |  158 ++++++++++++++++++++++++++++-
 4 files changed, 416 insertions(+), 41 deletions(-)

diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
index 34e57af..e8add3d 100644
--- a/include/linux/list_lru.h
+++ b/include/linux/list_lru.h
@@ -28,11 +28,47 @@ struct list_lru_node {
 	long			nr_items;
 } ____cacheline_aligned_in_smp;
 
+struct list_lru_one {
+	struct list_lru_node *node;
+	nodemask_t active_nodes;
+};
+
 struct list_lru {
-	struct list_lru_node	*node;
-	nodemask_t		active_nodes;
+	struct list_lru_one	global;
+#ifdef CONFIG_MEMCG_KMEM
+	/*
+	 * In order to provide ability of scanning objects from different
+	 * memory cgroups independently, we keep a separate LRU list for each
+	 * kmem-active memcg in this array. The array is RCU-protected and
+	 * indexed by memcg_cache_id().
+	 */
+	struct list_lru_one	**memcg;
+	/*
+	 * Every time a kmem-active memcg is created or destroyed, we have to
+	 * update the array of per-memcg LRUs in each list_lru structure. To
+	 * achieve that, we keep all list_lru structures in the all_memcg_lrus
+	 * list.
+	 */
+	struct list_head	list;
+	/*
+	 * Since the array of per-memcg LRUs is RCU-protected, we can only free
+	 * it after a call to synchronize_rcu(). To avoid multiple calls to
+	 * synchronize_rcu() when a lot of LRUs get updated at the same time,
+	 * which is a typical scenario, we will store the pointer to the
+	 * previous version of the array in the memcg_old field for each
+	 * list_lru structure, and then free them all at once after a single
+	 * call to synchronize_rcu().
+	 */
+	void *memcg_old;
+#endif /* CONFIG_MEMCG_KMEM */
 };
 
+#ifdef CONFIG_MEMCG_KMEM
+int list_lru_memcg_alloc(struct list_lru *lru, int memcg_id);
+void list_lru_memcg_free(struct list_lru *lru, int memcg_id);
+int list_lru_grow_memcg(struct list_lru *lru, size_t new_array_size);
+#endif
+
 void list_lru_destroy(struct list_lru *lru);
 int list_lru_init(struct list_lru *lru);
 
@@ -70,7 +106,7 @@ bool list_lru_del(struct list_lru *lru, struct list_head *item);
 /**
  * list_lru_count: return the number of objects currently held by @lru
  * @lru: the lru pointer.
- * @sc: if not NULL, count only from node @sc->nid.
+ * @sc: if not NULL, count only from node @sc->nid and memcg @sc->memcg.
  *
  * Always return a non-negative number, 0 for empty lists. There is no
  * guarantee that the list is not updated while the count is being computed.
@@ -83,7 +119,7 @@ typedef enum lru_status
 /**
  * list_lru_walk: walk a list_lru, isolating and disposing freeable items.
  * @lru: the lru pointer.
- * @sc: if not NULL, scan only from node @sc->nid.
+ * @sc: if not NULL, scan only from node @sc->nid and memcg @sc->memcg.
  * @isolate: callback function that is resposible for deciding what to do with
  *  the item currently being scanned
  * @cb_arg: opaque type that will be passed to @isolate
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index c0f24a9..4c88d72 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -29,6 +29,7 @@ struct page_cgroup;
 struct page;
 struct mm_struct;
 struct kmem_cache;
+struct list_lru;
 
 /*
  * The corresponding mem_cgroup_stat_names is defined in mm/memcontrol.c,
@@ -523,6 +524,9 @@ __memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp);
 void mem_cgroup_destroy_cache(struct kmem_cache *cachep);
 void kmem_cache_destroy_memcg_children(struct kmem_cache *s);
 
+int memcg_list_lru_init(struct list_lru *lru);
+void memcg_list_lru_destroy(struct list_lru *lru);
+
 /**
  * memcg_kmem_newpage_charge: verify if a new kmem allocation is allowed.
  * @gfp: the gfp allocation flags.
@@ -687,6 +691,15 @@ memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp)
 static inline void kmem_cache_destroy_memcg_children(struct kmem_cache *s)
 {
 }
+
+static inline int memcg_list_lru_init(struct list_lru *lru)
+{
+	return 0;
+}
+
+static inline void memcg_list_lru_destroy(struct list_lru *lru)
+{
+}
 #endif /* CONFIG_MEMCG_KMEM */
 #endif /* _LINUX_MEMCONTROL_H */
 
diff --git a/mm/list_lru.c b/mm/list_lru.c
index 7d4a9c2..1c0f39a 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -9,19 +9,70 @@
 #include <linux/mm.h>
 #include <linux/slab.h>
 #include <linux/shrinker.h>
+#include <linux/memcontrol.h>
+#include <linux/page_cgroup.h>
 #include <linux/list_lru.h>
 
+#ifdef CONFIG_MEMCG_KMEM
+static struct list_lru_one *lru_of_index(struct list_lru *lru,
+					 int memcg_id)
+{
+	struct list_lru_one **memcg_lrus;
+	struct list_lru_one *olru;
+
+	if (memcg_id < 0)
+		return &lru->global;
+
+	rcu_read_lock();
+	memcg_lrus = rcu_dereference(lru->memcg);
+	olru = memcg_lrus[memcg_id];
+	rcu_read_unlock();
+
+	return olru;
+}
+
+static struct list_lru_one *lru_of_page(struct list_lru *lru,
+					struct page *page)
+{
+	struct mem_cgroup *memcg = NULL;
+	struct page_cgroup *pc;
+
+	pc = lookup_page_cgroup(compound_head(page));
+	if (PageCgroupUsed(pc)) {
+		lock_page_cgroup(pc);
+		if (PageCgroupUsed(pc))
+			memcg = pc->mem_cgroup;
+		unlock_page_cgroup(pc);
+	}
+	return lru_of_index(lru, memcg_cache_id(memcg));
+}
+#else
+static inline struct list_lru_one *lru_of_index(struct list_lru *lru,
+						int memcg_id)
+{
+	return &lru->global;
+}
+
+static inline struct list_lru_one *lru_of_page(struct list_lru *lru,
+					       struct page *page)
+{
+	return &lru->global;
+}
+#endif /* CONFIG_MEMCG_KMEM */
+
 bool list_lru_add(struct list_lru *lru, struct list_head *item)
 {
-	int nid = page_to_nid(virt_to_page(item));
-	struct list_lru_node *nlru = &lru->node[nid];
+	struct page *page = virt_to_page(item);
+	int nid = page_to_nid(page);
+	struct list_lru_one *olru = lru_of_page(lru, page);
+	struct list_lru_node *nlru = &olru->node[nid];
 
 	spin_lock(&nlru->lock);
 	WARN_ON_ONCE(nlru->nr_items < 0);
 	if (list_empty(item)) {
 		list_add_tail(item, &nlru->list);
 		if (nlru->nr_items++ == 0)
-			node_set(nid, lru->active_nodes);
+			node_set(nid, olru->active_nodes);
 		spin_unlock(&nlru->lock);
 		return true;
 	}
@@ -32,14 +83,16 @@ EXPORT_SYMBOL_GPL(list_lru_add);
 
 bool list_lru_del(struct list_lru *lru, struct list_head *item)
 {
-	int nid = page_to_nid(virt_to_page(item));
-	struct list_lru_node *nlru = &lru->node[nid];
+	struct page *page = virt_to_page(item);
+	int nid = page_to_nid(page);
+	struct list_lru_one *olru = lru_of_page(lru, page);
+	struct list_lru_node *nlru = &olru->node[nid];
 
 	spin_lock(&nlru->lock);
 	if (!list_empty(item)) {
 		list_del_init(item);
 		if (--nlru->nr_items == 0)
-			node_clear(nid, lru->active_nodes);
+			node_clear(nid, olru->active_nodes);
 		WARN_ON_ONCE(nlru->nr_items < 0);
 		spin_unlock(&nlru->lock);
 		return true;
@@ -50,10 +103,10 @@ bool list_lru_del(struct list_lru *lru, struct list_head *item)
 EXPORT_SYMBOL_GPL(list_lru_del);
 
 static unsigned long
-list_lru_count_node(struct list_lru *lru, int nid)
+list_lru_count_node(struct list_lru_one *olru, int nid)
 {
 	unsigned long count = 0;
-	struct list_lru_node *nlru = &lru->node[nid];
+	struct list_lru_node *nlru = &olru->node[nid];
 
 	spin_lock(&nlru->lock);
 	WARN_ON_ONCE(nlru->nr_items < 0);
@@ -67,23 +120,36 @@ unsigned long list_lru_count(struct list_lru *lru, struct shrink_control *sc)
 {
 	long count = 0;
 	int nid;
+	struct list_lru_one *olru;
+	struct mem_cgroup *memcg;
 
-	if (sc)
-		return list_lru_count_node(lru, sc->nid);
+	if (sc) {
+		olru = lru_of_index(lru, memcg_cache_id(sc->memcg));
+		return list_lru_count_node(olru, sc->nid);
+	}
+
+	memcg = NULL;
+	do {
+		if (memcg && !memcg_kmem_is_active(memcg))
+			goto next;
 
-	for_each_node_mask(nid, lru->active_nodes)
-		count += list_lru_count_node(lru, nid);
+		olru = lru_of_index(lru, memcg_cache_id(memcg));
+		for_each_node_mask(nid, olru->active_nodes)
+			count += list_lru_count_node(olru, nid);
+next:
+		memcg = mem_cgroup_iter(NULL, memcg, NULL);
+	} while (memcg);
 
 	return count;
 }
 EXPORT_SYMBOL_GPL(list_lru_count);
 
 static unsigned long
-list_lru_walk_node(struct list_lru *lru, int nid, list_lru_walk_cb isolate,
+list_lru_walk_node(struct list_lru_one *olru, int nid, list_lru_walk_cb isolate,
 		   void *cb_arg, unsigned long *nr_to_walk)
 {
 
-	struct list_lru_node	*nlru = &lru->node[nid];
+	struct list_lru_node *nlru = &olru->node[nid];
 	struct list_head *item, *n;
 	unsigned long isolated = 0;
 
@@ -104,7 +170,7 @@ restart:
 		switch (ret) {
 		case LRU_REMOVED:
 			if (--nlru->nr_items == 0)
-				node_clear(nid, lru->active_nodes);
+				node_clear(nid, olru->active_nodes);
 			WARN_ON_ONCE(nlru->nr_items < 0);
 			isolated++;
 			break;
@@ -134,42 +200,154 @@ unsigned long list_lru_walk(struct list_lru *lru, struct shrink_control *sc,
 {
 	long isolated = 0;
 	int nid;
+	struct list_lru_one *olru;
+	struct mem_cgroup *memcg;
 
-	if (sc)
-		return list_lru_walk_node(lru, sc->nid, isolate,
+	if (sc) {
+		olru = lru_of_index(lru, memcg_cache_id(sc->memcg));
+		return list_lru_walk_node(olru, sc->nid, isolate,
 					  cb_arg, nr_to_walk);
-
-	for_each_node_mask(nid, lru->active_nodes) {
-		isolated += list_lru_walk_node(lru, nid, isolate,
-					       cb_arg, nr_to_walk);
-		if (*nr_to_walk <= 0)
-			break;
 	}
+
+	memcg = NULL;
+	do {
+		if (memcg && !memcg_kmem_is_active(memcg))
+			goto next;
+
+		olru = lru_of_index(lru, memcg_cache_id(memcg));
+		for_each_node_mask(nid, olru->active_nodes) {
+			isolated += list_lru_walk_node(olru, nid, isolate,
+						       cb_arg, nr_to_walk);
+			if (*nr_to_walk <= 0) {
+				mem_cgroup_iter_break(NULL, memcg);
+				goto out;
+			}
+		}
+next:
+		memcg = mem_cgroup_iter(NULL, memcg, NULL);
+	} while (memcg);
+out:
 	return isolated;
 }
 EXPORT_SYMBOL_GPL(list_lru_walk);
 
-int list_lru_init(struct list_lru *lru)
+static int list_lru_init_one(struct list_lru_one *olru)
 {
 	int i;
-	size_t size = sizeof(*lru->node) * nr_node_ids;
 
-	lru->node = kzalloc(size, GFP_KERNEL);
-	if (!lru->node)
+	olru->node = kcalloc(nr_node_ids, sizeof(*olru->node), GFP_KERNEL);
+	if (!olru->node)
 		return -ENOMEM;
 
-	nodes_clear(lru->active_nodes);
+	nodes_clear(olru->active_nodes);
 	for (i = 0; i < nr_node_ids; i++) {
-		spin_lock_init(&lru->node[i].lock);
-		INIT_LIST_HEAD(&lru->node[i].list);
-		lru->node[i].nr_items = 0;
+		struct list_lru_node *nlru = &olru->node[i];
+
+		spin_lock_init(&nlru->lock);
+		INIT_LIST_HEAD(&nlru->list);
+		nlru->nr_items = 0;
 	}
 	return 0;
 }
+
+static void list_lru_destroy_one(struct list_lru_one *olru)
+{
+	kfree(olru->node);
+}
+
+int list_lru_init(struct list_lru *lru)
+{
+	int err;
+
+	err = list_lru_init_one(&lru->global);
+	if (err)
+		goto fail;
+
+	err = memcg_list_lru_init(lru);
+	if (err)
+		goto fail;
+
+	return 0;
+fail:
+	list_lru_destroy_one(&lru->global);
+	lru->global.node = NULL; /* see list_lru_destroy() */
+	return err;
+}
 EXPORT_SYMBOL_GPL(list_lru_init);
 
 void list_lru_destroy(struct list_lru *lru)
 {
-	kfree(lru->node);
+	/*
+	 * It is common throughout the kernel source tree to call the
+	 * destructor on a zeroed out object that has not been initialized or
+	 * whose initialization failed, because it greatly simplifies fail
+	 * paths. Once the list_lru structure was implemented, its destructor
+	 * consisted of the only call to kfree() and thus conformed to the
+	 * rule, but as it growed, it became more complex so that calling
+	 * destructor on an uninitialized object would be a bug. To preserve
+	 * backward compatibility, we explicitly exit the destructor if the
+	 * object seems to be uninitialized.
+	 */
+	if (!lru->global.node)
+		return;
+
+	list_lru_destroy_one(&lru->global);
+	memcg_list_lru_destroy(lru);
 }
 EXPORT_SYMBOL_GPL(list_lru_destroy);
+
+#ifdef CONFIG_MEMCG_KMEM
+int list_lru_memcg_alloc(struct list_lru *lru, int memcg_id)
+{
+	int err;
+	struct list_lru_one *olru;
+
+	olru = kmalloc(sizeof(*olru), GFP_KERNEL);
+	if (!olru)
+		return -ENOMEM;
+
+	err = list_lru_init_one(olru);
+	if (err) {
+		kfree(olru);
+		return err;
+	}
+
+	VM_BUG_ON(lru->memcg[memcg_id]);
+	lru->memcg[memcg_id] = olru;
+	return 0;
+}
+
+void list_lru_memcg_free(struct list_lru *lru, int memcg_id)
+{
+	struct list_lru_one *olru;
+
+	olru = lru->memcg[memcg_id];
+	if (olru) {
+		list_lru_destroy_one(olru);
+		kfree(olru);
+		lru->memcg[memcg_id] = NULL;
+	}
+}
+
+int list_lru_grow_memcg(struct list_lru *lru, size_t new_array_size)
+{
+	int i;
+	struct list_lru_one **memcg_lrus;
+
+	memcg_lrus = kcalloc(new_array_size, sizeof(*memcg_lrus), GFP_KERNEL);
+	if (!memcg_lrus)
+		return -ENOMEM;
+
+	if (lru->memcg) {
+		for_each_memcg_cache_index(i) {
+			if (lru->memcg[i])
+				memcg_lrus[i] = lru->memcg[i];
+		}
+	}
+
+	VM_BUG_ON(lru->memcg_old);
+	lru->memcg_old = lru->memcg;
+	rcu_assign_pointer(lru->memcg, memcg_lrus);
+	return 0;
+}
+#endif /* CONFIG_MEMCG_KMEM */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index a3f479b..b15219e 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -55,6 +55,7 @@
 #include <linux/cpu.h>
 #include <linux/oom.h>
 #include <linux/lockdep.h>
+#include <linux/list_lru.h>
 #include "internal.h"
 #include <net/sock.h>
 #include <net/ip.h>
@@ -3034,6 +3035,137 @@ static size_t memcg_caches_array_size(int num_groups)
 }
 
 /*
+ * The list of all list_lru structures. Protected by memcg_create_mutex.
+ * Needed for updating all per-memcg LRUs whenever a kmem-enabled memcg is
+ * created or destroyed.
+ */
+static LIST_HEAD(all_memcg_lrus);
+
+static void __memcg_destroy_all_lrus(int memcg_id)
+{
+	struct list_lru *lru;
+
+	list_for_each_entry(lru, &all_memcg_lrus, list)
+		list_lru_memcg_free(lru, memcg_id);
+}
+
+/*
+ * This function is called when a kmem-active memcg is destroyed in order to
+ * free LRUs corresponding to the memcg in all list_lru structures.
+ */
+static void memcg_destroy_all_lrus(struct mem_cgroup *memcg)
+{
+	int memcg_id;
+
+	memcg_id = memcg_cache_id(memcg);
+	if (memcg_id >= 0) {
+		mutex_lock(&memcg_create_mutex);
+		__memcg_destroy_all_lrus(memcg_id);
+		mutex_unlock(&memcg_create_mutex);
+	}
+}
+
+/*
+ * This function allocates LRUs for a memcg in all list_lru structures. It is
+ * called under memcg_create_mutex when a new kmem-active memcg is added.
+ */
+static int memcg_init_all_lrus(int new_memcg_id)
+{
+	int err = 0;
+	int num_memcgs = new_memcg_id + 1;
+	int grow = (num_memcgs > memcg_limited_groups_array_size);
+	size_t new_array_size = memcg_caches_array_size(num_memcgs);
+	struct list_lru *lru;
+
+	if (grow) {
+		list_for_each_entry(lru, &all_memcg_lrus, list) {
+			err = list_lru_grow_memcg(lru, new_array_size);
+			if (err)
+				goto out;
+		}
+	}
+
+	list_for_each_entry(lru, &all_memcg_lrus, list) {
+		err = list_lru_memcg_alloc(lru, new_memcg_id);
+		if (err) {
+			__memcg_destroy_all_lrus(new_memcg_id);
+			break;
+		}
+	}
+out:
+	if (grow) {
+		synchronize_rcu();
+		list_for_each_entry(lru, &all_memcg_lrus, list) {
+			kfree(lru->memcg_old);
+			lru->memcg_old = NULL;
+		}
+	}
+	return err;
+}
+
+int memcg_list_lru_init(struct list_lru *lru)
+{
+	int err = 0;
+	int i;
+	struct mem_cgroup *memcg;
+
+	lru->memcg = NULL;
+	lru->memcg_old = NULL;
+
+	mutex_lock(&memcg_create_mutex);
+	if (!memcg_kmem_enabled())
+		goto out_list_add;
+
+	lru->memcg = kcalloc(memcg_limited_groups_array_size,
+			     sizeof(*lru->memcg), GFP_KERNEL);
+	if (!lru->memcg) {
+		err = -ENOMEM;
+		goto out;
+	}
+
+	for_each_mem_cgroup(memcg) {
+		int memcg_id;
+
+		memcg_id = memcg_cache_id(memcg);
+		if (memcg_id < 0)
+			continue;
+
+		err = list_lru_memcg_alloc(lru, memcg_id);
+		if (err) {
+			mem_cgroup_iter_break(NULL, memcg);
+			goto out_free_lru_memcg;
+		}
+	}
+out_list_add:
+	list_add(&lru->list, &all_memcg_lrus);
+out:
+	mutex_unlock(&memcg_create_mutex);
+	return err;
+
+out_free_lru_memcg:
+	for (i = 0; i < memcg_limited_groups_array_size; i++)
+		list_lru_memcg_free(lru, i);
+	kfree(lru->memcg);
+	goto out;
+}
+
+void memcg_list_lru_destroy(struct list_lru *lru)
+{
+	int i, array_size;
+
+	mutex_lock(&memcg_create_mutex);
+	list_del(&lru->list);
+	array_size = memcg_limited_groups_array_size;
+	mutex_unlock(&memcg_create_mutex);
+
+	if (lru->memcg) {
+		for (i = 0; i < array_size; i++)
+			list_lru_memcg_free(lru, i);
+		kfree(lru->memcg);
+	}
+}
+
+/*
  * This is a bit cumbersome, but it is rarely used and avoids a backpointer
  * in the memcg_cache_params struct.
  */
@@ -3164,15 +3296,30 @@ int memcg_update_cache_sizes(struct mem_cgroup *memcg)
 	 */
 	memcg_kmem_set_activated(memcg);
 
+	/*
+	 * We need to init the memcg lru lists before we update the caches.
+	 * Once the caches are updated, they will be able to start hosting
+	 * objects. If a cache is created very quickly and an element is used
+	 * and disposed to the lru quickly as well, we can end up with a NULL
+	 * pointer dereference while trying to add a new element to a memcg
+	 * lru.
+	 */
+	ret = memcg_init_all_lrus(num);
+	if (ret)
+		goto out;
+
 	ret = memcg_update_all_caches(num+1);
-	if (ret) {
-		ida_simple_remove(&kmem_limited_groups, num);
-		memcg_kmem_clear_activated(memcg);
-		return ret;
-	}
+	if (ret)
+		goto out_destroy_all_lrus;
 
 	memcg->kmemcg_id = num;
 	return 0;
+out_destroy_all_lrus:
+	__memcg_destroy_all_lrus(num);
+out:
+	ida_simple_remove(&kmem_limited_groups, num);
+	memcg_kmem_clear_activated(memcg);
+	return ret;
 }
 
 /*
@@ -5955,6 +6102,7 @@ static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
 static void memcg_destroy_kmem(struct mem_cgroup *memcg)
 {
 	mem_cgroup_sockets_destroy(memcg);
+	memcg_destroy_all_lrus(memcg);
 }
 
 static void kmem_cgroup_css_offline(struct mem_cgroup *memcg)
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH v13 12/16] fs: mark list_lru based shrinkers memcg aware
  2013-12-09  8:05 ` Vladimir Davydov
@ 2013-12-09  8:05   ` Vladimir Davydov
  -1 siblings, 0 replies; 81+ messages in thread
From: Vladimir Davydov @ 2013-12-09  8:05 UTC (permalink / raw)
  To: dchinner, hannes, mhocko, akpm
  Cc: linux-kernel, linux-mm, cgroups, devel, glommer, glommer,
	vdavydov, Al Viro

Since now list_lru automatically distributes objects among per-memcg
lists and list_lru_{count,walk} employ information passed in the
shrink_control argument to scan appropriate list, all shrinkers that
keep objects in the list_lru structure can already work as memcg-aware.
Let us mark them so.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Glauber Costa <glommer@openvz.org>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
---
 fs/gfs2/quota.c  |    2 +-
 fs/super.c       |    2 +-
 fs/xfs/xfs_buf.c |    2 +-
 fs/xfs/xfs_qm.c  |    2 +-
 4 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/fs/gfs2/quota.c b/fs/gfs2/quota.c
index f0435da..6cf6114 100644
--- a/fs/gfs2/quota.c
+++ b/fs/gfs2/quota.c
@@ -150,7 +150,7 @@ struct shrinker gfs2_qd_shrinker = {
 	.count_objects = gfs2_qd_shrink_count,
 	.scan_objects = gfs2_qd_shrink_scan,
 	.seeks = DEFAULT_SEEKS,
-	.flags = SHRINKER_NUMA_AWARE,
+	.flags = SHRINKER_NUMA_AWARE | SHRINKER_MEMCG_AWARE,
 };
 
 
diff --git a/fs/super.c b/fs/super.c
index 8f9a81b..05bead8 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -219,7 +219,7 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags)
 	s->s_shrink.scan_objects = super_cache_scan;
 	s->s_shrink.count_objects = super_cache_count;
 	s->s_shrink.batch = 1024;
-	s->s_shrink.flags = SHRINKER_NUMA_AWARE;
+	s->s_shrink.flags = SHRINKER_NUMA_AWARE | SHRINKER_MEMCG_AWARE;
 	return s;
 
 fail:
diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 5b2a49c..d8326b6 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -1679,7 +1679,7 @@ xfs_alloc_buftarg(
 	btp->bt_shrinker.count_objects = xfs_buftarg_shrink_count;
 	btp->bt_shrinker.scan_objects = xfs_buftarg_shrink_scan;
 	btp->bt_shrinker.seeks = DEFAULT_SEEKS;
-	btp->bt_shrinker.flags = SHRINKER_NUMA_AWARE;
+	btp->bt_shrinker.flags = SHRINKER_NUMA_AWARE | SHRINKER_MEMCG_AWARE;
 	register_shrinker(&btp->bt_shrinker);
 	return btp;
 
diff --git a/fs/xfs/xfs_qm.c b/fs/xfs/xfs_qm.c
index aaacf8f..1f9bbb5 100644
--- a/fs/xfs/xfs_qm.c
+++ b/fs/xfs/xfs_qm.c
@@ -903,7 +903,7 @@ xfs_qm_init_quotainfo(
 	qinf->qi_shrinker.count_objects = xfs_qm_shrink_count;
 	qinf->qi_shrinker.scan_objects = xfs_qm_shrink_scan;
 	qinf->qi_shrinker.seeks = DEFAULT_SEEKS;
-	qinf->qi_shrinker.flags = SHRINKER_NUMA_AWARE;
+	qinf->qi_shrinker.flags = SHRINKER_NUMA_AWARE | SHRINKER_MEMCG_AWARE;
 	register_shrinker(&qinf->qi_shrinker);
 	return 0;
 }
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH v13 12/16] fs: mark list_lru based shrinkers memcg aware
@ 2013-12-09  8:05   ` Vladimir Davydov
  0 siblings, 0 replies; 81+ messages in thread
From: Vladimir Davydov @ 2013-12-09  8:05 UTC (permalink / raw)
  To: dchinner, hannes, mhocko, akpm
  Cc: linux-kernel, linux-mm, cgroups, devel, glommer, glommer,
	vdavydov, Al Viro

Since now list_lru automatically distributes objects among per-memcg
lists and list_lru_{count,walk} employ information passed in the
shrink_control argument to scan appropriate list, all shrinkers that
keep objects in the list_lru structure can already work as memcg-aware.
Let us mark them so.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Glauber Costa <glommer@openvz.org>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
---
 fs/gfs2/quota.c  |    2 +-
 fs/super.c       |    2 +-
 fs/xfs/xfs_buf.c |    2 +-
 fs/xfs/xfs_qm.c  |    2 +-
 4 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/fs/gfs2/quota.c b/fs/gfs2/quota.c
index f0435da..6cf6114 100644
--- a/fs/gfs2/quota.c
+++ b/fs/gfs2/quota.c
@@ -150,7 +150,7 @@ struct shrinker gfs2_qd_shrinker = {
 	.count_objects = gfs2_qd_shrink_count,
 	.scan_objects = gfs2_qd_shrink_scan,
 	.seeks = DEFAULT_SEEKS,
-	.flags = SHRINKER_NUMA_AWARE,
+	.flags = SHRINKER_NUMA_AWARE | SHRINKER_MEMCG_AWARE,
 };
 
 
diff --git a/fs/super.c b/fs/super.c
index 8f9a81b..05bead8 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -219,7 +219,7 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags)
 	s->s_shrink.scan_objects = super_cache_scan;
 	s->s_shrink.count_objects = super_cache_count;
 	s->s_shrink.batch = 1024;
-	s->s_shrink.flags = SHRINKER_NUMA_AWARE;
+	s->s_shrink.flags = SHRINKER_NUMA_AWARE | SHRINKER_MEMCG_AWARE;
 	return s;
 
 fail:
diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 5b2a49c..d8326b6 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -1679,7 +1679,7 @@ xfs_alloc_buftarg(
 	btp->bt_shrinker.count_objects = xfs_buftarg_shrink_count;
 	btp->bt_shrinker.scan_objects = xfs_buftarg_shrink_scan;
 	btp->bt_shrinker.seeks = DEFAULT_SEEKS;
-	btp->bt_shrinker.flags = SHRINKER_NUMA_AWARE;
+	btp->bt_shrinker.flags = SHRINKER_NUMA_AWARE | SHRINKER_MEMCG_AWARE;
 	register_shrinker(&btp->bt_shrinker);
 	return btp;
 
diff --git a/fs/xfs/xfs_qm.c b/fs/xfs/xfs_qm.c
index aaacf8f..1f9bbb5 100644
--- a/fs/xfs/xfs_qm.c
+++ b/fs/xfs/xfs_qm.c
@@ -903,7 +903,7 @@ xfs_qm_init_quotainfo(
 	qinf->qi_shrinker.count_objects = xfs_qm_shrink_count;
 	qinf->qi_shrinker.scan_objects = xfs_qm_shrink_scan;
 	qinf->qi_shrinker.seeks = DEFAULT_SEEKS;
-	qinf->qi_shrinker.flags = SHRINKER_NUMA_AWARE;
+	qinf->qi_shrinker.flags = SHRINKER_NUMA_AWARE | SHRINKER_MEMCG_AWARE;
 	register_shrinker(&qinf->qi_shrinker);
 	return 0;
 }
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH v13 13/16] vmscan: take at least one pass with shrinkers
  2013-12-09  8:05 ` Vladimir Davydov
@ 2013-12-09  8:05   ` Vladimir Davydov
  -1 siblings, 0 replies; 81+ messages in thread
From: Vladimir Davydov @ 2013-12-09  8:05 UTC (permalink / raw)
  To: dchinner, hannes, mhocko, akpm
  Cc: linux-kernel, linux-mm, cgroups, devel, glommer, glommer,
	vdavydov, Glauber Costa, Mel Gorman, Rik van Riel

From: Glauber Costa <glommer@openvz.org>

In very low free kernel memory situations, it may be the case that we
have less objects to free than our initial batch size. If this is the
case, it is better to shrink those, and open space for the new workload
then to keep them and fail the new allocations.

In particular, we are concerned with the direct reclaim case for memcg.
Although this same technique can be applied to other situations just as
well, we will start conservative and apply it for that case, which is
the one that matters the most.

Signed-off-by: Glauber Costa <gloomer@openvz.org>
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
---
 mm/vmscan.c |   13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 1997813..b2a5be9 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -281,17 +281,22 @@ shrink_slab_node(struct shrink_control *shrinkctl, struct shrinker *shrinker,
 				nr_pages_scanned, lru_pages,
 				max_pass, delta, total_scan);
 
-	while (total_scan >= batch_size) {
+	while (total_scan > 0) {
 		unsigned long ret;
+		unsigned long nr_to_scan = min(batch_size, total_scan);
 
-		shrinkctl->nr_to_scan = batch_size;
+		if (!shrinkctl->target_mem_cgroup &&
+		    total_scan < batch_size)
+			break;
+
+		shrinkctl->nr_to_scan = nr_to_scan;
 		ret = shrinker->scan_objects(shrinker, shrinkctl);
 		if (ret == SHRINK_STOP)
 			break;
 		freed += ret;
 
-		count_vm_events(SLABS_SCANNED, batch_size);
-		total_scan -= batch_size;
+		count_vm_events(SLABS_SCANNED, nr_to_scan);
+		total_scan -= nr_to_scan;
 
 		cond_resched();
 	}
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH v13 13/16] vmscan: take at least one pass with shrinkers
@ 2013-12-09  8:05   ` Vladimir Davydov
  0 siblings, 0 replies; 81+ messages in thread
From: Vladimir Davydov @ 2013-12-09  8:05 UTC (permalink / raw)
  To: dchinner, hannes, mhocko, akpm
  Cc: linux-kernel, linux-mm, cgroups, devel, glommer, glommer,
	vdavydov, Glauber Costa, Mel Gorman, Rik van Riel

From: Glauber Costa <glommer@openvz.org>

In very low free kernel memory situations, it may be the case that we
have less objects to free than our initial batch size. If this is the
case, it is better to shrink those, and open space for the new workload
then to keep them and fail the new allocations.

In particular, we are concerned with the direct reclaim case for memcg.
Although this same technique can be applied to other situations just as
well, we will start conservative and apply it for that case, which is
the one that matters the most.

Signed-off-by: Glauber Costa <gloomer@openvz.org>
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
---
 mm/vmscan.c |   13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 1997813..b2a5be9 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -281,17 +281,22 @@ shrink_slab_node(struct shrink_control *shrinkctl, struct shrinker *shrinker,
 				nr_pages_scanned, lru_pages,
 				max_pass, delta, total_scan);
 
-	while (total_scan >= batch_size) {
+	while (total_scan > 0) {
 		unsigned long ret;
+		unsigned long nr_to_scan = min(batch_size, total_scan);
 
-		shrinkctl->nr_to_scan = batch_size;
+		if (!shrinkctl->target_mem_cgroup &&
+		    total_scan < batch_size)
+			break;
+
+		shrinkctl->nr_to_scan = nr_to_scan;
 		ret = shrinker->scan_objects(shrinker, shrinkctl);
 		if (ret == SHRINK_STOP)
 			break;
 		freed += ret;
 
-		count_vm_events(SLABS_SCANNED, batch_size);
-		total_scan -= batch_size;
+		count_vm_events(SLABS_SCANNED, nr_to_scan);
+		total_scan -= nr_to_scan;
 
 		cond_resched();
 	}
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH v13 14/16] vmpressure: in-kernel notifications
  2013-12-09  8:05 ` Vladimir Davydov
@ 2013-12-09  8:05   ` Vladimir Davydov
  -1 siblings, 0 replies; 81+ messages in thread
From: Vladimir Davydov @ 2013-12-09  8:05 UTC (permalink / raw)
  To: dchinner, hannes, mhocko, akpm
  Cc: linux-kernel, linux-mm, cgroups, devel, glommer, glommer,
	vdavydov, John Stultz, Joonsoo Kim, Kamezawa Hiroyuki

From: Glauber Costa <glommer@openvz.org>

During the past weeks, it became clear to us that the shrinker interface
we have right now works very well for some particular types of users,
but not that well for others. The latter are usually people interested
in one-shot notifications, that were forced to adapt themselves to the
count+scan behavior of shrinkers. To do so, they had no choice than to
greatly abuse the shrinker interface producing little monsters all over.

During LSF/MM, one of the proposals that popped out during our session
was to reuse Anton Voronstsov's vmpressure for this. They are designed
for userspace consumption, but also provide a well-stablished,
cgroup-aware entry point for notifications.

This patch extends that to also support in-kernel users. Events that
should be generated for in-kernel consumption will be marked as such,
and for those, we will call a registered function instead of triggering
an eventfd notification.

Please note that due to my lack of understanding of each shrinker user,
I will stay away from converting the actual users, you are all welcome
to do so.

Signed-off-by: Glauber Costa <glommer@openvz.org>
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: Anton Vorontsov <anton@enomsg.org>
Acked-by: Pekka Enberg <penberg@kernel.org>
Reviewed-by: Greg Thelen <gthelen@google.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/vmpressure.h |    5 +++++
 mm/vmpressure.c            |   53 +++++++++++++++++++++++++++++++++++++++++---
 2 files changed, 55 insertions(+), 3 deletions(-)

diff --git a/include/linux/vmpressure.h b/include/linux/vmpressure.h
index 3f3788d..9102e53 100644
--- a/include/linux/vmpressure.h
+++ b/include/linux/vmpressure.h
@@ -19,6 +19,9 @@ struct vmpressure {
 	/* Have to grab the lock on events traversal or modifications. */
 	struct mutex events_lock;
 
+	/* False if only kernel users want to be notified, true otherwise. */
+	bool notify_userspace;
+
 	struct work_struct work;
 };
 
@@ -38,6 +41,8 @@ extern int vmpressure_register_event(struct cgroup_subsys_state *css,
 				     struct cftype *cft,
 				     struct eventfd_ctx *eventfd,
 				     const char *args);
+extern int vmpressure_register_kernel_event(struct cgroup_subsys_state *css,
+					    void (*fn)(void));
 extern void vmpressure_unregister_event(struct cgroup_subsys_state *css,
 					struct cftype *cft,
 					struct eventfd_ctx *eventfd);
diff --git a/mm/vmpressure.c b/mm/vmpressure.c
index e0f6283..730e7c1 100644
--- a/mm/vmpressure.c
+++ b/mm/vmpressure.c
@@ -130,8 +130,12 @@ static enum vmpressure_levels vmpressure_calc_level(unsigned long scanned,
 }
 
 struct vmpressure_event {
-	struct eventfd_ctx *efd;
+	union {
+		struct eventfd_ctx *efd;
+		void (*fn)(void);
+	};
 	enum vmpressure_levels level;
+	bool kernel_event;
 	struct list_head node;
 };
 
@@ -147,12 +151,15 @@ static bool vmpressure_event(struct vmpressure *vmpr,
 	mutex_lock(&vmpr->events_lock);
 
 	list_for_each_entry(ev, &vmpr->events, node) {
-		if (level >= ev->level) {
+		if (ev->kernel_event) {
+			ev->fn();
+		} else if (vmpr->notify_userspace && level >= ev->level) {
 			eventfd_signal(ev->efd, 1);
 			signalled = true;
 		}
 	}
 
+	vmpr->notify_userspace = false;
 	mutex_unlock(&vmpr->events_lock);
 
 	return signalled;
@@ -222,7 +229,7 @@ void vmpressure(gfp_t gfp, struct mem_cgroup *memcg,
 	 * we account it too.
 	 */
 	if (!(gfp & (__GFP_HIGHMEM | __GFP_MOVABLE | __GFP_IO | __GFP_FS)))
-		return;
+		goto schedule;
 
 	/*
 	 * If we got here with no pages scanned, then that is an indicator
@@ -239,8 +246,15 @@ void vmpressure(gfp_t gfp, struct mem_cgroup *memcg,
 	vmpr->scanned += scanned;
 	vmpr->reclaimed += reclaimed;
 	scanned = vmpr->scanned;
+	/*
+	 * If we didn't reach this point, only kernel events will be triggered.
+	 * It is the job of the worker thread to clean this up once the
+	 * notifications are all delivered.
+	 */
+	vmpr->notify_userspace = true;
 	spin_unlock(&vmpr->sr_lock);
 
+schedule:
 	if (scanned < vmpressure_win)
 		return;
 	schedule_work(&vmpr->work);
@@ -324,6 +338,39 @@ int vmpressure_register_event(struct cgroup_subsys_state *css,
 }
 
 /**
+ * vmpressure_register_kernel_event() - Register kernel-side notification
+ * @css:	css that is interested in vmpressure notifications
+ * @fn:		function to be called when pressure happens
+ *
+ * This function register in-kernel users interested in receiving notifications
+ * about pressure conditions. Pressure notifications will be triggered at the
+ * same time as userspace notifications (with no particular ordering relative
+ * to it).
+ *
+ * Pressure notifications are a alternative method to shrinkers and will serve
+ * well users that are interested in a one-shot notification, with a
+ * well-defined cgroup aware interface.
+ */
+int vmpressure_register_kernel_event(struct cgroup_subsys_state *css,
+				      void (*fn)(void))
+{
+	struct vmpressure *vmpr = css_to_vmpressure(css);
+	struct vmpressure_event *ev;
+
+	ev = kzalloc(sizeof(*ev), GFP_KERNEL);
+	if (!ev)
+		return -ENOMEM;
+
+	ev->kernel_event = true;
+	ev->fn = fn;
+
+	mutex_lock(&vmpr->events_lock);
+	list_add(&ev->node, &vmpr->events);
+	mutex_unlock(&vmpr->events_lock);
+	return 0;
+}
+
+/**
  * vmpressure_unregister_event() - Unbind eventfd from vmpressure
  * @css:	css handle
  * @cft:	cgroup control files handle
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH v13 14/16] vmpressure: in-kernel notifications
@ 2013-12-09  8:05   ` Vladimir Davydov
  0 siblings, 0 replies; 81+ messages in thread
From: Vladimir Davydov @ 2013-12-09  8:05 UTC (permalink / raw)
  To: dchinner, hannes, mhocko, akpm
  Cc: linux-kernel, linux-mm, cgroups, devel, glommer, glommer,
	vdavydov, John Stultz, Joonsoo Kim, Kamezawa Hiroyuki

From: Glauber Costa <glommer@openvz.org>

During the past weeks, it became clear to us that the shrinker interface
we have right now works very well for some particular types of users,
but not that well for others. The latter are usually people interested
in one-shot notifications, that were forced to adapt themselves to the
count+scan behavior of shrinkers. To do so, they had no choice than to
greatly abuse the shrinker interface producing little monsters all over.

During LSF/MM, one of the proposals that popped out during our session
was to reuse Anton Voronstsov's vmpressure for this. They are designed
for userspace consumption, but also provide a well-stablished,
cgroup-aware entry point for notifications.

This patch extends that to also support in-kernel users. Events that
should be generated for in-kernel consumption will be marked as such,
and for those, we will call a registered function instead of triggering
an eventfd notification.

Please note that due to my lack of understanding of each shrinker user,
I will stay away from converting the actual users, you are all welcome
to do so.

Signed-off-by: Glauber Costa <glommer@openvz.org>
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: Anton Vorontsov <anton@enomsg.org>
Acked-by: Pekka Enberg <penberg@kernel.org>
Reviewed-by: Greg Thelen <gthelen@google.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/vmpressure.h |    5 +++++
 mm/vmpressure.c            |   53 +++++++++++++++++++++++++++++++++++++++++---
 2 files changed, 55 insertions(+), 3 deletions(-)

diff --git a/include/linux/vmpressure.h b/include/linux/vmpressure.h
index 3f3788d..9102e53 100644
--- a/include/linux/vmpressure.h
+++ b/include/linux/vmpressure.h
@@ -19,6 +19,9 @@ struct vmpressure {
 	/* Have to grab the lock on events traversal or modifications. */
 	struct mutex events_lock;
 
+	/* False if only kernel users want to be notified, true otherwise. */
+	bool notify_userspace;
+
 	struct work_struct work;
 };
 
@@ -38,6 +41,8 @@ extern int vmpressure_register_event(struct cgroup_subsys_state *css,
 				     struct cftype *cft,
 				     struct eventfd_ctx *eventfd,
 				     const char *args);
+extern int vmpressure_register_kernel_event(struct cgroup_subsys_state *css,
+					    void (*fn)(void));
 extern void vmpressure_unregister_event(struct cgroup_subsys_state *css,
 					struct cftype *cft,
 					struct eventfd_ctx *eventfd);
diff --git a/mm/vmpressure.c b/mm/vmpressure.c
index e0f6283..730e7c1 100644
--- a/mm/vmpressure.c
+++ b/mm/vmpressure.c
@@ -130,8 +130,12 @@ static enum vmpressure_levels vmpressure_calc_level(unsigned long scanned,
 }
 
 struct vmpressure_event {
-	struct eventfd_ctx *efd;
+	union {
+		struct eventfd_ctx *efd;
+		void (*fn)(void);
+	};
 	enum vmpressure_levels level;
+	bool kernel_event;
 	struct list_head node;
 };
 
@@ -147,12 +151,15 @@ static bool vmpressure_event(struct vmpressure *vmpr,
 	mutex_lock(&vmpr->events_lock);
 
 	list_for_each_entry(ev, &vmpr->events, node) {
-		if (level >= ev->level) {
+		if (ev->kernel_event) {
+			ev->fn();
+		} else if (vmpr->notify_userspace && level >= ev->level) {
 			eventfd_signal(ev->efd, 1);
 			signalled = true;
 		}
 	}
 
+	vmpr->notify_userspace = false;
 	mutex_unlock(&vmpr->events_lock);
 
 	return signalled;
@@ -222,7 +229,7 @@ void vmpressure(gfp_t gfp, struct mem_cgroup *memcg,
 	 * we account it too.
 	 */
 	if (!(gfp & (__GFP_HIGHMEM | __GFP_MOVABLE | __GFP_IO | __GFP_FS)))
-		return;
+		goto schedule;
 
 	/*
 	 * If we got here with no pages scanned, then that is an indicator
@@ -239,8 +246,15 @@ void vmpressure(gfp_t gfp, struct mem_cgroup *memcg,
 	vmpr->scanned += scanned;
 	vmpr->reclaimed += reclaimed;
 	scanned = vmpr->scanned;
+	/*
+	 * If we didn't reach this point, only kernel events will be triggered.
+	 * It is the job of the worker thread to clean this up once the
+	 * notifications are all delivered.
+	 */
+	vmpr->notify_userspace = true;
 	spin_unlock(&vmpr->sr_lock);
 
+schedule:
 	if (scanned < vmpressure_win)
 		return;
 	schedule_work(&vmpr->work);
@@ -324,6 +338,39 @@ int vmpressure_register_event(struct cgroup_subsys_state *css,
 }
 
 /**
+ * vmpressure_register_kernel_event() - Register kernel-side notification
+ * @css:	css that is interested in vmpressure notifications
+ * @fn:		function to be called when pressure happens
+ *
+ * This function register in-kernel users interested in receiving notifications
+ * about pressure conditions. Pressure notifications will be triggered at the
+ * same time as userspace notifications (with no particular ordering relative
+ * to it).
+ *
+ * Pressure notifications are a alternative method to shrinkers and will serve
+ * well users that are interested in a one-shot notification, with a
+ * well-defined cgroup aware interface.
+ */
+int vmpressure_register_kernel_event(struct cgroup_subsys_state *css,
+				      void (*fn)(void))
+{
+	struct vmpressure *vmpr = css_to_vmpressure(css);
+	struct vmpressure_event *ev;
+
+	ev = kzalloc(sizeof(*ev), GFP_KERNEL);
+	if (!ev)
+		return -ENOMEM;
+
+	ev->kernel_event = true;
+	ev->fn = fn;
+
+	mutex_lock(&vmpr->events_lock);
+	list_add(&ev->node, &vmpr->events);
+	mutex_unlock(&vmpr->events_lock);
+	return 0;
+}
+
+/**
  * vmpressure_unregister_event() - Unbind eventfd from vmpressure
  * @css:	css handle
  * @cft:	cgroup control files handle
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH v13 15/16] memcg: reap dead memcgs upon global memory pressure
  2013-12-09  8:05 ` Vladimir Davydov
@ 2013-12-09  8:05   ` Vladimir Davydov
  -1 siblings, 0 replies; 81+ messages in thread
From: Vladimir Davydov @ 2013-12-09  8:05 UTC (permalink / raw)
  To: dchinner, hannes, mhocko, akpm
  Cc: linux-kernel, linux-mm, cgroups, devel, glommer, glommer,
	vdavydov, Anton Vorontsov, John Stultz, Kamezawa Hiroyuki

From: Glauber Costa <glommer@openvz.org>

When we delete kmem-enabled memcgs, they can still be zombieing around
for a while. The reason is that the objects may still be alive, and we
won't be able to delete them at destruction time.

The only entry point for that, though, are the shrinkers. The shrinker
interface, however, is not exactly tailored to our needs. It could be a
little bit better by using the API Dave Chinner proposed, but it is
still not ideal since we aren't really a count-and-scan event, but more
a one-off flush-all-you-can event that would have to abuse that somehow.

Signed-off-by: Glauber Costa <glommer@openvz.org>
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Anton Vorontsov <anton@enomsg.org>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/memcontrol.c |   80 ++++++++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 77 insertions(+), 3 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b15219e..182199f 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -287,8 +287,16 @@ struct mem_cgroup {
 	/* thresholds for mem+swap usage. RCU-protected */
 	struct mem_cgroup_thresholds memsw_thresholds;
 
-	/* For oom notifier event fd */
-	struct list_head oom_notify;
+	union {
+		/* For oom notifier event fd */
+		struct list_head oom_notify;
+		/*
+		 * we can only trigger an oom event if the memcg is alive.
+		 * so we will reuse this field to hook the memcg in the list
+		 * of dead memcgs.
+		 */
+		struct list_head dead;
+	};
 
 	/*
 	 * Should we move charges of a task when a task is moved into this
@@ -336,6 +344,29 @@ struct mem_cgroup {
 	/* WARNING: nodeinfo must be the last member here */
 };
 
+#if defined(CONFIG_MEMCG_KMEM) || defined(CONFIG_MEMCG_SWAP)
+static LIST_HEAD(dangling_memcgs);
+static DEFINE_MUTEX(dangling_memcgs_mutex);
+
+static inline void memcg_dangling_del(struct mem_cgroup *memcg)
+{
+	mutex_lock(&dangling_memcgs_mutex);
+	list_del(&memcg->dead);
+	mutex_unlock(&dangling_memcgs_mutex);
+}
+
+static inline void memcg_dangling_add(struct mem_cgroup *memcg)
+{
+	INIT_LIST_HEAD(&memcg->dead);
+	mutex_lock(&dangling_memcgs_mutex);
+	list_add(&memcg->dead, &dangling_memcgs);
+	mutex_unlock(&dangling_memcgs_mutex);
+}
+#else
+static inline void memcg_dangling_del(struct mem_cgroup *memcg) {}
+static inline void memcg_dangling_add(struct mem_cgroup *memcg) {}
+#endif
+
 static size_t memcg_size(void)
 {
 	return sizeof(struct mem_cgroup) +
@@ -6085,6 +6116,41 @@ static int mem_cgroup_oom_control_write(struct cgroup_subsys_state *css,
 }
 
 #ifdef CONFIG_MEMCG_KMEM
+static void memcg_vmpressure_shrink_dead(void)
+{
+	struct memcg_cache_params *params, *tmp;
+	struct kmem_cache *cachep;
+	struct mem_cgroup *memcg;
+
+	mutex_lock(&dangling_memcgs_mutex);
+	list_for_each_entry(memcg, &dangling_memcgs, dead) {
+		mutex_lock(&memcg->slab_caches_mutex);
+		/* The element may go away as an indirect result of shrink */
+		list_for_each_entry_safe(params, tmp,
+					 &memcg->memcg_slab_caches, list) {
+			cachep = memcg_params_to_cache(params);
+			/*
+			 * the cpu_hotplug lock is taken in kmem_cache_create
+			 * outside the slab_caches_mutex manipulation. It will
+			 * be taken by kmem_cache_shrink to flush the cache.
+			 * So we need to drop the lock. It is all right because
+			 * the lock only protects elements moving in and out the
+			 * list.
+			 */
+			mutex_unlock(&memcg->slab_caches_mutex);
+			kmem_cache_shrink(cachep);
+			mutex_lock(&memcg->slab_caches_mutex);
+		}
+		mutex_unlock(&memcg->slab_caches_mutex);
+	}
+	mutex_unlock(&dangling_memcgs_mutex);
+}
+
+static void memcg_register_kmem_events(struct cgroup_subsys_state *css)
+{
+	vmpressure_register_kernel_event(css, memcg_vmpressure_shrink_dead);
+}
+
 static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
 {
 	int ret;
@@ -6139,6 +6205,10 @@ static void kmem_cgroup_css_offline(struct mem_cgroup *memcg)
 		css_put(&memcg->css);
 }
 #else
+static inline void memcg_register_kmem_events(struct cgroup *cont)
+{
+}
+
 static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
 {
 	return 0;
@@ -6477,8 +6547,10 @@ mem_cgroup_css_online(struct cgroup_subsys_state *css)
 	if (css->cgroup->id > MEM_CGROUP_ID_MAX)
 		return -ENOSPC;
 
-	if (!parent)
+	if (!parent) {
+		memcg_register_kmem_events(css);
 		return 0;
+	}
 
 	mutex_lock(&memcg_create_mutex);
 
@@ -6540,6 +6612,7 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
 	mem_cgroup_invalidate_reclaim_iterators(memcg);
 	mem_cgroup_reparent_charges(memcg);
 	mem_cgroup_destroy_all_caches(memcg);
+	memcg_dangling_add(memcg);
 	vmpressure_cleanup(&memcg->vmpressure);
 }
 
@@ -6548,6 +6621,7 @@ static void mem_cgroup_css_free(struct cgroup_subsys_state *css)
 	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
 
 	memcg_destroy_kmem(memcg);
+	memcg_dangling_del(memcg);
 	__mem_cgroup_free(memcg);
 }
 
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH v13 15/16] memcg: reap dead memcgs upon global memory pressure
@ 2013-12-09  8:05   ` Vladimir Davydov
  0 siblings, 0 replies; 81+ messages in thread
From: Vladimir Davydov @ 2013-12-09  8:05 UTC (permalink / raw)
  To: dchinner, hannes, mhocko, akpm
  Cc: linux-kernel, linux-mm, cgroups, devel, glommer, glommer,
	vdavydov, Anton Vorontsov, John Stultz, Kamezawa Hiroyuki

From: Glauber Costa <glommer@openvz.org>

When we delete kmem-enabled memcgs, they can still be zombieing around
for a while. The reason is that the objects may still be alive, and we
won't be able to delete them at destruction time.

The only entry point for that, though, are the shrinkers. The shrinker
interface, however, is not exactly tailored to our needs. It could be a
little bit better by using the API Dave Chinner proposed, but it is
still not ideal since we aren't really a count-and-scan event, but more
a one-off flush-all-you-can event that would have to abuse that somehow.

Signed-off-by: Glauber Costa <glommer@openvz.org>
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Anton Vorontsov <anton@enomsg.org>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/memcontrol.c |   80 ++++++++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 77 insertions(+), 3 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b15219e..182199f 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -287,8 +287,16 @@ struct mem_cgroup {
 	/* thresholds for mem+swap usage. RCU-protected */
 	struct mem_cgroup_thresholds memsw_thresholds;
 
-	/* For oom notifier event fd */
-	struct list_head oom_notify;
+	union {
+		/* For oom notifier event fd */
+		struct list_head oom_notify;
+		/*
+		 * we can only trigger an oom event if the memcg is alive.
+		 * so we will reuse this field to hook the memcg in the list
+		 * of dead memcgs.
+		 */
+		struct list_head dead;
+	};
 
 	/*
 	 * Should we move charges of a task when a task is moved into this
@@ -336,6 +344,29 @@ struct mem_cgroup {
 	/* WARNING: nodeinfo must be the last member here */
 };
 
+#if defined(CONFIG_MEMCG_KMEM) || defined(CONFIG_MEMCG_SWAP)
+static LIST_HEAD(dangling_memcgs);
+static DEFINE_MUTEX(dangling_memcgs_mutex);
+
+static inline void memcg_dangling_del(struct mem_cgroup *memcg)
+{
+	mutex_lock(&dangling_memcgs_mutex);
+	list_del(&memcg->dead);
+	mutex_unlock(&dangling_memcgs_mutex);
+}
+
+static inline void memcg_dangling_add(struct mem_cgroup *memcg)
+{
+	INIT_LIST_HEAD(&memcg->dead);
+	mutex_lock(&dangling_memcgs_mutex);
+	list_add(&memcg->dead, &dangling_memcgs);
+	mutex_unlock(&dangling_memcgs_mutex);
+}
+#else
+static inline void memcg_dangling_del(struct mem_cgroup *memcg) {}
+static inline void memcg_dangling_add(struct mem_cgroup *memcg) {}
+#endif
+
 static size_t memcg_size(void)
 {
 	return sizeof(struct mem_cgroup) +
@@ -6085,6 +6116,41 @@ static int mem_cgroup_oom_control_write(struct cgroup_subsys_state *css,
 }
 
 #ifdef CONFIG_MEMCG_KMEM
+static void memcg_vmpressure_shrink_dead(void)
+{
+	struct memcg_cache_params *params, *tmp;
+	struct kmem_cache *cachep;
+	struct mem_cgroup *memcg;
+
+	mutex_lock(&dangling_memcgs_mutex);
+	list_for_each_entry(memcg, &dangling_memcgs, dead) {
+		mutex_lock(&memcg->slab_caches_mutex);
+		/* The element may go away as an indirect result of shrink */
+		list_for_each_entry_safe(params, tmp,
+					 &memcg->memcg_slab_caches, list) {
+			cachep = memcg_params_to_cache(params);
+			/*
+			 * the cpu_hotplug lock is taken in kmem_cache_create
+			 * outside the slab_caches_mutex manipulation. It will
+			 * be taken by kmem_cache_shrink to flush the cache.
+			 * So we need to drop the lock. It is all right because
+			 * the lock only protects elements moving in and out the
+			 * list.
+			 */
+			mutex_unlock(&memcg->slab_caches_mutex);
+			kmem_cache_shrink(cachep);
+			mutex_lock(&memcg->slab_caches_mutex);
+		}
+		mutex_unlock(&memcg->slab_caches_mutex);
+	}
+	mutex_unlock(&dangling_memcgs_mutex);
+}
+
+static void memcg_register_kmem_events(struct cgroup_subsys_state *css)
+{
+	vmpressure_register_kernel_event(css, memcg_vmpressure_shrink_dead);
+}
+
 static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
 {
 	int ret;
@@ -6139,6 +6205,10 @@ static void kmem_cgroup_css_offline(struct mem_cgroup *memcg)
 		css_put(&memcg->css);
 }
 #else
+static inline void memcg_register_kmem_events(struct cgroup *cont)
+{
+}
+
 static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
 {
 	return 0;
@@ -6477,8 +6547,10 @@ mem_cgroup_css_online(struct cgroup_subsys_state *css)
 	if (css->cgroup->id > MEM_CGROUP_ID_MAX)
 		return -ENOSPC;
 
-	if (!parent)
+	if (!parent) {
+		memcg_register_kmem_events(css);
 		return 0;
+	}
 
 	mutex_lock(&memcg_create_mutex);
 
@@ -6540,6 +6612,7 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
 	mem_cgroup_invalidate_reclaim_iterators(memcg);
 	mem_cgroup_reparent_charges(memcg);
 	mem_cgroup_destroy_all_caches(memcg);
+	memcg_dangling_add(memcg);
 	vmpressure_cleanup(&memcg->vmpressure);
 }
 
@@ -6548,6 +6621,7 @@ static void mem_cgroup_css_free(struct cgroup_subsys_state *css)
 	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
 
 	memcg_destroy_kmem(memcg);
+	memcg_dangling_del(memcg);
 	__mem_cgroup_free(memcg);
 }
 
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH v13 16/16] memcg: flush memcg items upon memcg destruction
  2013-12-09  8:05 ` Vladimir Davydov
@ 2013-12-09  8:05   ` Vladimir Davydov
  -1 siblings, 0 replies; 81+ messages in thread
From: Vladimir Davydov @ 2013-12-09  8:05 UTC (permalink / raw)
  To: dchinner, hannes, mhocko, akpm
  Cc: linux-kernel, linux-mm, cgroups, devel, glommer, glommer,
	vdavydov, Balbir Singh, KAMEZAWA Hiroyuki

From: Glauber Costa <glommer@openvz.org>

When a memcg is destroyed, it won't be imediately released until all
objects are gone. This means that if a memcg is restarted with the very
same workload - a very common case, the objects already cached won't be
billed to the new memcg. This is mostly undesirable since a container
can exploit this by restarting itself every time it reaches its limit,
and then coming up again with a fresh new limit.

Since now we have targeted reclaim, I sustain that we should assume that
a memcg that is destroyed should be flushed away. It makes perfect sense
if we assume that a memcg that goes away most likely indicates an
isolated workload that is terminated.

Signed-off-by: Glauber Costa <glommer@openvz.org>
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |   28 ++++++++++++++++++++++++++++
 1 file changed, 28 insertions(+)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 182199f..65ef284 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6171,12 +6171,40 @@ static void memcg_destroy_kmem(struct mem_cgroup *memcg)
 	memcg_destroy_all_lrus(memcg);
 }
 
+static void memcg_drop_slab(struct mem_cgroup *memcg)
+{
+	struct shrink_control shrink = {
+		.gfp_mask = GFP_KERNEL,
+		.target_mem_cgroup = memcg,
+	};
+	unsigned long nr_objects;
+
+	nodes_setall(shrink.nodes_to_scan);
+	do {
+		nr_objects = shrink_slab(&shrink, 1000, 1000);
+	} while (nr_objects > 0);
+}
+
 static void kmem_cgroup_css_offline(struct mem_cgroup *memcg)
 {
 	if (!memcg_kmem_is_active(memcg))
 		return;
 
 	/*
+	 * When a memcg is destroyed, it won't be imediately released until all
+	 * objects are gone. This means that if a memcg is restarted with the
+	 * very same workload - a very common case, the objects already cached
+	 * won't be billed to the new memcg. This is mostly undesirable since a
+	 * container can exploit this by restarting itself every time it
+	 * reaches its limit, and then coming up again with a fresh new limit.
+	 *
+	 * Therefore a memcg that is destroyed should be flushed away. It makes
+	 * perfect sense if we assume that a memcg that goes away indicates an
+	 * isolated workload that is terminated.
+	 */
+	memcg_drop_slab(memcg);
+
+	/*
 	 * kmem charges can outlive the cgroup. In the case of slab
 	 * pages, for instance, a page contain objects from various
 	 * processes. As we prevent from taking a reference for every
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 81+ messages in thread

* [PATCH v13 16/16] memcg: flush memcg items upon memcg destruction
@ 2013-12-09  8:05   ` Vladimir Davydov
  0 siblings, 0 replies; 81+ messages in thread
From: Vladimir Davydov @ 2013-12-09  8:05 UTC (permalink / raw)
  To: dchinner, hannes, mhocko, akpm
  Cc: linux-kernel, linux-mm, cgroups, devel, glommer, glommer,
	vdavydov, Balbir Singh, KAMEZAWA Hiroyuki

From: Glauber Costa <glommer@openvz.org>

When a memcg is destroyed, it won't be imediately released until all
objects are gone. This means that if a memcg is restarted with the very
same workload - a very common case, the objects already cached won't be
billed to the new memcg. This is mostly undesirable since a container
can exploit this by restarting itself every time it reaches its limit,
and then coming up again with a fresh new limit.

Since now we have targeted reclaim, I sustain that we should assume that
a memcg that is destroyed should be flushed away. It makes perfect sense
if we assume that a memcg that goes away most likely indicates an
isolated workload that is terminated.

Signed-off-by: Glauber Costa <glommer@openvz.org>
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |   28 ++++++++++++++++++++++++++++
 1 file changed, 28 insertions(+)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 182199f..65ef284 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6171,12 +6171,40 @@ static void memcg_destroy_kmem(struct mem_cgroup *memcg)
 	memcg_destroy_all_lrus(memcg);
 }
 
+static void memcg_drop_slab(struct mem_cgroup *memcg)
+{
+	struct shrink_control shrink = {
+		.gfp_mask = GFP_KERNEL,
+		.target_mem_cgroup = memcg,
+	};
+	unsigned long nr_objects;
+
+	nodes_setall(shrink.nodes_to_scan);
+	do {
+		nr_objects = shrink_slab(&shrink, 1000, 1000);
+	} while (nr_objects > 0);
+}
+
 static void kmem_cgroup_css_offline(struct mem_cgroup *memcg)
 {
 	if (!memcg_kmem_is_active(memcg))
 		return;
 
 	/*
+	 * When a memcg is destroyed, it won't be imediately released until all
+	 * objects are gone. This means that if a memcg is restarted with the
+	 * very same workload - a very common case, the objects already cached
+	 * won't be billed to the new memcg. This is mostly undesirable since a
+	 * container can exploit this by restarting itself every time it
+	 * reaches its limit, and then coming up again with a fresh new limit.
+	 *
+	 * Therefore a memcg that is destroyed should be flushed away. It makes
+	 * perfect sense if we assume that a memcg that goes away indicates an
+	 * isolated workload that is terminated.
+	 */
+	memcg_drop_slab(memcg);
+
+	/*
 	 * kmem charges can outlive the cgroup. In the case of slab
 	 * pages, for instance, a page contain objects from various
 	 * processes. As we prevent from taking a reference for every
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 81+ messages in thread

* Re: [PATCH v13 08/16] mm: list_lru: require shrink_control in count, walk functions
  2013-12-09  8:05   ` Vladimir Davydov
@ 2013-12-10  1:36     ` Dave Chinner
  -1 siblings, 0 replies; 81+ messages in thread
From: Dave Chinner @ 2013-12-10  1:36 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: dchinner, hannes, mhocko, akpm, linux-kernel, linux-mm, cgroups,
	devel, glommer, glommer, Al Viro

On Mon, Dec 09, 2013 at 12:05:49PM +0400, Vladimir Davydov wrote:
> To enable targeted reclaim, the list_lru structure distributes its
> elements among several LRU lists. Currently, there is one LRU per NUMA
> node, and the elements from different nodes are placed to different
> LRUs. As a result there are two versions of count and walk functions:
> 
>  - list_lru_count, list_lru_walk - count, walk items from all nodes;
>  - list_lru_count_node, list_lru_walk_node - count, walk items from a
>    particular node specified in an additional argument.
> 
> We are going to make the list_lru structure per-memcg in addition to
> being per-node. This would allow us to reclaim slab not only on global
> memory shortage, but also on memcg pressure. If we followed the current
> list_lru interface notation, we would have to add a bunch of new
> functions taking a memcg and a node in additional arguments, which would
> look cumbersome.
> 
> To avoid this, we remove the *_node functions and make list_lru_count
> and list_lru_walk require a shrink_control argument so that they will

I don't think that's a good idea. You've had to leave the nr_to_scan
parameter in the API because there are now callers of
list_lru_walk() that don't pass a shrink control structure. IOWs,
you've tried to handle two different caller contexts with the one
function API, when they really should remain separate and not
require internal branching base don what parameters were set in the
API.

i.e. list_lru_walk() is for callers that don't have a shrink control
structure and want to walk ever entry in the LRU, regardless of the
internal structure.

list_lru_walk_node() is for callers that don't have a shrink control
structure and just want to walk items on a single node. This is the
interface NUMA aware callers are using.

list_lru_shrink_walk() is for callers that pass all walk control
parameters via a struct shrink_control. It's not supposed to be used
for interfaces that don't have a shrink_control context....

Same goes for list_lru_count()....

IOWs, you should not be modifying a single user of list_lru_walk()
or list_lru_count() - only those that use the _node() variants and
are going to be marked as memcg aware that need to be changed at
this point. i.e. keep the number of subsystems you actually need to
modify down to a minimum to keep the test matrix reasonable.

> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -972,8 +972,8 @@ dentry_lru_isolate(struct list_head *item, spinlock_t *lru_lock, void *arg)
>  /**
>   * prune_dcache_sb - shrink the dcache
>   * @sb: superblock
> - * @nr_to_scan : number of entries to try to free
> - * @nid: which node to scan for freeable entities
> + * @sc: shrink control, passed to list_lru_walk()
> + * @nr_to_scan: number of entries to try to free
>   *
>   * Attempt to shrink the superblock dcache LRU by @nr_to_scan entries. This is
>   * done when we need more memory an called from the superblock shrinker
> @@ -982,14 +982,14 @@ dentry_lru_isolate(struct list_head *item, spinlock_t *lru_lock, void *arg)
>   * This function may fail to free any resources if all the dentries are in
>   * use.
>   */
> -long prune_dcache_sb(struct super_block *sb, unsigned long nr_to_scan,
> -		     int nid)
> +long prune_dcache_sb(struct super_block *sb, struct shrink_control *sc,
> +		     unsigned long nr_to_scan)
>  {
>  	LIST_HEAD(dispose);
>  	long freed;
>  
> -	freed = list_lru_walk_node(&sb->s_dentry_lru, nid, dentry_lru_isolate,
> -				       &dispose, &nr_to_scan);
> +	freed = list_lru_walk(&sb->s_dentry_lru, sc, dentry_lru_isolate,
> +			      &dispose, &nr_to_scan);

Two things here - nr_to_scan should be passed to prune_dcache_sb()
inside the shrink_control. i.e. as sc->nr_to_scan.

Secondly, why is &nr_to_scan being passed as a pointer to
list_lru_walk()? It's not a variable that has any value being
returned to the caller - how list_lru_walk() uses it is entirely
opaque to the caller, and the only return value that matters if the
number of objects freed. i.e. the number moved to the dispose
list.

list_lru_walk_node() is a different matter - a scan might involve
walking multiple nodes (e.g. the internal list_lru_walk()
implementation) and so the nr_to_scan context can span multiple
list_lru_walk_node() calls....

In any case, it should be a call like this here:

	freed = list_lru_shrink_walk(&sb->s_dentry_lru, sc, dentry_lru_isolate,
				     &dispose);

> diff --git a/fs/gfs2/quota.c b/fs/gfs2/quota.c
> index 98236d0..f0435da 100644
> --- a/fs/gfs2/quota.c
> +++ b/fs/gfs2/quota.c
> @@ -132,8 +132,8 @@ static unsigned long gfs2_qd_shrink_scan(struct shrinker *shrink,
>  	if (!(sc->gfp_mask & __GFP_FS))
>  		return SHRINK_STOP;
>  
> -	freed = list_lru_walk_node(&gfs2_qd_lru, sc->nid, gfs2_qd_isolate,
> -				   &dispose, &sc->nr_to_scan);
> +	freed = list_lru_walk(&gfs2_qd_lru, sc, gfs2_qd_isolate,
> +			      &dispose, &sc->nr_to_scan);

And this kind of points out the strangeness of this API. You're
passing the sc to the function, then passing the nr_to_scan out of
the sc structure....

As it is, this shrinker only needs to be node aware - it does not
ever need to be memcg aware because it is dealing with filesystem
internal structures that are of global scope....

i.e. these should all remain untouched as list_lru_*_node() calls.

> @@ -78,8 +78,8 @@ static unsigned long super_cache_scan(struct shrinker *shrink,
>  	if (sb->s_op->nr_cached_objects)
>  		fs_objects = sb->s_op->nr_cached_objects(sb, sc->nid);

sc needs to be propagated down into .nr_cached_objects. That's not
an optional extra - it needs to have the same information as the
dentry and inode cache pruners.

> -	inodes = list_lru_count_node(&sb->s_inode_lru, sc->nid);
> -	dentries = list_lru_count_node(&sb->s_dentry_lru, sc->nid);
> +	inodes = list_lru_count(&sb->s_inode_lru, sc);
> +	dentries = list_lru_count(&sb->s_dentry_lru, sc);
>  	total_objects = dentries + inodes + fs_objects + 1;

list_lru_shrink_count()

> @@ -90,8 +90,8 @@ static unsigned long super_cache_scan(struct shrinker *shrink,
>  	 * prune the dcache first as the icache is pinned by it, then
>  	 * prune the icache, followed by the filesystem specific caches
>  	 */
> -	freed = prune_dcache_sb(sb, dentries, sc->nid);
> -	freed += prune_icache_sb(sb, inodes, sc->nid);
> +	freed = prune_dcache_sb(sb, sc, dentries);
> +	freed += prune_icache_sb(sb, sc, inodes);

As I commented last time:

	sc->nr_to_scan = dentries;
	freed = prune_dcache_sb(sb, sc);
	sc->nr_to_scan = inodes;
	freed += prune_icache_sb(sb, sc);
	if (fs_objects) {
		sc->nr_to_scan = mult_frac(sc->nr_to_scan, fs_objects,
					   total_objects);
		freed += sb->s_op->free_cached_objects(sb, sc);
	}

> -	total_objects += list_lru_count_node(&sb->s_dentry_lru,
> -						 sc->nid);
> -	total_objects += list_lru_count_node(&sb->s_inode_lru,
> -						 sc->nid);
> +	total_objects += list_lru_count(&sb->s_dentry_lru, sc);
> +	total_objects += list_lru_count(&sb->s_inode_lru, sc);

list_lru_shrink_count()

> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index c7f0b77..5b2a49c 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -1508,9 +1508,11 @@ xfs_wait_buftarg(
>  	int loop = 0;
>  
>  	/* loop until there is nothing left on the lru list. */
> -	while (list_lru_count(&btp->bt_lru)) {
> -		list_lru_walk(&btp->bt_lru, xfs_buftarg_wait_rele,
> -			      &dispose, LONG_MAX);
> +	while (list_lru_count(&btp->bt_lru, NULL)) {
> +		unsigned long nr_to_scan = ULONG_MAX;
> +
> +		list_lru_walk(&btp->bt_lru, NULL, xfs_buftarg_wait_rele,
> +			      &dispose, &nr_to_scan);
>  
>  		while (!list_empty(&dispose)) {
>  			struct xfs_buf *bp;
> @@ -1565,8 +1567,8 @@ xfs_buftarg_shrink_scan(
>  	unsigned long		freed;
>  	unsigned long		nr_to_scan = sc->nr_to_scan;
>  
> -	freed = list_lru_walk_node(&btp->bt_lru, sc->nid, xfs_buftarg_isolate,
> -				       &dispose, &nr_to_scan);
> +	freed = list_lru_walk(&btp->bt_lru, sc, xfs_buftarg_isolate,
> +			      &dispose, &nr_to_scan);

No, this will never be made memcg aware because it's a global
filesystem metadata cache, so it should remain using
list_lru_walk_node().

i.e. don't touch stuff you don't need to touch.

>  	while (!list_empty(&dispose)) {
>  		struct xfs_buf *bp;
> @@ -1585,7 +1587,7 @@ xfs_buftarg_shrink_count(
>  {
>  	struct xfs_buftarg	*btp = container_of(shrink,
>  					struct xfs_buftarg, bt_shrinker);
> -	return list_lru_count_node(&btp->bt_lru, sc->nid);
> +	return list_lru_count(&btp->bt_lru, sc);
>  }
>  
>  void
> diff --git a/fs/xfs/xfs_qm.c b/fs/xfs/xfs_qm.c
> index 14a4996..aaacf8f 100644
> --- a/fs/xfs/xfs_qm.c
> +++ b/fs/xfs/xfs_qm.c
> @@ -769,8 +769,8 @@ xfs_qm_shrink_scan(
>  	INIT_LIST_HEAD(&isol.buffers);
>  	INIT_LIST_HEAD(&isol.dispose);
>  
> -	freed = list_lru_walk_node(&qi->qi_lru, sc->nid, xfs_qm_dquot_isolate, &isol,
> -					&nr_to_scan);
> +	freed = list_lru_walk(&qi->qi_lru, sc, xfs_qm_dquot_isolate, &isol,
> +			      &nr_to_scan);

Same here.

>  
>  	error = xfs_buf_delwri_submit(&isol.buffers);
>  	if (error)
> @@ -795,7 +795,7 @@ xfs_qm_shrink_count(
>  	struct xfs_quotainfo	*qi = container_of(shrink,
>  					struct xfs_quotainfo, qi_shrinker);
>  
> -	return list_lru_count_node(&qi->qi_lru, sc->nid);
> +	return list_lru_count(&qi->qi_lru, sc);
>  }
>  
>  /*
> diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
> index 3ce5417..34e57af 100644
> --- a/include/linux/list_lru.h
> +++ b/include/linux/list_lru.h
> @@ -10,6 +10,8 @@
....

None of this should change - there should just be new prototypes for
list_lru_shrink_walk() and list_lru_shrink_count().

>  
> -unsigned long
> +unsigned long list_lru_count(struct list_lru *lru, struct shrink_control *sc)
> +{
> +	long count = 0;
> +	int nid;
> +
> +	if (sc)
> +		return list_lru_count_node(lru, sc->nid);
> +
> +	for_each_node_mask(nid, lru->active_nodes)
> +		count += list_lru_count_node(lru, nid);
> +
> +	return count;
> +}
> +EXPORT_SYMBOL_GPL(list_lru_count);

In fact:

long
list_lru_shrink_count(struct list_lru *lru, struct shrink_control *sc)
{
	return list_lru_count_node(lru, sc->nid);
}

> +
> +static unsigned long
>  list_lru_walk_node(struct list_lru *lru, int nid, list_lru_walk_cb isolate,
>  		   void *cb_arg, unsigned long *nr_to_walk)
>  {
> @@ -112,7 +127,27 @@ restart:
>  	spin_unlock(&nlru->lock);
>  	return isolated;
>  }
> -EXPORT_SYMBOL_GPL(list_lru_walk_node);
> +
> +unsigned long list_lru_walk(struct list_lru *lru, struct shrink_control *sc,
> +			    list_lru_walk_cb isolate, void *cb_arg,
> +			    unsigned long *nr_to_walk)
> +{
> +	long isolated = 0;
> +	int nid;
> +
> +	if (sc)
> +		return list_lru_walk_node(lru, sc->nid, isolate,
> +					  cb_arg, nr_to_walk);
> +
> +	for_each_node_mask(nid, lru->active_nodes) {
> +		isolated += list_lru_walk_node(lru, nid, isolate,
> +					       cb_arg, nr_to_walk);
> +		if (*nr_to_walk <= 0)
> +			break;
> +	}
> +	return isolated;
> +}
> +EXPORT_SYMBOL_GPL(list_lru_walk);

long
list_lru_shrink_walk(struct list_lru *lru, struct shrink_control *sc,
		     list_lru_walk_cb isolate, void *cb_arg)
{
	return list_lru_walk_node(lru, sc->nid, isolate, cb_arg,
				  &sc->nr_to_scan);
}

i.e. adding shrink_control interfaces to the list_lru code only
requires the addition of two simple functions, not a major
API and implementation rework....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v13 08/16] mm: list_lru: require shrink_control in count, walk functions
@ 2013-12-10  1:36     ` Dave Chinner
  0 siblings, 0 replies; 81+ messages in thread
From: Dave Chinner @ 2013-12-10  1:36 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: dchinner, hannes, mhocko, akpm, linux-kernel, linux-mm, cgroups,
	devel, glommer, glommer, Al Viro

On Mon, Dec 09, 2013 at 12:05:49PM +0400, Vladimir Davydov wrote:
> To enable targeted reclaim, the list_lru structure distributes its
> elements among several LRU lists. Currently, there is one LRU per NUMA
> node, and the elements from different nodes are placed to different
> LRUs. As a result there are two versions of count and walk functions:
> 
>  - list_lru_count, list_lru_walk - count, walk items from all nodes;
>  - list_lru_count_node, list_lru_walk_node - count, walk items from a
>    particular node specified in an additional argument.
> 
> We are going to make the list_lru structure per-memcg in addition to
> being per-node. This would allow us to reclaim slab not only on global
> memory shortage, but also on memcg pressure. If we followed the current
> list_lru interface notation, we would have to add a bunch of new
> functions taking a memcg and a node in additional arguments, which would
> look cumbersome.
> 
> To avoid this, we remove the *_node functions and make list_lru_count
> and list_lru_walk require a shrink_control argument so that they will

I don't think that's a good idea. You've had to leave the nr_to_scan
parameter in the API because there are now callers of
list_lru_walk() that don't pass a shrink control structure. IOWs,
you've tried to handle two different caller contexts with the one
function API, when they really should remain separate and not
require internal branching base don what parameters were set in the
API.

i.e. list_lru_walk() is for callers that don't have a shrink control
structure and want to walk ever entry in the LRU, regardless of the
internal structure.

list_lru_walk_node() is for callers that don't have a shrink control
structure and just want to walk items on a single node. This is the
interface NUMA aware callers are using.

list_lru_shrink_walk() is for callers that pass all walk control
parameters via a struct shrink_control. It's not supposed to be used
for interfaces that don't have a shrink_control context....

Same goes for list_lru_count()....

IOWs, you should not be modifying a single user of list_lru_walk()
or list_lru_count() - only those that use the _node() variants and
are going to be marked as memcg aware that need to be changed at
this point. i.e. keep the number of subsystems you actually need to
modify down to a minimum to keep the test matrix reasonable.

> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -972,8 +972,8 @@ dentry_lru_isolate(struct list_head *item, spinlock_t *lru_lock, void *arg)
>  /**
>   * prune_dcache_sb - shrink the dcache
>   * @sb: superblock
> - * @nr_to_scan : number of entries to try to free
> - * @nid: which node to scan for freeable entities
> + * @sc: shrink control, passed to list_lru_walk()
> + * @nr_to_scan: number of entries to try to free
>   *
>   * Attempt to shrink the superblock dcache LRU by @nr_to_scan entries. This is
>   * done when we need more memory an called from the superblock shrinker
> @@ -982,14 +982,14 @@ dentry_lru_isolate(struct list_head *item, spinlock_t *lru_lock, void *arg)
>   * This function may fail to free any resources if all the dentries are in
>   * use.
>   */
> -long prune_dcache_sb(struct super_block *sb, unsigned long nr_to_scan,
> -		     int nid)
> +long prune_dcache_sb(struct super_block *sb, struct shrink_control *sc,
> +		     unsigned long nr_to_scan)
>  {
>  	LIST_HEAD(dispose);
>  	long freed;
>  
> -	freed = list_lru_walk_node(&sb->s_dentry_lru, nid, dentry_lru_isolate,
> -				       &dispose, &nr_to_scan);
> +	freed = list_lru_walk(&sb->s_dentry_lru, sc, dentry_lru_isolate,
> +			      &dispose, &nr_to_scan);

Two things here - nr_to_scan should be passed to prune_dcache_sb()
inside the shrink_control. i.e. as sc->nr_to_scan.

Secondly, why is &nr_to_scan being passed as a pointer to
list_lru_walk()? It's not a variable that has any value being
returned to the caller - how list_lru_walk() uses it is entirely
opaque to the caller, and the only return value that matters if the
number of objects freed. i.e. the number moved to the dispose
list.

list_lru_walk_node() is a different matter - a scan might involve
walking multiple nodes (e.g. the internal list_lru_walk()
implementation) and so the nr_to_scan context can span multiple
list_lru_walk_node() calls....

In any case, it should be a call like this here:

	freed = list_lru_shrink_walk(&sb->s_dentry_lru, sc, dentry_lru_isolate,
				     &dispose);

> diff --git a/fs/gfs2/quota.c b/fs/gfs2/quota.c
> index 98236d0..f0435da 100644
> --- a/fs/gfs2/quota.c
> +++ b/fs/gfs2/quota.c
> @@ -132,8 +132,8 @@ static unsigned long gfs2_qd_shrink_scan(struct shrinker *shrink,
>  	if (!(sc->gfp_mask & __GFP_FS))
>  		return SHRINK_STOP;
>  
> -	freed = list_lru_walk_node(&gfs2_qd_lru, sc->nid, gfs2_qd_isolate,
> -				   &dispose, &sc->nr_to_scan);
> +	freed = list_lru_walk(&gfs2_qd_lru, sc, gfs2_qd_isolate,
> +			      &dispose, &sc->nr_to_scan);

And this kind of points out the strangeness of this API. You're
passing the sc to the function, then passing the nr_to_scan out of
the sc structure....

As it is, this shrinker only needs to be node aware - it does not
ever need to be memcg aware because it is dealing with filesystem
internal structures that are of global scope....

i.e. these should all remain untouched as list_lru_*_node() calls.

> @@ -78,8 +78,8 @@ static unsigned long super_cache_scan(struct shrinker *shrink,
>  	if (sb->s_op->nr_cached_objects)
>  		fs_objects = sb->s_op->nr_cached_objects(sb, sc->nid);

sc needs to be propagated down into .nr_cached_objects. That's not
an optional extra - it needs to have the same information as the
dentry and inode cache pruners.

> -	inodes = list_lru_count_node(&sb->s_inode_lru, sc->nid);
> -	dentries = list_lru_count_node(&sb->s_dentry_lru, sc->nid);
> +	inodes = list_lru_count(&sb->s_inode_lru, sc);
> +	dentries = list_lru_count(&sb->s_dentry_lru, sc);
>  	total_objects = dentries + inodes + fs_objects + 1;

list_lru_shrink_count()

> @@ -90,8 +90,8 @@ static unsigned long super_cache_scan(struct shrinker *shrink,
>  	 * prune the dcache first as the icache is pinned by it, then
>  	 * prune the icache, followed by the filesystem specific caches
>  	 */
> -	freed = prune_dcache_sb(sb, dentries, sc->nid);
> -	freed += prune_icache_sb(sb, inodes, sc->nid);
> +	freed = prune_dcache_sb(sb, sc, dentries);
> +	freed += prune_icache_sb(sb, sc, inodes);

As I commented last time:

	sc->nr_to_scan = dentries;
	freed = prune_dcache_sb(sb, sc);
	sc->nr_to_scan = inodes;
	freed += prune_icache_sb(sb, sc);
	if (fs_objects) {
		sc->nr_to_scan = mult_frac(sc->nr_to_scan, fs_objects,
					   total_objects);
		freed += sb->s_op->free_cached_objects(sb, sc);
	}

> -	total_objects += list_lru_count_node(&sb->s_dentry_lru,
> -						 sc->nid);
> -	total_objects += list_lru_count_node(&sb->s_inode_lru,
> -						 sc->nid);
> +	total_objects += list_lru_count(&sb->s_dentry_lru, sc);
> +	total_objects += list_lru_count(&sb->s_inode_lru, sc);

list_lru_shrink_count()

> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index c7f0b77..5b2a49c 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -1508,9 +1508,11 @@ xfs_wait_buftarg(
>  	int loop = 0;
>  
>  	/* loop until there is nothing left on the lru list. */
> -	while (list_lru_count(&btp->bt_lru)) {
> -		list_lru_walk(&btp->bt_lru, xfs_buftarg_wait_rele,
> -			      &dispose, LONG_MAX);
> +	while (list_lru_count(&btp->bt_lru, NULL)) {
> +		unsigned long nr_to_scan = ULONG_MAX;
> +
> +		list_lru_walk(&btp->bt_lru, NULL, xfs_buftarg_wait_rele,
> +			      &dispose, &nr_to_scan);
>  
>  		while (!list_empty(&dispose)) {
>  			struct xfs_buf *bp;
> @@ -1565,8 +1567,8 @@ xfs_buftarg_shrink_scan(
>  	unsigned long		freed;
>  	unsigned long		nr_to_scan = sc->nr_to_scan;
>  
> -	freed = list_lru_walk_node(&btp->bt_lru, sc->nid, xfs_buftarg_isolate,
> -				       &dispose, &nr_to_scan);
> +	freed = list_lru_walk(&btp->bt_lru, sc, xfs_buftarg_isolate,
> +			      &dispose, &nr_to_scan);

No, this will never be made memcg aware because it's a global
filesystem metadata cache, so it should remain using
list_lru_walk_node().

i.e. don't touch stuff you don't need to touch.

>  	while (!list_empty(&dispose)) {
>  		struct xfs_buf *bp;
> @@ -1585,7 +1587,7 @@ xfs_buftarg_shrink_count(
>  {
>  	struct xfs_buftarg	*btp = container_of(shrink,
>  					struct xfs_buftarg, bt_shrinker);
> -	return list_lru_count_node(&btp->bt_lru, sc->nid);
> +	return list_lru_count(&btp->bt_lru, sc);
>  }
>  
>  void
> diff --git a/fs/xfs/xfs_qm.c b/fs/xfs/xfs_qm.c
> index 14a4996..aaacf8f 100644
> --- a/fs/xfs/xfs_qm.c
> +++ b/fs/xfs/xfs_qm.c
> @@ -769,8 +769,8 @@ xfs_qm_shrink_scan(
>  	INIT_LIST_HEAD(&isol.buffers);
>  	INIT_LIST_HEAD(&isol.dispose);
>  
> -	freed = list_lru_walk_node(&qi->qi_lru, sc->nid, xfs_qm_dquot_isolate, &isol,
> -					&nr_to_scan);
> +	freed = list_lru_walk(&qi->qi_lru, sc, xfs_qm_dquot_isolate, &isol,
> +			      &nr_to_scan);

Same here.

>  
>  	error = xfs_buf_delwri_submit(&isol.buffers);
>  	if (error)
> @@ -795,7 +795,7 @@ xfs_qm_shrink_count(
>  	struct xfs_quotainfo	*qi = container_of(shrink,
>  					struct xfs_quotainfo, qi_shrinker);
>  
> -	return list_lru_count_node(&qi->qi_lru, sc->nid);
> +	return list_lru_count(&qi->qi_lru, sc);
>  }
>  
>  /*
> diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
> index 3ce5417..34e57af 100644
> --- a/include/linux/list_lru.h
> +++ b/include/linux/list_lru.h
> @@ -10,6 +10,8 @@
....

None of this should change - there should just be new prototypes for
list_lru_shrink_walk() and list_lru_shrink_count().

>  
> -unsigned long
> +unsigned long list_lru_count(struct list_lru *lru, struct shrink_control *sc)
> +{
> +	long count = 0;
> +	int nid;
> +
> +	if (sc)
> +		return list_lru_count_node(lru, sc->nid);
> +
> +	for_each_node_mask(nid, lru->active_nodes)
> +		count += list_lru_count_node(lru, nid);
> +
> +	return count;
> +}
> +EXPORT_SYMBOL_GPL(list_lru_count);

In fact:

long
list_lru_shrink_count(struct list_lru *lru, struct shrink_control *sc)
{
	return list_lru_count_node(lru, sc->nid);
}

> +
> +static unsigned long
>  list_lru_walk_node(struct list_lru *lru, int nid, list_lru_walk_cb isolate,
>  		   void *cb_arg, unsigned long *nr_to_walk)
>  {
> @@ -112,7 +127,27 @@ restart:
>  	spin_unlock(&nlru->lock);
>  	return isolated;
>  }
> -EXPORT_SYMBOL_GPL(list_lru_walk_node);
> +
> +unsigned long list_lru_walk(struct list_lru *lru, struct shrink_control *sc,
> +			    list_lru_walk_cb isolate, void *cb_arg,
> +			    unsigned long *nr_to_walk)
> +{
> +	long isolated = 0;
> +	int nid;
> +
> +	if (sc)
> +		return list_lru_walk_node(lru, sc->nid, isolate,
> +					  cb_arg, nr_to_walk);
> +
> +	for_each_node_mask(nid, lru->active_nodes) {
> +		isolated += list_lru_walk_node(lru, nid, isolate,
> +					       cb_arg, nr_to_walk);
> +		if (*nr_to_walk <= 0)
> +			break;
> +	}
> +	return isolated;
> +}
> +EXPORT_SYMBOL_GPL(list_lru_walk);

long
list_lru_shrink_walk(struct list_lru *lru, struct shrink_control *sc,
		     list_lru_walk_cb isolate, void *cb_arg)
{
	return list_lru_walk_node(lru, sc->nid, isolate, cb_arg,
				  &sc->nr_to_scan);
}

i.e. adding shrink_control interfaces to the list_lru code only
requires the addition of two simple functions, not a major
API and implementation rework....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v13 09/16] fs: consolidate {nr,free}_cached_objects args in shrink_control
  2013-12-09  8:05   ` Vladimir Davydov
  (?)
@ 2013-12-10  1:38     ` Dave Chinner
  -1 siblings, 0 replies; 81+ messages in thread
From: Dave Chinner @ 2013-12-10  1:38 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: dchinner, hannes, mhocko, akpm, linux-kernel, linux-mm, cgroups,
	devel, glommer, glommer, Al Viro

On Mon, Dec 09, 2013 at 12:05:50PM +0400, Vladimir Davydov wrote:
> We are going to make the FS shrinker memcg-aware. To achieve that, we
> will have to pass the memcg to scan to the nr_cached_objects and
> free_cached_objects VFS methods, which currently take only the NUMA node
> to scan. Since the shrink_control structure already holds the node, and
> the memcg to scan will be added to it as we introduce memcg-aware
> vmscan, let us consolidate the methods' arguments in this structure to
> keep things clean.
> 
> Thanks to David Chinner for the tip.

Ok, you dealt with this as a separate patch...

> 
> Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
> Cc: Glauber Costa <glommer@openvz.org>
> Cc: Dave Chinner <dchinner@redhat.com>
> Cc: Al Viro <viro@zeniv.linux.org.uk>
> ---
>  fs/super.c         |    8 +++-----
>  fs/xfs/xfs_super.c |    6 +++---
>  include/linux/fs.h |    6 ++++--
>  3 files changed, 10 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/super.c b/fs/super.c
> index a039dba..8f9a81b 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -76,7 +76,7 @@ static unsigned long super_cache_scan(struct shrinker *shrink,
>  		return SHRINK_STOP;
>  
>  	if (sb->s_op->nr_cached_objects)
> -		fs_objects = sb->s_op->nr_cached_objects(sb, sc->nid);
> +		fs_objects = sb->s_op->nr_cached_objects(sb, sc);
>  
>  	inodes = list_lru_count(&sb->s_inode_lru, sc);
>  	dentries = list_lru_count(&sb->s_dentry_lru, sc);
> @@ -96,8 +96,7 @@ static unsigned long super_cache_scan(struct shrinker *shrink,
>  	if (fs_objects) {
>  		fs_objects = mult_frac(sc->nr_to_scan, fs_objects,
>  								total_objects);
> -		freed += sb->s_op->free_cached_objects(sb, fs_objects,
> -						       sc->nid);
> +		freed += sb->s_op->free_cached_objects(sb, sc, fs_objects);
>  	}

Again, pass the number to scan in sc->nr_to_scan, please.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v13 09/16] fs: consolidate {nr,free}_cached_objects args in shrink_control
@ 2013-12-10  1:38     ` Dave Chinner
  0 siblings, 0 replies; 81+ messages in thread
From: Dave Chinner @ 2013-12-10  1:38 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: dchinner, hannes, mhocko, akpm, linux-kernel, linux-mm, cgroups,
	devel, glommer, glommer, Al Viro

On Mon, Dec 09, 2013 at 12:05:50PM +0400, Vladimir Davydov wrote:
> We are going to make the FS shrinker memcg-aware. To achieve that, we
> will have to pass the memcg to scan to the nr_cached_objects and
> free_cached_objects VFS methods, which currently take only the NUMA node
> to scan. Since the shrink_control structure already holds the node, and
> the memcg to scan will be added to it as we introduce memcg-aware
> vmscan, let us consolidate the methods' arguments in this structure to
> keep things clean.
> 
> Thanks to David Chinner for the tip.

Ok, you dealt with this as a separate patch...

> 
> Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
> Cc: Glauber Costa <glommer@openvz.org>
> Cc: Dave Chinner <dchinner@redhat.com>
> Cc: Al Viro <viro@zeniv.linux.org.uk>
> ---
>  fs/super.c         |    8 +++-----
>  fs/xfs/xfs_super.c |    6 +++---
>  include/linux/fs.h |    6 ++++--
>  3 files changed, 10 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/super.c b/fs/super.c
> index a039dba..8f9a81b 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -76,7 +76,7 @@ static unsigned long super_cache_scan(struct shrinker *shrink,
>  		return SHRINK_STOP;
>  
>  	if (sb->s_op->nr_cached_objects)
> -		fs_objects = sb->s_op->nr_cached_objects(sb, sc->nid);
> +		fs_objects = sb->s_op->nr_cached_objects(sb, sc);
>  
>  	inodes = list_lru_count(&sb->s_inode_lru, sc);
>  	dentries = list_lru_count(&sb->s_dentry_lru, sc);
> @@ -96,8 +96,7 @@ static unsigned long super_cache_scan(struct shrinker *shrink,
>  	if (fs_objects) {
>  		fs_objects = mult_frac(sc->nr_to_scan, fs_objects,
>  								total_objects);
> -		freed += sb->s_op->free_cached_objects(sb, fs_objects,
> -						       sc->nid);
> +		freed += sb->s_op->free_cached_objects(sb, sc, fs_objects);
>  	}

Again, pass the number to scan in sc->nr_to_scan, please.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v13 09/16] fs: consolidate {nr,free}_cached_objects args in shrink_control
@ 2013-12-10  1:38     ` Dave Chinner
  0 siblings, 0 replies; 81+ messages in thread
From: Dave Chinner @ 2013-12-10  1:38 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: dchinner-H+wXaHxf7aLQT0dZR+AlfA, hannes-druUgvl0LCNAfugRpC6u6w,
	mhocko-AlSwsSmVLrQ, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	devel-GEFAQzZX7r8dnm+yROfE0A, glommer-GEFAQzZX7r8dnm+yROfE0A,
	glommer-Re5JQEeQqe8AvxtiuMwx3w, Al Viro

On Mon, Dec 09, 2013 at 12:05:50PM +0400, Vladimir Davydov wrote:
> We are going to make the FS shrinker memcg-aware. To achieve that, we
> will have to pass the memcg to scan to the nr_cached_objects and
> free_cached_objects VFS methods, which currently take only the NUMA node
> to scan. Since the shrink_control structure already holds the node, and
> the memcg to scan will be added to it as we introduce memcg-aware
> vmscan, let us consolidate the methods' arguments in this structure to
> keep things clean.
> 
> Thanks to David Chinner for the tip.

Ok, you dealt with this as a separate patch...

> 
> Signed-off-by: Vladimir Davydov <vdavydov-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
> Cc: Glauber Costa <glommer-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
> Cc: Dave Chinner <dchinner-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> Cc: Al Viro <viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
> ---
>  fs/super.c         |    8 +++-----
>  fs/xfs/xfs_super.c |    6 +++---
>  include/linux/fs.h |    6 ++++--
>  3 files changed, 10 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/super.c b/fs/super.c
> index a039dba..8f9a81b 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -76,7 +76,7 @@ static unsigned long super_cache_scan(struct shrinker *shrink,
>  		return SHRINK_STOP;
>  
>  	if (sb->s_op->nr_cached_objects)
> -		fs_objects = sb->s_op->nr_cached_objects(sb, sc->nid);
> +		fs_objects = sb->s_op->nr_cached_objects(sb, sc);
>  
>  	inodes = list_lru_count(&sb->s_inode_lru, sc);
>  	dentries = list_lru_count(&sb->s_dentry_lru, sc);
> @@ -96,8 +96,7 @@ static unsigned long super_cache_scan(struct shrinker *shrink,
>  	if (fs_objects) {
>  		fs_objects = mult_frac(sc->nr_to_scan, fs_objects,
>  								total_objects);
> -		freed += sb->s_op->free_cached_objects(sb, fs_objects,
> -						       sc->nid);
> +		freed += sb->s_op->free_cached_objects(sb, sc, fs_objects);
>  	}

Again, pass the number to scan in sc->nr_to_scan, please.

Cheers,

Dave.
-- 
Dave Chinner
david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v13 10/16] vmscan: shrink slab on memcg pressure
  2013-12-09  8:05   ` Vladimir Davydov
  (?)
@ 2013-12-10  2:11     ` Dave Chinner
  -1 siblings, 0 replies; 81+ messages in thread
From: Dave Chinner @ 2013-12-10  2:11 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: dchinner, hannes, mhocko, akpm, linux-kernel, linux-mm, cgroups,
	devel, glommer, glommer, Mel Gorman, Rik van Riel, Al Viro,
	Balbir Singh, KAMEZAWA Hiroyuki

On Mon, Dec 09, 2013 at 12:05:51PM +0400, Vladimir Davydov wrote:
> This patch makes direct reclaim path shrink slab not only on global
> memory pressure, but also when we reach the user memory limit of a
> memcg. To achieve that, it makes shrink_slab() walk over the memcg
> hierarchy and run shrinkers marked as memcg-aware on the target memcg
> and all its descendants. The memcg to scan is passed in a shrink_control
> structure; memcg-unaware shrinkers are still called only on global
> memory pressure with memcg=NULL. It is up to the shrinker how to
> organize the objects it is responsible for to achieve per-memcg reclaim.
> 
> The idea lying behind the patch as well as the initial implementation
> belong to Glauber Costa.
...
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -311,6 +311,58 @@ shrink_slab_node(struct shrink_control *shrinkctl, struct shrinker *shrinker,
>  	return freed;
>  }
>  
> +static unsigned long
> +run_shrinker(struct shrink_control *shrinkctl, struct shrinker *shrinker,
> +	     unsigned long nr_pages_scanned, unsigned long lru_pages)
> +{
> +	unsigned long freed = 0;
> +
> +	/*
> +	 * If we don't have a target mem cgroup, we scan them all. Otherwise
> +	 * we will limit our scan to shrinkers marked as memcg aware.
> +	 */
> +	if (!(shrinker->flags & SHRINKER_MEMCG_AWARE) &&
> +	    shrinkctl->target_mem_cgroup != NULL)
> +		return 0;
> +	/*
> +	 * In a hierarchical chain, it might be that not all memcgs are kmem
> +	 * active. kmemcg design mandates that when one memcg is active, its
> +	 * children will be active as well. But it is perfectly possible that
> +	 * its parent is not.
> +	 *
> +	 * We also need to make sure we scan at least once, for the global
> +	 * case. So if we don't have a target memcg, we proceed normally and
> +	 * expect to break in the next round.
> +	 */
> +	shrinkctl->memcg = shrinkctl->target_mem_cgroup;
> +	do {
> +		if (shrinkctl->memcg && !memcg_kmem_is_active(shrinkctl->memcg))
> +			goto next;
> +
> +		if (!(shrinker->flags & SHRINKER_NUMA_AWARE)) {
> +			shrinkctl->nid = 0;
> +			freed += shrink_slab_node(shrinkctl, shrinker,
> +					nr_pages_scanned, lru_pages);
> +			goto next;
> +		}
> +
> +		for_each_node_mask(shrinkctl->nid, shrinkctl->nodes_to_scan) {
> +			if (node_online(shrinkctl->nid))
> +				freed += shrink_slab_node(shrinkctl, shrinker,
> +						nr_pages_scanned, lru_pages);
> +
> +		}
> +next:
> +		if (!(shrinker->flags & SHRINKER_MEMCG_AWARE))
> +			break;
> +		shrinkctl->memcg = mem_cgroup_iter(shrinkctl->target_mem_cgroup,
> +						   shrinkctl->memcg, NULL);
> +	} while (shrinkctl->memcg);
> +
> +	return freed;
> +}

Ok, I think we need to improve the abstraction here, because I find
this quite messy and hard to follow the code flow differences
between memcg and non-memg shrinker invocations..

> +
>  /*
>   * Call the shrink functions to age shrinkable caches
>   *
> @@ -352,20 +404,10 @@ unsigned long shrink_slab(struct shrink_control *shrinkctl,
>  	}
>  
>  	list_for_each_entry(shrinker, &shrinker_list, list) {
> -		if (!(shrinker->flags & SHRINKER_NUMA_AWARE)) {
> -			shrinkctl->nid = 0;
> -			freed += shrink_slab_node(shrinkctl, shrinker,
> -					nr_pages_scanned, lru_pages);
> -			continue;
> -		}
> -
> -		for_each_node_mask(shrinkctl->nid, shrinkctl->nodes_to_scan) {
> -			if (node_online(shrinkctl->nid))
> -				freed += shrink_slab_node(shrinkctl, shrinker,
> -						nr_pages_scanned, lru_pages);
> -
> -		}

This code is the "run_shrinker()" helper function, not the entire
memcg loop.

> +		freed += run_shrinker(shrinkctl, shrinker,
> +				      nr_pages_scanned, lru_pages);
>  	}

i.e. the shrinker execution control loop becomes much clearer if
we separate the memcg and non-memcg shrinker execution from the
node awareness of the shrinker like so:

	list_for_each_entry(shrinker, &shrinker_list, list) {

		/*
		 * If we aren't doing targeted memcg shrinking, then run
		 * the shrinker with a global context and move on.
		 */
		if (!shrinkctl->target_mem_cgroup) {
			freed += run_shrinker(shrinkctl, shrinker,
					      nr_pages_scanned, lru_pages);
			continue;
		}

		if (!(shrinker->flags & SHRINKER_MEMCG_AWARE))
			continue;

		/*
		 * memcg shrinking: Iterate the target memcg heirarchy
		 * and run the shrinker on each memcg context that
		 * is found in the heirarchy.
		 */
		shrinkctl->memcg = shrinkctl->target_mem_cgroup;
		do {
			if (memcg_kmem_is_active(shrinkctl->memcg))
				continue;

			freed += run_shrinker(shrinkctl, shrinker,
					      nr_pages_scanned, lru_pages);
		while ((shrinkctl->memcg =
				mem_cgroup_iter(shrinkctl->target_mem_cgroup,
						shrinkctl->memcg, NULL)));
	}

That makes the code much easier to read and clearly demonstrates the
differences betwen non-memcg and memcg shrinking contexts, and
separates them cleanly from the shrinker implementation.  IMO,
that's much nicer than trying to handle all contexts in the one
do-while loop.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v13 10/16] vmscan: shrink slab on memcg pressure
@ 2013-12-10  2:11     ` Dave Chinner
  0 siblings, 0 replies; 81+ messages in thread
From: Dave Chinner @ 2013-12-10  2:11 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: dchinner, hannes, mhocko, akpm, linux-kernel, linux-mm, cgroups,
	devel, glommer, glommer, Mel Gorman, Rik van Riel, Al Viro,
	Balbir Singh, KAMEZAWA Hiroyuki

On Mon, Dec 09, 2013 at 12:05:51PM +0400, Vladimir Davydov wrote:
> This patch makes direct reclaim path shrink slab not only on global
> memory pressure, but also when we reach the user memory limit of a
> memcg. To achieve that, it makes shrink_slab() walk over the memcg
> hierarchy and run shrinkers marked as memcg-aware on the target memcg
> and all its descendants. The memcg to scan is passed in a shrink_control
> structure; memcg-unaware shrinkers are still called only on global
> memory pressure with memcg=NULL. It is up to the shrinker how to
> organize the objects it is responsible for to achieve per-memcg reclaim.
> 
> The idea lying behind the patch as well as the initial implementation
> belong to Glauber Costa.
...
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -311,6 +311,58 @@ shrink_slab_node(struct shrink_control *shrinkctl, struct shrinker *shrinker,
>  	return freed;
>  }
>  
> +static unsigned long
> +run_shrinker(struct shrink_control *shrinkctl, struct shrinker *shrinker,
> +	     unsigned long nr_pages_scanned, unsigned long lru_pages)
> +{
> +	unsigned long freed = 0;
> +
> +	/*
> +	 * If we don't have a target mem cgroup, we scan them all. Otherwise
> +	 * we will limit our scan to shrinkers marked as memcg aware.
> +	 */
> +	if (!(shrinker->flags & SHRINKER_MEMCG_AWARE) &&
> +	    shrinkctl->target_mem_cgroup != NULL)
> +		return 0;
> +	/*
> +	 * In a hierarchical chain, it might be that not all memcgs are kmem
> +	 * active. kmemcg design mandates that when one memcg is active, its
> +	 * children will be active as well. But it is perfectly possible that
> +	 * its parent is not.
> +	 *
> +	 * We also need to make sure we scan at least once, for the global
> +	 * case. So if we don't have a target memcg, we proceed normally and
> +	 * expect to break in the next round.
> +	 */
> +	shrinkctl->memcg = shrinkctl->target_mem_cgroup;
> +	do {
> +		if (shrinkctl->memcg && !memcg_kmem_is_active(shrinkctl->memcg))
> +			goto next;
> +
> +		if (!(shrinker->flags & SHRINKER_NUMA_AWARE)) {
> +			shrinkctl->nid = 0;
> +			freed += shrink_slab_node(shrinkctl, shrinker,
> +					nr_pages_scanned, lru_pages);
> +			goto next;
> +		}
> +
> +		for_each_node_mask(shrinkctl->nid, shrinkctl->nodes_to_scan) {
> +			if (node_online(shrinkctl->nid))
> +				freed += shrink_slab_node(shrinkctl, shrinker,
> +						nr_pages_scanned, lru_pages);
> +
> +		}
> +next:
> +		if (!(shrinker->flags & SHRINKER_MEMCG_AWARE))
> +			break;
> +		shrinkctl->memcg = mem_cgroup_iter(shrinkctl->target_mem_cgroup,
> +						   shrinkctl->memcg, NULL);
> +	} while (shrinkctl->memcg);
> +
> +	return freed;
> +}

Ok, I think we need to improve the abstraction here, because I find
this quite messy and hard to follow the code flow differences
between memcg and non-memg shrinker invocations..

> +
>  /*
>   * Call the shrink functions to age shrinkable caches
>   *
> @@ -352,20 +404,10 @@ unsigned long shrink_slab(struct shrink_control *shrinkctl,
>  	}
>  
>  	list_for_each_entry(shrinker, &shrinker_list, list) {
> -		if (!(shrinker->flags & SHRINKER_NUMA_AWARE)) {
> -			shrinkctl->nid = 0;
> -			freed += shrink_slab_node(shrinkctl, shrinker,
> -					nr_pages_scanned, lru_pages);
> -			continue;
> -		}
> -
> -		for_each_node_mask(shrinkctl->nid, shrinkctl->nodes_to_scan) {
> -			if (node_online(shrinkctl->nid))
> -				freed += shrink_slab_node(shrinkctl, shrinker,
> -						nr_pages_scanned, lru_pages);
> -
> -		}

This code is the "run_shrinker()" helper function, not the entire
memcg loop.

> +		freed += run_shrinker(shrinkctl, shrinker,
> +				      nr_pages_scanned, lru_pages);
>  	}

i.e. the shrinker execution control loop becomes much clearer if
we separate the memcg and non-memcg shrinker execution from the
node awareness of the shrinker like so:

	list_for_each_entry(shrinker, &shrinker_list, list) {

		/*
		 * If we aren't doing targeted memcg shrinking, then run
		 * the shrinker with a global context and move on.
		 */
		if (!shrinkctl->target_mem_cgroup) {
			freed += run_shrinker(shrinkctl, shrinker,
					      nr_pages_scanned, lru_pages);
			continue;
		}

		if (!(shrinker->flags & SHRINKER_MEMCG_AWARE))
			continue;

		/*
		 * memcg shrinking: Iterate the target memcg heirarchy
		 * and run the shrinker on each memcg context that
		 * is found in the heirarchy.
		 */
		shrinkctl->memcg = shrinkctl->target_mem_cgroup;
		do {
			if (memcg_kmem_is_active(shrinkctl->memcg))
				continue;

			freed += run_shrinker(shrinkctl, shrinker,
					      nr_pages_scanned, lru_pages);
		while ((shrinkctl->memcg =
				mem_cgroup_iter(shrinkctl->target_mem_cgroup,
						shrinkctl->memcg, NULL)));
	}

That makes the code much easier to read and clearly demonstrates the
differences betwen non-memcg and memcg shrinking contexts, and
separates them cleanly from the shrinker implementation.  IMO,
that's much nicer than trying to handle all contexts in the one
do-while loop.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v13 10/16] vmscan: shrink slab on memcg pressure
@ 2013-12-10  2:11     ` Dave Chinner
  0 siblings, 0 replies; 81+ messages in thread
From: Dave Chinner @ 2013-12-10  2:11 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: dchinner-H+wXaHxf7aLQT0dZR+AlfA, hannes-druUgvl0LCNAfugRpC6u6w,
	mhocko-AlSwsSmVLrQ, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	devel-GEFAQzZX7r8dnm+yROfE0A, glommer-GEFAQzZX7r8dnm+yROfE0A,
	glommer-Re5JQEeQqe8AvxtiuMwx3w, Mel Gorman, Rik van Riel,
	Al Viro, Balbir Singh, KAMEZAWA Hiroyuki

On Mon, Dec 09, 2013 at 12:05:51PM +0400, Vladimir Davydov wrote:
> This patch makes direct reclaim path shrink slab not only on global
> memory pressure, but also when we reach the user memory limit of a
> memcg. To achieve that, it makes shrink_slab() walk over the memcg
> hierarchy and run shrinkers marked as memcg-aware on the target memcg
> and all its descendants. The memcg to scan is passed in a shrink_control
> structure; memcg-unaware shrinkers are still called only on global
> memory pressure with memcg=NULL. It is up to the shrinker how to
> organize the objects it is responsible for to achieve per-memcg reclaim.
> 
> The idea lying behind the patch as well as the initial implementation
> belong to Glauber Costa.
...
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -311,6 +311,58 @@ shrink_slab_node(struct shrink_control *shrinkctl, struct shrinker *shrinker,
>  	return freed;
>  }
>  
> +static unsigned long
> +run_shrinker(struct shrink_control *shrinkctl, struct shrinker *shrinker,
> +	     unsigned long nr_pages_scanned, unsigned long lru_pages)
> +{
> +	unsigned long freed = 0;
> +
> +	/*
> +	 * If we don't have a target mem cgroup, we scan them all. Otherwise
> +	 * we will limit our scan to shrinkers marked as memcg aware.
> +	 */
> +	if (!(shrinker->flags & SHRINKER_MEMCG_AWARE) &&
> +	    shrinkctl->target_mem_cgroup != NULL)
> +		return 0;
> +	/*
> +	 * In a hierarchical chain, it might be that not all memcgs are kmem
> +	 * active. kmemcg design mandates that when one memcg is active, its
> +	 * children will be active as well. But it is perfectly possible that
> +	 * its parent is not.
> +	 *
> +	 * We also need to make sure we scan at least once, for the global
> +	 * case. So if we don't have a target memcg, we proceed normally and
> +	 * expect to break in the next round.
> +	 */
> +	shrinkctl->memcg = shrinkctl->target_mem_cgroup;
> +	do {
> +		if (shrinkctl->memcg && !memcg_kmem_is_active(shrinkctl->memcg))
> +			goto next;
> +
> +		if (!(shrinker->flags & SHRINKER_NUMA_AWARE)) {
> +			shrinkctl->nid = 0;
> +			freed += shrink_slab_node(shrinkctl, shrinker,
> +					nr_pages_scanned, lru_pages);
> +			goto next;
> +		}
> +
> +		for_each_node_mask(shrinkctl->nid, shrinkctl->nodes_to_scan) {
> +			if (node_online(shrinkctl->nid))
> +				freed += shrink_slab_node(shrinkctl, shrinker,
> +						nr_pages_scanned, lru_pages);
> +
> +		}
> +next:
> +		if (!(shrinker->flags & SHRINKER_MEMCG_AWARE))
> +			break;
> +		shrinkctl->memcg = mem_cgroup_iter(shrinkctl->target_mem_cgroup,
> +						   shrinkctl->memcg, NULL);
> +	} while (shrinkctl->memcg);
> +
> +	return freed;
> +}

Ok, I think we need to improve the abstraction here, because I find
this quite messy and hard to follow the code flow differences
between memcg and non-memg shrinker invocations..

> +
>  /*
>   * Call the shrink functions to age shrinkable caches
>   *
> @@ -352,20 +404,10 @@ unsigned long shrink_slab(struct shrink_control *shrinkctl,
>  	}
>  
>  	list_for_each_entry(shrinker, &shrinker_list, list) {
> -		if (!(shrinker->flags & SHRINKER_NUMA_AWARE)) {
> -			shrinkctl->nid = 0;
> -			freed += shrink_slab_node(shrinkctl, shrinker,
> -					nr_pages_scanned, lru_pages);
> -			continue;
> -		}
> -
> -		for_each_node_mask(shrinkctl->nid, shrinkctl->nodes_to_scan) {
> -			if (node_online(shrinkctl->nid))
> -				freed += shrink_slab_node(shrinkctl, shrinker,
> -						nr_pages_scanned, lru_pages);
> -
> -		}

This code is the "run_shrinker()" helper function, not the entire
memcg loop.

> +		freed += run_shrinker(shrinkctl, shrinker,
> +				      nr_pages_scanned, lru_pages);
>  	}

i.e. the shrinker execution control loop becomes much clearer if
we separate the memcg and non-memcg shrinker execution from the
node awareness of the shrinker like so:

	list_for_each_entry(shrinker, &shrinker_list, list) {

		/*
		 * If we aren't doing targeted memcg shrinking, then run
		 * the shrinker with a global context and move on.
		 */
		if (!shrinkctl->target_mem_cgroup) {
			freed += run_shrinker(shrinkctl, shrinker,
					      nr_pages_scanned, lru_pages);
			continue;
		}

		if (!(shrinker->flags & SHRINKER_MEMCG_AWARE))
			continue;

		/*
		 * memcg shrinking: Iterate the target memcg heirarchy
		 * and run the shrinker on each memcg context that
		 * is found in the heirarchy.
		 */
		shrinkctl->memcg = shrinkctl->target_mem_cgroup;
		do {
			if (memcg_kmem_is_active(shrinkctl->memcg))
				continue;

			freed += run_shrinker(shrinkctl, shrinker,
					      nr_pages_scanned, lru_pages);
		while ((shrinkctl->memcg =
				mem_cgroup_iter(shrinkctl->target_mem_cgroup,
						shrinkctl->memcg, NULL)));
	}

That makes the code much easier to read and clearly demonstrates the
differences betwen non-memcg and memcg shrinking contexts, and
separates them cleanly from the shrinker implementation.  IMO,
that's much nicer than trying to handle all contexts in the one
do-while loop.

Cheers,

Dave.
-- 
Dave Chinner
david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v13 12/16] fs: mark list_lru based shrinkers memcg aware
  2013-12-09  8:05   ` Vladimir Davydov
@ 2013-12-10  4:17     ` Dave Chinner
  -1 siblings, 0 replies; 81+ messages in thread
From: Dave Chinner @ 2013-12-10  4:17 UTC (permalink / raw)
  To: Vladimir Davydov, Steven Whitehouse
  Cc: dchinner, hannes, mhocko, akpm, linux-kernel, linux-mm, cgroups,
	devel, glommer, glommer, Al Viro

On Mon, Dec 09, 2013 at 12:05:53PM +0400, Vladimir Davydov wrote:
> Since now list_lru automatically distributes objects among per-memcg
> lists and list_lru_{count,walk} employ information passed in the
> shrink_control argument to scan appropriate list, all shrinkers that
> keep objects in the list_lru structure can already work as memcg-aware.
> Let us mark them so.
> 
> Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
> Cc: Glauber Costa <glommer@openvz.org>
> Cc: Dave Chinner <dchinner@redhat.com>
> Cc: Al Viro <viro@zeniv.linux.org.uk>
> ---
>  fs/gfs2/quota.c  |    2 +-
>  fs/super.c       |    2 +-
>  fs/xfs/xfs_buf.c |    2 +-
>  fs/xfs/xfs_qm.c  |    2 +-
>  4 files changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/gfs2/quota.c b/fs/gfs2/quota.c
> index f0435da..6cf6114 100644
> --- a/fs/gfs2/quota.c
> +++ b/fs/gfs2/quota.c
> @@ -150,7 +150,7 @@ struct shrinker gfs2_qd_shrinker = {
>  	.count_objects = gfs2_qd_shrink_count,
>  	.scan_objects = gfs2_qd_shrink_scan,
>  	.seeks = DEFAULT_SEEKS,
> -	.flags = SHRINKER_NUMA_AWARE,
> +	.flags = SHRINKER_NUMA_AWARE | SHRINKER_MEMCG_AWARE,
>  };

I'll leave it for Steve to have the final say, but this cache tracks
objects that have contexts that span multiple memcgs (i.e. global
scope) and so is not a candidate for memcg based shrinking.

e.g. a single user can have processes running in multiple concurrent
memcgs, and so the user quota dquot needs to be accessed from all
those memcg contexts. Same for group quota objects - they can span
multiple memcgs that different users have instantiated, simply
because they all belong to the same group and hence are subject to
the group quota accounting.

And for XFS, there's also project quotas, which means you can have
files that are unique to both users and groups, but shared the same
project quota and hence span memcgs that way....

> diff --git a/fs/super.c b/fs/super.c
> index 8f9a81b..05bead8 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -219,7 +219,7 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags)
>  	s->s_shrink.scan_objects = super_cache_scan;
>  	s->s_shrink.count_objects = super_cache_count;
>  	s->s_shrink.batch = 1024;
> -	s->s_shrink.flags = SHRINKER_NUMA_AWARE;
> +	s->s_shrink.flags = SHRINKER_NUMA_AWARE | SHRINKER_MEMCG_AWARE;
>  	return s;

OK.

> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index 5b2a49c..d8326b6 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -1679,7 +1679,7 @@ xfs_alloc_buftarg(
>  	btp->bt_shrinker.count_objects = xfs_buftarg_shrink_count;
>  	btp->bt_shrinker.scan_objects = xfs_buftarg_shrink_scan;
>  	btp->bt_shrinker.seeks = DEFAULT_SEEKS;
> -	btp->bt_shrinker.flags = SHRINKER_NUMA_AWARE;
> +	btp->bt_shrinker.flags = SHRINKER_NUMA_AWARE | SHRINKER_MEMCG_AWARE;
>  	register_shrinker(&btp->bt_shrinker);
>  	return btp;

This is a cache for XFS metadata buffers, and so is way below the
scope of memcg control. e.g. an inode buffer can hold 32 inodes,
each that belongs to a different memcg at the VFS inode cache level.
Even if a memcg removes an inode from the VFS cache level, that
buffer is still relevant to 31 other memcg contexts. A similar case
occurs for dquot buffers, and then there's filesystem internal
metadata like AG headers that no memcg has any right to claim
ownership of - they are owned and used solely by the filesystem, and
can be required by *any* memcg in the system to make progress.

i.e. these are low level filesystem metadata caches are owned by the
filesystem and are global resources - they will never come under
control of memcg, and none of the memory associated with this cache
should be accounted to a memcg context because of that....

> diff --git a/fs/xfs/xfs_qm.c b/fs/xfs/xfs_qm.c
> index aaacf8f..1f9bbb5 100644
> --- a/fs/xfs/xfs_qm.c
> +++ b/fs/xfs/xfs_qm.c
> @@ -903,7 +903,7 @@ xfs_qm_init_quotainfo(
>  	qinf->qi_shrinker.count_objects = xfs_qm_shrink_count;
>  	qinf->qi_shrinker.scan_objects = xfs_qm_shrink_scan;
>  	qinf->qi_shrinker.seeks = DEFAULT_SEEKS;
> -	qinf->qi_shrinker.flags = SHRINKER_NUMA_AWARE;
> +	qinf->qi_shrinker.flags = SHRINKER_NUMA_AWARE | SHRINKER_MEMCG_AWARE;
>  	register_shrinker(&qinf->qi_shrinker);
>  	return 0;

That's the XFS dquot cache, analogous to the GFS2 dquot
cache I commented on above. Hence, not a candidate for memcg
shrinking.

Remember - caches use list_lru for scalability reasons right now,
but that doesn't automatically mean memcg based shrinking makes
sense for them.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v13 12/16] fs: mark list_lru based shrinkers memcg aware
@ 2013-12-10  4:17     ` Dave Chinner
  0 siblings, 0 replies; 81+ messages in thread
From: Dave Chinner @ 2013-12-10  4:17 UTC (permalink / raw)
  To: Vladimir Davydov, Steven Whitehouse
  Cc: dchinner, hannes, mhocko, akpm, linux-kernel, linux-mm, cgroups,
	devel, glommer, glommer, Al Viro

On Mon, Dec 09, 2013 at 12:05:53PM +0400, Vladimir Davydov wrote:
> Since now list_lru automatically distributes objects among per-memcg
> lists and list_lru_{count,walk} employ information passed in the
> shrink_control argument to scan appropriate list, all shrinkers that
> keep objects in the list_lru structure can already work as memcg-aware.
> Let us mark them so.
> 
> Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
> Cc: Glauber Costa <glommer@openvz.org>
> Cc: Dave Chinner <dchinner@redhat.com>
> Cc: Al Viro <viro@zeniv.linux.org.uk>
> ---
>  fs/gfs2/quota.c  |    2 +-
>  fs/super.c       |    2 +-
>  fs/xfs/xfs_buf.c |    2 +-
>  fs/xfs/xfs_qm.c  |    2 +-
>  4 files changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/gfs2/quota.c b/fs/gfs2/quota.c
> index f0435da..6cf6114 100644
> --- a/fs/gfs2/quota.c
> +++ b/fs/gfs2/quota.c
> @@ -150,7 +150,7 @@ struct shrinker gfs2_qd_shrinker = {
>  	.count_objects = gfs2_qd_shrink_count,
>  	.scan_objects = gfs2_qd_shrink_scan,
>  	.seeks = DEFAULT_SEEKS,
> -	.flags = SHRINKER_NUMA_AWARE,
> +	.flags = SHRINKER_NUMA_AWARE | SHRINKER_MEMCG_AWARE,
>  };

I'll leave it for Steve to have the final say, but this cache tracks
objects that have contexts that span multiple memcgs (i.e. global
scope) and so is not a candidate for memcg based shrinking.

e.g. a single user can have processes running in multiple concurrent
memcgs, and so the user quota dquot needs to be accessed from all
those memcg contexts. Same for group quota objects - they can span
multiple memcgs that different users have instantiated, simply
because they all belong to the same group and hence are subject to
the group quota accounting.

And for XFS, there's also project quotas, which means you can have
files that are unique to both users and groups, but shared the same
project quota and hence span memcgs that way....

> diff --git a/fs/super.c b/fs/super.c
> index 8f9a81b..05bead8 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -219,7 +219,7 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags)
>  	s->s_shrink.scan_objects = super_cache_scan;
>  	s->s_shrink.count_objects = super_cache_count;
>  	s->s_shrink.batch = 1024;
> -	s->s_shrink.flags = SHRINKER_NUMA_AWARE;
> +	s->s_shrink.flags = SHRINKER_NUMA_AWARE | SHRINKER_MEMCG_AWARE;
>  	return s;

OK.

> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index 5b2a49c..d8326b6 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -1679,7 +1679,7 @@ xfs_alloc_buftarg(
>  	btp->bt_shrinker.count_objects = xfs_buftarg_shrink_count;
>  	btp->bt_shrinker.scan_objects = xfs_buftarg_shrink_scan;
>  	btp->bt_shrinker.seeks = DEFAULT_SEEKS;
> -	btp->bt_shrinker.flags = SHRINKER_NUMA_AWARE;
> +	btp->bt_shrinker.flags = SHRINKER_NUMA_AWARE | SHRINKER_MEMCG_AWARE;
>  	register_shrinker(&btp->bt_shrinker);
>  	return btp;

This is a cache for XFS metadata buffers, and so is way below the
scope of memcg control. e.g. an inode buffer can hold 32 inodes,
each that belongs to a different memcg at the VFS inode cache level.
Even if a memcg removes an inode from the VFS cache level, that
buffer is still relevant to 31 other memcg contexts. A similar case
occurs for dquot buffers, and then there's filesystem internal
metadata like AG headers that no memcg has any right to claim
ownership of - they are owned and used solely by the filesystem, and
can be required by *any* memcg in the system to make progress.

i.e. these are low level filesystem metadata caches are owned by the
filesystem and are global resources - they will never come under
control of memcg, and none of the memory associated with this cache
should be accounted to a memcg context because of that....

> diff --git a/fs/xfs/xfs_qm.c b/fs/xfs/xfs_qm.c
> index aaacf8f..1f9bbb5 100644
> --- a/fs/xfs/xfs_qm.c
> +++ b/fs/xfs/xfs_qm.c
> @@ -903,7 +903,7 @@ xfs_qm_init_quotainfo(
>  	qinf->qi_shrinker.count_objects = xfs_qm_shrink_count;
>  	qinf->qi_shrinker.scan_objects = xfs_qm_shrink_scan;
>  	qinf->qi_shrinker.seeks = DEFAULT_SEEKS;
> -	qinf->qi_shrinker.flags = SHRINKER_NUMA_AWARE;
> +	qinf->qi_shrinker.flags = SHRINKER_NUMA_AWARE | SHRINKER_MEMCG_AWARE;
>  	register_shrinker(&qinf->qi_shrinker);
>  	return 0;

That's the XFS dquot cache, analogous to the GFS2 dquot
cache I commented on above. Hence, not a candidate for memcg
shrinking.

Remember - caches use list_lru for scalability reasons right now,
but that doesn't automatically mean memcg based shrinking makes
sense for them.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v13 13/16] vmscan: take at least one pass with shrinkers
  2013-12-09  8:05   ` Vladimir Davydov
@ 2013-12-10  4:18     ` Dave Chinner
  -1 siblings, 0 replies; 81+ messages in thread
From: Dave Chinner @ 2013-12-10  4:18 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: dchinner, hannes, mhocko, akpm, linux-kernel, linux-mm, cgroups,
	devel, glommer, glommer, Glauber Costa, Mel Gorman, Rik van Riel

On Mon, Dec 09, 2013 at 12:05:54PM +0400, Vladimir Davydov wrote:
> From: Glauber Costa <glommer@openvz.org>
> 
> In very low free kernel memory situations, it may be the case that we
> have less objects to free than our initial batch size. If this is the
> case, it is better to shrink those, and open space for the new workload
> then to keep them and fail the new allocations.
> 
> In particular, we are concerned with the direct reclaim case for memcg.
> Although this same technique can be applied to other situations just as
> well, we will start conservative and apply it for that case, which is
> the one that matters the most.

This should be at the start of the series.

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v13 13/16] vmscan: take at least one pass with shrinkers
@ 2013-12-10  4:18     ` Dave Chinner
  0 siblings, 0 replies; 81+ messages in thread
From: Dave Chinner @ 2013-12-10  4:18 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: dchinner, hannes, mhocko, akpm, linux-kernel, linux-mm, cgroups,
	devel, glommer, glommer, Glauber Costa, Mel Gorman, Rik van Riel

On Mon, Dec 09, 2013 at 12:05:54PM +0400, Vladimir Davydov wrote:
> From: Glauber Costa <glommer@openvz.org>
> 
> In very low free kernel memory situations, it may be the case that we
> have less objects to free than our initial batch size. If this is the
> case, it is better to shrink those, and open space for the new workload
> then to keep them and fail the new allocations.
> 
> In particular, we are concerned with the direct reclaim case for memcg.
> Although this same technique can be applied to other situations just as
> well, we will start conservative and apply it for that case, which is
> the one that matters the most.

This should be at the start of the series.

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v13 11/16] mm: list_lru: add per-memcg lists
  2013-12-09  8:05   ` Vladimir Davydov
  (?)
@ 2013-12-10  5:00     ` Dave Chinner
  -1 siblings, 0 replies; 81+ messages in thread
From: Dave Chinner @ 2013-12-10  5:00 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: dchinner, hannes, mhocko, akpm, linux-kernel, linux-mm, cgroups,
	devel, glommer, glommer, Al Viro, Balbir Singh,
	KAMEZAWA Hiroyuki

On Mon, Dec 09, 2013 at 12:05:52PM +0400, Vladimir Davydov wrote:
> There are several FS shrinkers, including super_block::s_shrink, that
> keep reclaimable objects in the list_lru structure. That said, to turn
> them to memcg-aware shrinkers, it is enough to make list_lru per-memcg.
> 
> This patch does the trick. It adds an array of LRU lists to the list_lru
> structure, one for each kmem-active memcg, and dispatches every item
> addition or removal operation to the list corresponding to the memcg the
> item is accounted to.
> 
> Since we already pass a shrink_control object to count and walk list_lru
> functions to specify the NUMA node to scan, and the target memcg is held
> in this structure, there is no need in changing the list_lru interface.
> 
> The idea lying behind the patch as well as the initial implementation
> belong to Glauber Costa.
> 
> Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
> Cc: Glauber Costa <glommer@openvz.org>
> Cc: Dave Chinner <dchinner@redhat.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Michal Hocko <mhocko@suse.cz>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Al Viro <viro@zeniv.linux.org.uk>
> Cc: Balbir Singh <bsingharora@gmail.com>
> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
>  include/linux/list_lru.h   |   44 +++++++-
>  include/linux/memcontrol.h |   13 +++
>  mm/list_lru.c              |  242 ++++++++++++++++++++++++++++++++++++++------
>  mm/memcontrol.c            |  158 ++++++++++++++++++++++++++++-
>  4 files changed, 416 insertions(+), 41 deletions(-)
> 
> diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
> index 34e57af..e8add3d 100644
> --- a/include/linux/list_lru.h
> +++ b/include/linux/list_lru.h
> @@ -28,11 +28,47 @@ struct list_lru_node {
>  	long			nr_items;
>  } ____cacheline_aligned_in_smp;
>  
> +struct list_lru_one {
> +	struct list_lru_node *node;
> +	nodemask_t active_nodes;
> +};
> +
>  struct list_lru {
> -	struct list_lru_node	*node;
> -	nodemask_t		active_nodes;
> +	struct list_lru_one	global;
> +#ifdef CONFIG_MEMCG_KMEM
> +	/*
> +	 * In order to provide ability of scanning objects from different
> +	 * memory cgroups independently, we keep a separate LRU list for each
> +	 * kmem-active memcg in this array. The array is RCU-protected and
> +	 * indexed by memcg_cache_id().
> +	 */
> +	struct list_lru_one	**memcg;

OK, as far as I can tell, this is introducing a per-node, per-memcg
LRU lists. Is that correct?

If so, then that is not what Glauber and I originally intended for
memcg LRUs. per-node LRUs are expensive in terms of memory and cross
multiplying them by the number of memcgs in a system was not a good
use of memory.

According to Glauber, most memcgs are small and typically confined
to a single node or two by external means and therefore don't need the
scalability numa aware LRUs provide. Hence the idea was that the
memcg LRUs would just be a single LRU list, just like a non-numa
aware list_lru instantiation. IOWs, this is the structure that we
had decided on as the best compromise between memory usage,
complexity and memcg awareness:

	global list --- node 0 lru
			node 1 lru
			.....
			node nr_nodes lru
	memcg lists	memcg 0 lru
			memcg 1 lru
			.....
			memcg nr_memcgs lru

and the LRU code internally would select either a node or memcg LRU
to iterated based on the scan information coming in from the
shrinker. i.e.:


struct list_lru {
	struct list_lru_node	*node;
	nodemask_t		active_nodes;
#ifdef MEMCG
	struct list_lru_node	**memcg;
	....


>  bool list_lru_add(struct list_lru *lru, struct list_head *item)
>  {
> -	int nid = page_to_nid(virt_to_page(item));
> -	struct list_lru_node *nlru = &lru->node[nid];
> +	struct page *page = virt_to_page(item);
> +	int nid = page_to_nid(page);
> +	struct list_lru_one *olru = lru_of_page(lru, page);
> +	struct list_lru_node *nlru = &olru->node[nid];

Yeah, that's per-memcg, per-node dereferencing. And, FWIW, olru/nlru
are bad names - that's the convention typically used for "old <foo>"
and "new <foo>" pointers....

As it is, it shouldn't be necessary - lru_of_page() should just
return a struct list_lru_node....

> +int list_lru_init(struct list_lru *lru)
> +{
> +	int err;
> +
> +	err = list_lru_init_one(&lru->global);
> +	if (err)
> +		goto fail;
> +
> +	err = memcg_list_lru_init(lru);
> +	if (err)
> +		goto fail;
> +
> +	return 0;
> +fail:
> +	list_lru_destroy_one(&lru->global);
> +	lru->global.node = NULL; /* see list_lru_destroy() */
> +	return err;
> +}

I suspect we have users of list_lru that don't want memcg bits added
to them. Hence I think we want to leave list_lru_init() alone and
add a list_lru_init_memcg() variant that makes the LRU memcg aware.
i.e. if the shrinker is not going to be memcg aware, then we don't
want the LRU to be memcg aware, either....

>  EXPORT_SYMBOL_GPL(list_lru_init);
>  
>  void list_lru_destroy(struct list_lru *lru)
>  {
> -	kfree(lru->node);
> +	/*
> +	 * It is common throughout the kernel source tree to call the
> +	 * destructor on a zeroed out object that has not been initialized or
> +	 * whose initialization failed, because it greatly simplifies fail
> +	 * paths. Once the list_lru structure was implemented, its destructor
> +	 * consisted of the only call to kfree() and thus conformed to the
> +	 * rule, but as it growed, it became more complex so that calling
> +	 * destructor on an uninitialized object would be a bug. To preserve
> +	 * backward compatibility, we explicitly exit the destructor if the
> +	 * object seems to be uninitialized.
> +	 */

We don't need an essay here. somethign a simple as:

	/*
	 * We might be called after partial initialisation (e.g. due to
	 * ENOMEM error) so handle that appropriately.
	 */
> +	if (!lru->global.node)
> +		return;
> +
> +	list_lru_destroy_one(&lru->global);
> +	memcg_list_lru_destroy(lru);
>  }
>  EXPORT_SYMBOL_GPL(list_lru_destroy);
> +
> +#ifdef CONFIG_MEMCG_KMEM
> +int list_lru_memcg_alloc(struct list_lru *lru, int memcg_id)
> +{
> +	int err;
> +	struct list_lru_one *olru;
> +
> +	olru = kmalloc(sizeof(*olru), GFP_KERNEL);
> +	if (!olru)
> +		return -ENOMEM;
> +
> +	err = list_lru_init_one(olru);
> +	if (err) {
> +		kfree(olru);
> +		return err;
> +	}
> +
> +	VM_BUG_ON(lru->memcg[memcg_id]);
> +	lru->memcg[memcg_id] = olru;
> +	return 0;
> +}
> +
> +void list_lru_memcg_free(struct list_lru *lru, int memcg_id)
> +{
> +	struct list_lru_one *olru;
> +
> +	olru = lru->memcg[memcg_id];
> +	if (olru) {
> +		list_lru_destroy_one(olru);
> +		kfree(olru);
> +		lru->memcg[memcg_id] = NULL;
> +	}
> +}
> +
> +int list_lru_grow_memcg(struct list_lru *lru, size_t new_array_size)
> +{
> +	int i;
> +	struct list_lru_one **memcg_lrus;
> +
> +	memcg_lrus = kcalloc(new_array_size, sizeof(*memcg_lrus), GFP_KERNEL);
> +	if (!memcg_lrus)
> +		return -ENOMEM;
> +
> +	if (lru->memcg) {
> +		for_each_memcg_cache_index(i) {
> +			if (lru->memcg[i])
> +				memcg_lrus[i] = lru->memcg[i];
> +		}
> +	}

Um, krealloc()?


> +/*
> + * This function allocates LRUs for a memcg in all list_lru structures. It is
> + * called under memcg_create_mutex when a new kmem-active memcg is added.
> + */
> +static int memcg_init_all_lrus(int new_memcg_id)
> +{
> +	int err = 0;
> +	int num_memcgs = new_memcg_id + 1;
> +	int grow = (num_memcgs > memcg_limited_groups_array_size);
> +	size_t new_array_size = memcg_caches_array_size(num_memcgs);
> +	struct list_lru *lru;
> +
> +	if (grow) {
> +		list_for_each_entry(lru, &all_memcg_lrus, list) {
> +			err = list_lru_grow_memcg(lru, new_array_size);
> +			if (err)
> +				goto out;
> +		}
> +	}
> +
> +	list_for_each_entry(lru, &all_memcg_lrus, list) {
> +		err = list_lru_memcg_alloc(lru, new_memcg_id);
> +		if (err) {
> +			__memcg_destroy_all_lrus(new_memcg_id);
> +			break;
> +		}
> +	}
> +out:
> +	if (grow) {
> +		synchronize_rcu();
> +		list_for_each_entry(lru, &all_memcg_lrus, list) {
> +			kfree(lru->memcg_old);
> +			lru->memcg_old = NULL;
> +		}
> +	}
> +	return err;
> +}

Urk. That won't scale very well.

> +
> +int memcg_list_lru_init(struct list_lru *lru)
> +{
> +	int err = 0;
> +	int i;
> +	struct mem_cgroup *memcg;
> +
> +	lru->memcg = NULL;
> +	lru->memcg_old = NULL;
> +
> +	mutex_lock(&memcg_create_mutex);
> +	if (!memcg_kmem_enabled())
> +		goto out_list_add;
> +
> +	lru->memcg = kcalloc(memcg_limited_groups_array_size,
> +			     sizeof(*lru->memcg), GFP_KERNEL);
> +	if (!lru->memcg) {
> +		err = -ENOMEM;
> +		goto out;
> +	}
> +
> +	for_each_mem_cgroup(memcg) {
> +		int memcg_id;
> +
> +		memcg_id = memcg_cache_id(memcg);
> +		if (memcg_id < 0)
> +			continue;
> +
> +		err = list_lru_memcg_alloc(lru, memcg_id);
> +		if (err) {
> +			mem_cgroup_iter_break(NULL, memcg);
> +			goto out_free_lru_memcg;
> +		}
> +	}
> +out_list_add:
> +	list_add(&lru->list, &all_memcg_lrus);
> +out:
> +	mutex_unlock(&memcg_create_mutex);
> +	return err;
> +
> +out_free_lru_memcg:
> +	for (i = 0; i < memcg_limited_groups_array_size; i++)
> +		list_lru_memcg_free(lru, i);
> +	kfree(lru->memcg);
> +	goto out;
> +}

That will probably scale even worse. Think about what happens when we
try to mount a bunch of filesystems in parallel - they will now
serialise completely on this memcg_create_mutex instantiating memcg
lists inside list_lru_init().

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v13 11/16] mm: list_lru: add per-memcg lists
@ 2013-12-10  5:00     ` Dave Chinner
  0 siblings, 0 replies; 81+ messages in thread
From: Dave Chinner @ 2013-12-10  5:00 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: dchinner, hannes, mhocko, akpm, linux-kernel, linux-mm, cgroups,
	devel, glommer, glommer, Al Viro, Balbir Singh,
	KAMEZAWA Hiroyuki

On Mon, Dec 09, 2013 at 12:05:52PM +0400, Vladimir Davydov wrote:
> There are several FS shrinkers, including super_block::s_shrink, that
> keep reclaimable objects in the list_lru structure. That said, to turn
> them to memcg-aware shrinkers, it is enough to make list_lru per-memcg.
> 
> This patch does the trick. It adds an array of LRU lists to the list_lru
> structure, one for each kmem-active memcg, and dispatches every item
> addition or removal operation to the list corresponding to the memcg the
> item is accounted to.
> 
> Since we already pass a shrink_control object to count and walk list_lru
> functions to specify the NUMA node to scan, and the target memcg is held
> in this structure, there is no need in changing the list_lru interface.
> 
> The idea lying behind the patch as well as the initial implementation
> belong to Glauber Costa.
> 
> Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
> Cc: Glauber Costa <glommer@openvz.org>
> Cc: Dave Chinner <dchinner@redhat.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Michal Hocko <mhocko@suse.cz>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Al Viro <viro@zeniv.linux.org.uk>
> Cc: Balbir Singh <bsingharora@gmail.com>
> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> ---
>  include/linux/list_lru.h   |   44 +++++++-
>  include/linux/memcontrol.h |   13 +++
>  mm/list_lru.c              |  242 ++++++++++++++++++++++++++++++++++++++------
>  mm/memcontrol.c            |  158 ++++++++++++++++++++++++++++-
>  4 files changed, 416 insertions(+), 41 deletions(-)
> 
> diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
> index 34e57af..e8add3d 100644
> --- a/include/linux/list_lru.h
> +++ b/include/linux/list_lru.h
> @@ -28,11 +28,47 @@ struct list_lru_node {
>  	long			nr_items;
>  } ____cacheline_aligned_in_smp;
>  
> +struct list_lru_one {
> +	struct list_lru_node *node;
> +	nodemask_t active_nodes;
> +};
> +
>  struct list_lru {
> -	struct list_lru_node	*node;
> -	nodemask_t		active_nodes;
> +	struct list_lru_one	global;
> +#ifdef CONFIG_MEMCG_KMEM
> +	/*
> +	 * In order to provide ability of scanning objects from different
> +	 * memory cgroups independently, we keep a separate LRU list for each
> +	 * kmem-active memcg in this array. The array is RCU-protected and
> +	 * indexed by memcg_cache_id().
> +	 */
> +	struct list_lru_one	**memcg;

OK, as far as I can tell, this is introducing a per-node, per-memcg
LRU lists. Is that correct?

If so, then that is not what Glauber and I originally intended for
memcg LRUs. per-node LRUs are expensive in terms of memory and cross
multiplying them by the number of memcgs in a system was not a good
use of memory.

According to Glauber, most memcgs are small and typically confined
to a single node or two by external means and therefore don't need the
scalability numa aware LRUs provide. Hence the idea was that the
memcg LRUs would just be a single LRU list, just like a non-numa
aware list_lru instantiation. IOWs, this is the structure that we
had decided on as the best compromise between memory usage,
complexity and memcg awareness:

	global list --- node 0 lru
			node 1 lru
			.....
			node nr_nodes lru
	memcg lists	memcg 0 lru
			memcg 1 lru
			.....
			memcg nr_memcgs lru

and the LRU code internally would select either a node or memcg LRU
to iterated based on the scan information coming in from the
shrinker. i.e.:


struct list_lru {
	struct list_lru_node	*node;
	nodemask_t		active_nodes;
#ifdef MEMCG
	struct list_lru_node	**memcg;
	....


>  bool list_lru_add(struct list_lru *lru, struct list_head *item)
>  {
> -	int nid = page_to_nid(virt_to_page(item));
> -	struct list_lru_node *nlru = &lru->node[nid];
> +	struct page *page = virt_to_page(item);
> +	int nid = page_to_nid(page);
> +	struct list_lru_one *olru = lru_of_page(lru, page);
> +	struct list_lru_node *nlru = &olru->node[nid];

Yeah, that's per-memcg, per-node dereferencing. And, FWIW, olru/nlru
are bad names - that's the convention typically used for "old <foo>"
and "new <foo>" pointers....

As it is, it shouldn't be necessary - lru_of_page() should just
return a struct list_lru_node....

> +int list_lru_init(struct list_lru *lru)
> +{
> +	int err;
> +
> +	err = list_lru_init_one(&lru->global);
> +	if (err)
> +		goto fail;
> +
> +	err = memcg_list_lru_init(lru);
> +	if (err)
> +		goto fail;
> +
> +	return 0;
> +fail:
> +	list_lru_destroy_one(&lru->global);
> +	lru->global.node = NULL; /* see list_lru_destroy() */
> +	return err;
> +}

I suspect we have users of list_lru that don't want memcg bits added
to them. Hence I think we want to leave list_lru_init() alone and
add a list_lru_init_memcg() variant that makes the LRU memcg aware.
i.e. if the shrinker is not going to be memcg aware, then we don't
want the LRU to be memcg aware, either....

>  EXPORT_SYMBOL_GPL(list_lru_init);
>  
>  void list_lru_destroy(struct list_lru *lru)
>  {
> -	kfree(lru->node);
> +	/*
> +	 * It is common throughout the kernel source tree to call the
> +	 * destructor on a zeroed out object that has not been initialized or
> +	 * whose initialization failed, because it greatly simplifies fail
> +	 * paths. Once the list_lru structure was implemented, its destructor
> +	 * consisted of the only call to kfree() and thus conformed to the
> +	 * rule, but as it growed, it became more complex so that calling
> +	 * destructor on an uninitialized object would be a bug. To preserve
> +	 * backward compatibility, we explicitly exit the destructor if the
> +	 * object seems to be uninitialized.
> +	 */

We don't need an essay here. somethign a simple as:

	/*
	 * We might be called after partial initialisation (e.g. due to
	 * ENOMEM error) so handle that appropriately.
	 */
> +	if (!lru->global.node)
> +		return;
> +
> +	list_lru_destroy_one(&lru->global);
> +	memcg_list_lru_destroy(lru);
>  }
>  EXPORT_SYMBOL_GPL(list_lru_destroy);
> +
> +#ifdef CONFIG_MEMCG_KMEM
> +int list_lru_memcg_alloc(struct list_lru *lru, int memcg_id)
> +{
> +	int err;
> +	struct list_lru_one *olru;
> +
> +	olru = kmalloc(sizeof(*olru), GFP_KERNEL);
> +	if (!olru)
> +		return -ENOMEM;
> +
> +	err = list_lru_init_one(olru);
> +	if (err) {
> +		kfree(olru);
> +		return err;
> +	}
> +
> +	VM_BUG_ON(lru->memcg[memcg_id]);
> +	lru->memcg[memcg_id] = olru;
> +	return 0;
> +}
> +
> +void list_lru_memcg_free(struct list_lru *lru, int memcg_id)
> +{
> +	struct list_lru_one *olru;
> +
> +	olru = lru->memcg[memcg_id];
> +	if (olru) {
> +		list_lru_destroy_one(olru);
> +		kfree(olru);
> +		lru->memcg[memcg_id] = NULL;
> +	}
> +}
> +
> +int list_lru_grow_memcg(struct list_lru *lru, size_t new_array_size)
> +{
> +	int i;
> +	struct list_lru_one **memcg_lrus;
> +
> +	memcg_lrus = kcalloc(new_array_size, sizeof(*memcg_lrus), GFP_KERNEL);
> +	if (!memcg_lrus)
> +		return -ENOMEM;
> +
> +	if (lru->memcg) {
> +		for_each_memcg_cache_index(i) {
> +			if (lru->memcg[i])
> +				memcg_lrus[i] = lru->memcg[i];
> +		}
> +	}

Um, krealloc()?


> +/*
> + * This function allocates LRUs for a memcg in all list_lru structures. It is
> + * called under memcg_create_mutex when a new kmem-active memcg is added.
> + */
> +static int memcg_init_all_lrus(int new_memcg_id)
> +{
> +	int err = 0;
> +	int num_memcgs = new_memcg_id + 1;
> +	int grow = (num_memcgs > memcg_limited_groups_array_size);
> +	size_t new_array_size = memcg_caches_array_size(num_memcgs);
> +	struct list_lru *lru;
> +
> +	if (grow) {
> +		list_for_each_entry(lru, &all_memcg_lrus, list) {
> +			err = list_lru_grow_memcg(lru, new_array_size);
> +			if (err)
> +				goto out;
> +		}
> +	}
> +
> +	list_for_each_entry(lru, &all_memcg_lrus, list) {
> +		err = list_lru_memcg_alloc(lru, new_memcg_id);
> +		if (err) {
> +			__memcg_destroy_all_lrus(new_memcg_id);
> +			break;
> +		}
> +	}
> +out:
> +	if (grow) {
> +		synchronize_rcu();
> +		list_for_each_entry(lru, &all_memcg_lrus, list) {
> +			kfree(lru->memcg_old);
> +			lru->memcg_old = NULL;
> +		}
> +	}
> +	return err;
> +}

Urk. That won't scale very well.

> +
> +int memcg_list_lru_init(struct list_lru *lru)
> +{
> +	int err = 0;
> +	int i;
> +	struct mem_cgroup *memcg;
> +
> +	lru->memcg = NULL;
> +	lru->memcg_old = NULL;
> +
> +	mutex_lock(&memcg_create_mutex);
> +	if (!memcg_kmem_enabled())
> +		goto out_list_add;
> +
> +	lru->memcg = kcalloc(memcg_limited_groups_array_size,
> +			     sizeof(*lru->memcg), GFP_KERNEL);
> +	if (!lru->memcg) {
> +		err = -ENOMEM;
> +		goto out;
> +	}
> +
> +	for_each_mem_cgroup(memcg) {
> +		int memcg_id;
> +
> +		memcg_id = memcg_cache_id(memcg);
> +		if (memcg_id < 0)
> +			continue;
> +
> +		err = list_lru_memcg_alloc(lru, memcg_id);
> +		if (err) {
> +			mem_cgroup_iter_break(NULL, memcg);
> +			goto out_free_lru_memcg;
> +		}
> +	}
> +out_list_add:
> +	list_add(&lru->list, &all_memcg_lrus);
> +out:
> +	mutex_unlock(&memcg_create_mutex);
> +	return err;
> +
> +out_free_lru_memcg:
> +	for (i = 0; i < memcg_limited_groups_array_size; i++)
> +		list_lru_memcg_free(lru, i);
> +	kfree(lru->memcg);
> +	goto out;
> +}

That will probably scale even worse. Think about what happens when we
try to mount a bunch of filesystems in parallel - they will now
serialise completely on this memcg_create_mutex instantiating memcg
lists inside list_lru_init().

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v13 11/16] mm: list_lru: add per-memcg lists
@ 2013-12-10  5:00     ` Dave Chinner
  0 siblings, 0 replies; 81+ messages in thread
From: Dave Chinner @ 2013-12-10  5:00 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: dchinner-H+wXaHxf7aLQT0dZR+AlfA, hannes-druUgvl0LCNAfugRpC6u6w,
	mhocko-AlSwsSmVLrQ, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	devel-GEFAQzZX7r8dnm+yROfE0A, glommer-GEFAQzZX7r8dnm+yROfE0A,
	glommer-Re5JQEeQqe8AvxtiuMwx3w, Al Viro, Balbir Singh,
	KAMEZAWA Hiroyuki

On Mon, Dec 09, 2013 at 12:05:52PM +0400, Vladimir Davydov wrote:
> There are several FS shrinkers, including super_block::s_shrink, that
> keep reclaimable objects in the list_lru structure. That said, to turn
> them to memcg-aware shrinkers, it is enough to make list_lru per-memcg.
> 
> This patch does the trick. It adds an array of LRU lists to the list_lru
> structure, one for each kmem-active memcg, and dispatches every item
> addition or removal operation to the list corresponding to the memcg the
> item is accounted to.
> 
> Since we already pass a shrink_control object to count and walk list_lru
> functions to specify the NUMA node to scan, and the target memcg is held
> in this structure, there is no need in changing the list_lru interface.
> 
> The idea lying behind the patch as well as the initial implementation
> belong to Glauber Costa.
> 
> Signed-off-by: Vladimir Davydov <vdavydov-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
> Cc: Glauber Costa <glommer-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
> Cc: Dave Chinner <dchinner-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> Cc: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
> Cc: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
> Cc: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
> Cc: Al Viro <viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
> Cc: Balbir Singh <bsingharora-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
> ---
>  include/linux/list_lru.h   |   44 +++++++-
>  include/linux/memcontrol.h |   13 +++
>  mm/list_lru.c              |  242 ++++++++++++++++++++++++++++++++++++++------
>  mm/memcontrol.c            |  158 ++++++++++++++++++++++++++++-
>  4 files changed, 416 insertions(+), 41 deletions(-)
> 
> diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
> index 34e57af..e8add3d 100644
> --- a/include/linux/list_lru.h
> +++ b/include/linux/list_lru.h
> @@ -28,11 +28,47 @@ struct list_lru_node {
>  	long			nr_items;
>  } ____cacheline_aligned_in_smp;
>  
> +struct list_lru_one {
> +	struct list_lru_node *node;
> +	nodemask_t active_nodes;
> +};
> +
>  struct list_lru {
> -	struct list_lru_node	*node;
> -	nodemask_t		active_nodes;
> +	struct list_lru_one	global;
> +#ifdef CONFIG_MEMCG_KMEM
> +	/*
> +	 * In order to provide ability of scanning objects from different
> +	 * memory cgroups independently, we keep a separate LRU list for each
> +	 * kmem-active memcg in this array. The array is RCU-protected and
> +	 * indexed by memcg_cache_id().
> +	 */
> +	struct list_lru_one	**memcg;

OK, as far as I can tell, this is introducing a per-node, per-memcg
LRU lists. Is that correct?

If so, then that is not what Glauber and I originally intended for
memcg LRUs. per-node LRUs are expensive in terms of memory and cross
multiplying them by the number of memcgs in a system was not a good
use of memory.

According to Glauber, most memcgs are small and typically confined
to a single node or two by external means and therefore don't need the
scalability numa aware LRUs provide. Hence the idea was that the
memcg LRUs would just be a single LRU list, just like a non-numa
aware list_lru instantiation. IOWs, this is the structure that we
had decided on as the best compromise between memory usage,
complexity and memcg awareness:

	global list --- node 0 lru
			node 1 lru
			.....
			node nr_nodes lru
	memcg lists	memcg 0 lru
			memcg 1 lru
			.....
			memcg nr_memcgs lru

and the LRU code internally would select either a node or memcg LRU
to iterated based on the scan information coming in from the
shrinker. i.e.:


struct list_lru {
	struct list_lru_node	*node;
	nodemask_t		active_nodes;
#ifdef MEMCG
	struct list_lru_node	**memcg;
	....


>  bool list_lru_add(struct list_lru *lru, struct list_head *item)
>  {
> -	int nid = page_to_nid(virt_to_page(item));
> -	struct list_lru_node *nlru = &lru->node[nid];
> +	struct page *page = virt_to_page(item);
> +	int nid = page_to_nid(page);
> +	struct list_lru_one *olru = lru_of_page(lru, page);
> +	struct list_lru_node *nlru = &olru->node[nid];

Yeah, that's per-memcg, per-node dereferencing. And, FWIW, olru/nlru
are bad names - that's the convention typically used for "old <foo>"
and "new <foo>" pointers....

As it is, it shouldn't be necessary - lru_of_page() should just
return a struct list_lru_node....

> +int list_lru_init(struct list_lru *lru)
> +{
> +	int err;
> +
> +	err = list_lru_init_one(&lru->global);
> +	if (err)
> +		goto fail;
> +
> +	err = memcg_list_lru_init(lru);
> +	if (err)
> +		goto fail;
> +
> +	return 0;
> +fail:
> +	list_lru_destroy_one(&lru->global);
> +	lru->global.node = NULL; /* see list_lru_destroy() */
> +	return err;
> +}

I suspect we have users of list_lru that don't want memcg bits added
to them. Hence I think we want to leave list_lru_init() alone and
add a list_lru_init_memcg() variant that makes the LRU memcg aware.
i.e. if the shrinker is not going to be memcg aware, then we don't
want the LRU to be memcg aware, either....

>  EXPORT_SYMBOL_GPL(list_lru_init);
>  
>  void list_lru_destroy(struct list_lru *lru)
>  {
> -	kfree(lru->node);
> +	/*
> +	 * It is common throughout the kernel source tree to call the
> +	 * destructor on a zeroed out object that has not been initialized or
> +	 * whose initialization failed, because it greatly simplifies fail
> +	 * paths. Once the list_lru structure was implemented, its destructor
> +	 * consisted of the only call to kfree() and thus conformed to the
> +	 * rule, but as it growed, it became more complex so that calling
> +	 * destructor on an uninitialized object would be a bug. To preserve
> +	 * backward compatibility, we explicitly exit the destructor if the
> +	 * object seems to be uninitialized.
> +	 */

We don't need an essay here. somethign a simple as:

	/*
	 * We might be called after partial initialisation (e.g. due to
	 * ENOMEM error) so handle that appropriately.
	 */
> +	if (!lru->global.node)
> +		return;
> +
> +	list_lru_destroy_one(&lru->global);
> +	memcg_list_lru_destroy(lru);
>  }
>  EXPORT_SYMBOL_GPL(list_lru_destroy);
> +
> +#ifdef CONFIG_MEMCG_KMEM
> +int list_lru_memcg_alloc(struct list_lru *lru, int memcg_id)
> +{
> +	int err;
> +	struct list_lru_one *olru;
> +
> +	olru = kmalloc(sizeof(*olru), GFP_KERNEL);
> +	if (!olru)
> +		return -ENOMEM;
> +
> +	err = list_lru_init_one(olru);
> +	if (err) {
> +		kfree(olru);
> +		return err;
> +	}
> +
> +	VM_BUG_ON(lru->memcg[memcg_id]);
> +	lru->memcg[memcg_id] = olru;
> +	return 0;
> +}
> +
> +void list_lru_memcg_free(struct list_lru *lru, int memcg_id)
> +{
> +	struct list_lru_one *olru;
> +
> +	olru = lru->memcg[memcg_id];
> +	if (olru) {
> +		list_lru_destroy_one(olru);
> +		kfree(olru);
> +		lru->memcg[memcg_id] = NULL;
> +	}
> +}
> +
> +int list_lru_grow_memcg(struct list_lru *lru, size_t new_array_size)
> +{
> +	int i;
> +	struct list_lru_one **memcg_lrus;
> +
> +	memcg_lrus = kcalloc(new_array_size, sizeof(*memcg_lrus), GFP_KERNEL);
> +	if (!memcg_lrus)
> +		return -ENOMEM;
> +
> +	if (lru->memcg) {
> +		for_each_memcg_cache_index(i) {
> +			if (lru->memcg[i])
> +				memcg_lrus[i] = lru->memcg[i];
> +		}
> +	}

Um, krealloc()?


> +/*
> + * This function allocates LRUs for a memcg in all list_lru structures. It is
> + * called under memcg_create_mutex when a new kmem-active memcg is added.
> + */
> +static int memcg_init_all_lrus(int new_memcg_id)
> +{
> +	int err = 0;
> +	int num_memcgs = new_memcg_id + 1;
> +	int grow = (num_memcgs > memcg_limited_groups_array_size);
> +	size_t new_array_size = memcg_caches_array_size(num_memcgs);
> +	struct list_lru *lru;
> +
> +	if (grow) {
> +		list_for_each_entry(lru, &all_memcg_lrus, list) {
> +			err = list_lru_grow_memcg(lru, new_array_size);
> +			if (err)
> +				goto out;
> +		}
> +	}
> +
> +	list_for_each_entry(lru, &all_memcg_lrus, list) {
> +		err = list_lru_memcg_alloc(lru, new_memcg_id);
> +		if (err) {
> +			__memcg_destroy_all_lrus(new_memcg_id);
> +			break;
> +		}
> +	}
> +out:
> +	if (grow) {
> +		synchronize_rcu();
> +		list_for_each_entry(lru, &all_memcg_lrus, list) {
> +			kfree(lru->memcg_old);
> +			lru->memcg_old = NULL;
> +		}
> +	}
> +	return err;
> +}

Urk. That won't scale very well.

> +
> +int memcg_list_lru_init(struct list_lru *lru)
> +{
> +	int err = 0;
> +	int i;
> +	struct mem_cgroup *memcg;
> +
> +	lru->memcg = NULL;
> +	lru->memcg_old = NULL;
> +
> +	mutex_lock(&memcg_create_mutex);
> +	if (!memcg_kmem_enabled())
> +		goto out_list_add;
> +
> +	lru->memcg = kcalloc(memcg_limited_groups_array_size,
> +			     sizeof(*lru->memcg), GFP_KERNEL);
> +	if (!lru->memcg) {
> +		err = -ENOMEM;
> +		goto out;
> +	}
> +
> +	for_each_mem_cgroup(memcg) {
> +		int memcg_id;
> +
> +		memcg_id = memcg_cache_id(memcg);
> +		if (memcg_id < 0)
> +			continue;
> +
> +		err = list_lru_memcg_alloc(lru, memcg_id);
> +		if (err) {
> +			mem_cgroup_iter_break(NULL, memcg);
> +			goto out_free_lru_memcg;
> +		}
> +	}
> +out_list_add:
> +	list_add(&lru->list, &all_memcg_lrus);
> +out:
> +	mutex_unlock(&memcg_create_mutex);
> +	return err;
> +
> +out_free_lru_memcg:
> +	for (i = 0; i < memcg_limited_groups_array_size; i++)
> +		list_lru_memcg_free(lru, i);
> +	kfree(lru->memcg);
> +	goto out;
> +}

That will probably scale even worse. Think about what happens when we
try to mount a bunch of filesystems in parallel - they will now
serialise completely on this memcg_create_mutex instantiating memcg
lists inside list_lru_init().

Cheers,

Dave.
-- 
Dave Chinner
david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v13 00/16] kmemcg shrinkers
  2013-12-09  8:05 ` Vladimir Davydov
@ 2013-12-10  8:02   ` Glauber Costa
  -1 siblings, 0 replies; 81+ messages in thread
From: Glauber Costa @ 2013-12-10  8:02 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: dchinner, Johannes Weiner, Michal Hocko, Andrew Morton, LKML,
	linux-mm, cgroups, devel, Glauber Costa

> Please note that in contrast to previous versions this patch-set implements
> slab shrinking only when we hit the user memory limit so that kmem allocations
> will still fail if we are below the user memory limit, but close to the kmem
> limit. This is, because the implementation of kmem-only reclaim was rather
> incomplete - we had to fail GFP_NOFS allocations since everything we could
> reclaim was only FS data. I will try to improve this and send in a separate
> patch-set, but currently it is only worthwhile setting the kmem limit to be
> greater than the user mem limit just to enable per-memcg slab accounting and
> reclaim.

That is unfortunate, but it makes sense as a first step.



-- 
E Mare, Libertas

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v13 00/16] kmemcg shrinkers
@ 2013-12-10  8:02   ` Glauber Costa
  0 siblings, 0 replies; 81+ messages in thread
From: Glauber Costa @ 2013-12-10  8:02 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: dchinner, Johannes Weiner, Michal Hocko, Andrew Morton, LKML,
	linux-mm, cgroups, devel, Glauber Costa

> Please note that in contrast to previous versions this patch-set implements
> slab shrinking only when we hit the user memory limit so that kmem allocations
> will still fail if we are below the user memory limit, but close to the kmem
> limit. This is, because the implementation of kmem-only reclaim was rather
> incomplete - we had to fail GFP_NOFS allocations since everything we could
> reclaim was only FS data. I will try to improve this and send in a separate
> patch-set, but currently it is only worthwhile setting the kmem limit to be
> greater than the user mem limit just to enable per-memcg slab accounting and
> reclaim.

That is unfortunate, but it makes sense as a first step.



-- 
E Mare, Libertas

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v13 04/16] memcg: move memcg_caches_array_size() function
  2013-12-09  8:05   ` Vladimir Davydov
@ 2013-12-10  8:04     ` Glauber Costa
  -1 siblings, 0 replies; 81+ messages in thread
From: Glauber Costa @ 2013-12-10  8:04 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: dchinner, Johannes Weiner, Michal Hocko, Andrew Morton, LKML,
	linux-mm, cgroups, devel, Glauber Costa, Balbir Singh,
	KAMEZAWA Hiroyuki

On Mon, Dec 9, 2013 at 12:05 PM, Vladimir Davydov
<vdavydov@parallels.com> wrote:
> I need to move this up a bit, and I am doing in a separate patch just to
> reduce churn in the patch that needs it.
>
> Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Reviewed-by: Glauber Costa <glommer@openvz.org>

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v13 04/16] memcg: move memcg_caches_array_size() function
@ 2013-12-10  8:04     ` Glauber Costa
  0 siblings, 0 replies; 81+ messages in thread
From: Glauber Costa @ 2013-12-10  8:04 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: dchinner, Johannes Weiner, Michal Hocko, Andrew Morton, LKML,
	linux-mm, cgroups, devel, Glauber Costa, Balbir Singh,
	KAMEZAWA Hiroyuki

On Mon, Dec 9, 2013 at 12:05 PM, Vladimir Davydov
<vdavydov@parallels.com> wrote:
> I need to move this up a bit, and I am doing in a separate patch just to
> reduce churn in the patch that needs it.
>
> Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Reviewed-by: Glauber Costa <glommer@openvz.org>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v13 05/16] vmscan: move call to shrink_slab() to shrink_zones()
  2013-12-09  8:05   ` Vladimir Davydov
@ 2013-12-10  8:10     ` Glauber Costa
  -1 siblings, 0 replies; 81+ messages in thread
From: Glauber Costa @ 2013-12-10  8:10 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: dchinner, Johannes Weiner, Michal Hocko, Andrew Morton, LKML,
	linux-mm, cgroups, devel, Glauber Costa, Mel Gorman,
	Rik van Riel

On Mon, Dec 9, 2013 at 12:05 PM, Vladimir Davydov
<vdavydov@parallels.com> wrote:
> This reduces the indentation level of do_try_to_free_pages() and removes
> extra loop over all eligible zones counting the number of on-LRU pages.
>

Looks correct to me.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v13 05/16] vmscan: move call to shrink_slab() to shrink_zones()
@ 2013-12-10  8:10     ` Glauber Costa
  0 siblings, 0 replies; 81+ messages in thread
From: Glauber Costa @ 2013-12-10  8:10 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: dchinner, Johannes Weiner, Michal Hocko, Andrew Morton, LKML,
	linux-mm, cgroups, devel, Glauber Costa, Mel Gorman,
	Rik van Riel

On Mon, Dec 9, 2013 at 12:05 PM, Vladimir Davydov
<vdavydov@parallels.com> wrote:
> This reduces the indentation level of do_try_to_free_pages() and removes
> extra loop over all eligible zones counting the number of on-LRU pages.
>

Looks correct to me.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v13 14/16] vmpressure: in-kernel notifications
  2013-12-09  8:05   ` Vladimir Davydov
@ 2013-12-10  8:12     ` Glauber Costa
  -1 siblings, 0 replies; 81+ messages in thread
From: Glauber Costa @ 2013-12-10  8:12 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: dchinner, Johannes Weiner, Michal Hocko, Andrew Morton, LKML,
	linux-mm, cgroups, devel, Glauber Costa, John Stultz,
	Joonsoo Kim, Kamezawa Hiroyuki

On Mon, Dec 9, 2013 at 12:05 PM, Vladimir Davydov
<vdavydov@parallels.com> wrote:
> From: Glauber Costa <glommer@openvz.org>
>
> During the past weeks, it became clear to us that the shrinker interface

It has been more than a few weeks by now =)

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v13 14/16] vmpressure: in-kernel notifications
@ 2013-12-10  8:12     ` Glauber Costa
  0 siblings, 0 replies; 81+ messages in thread
From: Glauber Costa @ 2013-12-10  8:12 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: dchinner, Johannes Weiner, Michal Hocko, Andrew Morton, LKML,
	linux-mm, cgroups, devel, Glauber Costa, John Stultz,
	Joonsoo Kim, Kamezawa Hiroyuki

On Mon, Dec 9, 2013 at 12:05 PM, Vladimir Davydov
<vdavydov@parallels.com> wrote:
> From: Glauber Costa <glommer@openvz.org>
>
> During the past weeks, it became clear to us that the shrinker interface

It has been more than a few weeks by now =)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v13 11/16] mm: list_lru: add per-memcg lists
  2013-12-10  5:00     ` Dave Chinner
  (?)
@ 2013-12-10 10:05       ` Vladimir Davydov
  -1 siblings, 0 replies; 81+ messages in thread
From: Vladimir Davydov @ 2013-12-10 10:05 UTC (permalink / raw)
  To: Dave Chinner
  Cc: dchinner, hannes, mhocko, akpm, linux-kernel, linux-mm, cgroups,
	devel, glommer, glommer, Al Viro, Balbir Singh,
	KAMEZAWA Hiroyuki

Hi, David

First of all, let me thank you for such a thorough review. It is really
helpful. As usual, I can't help agreeing with most of your comments, but
there are a couple of things I'd like to clarify. Please, see comments
inline.

On 12/10/2013 09:00 AM, Dave Chinner wrote:
> On Mon, Dec 09, 2013 at 12:05:52PM +0400, Vladimir Davydov wrote:
>> diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
>> index 34e57af..e8add3d 100644
>> --- a/include/linux/list_lru.h
>> +++ b/include/linux/list_lru.h
>> @@ -28,11 +28,47 @@ struct list_lru_node {
>>  	long			nr_items;
>>  } ____cacheline_aligned_in_smp;
>>  
>> +struct list_lru_one {
>> +	struct list_lru_node *node;
>> +	nodemask_t active_nodes;
>> +};
>> +
>>  struct list_lru {
>> -	struct list_lru_node	*node;
>> -	nodemask_t		active_nodes;
>> +	struct list_lru_one	global;
>> +#ifdef CONFIG_MEMCG_KMEM
>> +	/*
>> +	 * In order to provide ability of scanning objects from different
>> +	 * memory cgroups independently, we keep a separate LRU list for each
>> +	 * kmem-active memcg in this array. The array is RCU-protected and
>> +	 * indexed by memcg_cache_id().
>> +	 */
>> +	struct list_lru_one	**memcg;
> OK, as far as I can tell, this is introducing a per-node, per-memcg
> LRU lists. Is that correct?

Yes, it is.

> If so, then that is not what Glauber and I originally intended for
> memcg LRUs. per-node LRUs are expensive in terms of memory and cross
> multiplying them by the number of memcgs in a system was not a good
> use of memory.

Unfortunately, I did not spoke to Glauber about this. I only saw the
last version he tried to submit and the code from his tree. There
list_lru is implemented as per-memcg per-node matrix.

> According to Glauber, most memcgs are small and typically confined
> to a single node or two by external means and therefore don't need the
> scalability numa aware LRUs provide. Hence the idea was that the
> memcg LRUs would just be a single LRU list, just like a non-numa
> aware list_lru instantiation. IOWs, this is the structure that we
> had decided on as the best compromise between memory usage,
> complexity and memcg awareness:
>
> 	global list --- node 0 lru
> 			node 1 lru
> 			.....
> 			node nr_nodes lru
> 	memcg lists	memcg 0 lru
> 			memcg 1 lru
> 			.....
> 			memcg nr_memcgs lru
>
> and the LRU code internally would select either a node or memcg LRU
> to iterated based on the scan information coming in from the
> shrinker. i.e.:
>
>
> struct list_lru {
> 	struct list_lru_node	*node;
> 	nodemask_t		active_nodes;
> #ifdef MEMCG
> 	struct list_lru_node	**memcg;
> 	....

I agree that such a setup would not only reduce memory consumption, but
also make the code look much clearer removing these ugly "list_lru_one"
and "olru" I had to introduce. However, it would also make us scan memcg
LRUs more aggressively than usual NUMA-aware LRUs on global pressure (I
mean kswapd's would scan them on each node). I don't think it's much of
concern though, because this is what we had for all shrinkers before
NUMA-awareness was introduced. Besides, prioritizing memcg LRUs reclaim
over global LRUs sounds sane. That said I like this idea. Thanks.

>>  bool list_lru_add(struct list_lru *lru, struct list_head *item)
>>  {
>> -	int nid = page_to_nid(virt_to_page(item));
>> -	struct list_lru_node *nlru = &lru->node[nid];
>> +	struct page *page = virt_to_page(item);
>> +	int nid = page_to_nid(page);
>> +	struct list_lru_one *olru = lru_of_page(lru, page);
>> +	struct list_lru_node *nlru = &olru->node[nid];
> Yeah, that's per-memcg, per-node dereferencing. And, FWIW, olru/nlru
> are bad names - that's the convention typically used for "old <foo>"
> and "new <foo>" pointers....
>
> As it is, it shouldn't be necessary - lru_of_page() should just
> return a struct list_lru_node....
>
>> +int list_lru_init(struct list_lru *lru)
>> +{
>> +	int err;
>> +
>> +	err = list_lru_init_one(&lru->global);
>> +	if (err)
>> +		goto fail;
>> +
>> +	err = memcg_list_lru_init(lru);
>> +	if (err)
>> +		goto fail;
>> +
>> +	return 0;
>> +fail:
>> +	list_lru_destroy_one(&lru->global);
>> +	lru->global.node = NULL; /* see list_lru_destroy() */
>> +	return err;
>> +}
> I suspect we have users of list_lru that don't want memcg bits added
> to them. Hence I think we want to leave list_lru_init() alone and
> add a list_lru_init_memcg() variant that makes the LRU memcg aware.
> i.e. if the shrinker is not going to be memcg aware, then we don't
> want the LRU to be memcg aware, either....

I though that we want to make all LRUs per-memcg automatically, just
like it was with NUMA awareness. After your explanation about some
FS-specific caches (gfs2/xfs dquot), I admit I was wrong, and not all
caches require per-memcg shrinking. I'll add a flag to list_lru_init()
specifying if we want memcg awareness.

>> +int list_lru_grow_memcg(struct list_lru *lru, size_t new_array_size)
>> +{
>> +	int i;
>> +	struct list_lru_one **memcg_lrus;
>> +
>> +	memcg_lrus = kcalloc(new_array_size, sizeof(*memcg_lrus), GFP_KERNEL);
>> +	if (!memcg_lrus)
>> +		return -ENOMEM;
>> +
>> +	if (lru->memcg) {
>> +		for_each_memcg_cache_index(i) {
>> +			if (lru->memcg[i])
>> +				memcg_lrus[i] = lru->memcg[i];
>> +		}
>> +	}
> Um, krealloc()?

Not exactly. We have to keep the old version until we call sync_rcu.

>> +/*
>> + * This function allocates LRUs for a memcg in all list_lru structures. It is
>> + * called under memcg_create_mutex when a new kmem-active memcg is added.
>> + */
>> +static int memcg_init_all_lrus(int new_memcg_id)
>> +{
>> +	int err = 0;
>> +	int num_memcgs = new_memcg_id + 1;
>> +	int grow = (num_memcgs > memcg_limited_groups_array_size);
>> +	size_t new_array_size = memcg_caches_array_size(num_memcgs);
>> +	struct list_lru *lru;
>> +
>> +	if (grow) {
>> +		list_for_each_entry(lru, &all_memcg_lrus, list) {
>> +			err = list_lru_grow_memcg(lru, new_array_size);
>> +			if (err)
>> +				goto out;
>> +		}
>> +	}
>> +
>> +	list_for_each_entry(lru, &all_memcg_lrus, list) {
>> +		err = list_lru_memcg_alloc(lru, new_memcg_id);
>> +		if (err) {
>> +			__memcg_destroy_all_lrus(new_memcg_id);
>> +			break;
>> +		}
>> +	}
>> +out:
>> +	if (grow) {
>> +		synchronize_rcu();
>> +		list_for_each_entry(lru, &all_memcg_lrus, list) {
>> +			kfree(lru->memcg_old);
>> +			lru->memcg_old = NULL;
>> +		}
>> +	}
>> +	return err;
>> +}
> Urk. That won't scale very well.
>
>> +
>> +int memcg_list_lru_init(struct list_lru *lru)
>> +{
>> +	int err = 0;
>> +	int i;
>> +	struct mem_cgroup *memcg;
>> +
>> +	lru->memcg = NULL;
>> +	lru->memcg_old = NULL;
>> +
>> +	mutex_lock(&memcg_create_mutex);
>> +	if (!memcg_kmem_enabled())
>> +		goto out_list_add;
>> +
>> +	lru->memcg = kcalloc(memcg_limited_groups_array_size,
>> +			     sizeof(*lru->memcg), GFP_KERNEL);
>> +	if (!lru->memcg) {
>> +		err = -ENOMEM;
>> +		goto out;
>> +	}
>> +
>> +	for_each_mem_cgroup(memcg) {
>> +		int memcg_id;
>> +
>> +		memcg_id = memcg_cache_id(memcg);
>> +		if (memcg_id < 0)
>> +			continue;
>> +
>> +		err = list_lru_memcg_alloc(lru, memcg_id);
>> +		if (err) {
>> +			mem_cgroup_iter_break(NULL, memcg);
>> +			goto out_free_lru_memcg;
>> +		}
>> +	}
>> +out_list_add:
>> +	list_add(&lru->list, &all_memcg_lrus);
>> +out:
>> +	mutex_unlock(&memcg_create_mutex);
>> +	return err;
>> +
>> +out_free_lru_memcg:
>> +	for (i = 0; i < memcg_limited_groups_array_size; i++)
>> +		list_lru_memcg_free(lru, i);
>> +	kfree(lru->memcg);
>> +	goto out;
>> +}
> That will probably scale even worse. Think about what happens when we
> try to mount a bunch of filesystems in parallel - they will now
> serialise completely on this memcg_create_mutex instantiating memcg
> lists inside list_lru_init().

Yes, the scalability seems to be the main problem here. I have a couple
of thoughts on how it could be improved. Here they go:

1) We can turn memcg_create_mutex to rw-semaphore (or introduce an
rw-semaphore, which we would take for modifying list_lru's) and take it
for reading in memcg_list_lru_init() and for writing when we create a
new memcg (memcg_init_all_lrus()).
This would remove the bottleneck from the mount path, but every memcg
creation would still iterate over all LRUs under a memcg mutex. So I
guess it is not an option, isn't it?

2) We could use cmpxchg() instead of a mutex in list_lru_init_memcg()
and memcg_init_all_lrus() to assure a memcg LRU is initialized only
once. But again, this would not remove iteration over all LRUs from
memcg_init_all_lrus().

3) We can try to initialize per-memcg LRUs lazily only when we actually
need them, similar to how we now handle per-memcg kmem caches creation.
If list_lru_add() cannot find appropriate LRU, it will schedule a
background worker for its initialization.
The benefits of this approach are clear: we do not introduce any
bottlenecks, and we lower memory consumption in case different memcgs
use different mounts exclusively.
However, there is one thing that bothers me. Some objects accounted to a
memcg will go into the global LRU, which will postpone actual memcg
destruction until global reclaim.

Thanks.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v13 11/16] mm: list_lru: add per-memcg lists
@ 2013-12-10 10:05       ` Vladimir Davydov
  0 siblings, 0 replies; 81+ messages in thread
From: Vladimir Davydov @ 2013-12-10 10:05 UTC (permalink / raw)
  To: Dave Chinner
  Cc: dchinner, hannes, mhocko, akpm, linux-kernel, linux-mm, cgroups,
	devel, glommer, glommer, Al Viro, Balbir Singh,
	KAMEZAWA Hiroyuki

Hi, David

First of all, let me thank you for such a thorough review. It is really
helpful. As usual, I can't help agreeing with most of your comments, but
there are a couple of things I'd like to clarify. Please, see comments
inline.

On 12/10/2013 09:00 AM, Dave Chinner wrote:
> On Mon, Dec 09, 2013 at 12:05:52PM +0400, Vladimir Davydov wrote:
>> diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
>> index 34e57af..e8add3d 100644
>> --- a/include/linux/list_lru.h
>> +++ b/include/linux/list_lru.h
>> @@ -28,11 +28,47 @@ struct list_lru_node {
>>  	long			nr_items;
>>  } ____cacheline_aligned_in_smp;
>>  
>> +struct list_lru_one {
>> +	struct list_lru_node *node;
>> +	nodemask_t active_nodes;
>> +};
>> +
>>  struct list_lru {
>> -	struct list_lru_node	*node;
>> -	nodemask_t		active_nodes;
>> +	struct list_lru_one	global;
>> +#ifdef CONFIG_MEMCG_KMEM
>> +	/*
>> +	 * In order to provide ability of scanning objects from different
>> +	 * memory cgroups independently, we keep a separate LRU list for each
>> +	 * kmem-active memcg in this array. The array is RCU-protected and
>> +	 * indexed by memcg_cache_id().
>> +	 */
>> +	struct list_lru_one	**memcg;
> OK, as far as I can tell, this is introducing a per-node, per-memcg
> LRU lists. Is that correct?

Yes, it is.

> If so, then that is not what Glauber and I originally intended for
> memcg LRUs. per-node LRUs are expensive in terms of memory and cross
> multiplying them by the number of memcgs in a system was not a good
> use of memory.

Unfortunately, I did not spoke to Glauber about this. I only saw the
last version he tried to submit and the code from his tree. There
list_lru is implemented as per-memcg per-node matrix.

> According to Glauber, most memcgs are small and typically confined
> to a single node or two by external means and therefore don't need the
> scalability numa aware LRUs provide. Hence the idea was that the
> memcg LRUs would just be a single LRU list, just like a non-numa
> aware list_lru instantiation. IOWs, this is the structure that we
> had decided on as the best compromise between memory usage,
> complexity and memcg awareness:
>
> 	global list --- node 0 lru
> 			node 1 lru
> 			.....
> 			node nr_nodes lru
> 	memcg lists	memcg 0 lru
> 			memcg 1 lru
> 			.....
> 			memcg nr_memcgs lru
>
> and the LRU code internally would select either a node or memcg LRU
> to iterated based on the scan information coming in from the
> shrinker. i.e.:
>
>
> struct list_lru {
> 	struct list_lru_node	*node;
> 	nodemask_t		active_nodes;
> #ifdef MEMCG
> 	struct list_lru_node	**memcg;
> 	....

I agree that such a setup would not only reduce memory consumption, but
also make the code look much clearer removing these ugly "list_lru_one"
and "olru" I had to introduce. However, it would also make us scan memcg
LRUs more aggressively than usual NUMA-aware LRUs on global pressure (I
mean kswapd's would scan them on each node). I don't think it's much of
concern though, because this is what we had for all shrinkers before
NUMA-awareness was introduced. Besides, prioritizing memcg LRUs reclaim
over global LRUs sounds sane. That said I like this idea. Thanks.

>>  bool list_lru_add(struct list_lru *lru, struct list_head *item)
>>  {
>> -	int nid = page_to_nid(virt_to_page(item));
>> -	struct list_lru_node *nlru = &lru->node[nid];
>> +	struct page *page = virt_to_page(item);
>> +	int nid = page_to_nid(page);
>> +	struct list_lru_one *olru = lru_of_page(lru, page);
>> +	struct list_lru_node *nlru = &olru->node[nid];
> Yeah, that's per-memcg, per-node dereferencing. And, FWIW, olru/nlru
> are bad names - that's the convention typically used for "old <foo>"
> and "new <foo>" pointers....
>
> As it is, it shouldn't be necessary - lru_of_page() should just
> return a struct list_lru_node....
>
>> +int list_lru_init(struct list_lru *lru)
>> +{
>> +	int err;
>> +
>> +	err = list_lru_init_one(&lru->global);
>> +	if (err)
>> +		goto fail;
>> +
>> +	err = memcg_list_lru_init(lru);
>> +	if (err)
>> +		goto fail;
>> +
>> +	return 0;
>> +fail:
>> +	list_lru_destroy_one(&lru->global);
>> +	lru->global.node = NULL; /* see list_lru_destroy() */
>> +	return err;
>> +}
> I suspect we have users of list_lru that don't want memcg bits added
> to them. Hence I think we want to leave list_lru_init() alone and
> add a list_lru_init_memcg() variant that makes the LRU memcg aware.
> i.e. if the shrinker is not going to be memcg aware, then we don't
> want the LRU to be memcg aware, either....

I though that we want to make all LRUs per-memcg automatically, just
like it was with NUMA awareness. After your explanation about some
FS-specific caches (gfs2/xfs dquot), I admit I was wrong, and not all
caches require per-memcg shrinking. I'll add a flag to list_lru_init()
specifying if we want memcg awareness.

>> +int list_lru_grow_memcg(struct list_lru *lru, size_t new_array_size)
>> +{
>> +	int i;
>> +	struct list_lru_one **memcg_lrus;
>> +
>> +	memcg_lrus = kcalloc(new_array_size, sizeof(*memcg_lrus), GFP_KERNEL);
>> +	if (!memcg_lrus)
>> +		return -ENOMEM;
>> +
>> +	if (lru->memcg) {
>> +		for_each_memcg_cache_index(i) {
>> +			if (lru->memcg[i])
>> +				memcg_lrus[i] = lru->memcg[i];
>> +		}
>> +	}
> Um, krealloc()?

Not exactly. We have to keep the old version until we call sync_rcu.

>> +/*
>> + * This function allocates LRUs for a memcg in all list_lru structures. It is
>> + * called under memcg_create_mutex when a new kmem-active memcg is added.
>> + */
>> +static int memcg_init_all_lrus(int new_memcg_id)
>> +{
>> +	int err = 0;
>> +	int num_memcgs = new_memcg_id + 1;
>> +	int grow = (num_memcgs > memcg_limited_groups_array_size);
>> +	size_t new_array_size = memcg_caches_array_size(num_memcgs);
>> +	struct list_lru *lru;
>> +
>> +	if (grow) {
>> +		list_for_each_entry(lru, &all_memcg_lrus, list) {
>> +			err = list_lru_grow_memcg(lru, new_array_size);
>> +			if (err)
>> +				goto out;
>> +		}
>> +	}
>> +
>> +	list_for_each_entry(lru, &all_memcg_lrus, list) {
>> +		err = list_lru_memcg_alloc(lru, new_memcg_id);
>> +		if (err) {
>> +			__memcg_destroy_all_lrus(new_memcg_id);
>> +			break;
>> +		}
>> +	}
>> +out:
>> +	if (grow) {
>> +		synchronize_rcu();
>> +		list_for_each_entry(lru, &all_memcg_lrus, list) {
>> +			kfree(lru->memcg_old);
>> +			lru->memcg_old = NULL;
>> +		}
>> +	}
>> +	return err;
>> +}
> Urk. That won't scale very well.
>
>> +
>> +int memcg_list_lru_init(struct list_lru *lru)
>> +{
>> +	int err = 0;
>> +	int i;
>> +	struct mem_cgroup *memcg;
>> +
>> +	lru->memcg = NULL;
>> +	lru->memcg_old = NULL;
>> +
>> +	mutex_lock(&memcg_create_mutex);
>> +	if (!memcg_kmem_enabled())
>> +		goto out_list_add;
>> +
>> +	lru->memcg = kcalloc(memcg_limited_groups_array_size,
>> +			     sizeof(*lru->memcg), GFP_KERNEL);
>> +	if (!lru->memcg) {
>> +		err = -ENOMEM;
>> +		goto out;
>> +	}
>> +
>> +	for_each_mem_cgroup(memcg) {
>> +		int memcg_id;
>> +
>> +		memcg_id = memcg_cache_id(memcg);
>> +		if (memcg_id < 0)
>> +			continue;
>> +
>> +		err = list_lru_memcg_alloc(lru, memcg_id);
>> +		if (err) {
>> +			mem_cgroup_iter_break(NULL, memcg);
>> +			goto out_free_lru_memcg;
>> +		}
>> +	}
>> +out_list_add:
>> +	list_add(&lru->list, &all_memcg_lrus);
>> +out:
>> +	mutex_unlock(&memcg_create_mutex);
>> +	return err;
>> +
>> +out_free_lru_memcg:
>> +	for (i = 0; i < memcg_limited_groups_array_size; i++)
>> +		list_lru_memcg_free(lru, i);
>> +	kfree(lru->memcg);
>> +	goto out;
>> +}
> That will probably scale even worse. Think about what happens when we
> try to mount a bunch of filesystems in parallel - they will now
> serialise completely on this memcg_create_mutex instantiating memcg
> lists inside list_lru_init().

Yes, the scalability seems to be the main problem here. I have a couple
of thoughts on how it could be improved. Here they go:

1) We can turn memcg_create_mutex to rw-semaphore (or introduce an
rw-semaphore, which we would take for modifying list_lru's) and take it
for reading in memcg_list_lru_init() and for writing when we create a
new memcg (memcg_init_all_lrus()).
This would remove the bottleneck from the mount path, but every memcg
creation would still iterate over all LRUs under a memcg mutex. So I
guess it is not an option, isn't it?

2) We could use cmpxchg() instead of a mutex in list_lru_init_memcg()
and memcg_init_all_lrus() to assure a memcg LRU is initialized only
once. But again, this would not remove iteration over all LRUs from
memcg_init_all_lrus().

3) We can try to initialize per-memcg LRUs lazily only when we actually
need them, similar to how we now handle per-memcg kmem caches creation.
If list_lru_add() cannot find appropriate LRU, it will schedule a
background worker for its initialization.
The benefits of this approach are clear: we do not introduce any
bottlenecks, and we lower memory consumption in case different memcgs
use different mounts exclusively.
However, there is one thing that bothers me. Some objects accounted to a
memcg will go into the global LRU, which will postpone actual memcg
destruction until global reclaim.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v13 11/16] mm: list_lru: add per-memcg lists
@ 2013-12-10 10:05       ` Vladimir Davydov
  0 siblings, 0 replies; 81+ messages in thread
From: Vladimir Davydov @ 2013-12-10 10:05 UTC (permalink / raw)
  To: Dave Chinner
  Cc: dchinner, hannes, mhocko, akpm, linux-kernel, linux-mm, cgroups,
	devel, glommer, glommer, Al Viro, Balbir Singh,
	KAMEZAWA Hiroyuki

Hi, David

First of all, let me thank you for such a thorough review. It is really
helpful. As usual, I can't help agreeing with most of your comments, but
there are a couple of things I'd like to clarify. Please, see comments
inline.

On 12/10/2013 09:00 AM, Dave Chinner wrote:
> On Mon, Dec 09, 2013 at 12:05:52PM +0400, Vladimir Davydov wrote:
>> diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
>> index 34e57af..e8add3d 100644
>> --- a/include/linux/list_lru.h
>> +++ b/include/linux/list_lru.h
>> @@ -28,11 +28,47 @@ struct list_lru_node {
>>  	long			nr_items;
>>  } ____cacheline_aligned_in_smp;
>>  
>> +struct list_lru_one {
>> +	struct list_lru_node *node;
>> +	nodemask_t active_nodes;
>> +};
>> +
>>  struct list_lru {
>> -	struct list_lru_node	*node;
>> -	nodemask_t		active_nodes;
>> +	struct list_lru_one	global;
>> +#ifdef CONFIG_MEMCG_KMEM
>> +	/*
>> +	 * In order to provide ability of scanning objects from different
>> +	 * memory cgroups independently, we keep a separate LRU list for each
>> +	 * kmem-active memcg in this array. The array is RCU-protected and
>> +	 * indexed by memcg_cache_id().
>> +	 */
>> +	struct list_lru_one	**memcg;
> OK, as far as I can tell, this is introducing a per-node, per-memcg
> LRU lists. Is that correct?

Yes, it is.

> If so, then that is not what Glauber and I originally intended for
> memcg LRUs. per-node LRUs are expensive in terms of memory and cross
> multiplying them by the number of memcgs in a system was not a good
> use of memory.

Unfortunately, I did not spoke to Glauber about this. I only saw the
last version he tried to submit and the code from his tree. There
list_lru is implemented as per-memcg per-node matrix.

> According to Glauber, most memcgs are small and typically confined
> to a single node or two by external means and therefore don't need the
> scalability numa aware LRUs provide. Hence the idea was that the
> memcg LRUs would just be a single LRU list, just like a non-numa
> aware list_lru instantiation. IOWs, this is the structure that we
> had decided on as the best compromise between memory usage,
> complexity and memcg awareness:
>
> 	global list --- node 0 lru
> 			node 1 lru
> 			.....
> 			node nr_nodes lru
> 	memcg lists	memcg 0 lru
> 			memcg 1 lru
> 			.....
> 			memcg nr_memcgs lru
>
> and the LRU code internally would select either a node or memcg LRU
> to iterated based on the scan information coming in from the
> shrinker. i.e.:
>
>
> struct list_lru {
> 	struct list_lru_node	*node;
> 	nodemask_t		active_nodes;
> #ifdef MEMCG
> 	struct list_lru_node	**memcg;
> 	....

I agree that such a setup would not only reduce memory consumption, but
also make the code look much clearer removing these ugly "list_lru_one"
and "olru" I had to introduce. However, it would also make us scan memcg
LRUs more aggressively than usual NUMA-aware LRUs on global pressure (I
mean kswapd's would scan them on each node). I don't think it's much of
concern though, because this is what we had for all shrinkers before
NUMA-awareness was introduced. Besides, prioritizing memcg LRUs reclaim
over global LRUs sounds sane. That said I like this idea. Thanks.

>>  bool list_lru_add(struct list_lru *lru, struct list_head *item)
>>  {
>> -	int nid = page_to_nid(virt_to_page(item));
>> -	struct list_lru_node *nlru = &lru->node[nid];
>> +	struct page *page = virt_to_page(item);
>> +	int nid = page_to_nid(page);
>> +	struct list_lru_one *olru = lru_of_page(lru, page);
>> +	struct list_lru_node *nlru = &olru->node[nid];
> Yeah, that's per-memcg, per-node dereferencing. And, FWIW, olru/nlru
> are bad names - that's the convention typically used for "old <foo>"
> and "new <foo>" pointers....
>
> As it is, it shouldn't be necessary - lru_of_page() should just
> return a struct list_lru_node....
>
>> +int list_lru_init(struct list_lru *lru)
>> +{
>> +	int err;
>> +
>> +	err = list_lru_init_one(&lru->global);
>> +	if (err)
>> +		goto fail;
>> +
>> +	err = memcg_list_lru_init(lru);
>> +	if (err)
>> +		goto fail;
>> +
>> +	return 0;
>> +fail:
>> +	list_lru_destroy_one(&lru->global);
>> +	lru->global.node = NULL; /* see list_lru_destroy() */
>> +	return err;
>> +}
> I suspect we have users of list_lru that don't want memcg bits added
> to them. Hence I think we want to leave list_lru_init() alone and
> add a list_lru_init_memcg() variant that makes the LRU memcg aware.
> i.e. if the shrinker is not going to be memcg aware, then we don't
> want the LRU to be memcg aware, either....

I though that we want to make all LRUs per-memcg automatically, just
like it was with NUMA awareness. After your explanation about some
FS-specific caches (gfs2/xfs dquot), I admit I was wrong, and not all
caches require per-memcg shrinking. I'll add a flag to list_lru_init()
specifying if we want memcg awareness.

>> +int list_lru_grow_memcg(struct list_lru *lru, size_t new_array_size)
>> +{
>> +	int i;
>> +	struct list_lru_one **memcg_lrus;
>> +
>> +	memcg_lrus = kcalloc(new_array_size, sizeof(*memcg_lrus), GFP_KERNEL);
>> +	if (!memcg_lrus)
>> +		return -ENOMEM;
>> +
>> +	if (lru->memcg) {
>> +		for_each_memcg_cache_index(i) {
>> +			if (lru->memcg[i])
>> +				memcg_lrus[i] = lru->memcg[i];
>> +		}
>> +	}
> Um, krealloc()?

Not exactly. We have to keep the old version until we call sync_rcu.

>> +/*
>> + * This function allocates LRUs for a memcg in all list_lru structures. It is
>> + * called under memcg_create_mutex when a new kmem-active memcg is added.
>> + */
>> +static int memcg_init_all_lrus(int new_memcg_id)
>> +{
>> +	int err = 0;
>> +	int num_memcgs = new_memcg_id + 1;
>> +	int grow = (num_memcgs > memcg_limited_groups_array_size);
>> +	size_t new_array_size = memcg_caches_array_size(num_memcgs);
>> +	struct list_lru *lru;
>> +
>> +	if (grow) {
>> +		list_for_each_entry(lru, &all_memcg_lrus, list) {
>> +			err = list_lru_grow_memcg(lru, new_array_size);
>> +			if (err)
>> +				goto out;
>> +		}
>> +	}
>> +
>> +	list_for_each_entry(lru, &all_memcg_lrus, list) {
>> +		err = list_lru_memcg_alloc(lru, new_memcg_id);
>> +		if (err) {
>> +			__memcg_destroy_all_lrus(new_memcg_id);
>> +			break;
>> +		}
>> +	}
>> +out:
>> +	if (grow) {
>> +		synchronize_rcu();
>> +		list_for_each_entry(lru, &all_memcg_lrus, list) {
>> +			kfree(lru->memcg_old);
>> +			lru->memcg_old = NULL;
>> +		}
>> +	}
>> +	return err;
>> +}
> Urk. That won't scale very well.
>
>> +
>> +int memcg_list_lru_init(struct list_lru *lru)
>> +{
>> +	int err = 0;
>> +	int i;
>> +	struct mem_cgroup *memcg;
>> +
>> +	lru->memcg = NULL;
>> +	lru->memcg_old = NULL;
>> +
>> +	mutex_lock(&memcg_create_mutex);
>> +	if (!memcg_kmem_enabled())
>> +		goto out_list_add;
>> +
>> +	lru->memcg = kcalloc(memcg_limited_groups_array_size,
>> +			     sizeof(*lru->memcg), GFP_KERNEL);
>> +	if (!lru->memcg) {
>> +		err = -ENOMEM;
>> +		goto out;
>> +	}
>> +
>> +	for_each_mem_cgroup(memcg) {
>> +		int memcg_id;
>> +
>> +		memcg_id = memcg_cache_id(memcg);
>> +		if (memcg_id < 0)
>> +			continue;
>> +
>> +		err = list_lru_memcg_alloc(lru, memcg_id);
>> +		if (err) {
>> +			mem_cgroup_iter_break(NULL, memcg);
>> +			goto out_free_lru_memcg;
>> +		}
>> +	}
>> +out_list_add:
>> +	list_add(&lru->list, &all_memcg_lrus);
>> +out:
>> +	mutex_unlock(&memcg_create_mutex);
>> +	return err;
>> +
>> +out_free_lru_memcg:
>> +	for (i = 0; i < memcg_limited_groups_array_size; i++)
>> +		list_lru_memcg_free(lru, i);
>> +	kfree(lru->memcg);
>> +	goto out;
>> +}
> That will probably scale even worse. Think about what happens when we
> try to mount a bunch of filesystems in parallel - they will now
> serialise completely on this memcg_create_mutex instantiating memcg
> lists inside list_lru_init().

Yes, the scalability seems to be the main problem here. I have a couple
of thoughts on how it could be improved. Here they go:

1) We can turn memcg_create_mutex to rw-semaphore (or introduce an
rw-semaphore, which we would take for modifying list_lru's) and take it
for reading in memcg_list_lru_init() and for writing when we create a
new memcg (memcg_init_all_lrus()).
This would remove the bottleneck from the mount path, but every memcg
creation would still iterate over all LRUs under a memcg mutex. So I
guess it is not an option, isn't it?

2) We could use cmpxchg() instead of a mutex in list_lru_init_memcg()
and memcg_init_all_lrus() to assure a memcg LRU is initialized only
once. But again, this would not remove iteration over all LRUs from
memcg_init_all_lrus().

3) We can try to initialize per-memcg LRUs lazily only when we actually
need them, similar to how we now handle per-memcg kmem caches creation.
If list_lru_add() cannot find appropriate LRU, it will schedule a
background worker for its initialization.
The benefits of this approach are clear: we do not introduce any
bottlenecks, and we lower memory consumption in case different memcgs
use different mounts exclusively.
However, there is one thing that bothers me. Some objects accounted to a
memcg will go into the global LRU, which will postpone actual memcg
destruction until global reclaim.

Thanks.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v13 13/16] vmscan: take at least one pass with shrinkers
  2013-12-10  4:18     ` Dave Chinner
  (?)
@ 2013-12-10 11:50       ` Vladimir Davydov
  -1 siblings, 0 replies; 81+ messages in thread
From: Vladimir Davydov @ 2013-12-10 11:50 UTC (permalink / raw)
  To: Dave Chinner
  Cc: dchinner, hannes, mhocko, akpm, linux-kernel, linux-mm, cgroups,
	devel, glommer, glommer, Glauber Costa, Mel Gorman, Rik van Riel

On 12/10/2013 08:18 AM, Dave Chinner wrote:
> On Mon, Dec 09, 2013 at 12:05:54PM +0400, Vladimir Davydov wrote:
>> From: Glauber Costa <glommer@openvz.org>
>>
>> In very low free kernel memory situations, it may be the case that we
>> have less objects to free than our initial batch size. If this is the
>> case, it is better to shrink those, and open space for the new workload
>> then to keep them and fail the new allocations.
>>
>> In particular, we are concerned with the direct reclaim case for memcg.
>> Although this same technique can be applied to other situations just as
>> well, we will start conservative and apply it for that case, which is
>> the one that matters the most.
> This should be at the start of the series.

Since Glauber wanted to introduce this only for memcg-reclaim first,
this can't be at the start of the series, but I'll move it to go
immediately after per-memcg shrinking core in the next iteration.

Thanks.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v13 13/16] vmscan: take at least one pass with shrinkers
@ 2013-12-10 11:50       ` Vladimir Davydov
  0 siblings, 0 replies; 81+ messages in thread
From: Vladimir Davydov @ 2013-12-10 11:50 UTC (permalink / raw)
  To: Dave Chinner
  Cc: dchinner, hannes, mhocko, akpm, linux-kernel, linux-mm, cgroups,
	devel, glommer, glommer, Glauber Costa, Mel Gorman, Rik van Riel

On 12/10/2013 08:18 AM, Dave Chinner wrote:
> On Mon, Dec 09, 2013 at 12:05:54PM +0400, Vladimir Davydov wrote:
>> From: Glauber Costa <glommer@openvz.org>
>>
>> In very low free kernel memory situations, it may be the case that we
>> have less objects to free than our initial batch size. If this is the
>> case, it is better to shrink those, and open space for the new workload
>> then to keep them and fail the new allocations.
>>
>> In particular, we are concerned with the direct reclaim case for memcg.
>> Although this same technique can be applied to other situations just as
>> well, we will start conservative and apply it for that case, which is
>> the one that matters the most.
> This should be at the start of the series.

Since Glauber wanted to introduce this only for memcg-reclaim first,
this can't be at the start of the series, but I'll move it to go
immediately after per-memcg shrinking core in the next iteration.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v13 13/16] vmscan: take at least one pass with shrinkers
@ 2013-12-10 11:50       ` Vladimir Davydov
  0 siblings, 0 replies; 81+ messages in thread
From: Vladimir Davydov @ 2013-12-10 11:50 UTC (permalink / raw)
  To: Dave Chinner
  Cc: dchinner-H+wXaHxf7aLQT0dZR+AlfA, hannes-druUgvl0LCNAfugRpC6u6w,
	mhocko-AlSwsSmVLrQ, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	devel-GEFAQzZX7r8dnm+yROfE0A, glommer-GEFAQzZX7r8dnm+yROfE0A,
	glommer-Re5JQEeQqe8AvxtiuMwx3w, Glauber Costa, Mel Gorman,
	Rik van Riel

On 12/10/2013 08:18 AM, Dave Chinner wrote:
> On Mon, Dec 09, 2013 at 12:05:54PM +0400, Vladimir Davydov wrote:
>> From: Glauber Costa <glommer-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
>>
>> In very low free kernel memory situations, it may be the case that we
>> have less objects to free than our initial batch size. If this is the
>> case, it is better to shrink those, and open space for the new workload
>> then to keep them and fail the new allocations.
>>
>> In particular, we are concerned with the direct reclaim case for memcg.
>> Although this same technique can be applied to other situations just as
>> well, we will start conservative and apply it for that case, which is
>> the one that matters the most.
> This should be at the start of the series.

Since Glauber wanted to introduce this only for memcg-reclaim first,
this can't be at the start of the series, but I'll move it to go
immediately after per-memcg shrinking core in the next iteration.

Thanks.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v13 13/16] vmscan: take at least one pass with shrinkers
  2013-12-10 11:50       ` Vladimir Davydov
  (?)
@ 2013-12-10 12:38         ` Glauber Costa
  -1 siblings, 0 replies; 81+ messages in thread
From: Glauber Costa @ 2013-12-10 12:38 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Dave Chinner, dchinner, Johannes Weiner, Michal Hocko,
	Andrew Morton, LKML, linux-mm, cgroups, devel, Glauber Costa,
	Glauber Costa, Mel Gorman, Rik van Riel

On Tue, Dec 10, 2013 at 3:50 PM, Vladimir Davydov
<vdavydov@parallels.com> wrote:
> On 12/10/2013 08:18 AM, Dave Chinner wrote:
>> On Mon, Dec 09, 2013 at 12:05:54PM +0400, Vladimir Davydov wrote:
>>> From: Glauber Costa <glommer@openvz.org>
>>>
>>> In very low free kernel memory situations, it may be the case that we
>>> have less objects to free than our initial batch size. If this is the
>>> case, it is better to shrink those, and open space for the new workload
>>> then to keep them and fail the new allocations.
>>>
>>> In particular, we are concerned with the direct reclaim case for memcg.
>>> Although this same technique can be applied to other situations just as
>>> well, we will start conservative and apply it for that case, which is
>>> the one that matters the most.
>> This should be at the start of the series.
>
> Since Glauber wanted to introduce this only for memcg-reclaim first,
> this can't be at the start of the series, but I'll move it to go
> immediately after per-memcg shrinking core in the next iteration.
>
> Thanks.

So, the reason for that being memcg only, is that the reclaim for
small objects triggered
a bunch of XFS regressions (I am sure the regressions are general, but
I've tested them using
ZFS).

In theory they shouldn't, so we can try to make it global again, so
long as it comes together
with benchmarks demonstrating that it is a safe change.

I am not sure the filesystem people would benefit from that directly,
though. So it may not be worth the hassle...

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v13 13/16] vmscan: take at least one pass with shrinkers
@ 2013-12-10 12:38         ` Glauber Costa
  0 siblings, 0 replies; 81+ messages in thread
From: Glauber Costa @ 2013-12-10 12:38 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Dave Chinner, dchinner, Johannes Weiner, Michal Hocko,
	Andrew Morton, LKML, linux-mm, cgroups, devel, Glauber Costa,
	Glauber Costa, Mel Gorman, Rik van Riel

On Tue, Dec 10, 2013 at 3:50 PM, Vladimir Davydov
<vdavydov@parallels.com> wrote:
> On 12/10/2013 08:18 AM, Dave Chinner wrote:
>> On Mon, Dec 09, 2013 at 12:05:54PM +0400, Vladimir Davydov wrote:
>>> From: Glauber Costa <glommer@openvz.org>
>>>
>>> In very low free kernel memory situations, it may be the case that we
>>> have less objects to free than our initial batch size. If this is the
>>> case, it is better to shrink those, and open space for the new workload
>>> then to keep them and fail the new allocations.
>>>
>>> In particular, we are concerned with the direct reclaim case for memcg.
>>> Although this same technique can be applied to other situations just as
>>> well, we will start conservative and apply it for that case, which is
>>> the one that matters the most.
>> This should be at the start of the series.
>
> Since Glauber wanted to introduce this only for memcg-reclaim first,
> this can't be at the start of the series, but I'll move it to go
> immediately after per-memcg shrinking core in the next iteration.
>
> Thanks.

So, the reason for that being memcg only, is that the reclaim for
small objects triggered
a bunch of XFS regressions (I am sure the regressions are general, but
I've tested them using
ZFS).

In theory they shouldn't, so we can try to make it global again, so
long as it comes together
with benchmarks demonstrating that it is a safe change.

I am not sure the filesystem people would benefit from that directly,
though. So it may not be worth the hassle...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v13 13/16] vmscan: take at least one pass with shrinkers
@ 2013-12-10 12:38         ` Glauber Costa
  0 siblings, 0 replies; 81+ messages in thread
From: Glauber Costa @ 2013-12-10 12:38 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Dave Chinner, dchinner-H+wXaHxf7aLQT0dZR+AlfA, Johannes Weiner,
	Michal Hocko, Andrew Morton, LKML,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	devel-GEFAQzZX7r8dnm+yROfE0A, Glauber Costa, Glauber Costa,
	Mel Gorman, Rik van Riel

On Tue, Dec 10, 2013 at 3:50 PM, Vladimir Davydov
<vdavydov-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> wrote:
> On 12/10/2013 08:18 AM, Dave Chinner wrote:
>> On Mon, Dec 09, 2013 at 12:05:54PM +0400, Vladimir Davydov wrote:
>>> From: Glauber Costa <glommer-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
>>>
>>> In very low free kernel memory situations, it may be the case that we
>>> have less objects to free than our initial batch size. If this is the
>>> case, it is better to shrink those, and open space for the new workload
>>> then to keep them and fail the new allocations.
>>>
>>> In particular, we are concerned with the direct reclaim case for memcg.
>>> Although this same technique can be applied to other situations just as
>>> well, we will start conservative and apply it for that case, which is
>>> the one that matters the most.
>> This should be at the start of the series.
>
> Since Glauber wanted to introduce this only for memcg-reclaim first,
> this can't be at the start of the series, but I'll move it to go
> immediately after per-memcg shrinking core in the next iteration.
>
> Thanks.

So, the reason for that being memcg only, is that the reclaim for
small objects triggered
a bunch of XFS regressions (I am sure the regressions are general, but
I've tested them using
ZFS).

In theory they shouldn't, so we can try to make it global again, so
long as it comes together
with benchmarks demonstrating that it is a safe change.

I am not sure the filesystem people would benefit from that directly,
though. So it may not be worth the hassle...

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v13 12/16] fs: mark list_lru based shrinkers memcg aware
  2013-12-10  4:17     ` Dave Chinner
@ 2013-12-11 11:08       ` Steven Whitehouse
  -1 siblings, 0 replies; 81+ messages in thread
From: Steven Whitehouse @ 2013-12-11 11:08 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Vladimir Davydov, dchinner, hannes, mhocko, akpm, linux-kernel,
	linux-mm, cgroups, devel, glommer, glommer, Al Viro

Hi,

On Tue, 2013-12-10 at 15:17 +1100, Dave Chinner wrote:
> On Mon, Dec 09, 2013 at 12:05:53PM +0400, Vladimir Davydov wrote:
> > Since now list_lru automatically distributes objects among per-memcg
> > lists and list_lru_{count,walk} employ information passed in the
> > shrink_control argument to scan appropriate list, all shrinkers that
> > keep objects in the list_lru structure can already work as memcg-aware.
> > Let us mark them so.
> > 
> > Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
> > Cc: Glauber Costa <glommer@openvz.org>
> > Cc: Dave Chinner <dchinner@redhat.com>
> > Cc: Al Viro <viro@zeniv.linux.org.uk>
> > ---
> >  fs/gfs2/quota.c  |    2 +-
> >  fs/super.c       |    2 +-
> >  fs/xfs/xfs_buf.c |    2 +-
> >  fs/xfs/xfs_qm.c  |    2 +-
> >  4 files changed, 4 insertions(+), 4 deletions(-)
> > 
> > diff --git a/fs/gfs2/quota.c b/fs/gfs2/quota.c
> > index f0435da..6cf6114 100644
> > --- a/fs/gfs2/quota.c
> > +++ b/fs/gfs2/quota.c
> > @@ -150,7 +150,7 @@ struct shrinker gfs2_qd_shrinker = {
> >  	.count_objects = gfs2_qd_shrink_count,
> >  	.scan_objects = gfs2_qd_shrink_scan,
> >  	.seeks = DEFAULT_SEEKS,
> > -	.flags = SHRINKER_NUMA_AWARE,
> > +	.flags = SHRINKER_NUMA_AWARE | SHRINKER_MEMCG_AWARE,
> >  };
> 
> I'll leave it for Steve to have the final say, but this cache tracks
> objects that have contexts that span multiple memcgs (i.e. global
> scope) and so is not a candidate for memcg based shrinking.
> 
> e.g. a single user can have processes running in multiple concurrent
> memcgs, and so the user quota dquot needs to be accessed from all
> those memcg contexts. Same for group quota objects - they can span
> multiple memcgs that different users have instantiated, simply
> because they all belong to the same group and hence are subject to
> the group quota accounting.
> 
> And for XFS, there's also project quotas, which means you can have
> files that are unique to both users and groups, but shared the same
> project quota and hence span memcgs that way....
> 

Well that seems to make sense to me. I'm not that familiar with memcg
and my main interest was to use the provided lru code, unless there was
a good reason why we should roll our own, and also to take advantage of
the NUMA friendliness of the new code. Although my main target in GFS2
was the glock lru, I've not got that far yet as it is a rather more
complicated thing to do, compared with the quota code,

Steve.



^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v13 12/16] fs: mark list_lru based shrinkers memcg aware
@ 2013-12-11 11:08       ` Steven Whitehouse
  0 siblings, 0 replies; 81+ messages in thread
From: Steven Whitehouse @ 2013-12-11 11:08 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Vladimir Davydov, dchinner, hannes, mhocko, akpm, linux-kernel,
	linux-mm, cgroups, devel, glommer, glommer, Al Viro

Hi,

On Tue, 2013-12-10 at 15:17 +1100, Dave Chinner wrote:
> On Mon, Dec 09, 2013 at 12:05:53PM +0400, Vladimir Davydov wrote:
> > Since now list_lru automatically distributes objects among per-memcg
> > lists and list_lru_{count,walk} employ information passed in the
> > shrink_control argument to scan appropriate list, all shrinkers that
> > keep objects in the list_lru structure can already work as memcg-aware.
> > Let us mark them so.
> > 
> > Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
> > Cc: Glauber Costa <glommer@openvz.org>
> > Cc: Dave Chinner <dchinner@redhat.com>
> > Cc: Al Viro <viro@zeniv.linux.org.uk>
> > ---
> >  fs/gfs2/quota.c  |    2 +-
> >  fs/super.c       |    2 +-
> >  fs/xfs/xfs_buf.c |    2 +-
> >  fs/xfs/xfs_qm.c  |    2 +-
> >  4 files changed, 4 insertions(+), 4 deletions(-)
> > 
> > diff --git a/fs/gfs2/quota.c b/fs/gfs2/quota.c
> > index f0435da..6cf6114 100644
> > --- a/fs/gfs2/quota.c
> > +++ b/fs/gfs2/quota.c
> > @@ -150,7 +150,7 @@ struct shrinker gfs2_qd_shrinker = {
> >  	.count_objects = gfs2_qd_shrink_count,
> >  	.scan_objects = gfs2_qd_shrink_scan,
> >  	.seeks = DEFAULT_SEEKS,
> > -	.flags = SHRINKER_NUMA_AWARE,
> > +	.flags = SHRINKER_NUMA_AWARE | SHRINKER_MEMCG_AWARE,
> >  };
> 
> I'll leave it for Steve to have the final say, but this cache tracks
> objects that have contexts that span multiple memcgs (i.e. global
> scope) and so is not a candidate for memcg based shrinking.
> 
> e.g. a single user can have processes running in multiple concurrent
> memcgs, and so the user quota dquot needs to be accessed from all
> those memcg contexts. Same for group quota objects - they can span
> multiple memcgs that different users have instantiated, simply
> because they all belong to the same group and hence are subject to
> the group quota accounting.
> 
> And for XFS, there's also project quotas, which means you can have
> files that are unique to both users and groups, but shared the same
> project quota and hence span memcgs that way....
> 

Well that seems to make sense to me. I'm not that familiar with memcg
and my main interest was to use the provided lru code, unless there was
a good reason why we should roll our own, and also to take advantage of
the NUMA friendliness of the new code. Although my main target in GFS2
was the glock lru, I've not got that far yet as it is a rather more
complicated thing to do, compared with the quota code,

Steve.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v13 11/16] mm: list_lru: add per-memcg lists
  2013-12-10 10:05       ` Vladimir Davydov
@ 2013-12-12  1:40         ` Dave Chinner
  -1 siblings, 0 replies; 81+ messages in thread
From: Dave Chinner @ 2013-12-12  1:40 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: dchinner, hannes, mhocko, akpm, linux-kernel, linux-mm, cgroups,
	devel, glommer, glommer, Al Viro, Balbir Singh,
	KAMEZAWA Hiroyuki

On Tue, Dec 10, 2013 at 02:05:47PM +0400, Vladimir Davydov wrote:
> Hi, David
> 
> First of all, let me thank you for such a thorough review. It is really
> helpful. As usual, I can't help agreeing with most of your comments, but
> there are a couple of things I'd like to clarify. Please, see comments
> inline.

No worries - I just want ot make sure that we integrate this as
cleanly as possible ;)

> I agree that such a setup would not only reduce memory consumption, but
> also make the code look much clearer removing these ugly "list_lru_one"
> and "olru" I had to introduce. However, it would also make us scan memcg
> LRUs more aggressively than usual NUMA-aware LRUs on global pressure (I
> mean kswapd's would scan them on each node). I don't think it's much of
> concern though, because this is what we had for all shrinkers before
> NUMA-awareness was introduced. Besides, prioritizing memcg LRUs reclaim
> over global LRUs sounds sane. That said I like this idea. Thanks.

Right, and given that in most cases where these memcg LRUs are going
to be important are containerised systems where they are typically
small, the scalability of per-node LRU lists is not really needed.
And, as such, having a single LRU for them means that reclaim will
be slightly more predictable within the memcg....

> >> +int list_lru_init(struct list_lru *lru)
> >> +{
> >> +	int err;
> >> +
> >> +	err = list_lru_init_one(&lru->global);
> >> +	if (err)
> >> +		goto fail;
> >> +
> >> +	err = memcg_list_lru_init(lru);
> >> +	if (err)
> >> +		goto fail;
> >> +
> >> +	return 0;
> >> +fail:
> >> +	list_lru_destroy_one(&lru->global);
> >> +	lru->global.node = NULL; /* see list_lru_destroy() */
> >> +	return err;
> >> +}
> > I suspect we have users of list_lru that don't want memcg bits added
> > to them. Hence I think we want to leave list_lru_init() alone and
> > add a list_lru_init_memcg() variant that makes the LRU memcg aware.
> > i.e. if the shrinker is not going to be memcg aware, then we don't
> > want the LRU to be memcg aware, either....
> 
> I though that we want to make all LRUs per-memcg automatically, just
> like it was with NUMA awareness. After your explanation about some
> FS-specific caches (gfs2/xfs dquot), I admit I was wrong, and not all
> caches require per-memcg shrinking. I'll add a flag to list_lru_init()
> specifying if we want memcg awareness.

Keep in mind that this may extend to the declaration of slab caches.
For example, XFS has a huge number of internal caches (see
xfs_init_zones()) and in reality, no allocation to any of these
other than the xfs inode slab should be accounted to a memcg. i.e.
the objects allocated out of them are filesystem objects that have
global scope and so shouldn't be owned/accounted to a memcg as
such...


> >> +int list_lru_grow_memcg(struct list_lru *lru, size_t new_array_size)
> >> +{
> >> +	int i;
> >> +	struct list_lru_one **memcg_lrus;
> >> +
> >> +	memcg_lrus = kcalloc(new_array_size, sizeof(*memcg_lrus), GFP_KERNEL);
> >> +	if (!memcg_lrus)
> >> +		return -ENOMEM;
> >> +
> >> +	if (lru->memcg) {
> >> +		for_each_memcg_cache_index(i) {
> >> +			if (lru->memcg[i])
> >> +				memcg_lrus[i] = lru->memcg[i];
> >> +		}
> >> +	}
> > Um, krealloc()?
> 
> Not exactly. We have to keep the old version until we call sync_rcu.

Ah, of course. Could you add a big comment explaining this so that
the next reader doesn't suggest replacing it with krealloc(), too?

> >> +int memcg_list_lru_init(struct list_lru *lru)
> >> +{
> >> +	int err = 0;
> >> +	int i;
> >> +	struct mem_cgroup *memcg;
> >> +
> >> +	lru->memcg = NULL;
> >> +	lru->memcg_old = NULL;
> >> +
> >> +	mutex_lock(&memcg_create_mutex);
> >> +	if (!memcg_kmem_enabled())
> >> +		goto out_list_add;
> >> +
> >> +	lru->memcg = kcalloc(memcg_limited_groups_array_size,
> >> +			     sizeof(*lru->memcg), GFP_KERNEL);
> >> +	if (!lru->memcg) {
> >> +		err = -ENOMEM;
> >> +		goto out;
> >> +	}
> >> +
> >> +	for_each_mem_cgroup(memcg) {
> >> +		int memcg_id;
> >> +
> >> +		memcg_id = memcg_cache_id(memcg);
> >> +		if (memcg_id < 0)
> >> +			continue;
> >> +
> >> +		err = list_lru_memcg_alloc(lru, memcg_id);
> >> +		if (err) {
> >> +			mem_cgroup_iter_break(NULL, memcg);
> >> +			goto out_free_lru_memcg;
> >> +		}
> >> +	}
> >> +out_list_add:
> >> +	list_add(&lru->list, &all_memcg_lrus);
> >> +out:
> >> +	mutex_unlock(&memcg_create_mutex);
> >> +	return err;
> >> +
> >> +out_free_lru_memcg:
> >> +	for (i = 0; i < memcg_limited_groups_array_size; i++)
> >> +		list_lru_memcg_free(lru, i);
> >> +	kfree(lru->memcg);
> >> +	goto out;
> >> +}
> > That will probably scale even worse. Think about what happens when we
> > try to mount a bunch of filesystems in parallel - they will now
> > serialise completely on this memcg_create_mutex instantiating memcg
> > lists inside list_lru_init().
> 
> Yes, the scalability seems to be the main problem here. I have a couple
> of thoughts on how it could be improved. Here they go:
> 
> 1) We can turn memcg_create_mutex to rw-semaphore (or introduce an
> rw-semaphore, which we would take for modifying list_lru's) and take it
> for reading in memcg_list_lru_init() and for writing when we create a
> new memcg (memcg_init_all_lrus()).
> This would remove the bottleneck from the mount path, but every memcg
> creation would still iterate over all LRUs under a memcg mutex. So I
> guess it is not an option, isn't it?

Right - it's not so much that there is a mutex to protect the init,
it's how long it's held that will be the issue. I mean, we don't
need to hold the memcg_create_mutex until we've completely
initialised the lru structure and are ready to add it to the
all_memcg_lrus list, right?

i.e. restructing it so that you don't need to hold the mutex until
you make the LRU list globally visible would solve the problem just
as well. if we can iterate the memcgs lists without holding a lock,
then we can init the per-memcg lru lists without holding a lock
because nobody will access them through the list_lru structure
because it's not yet been published.

That keeps the locking simple, and we get scalability because we've
reduced the lock's scope to just a few instructures instead of a
memcg iteration and a heap of memory allocation....

> 2) We could use cmpxchg() instead of a mutex in list_lru_init_memcg()
> and memcg_init_all_lrus() to assure a memcg LRU is initialized only
> once. But again, this would not remove iteration over all LRUs from
> memcg_init_all_lrus().
> 
> 3) We can try to initialize per-memcg LRUs lazily only when we actually
> need them, similar to how we now handle per-memcg kmem caches creation.
> If list_lru_add() cannot find appropriate LRU, it will schedule a
> background worker for its initialization.

I'd prefer not to add complexity to the list_lru_add() path here.
It's frequently called, so it's a code hot path and so we should
keep it as simply as possible.

> The benefits of this approach are clear: we do not introduce any
> bottlenecks, and we lower memory consumption in case different memcgs
> use different mounts exclusively.
> However, there is one thing that bothers me. Some objects accounted to a
> memcg will go into the global LRU, which will postpone actual memcg
> destruction until global reclaim.

Yeah, that's messy. best to avoid it by doing the work at list init
time, IMO.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v13 11/16] mm: list_lru: add per-memcg lists
@ 2013-12-12  1:40         ` Dave Chinner
  0 siblings, 0 replies; 81+ messages in thread
From: Dave Chinner @ 2013-12-12  1:40 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: dchinner, hannes, mhocko, akpm, linux-kernel, linux-mm, cgroups,
	devel, glommer, glommer, Al Viro, Balbir Singh,
	KAMEZAWA Hiroyuki

On Tue, Dec 10, 2013 at 02:05:47PM +0400, Vladimir Davydov wrote:
> Hi, David
> 
> First of all, let me thank you for such a thorough review. It is really
> helpful. As usual, I can't help agreeing with most of your comments, but
> there are a couple of things I'd like to clarify. Please, see comments
> inline.

No worries - I just want ot make sure that we integrate this as
cleanly as possible ;)

> I agree that such a setup would not only reduce memory consumption, but
> also make the code look much clearer removing these ugly "list_lru_one"
> and "olru" I had to introduce. However, it would also make us scan memcg
> LRUs more aggressively than usual NUMA-aware LRUs on global pressure (I
> mean kswapd's would scan them on each node). I don't think it's much of
> concern though, because this is what we had for all shrinkers before
> NUMA-awareness was introduced. Besides, prioritizing memcg LRUs reclaim
> over global LRUs sounds sane. That said I like this idea. Thanks.

Right, and given that in most cases where these memcg LRUs are going
to be important are containerised systems where they are typically
small, the scalability of per-node LRU lists is not really needed.
And, as such, having a single LRU for them means that reclaim will
be slightly more predictable within the memcg....

> >> +int list_lru_init(struct list_lru *lru)
> >> +{
> >> +	int err;
> >> +
> >> +	err = list_lru_init_one(&lru->global);
> >> +	if (err)
> >> +		goto fail;
> >> +
> >> +	err = memcg_list_lru_init(lru);
> >> +	if (err)
> >> +		goto fail;
> >> +
> >> +	return 0;
> >> +fail:
> >> +	list_lru_destroy_one(&lru->global);
> >> +	lru->global.node = NULL; /* see list_lru_destroy() */
> >> +	return err;
> >> +}
> > I suspect we have users of list_lru that don't want memcg bits added
> > to them. Hence I think we want to leave list_lru_init() alone and
> > add a list_lru_init_memcg() variant that makes the LRU memcg aware.
> > i.e. if the shrinker is not going to be memcg aware, then we don't
> > want the LRU to be memcg aware, either....
> 
> I though that we want to make all LRUs per-memcg automatically, just
> like it was with NUMA awareness. After your explanation about some
> FS-specific caches (gfs2/xfs dquot), I admit I was wrong, and not all
> caches require per-memcg shrinking. I'll add a flag to list_lru_init()
> specifying if we want memcg awareness.

Keep in mind that this may extend to the declaration of slab caches.
For example, XFS has a huge number of internal caches (see
xfs_init_zones()) and in reality, no allocation to any of these
other than the xfs inode slab should be accounted to a memcg. i.e.
the objects allocated out of them are filesystem objects that have
global scope and so shouldn't be owned/accounted to a memcg as
such...


> >> +int list_lru_grow_memcg(struct list_lru *lru, size_t new_array_size)
> >> +{
> >> +	int i;
> >> +	struct list_lru_one **memcg_lrus;
> >> +
> >> +	memcg_lrus = kcalloc(new_array_size, sizeof(*memcg_lrus), GFP_KERNEL);
> >> +	if (!memcg_lrus)
> >> +		return -ENOMEM;
> >> +
> >> +	if (lru->memcg) {
> >> +		for_each_memcg_cache_index(i) {
> >> +			if (lru->memcg[i])
> >> +				memcg_lrus[i] = lru->memcg[i];
> >> +		}
> >> +	}
> > Um, krealloc()?
> 
> Not exactly. We have to keep the old version until we call sync_rcu.

Ah, of course. Could you add a big comment explaining this so that
the next reader doesn't suggest replacing it with krealloc(), too?

> >> +int memcg_list_lru_init(struct list_lru *lru)
> >> +{
> >> +	int err = 0;
> >> +	int i;
> >> +	struct mem_cgroup *memcg;
> >> +
> >> +	lru->memcg = NULL;
> >> +	lru->memcg_old = NULL;
> >> +
> >> +	mutex_lock(&memcg_create_mutex);
> >> +	if (!memcg_kmem_enabled())
> >> +		goto out_list_add;
> >> +
> >> +	lru->memcg = kcalloc(memcg_limited_groups_array_size,
> >> +			     sizeof(*lru->memcg), GFP_KERNEL);
> >> +	if (!lru->memcg) {
> >> +		err = -ENOMEM;
> >> +		goto out;
> >> +	}
> >> +
> >> +	for_each_mem_cgroup(memcg) {
> >> +		int memcg_id;
> >> +
> >> +		memcg_id = memcg_cache_id(memcg);
> >> +		if (memcg_id < 0)
> >> +			continue;
> >> +
> >> +		err = list_lru_memcg_alloc(lru, memcg_id);
> >> +		if (err) {
> >> +			mem_cgroup_iter_break(NULL, memcg);
> >> +			goto out_free_lru_memcg;
> >> +		}
> >> +	}
> >> +out_list_add:
> >> +	list_add(&lru->list, &all_memcg_lrus);
> >> +out:
> >> +	mutex_unlock(&memcg_create_mutex);
> >> +	return err;
> >> +
> >> +out_free_lru_memcg:
> >> +	for (i = 0; i < memcg_limited_groups_array_size; i++)
> >> +		list_lru_memcg_free(lru, i);
> >> +	kfree(lru->memcg);
> >> +	goto out;
> >> +}
> > That will probably scale even worse. Think about what happens when we
> > try to mount a bunch of filesystems in parallel - they will now
> > serialise completely on this memcg_create_mutex instantiating memcg
> > lists inside list_lru_init().
> 
> Yes, the scalability seems to be the main problem here. I have a couple
> of thoughts on how it could be improved. Here they go:
> 
> 1) We can turn memcg_create_mutex to rw-semaphore (or introduce an
> rw-semaphore, which we would take for modifying list_lru's) and take it
> for reading in memcg_list_lru_init() and for writing when we create a
> new memcg (memcg_init_all_lrus()).
> This would remove the bottleneck from the mount path, but every memcg
> creation would still iterate over all LRUs under a memcg mutex. So I
> guess it is not an option, isn't it?

Right - it's not so much that there is a mutex to protect the init,
it's how long it's held that will be the issue. I mean, we don't
need to hold the memcg_create_mutex until we've completely
initialised the lru structure and are ready to add it to the
all_memcg_lrus list, right?

i.e. restructing it so that you don't need to hold the mutex until
you make the LRU list globally visible would solve the problem just
as well. if we can iterate the memcgs lists without holding a lock,
then we can init the per-memcg lru lists without holding a lock
because nobody will access them through the list_lru structure
because it's not yet been published.

That keeps the locking simple, and we get scalability because we've
reduced the lock's scope to just a few instructures instead of a
memcg iteration and a heap of memory allocation....

> 2) We could use cmpxchg() instead of a mutex in list_lru_init_memcg()
> and memcg_init_all_lrus() to assure a memcg LRU is initialized only
> once. But again, this would not remove iteration over all LRUs from
> memcg_init_all_lrus().
> 
> 3) We can try to initialize per-memcg LRUs lazily only when we actually
> need them, similar to how we now handle per-memcg kmem caches creation.
> If list_lru_add() cannot find appropriate LRU, it will schedule a
> background worker for its initialization.

I'd prefer not to add complexity to the list_lru_add() path here.
It's frequently called, so it's a code hot path and so we should
keep it as simply as possible.

> The benefits of this approach are clear: we do not introduce any
> bottlenecks, and we lower memory consumption in case different memcgs
> use different mounts exclusively.
> However, there is one thing that bothers me. Some objects accounted to a
> memcg will go into the global LRU, which will postpone actual memcg
> destruction until global reclaim.

Yeah, that's messy. best to avoid it by doing the work at list init
time, IMO.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v13 11/16] mm: list_lru: add per-memcg lists
  2013-12-12  1:40         ` Dave Chinner
  (?)
@ 2013-12-12  9:50           ` Vladimir Davydov
  -1 siblings, 0 replies; 81+ messages in thread
From: Vladimir Davydov @ 2013-12-12  9:50 UTC (permalink / raw)
  To: Dave Chinner
  Cc: dchinner, hannes, mhocko, akpm, linux-kernel, linux-mm, cgroups,
	devel, glommer, glommer, Al Viro, Balbir Singh,
	KAMEZAWA Hiroyuki

On 12/12/2013 05:40 AM, Dave Chinner wrote:
>>>> +int list_lru_grow_memcg(struct list_lru *lru, size_t new_array_size)
>>>> +{
>>>> +	int i;
>>>> +	struct list_lru_one **memcg_lrus;
>>>> +
>>>> +	memcg_lrus = kcalloc(new_array_size, sizeof(*memcg_lrus), GFP_KERNEL);
>>>> +	if (!memcg_lrus)
>>>> +		return -ENOMEM;
>>>> +
>>>> +	if (lru->memcg) {
>>>> +		for_each_memcg_cache_index(i) {
>>>> +			if (lru->memcg[i])
>>>> +				memcg_lrus[i] = lru->memcg[i];
>>>> +		}
>>>> +	}
>>> Um, krealloc()?
>> Not exactly. We have to keep the old version until we call sync_rcu.
> Ah, of course. Could you add a big comment explaining this so that
> the next reader doesn't suggest replacing it with krealloc(), too?

Sure.

>>>> +int memcg_list_lru_init(struct list_lru *lru)
>>>> +{
>>>> +	int err = 0;
>>>> +	int i;
>>>> +	struct mem_cgroup *memcg;
>>>> +
>>>> +	lru->memcg = NULL;
>>>> +	lru->memcg_old = NULL;
>>>> +
>>>> +	mutex_lock(&memcg_create_mutex);
>>>> +	if (!memcg_kmem_enabled())
>>>> +		goto out_list_add;
>>>> +
>>>> +	lru->memcg = kcalloc(memcg_limited_groups_array_size,
>>>> +			     sizeof(*lru->memcg), GFP_KERNEL);
>>>> +	if (!lru->memcg) {
>>>> +		err = -ENOMEM;
>>>> +		goto out;
>>>> +	}
>>>> +
>>>> +	for_each_mem_cgroup(memcg) {
>>>> +		int memcg_id;
>>>> +
>>>> +		memcg_id = memcg_cache_id(memcg);
>>>> +		if (memcg_id < 0)
>>>> +			continue;
>>>> +
>>>> +		err = list_lru_memcg_alloc(lru, memcg_id);
>>>> +		if (err) {
>>>> +			mem_cgroup_iter_break(NULL, memcg);
>>>> +			goto out_free_lru_memcg;
>>>> +		}
>>>> +	}
>>>> +out_list_add:
>>>> +	list_add(&lru->list, &all_memcg_lrus);
>>>> +out:
>>>> +	mutex_unlock(&memcg_create_mutex);
>>>> +	return err;
>>>> +
>>>> +out_free_lru_memcg:
>>>> +	for (i = 0; i < memcg_limited_groups_array_size; i++)
>>>> +		list_lru_memcg_free(lru, i);
>>>> +	kfree(lru->memcg);
>>>> +	goto out;
>>>> +}
>>> That will probably scale even worse. Think about what happens when we
>>> try to mount a bunch of filesystems in parallel - they will now
>>> serialise completely on this memcg_create_mutex instantiating memcg
>>> lists inside list_lru_init().
>> Yes, the scalability seems to be the main problem here. I have a couple
>> of thoughts on how it could be improved. Here they go:
>>
>> 1) We can turn memcg_create_mutex to rw-semaphore (or introduce an
>> rw-semaphore, which we would take for modifying list_lru's) and take it
>> for reading in memcg_list_lru_init() and for writing when we create a
>> new memcg (memcg_init_all_lrus()).
>> This would remove the bottleneck from the mount path, but every memcg
>> creation would still iterate over all LRUs under a memcg mutex. So I
>> guess it is not an option, isn't it?
> Right - it's not so much that there is a mutex to protect the init,
> it's how long it's held that will be the issue. I mean, we don't
> need to hold the memcg_create_mutex until we've completely
> initialised the lru structure and are ready to add it to the
> all_memcg_lrus list, right?
>
> i.e. restructing it so that you don't need to hold the mutex until
> you make the LRU list globally visible would solve the problem just
> as well. if we can iterate the memcgs lists without holding a lock,
> then we can init the per-memcg lru lists without holding a lock
> because nobody will access them through the list_lru structure
> because it's not yet been published.
>
> That keeps the locking simple, and we get scalability because we've
> reduced the lock's scope to just a few instructures instead of a
> memcg iteration and a heap of memory allocation....

Unfortunately that's not that easy as it seems to be :-(

Currently I hold the memcg_create_mutex while initializing per-memcg
LRUs in memcg_list_lru_init() in order to be sure that I won't miss a
memcg that is added during initialization.

I mean, let's try to move per-memcg LRUs allocation outside the lock and
only register the LRU there:

memcg_list_lru_init():
    1) allocate lru->memcg array
    2) for_each_kmem_active_memcg(m)
            allocate lru->memcg[m]
    3) lock memcg_create_mutex
       add lru to all_memcg_lrus_list
       unlock memcg_create_mutex

Then if a new kmem-active memcg is added after step 2 and before step 3,
it won't see the new lru, because it has not been registered yet, and
thus won't initialize its list in this lru. As a result, we will end up
with a partially initialized list_lru. Note that this will happen even
if the whole memcg initialization proceeds under the memcg_create_mutex.

Provided we could freeze memcg_limited_groups_array_size, it would be
possible to fix this problem by swapping steps 2 and 3 and making step 2
initialize lru->memcg[m] using cmpxchg() only if it was not initialized.
However we still have to hold the memcg_create_mutex during the whole
kmemcg activation path (memcg_init_all_lrus()).

Let's see if we can get rid of the lock in memcg_init_all_lrus() by
making the all_memcg_lrus RCU-protected so that we could iterate over
all list_lrus w/o holding any locks and turn memcg_init_all_lrus() to
something like this:

memcg_init_all_lrus():
    1) for_each_list_lru_rcu(lru)
           allocate lru->memcg[new_memcg_id]
    2) mark new_memcg as kmem-active

The problem is that if memcg_list_lru_init(new_lru) starts and completes
between steps 1 and 2, we will not initialize
new_lru->memcg[new_memcg_id] neither in memcg_init_all_lrus() nor in
memcg_list_lru_init().

The problem here is that on kmemcg creation (memcg_init_all_lrus()) we
have to iterate over all list_lrus while on list_lru creation
(memcg_list_lru_init()) we have to iterate over all memcgs. Currently I
can't figure out how we can do it w/o holding any mutexes at least while
calling one of these functions, but I'm keep thinking on it.

>
>> 2) We could use cmpxchg() instead of a mutex in list_lru_init_memcg()
>> and memcg_init_all_lrus() to assure a memcg LRU is initialized only
>> once. But again, this would not remove iteration over all LRUs from
>> memcg_init_all_lrus().
>>
>> 3) We can try to initialize per-memcg LRUs lazily only when we actually
>> need them, similar to how we now handle per-memcg kmem caches creation.
>> If list_lru_add() cannot find appropriate LRU, it will schedule a
>> background worker for its initialization.
> I'd prefer not to add complexity to the list_lru_add() path here.
> It's frequently called, so it's a code hot path and so we should
> keep it as simply as possible.
>
>> The benefits of this approach are clear: we do not introduce any
>> bottlenecks, and we lower memory consumption in case different memcgs
>> use different mounts exclusively.
>> However, there is one thing that bothers me. Some objects accounted to a
>> memcg will go into the global LRU, which will postpone actual memcg
>> destruction until global reclaim.
> Yeah, that's messy. best to avoid it by doing the work at list init
> time, IMO.

I also think so, because the benefits of this are rather doubtful:
1) We actually remove bottlenecks from slow paths (memcg creation and fs
mount) executed relatively rare.
2) In contrast to kmem_cache, list_lru_node is a very small structure so
that making per-memcg lists initialized lazily would not save us much
memory.
But currently I guess it would be the easiest way to get rid of the
memcg_create_mutex held in the initialization paths.

Thanks.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v13 11/16] mm: list_lru: add per-memcg lists
@ 2013-12-12  9:50           ` Vladimir Davydov
  0 siblings, 0 replies; 81+ messages in thread
From: Vladimir Davydov @ 2013-12-12  9:50 UTC (permalink / raw)
  To: Dave Chinner
  Cc: dchinner, hannes, mhocko, akpm, linux-kernel, linux-mm, cgroups,
	devel, glommer, glommer, Al Viro, Balbir Singh,
	KAMEZAWA Hiroyuki

On 12/12/2013 05:40 AM, Dave Chinner wrote:
>>>> +int list_lru_grow_memcg(struct list_lru *lru, size_t new_array_size)
>>>> +{
>>>> +	int i;
>>>> +	struct list_lru_one **memcg_lrus;
>>>> +
>>>> +	memcg_lrus = kcalloc(new_array_size, sizeof(*memcg_lrus), GFP_KERNEL);
>>>> +	if (!memcg_lrus)
>>>> +		return -ENOMEM;
>>>> +
>>>> +	if (lru->memcg) {
>>>> +		for_each_memcg_cache_index(i) {
>>>> +			if (lru->memcg[i])
>>>> +				memcg_lrus[i] = lru->memcg[i];
>>>> +		}
>>>> +	}
>>> Um, krealloc()?
>> Not exactly. We have to keep the old version until we call sync_rcu.
> Ah, of course. Could you add a big comment explaining this so that
> the next reader doesn't suggest replacing it with krealloc(), too?

Sure.

>>>> +int memcg_list_lru_init(struct list_lru *lru)
>>>> +{
>>>> +	int err = 0;
>>>> +	int i;
>>>> +	struct mem_cgroup *memcg;
>>>> +
>>>> +	lru->memcg = NULL;
>>>> +	lru->memcg_old = NULL;
>>>> +
>>>> +	mutex_lock(&memcg_create_mutex);
>>>> +	if (!memcg_kmem_enabled())
>>>> +		goto out_list_add;
>>>> +
>>>> +	lru->memcg = kcalloc(memcg_limited_groups_array_size,
>>>> +			     sizeof(*lru->memcg), GFP_KERNEL);
>>>> +	if (!lru->memcg) {
>>>> +		err = -ENOMEM;
>>>> +		goto out;
>>>> +	}
>>>> +
>>>> +	for_each_mem_cgroup(memcg) {
>>>> +		int memcg_id;
>>>> +
>>>> +		memcg_id = memcg_cache_id(memcg);
>>>> +		if (memcg_id < 0)
>>>> +			continue;
>>>> +
>>>> +		err = list_lru_memcg_alloc(lru, memcg_id);
>>>> +		if (err) {
>>>> +			mem_cgroup_iter_break(NULL, memcg);
>>>> +			goto out_free_lru_memcg;
>>>> +		}
>>>> +	}
>>>> +out_list_add:
>>>> +	list_add(&lru->list, &all_memcg_lrus);
>>>> +out:
>>>> +	mutex_unlock(&memcg_create_mutex);
>>>> +	return err;
>>>> +
>>>> +out_free_lru_memcg:
>>>> +	for (i = 0; i < memcg_limited_groups_array_size; i++)
>>>> +		list_lru_memcg_free(lru, i);
>>>> +	kfree(lru->memcg);
>>>> +	goto out;
>>>> +}
>>> That will probably scale even worse. Think about what happens when we
>>> try to mount a bunch of filesystems in parallel - they will now
>>> serialise completely on this memcg_create_mutex instantiating memcg
>>> lists inside list_lru_init().
>> Yes, the scalability seems to be the main problem here. I have a couple
>> of thoughts on how it could be improved. Here they go:
>>
>> 1) We can turn memcg_create_mutex to rw-semaphore (or introduce an
>> rw-semaphore, which we would take for modifying list_lru's) and take it
>> for reading in memcg_list_lru_init() and for writing when we create a
>> new memcg (memcg_init_all_lrus()).
>> This would remove the bottleneck from the mount path, but every memcg
>> creation would still iterate over all LRUs under a memcg mutex. So I
>> guess it is not an option, isn't it?
> Right - it's not so much that there is a mutex to protect the init,
> it's how long it's held that will be the issue. I mean, we don't
> need to hold the memcg_create_mutex until we've completely
> initialised the lru structure and are ready to add it to the
> all_memcg_lrus list, right?
>
> i.e. restructing it so that you don't need to hold the mutex until
> you make the LRU list globally visible would solve the problem just
> as well. if we can iterate the memcgs lists without holding a lock,
> then we can init the per-memcg lru lists without holding a lock
> because nobody will access them through the list_lru structure
> because it's not yet been published.
>
> That keeps the locking simple, and we get scalability because we've
> reduced the lock's scope to just a few instructures instead of a
> memcg iteration and a heap of memory allocation....

Unfortunately that's not that easy as it seems to be :-(

Currently I hold the memcg_create_mutex while initializing per-memcg
LRUs in memcg_list_lru_init() in order to be sure that I won't miss a
memcg that is added during initialization.

I mean, let's try to move per-memcg LRUs allocation outside the lock and
only register the LRU there:

memcg_list_lru_init():
    1) allocate lru->memcg array
    2) for_each_kmem_active_memcg(m)
            allocate lru->memcg[m]
    3) lock memcg_create_mutex
       add lru to all_memcg_lrus_list
       unlock memcg_create_mutex

Then if a new kmem-active memcg is added after step 2 and before step 3,
it won't see the new lru, because it has not been registered yet, and
thus won't initialize its list in this lru. As a result, we will end up
with a partially initialized list_lru. Note that this will happen even
if the whole memcg initialization proceeds under the memcg_create_mutex.

Provided we could freeze memcg_limited_groups_array_size, it would be
possible to fix this problem by swapping steps 2 and 3 and making step 2
initialize lru->memcg[m] using cmpxchg() only if it was not initialized.
However we still have to hold the memcg_create_mutex during the whole
kmemcg activation path (memcg_init_all_lrus()).

Let's see if we can get rid of the lock in memcg_init_all_lrus() by
making the all_memcg_lrus RCU-protected so that we could iterate over
all list_lrus w/o holding any locks and turn memcg_init_all_lrus() to
something like this:

memcg_init_all_lrus():
    1) for_each_list_lru_rcu(lru)
           allocate lru->memcg[new_memcg_id]
    2) mark new_memcg as kmem-active

The problem is that if memcg_list_lru_init(new_lru) starts and completes
between steps 1 and 2, we will not initialize
new_lru->memcg[new_memcg_id] neither in memcg_init_all_lrus() nor in
memcg_list_lru_init().

The problem here is that on kmemcg creation (memcg_init_all_lrus()) we
have to iterate over all list_lrus while on list_lru creation
(memcg_list_lru_init()) we have to iterate over all memcgs. Currently I
can't figure out how we can do it w/o holding any mutexes at least while
calling one of these functions, but I'm keep thinking on it.

>
>> 2) We could use cmpxchg() instead of a mutex in list_lru_init_memcg()
>> and memcg_init_all_lrus() to assure a memcg LRU is initialized only
>> once. But again, this would not remove iteration over all LRUs from
>> memcg_init_all_lrus().
>>
>> 3) We can try to initialize per-memcg LRUs lazily only when we actually
>> need them, similar to how we now handle per-memcg kmem caches creation.
>> If list_lru_add() cannot find appropriate LRU, it will schedule a
>> background worker for its initialization.
> I'd prefer not to add complexity to the list_lru_add() path here.
> It's frequently called, so it's a code hot path and so we should
> keep it as simply as possible.
>
>> The benefits of this approach are clear: we do not introduce any
>> bottlenecks, and we lower memory consumption in case different memcgs
>> use different mounts exclusively.
>> However, there is one thing that bothers me. Some objects accounted to a
>> memcg will go into the global LRU, which will postpone actual memcg
>> destruction until global reclaim.
> Yeah, that's messy. best to avoid it by doing the work at list init
> time, IMO.

I also think so, because the benefits of this are rather doubtful:
1) We actually remove bottlenecks from slow paths (memcg creation and fs
mount) executed relatively rare.
2) In contrast to kmem_cache, list_lru_node is a very small structure so
that making per-memcg lists initialized lazily would not save us much
memory.
But currently I guess it would be the easiest way to get rid of the
memcg_create_mutex held in the initialization paths.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v13 11/16] mm: list_lru: add per-memcg lists
@ 2013-12-12  9:50           ` Vladimir Davydov
  0 siblings, 0 replies; 81+ messages in thread
From: Vladimir Davydov @ 2013-12-12  9:50 UTC (permalink / raw)
  To: Dave Chinner
  Cc: dchinner-H+wXaHxf7aLQT0dZR+AlfA, hannes-druUgvl0LCNAfugRpC6u6w,
	mhocko-AlSwsSmVLrQ, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	devel-GEFAQzZX7r8dnm+yROfE0A, glommer-GEFAQzZX7r8dnm+yROfE0A,
	glommer-Re5JQEeQqe8AvxtiuMwx3w, Al Viro, Balbir Singh,
	KAMEZAWA Hiroyuki

On 12/12/2013 05:40 AM, Dave Chinner wrote:
>>>> +int list_lru_grow_memcg(struct list_lru *lru, size_t new_array_size)
>>>> +{
>>>> +	int i;
>>>> +	struct list_lru_one **memcg_lrus;
>>>> +
>>>> +	memcg_lrus = kcalloc(new_array_size, sizeof(*memcg_lrus), GFP_KERNEL);
>>>> +	if (!memcg_lrus)
>>>> +		return -ENOMEM;
>>>> +
>>>> +	if (lru->memcg) {
>>>> +		for_each_memcg_cache_index(i) {
>>>> +			if (lru->memcg[i])
>>>> +				memcg_lrus[i] = lru->memcg[i];
>>>> +		}
>>>> +	}
>>> Um, krealloc()?
>> Not exactly. We have to keep the old version until we call sync_rcu.
> Ah, of course. Could you add a big comment explaining this so that
> the next reader doesn't suggest replacing it with krealloc(), too?

Sure.

>>>> +int memcg_list_lru_init(struct list_lru *lru)
>>>> +{
>>>> +	int err = 0;
>>>> +	int i;
>>>> +	struct mem_cgroup *memcg;
>>>> +
>>>> +	lru->memcg = NULL;
>>>> +	lru->memcg_old = NULL;
>>>> +
>>>> +	mutex_lock(&memcg_create_mutex);
>>>> +	if (!memcg_kmem_enabled())
>>>> +		goto out_list_add;
>>>> +
>>>> +	lru->memcg = kcalloc(memcg_limited_groups_array_size,
>>>> +			     sizeof(*lru->memcg), GFP_KERNEL);
>>>> +	if (!lru->memcg) {
>>>> +		err = -ENOMEM;
>>>> +		goto out;
>>>> +	}
>>>> +
>>>> +	for_each_mem_cgroup(memcg) {
>>>> +		int memcg_id;
>>>> +
>>>> +		memcg_id = memcg_cache_id(memcg);
>>>> +		if (memcg_id < 0)
>>>> +			continue;
>>>> +
>>>> +		err = list_lru_memcg_alloc(lru, memcg_id);
>>>> +		if (err) {
>>>> +			mem_cgroup_iter_break(NULL, memcg);
>>>> +			goto out_free_lru_memcg;
>>>> +		}
>>>> +	}
>>>> +out_list_add:
>>>> +	list_add(&lru->list, &all_memcg_lrus);
>>>> +out:
>>>> +	mutex_unlock(&memcg_create_mutex);
>>>> +	return err;
>>>> +
>>>> +out_free_lru_memcg:
>>>> +	for (i = 0; i < memcg_limited_groups_array_size; i++)
>>>> +		list_lru_memcg_free(lru, i);
>>>> +	kfree(lru->memcg);
>>>> +	goto out;
>>>> +}
>>> That will probably scale even worse. Think about what happens when we
>>> try to mount a bunch of filesystems in parallel - they will now
>>> serialise completely on this memcg_create_mutex instantiating memcg
>>> lists inside list_lru_init().
>> Yes, the scalability seems to be the main problem here. I have a couple
>> of thoughts on how it could be improved. Here they go:
>>
>> 1) We can turn memcg_create_mutex to rw-semaphore (or introduce an
>> rw-semaphore, which we would take for modifying list_lru's) and take it
>> for reading in memcg_list_lru_init() and for writing when we create a
>> new memcg (memcg_init_all_lrus()).
>> This would remove the bottleneck from the mount path, but every memcg
>> creation would still iterate over all LRUs under a memcg mutex. So I
>> guess it is not an option, isn't it?
> Right - it's not so much that there is a mutex to protect the init,
> it's how long it's held that will be the issue. I mean, we don't
> need to hold the memcg_create_mutex until we've completely
> initialised the lru structure and are ready to add it to the
> all_memcg_lrus list, right?
>
> i.e. restructing it so that you don't need to hold the mutex until
> you make the LRU list globally visible would solve the problem just
> as well. if we can iterate the memcgs lists without holding a lock,
> then we can init the per-memcg lru lists without holding a lock
> because nobody will access them through the list_lru structure
> because it's not yet been published.
>
> That keeps the locking simple, and we get scalability because we've
> reduced the lock's scope to just a few instructures instead of a
> memcg iteration and a heap of memory allocation....

Unfortunately that's not that easy as it seems to be :-(

Currently I hold the memcg_create_mutex while initializing per-memcg
LRUs in memcg_list_lru_init() in order to be sure that I won't miss a
memcg that is added during initialization.

I mean, let's try to move per-memcg LRUs allocation outside the lock and
only register the LRU there:

memcg_list_lru_init():
    1) allocate lru->memcg array
    2) for_each_kmem_active_memcg(m)
            allocate lru->memcg[m]
    3) lock memcg_create_mutex
       add lru to all_memcg_lrus_list
       unlock memcg_create_mutex

Then if a new kmem-active memcg is added after step 2 and before step 3,
it won't see the new lru, because it has not been registered yet, and
thus won't initialize its list in this lru. As a result, we will end up
with a partially initialized list_lru. Note that this will happen even
if the whole memcg initialization proceeds under the memcg_create_mutex.

Provided we could freeze memcg_limited_groups_array_size, it would be
possible to fix this problem by swapping steps 2 and 3 and making step 2
initialize lru->memcg[m] using cmpxchg() only if it was not initialized.
However we still have to hold the memcg_create_mutex during the whole
kmemcg activation path (memcg_init_all_lrus()).

Let's see if we can get rid of the lock in memcg_init_all_lrus() by
making the all_memcg_lrus RCU-protected so that we could iterate over
all list_lrus w/o holding any locks and turn memcg_init_all_lrus() to
something like this:

memcg_init_all_lrus():
    1) for_each_list_lru_rcu(lru)
           allocate lru->memcg[new_memcg_id]
    2) mark new_memcg as kmem-active

The problem is that if memcg_list_lru_init(new_lru) starts and completes
between steps 1 and 2, we will not initialize
new_lru->memcg[new_memcg_id] neither in memcg_init_all_lrus() nor in
memcg_list_lru_init().

The problem here is that on kmemcg creation (memcg_init_all_lrus()) we
have to iterate over all list_lrus while on list_lru creation
(memcg_list_lru_init()) we have to iterate over all memcgs. Currently I
can't figure out how we can do it w/o holding any mutexes at least while
calling one of these functions, but I'm keep thinking on it.

>
>> 2) We could use cmpxchg() instead of a mutex in list_lru_init_memcg()
>> and memcg_init_all_lrus() to assure a memcg LRU is initialized only
>> once. But again, this would not remove iteration over all LRUs from
>> memcg_init_all_lrus().
>>
>> 3) We can try to initialize per-memcg LRUs lazily only when we actually
>> need them, similar to how we now handle per-memcg kmem caches creation.
>> If list_lru_add() cannot find appropriate LRU, it will schedule a
>> background worker for its initialization.
> I'd prefer not to add complexity to the list_lru_add() path here.
> It's frequently called, so it's a code hot path and so we should
> keep it as simply as possible.
>
>> The benefits of this approach are clear: we do not introduce any
>> bottlenecks, and we lower memory consumption in case different memcgs
>> use different mounts exclusively.
>> However, there is one thing that bothers me. Some objects accounted to a
>> memcg will go into the global LRU, which will postpone actual memcg
>> destruction until global reclaim.
> Yeah, that's messy. best to avoid it by doing the work at list init
> time, IMO.

I also think so, because the benefits of this are rather doubtful:
1) We actually remove bottlenecks from slow paths (memcg creation and fs
mount) executed relatively rare.
2) In contrast to kmem_cache, list_lru_node is a very small structure so
that making per-memcg lists initialized lazily would not save us much
memory.
But currently I guess it would be the easiest way to get rid of the
memcg_create_mutex held in the initialization paths.

Thanks.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v13 11/16] mm: list_lru: add per-memcg lists
  2013-12-12  9:50           ` Vladimir Davydov
@ 2013-12-12 20:24             ` Vladimir Davydov
  -1 siblings, 0 replies; 81+ messages in thread
From: Vladimir Davydov @ 2013-12-12 20:24 UTC (permalink / raw)
  To: Dave Chinner
  Cc: glommer, Balbir Singh, dchinner, KAMEZAWA Hiroyuki, linux-kernel,
	glommer, mhocko, linux-mm, Al Viro, hannes, cgroups, akpm, devel

On 12/12/2013 01:50 PM, Vladimir Davydov wrote:
>>>>> +int memcg_list_lru_init(struct list_lru *lru)
>>>>> +{
>>>>> +	int err = 0;
>>>>> +	int i;
>>>>> +	struct mem_cgroup *memcg;
>>>>> +
>>>>> +	lru->memcg = NULL;
>>>>> +	lru->memcg_old = NULL;
>>>>> +
>>>>> +	mutex_lock(&memcg_create_mutex);
>>>>> +	if (!memcg_kmem_enabled())
>>>>> +		goto out_list_add;
>>>>> +
>>>>> +	lru->memcg = kcalloc(memcg_limited_groups_array_size,
>>>>> +			     sizeof(*lru->memcg), GFP_KERNEL);
>>>>> +	if (!lru->memcg) {
>>>>> +		err = -ENOMEM;
>>>>> +		goto out;
>>>>> +	}
>>>>> +
>>>>> +	for_each_mem_cgroup(memcg) {
>>>>> +		int memcg_id;
>>>>> +
>>>>> +		memcg_id = memcg_cache_id(memcg);
>>>>> +		if (memcg_id < 0)
>>>>> +			continue;
>>>>> +
>>>>> +		err = list_lru_memcg_alloc(lru, memcg_id);
>>>>> +		if (err) {
>>>>> +			mem_cgroup_iter_break(NULL, memcg);
>>>>> +			goto out_free_lru_memcg;
>>>>> +		}
>>>>> +	}
>>>>> +out_list_add:
>>>>> +	list_add(&lru->list, &all_memcg_lrus);
>>>>> +out:
>>>>> +	mutex_unlock(&memcg_create_mutex);
>>>>> +	return err;
>>>>> +
>>>>> +out_free_lru_memcg:
>>>>> +	for (i = 0; i < memcg_limited_groups_array_size; i++)
>>>>> +		list_lru_memcg_free(lru, i);
>>>>> +	kfree(lru->memcg);
>>>>> +	goto out;
>>>>> +}
>>>> That will probably scale even worse. Think about what happens when we
>>>> try to mount a bunch of filesystems in parallel - they will now
>>>> serialise completely on this memcg_create_mutex instantiating memcg
>>>> lists inside list_lru_init().
>>> Yes, the scalability seems to be the main problem here. I have a couple
>>> of thoughts on how it could be improved. Here they go:
>>>
>>> 1) We can turn memcg_create_mutex to rw-semaphore (or introduce an
>>> rw-semaphore, which we would take for modifying list_lru's) and take it
>>> for reading in memcg_list_lru_init() and for writing when we create a
>>> new memcg (memcg_init_all_lrus()).
>>> This would remove the bottleneck from the mount path, but every memcg
>>> creation would still iterate over all LRUs under a memcg mutex. So I
>>> guess it is not an option, isn't it?
>> Right - it's not so much that there is a mutex to protect the init,
>> it's how long it's held that will be the issue. I mean, we don't
>> need to hold the memcg_create_mutex until we've completely
>> initialised the lru structure and are ready to add it to the
>> all_memcg_lrus list, right?
>>
>> i.e. restructing it so that you don't need to hold the mutex until
>> you make the LRU list globally visible would solve the problem just
>> as well. if we can iterate the memcgs lists without holding a lock,
>> then we can init the per-memcg lru lists without holding a lock
>> because nobody will access them through the list_lru structure
>> because it's not yet been published.
>>
>> That keeps the locking simple, and we get scalability because we've
>> reduced the lock's scope to just a few instructures instead of a
>> memcg iteration and a heap of memory allocation....
> Unfortunately that's not that easy as it seems to be :-(
>
> Currently I hold the memcg_create_mutex while initializing per-memcg
> LRUs in memcg_list_lru_init() in order to be sure that I won't miss a
> memcg that is added during initialization.
>
> I mean, let's try to move per-memcg LRUs allocation outside the lock and
> only register the LRU there:
>
> memcg_list_lru_init():
>     1) allocate lru->memcg array
>     2) for_each_kmem_active_memcg(m)
>             allocate lru->memcg[m]
>     3) lock memcg_create_mutex
>        add lru to all_memcg_lrus_list
>        unlock memcg_create_mutex
>
> Then if a new kmem-active memcg is added after step 2 and before step 3,
> it won't see the new lru, because it has not been registered yet, and
> thus won't initialize its list in this lru. As a result, we will end up
> with a partially initialized list_lru. Note that this will happen even
> if the whole memcg initialization proceeds under the memcg_create_mutex.
>
> Provided we could freeze memcg_limited_groups_array_size, it would be
> possible to fix this problem by swapping steps 2 and 3 and making step 2
> initialize lru->memcg[m] using cmpxchg() only if it was not initialized.
> However we still have to hold the memcg_create_mutex during the whole
> kmemcg activation path (memcg_init_all_lrus()).
>
> Let's see if we can get rid of the lock in memcg_init_all_lrus() by
> making the all_memcg_lrus RCU-protected so that we could iterate over
> all list_lrus w/o holding any locks and turn memcg_init_all_lrus() to
> something like this:
>
> memcg_init_all_lrus():
>     1) for_each_list_lru_rcu(lru)
>            allocate lru->memcg[new_memcg_id]
>     2) mark new_memcg as kmem-active
>
> The problem is that if memcg_list_lru_init(new_lru) starts and completes
> between steps 1 and 2, we will not initialize
> new_lru->memcg[new_memcg_id] neither in memcg_init_all_lrus() nor in
> memcg_list_lru_init().
>
> The problem here is that on kmemcg creation (memcg_init_all_lrus()) we
> have to iterate over all list_lrus while on list_lru creation
> (memcg_list_lru_init()) we have to iterate over all memcgs. Currently I
> can't figure out how we can do it w/o holding any mutexes at least while
> calling one of these functions, but I'm keep thinking on it.
>

Seems I got it. We could add a memcg state bit, say "activating",
meaning that a memcg is going to become kmem-active, but it is not yet,
so it should not be accounted to; keep all list_lrus in a rcu-protected
list; and implement memcg_init_all_lrus() and memcg_list_lru_init() as
follows:

memcg_init_all_lrus():
    set activating
    for_each_list_lru_rcu(lru)
        cmpxchg(&lru->memcg[new_memcg_id], NULL, new list_lru_node);
    set active

memcg_list_lru_init():
    add the new_lru to the all_memcg_lrus list
    for_each_memcg(memcg):
        if memcg is activating or active:
            cmpxchg(&new_lru->memcg[memcg_id], NULL, new list_lru_node)

At first glance, it looks correct, because:

If we skip an lru while iterating over all_memcg_lrus in
memcg_init_all_lrus(), it means it was created after the "activating"
bit had been set and thus the per-memcg lru will be initialized in
memcg_list_lru_init(). If we skip a memcg in memcg_list_lru_init(), this
memg will "see" this lru while iterating over the all_memcg_lrus list,
because the lru must have been created before the "activating" bit set.

Although it is to be elaborated, because I haven't examined the
destruction paths yet, and this doesn't take into account the fact that
memcg_limited_groups_array_size is not a constant (I guess I'll have to
introduce an rw semaphore for it, but it'll be OK, because its updates
are very-very rare), I guess I am on the right way.

Thanks.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v13 11/16] mm: list_lru: add per-memcg lists
@ 2013-12-12 20:24             ` Vladimir Davydov
  0 siblings, 0 replies; 81+ messages in thread
From: Vladimir Davydov @ 2013-12-12 20:24 UTC (permalink / raw)
  To: Dave Chinner
  Cc: glommer, Balbir Singh, dchinner, KAMEZAWA Hiroyuki, linux-kernel,
	glommer, mhocko, linux-mm, Al Viro, hannes, cgroups, akpm, devel

On 12/12/2013 01:50 PM, Vladimir Davydov wrote:
>>>>> +int memcg_list_lru_init(struct list_lru *lru)
>>>>> +{
>>>>> +	int err = 0;
>>>>> +	int i;
>>>>> +	struct mem_cgroup *memcg;
>>>>> +
>>>>> +	lru->memcg = NULL;
>>>>> +	lru->memcg_old = NULL;
>>>>> +
>>>>> +	mutex_lock(&memcg_create_mutex);
>>>>> +	if (!memcg_kmem_enabled())
>>>>> +		goto out_list_add;
>>>>> +
>>>>> +	lru->memcg = kcalloc(memcg_limited_groups_array_size,
>>>>> +			     sizeof(*lru->memcg), GFP_KERNEL);
>>>>> +	if (!lru->memcg) {
>>>>> +		err = -ENOMEM;
>>>>> +		goto out;
>>>>> +	}
>>>>> +
>>>>> +	for_each_mem_cgroup(memcg) {
>>>>> +		int memcg_id;
>>>>> +
>>>>> +		memcg_id = memcg_cache_id(memcg);
>>>>> +		if (memcg_id < 0)
>>>>> +			continue;
>>>>> +
>>>>> +		err = list_lru_memcg_alloc(lru, memcg_id);
>>>>> +		if (err) {
>>>>> +			mem_cgroup_iter_break(NULL, memcg);
>>>>> +			goto out_free_lru_memcg;
>>>>> +		}
>>>>> +	}
>>>>> +out_list_add:
>>>>> +	list_add(&lru->list, &all_memcg_lrus);
>>>>> +out:
>>>>> +	mutex_unlock(&memcg_create_mutex);
>>>>> +	return err;
>>>>> +
>>>>> +out_free_lru_memcg:
>>>>> +	for (i = 0; i < memcg_limited_groups_array_size; i++)
>>>>> +		list_lru_memcg_free(lru, i);
>>>>> +	kfree(lru->memcg);
>>>>> +	goto out;
>>>>> +}
>>>> That will probably scale even worse. Think about what happens when we
>>>> try to mount a bunch of filesystems in parallel - they will now
>>>> serialise completely on this memcg_create_mutex instantiating memcg
>>>> lists inside list_lru_init().
>>> Yes, the scalability seems to be the main problem here. I have a couple
>>> of thoughts on how it could be improved. Here they go:
>>>
>>> 1) We can turn memcg_create_mutex to rw-semaphore (or introduce an
>>> rw-semaphore, which we would take for modifying list_lru's) and take it
>>> for reading in memcg_list_lru_init() and for writing when we create a
>>> new memcg (memcg_init_all_lrus()).
>>> This would remove the bottleneck from the mount path, but every memcg
>>> creation would still iterate over all LRUs under a memcg mutex. So I
>>> guess it is not an option, isn't it?
>> Right - it's not so much that there is a mutex to protect the init,
>> it's how long it's held that will be the issue. I mean, we don't
>> need to hold the memcg_create_mutex until we've completely
>> initialised the lru structure and are ready to add it to the
>> all_memcg_lrus list, right?
>>
>> i.e. restructing it so that you don't need to hold the mutex until
>> you make the LRU list globally visible would solve the problem just
>> as well. if we can iterate the memcgs lists without holding a lock,
>> then we can init the per-memcg lru lists without holding a lock
>> because nobody will access them through the list_lru structure
>> because it's not yet been published.
>>
>> That keeps the locking simple, and we get scalability because we've
>> reduced the lock's scope to just a few instructures instead of a
>> memcg iteration and a heap of memory allocation....
> Unfortunately that's not that easy as it seems to be :-(
>
> Currently I hold the memcg_create_mutex while initializing per-memcg
> LRUs in memcg_list_lru_init() in order to be sure that I won't miss a
> memcg that is added during initialization.
>
> I mean, let's try to move per-memcg LRUs allocation outside the lock and
> only register the LRU there:
>
> memcg_list_lru_init():
>     1) allocate lru->memcg array
>     2) for_each_kmem_active_memcg(m)
>             allocate lru->memcg[m]
>     3) lock memcg_create_mutex
>        add lru to all_memcg_lrus_list
>        unlock memcg_create_mutex
>
> Then if a new kmem-active memcg is added after step 2 and before step 3,
> it won't see the new lru, because it has not been registered yet, and
> thus won't initialize its list in this lru. As a result, we will end up
> with a partially initialized list_lru. Note that this will happen even
> if the whole memcg initialization proceeds under the memcg_create_mutex.
>
> Provided we could freeze memcg_limited_groups_array_size, it would be
> possible to fix this problem by swapping steps 2 and 3 and making step 2
> initialize lru->memcg[m] using cmpxchg() only if it was not initialized.
> However we still have to hold the memcg_create_mutex during the whole
> kmemcg activation path (memcg_init_all_lrus()).
>
> Let's see if we can get rid of the lock in memcg_init_all_lrus() by
> making the all_memcg_lrus RCU-protected so that we could iterate over
> all list_lrus w/o holding any locks and turn memcg_init_all_lrus() to
> something like this:
>
> memcg_init_all_lrus():
>     1) for_each_list_lru_rcu(lru)
>            allocate lru->memcg[new_memcg_id]
>     2) mark new_memcg as kmem-active
>
> The problem is that if memcg_list_lru_init(new_lru) starts and completes
> between steps 1 and 2, we will not initialize
> new_lru->memcg[new_memcg_id] neither in memcg_init_all_lrus() nor in
> memcg_list_lru_init().
>
> The problem here is that on kmemcg creation (memcg_init_all_lrus()) we
> have to iterate over all list_lrus while on list_lru creation
> (memcg_list_lru_init()) we have to iterate over all memcgs. Currently I
> can't figure out how we can do it w/o holding any mutexes at least while
> calling one of these functions, but I'm keep thinking on it.
>

Seems I got it. We could add a memcg state bit, say "activating",
meaning that a memcg is going to become kmem-active, but it is not yet,
so it should not be accounted to; keep all list_lrus in a rcu-protected
list; and implement memcg_init_all_lrus() and memcg_list_lru_init() as
follows:

memcg_init_all_lrus():
    set activating
    for_each_list_lru_rcu(lru)
        cmpxchg(&lru->memcg[new_memcg_id], NULL, new list_lru_node);
    set active

memcg_list_lru_init():
    add the new_lru to the all_memcg_lrus list
    for_each_memcg(memcg):
        if memcg is activating or active:
            cmpxchg(&new_lru->memcg[memcg_id], NULL, new list_lru_node)

At first glance, it looks correct, because:

If we skip an lru while iterating over all_memcg_lrus in
memcg_init_all_lrus(), it means it was created after the "activating"
bit had been set and thus the per-memcg lru will be initialized in
memcg_list_lru_init(). If we skip a memcg in memcg_list_lru_init(), this
memg will "see" this lru while iterating over the all_memcg_lrus list,
because the lru must have been created before the "activating" bit set.

Although it is to be elaborated, because I haven't examined the
destruction paths yet, and this doesn't take into account the fact that
memcg_limited_groups_array_size is not a constant (I guess I'll have to
introduce an rw semaphore for it, but it'll be OK, because its updates
are very-very rare), I guess I am on the right way.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v13 11/16] mm: list_lru: add per-memcg lists
  2013-12-10  5:00     ` Dave Chinner
@ 2013-12-12 20:48       ` Glauber Costa
  -1 siblings, 0 replies; 81+ messages in thread
From: Glauber Costa @ 2013-12-12 20:48 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Vladimir Davydov, dchinner, Johannes Weiner, Michal Hocko,
	Andrew Morton, LKML, linux-mm, cgroups, devel, Glauber Costa,
	Al Viro, Balbir Singh, KAMEZAWA Hiroyuki

> OK, as far as I can tell, this is introducing a per-node, per-memcg
> LRU lists. Is that correct?
>
> If so, then that is not what Glauber and I originally intended for
> memcg LRUs. per-node LRUs are expensive in terms of memory and cross
> multiplying them by the number of memcgs in a system was not a good
> use of memory.
>
> According to Glauber, most memcgs are small and typically confined
> to a single node or two by external means and therefore don't need the
> scalability numa aware LRUs provide. Hence the idea was that the
> memcg LRUs would just be a single LRU list, just like a non-numa
> aware list_lru instantiation. IOWs, this is the structure that we
> had decided on as the best compromise between memory usage,
> complexity and memcg awareness:
>
Sorry for jumping late into this particular e-mail.

I just wanted to point out that the reason I adopted such matrix in my
design was that
it actually uses less memory this way. My reasoning for this was
explained in the original
patch that I posted that contained that implementation.

This is because whenever an object would go on a memcg list, it *would
not* go on
the global list. Therefore, to keep information about nodes for global
reclaim, you
have to put them in node-lists.

memcg reclaim, however, would reclaim regardless of node information.

In global reclaim, the memcg lists would be scanned obeying the node structure
in the lists.

Because that has a fixed cost, it ends up using less memory that having a second
list pointer in the objects, which is something that scale with the
number of objects.
Not to mention, that cost would be incurred even with memcg not being in use,
which is something that we would like to avoid.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v13 11/16] mm: list_lru: add per-memcg lists
@ 2013-12-12 20:48       ` Glauber Costa
  0 siblings, 0 replies; 81+ messages in thread
From: Glauber Costa @ 2013-12-12 20:48 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Vladimir Davydov, dchinner, Johannes Weiner, Michal Hocko,
	Andrew Morton, LKML, linux-mm, cgroups, devel, Glauber Costa,
	Al Viro, Balbir Singh, KAMEZAWA Hiroyuki

> OK, as far as I can tell, this is introducing a per-node, per-memcg
> LRU lists. Is that correct?
>
> If so, then that is not what Glauber and I originally intended for
> memcg LRUs. per-node LRUs are expensive in terms of memory and cross
> multiplying them by the number of memcgs in a system was not a good
> use of memory.
>
> According to Glauber, most memcgs are small and typically confined
> to a single node or two by external means and therefore don't need the
> scalability numa aware LRUs provide. Hence the idea was that the
> memcg LRUs would just be a single LRU list, just like a non-numa
> aware list_lru instantiation. IOWs, this is the structure that we
> had decided on as the best compromise between memory usage,
> complexity and memcg awareness:
>
Sorry for jumping late into this particular e-mail.

I just wanted to point out that the reason I adopted such matrix in my
design was that
it actually uses less memory this way. My reasoning for this was
explained in the original
patch that I posted that contained that implementation.

This is because whenever an object would go on a memcg list, it *would
not* go on
the global list. Therefore, to keep information about nodes for global
reclaim, you
have to put them in node-lists.

memcg reclaim, however, would reclaim regardless of node information.

In global reclaim, the memcg lists would be scanned obeying the node structure
in the lists.

Because that has a fixed cost, it ends up using less memory that having a second
list pointer in the objects, which is something that scale with the
number of objects.
Not to mention, that cost would be incurred even with memcg not being in use,
which is something that we would like to avoid.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v13 11/16] mm: list_lru: add per-memcg lists
  2013-12-12 20:24             ` Vladimir Davydov
@ 2013-12-14 20:03               ` Vladimir Davydov
  -1 siblings, 0 replies; 81+ messages in thread
From: Vladimir Davydov @ 2013-12-14 20:03 UTC (permalink / raw)
  To: Dave Chinner
  Cc: glommer, Balbir Singh, hannes, linux-kernel, glommer, mhocko,
	linux-mm, devel, Al Viro, dchinner, cgroups, akpm,
	KAMEZAWA Hiroyuki

On 12/13/2013 12:24 AM, Vladimir Davydov wrote:
> On 12/12/2013 01:50 PM, Vladimir Davydov wrote:
>>>>>> +int memcg_list_lru_init(struct list_lru *lru)
>>>>>> +{
>>>>>> +	int err = 0;
>>>>>> +	int i;
>>>>>> +	struct mem_cgroup *memcg;
>>>>>> +
>>>>>> +	lru->memcg = NULL;
>>>>>> +	lru->memcg_old = NULL;
>>>>>> +
>>>>>> +	mutex_lock(&memcg_create_mutex);
>>>>>> +	if (!memcg_kmem_enabled())
>>>>>> +		goto out_list_add;
>>>>>> +
>>>>>> +	lru->memcg = kcalloc(memcg_limited_groups_array_size,
>>>>>> +			     sizeof(*lru->memcg), GFP_KERNEL);
>>>>>> +	if (!lru->memcg) {
>>>>>> +		err = -ENOMEM;
>>>>>> +		goto out;
>>>>>> +	}
>>>>>> +
>>>>>> +	for_each_mem_cgroup(memcg) {
>>>>>> +		int memcg_id;
>>>>>> +
>>>>>> +		memcg_id = memcg_cache_id(memcg);
>>>>>> +		if (memcg_id < 0)
>>>>>> +			continue;
>>>>>> +
>>>>>> +		err = list_lru_memcg_alloc(lru, memcg_id);
>>>>>> +		if (err) {
>>>>>> +			mem_cgroup_iter_break(NULL, memcg);
>>>>>> +			goto out_free_lru_memcg;
>>>>>> +		}
>>>>>> +	}
>>>>>> +out_list_add:
>>>>>> +	list_add(&lru->list, &all_memcg_lrus);
>>>>>> +out:
>>>>>> +	mutex_unlock(&memcg_create_mutex);
>>>>>> +	return err;
>>>>>> +
>>>>>> +out_free_lru_memcg:
>>>>>> +	for (i = 0; i < memcg_limited_groups_array_size; i++)
>>>>>> +		list_lru_memcg_free(lru, i);
>>>>>> +	kfree(lru->memcg);
>>>>>> +	goto out;
>>>>>> +}
>>>>> That will probably scale even worse. Think about what happens when we
>>>>> try to mount a bunch of filesystems in parallel - they will now
>>>>> serialise completely on this memcg_create_mutex instantiating memcg
>>>>> lists inside list_lru_init().
>>>> Yes, the scalability seems to be the main problem here. I have a couple
>>>> of thoughts on how it could be improved. Here they go:
>>>>
>>>> 1) We can turn memcg_create_mutex to rw-semaphore (or introduce an
>>>> rw-semaphore, which we would take for modifying list_lru's) and take it
>>>> for reading in memcg_list_lru_init() and for writing when we create a
>>>> new memcg (memcg_init_all_lrus()).
>>>> This would remove the bottleneck from the mount path, but every memcg
>>>> creation would still iterate over all LRUs under a memcg mutex. So I
>>>> guess it is not an option, isn't it?
>>> Right - it's not so much that there is a mutex to protect the init,
>>> it's how long it's held that will be the issue. I mean, we don't
>>> need to hold the memcg_create_mutex until we've completely
>>> initialised the lru structure and are ready to add it to the
>>> all_memcg_lrus list, right?
>>>
>>> i.e. restructing it so that you don't need to hold the mutex until
>>> you make the LRU list globally visible would solve the problem just
>>> as well. if we can iterate the memcgs lists without holding a lock,
>>> then we can init the per-memcg lru lists without holding a lock
>>> because nobody will access them through the list_lru structure
>>> because it's not yet been published.
>>>
>>> That keeps the locking simple, and we get scalability because we've
>>> reduced the lock's scope to just a few instructures instead of a
>>> memcg iteration and a heap of memory allocation....
>> Unfortunately that's not that easy as it seems to be :-(
>>
>> Currently I hold the memcg_create_mutex while initializing per-memcg
>> LRUs in memcg_list_lru_init() in order to be sure that I won't miss a
>> memcg that is added during initialization.
>>
>> I mean, let's try to move per-memcg LRUs allocation outside the lock and
>> only register the LRU there:
>>
>> memcg_list_lru_init():
>>     1) allocate lru->memcg array
>>     2) for_each_kmem_active_memcg(m)
>>             allocate lru->memcg[m]
>>     3) lock memcg_create_mutex
>>        add lru to all_memcg_lrus_list
>>        unlock memcg_create_mutex
>>
>> Then if a new kmem-active memcg is added after step 2 and before step 3,
>> it won't see the new lru, because it has not been registered yet, and
>> thus won't initialize its list in this lru. As a result, we will end up
>> with a partially initialized list_lru. Note that this will happen even
>> if the whole memcg initialization proceeds under the memcg_create_mutex.
>>
>> Provided we could freeze memcg_limited_groups_array_size, it would be
>> possible to fix this problem by swapping steps 2 and 3 and making step 2
>> initialize lru->memcg[m] using cmpxchg() only if it was not initialized.
>> However we still have to hold the memcg_create_mutex during the whole
>> kmemcg activation path (memcg_init_all_lrus()).
>>
>> Let's see if we can get rid of the lock in memcg_init_all_lrus() by
>> making the all_memcg_lrus RCU-protected so that we could iterate over
>> all list_lrus w/o holding any locks and turn memcg_init_all_lrus() to
>> something like this:
>>
>> memcg_init_all_lrus():
>>     1) for_each_list_lru_rcu(lru)
>>            allocate lru->memcg[new_memcg_id]
>>     2) mark new_memcg as kmem-active
>>
>> The problem is that if memcg_list_lru_init(new_lru) starts and completes
>> between steps 1 and 2, we will not initialize
>> new_lru->memcg[new_memcg_id] neither in memcg_init_all_lrus() nor in
>> memcg_list_lru_init().
>>
>> The problem here is that on kmemcg creation (memcg_init_all_lrus()) we
>> have to iterate over all list_lrus while on list_lru creation
>> (memcg_list_lru_init()) we have to iterate over all memcgs. Currently I
>> can't figure out how we can do it w/o holding any mutexes at least while
>> calling one of these functions, but I'm keep thinking on it.
>>
> Seems I got it. We could add a memcg state bit, say "activating",
> meaning that a memcg is going to become kmem-active, but it is not yet,
> so it should not be accounted to; keep all list_lrus in a rcu-protected
> list; and implement memcg_init_all_lrus() and memcg_list_lru_init() as
> follows:
>
> memcg_init_all_lrus():
>     set activating
>     for_each_list_lru_rcu(lru)
>         cmpxchg(&lru->memcg[new_memcg_id], NULL, new list_lru_node);
>     set active
>
> memcg_list_lru_init():
>     add the new_lru to the all_memcg_lrus list
>     for_each_memcg(memcg):
>         if memcg is activating or active:
>             cmpxchg(&new_lru->memcg[memcg_id], NULL, new list_lru_node)

While trying to implement this I understood I was mistaken :-(
The point is that we can't iterate the RCU-protected list of list_lrus
allocating per-memcg lists in the meantime.

So, the situation we have looks as follows. On kmemcg creation we need
to iterate over all list_lrus and allocate a list_lru_node for the new
kmemsg. If we keep all list_lrus in a linked list we could use
RCU-iteration, but still we could not do something like this:

rcu_read_lock()
list_for_each_entry_rcu(lru)
    lru->memcg[id] = kmalloc()
rcu_read_lock()

because it is incorrect to sleep in an RCU critical section.

Since the list_lru structure cannot be made ref-counted (it is usually
built in other structure), we cannot leave the RCU critical section for
kmalloc() in the middle of the list_for_each loop.

Of course, we could preallocate all per-memcg LRUs before entering the
RCU critical section, i.e.

rcu_read_lock()
list_for_each_entry_rcu(lru) {
    if (have to allocate lru->memcg[id])
        nr_to_alloc++;
}
rcu_read_unlock()
// allocate nr_to_alloc list_lru_node objects
rcu_read_lock()
list_for_each_entry_rcu(lru) {
    if (had to allocate lru->memcg[id])
        init lru->memcg[id] with preallocated list_lru_node
}
rcu_read_unlock()

but for doing this we would need to allocate a temporary buffer for
holding list_lru_node references, which can be very big - not good.
Plus, the code would be complicated.

That said we have to iterate over the list of list_lrus under a mutex -
at least this is my current understanding :-(

So, what I am going to do for the next iteration of this patchset is to
introduce an rw semaphore and take it for reading on list_lru creation
and for writing on kmemcg creation. This will make concurrent mounts
possible, but mounting an fs will serialize with kmemcg creation on the
semaphore. I don't think it is that bad though, because creation of
kmemcgs is rather a rare event (isn't it?)

If anybody has a better idea, please share.

Thanks.

>
> At first glance, it looks correct, because:
>
> If we skip an lru while iterating over all_memcg_lrus in
> memcg_init_all_lrus(), it means it was created after the "activating"
> bit had been set and thus the per-memcg lru will be initialized in
> memcg_list_lru_init(). If we skip a memcg in memcg_list_lru_init(), this
> memg will "see" this lru while iterating over the all_memcg_lrus list,
> because the lru must have been created before the "activating" bit set.
>
> Although it is to be elaborated, because I haven't examined the
> destruction paths yet, and this doesn't take into account the fact that
> memcg_limited_groups_array_size is not a constant (I guess I'll have to
> introduce an rw semaphore for it, but it'll be OK, because its updates
> are very-very rare), I guess I am on the right way.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: [PATCH v13 11/16] mm: list_lru: add per-memcg lists
@ 2013-12-14 20:03               ` Vladimir Davydov
  0 siblings, 0 replies; 81+ messages in thread
From: Vladimir Davydov @ 2013-12-14 20:03 UTC (permalink / raw)
  To: Dave Chinner
  Cc: glommer, Balbir Singh, hannes, linux-kernel, glommer, mhocko,
	linux-mm, devel, Al Viro, dchinner, cgroups, akpm,
	KAMEZAWA Hiroyuki

On 12/13/2013 12:24 AM, Vladimir Davydov wrote:
> On 12/12/2013 01:50 PM, Vladimir Davydov wrote:
>>>>>> +int memcg_list_lru_init(struct list_lru *lru)
>>>>>> +{
>>>>>> +	int err = 0;
>>>>>> +	int i;
>>>>>> +	struct mem_cgroup *memcg;
>>>>>> +
>>>>>> +	lru->memcg = NULL;
>>>>>> +	lru->memcg_old = NULL;
>>>>>> +
>>>>>> +	mutex_lock(&memcg_create_mutex);
>>>>>> +	if (!memcg_kmem_enabled())
>>>>>> +		goto out_list_add;
>>>>>> +
>>>>>> +	lru->memcg = kcalloc(memcg_limited_groups_array_size,
>>>>>> +			     sizeof(*lru->memcg), GFP_KERNEL);
>>>>>> +	if (!lru->memcg) {
>>>>>> +		err = -ENOMEM;
>>>>>> +		goto out;
>>>>>> +	}
>>>>>> +
>>>>>> +	for_each_mem_cgroup(memcg) {
>>>>>> +		int memcg_id;
>>>>>> +
>>>>>> +		memcg_id = memcg_cache_id(memcg);
>>>>>> +		if (memcg_id < 0)
>>>>>> +			continue;
>>>>>> +
>>>>>> +		err = list_lru_memcg_alloc(lru, memcg_id);
>>>>>> +		if (err) {
>>>>>> +			mem_cgroup_iter_break(NULL, memcg);
>>>>>> +			goto out_free_lru_memcg;
>>>>>> +		}
>>>>>> +	}
>>>>>> +out_list_add:
>>>>>> +	list_add(&lru->list, &all_memcg_lrus);
>>>>>> +out:
>>>>>> +	mutex_unlock(&memcg_create_mutex);
>>>>>> +	return err;
>>>>>> +
>>>>>> +out_free_lru_memcg:
>>>>>> +	for (i = 0; i < memcg_limited_groups_array_size; i++)
>>>>>> +		list_lru_memcg_free(lru, i);
>>>>>> +	kfree(lru->memcg);
>>>>>> +	goto out;
>>>>>> +}
>>>>> That will probably scale even worse. Think about what happens when we
>>>>> try to mount a bunch of filesystems in parallel - they will now
>>>>> serialise completely on this memcg_create_mutex instantiating memcg
>>>>> lists inside list_lru_init().
>>>> Yes, the scalability seems to be the main problem here. I have a couple
>>>> of thoughts on how it could be improved. Here they go:
>>>>
>>>> 1) We can turn memcg_create_mutex to rw-semaphore (or introduce an
>>>> rw-semaphore, which we would take for modifying list_lru's) and take it
>>>> for reading in memcg_list_lru_init() and for writing when we create a
>>>> new memcg (memcg_init_all_lrus()).
>>>> This would remove the bottleneck from the mount path, but every memcg
>>>> creation would still iterate over all LRUs under a memcg mutex. So I
>>>> guess it is not an option, isn't it?
>>> Right - it's not so much that there is a mutex to protect the init,
>>> it's how long it's held that will be the issue. I mean, we don't
>>> need to hold the memcg_create_mutex until we've completely
>>> initialised the lru structure and are ready to add it to the
>>> all_memcg_lrus list, right?
>>>
>>> i.e. restructing it so that you don't need to hold the mutex until
>>> you make the LRU list globally visible would solve the problem just
>>> as well. if we can iterate the memcgs lists without holding a lock,
>>> then we can init the per-memcg lru lists without holding a lock
>>> because nobody will access them through the list_lru structure
>>> because it's not yet been published.
>>>
>>> That keeps the locking simple, and we get scalability because we've
>>> reduced the lock's scope to just a few instructures instead of a
>>> memcg iteration and a heap of memory allocation....
>> Unfortunately that's not that easy as it seems to be :-(
>>
>> Currently I hold the memcg_create_mutex while initializing per-memcg
>> LRUs in memcg_list_lru_init() in order to be sure that I won't miss a
>> memcg that is added during initialization.
>>
>> I mean, let's try to move per-memcg LRUs allocation outside the lock and
>> only register the LRU there:
>>
>> memcg_list_lru_init():
>>     1) allocate lru->memcg array
>>     2) for_each_kmem_active_memcg(m)
>>             allocate lru->memcg[m]
>>     3) lock memcg_create_mutex
>>        add lru to all_memcg_lrus_list
>>        unlock memcg_create_mutex
>>
>> Then if a new kmem-active memcg is added after step 2 and before step 3,
>> it won't see the new lru, because it has not been registered yet, and
>> thus won't initialize its list in this lru. As a result, we will end up
>> with a partially initialized list_lru. Note that this will happen even
>> if the whole memcg initialization proceeds under the memcg_create_mutex.
>>
>> Provided we could freeze memcg_limited_groups_array_size, it would be
>> possible to fix this problem by swapping steps 2 and 3 and making step 2
>> initialize lru->memcg[m] using cmpxchg() only if it was not initialized.
>> However we still have to hold the memcg_create_mutex during the whole
>> kmemcg activation path (memcg_init_all_lrus()).
>>
>> Let's see if we can get rid of the lock in memcg_init_all_lrus() by
>> making the all_memcg_lrus RCU-protected so that we could iterate over
>> all list_lrus w/o holding any locks and turn memcg_init_all_lrus() to
>> something like this:
>>
>> memcg_init_all_lrus():
>>     1) for_each_list_lru_rcu(lru)
>>            allocate lru->memcg[new_memcg_id]
>>     2) mark new_memcg as kmem-active
>>
>> The problem is that if memcg_list_lru_init(new_lru) starts and completes
>> between steps 1 and 2, we will not initialize
>> new_lru->memcg[new_memcg_id] neither in memcg_init_all_lrus() nor in
>> memcg_list_lru_init().
>>
>> The problem here is that on kmemcg creation (memcg_init_all_lrus()) we
>> have to iterate over all list_lrus while on list_lru creation
>> (memcg_list_lru_init()) we have to iterate over all memcgs. Currently I
>> can't figure out how we can do it w/o holding any mutexes at least while
>> calling one of these functions, but I'm keep thinking on it.
>>
> Seems I got it. We could add a memcg state bit, say "activating",
> meaning that a memcg is going to become kmem-active, but it is not yet,
> so it should not be accounted to; keep all list_lrus in a rcu-protected
> list; and implement memcg_init_all_lrus() and memcg_list_lru_init() as
> follows:
>
> memcg_init_all_lrus():
>     set activating
>     for_each_list_lru_rcu(lru)
>         cmpxchg(&lru->memcg[new_memcg_id], NULL, new list_lru_node);
>     set active
>
> memcg_list_lru_init():
>     add the new_lru to the all_memcg_lrus list
>     for_each_memcg(memcg):
>         if memcg is activating or active:
>             cmpxchg(&new_lru->memcg[memcg_id], NULL, new list_lru_node)

While trying to implement this I understood I was mistaken :-(
The point is that we can't iterate the RCU-protected list of list_lrus
allocating per-memcg lists in the meantime.

So, the situation we have looks as follows. On kmemcg creation we need
to iterate over all list_lrus and allocate a list_lru_node for the new
kmemsg. If we keep all list_lrus in a linked list we could use
RCU-iteration, but still we could not do something like this:

rcu_read_lock()
list_for_each_entry_rcu(lru)
    lru->memcg[id] = kmalloc()
rcu_read_lock()

because it is incorrect to sleep in an RCU critical section.

Since the list_lru structure cannot be made ref-counted (it is usually
built in other structure), we cannot leave the RCU critical section for
kmalloc() in the middle of the list_for_each loop.

Of course, we could preallocate all per-memcg LRUs before entering the
RCU critical section, i.e.

rcu_read_lock()
list_for_each_entry_rcu(lru) {
    if (have to allocate lru->memcg[id])
        nr_to_alloc++;
}
rcu_read_unlock()
// allocate nr_to_alloc list_lru_node objects
rcu_read_lock()
list_for_each_entry_rcu(lru) {
    if (had to allocate lru->memcg[id])
        init lru->memcg[id] with preallocated list_lru_node
}
rcu_read_unlock()

but for doing this we would need to allocate a temporary buffer for
holding list_lru_node references, which can be very big - not good.
Plus, the code would be complicated.

That said we have to iterate over the list of list_lrus under a mutex -
at least this is my current understanding :-(

So, what I am going to do for the next iteration of this patchset is to
introduce an rw semaphore and take it for reading on list_lru creation
and for writing on kmemcg creation. This will make concurrent mounts
possible, but mounting an fs will serialize with kmemcg creation on the
semaphore. I don't think it is that bad though, because creation of
kmemcgs is rather a rare event (isn't it?)

If anybody has a better idea, please share.

Thanks.

>
> At first glance, it looks correct, because:
>
> If we skip an lru while iterating over all_memcg_lrus in
> memcg_init_all_lrus(), it means it was created after the "activating"
> bit had been set and thus the per-memcg lru will be initialized in
> memcg_list_lru_init(). If we skip a memcg in memcg_list_lru_init(), this
> memg will "see" this lru while iterating over the all_memcg_lrus list,
> because the lru must have been created before the "activating" bit set.
>
> Although it is to be elaborated, because I haven't examined the
> destruction paths yet, and this doesn't take into account the fact that
> memcg_limited_groups_array_size is not a constant (I guess I'll have to
> introduce an rw semaphore for it, but it'll be OK, because its updates
> are very-very rare), I guess I am on the right way.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 81+ messages in thread

end of thread, other threads:[~2013-12-14 20:04 UTC | newest]

Thread overview: 81+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-12-09  8:05 [PATCH v13 00/16] kmemcg shrinkers Vladimir Davydov
2013-12-09  8:05 ` Vladimir Davydov
2013-12-09  8:05 ` [PATCH v13 01/16] memcg: make cache index determination more robust Vladimir Davydov
2013-12-09  8:05   ` Vladimir Davydov
2013-12-09  8:05 ` [PATCH v13 02/16] memcg: consolidate callers of memcg_cache_id Vladimir Davydov
2013-12-09  8:05   ` Vladimir Davydov
2013-12-09  8:05 ` [PATCH v13 03/16] memcg: move initialization to memcg creation Vladimir Davydov
2013-12-09  8:05   ` Vladimir Davydov
2013-12-09  8:05 ` [PATCH v13 04/16] memcg: move memcg_caches_array_size() function Vladimir Davydov
2013-12-09  8:05   ` Vladimir Davydov
2013-12-10  8:04   ` Glauber Costa
2013-12-10  8:04     ` Glauber Costa
2013-12-09  8:05 ` [PATCH v13 05/16] vmscan: move call to shrink_slab() to shrink_zones() Vladimir Davydov
2013-12-09  8:05   ` Vladimir Davydov
2013-12-09  8:05   ` Vladimir Davydov
2013-12-10  8:10   ` Glauber Costa
2013-12-10  8:10     ` Glauber Costa
2013-12-09  8:05 ` [PATCH v13 06/16] vmscan: remove shrink_control arg from do_try_to_free_pages() Vladimir Davydov
2013-12-09  8:05   ` Vladimir Davydov
2013-12-09  8:05 ` [PATCH v13 07/16] vmscan: call NUMA-unaware shrinkers irrespective of nodemask Vladimir Davydov
2013-12-09  8:05   ` Vladimir Davydov
2013-12-09  8:05   ` Vladimir Davydov
2013-12-09  8:05 ` [PATCH v13 08/16] mm: list_lru: require shrink_control in count, walk functions Vladimir Davydov
2013-12-09  8:05   ` Vladimir Davydov
2013-12-10  1:36   ` Dave Chinner
2013-12-10  1:36     ` Dave Chinner
2013-12-09  8:05 ` [PATCH v13 09/16] fs: consolidate {nr,free}_cached_objects args in shrink_control Vladimir Davydov
2013-12-09  8:05   ` Vladimir Davydov
2013-12-10  1:38   ` Dave Chinner
2013-12-10  1:38     ` Dave Chinner
2013-12-10  1:38     ` Dave Chinner
2013-12-09  8:05 ` [PATCH v13 10/16] vmscan: shrink slab on memcg pressure Vladimir Davydov
2013-12-09  8:05   ` Vladimir Davydov
2013-12-10  2:11   ` Dave Chinner
2013-12-10  2:11     ` Dave Chinner
2013-12-10  2:11     ` Dave Chinner
2013-12-09  8:05 ` [PATCH v13 11/16] mm: list_lru: add per-memcg lists Vladimir Davydov
2013-12-09  8:05   ` Vladimir Davydov
2013-12-10  5:00   ` Dave Chinner
2013-12-10  5:00     ` Dave Chinner
2013-12-10  5:00     ` Dave Chinner
2013-12-10 10:05     ` Vladimir Davydov
2013-12-10 10:05       ` Vladimir Davydov
2013-12-10 10:05       ` Vladimir Davydov
2013-12-12  1:40       ` Dave Chinner
2013-12-12  1:40         ` Dave Chinner
2013-12-12  9:50         ` Vladimir Davydov
2013-12-12  9:50           ` Vladimir Davydov
2013-12-12  9:50           ` Vladimir Davydov
2013-12-12 20:24           ` Vladimir Davydov
2013-12-12 20:24             ` Vladimir Davydov
2013-12-14 20:03             ` Vladimir Davydov
2013-12-14 20:03               ` Vladimir Davydov
2013-12-12 20:48     ` Glauber Costa
2013-12-12 20:48       ` Glauber Costa
2013-12-09  8:05 ` [PATCH v13 12/16] fs: mark list_lru based shrinkers memcg aware Vladimir Davydov
2013-12-09  8:05   ` Vladimir Davydov
2013-12-10  4:17   ` Dave Chinner
2013-12-10  4:17     ` Dave Chinner
2013-12-11 11:08     ` Steven Whitehouse
2013-12-11 11:08       ` Steven Whitehouse
2013-12-09  8:05 ` [PATCH v13 13/16] vmscan: take at least one pass with shrinkers Vladimir Davydov
2013-12-09  8:05   ` Vladimir Davydov
2013-12-10  4:18   ` Dave Chinner
2013-12-10  4:18     ` Dave Chinner
2013-12-10 11:50     ` Vladimir Davydov
2013-12-10 11:50       ` Vladimir Davydov
2013-12-10 11:50       ` Vladimir Davydov
2013-12-10 12:38       ` Glauber Costa
2013-12-10 12:38         ` Glauber Costa
2013-12-10 12:38         ` Glauber Costa
2013-12-09  8:05 ` [PATCH v13 14/16] vmpressure: in-kernel notifications Vladimir Davydov
2013-12-09  8:05   ` Vladimir Davydov
2013-12-10  8:12   ` Glauber Costa
2013-12-10  8:12     ` Glauber Costa
2013-12-09  8:05 ` [PATCH v13 15/16] memcg: reap dead memcgs upon global memory pressure Vladimir Davydov
2013-12-09  8:05   ` Vladimir Davydov
2013-12-09  8:05 ` [PATCH v13 16/16] memcg: flush memcg items upon memcg destruction Vladimir Davydov
2013-12-09  8:05   ` Vladimir Davydov
2013-12-10  8:02 ` [PATCH v13 00/16] kmemcg shrinkers Glauber Costa
2013-12-10  8:02   ` Glauber Costa

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.