All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v14 00/18] kmemcg shrinkers
@ 2013-12-16 12:16 ` Vladimir Davydov
  0 siblings, 0 replies; 64+ messages in thread
From: Vladimir Davydov @ 2013-12-16 12:16 UTC (permalink / raw)
  To: dchinner, mhocko, hannes, akpm
  Cc: linux-kernel, linux-mm, cgroups, devel, glommer, glommer

Hi,

This is the 14th iteration of Glauber Costa's patch-set implementing slab
shrinking on memcg pressure. The main idea is to make the list_lru structure
used by most FS shrinkers per-memcg. When adding or removing an element from a
list_lru, we use the page information to figure out which memcg it belongs to
and relay it to the appropriate list. This allows scanning kmem objects
accounted to different memcgs independently.

Please note that this patch-set implements slab shrinking only when we hit the
user memory limit so that kmem allocations will still fail if we are below the
user memory limit, but close to the kmem limit. I am going to fix this in a
separate patch-set, but currently it is only worthwhile setting the kmem limit
to be greater than the user mem limit just to enable per-memcg slab accounting
and reclaim.

The patch-set is based on top of Linux-3.13-rc4 and organized as follows:
 - patches 1-11 prepare vmscan, memcontrol, list_lru to kmemcg reclaim;
 - patches 12, 13 implement the kmemcg reclaim core;
 - patch 14 makes the list_lru struct per-memcg and patch 15 marks the
   super_block shrinker as memcg-aware;
 - patches 16-18 slightly improve memcontrol behavior regarding mem reclaim.

Changes in v14:
 - do not change list_lru interface, introduce new shrink functions instead;
 - remove NUMA awareness from per-memcg LRUs;
 - improve synchronization between list_lru creation and kmemcg activation;
 - various small fixes/improvements and code cleanup.

Previous iterations of this patch-set can be found here:
 - https://lkml.org/lkml/2013/12/9/103 (v13)
 - https://lkml.org/lkml/2013/12/2/141 (v12)
 - https://lkml.org/lkml/2013/11/25/214 (v11)

Thanks.

Glauber Costa (7):
  memcg: make cache index determination more robust
  memcg: consolidate callers of memcg_cache_id
  memcg: move initialization to memcg creation
  vmscan: take at least one pass with shrinkers
  vmpressure: in-kernel notifications
  memcg: reap dead memcgs upon global memory pressure
  memcg: flush memcg items upon memcg destruction

Vladimir Davydov (11):
  memcg: make for_each_mem_cgroup macros public
  memcg: remove KMEM_ACCOUNTED_ACTIVATED flag
  memcg: rework memcg_update_kmem_limit synchronization
  list_lru, shrinkers: introduce list_lru_shrink_{count,walk}
  fs: consolidate {nr,free}_cached_objects args in shrink_control
  vmscan: move call to shrink_slab() to shrink_zones()
  vmscan: remove shrink_control arg from do_try_to_free_pages()
  vmscan: call NUMA-unaware shrinkers irrespective of nodemask
  vmscan: shrink slab on memcg pressure
  list_lru: add per-memcg lists
  fs: make shrinker memcg aware

 fs/dcache.c                |   14 +-
 fs/gfs2/quota.c            |    6 +-
 fs/inode.c                 |    7 +-
 fs/internal.h              |    7 +-
 fs/super.c                 |   34 ++-
 fs/xfs/xfs_buf.c           |    7 +-
 fs/xfs/xfs_qm.c            |    7 +-
 fs/xfs/xfs_super.c         |    7 +-
 include/linux/fs.h         |    6 +-
 include/linux/list_lru.h   |  112 +++++++--
 include/linux/memcontrol.h |   50 ++++
 include/linux/shrinker.h   |   10 +-
 include/linux/vmpressure.h |    5 +
 mm/list_lru.c              |  257 +++++++++++++++++--
 mm/memcontrol.c            |  584 ++++++++++++++++++++++++++++++++------------
 mm/vmpressure.c            |   53 +++-
 mm/vmscan.c                |  170 ++++++++-----
 17 files changed, 1017 insertions(+), 319 deletions(-)

-- 
1.7.10.4


^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH v14 00/18] kmemcg shrinkers
@ 2013-12-16 12:16 ` Vladimir Davydov
  0 siblings, 0 replies; 64+ messages in thread
From: Vladimir Davydov @ 2013-12-16 12:16 UTC (permalink / raw)
  To: dchinner, mhocko, hannes, akpm
  Cc: linux-kernel, linux-mm, cgroups, devel, glommer, glommer

Hi,

This is the 14th iteration of Glauber Costa's patch-set implementing slab
shrinking on memcg pressure. The main idea is to make the list_lru structure
used by most FS shrinkers per-memcg. When adding or removing an element from a
list_lru, we use the page information to figure out which memcg it belongs to
and relay it to the appropriate list. This allows scanning kmem objects
accounted to different memcgs independently.

Please note that this patch-set implements slab shrinking only when we hit the
user memory limit so that kmem allocations will still fail if we are below the
user memory limit, but close to the kmem limit. I am going to fix this in a
separate patch-set, but currently it is only worthwhile setting the kmem limit
to be greater than the user mem limit just to enable per-memcg slab accounting
and reclaim.

The patch-set is based on top of Linux-3.13-rc4 and organized as follows:
 - patches 1-11 prepare vmscan, memcontrol, list_lru to kmemcg reclaim;
 - patches 12, 13 implement the kmemcg reclaim core;
 - patch 14 makes the list_lru struct per-memcg and patch 15 marks the
   super_block shrinker as memcg-aware;
 - patches 16-18 slightly improve memcontrol behavior regarding mem reclaim.

Changes in v14:
 - do not change list_lru interface, introduce new shrink functions instead;
 - remove NUMA awareness from per-memcg LRUs;
 - improve synchronization between list_lru creation and kmemcg activation;
 - various small fixes/improvements and code cleanup.

Previous iterations of this patch-set can be found here:
 - https://lkml.org/lkml/2013/12/9/103 (v13)
 - https://lkml.org/lkml/2013/12/2/141 (v12)
 - https://lkml.org/lkml/2013/11/25/214 (v11)

Thanks.

Glauber Costa (7):
  memcg: make cache index determination more robust
  memcg: consolidate callers of memcg_cache_id
  memcg: move initialization to memcg creation
  vmscan: take at least one pass with shrinkers
  vmpressure: in-kernel notifications
  memcg: reap dead memcgs upon global memory pressure
  memcg: flush memcg items upon memcg destruction

Vladimir Davydov (11):
  memcg: make for_each_mem_cgroup macros public
  memcg: remove KMEM_ACCOUNTED_ACTIVATED flag
  memcg: rework memcg_update_kmem_limit synchronization
  list_lru, shrinkers: introduce list_lru_shrink_{count,walk}
  fs: consolidate {nr,free}_cached_objects args in shrink_control
  vmscan: move call to shrink_slab() to shrink_zones()
  vmscan: remove shrink_control arg from do_try_to_free_pages()
  vmscan: call NUMA-unaware shrinkers irrespective of nodemask
  vmscan: shrink slab on memcg pressure
  list_lru: add per-memcg lists
  fs: make shrinker memcg aware

 fs/dcache.c                |   14 +-
 fs/gfs2/quota.c            |    6 +-
 fs/inode.c                 |    7 +-
 fs/internal.h              |    7 +-
 fs/super.c                 |   34 ++-
 fs/xfs/xfs_buf.c           |    7 +-
 fs/xfs/xfs_qm.c            |    7 +-
 fs/xfs/xfs_super.c         |    7 +-
 include/linux/fs.h         |    6 +-
 include/linux/list_lru.h   |  112 +++++++--
 include/linux/memcontrol.h |   50 ++++
 include/linux/shrinker.h   |   10 +-
 include/linux/vmpressure.h |    5 +
 mm/list_lru.c              |  257 +++++++++++++++++--
 mm/memcontrol.c            |  584 ++++++++++++++++++++++++++++++++------------
 mm/vmpressure.c            |   53 +++-
 mm/vmscan.c                |  170 ++++++++-----
 17 files changed, 1017 insertions(+), 319 deletions(-)

-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH v14 00/18] kmemcg shrinkers
@ 2013-12-16 12:16 ` Vladimir Davydov
  0 siblings, 0 replies; 64+ messages in thread
From: Vladimir Davydov @ 2013-12-16 12:16 UTC (permalink / raw)
  To: dchinner-H+wXaHxf7aLQT0dZR+AlfA, mhocko-AlSwsSmVLrQ,
	hannes-druUgvl0LCNAfugRpC6u6w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	devel-GEFAQzZX7r8dnm+yROfE0A, glommer-GEFAQzZX7r8dnm+yROfE0A,
	glommer-Re5JQEeQqe8AvxtiuMwx3w

Hi,

This is the 14th iteration of Glauber Costa's patch-set implementing slab
shrinking on memcg pressure. The main idea is to make the list_lru structure
used by most FS shrinkers per-memcg. When adding or removing an element from a
list_lru, we use the page information to figure out which memcg it belongs to
and relay it to the appropriate list. This allows scanning kmem objects
accounted to different memcgs independently.

Please note that this patch-set implements slab shrinking only when we hit the
user memory limit so that kmem allocations will still fail if we are below the
user memory limit, but close to the kmem limit. I am going to fix this in a
separate patch-set, but currently it is only worthwhile setting the kmem limit
to be greater than the user mem limit just to enable per-memcg slab accounting
and reclaim.

The patch-set is based on top of Linux-3.13-rc4 and organized as follows:
 - patches 1-11 prepare vmscan, memcontrol, list_lru to kmemcg reclaim;
 - patches 12, 13 implement the kmemcg reclaim core;
 - patch 14 makes the list_lru struct per-memcg and patch 15 marks the
   super_block shrinker as memcg-aware;
 - patches 16-18 slightly improve memcontrol behavior regarding mem reclaim.

Changes in v14:
 - do not change list_lru interface, introduce new shrink functions instead;
 - remove NUMA awareness from per-memcg LRUs;
 - improve synchronization between list_lru creation and kmemcg activation;
 - various small fixes/improvements and code cleanup.

Previous iterations of this patch-set can be found here:
 - https://lkml.org/lkml/2013/12/9/103 (v13)
 - https://lkml.org/lkml/2013/12/2/141 (v12)
 - https://lkml.org/lkml/2013/11/25/214 (v11)

Thanks.

Glauber Costa (7):
  memcg: make cache index determination more robust
  memcg: consolidate callers of memcg_cache_id
  memcg: move initialization to memcg creation
  vmscan: take at least one pass with shrinkers
  vmpressure: in-kernel notifications
  memcg: reap dead memcgs upon global memory pressure
  memcg: flush memcg items upon memcg destruction

Vladimir Davydov (11):
  memcg: make for_each_mem_cgroup macros public
  memcg: remove KMEM_ACCOUNTED_ACTIVATED flag
  memcg: rework memcg_update_kmem_limit synchronization
  list_lru, shrinkers: introduce list_lru_shrink_{count,walk}
  fs: consolidate {nr,free}_cached_objects args in shrink_control
  vmscan: move call to shrink_slab() to shrink_zones()
  vmscan: remove shrink_control arg from do_try_to_free_pages()
  vmscan: call NUMA-unaware shrinkers irrespective of nodemask
  vmscan: shrink slab on memcg pressure
  list_lru: add per-memcg lists
  fs: make shrinker memcg aware

 fs/dcache.c                |   14 +-
 fs/gfs2/quota.c            |    6 +-
 fs/inode.c                 |    7 +-
 fs/internal.h              |    7 +-
 fs/super.c                 |   34 ++-
 fs/xfs/xfs_buf.c           |    7 +-
 fs/xfs/xfs_qm.c            |    7 +-
 fs/xfs/xfs_super.c         |    7 +-
 include/linux/fs.h         |    6 +-
 include/linux/list_lru.h   |  112 +++++++--
 include/linux/memcontrol.h |   50 ++++
 include/linux/shrinker.h   |   10 +-
 include/linux/vmpressure.h |    5 +
 mm/list_lru.c              |  257 +++++++++++++++++--
 mm/memcontrol.c            |  584 ++++++++++++++++++++++++++++++++------------
 mm/vmpressure.c            |   53 +++-
 mm/vmscan.c                |  170 ++++++++-----
 17 files changed, 1017 insertions(+), 319 deletions(-)

-- 
1.7.10.4

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH v14 01/18] memcg: make cache index determination more robust
  2013-12-16 12:16 ` Vladimir Davydov
@ 2013-12-16 12:16   ` Vladimir Davydov
  -1 siblings, 0 replies; 64+ messages in thread
From: Vladimir Davydov @ 2013-12-16 12:16 UTC (permalink / raw)
  To: dchinner, mhocko, hannes, akpm
  Cc: linux-kernel, linux-mm, cgroups, devel, glommer, glommer,
	Balbir Singh, KAMEZAWA Hiroyuki

From: Glauber Costa <glommer@openvz.org>

I caught myself doing something like the following outside memcg core:

	memcg_id = -1;
	if (memcg && memcg_kmem_is_active(memcg))
		memcg_id = memcg_cache_id(memcg);

to be able to handle all possible memcgs in a sane manner. In particular, the
root cache will have kmemcg_id = -1 (just because we don't call memcg_kmem_init
to the root cache since it is not limitable). We have always coped with that by
making sure we sanitize which cache is passed to memcg_cache_id. Although this
example is given for root, what we really need to know is whether or not a
cache is kmem active.

But outside the memcg core testing for root, for instance, is not trivial since
we don't export mem_cgroup_is_root. I ended up realizing that this tests really
belong inside memcg_cache_id. This patch moves a similar but stronger test
inside memcg_cache_id and make sure it always return a meaningful value.

Signed-off-by: Glauber Costa <glommer@openvz.org>
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |    4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index bf5e894..3408852 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3076,7 +3076,9 @@ void memcg_cache_list_add(struct mem_cgroup *memcg, struct kmem_cache *cachep)
  */
 int memcg_cache_id(struct mem_cgroup *memcg)
 {
-	return memcg ? memcg->kmemcg_id : -1;
+	if (!memcg || !memcg_can_account_kmem(memcg))
+		return -1;
+	return memcg->kmemcg_id;
 }
 
 /*
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v14 01/18] memcg: make cache index determination more robust
@ 2013-12-16 12:16   ` Vladimir Davydov
  0 siblings, 0 replies; 64+ messages in thread
From: Vladimir Davydov @ 2013-12-16 12:16 UTC (permalink / raw)
  To: dchinner, mhocko, hannes, akpm
  Cc: linux-kernel, linux-mm, cgroups, devel, glommer, glommer,
	Balbir Singh, KAMEZAWA Hiroyuki

From: Glauber Costa <glommer@openvz.org>

I caught myself doing something like the following outside memcg core:

	memcg_id = -1;
	if (memcg && memcg_kmem_is_active(memcg))
		memcg_id = memcg_cache_id(memcg);

to be able to handle all possible memcgs in a sane manner. In particular, the
root cache will have kmemcg_id = -1 (just because we don't call memcg_kmem_init
to the root cache since it is not limitable). We have always coped with that by
making sure we sanitize which cache is passed to memcg_cache_id. Although this
example is given for root, what we really need to know is whether or not a
cache is kmem active.

But outside the memcg core testing for root, for instance, is not trivial since
we don't export mem_cgroup_is_root. I ended up realizing that this tests really
belong inside memcg_cache_id. This patch moves a similar but stronger test
inside memcg_cache_id and make sure it always return a meaningful value.

Signed-off-by: Glauber Costa <glommer@openvz.org>
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |    4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index bf5e894..3408852 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3076,7 +3076,9 @@ void memcg_cache_list_add(struct mem_cgroup *memcg, struct kmem_cache *cachep)
  */
 int memcg_cache_id(struct mem_cgroup *memcg)
 {
-	return memcg ? memcg->kmemcg_id : -1;
+	if (!memcg || !memcg_can_account_kmem(memcg))
+		return -1;
+	return memcg->kmemcg_id;
 }
 
 /*
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v14 02/18] memcg: consolidate callers of memcg_cache_id
  2013-12-16 12:16 ` Vladimir Davydov
@ 2013-12-16 12:16   ` Vladimir Davydov
  -1 siblings, 0 replies; 64+ messages in thread
From: Vladimir Davydov @ 2013-12-16 12:16 UTC (permalink / raw)
  To: dchinner, mhocko, hannes, akpm
  Cc: linux-kernel, linux-mm, cgroups, devel, glommer, glommer,
	Balbir Singh, KAMEZAWA Hiroyuki

From: Glauber Costa <glommer@openvz.org>

Each caller of memcg_cache_id ends up sanitizing its parameters in its own way.
Now that the memcg_cache_id itself is more robust, we can consolidate this.

Also, as suggested by Michal, a special helper memcg_cache_idx is used when the
result is expected to be used directly as an array index to make sure we never
access in a negative index.

Signed-off-by: Glauber Costa <glommer@openvz.org>
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |   49 +++++++++++++++++++++++++++++--------------------
 1 file changed, 29 insertions(+), 20 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 3408852..35a367c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2963,6 +2963,30 @@ static inline bool memcg_can_account_kmem(struct mem_cgroup *memcg)
 }
 
 /*
+ * helper for acessing a memcg's index. It will be used as an index in the
+ * child cache array in kmem_cache, and also to derive its name. This function
+ * will return -1 when this is not a kmem-limited memcg.
+ */
+int memcg_cache_id(struct mem_cgroup *memcg)
+{
+	if (!memcg || !memcg_can_account_kmem(memcg))
+		return -1;
+	return memcg->kmemcg_id;
+}
+
+/*
+ * This helper around memcg_cache_id is not intented for use outside memcg
+ * core. It is meant for places where the cache id is used directly as an array
+ * index
+ */
+static int memcg_cache_idx(struct mem_cgroup *memcg)
+{
+	int ret = memcg_cache_id(memcg);
+	BUG_ON(ret < 0);
+	return ret;
+}
+
+/*
  * This is a bit cumbersome, but it is rarely used and avoids a backpointer
  * in the memcg_cache_params struct.
  */
@@ -2972,7 +2996,7 @@ static struct kmem_cache *memcg_params_to_cache(struct memcg_cache_params *p)
 
 	VM_BUG_ON(p->is_root_cache);
 	cachep = p->root_cache;
-	return cache_from_memcg_idx(cachep, memcg_cache_id(p->memcg));
+	return cache_from_memcg_idx(cachep, memcg_cache_idx(p->memcg));
 }
 
 #ifdef CONFIG_SLABINFO
@@ -3070,18 +3094,6 @@ void memcg_cache_list_add(struct mem_cgroup *memcg, struct kmem_cache *cachep)
 }
 
 /*
- * helper for acessing a memcg's index. It will be used as an index in the
- * child cache array in kmem_cache, and also to derive its name. This function
- * will return -1 when this is not a kmem-limited memcg.
- */
-int memcg_cache_id(struct mem_cgroup *memcg)
-{
-	if (!memcg || !memcg_can_account_kmem(memcg))
-		return -1;
-	return memcg->kmemcg_id;
-}
-
-/*
  * This ends up being protected by the set_limit mutex, during normal
  * operation, because that is its main call site.
  *
@@ -3243,7 +3255,7 @@ void memcg_release_cache(struct kmem_cache *s)
 		goto out;
 
 	memcg = s->memcg_params->memcg;
-	id  = memcg_cache_id(memcg);
+	id = memcg_cache_idx(memcg);
 
 	root = s->memcg_params->root_cache;
 	root->memcg_params->memcg_caches[id] = NULL;
@@ -3406,9 +3418,7 @@ static struct kmem_cache *memcg_create_kmem_cache(struct mem_cgroup *memcg,
 	struct kmem_cache *new_cachep;
 	int idx;
 
-	BUG_ON(!memcg_can_account_kmem(memcg));
-
-	idx = memcg_cache_id(memcg);
+	idx = memcg_cache_idx(memcg);
 
 	mutex_lock(&memcg_cache_mutex);
 	new_cachep = cache_from_memcg_idx(cachep, idx);
@@ -3581,10 +3591,9 @@ struct kmem_cache *__memcg_kmem_get_cache(struct kmem_cache *cachep,
 	rcu_read_lock();
 	memcg = mem_cgroup_from_task(rcu_dereference(current->mm->owner));
 
-	if (!memcg_can_account_kmem(memcg))
-		goto out;
-
 	idx = memcg_cache_id(memcg);
+	if (idx < 0)
+		goto out;
 
 	/*
 	 * barrier to mare sure we're always seeing the up to date value.  The
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v14 02/18] memcg: consolidate callers of memcg_cache_id
@ 2013-12-16 12:16   ` Vladimir Davydov
  0 siblings, 0 replies; 64+ messages in thread
From: Vladimir Davydov @ 2013-12-16 12:16 UTC (permalink / raw)
  To: dchinner, mhocko, hannes, akpm
  Cc: linux-kernel, linux-mm, cgroups, devel, glommer, glommer,
	Balbir Singh, KAMEZAWA Hiroyuki

From: Glauber Costa <glommer@openvz.org>

Each caller of memcg_cache_id ends up sanitizing its parameters in its own way.
Now that the memcg_cache_id itself is more robust, we can consolidate this.

Also, as suggested by Michal, a special helper memcg_cache_idx is used when the
result is expected to be used directly as an array index to make sure we never
access in a negative index.

Signed-off-by: Glauber Costa <glommer@openvz.org>
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |   49 +++++++++++++++++++++++++++++--------------------
 1 file changed, 29 insertions(+), 20 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 3408852..35a367c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2963,6 +2963,30 @@ static inline bool memcg_can_account_kmem(struct mem_cgroup *memcg)
 }
 
 /*
+ * helper for acessing a memcg's index. It will be used as an index in the
+ * child cache array in kmem_cache, and also to derive its name. This function
+ * will return -1 when this is not a kmem-limited memcg.
+ */
+int memcg_cache_id(struct mem_cgroup *memcg)
+{
+	if (!memcg || !memcg_can_account_kmem(memcg))
+		return -1;
+	return memcg->kmemcg_id;
+}
+
+/*
+ * This helper around memcg_cache_id is not intented for use outside memcg
+ * core. It is meant for places where the cache id is used directly as an array
+ * index
+ */
+static int memcg_cache_idx(struct mem_cgroup *memcg)
+{
+	int ret = memcg_cache_id(memcg);
+	BUG_ON(ret < 0);
+	return ret;
+}
+
+/*
  * This is a bit cumbersome, but it is rarely used and avoids a backpointer
  * in the memcg_cache_params struct.
  */
@@ -2972,7 +2996,7 @@ static struct kmem_cache *memcg_params_to_cache(struct memcg_cache_params *p)
 
 	VM_BUG_ON(p->is_root_cache);
 	cachep = p->root_cache;
-	return cache_from_memcg_idx(cachep, memcg_cache_id(p->memcg));
+	return cache_from_memcg_idx(cachep, memcg_cache_idx(p->memcg));
 }
 
 #ifdef CONFIG_SLABINFO
@@ -3070,18 +3094,6 @@ void memcg_cache_list_add(struct mem_cgroup *memcg, struct kmem_cache *cachep)
 }
 
 /*
- * helper for acessing a memcg's index. It will be used as an index in the
- * child cache array in kmem_cache, and also to derive its name. This function
- * will return -1 when this is not a kmem-limited memcg.
- */
-int memcg_cache_id(struct mem_cgroup *memcg)
-{
-	if (!memcg || !memcg_can_account_kmem(memcg))
-		return -1;
-	return memcg->kmemcg_id;
-}
-
-/*
  * This ends up being protected by the set_limit mutex, during normal
  * operation, because that is its main call site.
  *
@@ -3243,7 +3255,7 @@ void memcg_release_cache(struct kmem_cache *s)
 		goto out;
 
 	memcg = s->memcg_params->memcg;
-	id  = memcg_cache_id(memcg);
+	id = memcg_cache_idx(memcg);
 
 	root = s->memcg_params->root_cache;
 	root->memcg_params->memcg_caches[id] = NULL;
@@ -3406,9 +3418,7 @@ static struct kmem_cache *memcg_create_kmem_cache(struct mem_cgroup *memcg,
 	struct kmem_cache *new_cachep;
 	int idx;
 
-	BUG_ON(!memcg_can_account_kmem(memcg));
-
-	idx = memcg_cache_id(memcg);
+	idx = memcg_cache_idx(memcg);
 
 	mutex_lock(&memcg_cache_mutex);
 	new_cachep = cache_from_memcg_idx(cachep, idx);
@@ -3581,10 +3591,9 @@ struct kmem_cache *__memcg_kmem_get_cache(struct kmem_cache *cachep,
 	rcu_read_lock();
 	memcg = mem_cgroup_from_task(rcu_dereference(current->mm->owner));
 
-	if (!memcg_can_account_kmem(memcg))
-		goto out;
-
 	idx = memcg_cache_id(memcg);
+	if (idx < 0)
+		goto out;
 
 	/*
 	 * barrier to mare sure we're always seeing the up to date value.  The
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v14 03/18] memcg: move initialization to memcg creation
  2013-12-16 12:16 ` Vladimir Davydov
@ 2013-12-16 12:16   ` Vladimir Davydov
  -1 siblings, 0 replies; 64+ messages in thread
From: Vladimir Davydov @ 2013-12-16 12:16 UTC (permalink / raw)
  To: dchinner, mhocko, hannes, akpm
  Cc: linux-kernel, linux-mm, cgroups, devel, glommer, glommer,
	Balbir Singh, KAMEZAWA Hiroyuki

From: Glauber Costa <glommer@openvz.org>

Those structures are only used for memcgs that are effectively using
kmemcg. However, in a later patch I intend to use scan that list
inconditionally (list empty meaning no kmem caches present), which
simplifies the code a lot.

So move the initialization to early kmem creation.

Signed-off-by: Glauber Costa <glommer@openvz.org>
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 35a367c..8fdb239 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3125,8 +3125,6 @@ int memcg_update_cache_sizes(struct mem_cgroup *memcg)
 	}
 
 	memcg->kmemcg_id = num;
-	INIT_LIST_HEAD(&memcg->memcg_slab_caches);
-	mutex_init(&memcg->slab_caches_mutex);
 	return 0;
 }
 
@@ -5912,6 +5910,8 @@ static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
 {
 	int ret;
 
+	INIT_LIST_HEAD(&memcg->memcg_slab_caches);
+	mutex_init(&memcg->slab_caches_mutex);
 	memcg->kmemcg_id = -1;
 	ret = memcg_propagate_kmem(memcg);
 	if (ret)
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v14 03/18] memcg: move initialization to memcg creation
@ 2013-12-16 12:16   ` Vladimir Davydov
  0 siblings, 0 replies; 64+ messages in thread
From: Vladimir Davydov @ 2013-12-16 12:16 UTC (permalink / raw)
  To: dchinner, mhocko, hannes, akpm
  Cc: linux-kernel, linux-mm, cgroups, devel, glommer, glommer,
	Balbir Singh, KAMEZAWA Hiroyuki

From: Glauber Costa <glommer@openvz.org>

Those structures are only used for memcgs that are effectively using
kmemcg. However, in a later patch I intend to use scan that list
inconditionally (list empty meaning no kmem caches present), which
simplifies the code a lot.

So move the initialization to early kmem creation.

Signed-off-by: Glauber Costa <glommer@openvz.org>
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 35a367c..8fdb239 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3125,8 +3125,6 @@ int memcg_update_cache_sizes(struct mem_cgroup *memcg)
 	}
 
 	memcg->kmemcg_id = num;
-	INIT_LIST_HEAD(&memcg->memcg_slab_caches);
-	mutex_init(&memcg->slab_caches_mutex);
 	return 0;
 }
 
@@ -5912,6 +5910,8 @@ static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
 {
 	int ret;
 
+	INIT_LIST_HEAD(&memcg->memcg_slab_caches);
+	mutex_init(&memcg->slab_caches_mutex);
 	memcg->kmemcg_id = -1;
 	ret = memcg_propagate_kmem(memcg);
 	if (ret)
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v14 04/18] memcg: make for_each_mem_cgroup macros public
  2013-12-16 12:16 ` Vladimir Davydov
  (?)
@ 2013-12-16 12:16   ` Vladimir Davydov
  -1 siblings, 0 replies; 64+ messages in thread
From: Vladimir Davydov @ 2013-12-16 12:16 UTC (permalink / raw)
  To: dchinner, mhocko, hannes, akpm
  Cc: linux-kernel, linux-mm, cgroups, devel, glommer, glommer,
	Balbir Singh, KAMEZAWA Hiroyuki

I am going to use these macros in next patches, so let's move them to
the header. These macros are very handy and they depend only on
mem_cgroup_iter(), which is already public, so I guess it's worth it.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Glauber Costa <glommer@openvz.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/memcontrol.h |   15 +++++++++++++++
 mm/memcontrol.c            |   15 ---------------
 2 files changed, 15 insertions(+), 15 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index b3e7a66..e3efab2 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -53,6 +53,21 @@ struct mem_cgroup_reclaim_cookie {
 	unsigned int generation;
 };
 
+/*
+ * Iteration constructs for visiting all cgroups (under a tree).  If
+ * loops are exited prematurely (break), mem_cgroup_iter_break() must
+ * be used for reference counting.
+ */
+#define for_each_mem_cgroup_tree(iter, root)		\
+	for (iter = mem_cgroup_iter(root, NULL, NULL);	\
+	     iter != NULL;				\
+	     iter = mem_cgroup_iter(root, iter, NULL))
+
+#define for_each_mem_cgroup(iter)			\
+	for (iter = mem_cgroup_iter(NULL, NULL, NULL);	\
+	     iter != NULL;				\
+	     iter = mem_cgroup_iter(NULL, iter, NULL))
+
 #ifdef CONFIG_MEMCG
 /*
  * All "charge" functions with gfp_mask should use GFP_KERNEL or
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 8fdb239..b6ec029 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1261,21 +1261,6 @@ void mem_cgroup_iter_break(struct mem_cgroup *root,
 		css_put(&prev->css);
 }
 
-/*
- * Iteration constructs for visiting all cgroups (under a tree).  If
- * loops are exited prematurely (break), mem_cgroup_iter_break() must
- * be used for reference counting.
- */
-#define for_each_mem_cgroup_tree(iter, root)		\
-	for (iter = mem_cgroup_iter(root, NULL, NULL);	\
-	     iter != NULL;				\
-	     iter = mem_cgroup_iter(root, iter, NULL))
-
-#define for_each_mem_cgroup(iter)			\
-	for (iter = mem_cgroup_iter(NULL, NULL, NULL);	\
-	     iter != NULL;				\
-	     iter = mem_cgroup_iter(NULL, iter, NULL))
-
 void __mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item idx)
 {
 	struct mem_cgroup *memcg;
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v14 04/18] memcg: make for_each_mem_cgroup macros public
@ 2013-12-16 12:16   ` Vladimir Davydov
  0 siblings, 0 replies; 64+ messages in thread
From: Vladimir Davydov @ 2013-12-16 12:16 UTC (permalink / raw)
  To: dchinner, mhocko, hannes, akpm
  Cc: linux-kernel, linux-mm, cgroups, devel, glommer, glommer,
	Balbir Singh, KAMEZAWA Hiroyuki

I am going to use these macros in next patches, so let's move them to
the header. These macros are very handy and they depend only on
mem_cgroup_iter(), which is already public, so I guess it's worth it.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Glauber Costa <glommer@openvz.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/memcontrol.h |   15 +++++++++++++++
 mm/memcontrol.c            |   15 ---------------
 2 files changed, 15 insertions(+), 15 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index b3e7a66..e3efab2 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -53,6 +53,21 @@ struct mem_cgroup_reclaim_cookie {
 	unsigned int generation;
 };
 
+/*
+ * Iteration constructs for visiting all cgroups (under a tree).  If
+ * loops are exited prematurely (break), mem_cgroup_iter_break() must
+ * be used for reference counting.
+ */
+#define for_each_mem_cgroup_tree(iter, root)		\
+	for (iter = mem_cgroup_iter(root, NULL, NULL);	\
+	     iter != NULL;				\
+	     iter = mem_cgroup_iter(root, iter, NULL))
+
+#define for_each_mem_cgroup(iter)			\
+	for (iter = mem_cgroup_iter(NULL, NULL, NULL);	\
+	     iter != NULL;				\
+	     iter = mem_cgroup_iter(NULL, iter, NULL))
+
 #ifdef CONFIG_MEMCG
 /*
  * All "charge" functions with gfp_mask should use GFP_KERNEL or
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 8fdb239..b6ec029 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1261,21 +1261,6 @@ void mem_cgroup_iter_break(struct mem_cgroup *root,
 		css_put(&prev->css);
 }
 
-/*
- * Iteration constructs for visiting all cgroups (under a tree).  If
- * loops are exited prematurely (break), mem_cgroup_iter_break() must
- * be used for reference counting.
- */
-#define for_each_mem_cgroup_tree(iter, root)		\
-	for (iter = mem_cgroup_iter(root, NULL, NULL);	\
-	     iter != NULL;				\
-	     iter = mem_cgroup_iter(root, iter, NULL))
-
-#define for_each_mem_cgroup(iter)			\
-	for (iter = mem_cgroup_iter(NULL, NULL, NULL);	\
-	     iter != NULL;				\
-	     iter = mem_cgroup_iter(NULL, iter, NULL))
-
 void __mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item idx)
 {
 	struct mem_cgroup *memcg;
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v14 04/18] memcg: make for_each_mem_cgroup macros public
@ 2013-12-16 12:16   ` Vladimir Davydov
  0 siblings, 0 replies; 64+ messages in thread
From: Vladimir Davydov @ 2013-12-16 12:16 UTC (permalink / raw)
  To: dchinner-H+wXaHxf7aLQT0dZR+AlfA, mhocko-AlSwsSmVLrQ,
	hannes-druUgvl0LCNAfugRpC6u6w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	devel-GEFAQzZX7r8dnm+yROfE0A, glommer-GEFAQzZX7r8dnm+yROfE0A,
	glommer-Re5JQEeQqe8AvxtiuMwx3w, Balbir Singh, KAMEZAWA Hiroyuki

I am going to use these macros in next patches, so let's move them to
the header. These macros are very handy and they depend only on
mem_cgroup_iter(), which is already public, so I guess it's worth it.

Signed-off-by: Vladimir Davydov <vdavydov-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
Cc: Glauber Costa <glommer-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
Cc: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
Cc: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
Cc: Balbir Singh <bsingharora-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
---
 include/linux/memcontrol.h |   15 +++++++++++++++
 mm/memcontrol.c            |   15 ---------------
 2 files changed, 15 insertions(+), 15 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index b3e7a66..e3efab2 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -53,6 +53,21 @@ struct mem_cgroup_reclaim_cookie {
 	unsigned int generation;
 };
 
+/*
+ * Iteration constructs for visiting all cgroups (under a tree).  If
+ * loops are exited prematurely (break), mem_cgroup_iter_break() must
+ * be used for reference counting.
+ */
+#define for_each_mem_cgroup_tree(iter, root)		\
+	for (iter = mem_cgroup_iter(root, NULL, NULL);	\
+	     iter != NULL;				\
+	     iter = mem_cgroup_iter(root, iter, NULL))
+
+#define for_each_mem_cgroup(iter)			\
+	for (iter = mem_cgroup_iter(NULL, NULL, NULL);	\
+	     iter != NULL;				\
+	     iter = mem_cgroup_iter(NULL, iter, NULL))
+
 #ifdef CONFIG_MEMCG
 /*
  * All "charge" functions with gfp_mask should use GFP_KERNEL or
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 8fdb239..b6ec029 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1261,21 +1261,6 @@ void mem_cgroup_iter_break(struct mem_cgroup *root,
 		css_put(&prev->css);
 }
 
-/*
- * Iteration constructs for visiting all cgroups (under a tree).  If
- * loops are exited prematurely (break), mem_cgroup_iter_break() must
- * be used for reference counting.
- */
-#define for_each_mem_cgroup_tree(iter, root)		\
-	for (iter = mem_cgroup_iter(root, NULL, NULL);	\
-	     iter != NULL;				\
-	     iter = mem_cgroup_iter(root, iter, NULL))
-
-#define for_each_mem_cgroup(iter)			\
-	for (iter = mem_cgroup_iter(NULL, NULL, NULL);	\
-	     iter != NULL;				\
-	     iter = mem_cgroup_iter(NULL, iter, NULL))
-
 void __mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item idx)
 {
 	struct mem_cgroup *memcg;
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v14 05/18] memcg: remove KMEM_ACCOUNTED_ACTIVATED flag
  2013-12-16 12:16 ` Vladimir Davydov
@ 2013-12-16 12:16   ` Vladimir Davydov
  -1 siblings, 0 replies; 64+ messages in thread
From: Vladimir Davydov @ 2013-12-16 12:16 UTC (permalink / raw)
  To: dchinner, mhocko, hannes, akpm
  Cc: linux-kernel, linux-mm, cgroups, devel, glommer, glommer,
	Balbir Singh, KAMEZAWA Hiroyuki

Currently we have two state bits in mem_cgroup::kmem_account_flags
regarding kmem accounting activation, ACTIVATED and ACTIVE. We start
kmem accounting only if both flags are set (memcg_can_account_kmem()),
plus throughout the code there are several places where we check only
the ACTIVE flag, but we never check the ACTIVATED flag alone. These
flags are both set from memcg_update_kmem_limit() under the
set_limit_mutex, the ACTIVE flag always being set after ACTIVATED, and
they never get cleared. That said checking if both flags are set is
equivalent to checking only for the ACTIVE flag, and since there is no
ACTIVATED flag checks, we can safely remove the ACTIVATED flag, and
nothing will change.

Let's try to understand what was the reason for introducing these flags.
The purpose of the ACTIVE flag is clear - it states that kmem should be
accounting to the cgroup. The only requirement for it is that it should
be set after we have fully initialized kmem accounting bits for the
cgroup and patched all static branches relating to kmem accounting.
Since we always check if static branch is enabled before actually
considering if we should account (otherwise we wouldn't benefit from
static branching), this guarantees us that we won't skip a commit or
uncharge after a charge due to an unpatched static branch.

Now let's move on to the ACTIVATED bit. As I proved in the beginning of
this message, it is absolutely useless, and removing it will change
nothing. So what was the reason introducing it?

The ACTIVATED flag was introduced by commit a8964b9b ("memcg: use static
branches when code not in use") in order to guarantee that
static_key_slow_inc(&memcg_kmem_enabled_key) would be called only once
for each memory cgroup when its kmem accounting was activated. The point
was that at that time the memcg_update_kmem_limit() function's work-flow
looked like this:

        bool must_inc_static_branch = false;

        cgroup_lock();
        mutex_lock(&set_limit_mutex);
        if (!memcg->kmem_account_flags && val != RESOURCE_MAX) {
                /* The kmem limit is set for the first time */
                ret = res_counter_set_limit(&memcg->kmem, val);

                memcg_kmem_set_activated(memcg);
                must_inc_static_branch = true;
        } else
                ret = res_counter_set_limit(&memcg->kmem, val);
        mutex_unlock(&set_limit_mutex);
        cgroup_unlock();

        if (must_inc_static_branch) {
                /* We can't do this under cgroup_lock */
                static_key_slow_inc(&memcg_kmem_enabled_key);
                memcg_kmem_set_active(memcg);
        }

So that without the ACTIVATED flag we could race with other threads
trying to set the limit and increment the static branching ref-counter
more than once. Today we call the whole memcg_update_kmem_limit()
function under the set_limit_mutex and this race is impossible.

As now we understand why the ACTIVATED bit was introduced and why we
don't need it now, and know that removing it will change nothing anyway,
let's get rid of it.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Glauber Costa <glommer@openvz.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |   28 ++--------------------------
 1 file changed, 2 insertions(+), 26 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b6ec029..9bf11bf 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -343,15 +343,10 @@ static size_t memcg_size(void)
 
 /* internal only representation about the status of kmem accounting. */
 enum {
-	KMEM_ACCOUNTED_ACTIVE = 0, /* accounted by this cgroup itself */
-	KMEM_ACCOUNTED_ACTIVATED, /* static key enabled. */
+	KMEM_ACCOUNTED_ACTIVE, /* accounted by this cgroup itself */
 	KMEM_ACCOUNTED_DEAD, /* dead memcg with pending kmem charges */
 };
 
-/* We account when limit is on, but only after call sites are patched */
-#define KMEM_ACCOUNTED_MASK \
-		((1 << KMEM_ACCOUNTED_ACTIVE) | (1 << KMEM_ACCOUNTED_ACTIVATED))
-
 #ifdef CONFIG_MEMCG_KMEM
 static inline void memcg_kmem_set_active(struct mem_cgroup *memcg)
 {
@@ -363,16 +358,6 @@ static bool memcg_kmem_is_active(struct mem_cgroup *memcg)
 	return test_bit(KMEM_ACCOUNTED_ACTIVE, &memcg->kmem_account_flags);
 }
 
-static void memcg_kmem_set_activated(struct mem_cgroup *memcg)
-{
-	set_bit(KMEM_ACCOUNTED_ACTIVATED, &memcg->kmem_account_flags);
-}
-
-static void memcg_kmem_clear_activated(struct mem_cgroup *memcg)
-{
-	clear_bit(KMEM_ACCOUNTED_ACTIVATED, &memcg->kmem_account_flags);
-}
-
 static void memcg_kmem_mark_dead(struct mem_cgroup *memcg)
 {
 	/*
@@ -2944,7 +2929,7 @@ static DEFINE_MUTEX(set_limit_mutex);
 static inline bool memcg_can_account_kmem(struct mem_cgroup *memcg)
 {
 	return !mem_cgroup_disabled() && !mem_cgroup_is_root(memcg) &&
-		(memcg->kmem_account_flags & KMEM_ACCOUNTED_MASK);
+		memcg_kmem_is_active(memcg);
 }
 
 /*
@@ -3093,19 +3078,10 @@ int memcg_update_cache_sizes(struct mem_cgroup *memcg)
 				0, MEMCG_CACHES_MAX_SIZE, GFP_KERNEL);
 	if (num < 0)
 		return num;
-	/*
-	 * After this point, kmem_accounted (that we test atomically in
-	 * the beginning of this conditional), is no longer 0. This
-	 * guarantees only one process will set the following boolean
-	 * to true. We don't need test_and_set because we're protected
-	 * by the set_limit_mutex anyway.
-	 */
-	memcg_kmem_set_activated(memcg);
 
 	ret = memcg_update_all_caches(num+1);
 	if (ret) {
 		ida_simple_remove(&kmem_limited_groups, num);
-		memcg_kmem_clear_activated(memcg);
 		return ret;
 	}
 
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v14 05/18] memcg: remove KMEM_ACCOUNTED_ACTIVATED flag
@ 2013-12-16 12:16   ` Vladimir Davydov
  0 siblings, 0 replies; 64+ messages in thread
From: Vladimir Davydov @ 2013-12-16 12:16 UTC (permalink / raw)
  To: dchinner, mhocko, hannes, akpm
  Cc: linux-kernel, linux-mm, cgroups, devel, glommer, glommer,
	Balbir Singh, KAMEZAWA Hiroyuki

Currently we have two state bits in mem_cgroup::kmem_account_flags
regarding kmem accounting activation, ACTIVATED and ACTIVE. We start
kmem accounting only if both flags are set (memcg_can_account_kmem()),
plus throughout the code there are several places where we check only
the ACTIVE flag, but we never check the ACTIVATED flag alone. These
flags are both set from memcg_update_kmem_limit() under the
set_limit_mutex, the ACTIVE flag always being set after ACTIVATED, and
they never get cleared. That said checking if both flags are set is
equivalent to checking only for the ACTIVE flag, and since there is no
ACTIVATED flag checks, we can safely remove the ACTIVATED flag, and
nothing will change.

Let's try to understand what was the reason for introducing these flags.
The purpose of the ACTIVE flag is clear - it states that kmem should be
accounting to the cgroup. The only requirement for it is that it should
be set after we have fully initialized kmem accounting bits for the
cgroup and patched all static branches relating to kmem accounting.
Since we always check if static branch is enabled before actually
considering if we should account (otherwise we wouldn't benefit from
static branching), this guarantees us that we won't skip a commit or
uncharge after a charge due to an unpatched static branch.

Now let's move on to the ACTIVATED bit. As I proved in the beginning of
this message, it is absolutely useless, and removing it will change
nothing. So what was the reason introducing it?

The ACTIVATED flag was introduced by commit a8964b9b ("memcg: use static
branches when code not in use") in order to guarantee that
static_key_slow_inc(&memcg_kmem_enabled_key) would be called only once
for each memory cgroup when its kmem accounting was activated. The point
was that at that time the memcg_update_kmem_limit() function's work-flow
looked like this:

        bool must_inc_static_branch = false;

        cgroup_lock();
        mutex_lock(&set_limit_mutex);
        if (!memcg->kmem_account_flags && val != RESOURCE_MAX) {
                /* The kmem limit is set for the first time */
                ret = res_counter_set_limit(&memcg->kmem, val);

                memcg_kmem_set_activated(memcg);
                must_inc_static_branch = true;
        } else
                ret = res_counter_set_limit(&memcg->kmem, val);
        mutex_unlock(&set_limit_mutex);
        cgroup_unlock();

        if (must_inc_static_branch) {
                /* We can't do this under cgroup_lock */
                static_key_slow_inc(&memcg_kmem_enabled_key);
                memcg_kmem_set_active(memcg);
        }

So that without the ACTIVATED flag we could race with other threads
trying to set the limit and increment the static branching ref-counter
more than once. Today we call the whole memcg_update_kmem_limit()
function under the set_limit_mutex and this race is impossible.

As now we understand why the ACTIVATED bit was introduced and why we
don't need it now, and know that removing it will change nothing anyway,
let's get rid of it.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Glauber Costa <glommer@openvz.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |   28 ++--------------------------
 1 file changed, 2 insertions(+), 26 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b6ec029..9bf11bf 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -343,15 +343,10 @@ static size_t memcg_size(void)
 
 /* internal only representation about the status of kmem accounting. */
 enum {
-	KMEM_ACCOUNTED_ACTIVE = 0, /* accounted by this cgroup itself */
-	KMEM_ACCOUNTED_ACTIVATED, /* static key enabled. */
+	KMEM_ACCOUNTED_ACTIVE, /* accounted by this cgroup itself */
 	KMEM_ACCOUNTED_DEAD, /* dead memcg with pending kmem charges */
 };
 
-/* We account when limit is on, but only after call sites are patched */
-#define KMEM_ACCOUNTED_MASK \
-		((1 << KMEM_ACCOUNTED_ACTIVE) | (1 << KMEM_ACCOUNTED_ACTIVATED))
-
 #ifdef CONFIG_MEMCG_KMEM
 static inline void memcg_kmem_set_active(struct mem_cgroup *memcg)
 {
@@ -363,16 +358,6 @@ static bool memcg_kmem_is_active(struct mem_cgroup *memcg)
 	return test_bit(KMEM_ACCOUNTED_ACTIVE, &memcg->kmem_account_flags);
 }
 
-static void memcg_kmem_set_activated(struct mem_cgroup *memcg)
-{
-	set_bit(KMEM_ACCOUNTED_ACTIVATED, &memcg->kmem_account_flags);
-}
-
-static void memcg_kmem_clear_activated(struct mem_cgroup *memcg)
-{
-	clear_bit(KMEM_ACCOUNTED_ACTIVATED, &memcg->kmem_account_flags);
-}
-
 static void memcg_kmem_mark_dead(struct mem_cgroup *memcg)
 {
 	/*
@@ -2944,7 +2929,7 @@ static DEFINE_MUTEX(set_limit_mutex);
 static inline bool memcg_can_account_kmem(struct mem_cgroup *memcg)
 {
 	return !mem_cgroup_disabled() && !mem_cgroup_is_root(memcg) &&
-		(memcg->kmem_account_flags & KMEM_ACCOUNTED_MASK);
+		memcg_kmem_is_active(memcg);
 }
 
 /*
@@ -3093,19 +3078,10 @@ int memcg_update_cache_sizes(struct mem_cgroup *memcg)
 				0, MEMCG_CACHES_MAX_SIZE, GFP_KERNEL);
 	if (num < 0)
 		return num;
-	/*
-	 * After this point, kmem_accounted (that we test atomically in
-	 * the beginning of this conditional), is no longer 0. This
-	 * guarantees only one process will set the following boolean
-	 * to true. We don't need test_and_set because we're protected
-	 * by the set_limit_mutex anyway.
-	 */
-	memcg_kmem_set_activated(memcg);
 
 	ret = memcg_update_all_caches(num+1);
 	if (ret) {
 		ida_simple_remove(&kmem_limited_groups, num);
-		memcg_kmem_clear_activated(memcg);
 		return ret;
 	}
 
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v14 06/18] memcg: rework memcg_update_kmem_limit synchronization
  2013-12-16 12:16 ` Vladimir Davydov
@ 2013-12-16 12:16   ` Vladimir Davydov
  -1 siblings, 0 replies; 64+ messages in thread
From: Vladimir Davydov @ 2013-12-16 12:16 UTC (permalink / raw)
  To: dchinner, mhocko, hannes, akpm
  Cc: linux-kernel, linux-mm, cgroups, devel, glommer, glommer,
	Balbir Singh, KAMEZAWA Hiroyuki

Currently we take both the memcg_create_mutex and the set_limit_mutex
when we enable kmem accounting for a memory cgroup, which makes kmem
activation events serialize with both memcg creations and other memcg
limit updates (memory.limit, memory.memsw.limit). However, there is no
point keeping both of these mutexes during the whole kmem initialization
process.

First, the set_limit_mutex was introduced to keep the memory.limit and
memory.memsw.limit values in sync. Since memory.kmem.limit can be set
independently of them, it is better to introduce a separate mutex to
synchronize against concurrent kmem limit updates.

Second, we take the memcg_create_mutex in order to make sure all
children of this memcg will be kmem-active as well. For achieving that,
it is enough to take this mutex only around the call to
memcg_has_children(). This guarantees that if a child is added after we
check that the memcg has no children, the newly added cgroup will see
its parent kmem-active (of course if the latter succeeded), and call
kmem activation for itself.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Glauber Costa <glommer@openvz.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |  180 +++++++++++++++++++++++++++++--------------------------
 1 file changed, 94 insertions(+), 86 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 9bf11bf..f2372b0 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3063,32 +3063,6 @@ void memcg_cache_list_add(struct mem_cgroup *memcg, struct kmem_cache *cachep)
 	mutex_unlock(&memcg->slab_caches_mutex);
 }
 
-/*
- * This ends up being protected by the set_limit mutex, during normal
- * operation, because that is its main call site.
- *
- * But when we create a new cache, we can call this as well if its parent
- * is kmem-limited. That will have to hold set_limit_mutex as well.
- */
-int memcg_update_cache_sizes(struct mem_cgroup *memcg)
-{
-	int num, ret;
-
-	num = ida_simple_get(&kmem_limited_groups,
-				0, MEMCG_CACHES_MAX_SIZE, GFP_KERNEL);
-	if (num < 0)
-		return num;
-
-	ret = memcg_update_all_caches(num+1);
-	if (ret) {
-		ida_simple_remove(&kmem_limited_groups, num);
-		return ret;
-	}
-
-	memcg->kmemcg_id = num;
-	return 0;
-}
-
 static size_t memcg_caches_array_size(int num_groups)
 {
 	ssize_t size;
@@ -5119,11 +5093,28 @@ static ssize_t mem_cgroup_read(struct cgroup_subsys_state *css,
 	return simple_read_from_buffer(buf, nbytes, ppos, str, len);
 }
 
-static int memcg_update_kmem_limit(struct cgroup_subsys_state *css, u64 val)
-{
-	int ret = -EINVAL;
 #ifdef CONFIG_MEMCG_KMEM
-	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
+static DEFINE_MUTEX(activate_kmem_mutex);
+
+/* should be called with activate_kmem_mutex held */
+static int __memcg_activate_kmem(struct mem_cgroup *memcg,
+				 unsigned long long limit)
+{
+	int err = 0;
+	int memcg_id;
+
+	if (memcg_kmem_is_active(memcg))
+		return 0;
+
+	/*
+	 * We are going to allocate memory for data shared by all memory
+	 * cgroups so let's stop accounting here.
+	 */
+	memcg_stop_kmem_account();
+
+	err = res_counter_set_limit(&memcg->kmem, limit);
+	VM_BUG_ON(err);
+
 	/*
 	 * For simplicity, we won't allow this to be disabled.  It also can't
 	 * be changed if the cgroup has children already, or if tasks had
@@ -5137,72 +5128,91 @@ static int memcg_update_kmem_limit(struct cgroup_subsys_state *css, u64 val)
 	 * of course permitted.
 	 */
 	mutex_lock(&memcg_create_mutex);
-	mutex_lock(&set_limit_mutex);
-	if (!memcg->kmem_account_flags && val != RES_COUNTER_MAX) {
-		if (cgroup_task_count(css->cgroup) || memcg_has_children(memcg)) {
-			ret = -EBUSY;
-			goto out;
-		}
-		ret = res_counter_set_limit(&memcg->kmem, val);
-		VM_BUG_ON(ret);
+	if (cgroup_task_count(memcg->css.cgroup) || memcg_has_children(memcg))
+		err = -EBUSY;
+	mutex_unlock(&memcg_create_mutex);
+	if (err)
+		goto out_reset_limit;
+
+	memcg_id = ida_simple_get(&kmem_limited_groups,
+				  0, MEMCG_CACHES_MAX_SIZE, GFP_KERNEL);
+	if (memcg_id < 0) {
+		err = memcg_id;
+		goto out_reset_limit;
+	}
 
-		ret = memcg_update_cache_sizes(memcg);
-		if (ret) {
-			res_counter_set_limit(&memcg->kmem, RES_COUNTER_MAX);
-			goto out;
-		}
-		static_key_slow_inc(&memcg_kmem_enabled_key);
-		/*
-		 * setting the active bit after the inc will guarantee no one
-		 * starts accounting before all call sites are patched
-		 */
-		memcg_kmem_set_active(memcg);
-	} else
-		ret = res_counter_set_limit(&memcg->kmem, val);
+	/*
+	 * Make sure we have enough space for this cgroup in each kmem_cache's
+	 * memcg_params array.
+	 */
+	err = memcg_update_all_caches(memcg_id + 1);
+	if (err)
+		goto out_rmid;
+
+	memcg->kmemcg_id = memcg_id;
+
+	static_key_slow_inc(&memcg_kmem_enabled_key);
+	/*
+	 * Setting the active bit after enabling static branching will
+	 * guarantee no one starts accounting before all call sites are
+	 * patched.
+	 */
+	memcg_kmem_set_active(memcg);
 out:
-	mutex_unlock(&set_limit_mutex);
-	mutex_unlock(&memcg_create_mutex);
-#endif
+	memcg_resume_kmem_account();
+	return err;
+
+out_rmid:
+	ida_simple_remove(&kmem_limited_groups, memcg_id);
+out_reset_limit:
+	res_counter_set_limit(&memcg->kmem, RES_COUNTER_MAX);
+	goto out;
+}
+
+static int memcg_activate_kmem(struct mem_cgroup *memcg,
+			       unsigned long long limit)
+{
+	int ret;
+
+	mutex_lock(&activate_kmem_mutex);
+	ret = __memcg_activate_kmem(memcg, limit);
+	mutex_unlock(&activate_kmem_mutex);
+	return ret;
+}
+
+static int memcg_update_kmem_limit(struct mem_cgroup *memcg,
+				   unsigned long long val)
+{
+	int ret;
+
+	if (!memcg_kmem_is_active(memcg))
+		ret = memcg_activate_kmem(memcg, val);
+	else
+		ret = res_counter_set_limit(&memcg->kmem, val);
 	return ret;
 }
 
-#ifdef CONFIG_MEMCG_KMEM
 static int memcg_propagate_kmem(struct mem_cgroup *memcg)
 {
 	int ret = 0;
 	struct mem_cgroup *parent = parent_mem_cgroup(memcg);
-	if (!parent)
-		goto out;
 
-	memcg->kmem_account_flags = parent->kmem_account_flags;
-	/*
-	 * When that happen, we need to disable the static branch only on those
-	 * memcgs that enabled it. To achieve this, we would be forced to
-	 * complicate the code by keeping track of which memcgs were the ones
-	 * that actually enabled limits, and which ones got it from its
-	 * parents.
-	 *
-	 * It is a lot simpler just to do static_key_slow_inc() on every child
-	 * that is accounted.
-	 */
-	if (!memcg_kmem_is_active(memcg))
+	if (!parent)
 		goto out;
 
-	/*
-	 * __mem_cgroup_free() will issue static_key_slow_dec() because this
-	 * memcg is active already. If the later initialization fails then the
-	 * cgroup core triggers the cleanup so we do not have to do it here.
-	 */
-	static_key_slow_inc(&memcg_kmem_enabled_key);
-
-	mutex_lock(&set_limit_mutex);
-	memcg_stop_kmem_account();
-	ret = memcg_update_cache_sizes(memcg);
-	memcg_resume_kmem_account();
-	mutex_unlock(&set_limit_mutex);
+	mutex_lock(&activate_kmem_mutex);
+	if (memcg_kmem_is_active(parent))
+		ret = __memcg_activate_kmem(memcg, RES_COUNTER_MAX);
+	mutex_unlock(&activate_kmem_mutex);
 out:
 	return ret;
 }
+#else
+static inline int memcg_update_kmem_limit(struct mem_cgroup *memcg,
+					  unsigned long long val)
+{
+	return -EINVAL;
+}
 #endif /* CONFIG_MEMCG_KMEM */
 
 /*
@@ -5236,7 +5246,7 @@ static int mem_cgroup_write(struct cgroup_subsys_state *css, struct cftype *cft,
 		else if (type == _MEMSWAP)
 			ret = mem_cgroup_resize_memsw_limit(memcg, val);
 		else if (type == _KMEM)
-			ret = memcg_update_kmem_limit(css, val);
+			ret = memcg_update_kmem_limit(memcg, val);
 		else
 			return -EINVAL;
 		break;
@@ -6253,7 +6263,6 @@ mem_cgroup_css_online(struct cgroup_subsys_state *css)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
 	struct mem_cgroup *parent = mem_cgroup_from_css(css_parent(css));
-	int error = 0;
 
 	if (css->cgroup->id > MEM_CGROUP_ID_MAX)
 		return -ENOSPC;
@@ -6288,10 +6297,9 @@ mem_cgroup_css_online(struct cgroup_subsys_state *css)
 		if (parent != root_mem_cgroup)
 			mem_cgroup_subsys.broken_hierarchy = true;
 	}
-
-	error = memcg_init_kmem(memcg, &mem_cgroup_subsys);
 	mutex_unlock(&memcg_create_mutex);
-	return error;
+
+	return memcg_init_kmem(memcg, &mem_cgroup_subsys);
 }
 
 /*
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v14 06/18] memcg: rework memcg_update_kmem_limit synchronization
@ 2013-12-16 12:16   ` Vladimir Davydov
  0 siblings, 0 replies; 64+ messages in thread
From: Vladimir Davydov @ 2013-12-16 12:16 UTC (permalink / raw)
  To: dchinner, mhocko, hannes, akpm
  Cc: linux-kernel, linux-mm, cgroups, devel, glommer, glommer,
	Balbir Singh, KAMEZAWA Hiroyuki

Currently we take both the memcg_create_mutex and the set_limit_mutex
when we enable kmem accounting for a memory cgroup, which makes kmem
activation events serialize with both memcg creations and other memcg
limit updates (memory.limit, memory.memsw.limit). However, there is no
point keeping both of these mutexes during the whole kmem initialization
process.

First, the set_limit_mutex was introduced to keep the memory.limit and
memory.memsw.limit values in sync. Since memory.kmem.limit can be set
independently of them, it is better to introduce a separate mutex to
synchronize against concurrent kmem limit updates.

Second, we take the memcg_create_mutex in order to make sure all
children of this memcg will be kmem-active as well. For achieving that,
it is enough to take this mutex only around the call to
memcg_has_children(). This guarantees that if a child is added after we
check that the memcg has no children, the newly added cgroup will see
its parent kmem-active (of course if the latter succeeded), and call
kmem activation for itself.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Glauber Costa <glommer@openvz.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |  180 +++++++++++++++++++++++++++++--------------------------
 1 file changed, 94 insertions(+), 86 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 9bf11bf..f2372b0 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3063,32 +3063,6 @@ void memcg_cache_list_add(struct mem_cgroup *memcg, struct kmem_cache *cachep)
 	mutex_unlock(&memcg->slab_caches_mutex);
 }
 
-/*
- * This ends up being protected by the set_limit mutex, during normal
- * operation, because that is its main call site.
- *
- * But when we create a new cache, we can call this as well if its parent
- * is kmem-limited. That will have to hold set_limit_mutex as well.
- */
-int memcg_update_cache_sizes(struct mem_cgroup *memcg)
-{
-	int num, ret;
-
-	num = ida_simple_get(&kmem_limited_groups,
-				0, MEMCG_CACHES_MAX_SIZE, GFP_KERNEL);
-	if (num < 0)
-		return num;
-
-	ret = memcg_update_all_caches(num+1);
-	if (ret) {
-		ida_simple_remove(&kmem_limited_groups, num);
-		return ret;
-	}
-
-	memcg->kmemcg_id = num;
-	return 0;
-}
-
 static size_t memcg_caches_array_size(int num_groups)
 {
 	ssize_t size;
@@ -5119,11 +5093,28 @@ static ssize_t mem_cgroup_read(struct cgroup_subsys_state *css,
 	return simple_read_from_buffer(buf, nbytes, ppos, str, len);
 }
 
-static int memcg_update_kmem_limit(struct cgroup_subsys_state *css, u64 val)
-{
-	int ret = -EINVAL;
 #ifdef CONFIG_MEMCG_KMEM
-	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
+static DEFINE_MUTEX(activate_kmem_mutex);
+
+/* should be called with activate_kmem_mutex held */
+static int __memcg_activate_kmem(struct mem_cgroup *memcg,
+				 unsigned long long limit)
+{
+	int err = 0;
+	int memcg_id;
+
+	if (memcg_kmem_is_active(memcg))
+		return 0;
+
+	/*
+	 * We are going to allocate memory for data shared by all memory
+	 * cgroups so let's stop accounting here.
+	 */
+	memcg_stop_kmem_account();
+
+	err = res_counter_set_limit(&memcg->kmem, limit);
+	VM_BUG_ON(err);
+
 	/*
 	 * For simplicity, we won't allow this to be disabled.  It also can't
 	 * be changed if the cgroup has children already, or if tasks had
@@ -5137,72 +5128,91 @@ static int memcg_update_kmem_limit(struct cgroup_subsys_state *css, u64 val)
 	 * of course permitted.
 	 */
 	mutex_lock(&memcg_create_mutex);
-	mutex_lock(&set_limit_mutex);
-	if (!memcg->kmem_account_flags && val != RES_COUNTER_MAX) {
-		if (cgroup_task_count(css->cgroup) || memcg_has_children(memcg)) {
-			ret = -EBUSY;
-			goto out;
-		}
-		ret = res_counter_set_limit(&memcg->kmem, val);
-		VM_BUG_ON(ret);
+	if (cgroup_task_count(memcg->css.cgroup) || memcg_has_children(memcg))
+		err = -EBUSY;
+	mutex_unlock(&memcg_create_mutex);
+	if (err)
+		goto out_reset_limit;
+
+	memcg_id = ida_simple_get(&kmem_limited_groups,
+				  0, MEMCG_CACHES_MAX_SIZE, GFP_KERNEL);
+	if (memcg_id < 0) {
+		err = memcg_id;
+		goto out_reset_limit;
+	}
 
-		ret = memcg_update_cache_sizes(memcg);
-		if (ret) {
-			res_counter_set_limit(&memcg->kmem, RES_COUNTER_MAX);
-			goto out;
-		}
-		static_key_slow_inc(&memcg_kmem_enabled_key);
-		/*
-		 * setting the active bit after the inc will guarantee no one
-		 * starts accounting before all call sites are patched
-		 */
-		memcg_kmem_set_active(memcg);
-	} else
-		ret = res_counter_set_limit(&memcg->kmem, val);
+	/*
+	 * Make sure we have enough space for this cgroup in each kmem_cache's
+	 * memcg_params array.
+	 */
+	err = memcg_update_all_caches(memcg_id + 1);
+	if (err)
+		goto out_rmid;
+
+	memcg->kmemcg_id = memcg_id;
+
+	static_key_slow_inc(&memcg_kmem_enabled_key);
+	/*
+	 * Setting the active bit after enabling static branching will
+	 * guarantee no one starts accounting before all call sites are
+	 * patched.
+	 */
+	memcg_kmem_set_active(memcg);
 out:
-	mutex_unlock(&set_limit_mutex);
-	mutex_unlock(&memcg_create_mutex);
-#endif
+	memcg_resume_kmem_account();
+	return err;
+
+out_rmid:
+	ida_simple_remove(&kmem_limited_groups, memcg_id);
+out_reset_limit:
+	res_counter_set_limit(&memcg->kmem, RES_COUNTER_MAX);
+	goto out;
+}
+
+static int memcg_activate_kmem(struct mem_cgroup *memcg,
+			       unsigned long long limit)
+{
+	int ret;
+
+	mutex_lock(&activate_kmem_mutex);
+	ret = __memcg_activate_kmem(memcg, limit);
+	mutex_unlock(&activate_kmem_mutex);
+	return ret;
+}
+
+static int memcg_update_kmem_limit(struct mem_cgroup *memcg,
+				   unsigned long long val)
+{
+	int ret;
+
+	if (!memcg_kmem_is_active(memcg))
+		ret = memcg_activate_kmem(memcg, val);
+	else
+		ret = res_counter_set_limit(&memcg->kmem, val);
 	return ret;
 }
 
-#ifdef CONFIG_MEMCG_KMEM
 static int memcg_propagate_kmem(struct mem_cgroup *memcg)
 {
 	int ret = 0;
 	struct mem_cgroup *parent = parent_mem_cgroup(memcg);
-	if (!parent)
-		goto out;
 
-	memcg->kmem_account_flags = parent->kmem_account_flags;
-	/*
-	 * When that happen, we need to disable the static branch only on those
-	 * memcgs that enabled it. To achieve this, we would be forced to
-	 * complicate the code by keeping track of which memcgs were the ones
-	 * that actually enabled limits, and which ones got it from its
-	 * parents.
-	 *
-	 * It is a lot simpler just to do static_key_slow_inc() on every child
-	 * that is accounted.
-	 */
-	if (!memcg_kmem_is_active(memcg))
+	if (!parent)
 		goto out;
 
-	/*
-	 * __mem_cgroup_free() will issue static_key_slow_dec() because this
-	 * memcg is active already. If the later initialization fails then the
-	 * cgroup core triggers the cleanup so we do not have to do it here.
-	 */
-	static_key_slow_inc(&memcg_kmem_enabled_key);
-
-	mutex_lock(&set_limit_mutex);
-	memcg_stop_kmem_account();
-	ret = memcg_update_cache_sizes(memcg);
-	memcg_resume_kmem_account();
-	mutex_unlock(&set_limit_mutex);
+	mutex_lock(&activate_kmem_mutex);
+	if (memcg_kmem_is_active(parent))
+		ret = __memcg_activate_kmem(memcg, RES_COUNTER_MAX);
+	mutex_unlock(&activate_kmem_mutex);
 out:
 	return ret;
 }
+#else
+static inline int memcg_update_kmem_limit(struct mem_cgroup *memcg,
+					  unsigned long long val)
+{
+	return -EINVAL;
+}
 #endif /* CONFIG_MEMCG_KMEM */
 
 /*
@@ -5236,7 +5246,7 @@ static int mem_cgroup_write(struct cgroup_subsys_state *css, struct cftype *cft,
 		else if (type == _MEMSWAP)
 			ret = mem_cgroup_resize_memsw_limit(memcg, val);
 		else if (type == _KMEM)
-			ret = memcg_update_kmem_limit(css, val);
+			ret = memcg_update_kmem_limit(memcg, val);
 		else
 			return -EINVAL;
 		break;
@@ -6253,7 +6263,6 @@ mem_cgroup_css_online(struct cgroup_subsys_state *css)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
 	struct mem_cgroup *parent = mem_cgroup_from_css(css_parent(css));
-	int error = 0;
 
 	if (css->cgroup->id > MEM_CGROUP_ID_MAX)
 		return -ENOSPC;
@@ -6288,10 +6297,9 @@ mem_cgroup_css_online(struct cgroup_subsys_state *css)
 		if (parent != root_mem_cgroup)
 			mem_cgroup_subsys.broken_hierarchy = true;
 	}
-
-	error = memcg_init_kmem(memcg, &mem_cgroup_subsys);
 	mutex_unlock(&memcg_create_mutex);
-	return error;
+
+	return memcg_init_kmem(memcg, &mem_cgroup_subsys);
 }
 
 /*
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v14 07/18] list_lru, shrinkers: introduce list_lru_shrink_{count,walk}
  2013-12-16 12:16 ` Vladimir Davydov
@ 2013-12-16 12:16   ` Vladimir Davydov
  -1 siblings, 0 replies; 64+ messages in thread
From: Vladimir Davydov @ 2013-12-16 12:16 UTC (permalink / raw)
  To: dchinner, mhocko, hannes, akpm
  Cc: linux-kernel, linux-mm, cgroups, devel, glommer, glommer, Al Viro

NUMA aware slab shrinkers use the list_lru structure to distribute
objects coming from different NUMA nodes to different lists. Whenever
such a shrinker needs to count or scan objects from a particular node,
it issues commands like this:

        count = list_lru_count_node(lru, sc->nid);
        freed = list_lru_walk_node(lru, sc->nid, isolate_func,
                                   isolate_arg, &sc->nr_to_scan);

where sc is an instance of the shrink_control structure passed to it
from vmscan.

To simplify this, let's add the special list_lru functions to be used by
shrinkers, list_lru_shrink_count() and list_lru_shrink_walk(), which
consolidate the nid and nr_to_scan arguments in the shrink_control
structure.

This will also allow us to avoid patching shrinkers that use list_lru
when we make shrink_slab() per-memcg - all we will have to do is extend
the shrink_control structure to include the memcg to scan from and make
list_lru_shrink_{count,walk} handle this appropriately.

Thanks to David Chinner for the tip.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Glauber Costa <glommer@openvz.org>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
---
 fs/dcache.c              |   14 ++++++--------
 fs/gfs2/quota.c          |    6 +++---
 fs/inode.c               |    7 +++----
 fs/internal.h            |    7 +++----
 fs/super.c               |   22 ++++++++++------------
 fs/xfs/xfs_buf.c         |    7 +++----
 fs/xfs/xfs_qm.c          |    7 +++----
 include/linux/list_lru.h |   16 ++++++++++++++++
 8 files changed, 47 insertions(+), 39 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 6055d61..7e66fca 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -972,24 +972,22 @@ dentry_lru_isolate(struct list_head *item, spinlock_t *lru_lock, void *arg)
 /**
  * prune_dcache_sb - shrink the dcache
  * @sb: superblock
- * @nr_to_scan : number of entries to try to free
- * @nid: which node to scan for freeable entities
+ * @sc: shrink control, passed to list_lru_shrink_walk()
  *
- * Attempt to shrink the superblock dcache LRU by @nr_to_scan entries. This is
- * done when we need more memory an called from the superblock shrinker
+ * Attempt to shrink the superblock dcache LRU by @sc->nr_to_scan entries. This
+ * is done when we need more memory and called from the superblock shrinker
  * function.
  *
  * This function may fail to free any resources if all the dentries are in
  * use.
  */
-long prune_dcache_sb(struct super_block *sb, unsigned long nr_to_scan,
-		     int nid)
+long prune_dcache_sb(struct super_block *sb, struct shrink_control *sc)
 {
 	LIST_HEAD(dispose);
 	long freed;
 
-	freed = list_lru_walk_node(&sb->s_dentry_lru, nid, dentry_lru_isolate,
-				       &dispose, &nr_to_scan);
+	freed = list_lru_shrink_walk(&sb->s_dentry_lru, sc,
+				     dentry_lru_isolate, &dispose);
 	shrink_dentry_list(&dispose);
 	return freed;
 }
diff --git a/fs/gfs2/quota.c b/fs/gfs2/quota.c
index 98236d0..09a32c4 100644
--- a/fs/gfs2/quota.c
+++ b/fs/gfs2/quota.c
@@ -132,8 +132,8 @@ static unsigned long gfs2_qd_shrink_scan(struct shrinker *shrink,
 	if (!(sc->gfp_mask & __GFP_FS))
 		return SHRINK_STOP;
 
-	freed = list_lru_walk_node(&gfs2_qd_lru, sc->nid, gfs2_qd_isolate,
-				   &dispose, &sc->nr_to_scan);
+	freed = list_lru_shrink_walk(&gfs2_qd_lru, sc,
+				     gfs2_qd_isolate, &dispose);
 
 	gfs2_qd_dispose(&dispose);
 
@@ -143,7 +143,7 @@ static unsigned long gfs2_qd_shrink_scan(struct shrinker *shrink,
 static unsigned long gfs2_qd_shrink_count(struct shrinker *shrink,
 					  struct shrink_control *sc)
 {
-	return vfs_pressure_ratio(list_lru_count_node(&gfs2_qd_lru, sc->nid));
+	return vfs_pressure_ratio(list_lru_shrink_count(&gfs2_qd_lru, sc));
 }
 
 struct shrinker gfs2_qd_shrinker = {
diff --git a/fs/inode.c b/fs/inode.c
index 4bcdad3..37962eb 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -748,14 +748,13 @@ inode_lru_isolate(struct list_head *item, spinlock_t *lru_lock, void *arg)
  * to trim from the LRU. Inodes to be freed are moved to a temporary list and
  * then are freed outside inode_lock by dispose_list().
  */
-long prune_icache_sb(struct super_block *sb, unsigned long nr_to_scan,
-		     int nid)
+long prune_icache_sb(struct super_block *sb, struct shrink_control *sc)
 {
 	LIST_HEAD(freeable);
 	long freed;
 
-	freed = list_lru_walk_node(&sb->s_inode_lru, nid, inode_lru_isolate,
-				       &freeable, &nr_to_scan);
+	freed = list_lru_shrink_walk(&sb->s_inode_lru, sc,
+				     inode_lru_isolate, &freeable);
 	dispose_list(&freeable);
 	return freed;
 }
diff --git a/fs/internal.h b/fs/internal.h
index 4657424..3db5f6e 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -14,6 +14,7 @@ struct file_system_type;
 struct linux_binprm;
 struct path;
 struct mount;
+struct shrink_control;
 
 /*
  * block_dev.c
@@ -107,8 +108,7 @@ extern int open_check_o_direct(struct file *f);
  * inode.c
  */
 extern spinlock_t inode_sb_list_lock;
-extern long prune_icache_sb(struct super_block *sb, unsigned long nr_to_scan,
-			    int nid);
+extern long prune_icache_sb(struct super_block *sb, struct shrink_control *sc);
 extern void inode_add_lru(struct inode *inode);
 
 /*
@@ -125,8 +125,7 @@ extern int invalidate_inodes(struct super_block *, bool);
  */
 extern struct dentry *__d_alloc(struct super_block *, const struct qstr *);
 extern int d_set_mounted(struct dentry *dentry);
-extern long prune_dcache_sb(struct super_block *sb, unsigned long nr_to_scan,
-			    int nid);
+extern long prune_dcache_sb(struct super_block *sb, struct shrink_control *sc);
 
 /*
  * read_write.c
diff --git a/fs/super.c b/fs/super.c
index e5f6c2c..13c4c92 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -78,27 +78,27 @@ static unsigned long super_cache_scan(struct shrinker *shrink,
 	if (sb->s_op->nr_cached_objects)
 		fs_objects = sb->s_op->nr_cached_objects(sb, sc->nid);
 
-	inodes = list_lru_count_node(&sb->s_inode_lru, sc->nid);
-	dentries = list_lru_count_node(&sb->s_dentry_lru, sc->nid);
+	inodes = list_lru_shrink_count(&sb->s_inode_lru, sc);
+	dentries = list_lru_shrink_count(&sb->s_dentry_lru, sc);
 	total_objects = dentries + inodes + fs_objects + 1;
 
 	/* proportion the scan between the caches */
 	dentries = mult_frac(sc->nr_to_scan, dentries, total_objects);
 	inodes = mult_frac(sc->nr_to_scan, inodes, total_objects);
+	fs_objects = mult_frac(sc->nr_to_scan, fs_objects, total_objects);
 
 	/*
 	 * prune the dcache first as the icache is pinned by it, then
 	 * prune the icache, followed by the filesystem specific caches
 	 */
-	freed = prune_dcache_sb(sb, dentries, sc->nid);
-	freed += prune_icache_sb(sb, inodes, sc->nid);
+	sc->nr_to_scan = dentries;
+	freed = prune_dcache_sb(sb, sc);
+	sc->nr_to_scan = inodes;
+	freed += prune_icache_sb(sb, sc);
 
-	if (fs_objects) {
-		fs_objects = mult_frac(sc->nr_to_scan, fs_objects,
-								total_objects);
+	if (fs_objects)
 		freed += sb->s_op->free_cached_objects(sb, fs_objects,
 						       sc->nid);
-	}
 
 	drop_super(sb);
 	return freed;
@@ -119,10 +119,8 @@ static unsigned long super_cache_count(struct shrinker *shrink,
 		total_objects = sb->s_op->nr_cached_objects(sb,
 						 sc->nid);
 
-	total_objects += list_lru_count_node(&sb->s_dentry_lru,
-						 sc->nid);
-	total_objects += list_lru_count_node(&sb->s_inode_lru,
-						 sc->nid);
+	total_objects += list_lru_shrink_count(&sb->s_dentry_lru, sc);
+	total_objects += list_lru_shrink_count(&sb->s_inode_lru, sc);
 
 	total_objects = vfs_pressure_ratio(total_objects);
 	drop_super(sb);
diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index c7f0b77..a2bb6a33 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -1563,10 +1563,9 @@ xfs_buftarg_shrink_scan(
 					struct xfs_buftarg, bt_shrinker);
 	LIST_HEAD(dispose);
 	unsigned long		freed;
-	unsigned long		nr_to_scan = sc->nr_to_scan;
 
-	freed = list_lru_walk_node(&btp->bt_lru, sc->nid, xfs_buftarg_isolate,
-				       &dispose, &nr_to_scan);
+	freed = list_lru_shrink_walk(&btp->bt_lru, sc,
+				     xfs_buftarg_isolate, &dispose);
 
 	while (!list_empty(&dispose)) {
 		struct xfs_buf *bp;
@@ -1585,7 +1584,7 @@ xfs_buftarg_shrink_count(
 {
 	struct xfs_buftarg	*btp = container_of(shrink,
 					struct xfs_buftarg, bt_shrinker);
-	return list_lru_count_node(&btp->bt_lru, sc->nid);
+	return list_lru_shrink_count(&btp->bt_lru, sc);
 }
 
 void
diff --git a/fs/xfs/xfs_qm.c b/fs/xfs/xfs_qm.c
index 14a4996..65bc404 100644
--- a/fs/xfs/xfs_qm.c
+++ b/fs/xfs/xfs_qm.c
@@ -761,7 +761,6 @@ xfs_qm_shrink_scan(
 	struct xfs_qm_isolate	isol;
 	unsigned long		freed;
 	int			error;
-	unsigned long		nr_to_scan = sc->nr_to_scan;
 
 	if ((sc->gfp_mask & (__GFP_FS|__GFP_WAIT)) != (__GFP_FS|__GFP_WAIT))
 		return 0;
@@ -769,8 +768,8 @@ xfs_qm_shrink_scan(
 	INIT_LIST_HEAD(&isol.buffers);
 	INIT_LIST_HEAD(&isol.dispose);
 
-	freed = list_lru_walk_node(&qi->qi_lru, sc->nid, xfs_qm_dquot_isolate, &isol,
-					&nr_to_scan);
+	freed = list_lru_shrink_walk(&qi->qi_lru, sc,
+				     xfs_qm_dquot_isolate, &isol);
 
 	error = xfs_buf_delwri_submit(&isol.buffers);
 	if (error)
@@ -795,7 +794,7 @@ xfs_qm_shrink_count(
 	struct xfs_quotainfo	*qi = container_of(shrink,
 					struct xfs_quotainfo, qi_shrinker);
 
-	return list_lru_count_node(&qi->qi_lru, sc->nid);
+	return list_lru_shrink_count(&qi->qi_lru, sc);
 }
 
 /*
diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
index 3ce5417..194b1c4 100644
--- a/include/linux/list_lru.h
+++ b/include/linux/list_lru.h
@@ -9,6 +9,7 @@
 
 #include <linux/list.h>
 #include <linux/nodemask.h>
+#include <linux/shrinker.h>
 
 /* list_lru_walk_cb has to always return one of those */
 enum lru_status {
@@ -75,6 +76,13 @@ bool list_lru_del(struct list_lru *lru, struct list_head *item);
  * Callers that want such a guarantee need to provide an outer lock.
  */
 unsigned long list_lru_count_node(struct list_lru *lru, int nid);
+
+static inline unsigned long list_lru_shrink_count(struct list_lru *lru,
+						  struct shrink_control *sc)
+{
+	return list_lru_count_node(lru, sc->nid);
+}
+
 static inline unsigned long list_lru_count(struct list_lru *lru)
 {
 	long count = 0;
@@ -114,6 +122,14 @@ unsigned long list_lru_walk_node(struct list_lru *lru, int nid,
 				 unsigned long *nr_to_walk);
 
 static inline unsigned long
+list_lru_shrink_walk(struct list_lru *lru, struct shrink_control *sc,
+		     list_lru_walk_cb isolate, void *cb_arg)
+{
+	return list_lru_walk_node(lru, sc->nid, isolate, cb_arg,
+				  &sc->nr_to_scan);
+}
+
+static inline unsigned long
 list_lru_walk(struct list_lru *lru, list_lru_walk_cb isolate,
 	      void *cb_arg, unsigned long nr_to_walk)
 {
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v14 07/18] list_lru, shrinkers: introduce list_lru_shrink_{count,walk}
@ 2013-12-16 12:16   ` Vladimir Davydov
  0 siblings, 0 replies; 64+ messages in thread
From: Vladimir Davydov @ 2013-12-16 12:16 UTC (permalink / raw)
  To: dchinner, mhocko, hannes, akpm
  Cc: linux-kernel, linux-mm, cgroups, devel, glommer, glommer, Al Viro

NUMA aware slab shrinkers use the list_lru structure to distribute
objects coming from different NUMA nodes to different lists. Whenever
such a shrinker needs to count or scan objects from a particular node,
it issues commands like this:

        count = list_lru_count_node(lru, sc->nid);
        freed = list_lru_walk_node(lru, sc->nid, isolate_func,
                                   isolate_arg, &sc->nr_to_scan);

where sc is an instance of the shrink_control structure passed to it
from vmscan.

To simplify this, let's add the special list_lru functions to be used by
shrinkers, list_lru_shrink_count() and list_lru_shrink_walk(), which
consolidate the nid and nr_to_scan arguments in the shrink_control
structure.

This will also allow us to avoid patching shrinkers that use list_lru
when we make shrink_slab() per-memcg - all we will have to do is extend
the shrink_control structure to include the memcg to scan from and make
list_lru_shrink_{count,walk} handle this appropriately.

Thanks to David Chinner for the tip.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Glauber Costa <glommer@openvz.org>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
---
 fs/dcache.c              |   14 ++++++--------
 fs/gfs2/quota.c          |    6 +++---
 fs/inode.c               |    7 +++----
 fs/internal.h            |    7 +++----
 fs/super.c               |   22 ++++++++++------------
 fs/xfs/xfs_buf.c         |    7 +++----
 fs/xfs/xfs_qm.c          |    7 +++----
 include/linux/list_lru.h |   16 ++++++++++++++++
 8 files changed, 47 insertions(+), 39 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 6055d61..7e66fca 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -972,24 +972,22 @@ dentry_lru_isolate(struct list_head *item, spinlock_t *lru_lock, void *arg)
 /**
  * prune_dcache_sb - shrink the dcache
  * @sb: superblock
- * @nr_to_scan : number of entries to try to free
- * @nid: which node to scan for freeable entities
+ * @sc: shrink control, passed to list_lru_shrink_walk()
  *
- * Attempt to shrink the superblock dcache LRU by @nr_to_scan entries. This is
- * done when we need more memory an called from the superblock shrinker
+ * Attempt to shrink the superblock dcache LRU by @sc->nr_to_scan entries. This
+ * is done when we need more memory and called from the superblock shrinker
  * function.
  *
  * This function may fail to free any resources if all the dentries are in
  * use.
  */
-long prune_dcache_sb(struct super_block *sb, unsigned long nr_to_scan,
-		     int nid)
+long prune_dcache_sb(struct super_block *sb, struct shrink_control *sc)
 {
 	LIST_HEAD(dispose);
 	long freed;
 
-	freed = list_lru_walk_node(&sb->s_dentry_lru, nid, dentry_lru_isolate,
-				       &dispose, &nr_to_scan);
+	freed = list_lru_shrink_walk(&sb->s_dentry_lru, sc,
+				     dentry_lru_isolate, &dispose);
 	shrink_dentry_list(&dispose);
 	return freed;
 }
diff --git a/fs/gfs2/quota.c b/fs/gfs2/quota.c
index 98236d0..09a32c4 100644
--- a/fs/gfs2/quota.c
+++ b/fs/gfs2/quota.c
@@ -132,8 +132,8 @@ static unsigned long gfs2_qd_shrink_scan(struct shrinker *shrink,
 	if (!(sc->gfp_mask & __GFP_FS))
 		return SHRINK_STOP;
 
-	freed = list_lru_walk_node(&gfs2_qd_lru, sc->nid, gfs2_qd_isolate,
-				   &dispose, &sc->nr_to_scan);
+	freed = list_lru_shrink_walk(&gfs2_qd_lru, sc,
+				     gfs2_qd_isolate, &dispose);
 
 	gfs2_qd_dispose(&dispose);
 
@@ -143,7 +143,7 @@ static unsigned long gfs2_qd_shrink_scan(struct shrinker *shrink,
 static unsigned long gfs2_qd_shrink_count(struct shrinker *shrink,
 					  struct shrink_control *sc)
 {
-	return vfs_pressure_ratio(list_lru_count_node(&gfs2_qd_lru, sc->nid));
+	return vfs_pressure_ratio(list_lru_shrink_count(&gfs2_qd_lru, sc));
 }
 
 struct shrinker gfs2_qd_shrinker = {
diff --git a/fs/inode.c b/fs/inode.c
index 4bcdad3..37962eb 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -748,14 +748,13 @@ inode_lru_isolate(struct list_head *item, spinlock_t *lru_lock, void *arg)
  * to trim from the LRU. Inodes to be freed are moved to a temporary list and
  * then are freed outside inode_lock by dispose_list().
  */
-long prune_icache_sb(struct super_block *sb, unsigned long nr_to_scan,
-		     int nid)
+long prune_icache_sb(struct super_block *sb, struct shrink_control *sc)
 {
 	LIST_HEAD(freeable);
 	long freed;
 
-	freed = list_lru_walk_node(&sb->s_inode_lru, nid, inode_lru_isolate,
-				       &freeable, &nr_to_scan);
+	freed = list_lru_shrink_walk(&sb->s_inode_lru, sc,
+				     inode_lru_isolate, &freeable);
 	dispose_list(&freeable);
 	return freed;
 }
diff --git a/fs/internal.h b/fs/internal.h
index 4657424..3db5f6e 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -14,6 +14,7 @@ struct file_system_type;
 struct linux_binprm;
 struct path;
 struct mount;
+struct shrink_control;
 
 /*
  * block_dev.c
@@ -107,8 +108,7 @@ extern int open_check_o_direct(struct file *f);
  * inode.c
  */
 extern spinlock_t inode_sb_list_lock;
-extern long prune_icache_sb(struct super_block *sb, unsigned long nr_to_scan,
-			    int nid);
+extern long prune_icache_sb(struct super_block *sb, struct shrink_control *sc);
 extern void inode_add_lru(struct inode *inode);
 
 /*
@@ -125,8 +125,7 @@ extern int invalidate_inodes(struct super_block *, bool);
  */
 extern struct dentry *__d_alloc(struct super_block *, const struct qstr *);
 extern int d_set_mounted(struct dentry *dentry);
-extern long prune_dcache_sb(struct super_block *sb, unsigned long nr_to_scan,
-			    int nid);
+extern long prune_dcache_sb(struct super_block *sb, struct shrink_control *sc);
 
 /*
  * read_write.c
diff --git a/fs/super.c b/fs/super.c
index e5f6c2c..13c4c92 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -78,27 +78,27 @@ static unsigned long super_cache_scan(struct shrinker *shrink,
 	if (sb->s_op->nr_cached_objects)
 		fs_objects = sb->s_op->nr_cached_objects(sb, sc->nid);
 
-	inodes = list_lru_count_node(&sb->s_inode_lru, sc->nid);
-	dentries = list_lru_count_node(&sb->s_dentry_lru, sc->nid);
+	inodes = list_lru_shrink_count(&sb->s_inode_lru, sc);
+	dentries = list_lru_shrink_count(&sb->s_dentry_lru, sc);
 	total_objects = dentries + inodes + fs_objects + 1;
 
 	/* proportion the scan between the caches */
 	dentries = mult_frac(sc->nr_to_scan, dentries, total_objects);
 	inodes = mult_frac(sc->nr_to_scan, inodes, total_objects);
+	fs_objects = mult_frac(sc->nr_to_scan, fs_objects, total_objects);
 
 	/*
 	 * prune the dcache first as the icache is pinned by it, then
 	 * prune the icache, followed by the filesystem specific caches
 	 */
-	freed = prune_dcache_sb(sb, dentries, sc->nid);
-	freed += prune_icache_sb(sb, inodes, sc->nid);
+	sc->nr_to_scan = dentries;
+	freed = prune_dcache_sb(sb, sc);
+	sc->nr_to_scan = inodes;
+	freed += prune_icache_sb(sb, sc);
 
-	if (fs_objects) {
-		fs_objects = mult_frac(sc->nr_to_scan, fs_objects,
-								total_objects);
+	if (fs_objects)
 		freed += sb->s_op->free_cached_objects(sb, fs_objects,
 						       sc->nid);
-	}
 
 	drop_super(sb);
 	return freed;
@@ -119,10 +119,8 @@ static unsigned long super_cache_count(struct shrinker *shrink,
 		total_objects = sb->s_op->nr_cached_objects(sb,
 						 sc->nid);
 
-	total_objects += list_lru_count_node(&sb->s_dentry_lru,
-						 sc->nid);
-	total_objects += list_lru_count_node(&sb->s_inode_lru,
-						 sc->nid);
+	total_objects += list_lru_shrink_count(&sb->s_dentry_lru, sc);
+	total_objects += list_lru_shrink_count(&sb->s_inode_lru, sc);
 
 	total_objects = vfs_pressure_ratio(total_objects);
 	drop_super(sb);
diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index c7f0b77..a2bb6a33 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -1563,10 +1563,9 @@ xfs_buftarg_shrink_scan(
 					struct xfs_buftarg, bt_shrinker);
 	LIST_HEAD(dispose);
 	unsigned long		freed;
-	unsigned long		nr_to_scan = sc->nr_to_scan;
 
-	freed = list_lru_walk_node(&btp->bt_lru, sc->nid, xfs_buftarg_isolate,
-				       &dispose, &nr_to_scan);
+	freed = list_lru_shrink_walk(&btp->bt_lru, sc,
+				     xfs_buftarg_isolate, &dispose);
 
 	while (!list_empty(&dispose)) {
 		struct xfs_buf *bp;
@@ -1585,7 +1584,7 @@ xfs_buftarg_shrink_count(
 {
 	struct xfs_buftarg	*btp = container_of(shrink,
 					struct xfs_buftarg, bt_shrinker);
-	return list_lru_count_node(&btp->bt_lru, sc->nid);
+	return list_lru_shrink_count(&btp->bt_lru, sc);
 }
 
 void
diff --git a/fs/xfs/xfs_qm.c b/fs/xfs/xfs_qm.c
index 14a4996..65bc404 100644
--- a/fs/xfs/xfs_qm.c
+++ b/fs/xfs/xfs_qm.c
@@ -761,7 +761,6 @@ xfs_qm_shrink_scan(
 	struct xfs_qm_isolate	isol;
 	unsigned long		freed;
 	int			error;
-	unsigned long		nr_to_scan = sc->nr_to_scan;
 
 	if ((sc->gfp_mask & (__GFP_FS|__GFP_WAIT)) != (__GFP_FS|__GFP_WAIT))
 		return 0;
@@ -769,8 +768,8 @@ xfs_qm_shrink_scan(
 	INIT_LIST_HEAD(&isol.buffers);
 	INIT_LIST_HEAD(&isol.dispose);
 
-	freed = list_lru_walk_node(&qi->qi_lru, sc->nid, xfs_qm_dquot_isolate, &isol,
-					&nr_to_scan);
+	freed = list_lru_shrink_walk(&qi->qi_lru, sc,
+				     xfs_qm_dquot_isolate, &isol);
 
 	error = xfs_buf_delwri_submit(&isol.buffers);
 	if (error)
@@ -795,7 +794,7 @@ xfs_qm_shrink_count(
 	struct xfs_quotainfo	*qi = container_of(shrink,
 					struct xfs_quotainfo, qi_shrinker);
 
-	return list_lru_count_node(&qi->qi_lru, sc->nid);
+	return list_lru_shrink_count(&qi->qi_lru, sc);
 }
 
 /*
diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
index 3ce5417..194b1c4 100644
--- a/include/linux/list_lru.h
+++ b/include/linux/list_lru.h
@@ -9,6 +9,7 @@
 
 #include <linux/list.h>
 #include <linux/nodemask.h>
+#include <linux/shrinker.h>
 
 /* list_lru_walk_cb has to always return one of those */
 enum lru_status {
@@ -75,6 +76,13 @@ bool list_lru_del(struct list_lru *lru, struct list_head *item);
  * Callers that want such a guarantee need to provide an outer lock.
  */
 unsigned long list_lru_count_node(struct list_lru *lru, int nid);
+
+static inline unsigned long list_lru_shrink_count(struct list_lru *lru,
+						  struct shrink_control *sc)
+{
+	return list_lru_count_node(lru, sc->nid);
+}
+
 static inline unsigned long list_lru_count(struct list_lru *lru)
 {
 	long count = 0;
@@ -114,6 +122,14 @@ unsigned long list_lru_walk_node(struct list_lru *lru, int nid,
 				 unsigned long *nr_to_walk);
 
 static inline unsigned long
+list_lru_shrink_walk(struct list_lru *lru, struct shrink_control *sc,
+		     list_lru_walk_cb isolate, void *cb_arg)
+{
+	return list_lru_walk_node(lru, sc->nid, isolate, cb_arg,
+				  &sc->nr_to_scan);
+}
+
+static inline unsigned long
 list_lru_walk(struct list_lru *lru, list_lru_walk_cb isolate,
 	      void *cb_arg, unsigned long nr_to_walk)
 {
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v14 08/18] fs: consolidate {nr,free}_cached_objects args in shrink_control
  2013-12-16 12:16 ` Vladimir Davydov
@ 2013-12-16 12:16   ` Vladimir Davydov
  -1 siblings, 0 replies; 64+ messages in thread
From: Vladimir Davydov @ 2013-12-16 12:16 UTC (permalink / raw)
  To: dchinner, mhocko, hannes, akpm
  Cc: linux-kernel, linux-mm, cgroups, devel, glommer, glommer, Al Viro

We are going to make the FS shrinker memcg-aware. To achieve that, we
will have to pass the memcg to scan to the nr_cached_objects and
free_cached_objects VFS methods, which currently take only the NUMA node
to scan. Since the shrink_control structure already holds the node, and
the memcg to scan will be added to it as we introduce memcg-aware
vmscan, let us consolidate the methods' arguments in this structure to
keep things clean.

Thanks to David Chinner for the tip.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Glauber Costa <glommer@openvz.org>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
---
 fs/super.c         |   12 ++++++------
 fs/xfs/xfs_super.c |    7 +++----
 include/linux/fs.h |    6 ++++--
 3 files changed, 13 insertions(+), 12 deletions(-)

diff --git a/fs/super.c b/fs/super.c
index 13c4c92..9084c3d 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -76,7 +76,7 @@ static unsigned long super_cache_scan(struct shrinker *shrink,
 		return SHRINK_STOP;
 
 	if (sb->s_op->nr_cached_objects)
-		fs_objects = sb->s_op->nr_cached_objects(sb, sc->nid);
+		fs_objects = sb->s_op->nr_cached_objects(sb, sc);
 
 	inodes = list_lru_shrink_count(&sb->s_inode_lru, sc);
 	dentries = list_lru_shrink_count(&sb->s_dentry_lru, sc);
@@ -96,9 +96,10 @@ static unsigned long super_cache_scan(struct shrinker *shrink,
 	sc->nr_to_scan = inodes;
 	freed += prune_icache_sb(sb, sc);
 
-	if (fs_objects)
-		freed += sb->s_op->free_cached_objects(sb, fs_objects,
-						       sc->nid);
+	if (fs_objects) {
+		sc->nr_to_scan = fs_objects;
+		freed += sb->s_op->free_cached_objects(sb, sc);
+	}
 
 	drop_super(sb);
 	return freed;
@@ -116,8 +117,7 @@ static unsigned long super_cache_count(struct shrinker *shrink,
 		return 0;
 
 	if (sb->s_op && sb->s_op->nr_cached_objects)
-		total_objects = sb->s_op->nr_cached_objects(sb,
-						 sc->nid);
+		total_objects = sb->s_op->nr_cached_objects(sb, sc);
 
 	total_objects += list_lru_shrink_count(&sb->s_dentry_lru, sc);
 	total_objects += list_lru_shrink_count(&sb->s_inode_lru, sc);
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index f317488..28d0697 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1524,7 +1524,7 @@ xfs_fs_mount(
 static long
 xfs_fs_nr_cached_objects(
 	struct super_block	*sb,
-	int			nid)
+	struct shrink_control	*sc)
 {
 	return xfs_reclaim_inodes_count(XFS_M(sb));
 }
@@ -1532,10 +1532,9 @@ xfs_fs_nr_cached_objects(
 static long
 xfs_fs_free_cached_objects(
 	struct super_block	*sb,
-	long			nr_to_scan,
-	int			nid)
+	struct shrink_control	*sc)
 {
-	return xfs_reclaim_inodes_nr(XFS_M(sb), nr_to_scan);
+	return xfs_reclaim_inodes_nr(XFS_M(sb), sc->nr_to_scan);
 }
 
 static const struct super_operations xfs_super_operations = {
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 121f11f..aef5e6b 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1619,8 +1619,10 @@ struct super_operations {
 	ssize_t (*quota_write)(struct super_block *, int, const char *, size_t, loff_t);
 #endif
 	int (*bdev_try_to_free_page)(struct super_block*, struct page*, gfp_t);
-	long (*nr_cached_objects)(struct super_block *, int);
-	long (*free_cached_objects)(struct super_block *, long, int);
+	long (*nr_cached_objects)(struct super_block *,
+				  struct shrink_control *);
+	long (*free_cached_objects)(struct super_block *,
+				    struct shrink_control *);
 };
 
 /*
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v14 08/18] fs: consolidate {nr,free}_cached_objects args in shrink_control
@ 2013-12-16 12:16   ` Vladimir Davydov
  0 siblings, 0 replies; 64+ messages in thread
From: Vladimir Davydov @ 2013-12-16 12:16 UTC (permalink / raw)
  To: dchinner, mhocko, hannes, akpm
  Cc: linux-kernel, linux-mm, cgroups, devel, glommer, glommer, Al Viro

We are going to make the FS shrinker memcg-aware. To achieve that, we
will have to pass the memcg to scan to the nr_cached_objects and
free_cached_objects VFS methods, which currently take only the NUMA node
to scan. Since the shrink_control structure already holds the node, and
the memcg to scan will be added to it as we introduce memcg-aware
vmscan, let us consolidate the methods' arguments in this structure to
keep things clean.

Thanks to David Chinner for the tip.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Glauber Costa <glommer@openvz.org>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
---
 fs/super.c         |   12 ++++++------
 fs/xfs/xfs_super.c |    7 +++----
 include/linux/fs.h |    6 ++++--
 3 files changed, 13 insertions(+), 12 deletions(-)

diff --git a/fs/super.c b/fs/super.c
index 13c4c92..9084c3d 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -76,7 +76,7 @@ static unsigned long super_cache_scan(struct shrinker *shrink,
 		return SHRINK_STOP;
 
 	if (sb->s_op->nr_cached_objects)
-		fs_objects = sb->s_op->nr_cached_objects(sb, sc->nid);
+		fs_objects = sb->s_op->nr_cached_objects(sb, sc);
 
 	inodes = list_lru_shrink_count(&sb->s_inode_lru, sc);
 	dentries = list_lru_shrink_count(&sb->s_dentry_lru, sc);
@@ -96,9 +96,10 @@ static unsigned long super_cache_scan(struct shrinker *shrink,
 	sc->nr_to_scan = inodes;
 	freed += prune_icache_sb(sb, sc);
 
-	if (fs_objects)
-		freed += sb->s_op->free_cached_objects(sb, fs_objects,
-						       sc->nid);
+	if (fs_objects) {
+		sc->nr_to_scan = fs_objects;
+		freed += sb->s_op->free_cached_objects(sb, sc);
+	}
 
 	drop_super(sb);
 	return freed;
@@ -116,8 +117,7 @@ static unsigned long super_cache_count(struct shrinker *shrink,
 		return 0;
 
 	if (sb->s_op && sb->s_op->nr_cached_objects)
-		total_objects = sb->s_op->nr_cached_objects(sb,
-						 sc->nid);
+		total_objects = sb->s_op->nr_cached_objects(sb, sc);
 
 	total_objects += list_lru_shrink_count(&sb->s_dentry_lru, sc);
 	total_objects += list_lru_shrink_count(&sb->s_inode_lru, sc);
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index f317488..28d0697 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1524,7 +1524,7 @@ xfs_fs_mount(
 static long
 xfs_fs_nr_cached_objects(
 	struct super_block	*sb,
-	int			nid)
+	struct shrink_control	*sc)
 {
 	return xfs_reclaim_inodes_count(XFS_M(sb));
 }
@@ -1532,10 +1532,9 @@ xfs_fs_nr_cached_objects(
 static long
 xfs_fs_free_cached_objects(
 	struct super_block	*sb,
-	long			nr_to_scan,
-	int			nid)
+	struct shrink_control	*sc)
 {
-	return xfs_reclaim_inodes_nr(XFS_M(sb), nr_to_scan);
+	return xfs_reclaim_inodes_nr(XFS_M(sb), sc->nr_to_scan);
 }
 
 static const struct super_operations xfs_super_operations = {
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 121f11f..aef5e6b 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1619,8 +1619,10 @@ struct super_operations {
 	ssize_t (*quota_write)(struct super_block *, int, const char *, size_t, loff_t);
 #endif
 	int (*bdev_try_to_free_page)(struct super_block*, struct page*, gfp_t);
-	long (*nr_cached_objects)(struct super_block *, int);
-	long (*free_cached_objects)(struct super_block *, long, int);
+	long (*nr_cached_objects)(struct super_block *,
+				  struct shrink_control *);
+	long (*free_cached_objects)(struct super_block *,
+				    struct shrink_control *);
 };
 
 /*
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v14 09/18] vmscan: move call to shrink_slab() to shrink_zones()
  2013-12-16 12:16 ` Vladimir Davydov
@ 2013-12-16 12:16   ` Vladimir Davydov
  -1 siblings, 0 replies; 64+ messages in thread
From: Vladimir Davydov @ 2013-12-16 12:16 UTC (permalink / raw)
  To: dchinner, mhocko, hannes, akpm
  Cc: linux-kernel, linux-mm, cgroups, devel, glommer, glommer,
	Mel Gorman, Rik van Riel

This reduces the indentation level of do_try_to_free_pages() and removes
extra loop over all eligible zones counting the number of on-LRU pages.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Reviewed-by: Glauber Costa <glommer@openvz.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
---
 mm/vmscan.c |   57 ++++++++++++++++++++++++++-------------------------------
 1 file changed, 26 insertions(+), 31 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index eea668d..035ab3a 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2273,13 +2273,17 @@ static inline bool compaction_ready(struct zone *zone, struct scan_control *sc)
  * the caller that it should consider retrying the allocation instead of
  * further reclaim.
  */
-static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
+static bool shrink_zones(struct zonelist *zonelist,
+			 struct scan_control *sc,
+			 struct shrink_control *shrink)
 {
 	struct zoneref *z;
 	struct zone *zone;
 	unsigned long nr_soft_reclaimed;
 	unsigned long nr_soft_scanned;
+	unsigned long lru_pages = 0;
 	bool aborted_reclaim = false;
+	struct reclaim_state *reclaim_state = current->reclaim_state;
 
 	/*
 	 * If the number of buffer_heads in the machine exceeds the maximum
@@ -2289,6 +2293,8 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
 	if (buffer_heads_over_limit)
 		sc->gfp_mask |= __GFP_HIGHMEM;
 
+	nodes_clear(shrink->nodes_to_scan);
+
 	for_each_zone_zonelist_nodemask(zone, z, zonelist,
 					gfp_zone(sc->gfp_mask), sc->nodemask) {
 		if (!populated_zone(zone))
@@ -2300,6 +2306,10 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
 		if (global_reclaim(sc)) {
 			if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
 				continue;
+
+			lru_pages += zone_reclaimable_pages(zone);
+			node_set(zone_to_nid(zone), shrink->nodes_to_scan);
+
 			if (sc->priority != DEF_PRIORITY &&
 			    !zone_reclaimable(zone))
 				continue;	/* Let kswapd poll it */
@@ -2336,6 +2346,20 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
 		shrink_zone(zone, sc);
 	}
 
+	/*
+	 * Don't shrink slabs when reclaiming memory from over limit
+	 * cgroups but do shrink slab at least once when aborting
+	 * reclaim for compaction to avoid unevenly scanning file/anon
+	 * LRU pages over slab pages.
+	 */
+	if (global_reclaim(sc)) {
+		shrink_slab(shrink, sc->nr_scanned, lru_pages);
+		if (reclaim_state) {
+			sc->nr_reclaimed += reclaim_state->reclaimed_slab;
+			reclaim_state->reclaimed_slab = 0;
+		}
+	}
+
 	return aborted_reclaim;
 }
 
@@ -2380,9 +2404,6 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 					struct shrink_control *shrink)
 {
 	unsigned long total_scanned = 0;
-	struct reclaim_state *reclaim_state = current->reclaim_state;
-	struct zoneref *z;
-	struct zone *zone;
 	unsigned long writeback_threshold;
 	bool aborted_reclaim;
 
@@ -2395,34 +2416,8 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 		vmpressure_prio(sc->gfp_mask, sc->target_mem_cgroup,
 				sc->priority);
 		sc->nr_scanned = 0;
-		aborted_reclaim = shrink_zones(zonelist, sc);
-
-		/*
-		 * Don't shrink slabs when reclaiming memory from over limit
-		 * cgroups but do shrink slab at least once when aborting
-		 * reclaim for compaction to avoid unevenly scanning file/anon
-		 * LRU pages over slab pages.
-		 */
-		if (global_reclaim(sc)) {
-			unsigned long lru_pages = 0;
+		aborted_reclaim = shrink_zones(zonelist, sc, shrink);
 
-			nodes_clear(shrink->nodes_to_scan);
-			for_each_zone_zonelist(zone, z, zonelist,
-					gfp_zone(sc->gfp_mask)) {
-				if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
-					continue;
-
-				lru_pages += zone_reclaimable_pages(zone);
-				node_set(zone_to_nid(zone),
-					 shrink->nodes_to_scan);
-			}
-
-			shrink_slab(shrink, sc->nr_scanned, lru_pages);
-			if (reclaim_state) {
-				sc->nr_reclaimed += reclaim_state->reclaimed_slab;
-				reclaim_state->reclaimed_slab = 0;
-			}
-		}
 		total_scanned += sc->nr_scanned;
 		if (sc->nr_reclaimed >= sc->nr_to_reclaim)
 			goto out;
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v14 09/18] vmscan: move call to shrink_slab() to shrink_zones()
@ 2013-12-16 12:16   ` Vladimir Davydov
  0 siblings, 0 replies; 64+ messages in thread
From: Vladimir Davydov @ 2013-12-16 12:16 UTC (permalink / raw)
  To: dchinner, mhocko, hannes, akpm
  Cc: linux-kernel, linux-mm, cgroups, devel, glommer, glommer,
	Mel Gorman, Rik van Riel

This reduces the indentation level of do_try_to_free_pages() and removes
extra loop over all eligible zones counting the number of on-LRU pages.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Reviewed-by: Glauber Costa <glommer@openvz.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
---
 mm/vmscan.c |   57 ++++++++++++++++++++++++++-------------------------------
 1 file changed, 26 insertions(+), 31 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index eea668d..035ab3a 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2273,13 +2273,17 @@ static inline bool compaction_ready(struct zone *zone, struct scan_control *sc)
  * the caller that it should consider retrying the allocation instead of
  * further reclaim.
  */
-static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
+static bool shrink_zones(struct zonelist *zonelist,
+			 struct scan_control *sc,
+			 struct shrink_control *shrink)
 {
 	struct zoneref *z;
 	struct zone *zone;
 	unsigned long nr_soft_reclaimed;
 	unsigned long nr_soft_scanned;
+	unsigned long lru_pages = 0;
 	bool aborted_reclaim = false;
+	struct reclaim_state *reclaim_state = current->reclaim_state;
 
 	/*
 	 * If the number of buffer_heads in the machine exceeds the maximum
@@ -2289,6 +2293,8 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
 	if (buffer_heads_over_limit)
 		sc->gfp_mask |= __GFP_HIGHMEM;
 
+	nodes_clear(shrink->nodes_to_scan);
+
 	for_each_zone_zonelist_nodemask(zone, z, zonelist,
 					gfp_zone(sc->gfp_mask), sc->nodemask) {
 		if (!populated_zone(zone))
@@ -2300,6 +2306,10 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
 		if (global_reclaim(sc)) {
 			if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
 				continue;
+
+			lru_pages += zone_reclaimable_pages(zone);
+			node_set(zone_to_nid(zone), shrink->nodes_to_scan);
+
 			if (sc->priority != DEF_PRIORITY &&
 			    !zone_reclaimable(zone))
 				continue;	/* Let kswapd poll it */
@@ -2336,6 +2346,20 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
 		shrink_zone(zone, sc);
 	}
 
+	/*
+	 * Don't shrink slabs when reclaiming memory from over limit
+	 * cgroups but do shrink slab at least once when aborting
+	 * reclaim for compaction to avoid unevenly scanning file/anon
+	 * LRU pages over slab pages.
+	 */
+	if (global_reclaim(sc)) {
+		shrink_slab(shrink, sc->nr_scanned, lru_pages);
+		if (reclaim_state) {
+			sc->nr_reclaimed += reclaim_state->reclaimed_slab;
+			reclaim_state->reclaimed_slab = 0;
+		}
+	}
+
 	return aborted_reclaim;
 }
 
@@ -2380,9 +2404,6 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 					struct shrink_control *shrink)
 {
 	unsigned long total_scanned = 0;
-	struct reclaim_state *reclaim_state = current->reclaim_state;
-	struct zoneref *z;
-	struct zone *zone;
 	unsigned long writeback_threshold;
 	bool aborted_reclaim;
 
@@ -2395,34 +2416,8 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 		vmpressure_prio(sc->gfp_mask, sc->target_mem_cgroup,
 				sc->priority);
 		sc->nr_scanned = 0;
-		aborted_reclaim = shrink_zones(zonelist, sc);
-
-		/*
-		 * Don't shrink slabs when reclaiming memory from over limit
-		 * cgroups but do shrink slab at least once when aborting
-		 * reclaim for compaction to avoid unevenly scanning file/anon
-		 * LRU pages over slab pages.
-		 */
-		if (global_reclaim(sc)) {
-			unsigned long lru_pages = 0;
+		aborted_reclaim = shrink_zones(zonelist, sc, shrink);
 
-			nodes_clear(shrink->nodes_to_scan);
-			for_each_zone_zonelist(zone, z, zonelist,
-					gfp_zone(sc->gfp_mask)) {
-				if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
-					continue;
-
-				lru_pages += zone_reclaimable_pages(zone);
-				node_set(zone_to_nid(zone),
-					 shrink->nodes_to_scan);
-			}
-
-			shrink_slab(shrink, sc->nr_scanned, lru_pages);
-			if (reclaim_state) {
-				sc->nr_reclaimed += reclaim_state->reclaimed_slab;
-				reclaim_state->reclaimed_slab = 0;
-			}
-		}
 		total_scanned += sc->nr_scanned;
 		if (sc->nr_reclaimed >= sc->nr_to_reclaim)
 			goto out;
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v14 10/18] vmscan: remove shrink_control arg from do_try_to_free_pages()
  2013-12-16 12:16 ` Vladimir Davydov
@ 2013-12-16 12:16   ` Vladimir Davydov
  -1 siblings, 0 replies; 64+ messages in thread
From: Vladimir Davydov @ 2013-12-16 12:16 UTC (permalink / raw)
  To: dchinner, mhocko, hannes, akpm
  Cc: linux-kernel, linux-mm, cgroups, devel, glommer, glommer,
	Mel Gorman, Rik van Riel

There is no need passing on a shrink_control struct from
try_to_free_pages() and friends to do_try_to_free_pages() and then to
shrink_zones(), because it is only used in shrink_zones() and the only
field initialized on the top level is gfp_mask, which is always equal to
scan_control.gfp_mask. So let's move shrink_control initialization to
shrink_zones().

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Glauber Costa <glommer@openvz.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
---
 mm/vmscan.c |   32 ++++++++++++--------------------
 1 file changed, 12 insertions(+), 20 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 035ab3a..33b356e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2274,8 +2274,7 @@ static inline bool compaction_ready(struct zone *zone, struct scan_control *sc)
  * further reclaim.
  */
 static bool shrink_zones(struct zonelist *zonelist,
-			 struct scan_control *sc,
-			 struct shrink_control *shrink)
+			 struct scan_control *sc)
 {
 	struct zoneref *z;
 	struct zone *zone;
@@ -2284,6 +2283,9 @@ static bool shrink_zones(struct zonelist *zonelist,
 	unsigned long lru_pages = 0;
 	bool aborted_reclaim = false;
 	struct reclaim_state *reclaim_state = current->reclaim_state;
+	struct shrink_control shrink = {
+		.gfp_mask = sc->gfp_mask,
+	};
 
 	/*
 	 * If the number of buffer_heads in the machine exceeds the maximum
@@ -2293,7 +2295,7 @@ static bool shrink_zones(struct zonelist *zonelist,
 	if (buffer_heads_over_limit)
 		sc->gfp_mask |= __GFP_HIGHMEM;
 
-	nodes_clear(shrink->nodes_to_scan);
+	nodes_clear(shrink.nodes_to_scan);
 
 	for_each_zone_zonelist_nodemask(zone, z, zonelist,
 					gfp_zone(sc->gfp_mask), sc->nodemask) {
@@ -2308,7 +2310,7 @@ static bool shrink_zones(struct zonelist *zonelist,
 				continue;
 
 			lru_pages += zone_reclaimable_pages(zone);
-			node_set(zone_to_nid(zone), shrink->nodes_to_scan);
+			node_set(zone_to_nid(zone), shrink.nodes_to_scan);
 
 			if (sc->priority != DEF_PRIORITY &&
 			    !zone_reclaimable(zone))
@@ -2353,7 +2355,7 @@ static bool shrink_zones(struct zonelist *zonelist,
 	 * LRU pages over slab pages.
 	 */
 	if (global_reclaim(sc)) {
-		shrink_slab(shrink, sc->nr_scanned, lru_pages);
+		shrink_slab(&shrink, sc->nr_scanned, lru_pages);
 		if (reclaim_state) {
 			sc->nr_reclaimed += reclaim_state->reclaimed_slab;
 			reclaim_state->reclaimed_slab = 0;
@@ -2400,8 +2402,7 @@ static bool all_unreclaimable(struct zonelist *zonelist,
  * 		else, the number of pages reclaimed
  */
 static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
-					struct scan_control *sc,
-					struct shrink_control *shrink)
+					  struct scan_control *sc)
 {
 	unsigned long total_scanned = 0;
 	unsigned long writeback_threshold;
@@ -2416,7 +2417,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 		vmpressure_prio(sc->gfp_mask, sc->target_mem_cgroup,
 				sc->priority);
 		sc->nr_scanned = 0;
-		aborted_reclaim = shrink_zones(zonelist, sc, shrink);
+		aborted_reclaim = shrink_zones(zonelist, sc);
 
 		total_scanned += sc->nr_scanned;
 		if (sc->nr_reclaimed >= sc->nr_to_reclaim)
@@ -2579,9 +2580,6 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 		.target_mem_cgroup = NULL,
 		.nodemask = nodemask,
 	};
-	struct shrink_control shrink = {
-		.gfp_mask = sc.gfp_mask,
-	};
 
 	/*
 	 * Do not enter reclaim if fatal signal was delivered while throttled.
@@ -2595,7 +2593,7 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 				sc.may_writepage,
 				gfp_mask);
 
-	nr_reclaimed = do_try_to_free_pages(zonelist, &sc, &shrink);
+	nr_reclaimed = do_try_to_free_pages(zonelist, &sc);
 
 	trace_mm_vmscan_direct_reclaim_end(nr_reclaimed);
 
@@ -2662,9 +2660,6 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 		.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
 				(GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK),
 	};
-	struct shrink_control shrink = {
-		.gfp_mask = sc.gfp_mask,
-	};
 
 	/*
 	 * Unlike direct reclaim via alloc_pages(), memcg's reclaim doesn't
@@ -2679,7 +2674,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 					    sc.may_writepage,
 					    sc.gfp_mask);
 
-	nr_reclaimed = do_try_to_free_pages(zonelist, &sc, &shrink);
+	nr_reclaimed = do_try_to_free_pages(zonelist, &sc);
 
 	trace_mm_vmscan_memcg_reclaim_end(nr_reclaimed);
 
@@ -3335,9 +3330,6 @@ unsigned long shrink_all_memory(unsigned long nr_to_reclaim)
 		.order = 0,
 		.priority = DEF_PRIORITY,
 	};
-	struct shrink_control shrink = {
-		.gfp_mask = sc.gfp_mask,
-	};
 	struct zonelist *zonelist = node_zonelist(numa_node_id(), sc.gfp_mask);
 	struct task_struct *p = current;
 	unsigned long nr_reclaimed;
@@ -3347,7 +3339,7 @@ unsigned long shrink_all_memory(unsigned long nr_to_reclaim)
 	reclaim_state.reclaimed_slab = 0;
 	p->reclaim_state = &reclaim_state;
 
-	nr_reclaimed = do_try_to_free_pages(zonelist, &sc, &shrink);
+	nr_reclaimed = do_try_to_free_pages(zonelist, &sc);
 
 	p->reclaim_state = NULL;
 	lockdep_clear_current_reclaim_state();
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v14 10/18] vmscan: remove shrink_control arg from do_try_to_free_pages()
@ 2013-12-16 12:16   ` Vladimir Davydov
  0 siblings, 0 replies; 64+ messages in thread
From: Vladimir Davydov @ 2013-12-16 12:16 UTC (permalink / raw)
  To: dchinner, mhocko, hannes, akpm
  Cc: linux-kernel, linux-mm, cgroups, devel, glommer, glommer,
	Mel Gorman, Rik van Riel

There is no need passing on a shrink_control struct from
try_to_free_pages() and friends to do_try_to_free_pages() and then to
shrink_zones(), because it is only used in shrink_zones() and the only
field initialized on the top level is gfp_mask, which is always equal to
scan_control.gfp_mask. So let's move shrink_control initialization to
shrink_zones().

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Glauber Costa <glommer@openvz.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
---
 mm/vmscan.c |   32 ++++++++++++--------------------
 1 file changed, 12 insertions(+), 20 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 035ab3a..33b356e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2274,8 +2274,7 @@ static inline bool compaction_ready(struct zone *zone, struct scan_control *sc)
  * further reclaim.
  */
 static bool shrink_zones(struct zonelist *zonelist,
-			 struct scan_control *sc,
-			 struct shrink_control *shrink)
+			 struct scan_control *sc)
 {
 	struct zoneref *z;
 	struct zone *zone;
@@ -2284,6 +2283,9 @@ static bool shrink_zones(struct zonelist *zonelist,
 	unsigned long lru_pages = 0;
 	bool aborted_reclaim = false;
 	struct reclaim_state *reclaim_state = current->reclaim_state;
+	struct shrink_control shrink = {
+		.gfp_mask = sc->gfp_mask,
+	};
 
 	/*
 	 * If the number of buffer_heads in the machine exceeds the maximum
@@ -2293,7 +2295,7 @@ static bool shrink_zones(struct zonelist *zonelist,
 	if (buffer_heads_over_limit)
 		sc->gfp_mask |= __GFP_HIGHMEM;
 
-	nodes_clear(shrink->nodes_to_scan);
+	nodes_clear(shrink.nodes_to_scan);
 
 	for_each_zone_zonelist_nodemask(zone, z, zonelist,
 					gfp_zone(sc->gfp_mask), sc->nodemask) {
@@ -2308,7 +2310,7 @@ static bool shrink_zones(struct zonelist *zonelist,
 				continue;
 
 			lru_pages += zone_reclaimable_pages(zone);
-			node_set(zone_to_nid(zone), shrink->nodes_to_scan);
+			node_set(zone_to_nid(zone), shrink.nodes_to_scan);
 
 			if (sc->priority != DEF_PRIORITY &&
 			    !zone_reclaimable(zone))
@@ -2353,7 +2355,7 @@ static bool shrink_zones(struct zonelist *zonelist,
 	 * LRU pages over slab pages.
 	 */
 	if (global_reclaim(sc)) {
-		shrink_slab(shrink, sc->nr_scanned, lru_pages);
+		shrink_slab(&shrink, sc->nr_scanned, lru_pages);
 		if (reclaim_state) {
 			sc->nr_reclaimed += reclaim_state->reclaimed_slab;
 			reclaim_state->reclaimed_slab = 0;
@@ -2400,8 +2402,7 @@ static bool all_unreclaimable(struct zonelist *zonelist,
  * 		else, the number of pages reclaimed
  */
 static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
-					struct scan_control *sc,
-					struct shrink_control *shrink)
+					  struct scan_control *sc)
 {
 	unsigned long total_scanned = 0;
 	unsigned long writeback_threshold;
@@ -2416,7 +2417,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 		vmpressure_prio(sc->gfp_mask, sc->target_mem_cgroup,
 				sc->priority);
 		sc->nr_scanned = 0;
-		aborted_reclaim = shrink_zones(zonelist, sc, shrink);
+		aborted_reclaim = shrink_zones(zonelist, sc);
 
 		total_scanned += sc->nr_scanned;
 		if (sc->nr_reclaimed >= sc->nr_to_reclaim)
@@ -2579,9 +2580,6 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 		.target_mem_cgroup = NULL,
 		.nodemask = nodemask,
 	};
-	struct shrink_control shrink = {
-		.gfp_mask = sc.gfp_mask,
-	};
 
 	/*
 	 * Do not enter reclaim if fatal signal was delivered while throttled.
@@ -2595,7 +2593,7 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 				sc.may_writepage,
 				gfp_mask);
 
-	nr_reclaimed = do_try_to_free_pages(zonelist, &sc, &shrink);
+	nr_reclaimed = do_try_to_free_pages(zonelist, &sc);
 
 	trace_mm_vmscan_direct_reclaim_end(nr_reclaimed);
 
@@ -2662,9 +2660,6 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 		.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
 				(GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK),
 	};
-	struct shrink_control shrink = {
-		.gfp_mask = sc.gfp_mask,
-	};
 
 	/*
 	 * Unlike direct reclaim via alloc_pages(), memcg's reclaim doesn't
@@ -2679,7 +2674,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 					    sc.may_writepage,
 					    sc.gfp_mask);
 
-	nr_reclaimed = do_try_to_free_pages(zonelist, &sc, &shrink);
+	nr_reclaimed = do_try_to_free_pages(zonelist, &sc);
 
 	trace_mm_vmscan_memcg_reclaim_end(nr_reclaimed);
 
@@ -3335,9 +3330,6 @@ unsigned long shrink_all_memory(unsigned long nr_to_reclaim)
 		.order = 0,
 		.priority = DEF_PRIORITY,
 	};
-	struct shrink_control shrink = {
-		.gfp_mask = sc.gfp_mask,
-	};
 	struct zonelist *zonelist = node_zonelist(numa_node_id(), sc.gfp_mask);
 	struct task_struct *p = current;
 	unsigned long nr_reclaimed;
@@ -3347,7 +3339,7 @@ unsigned long shrink_all_memory(unsigned long nr_to_reclaim)
 	reclaim_state.reclaimed_slab = 0;
 	p->reclaim_state = &reclaim_state;
 
-	nr_reclaimed = do_try_to_free_pages(zonelist, &sc, &shrink);
+	nr_reclaimed = do_try_to_free_pages(zonelist, &sc);
 
 	p->reclaim_state = NULL;
 	lockdep_clear_current_reclaim_state();
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v14 11/18] vmscan: call NUMA-unaware shrinkers irrespective of nodemask
  2013-12-16 12:16 ` Vladimir Davydov
@ 2013-12-16 12:17   ` Vladimir Davydov
  -1 siblings, 0 replies; 64+ messages in thread
From: Vladimir Davydov @ 2013-12-16 12:17 UTC (permalink / raw)
  To: dchinner, mhocko, hannes, akpm
  Cc: linux-kernel, linux-mm, cgroups, devel, glommer, glommer,
	Mel Gorman, Rik van Riel

If a shrinker is not NUMA-aware, shrink_slab() should call it exactly
once with nid=0, but currently it is not true: if node 0 is not set in
the nodemask or if it is not online, we will not call such shrinkers at
all. As a result some slabs will be left untouched under some
circumstances. Let us fix it.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Reported-by: Dave Chinner <dchinner@redhat.com>
Cc: Glauber Costa <glommer@openvz.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
---
 mm/vmscan.c |   19 ++++++++++---------
 1 file changed, 10 insertions(+), 9 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 33b356e..d98f272 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -352,16 +352,17 @@ unsigned long shrink_slab(struct shrink_control *shrinkctl,
 	}
 
 	list_for_each_entry(shrinker, &shrinker_list, list) {
-		for_each_node_mask(shrinkctl->nid, shrinkctl->nodes_to_scan) {
-			if (!node_online(shrinkctl->nid))
-				continue;
-
-			if (!(shrinker->flags & SHRINKER_NUMA_AWARE) &&
-			    (shrinkctl->nid != 0))
-				break;
-
+		if (!(shrinker->flags & SHRINKER_NUMA_AWARE)) {
+			shrinkctl->nid = 0;
 			freed += shrink_slab_node(shrinkctl, shrinker,
-				 nr_pages_scanned, lru_pages);
+					nr_pages_scanned, lru_pages);
+			continue;
+		}
+
+		for_each_node_mask(shrinkctl->nid, shrinkctl->nodes_to_scan) {
+			if (node_online(shrinkctl->nid))
+				freed += shrink_slab_node(shrinkctl, shrinker,
+						nr_pages_scanned, lru_pages);
 
 		}
 	}
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v14 11/18] vmscan: call NUMA-unaware shrinkers irrespective of nodemask
@ 2013-12-16 12:17   ` Vladimir Davydov
  0 siblings, 0 replies; 64+ messages in thread
From: Vladimir Davydov @ 2013-12-16 12:17 UTC (permalink / raw)
  To: dchinner, mhocko, hannes, akpm
  Cc: linux-kernel, linux-mm, cgroups, devel, glommer, glommer,
	Mel Gorman, Rik van Riel

If a shrinker is not NUMA-aware, shrink_slab() should call it exactly
once with nid=0, but currently it is not true: if node 0 is not set in
the nodemask or if it is not online, we will not call such shrinkers at
all. As a result some slabs will be left untouched under some
circumstances. Let us fix it.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Reported-by: Dave Chinner <dchinner@redhat.com>
Cc: Glauber Costa <glommer@openvz.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
---
 mm/vmscan.c |   19 ++++++++++---------
 1 file changed, 10 insertions(+), 9 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 33b356e..d98f272 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -352,16 +352,17 @@ unsigned long shrink_slab(struct shrink_control *shrinkctl,
 	}
 
 	list_for_each_entry(shrinker, &shrinker_list, list) {
-		for_each_node_mask(shrinkctl->nid, shrinkctl->nodes_to_scan) {
-			if (!node_online(shrinkctl->nid))
-				continue;
-
-			if (!(shrinker->flags & SHRINKER_NUMA_AWARE) &&
-			    (shrinkctl->nid != 0))
-				break;
-
+		if (!(shrinker->flags & SHRINKER_NUMA_AWARE)) {
+			shrinkctl->nid = 0;
 			freed += shrink_slab_node(shrinkctl, shrinker,
-				 nr_pages_scanned, lru_pages);
+					nr_pages_scanned, lru_pages);
+			continue;
+		}
+
+		for_each_node_mask(shrinkctl->nid, shrinkctl->nodes_to_scan) {
+			if (node_online(shrinkctl->nid))
+				freed += shrink_slab_node(shrinkctl, shrinker,
+						nr_pages_scanned, lru_pages);
 
 		}
 	}
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v14 12/18] vmscan: shrink slab on memcg pressure
  2013-12-16 12:16 ` Vladimir Davydov
  (?)
@ 2013-12-16 12:17   ` Vladimir Davydov
  -1 siblings, 0 replies; 64+ messages in thread
From: Vladimir Davydov @ 2013-12-16 12:17 UTC (permalink / raw)
  To: dchinner, mhocko, hannes, akpm
  Cc: linux-kernel, linux-mm, cgroups, devel, glommer, glommer,
	Mel Gorman, Rik van Riel, Al Viro, Balbir Singh,
	KAMEZAWA Hiroyuki

This patch makes direct reclaim path shrink slab not only on global
memory pressure, but also when we reach the user memory limit of a
memcg. To achieve that, it makes shrink_slab() walk over the memcg
hierarchy and run shrinkers marked as memcg-aware on the target memcg
and all its descendants. The memcg to scan is passed in a shrink_control
structure; memcg-unaware shrinkers are still called only on global
memory pressure with memcg=NULL. It is up to the shrinker how to
organize the objects it is responsible for to achieve per-memcg reclaim.

Note that we do not intend to have true per memcg per node reclaim.
Since most memcgs are small and typically confined to a single NUMA node
or two by external means and therefore do not need the scalability NUMA
aware shrinkers provide, we actually call per node shrinking only for
the global list (memcg=NULL), while per-memcg lists are always scanned
only once irrespective of the nodemask with nid=0.

The idea lying behind the patch as well as the initial implementation
belong to Glauber Costa.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Glauber Costa <glommer@openvz.org>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/memcontrol.h |   22 ++++++++++
 include/linux/shrinker.h   |   10 ++++-
 mm/memcontrol.c            |   37 ++++++++++++++++-
 mm/vmscan.c                |   95 ++++++++++++++++++++++++++++++++++----------
 4 files changed, 142 insertions(+), 22 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index e3efab2..6001b31 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -95,6 +95,9 @@ extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
 struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *);
 struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *);
 
+unsigned long mem_cgroup_zone_reclaimable_pages(struct zone *,
+						struct mem_cgroup *);
+
 /* For coalescing uncharge for reducing memcg' overhead*/
 extern void mem_cgroup_uncharge_start(void);
 extern void mem_cgroup_uncharge_end(void);
@@ -304,6 +307,12 @@ static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page,
 	return &zone->lruvec;
 }
 
+static inline unsigned long mem_cgroup_zone_reclaimable_pages(struct zone *zone,
+							struct mem_cgroup *)
+{
+	return 0;
+}
+
 static inline struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
 {
 	return NULL;
@@ -494,6 +503,9 @@ static inline bool memcg_kmem_enabled(void)
 	return static_key_false(&memcg_kmem_enabled_key);
 }
 
+bool memcg_kmem_is_active(struct mem_cgroup *memcg);
+bool memcg_kmem_should_reclaim(struct mem_cgroup *memcg);
+
 /*
  * In general, we'll do everything in our power to not incur in any overhead
  * for non-memcg users for the kmem functions. Not even a function call, if we
@@ -635,6 +647,16 @@ static inline bool memcg_kmem_enabled(void)
 	return false;
 }
 
+static inline bool memcg_kmem_is_active(struct mem_cgroup *memcg)
+{
+	return false;
+}
+
+static inline bool memcg_kmem_should_reclaim(struct mem_cgroup *memcg)
+{
+	return false;
+}
+
 static inline bool
 memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup **memcg, int order)
 {
diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
index 68c0970..ab79b17 100644
--- a/include/linux/shrinker.h
+++ b/include/linux/shrinker.h
@@ -20,8 +20,15 @@ struct shrink_control {
 
 	/* shrink from these nodes */
 	nodemask_t nodes_to_scan;
+
+	/* shrink from this memory cgroup hierarchy (if not NULL) */
+	struct mem_cgroup *target_mem_cgroup;
+
 	/* current node being shrunk (for NUMA aware shrinkers) */
 	int nid;
+
+	/* current memcg being shrunk (for memcg aware shrinkers) */
+	struct mem_cgroup *memcg;
 };
 
 #define SHRINK_STOP (~0UL)
@@ -63,7 +70,8 @@ struct shrinker {
 #define DEFAULT_SEEKS 2 /* A good number if you don't know better. */
 
 /* Flags */
-#define SHRINKER_NUMA_AWARE (1 << 0)
+#define SHRINKER_NUMA_AWARE	(1 << 0)
+#define SHRINKER_MEMCG_AWARE	(1 << 1)
 
 extern int register_shrinker(struct shrinker *);
 extern void unregister_shrinker(struct shrinker *);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f2372b0..13b3131 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -353,7 +353,7 @@ static inline void memcg_kmem_set_active(struct mem_cgroup *memcg)
 	set_bit(KMEM_ACCOUNTED_ACTIVE, &memcg->kmem_account_flags);
 }
 
-static bool memcg_kmem_is_active(struct mem_cgroup *memcg)
+bool memcg_kmem_is_active(struct mem_cgroup *memcg)
 {
 	return test_bit(KMEM_ACCOUNTED_ACTIVE, &memcg->kmem_account_flags);
 }
@@ -1303,6 +1303,26 @@ out:
 	return lruvec;
 }
 
+unsigned long mem_cgroup_zone_reclaimable_pages(struct zone *zone,
+						struct mem_cgroup *memcg)
+{
+	int nid = zone_to_nid(zone);
+	int zid = zone_idx(zone);
+	unsigned long nr = 0;
+	struct mem_cgroup *iter;
+
+	iter = memcg;
+	do {
+		nr += mem_cgroup_zone_nr_lru_pages(iter, nid, zid,
+						   LRU_ALL_FILE);
+		if (do_swap_account)
+			nr += mem_cgroup_zone_nr_lru_pages(iter, nid, zid,
+							   LRU_ALL_ANON);
+		iter = mem_cgroup_iter(memcg, iter, NULL);
+	} while (iter);
+	return nr;
+}
+
 /*
  * Following LRU functions are allowed to be used without PCG_LOCK.
  * Operations are called by routine of global LRU independently from memcg.
@@ -2932,6 +2952,21 @@ static inline bool memcg_can_account_kmem(struct mem_cgroup *memcg)
 		memcg_kmem_is_active(memcg);
 }
 
+bool memcg_kmem_should_reclaim(struct mem_cgroup *memcg)
+{
+	struct mem_cgroup *iter;
+
+	iter = memcg;
+	do {
+		if (memcg_kmem_is_active(iter)) {
+			mem_cgroup_iter_break(memcg, iter);
+			return true;
+		}
+		iter = mem_cgroup_iter(memcg, iter, NULL);
+	} while (iter);
+	return false;
+}
+
 /*
  * helper for acessing a memcg's index. It will be used as an index in the
  * child cache array in kmem_cache, and also to derive its name. This function
diff --git a/mm/vmscan.c b/mm/vmscan.c
index d98f272..67c1950 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -311,6 +311,34 @@ shrink_slab_node(struct shrink_control *shrinkctl, struct shrinker *shrinker,
 	return freed;
 }
 
+static unsigned long
+run_shrinker(struct shrink_control *shrinkctl, struct shrinker *shrinker,
+	     unsigned long nr_pages_scanned, unsigned long lru_pages)
+{
+	unsigned long freed = 0;
+
+	/*
+	 * Since most memory cgroups are small and typically confined to a
+	 * single NUMA node or two by external means and therefore do not need
+	 * the scalability NUMA aware shrinkers provide, we implement per node
+	 * shrinking only for the global list.
+	 */
+	if (!(shrinker->flags & SHRINKER_NUMA_AWARE) ||
+	    shrinkctl->memcg) {
+		shrinkctl->nid = 0;
+		return shrink_slab_node(shrinkctl, shrinker,
+					nr_pages_scanned, lru_pages);
+	}
+
+	for_each_node_mask(shrinkctl->nid, shrinkctl->nodes_to_scan) {
+		if (node_online(shrinkctl->nid))
+			freed += shrink_slab_node(shrinkctl, shrinker,
+						  nr_pages_scanned, lru_pages);
+
+	}
+	return freed;
+}
+
 /*
  * Call the shrink functions to age shrinkable caches
  *
@@ -352,20 +380,34 @@ unsigned long shrink_slab(struct shrink_control *shrinkctl,
 	}
 
 	list_for_each_entry(shrinker, &shrinker_list, list) {
-		if (!(shrinker->flags & SHRINKER_NUMA_AWARE)) {
-			shrinkctl->nid = 0;
-			freed += shrink_slab_node(shrinkctl, shrinker,
-					nr_pages_scanned, lru_pages);
+		/*
+		 * Call memcg-unaware shrinkers only on global pressure.
+		 */
+		if (!(shrinker->flags & SHRINKER_MEMCG_AWARE)) {
+			if (!shrinkctl->target_mem_cgroup) {
+				shrinkctl->memcg = NULL;
+				freed += run_shrinker(shrinkctl, shrinker,
+						nr_pages_scanned, lru_pages);
+			}
 			continue;
 		}
 
-		for_each_node_mask(shrinkctl->nid, shrinkctl->nodes_to_scan) {
-			if (node_online(shrinkctl->nid))
-				freed += shrink_slab_node(shrinkctl, shrinker,
+		/*
+		 * For memcg-aware shrinkers iterate over the target memcg
+		 * hierarchy and run the shrinker on each kmem-active memcg
+		 * found in the hierarchy.
+		 */
+		shrinkctl->memcg = shrinkctl->target_mem_cgroup;
+		do {
+			if (!shrinkctl->memcg ||
+			    memcg_kmem_is_active(shrinkctl->memcg))
+				freed += run_shrinker(shrinkctl, shrinker,
 						nr_pages_scanned, lru_pages);
-
-		}
+		} while ((shrinkctl->memcg =
+			  mem_cgroup_iter(shrinkctl->target_mem_cgroup,
+					  shrinkctl->memcg, NULL)) != NULL);
 	}
+
 	up_read(&shrinker_rwsem);
 out:
 	cond_resched();
@@ -2286,6 +2328,7 @@ static bool shrink_zones(struct zonelist *zonelist,
 	struct reclaim_state *reclaim_state = current->reclaim_state;
 	struct shrink_control shrink = {
 		.gfp_mask = sc->gfp_mask,
+		.target_mem_cgroup = sc->target_mem_cgroup,
 	};
 
 	/*
@@ -2302,17 +2345,22 @@ static bool shrink_zones(struct zonelist *zonelist,
 					gfp_zone(sc->gfp_mask), sc->nodemask) {
 		if (!populated_zone(zone))
 			continue;
+
+		if (global_reclaim(sc) &&
+		    !cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
+			continue;
+
+		lru_pages += global_reclaim(sc) ?
+				zone_reclaimable_pages(zone) :
+				mem_cgroup_zone_reclaimable_pages(zone,
+						sc->target_mem_cgroup);
+		node_set(zone_to_nid(zone), shrink.nodes_to_scan);
+
 		/*
 		 * Take care memory controller reclaiming has small influence
 		 * to global LRU.
 		 */
 		if (global_reclaim(sc)) {
-			if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
-				continue;
-
-			lru_pages += zone_reclaimable_pages(zone);
-			node_set(zone_to_nid(zone), shrink.nodes_to_scan);
-
 			if (sc->priority != DEF_PRIORITY &&
 			    !zone_reclaimable(zone))
 				continue;	/* Let kswapd poll it */
@@ -2350,12 +2398,11 @@ static bool shrink_zones(struct zonelist *zonelist,
 	}
 
 	/*
-	 * Don't shrink slabs when reclaiming memory from over limit
-	 * cgroups but do shrink slab at least once when aborting
-	 * reclaim for compaction to avoid unevenly scanning file/anon
-	 * LRU pages over slab pages.
+	 * Shrink slabs at least once when aborting reclaim for compaction
+	 * to avoid unevenly scanning file/anon LRU pages over slab pages.
 	 */
-	if (global_reclaim(sc)) {
+	if (global_reclaim(sc) ||
+	    memcg_kmem_should_reclaim(sc->target_mem_cgroup)) {
 		shrink_slab(&shrink, sc->nr_scanned, lru_pages);
 		if (reclaim_state) {
 			sc->nr_reclaimed += reclaim_state->reclaimed_slab;
@@ -2649,6 +2696,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 	struct zonelist *zonelist;
 	unsigned long nr_reclaimed;
 	int nid;
+	struct reclaim_state reclaim_state;
 	struct scan_control sc = {
 		.may_writepage = !laptop_mode,
 		.may_unmap = 1,
@@ -2671,6 +2719,10 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 
 	zonelist = NODE_DATA(nid)->node_zonelists;
 
+	lockdep_set_current_reclaim_state(sc.gfp_mask);
+	reclaim_state.reclaimed_slab = 0;
+	current->reclaim_state = &reclaim_state;
+
 	trace_mm_vmscan_memcg_reclaim_begin(0,
 					    sc.may_writepage,
 					    sc.gfp_mask);
@@ -2679,6 +2731,9 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 
 	trace_mm_vmscan_memcg_reclaim_end(nr_reclaimed);
 
+	current->reclaim_state = NULL;
+	lockdep_clear_current_reclaim_state();
+
 	return nr_reclaimed;
 }
 #endif
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v14 12/18] vmscan: shrink slab on memcg pressure
@ 2013-12-16 12:17   ` Vladimir Davydov
  0 siblings, 0 replies; 64+ messages in thread
From: Vladimir Davydov @ 2013-12-16 12:17 UTC (permalink / raw)
  To: dchinner, mhocko, hannes, akpm
  Cc: linux-kernel, linux-mm, cgroups, devel, glommer, glommer,
	Mel Gorman, Rik van Riel, Al Viro, Balbir Singh,
	KAMEZAWA Hiroyuki

This patch makes direct reclaim path shrink slab not only on global
memory pressure, but also when we reach the user memory limit of a
memcg. To achieve that, it makes shrink_slab() walk over the memcg
hierarchy and run shrinkers marked as memcg-aware on the target memcg
and all its descendants. The memcg to scan is passed in a shrink_control
structure; memcg-unaware shrinkers are still called only on global
memory pressure with memcg=NULL. It is up to the shrinker how to
organize the objects it is responsible for to achieve per-memcg reclaim.

Note that we do not intend to have true per memcg per node reclaim.
Since most memcgs are small and typically confined to a single NUMA node
or two by external means and therefore do not need the scalability NUMA
aware shrinkers provide, we actually call per node shrinking only for
the global list (memcg=NULL), while per-memcg lists are always scanned
only once irrespective of the nodemask with nid=0.

The idea lying behind the patch as well as the initial implementation
belong to Glauber Costa.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Glauber Costa <glommer@openvz.org>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/memcontrol.h |   22 ++++++++++
 include/linux/shrinker.h   |   10 ++++-
 mm/memcontrol.c            |   37 ++++++++++++++++-
 mm/vmscan.c                |   95 ++++++++++++++++++++++++++++++++++----------
 4 files changed, 142 insertions(+), 22 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index e3efab2..6001b31 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -95,6 +95,9 @@ extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
 struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *);
 struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *);
 
+unsigned long mem_cgroup_zone_reclaimable_pages(struct zone *,
+						struct mem_cgroup *);
+
 /* For coalescing uncharge for reducing memcg' overhead*/
 extern void mem_cgroup_uncharge_start(void);
 extern void mem_cgroup_uncharge_end(void);
@@ -304,6 +307,12 @@ static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page,
 	return &zone->lruvec;
 }
 
+static inline unsigned long mem_cgroup_zone_reclaimable_pages(struct zone *zone,
+							struct mem_cgroup *)
+{
+	return 0;
+}
+
 static inline struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
 {
 	return NULL;
@@ -494,6 +503,9 @@ static inline bool memcg_kmem_enabled(void)
 	return static_key_false(&memcg_kmem_enabled_key);
 }
 
+bool memcg_kmem_is_active(struct mem_cgroup *memcg);
+bool memcg_kmem_should_reclaim(struct mem_cgroup *memcg);
+
 /*
  * In general, we'll do everything in our power to not incur in any overhead
  * for non-memcg users for the kmem functions. Not even a function call, if we
@@ -635,6 +647,16 @@ static inline bool memcg_kmem_enabled(void)
 	return false;
 }
 
+static inline bool memcg_kmem_is_active(struct mem_cgroup *memcg)
+{
+	return false;
+}
+
+static inline bool memcg_kmem_should_reclaim(struct mem_cgroup *memcg)
+{
+	return false;
+}
+
 static inline bool
 memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup **memcg, int order)
 {
diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
index 68c0970..ab79b17 100644
--- a/include/linux/shrinker.h
+++ b/include/linux/shrinker.h
@@ -20,8 +20,15 @@ struct shrink_control {
 
 	/* shrink from these nodes */
 	nodemask_t nodes_to_scan;
+
+	/* shrink from this memory cgroup hierarchy (if not NULL) */
+	struct mem_cgroup *target_mem_cgroup;
+
 	/* current node being shrunk (for NUMA aware shrinkers) */
 	int nid;
+
+	/* current memcg being shrunk (for memcg aware shrinkers) */
+	struct mem_cgroup *memcg;
 };
 
 #define SHRINK_STOP (~0UL)
@@ -63,7 +70,8 @@ struct shrinker {
 #define DEFAULT_SEEKS 2 /* A good number if you don't know better. */
 
 /* Flags */
-#define SHRINKER_NUMA_AWARE (1 << 0)
+#define SHRINKER_NUMA_AWARE	(1 << 0)
+#define SHRINKER_MEMCG_AWARE	(1 << 1)
 
 extern int register_shrinker(struct shrinker *);
 extern void unregister_shrinker(struct shrinker *);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f2372b0..13b3131 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -353,7 +353,7 @@ static inline void memcg_kmem_set_active(struct mem_cgroup *memcg)
 	set_bit(KMEM_ACCOUNTED_ACTIVE, &memcg->kmem_account_flags);
 }
 
-static bool memcg_kmem_is_active(struct mem_cgroup *memcg)
+bool memcg_kmem_is_active(struct mem_cgroup *memcg)
 {
 	return test_bit(KMEM_ACCOUNTED_ACTIVE, &memcg->kmem_account_flags);
 }
@@ -1303,6 +1303,26 @@ out:
 	return lruvec;
 }
 
+unsigned long mem_cgroup_zone_reclaimable_pages(struct zone *zone,
+						struct mem_cgroup *memcg)
+{
+	int nid = zone_to_nid(zone);
+	int zid = zone_idx(zone);
+	unsigned long nr = 0;
+	struct mem_cgroup *iter;
+
+	iter = memcg;
+	do {
+		nr += mem_cgroup_zone_nr_lru_pages(iter, nid, zid,
+						   LRU_ALL_FILE);
+		if (do_swap_account)
+			nr += mem_cgroup_zone_nr_lru_pages(iter, nid, zid,
+							   LRU_ALL_ANON);
+		iter = mem_cgroup_iter(memcg, iter, NULL);
+	} while (iter);
+	return nr;
+}
+
 /*
  * Following LRU functions are allowed to be used without PCG_LOCK.
  * Operations are called by routine of global LRU independently from memcg.
@@ -2932,6 +2952,21 @@ static inline bool memcg_can_account_kmem(struct mem_cgroup *memcg)
 		memcg_kmem_is_active(memcg);
 }
 
+bool memcg_kmem_should_reclaim(struct mem_cgroup *memcg)
+{
+	struct mem_cgroup *iter;
+
+	iter = memcg;
+	do {
+		if (memcg_kmem_is_active(iter)) {
+			mem_cgroup_iter_break(memcg, iter);
+			return true;
+		}
+		iter = mem_cgroup_iter(memcg, iter, NULL);
+	} while (iter);
+	return false;
+}
+
 /*
  * helper for acessing a memcg's index. It will be used as an index in the
  * child cache array in kmem_cache, and also to derive its name. This function
diff --git a/mm/vmscan.c b/mm/vmscan.c
index d98f272..67c1950 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -311,6 +311,34 @@ shrink_slab_node(struct shrink_control *shrinkctl, struct shrinker *shrinker,
 	return freed;
 }
 
+static unsigned long
+run_shrinker(struct shrink_control *shrinkctl, struct shrinker *shrinker,
+	     unsigned long nr_pages_scanned, unsigned long lru_pages)
+{
+	unsigned long freed = 0;
+
+	/*
+	 * Since most memory cgroups are small and typically confined to a
+	 * single NUMA node or two by external means and therefore do not need
+	 * the scalability NUMA aware shrinkers provide, we implement per node
+	 * shrinking only for the global list.
+	 */
+	if (!(shrinker->flags & SHRINKER_NUMA_AWARE) ||
+	    shrinkctl->memcg) {
+		shrinkctl->nid = 0;
+		return shrink_slab_node(shrinkctl, shrinker,
+					nr_pages_scanned, lru_pages);
+	}
+
+	for_each_node_mask(shrinkctl->nid, shrinkctl->nodes_to_scan) {
+		if (node_online(shrinkctl->nid))
+			freed += shrink_slab_node(shrinkctl, shrinker,
+						  nr_pages_scanned, lru_pages);
+
+	}
+	return freed;
+}
+
 /*
  * Call the shrink functions to age shrinkable caches
  *
@@ -352,20 +380,34 @@ unsigned long shrink_slab(struct shrink_control *shrinkctl,
 	}
 
 	list_for_each_entry(shrinker, &shrinker_list, list) {
-		if (!(shrinker->flags & SHRINKER_NUMA_AWARE)) {
-			shrinkctl->nid = 0;
-			freed += shrink_slab_node(shrinkctl, shrinker,
-					nr_pages_scanned, lru_pages);
+		/*
+		 * Call memcg-unaware shrinkers only on global pressure.
+		 */
+		if (!(shrinker->flags & SHRINKER_MEMCG_AWARE)) {
+			if (!shrinkctl->target_mem_cgroup) {
+				shrinkctl->memcg = NULL;
+				freed += run_shrinker(shrinkctl, shrinker,
+						nr_pages_scanned, lru_pages);
+			}
 			continue;
 		}
 
-		for_each_node_mask(shrinkctl->nid, shrinkctl->nodes_to_scan) {
-			if (node_online(shrinkctl->nid))
-				freed += shrink_slab_node(shrinkctl, shrinker,
+		/*
+		 * For memcg-aware shrinkers iterate over the target memcg
+		 * hierarchy and run the shrinker on each kmem-active memcg
+		 * found in the hierarchy.
+		 */
+		shrinkctl->memcg = shrinkctl->target_mem_cgroup;
+		do {
+			if (!shrinkctl->memcg ||
+			    memcg_kmem_is_active(shrinkctl->memcg))
+				freed += run_shrinker(shrinkctl, shrinker,
 						nr_pages_scanned, lru_pages);
-
-		}
+		} while ((shrinkctl->memcg =
+			  mem_cgroup_iter(shrinkctl->target_mem_cgroup,
+					  shrinkctl->memcg, NULL)) != NULL);
 	}
+
 	up_read(&shrinker_rwsem);
 out:
 	cond_resched();
@@ -2286,6 +2328,7 @@ static bool shrink_zones(struct zonelist *zonelist,
 	struct reclaim_state *reclaim_state = current->reclaim_state;
 	struct shrink_control shrink = {
 		.gfp_mask = sc->gfp_mask,
+		.target_mem_cgroup = sc->target_mem_cgroup,
 	};
 
 	/*
@@ -2302,17 +2345,22 @@ static bool shrink_zones(struct zonelist *zonelist,
 					gfp_zone(sc->gfp_mask), sc->nodemask) {
 		if (!populated_zone(zone))
 			continue;
+
+		if (global_reclaim(sc) &&
+		    !cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
+			continue;
+
+		lru_pages += global_reclaim(sc) ?
+				zone_reclaimable_pages(zone) :
+				mem_cgroup_zone_reclaimable_pages(zone,
+						sc->target_mem_cgroup);
+		node_set(zone_to_nid(zone), shrink.nodes_to_scan);
+
 		/*
 		 * Take care memory controller reclaiming has small influence
 		 * to global LRU.
 		 */
 		if (global_reclaim(sc)) {
-			if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
-				continue;
-
-			lru_pages += zone_reclaimable_pages(zone);
-			node_set(zone_to_nid(zone), shrink.nodes_to_scan);
-
 			if (sc->priority != DEF_PRIORITY &&
 			    !zone_reclaimable(zone))
 				continue;	/* Let kswapd poll it */
@@ -2350,12 +2398,11 @@ static bool shrink_zones(struct zonelist *zonelist,
 	}
 
 	/*
-	 * Don't shrink slabs when reclaiming memory from over limit
-	 * cgroups but do shrink slab at least once when aborting
-	 * reclaim for compaction to avoid unevenly scanning file/anon
-	 * LRU pages over slab pages.
+	 * Shrink slabs at least once when aborting reclaim for compaction
+	 * to avoid unevenly scanning file/anon LRU pages over slab pages.
 	 */
-	if (global_reclaim(sc)) {
+	if (global_reclaim(sc) ||
+	    memcg_kmem_should_reclaim(sc->target_mem_cgroup)) {
 		shrink_slab(&shrink, sc->nr_scanned, lru_pages);
 		if (reclaim_state) {
 			sc->nr_reclaimed += reclaim_state->reclaimed_slab;
@@ -2649,6 +2696,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 	struct zonelist *zonelist;
 	unsigned long nr_reclaimed;
 	int nid;
+	struct reclaim_state reclaim_state;
 	struct scan_control sc = {
 		.may_writepage = !laptop_mode,
 		.may_unmap = 1,
@@ -2671,6 +2719,10 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 
 	zonelist = NODE_DATA(nid)->node_zonelists;
 
+	lockdep_set_current_reclaim_state(sc.gfp_mask);
+	reclaim_state.reclaimed_slab = 0;
+	current->reclaim_state = &reclaim_state;
+
 	trace_mm_vmscan_memcg_reclaim_begin(0,
 					    sc.may_writepage,
 					    sc.gfp_mask);
@@ -2679,6 +2731,9 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 
 	trace_mm_vmscan_memcg_reclaim_end(nr_reclaimed);
 
+	current->reclaim_state = NULL;
+	lockdep_clear_current_reclaim_state();
+
 	return nr_reclaimed;
 }
 #endif
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v14 12/18] vmscan: shrink slab on memcg pressure
@ 2013-12-16 12:17   ` Vladimir Davydov
  0 siblings, 0 replies; 64+ messages in thread
From: Vladimir Davydov @ 2013-12-16 12:17 UTC (permalink / raw)
  To: dchinner-H+wXaHxf7aLQT0dZR+AlfA, mhocko-AlSwsSmVLrQ,
	hannes-druUgvl0LCNAfugRpC6u6w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	devel-GEFAQzZX7r8dnm+yROfE0A, glommer-GEFAQzZX7r8dnm+yROfE0A,
	glommer-Re5JQEeQqe8AvxtiuMwx3w, Mel Gorman, Rik van Riel,
	Al Viro, Balbir Singh, KAMEZAWA Hiroyuki

This patch makes direct reclaim path shrink slab not only on global
memory pressure, but also when we reach the user memory limit of a
memcg. To achieve that, it makes shrink_slab() walk over the memcg
hierarchy and run shrinkers marked as memcg-aware on the target memcg
and all its descendants. The memcg to scan is passed in a shrink_control
structure; memcg-unaware shrinkers are still called only on global
memory pressure with memcg=NULL. It is up to the shrinker how to
organize the objects it is responsible for to achieve per-memcg reclaim.

Note that we do not intend to have true per memcg per node reclaim.
Since most memcgs are small and typically confined to a single NUMA node
or two by external means and therefore do not need the scalability NUMA
aware shrinkers provide, we actually call per node shrinking only for
the global list (memcg=NULL), while per-memcg lists are always scanned
only once irrespective of the nodemask with nid=0.

The idea lying behind the patch as well as the initial implementation
belong to Glauber Costa.

Signed-off-by: Vladimir Davydov <vdavydov-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
Cc: Glauber Costa <glommer-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
Cc: Dave Chinner <dchinner-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Cc: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
Cc: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
Cc: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
Cc: Mel Gorman <mgorman-l3A5Bk7waGM@public.gmane.org>
Cc: Rik van Riel <riel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Cc: Al Viro <viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn@public.gmane.org>
Cc: Balbir Singh <bsingharora-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
---
 include/linux/memcontrol.h |   22 ++++++++++
 include/linux/shrinker.h   |   10 ++++-
 mm/memcontrol.c            |   37 ++++++++++++++++-
 mm/vmscan.c                |   95 ++++++++++++++++++++++++++++++++++----------
 4 files changed, 142 insertions(+), 22 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index e3efab2..6001b31 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -95,6 +95,9 @@ extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
 struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *);
 struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *);
 
+unsigned long mem_cgroup_zone_reclaimable_pages(struct zone *,
+						struct mem_cgroup *);
+
 /* For coalescing uncharge for reducing memcg' overhead*/
 extern void mem_cgroup_uncharge_start(void);
 extern void mem_cgroup_uncharge_end(void);
@@ -304,6 +307,12 @@ static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page,
 	return &zone->lruvec;
 }
 
+static inline unsigned long mem_cgroup_zone_reclaimable_pages(struct zone *zone,
+							struct mem_cgroup *)
+{
+	return 0;
+}
+
 static inline struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
 {
 	return NULL;
@@ -494,6 +503,9 @@ static inline bool memcg_kmem_enabled(void)
 	return static_key_false(&memcg_kmem_enabled_key);
 }
 
+bool memcg_kmem_is_active(struct mem_cgroup *memcg);
+bool memcg_kmem_should_reclaim(struct mem_cgroup *memcg);
+
 /*
  * In general, we'll do everything in our power to not incur in any overhead
  * for non-memcg users for the kmem functions. Not even a function call, if we
@@ -635,6 +647,16 @@ static inline bool memcg_kmem_enabled(void)
 	return false;
 }
 
+static inline bool memcg_kmem_is_active(struct mem_cgroup *memcg)
+{
+	return false;
+}
+
+static inline bool memcg_kmem_should_reclaim(struct mem_cgroup *memcg)
+{
+	return false;
+}
+
 static inline bool
 memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup **memcg, int order)
 {
diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
index 68c0970..ab79b17 100644
--- a/include/linux/shrinker.h
+++ b/include/linux/shrinker.h
@@ -20,8 +20,15 @@ struct shrink_control {
 
 	/* shrink from these nodes */
 	nodemask_t nodes_to_scan;
+
+	/* shrink from this memory cgroup hierarchy (if not NULL) */
+	struct mem_cgroup *target_mem_cgroup;
+
 	/* current node being shrunk (for NUMA aware shrinkers) */
 	int nid;
+
+	/* current memcg being shrunk (for memcg aware shrinkers) */
+	struct mem_cgroup *memcg;
 };
 
 #define SHRINK_STOP (~0UL)
@@ -63,7 +70,8 @@ struct shrinker {
 #define DEFAULT_SEEKS 2 /* A good number if you don't know better. */
 
 /* Flags */
-#define SHRINKER_NUMA_AWARE (1 << 0)
+#define SHRINKER_NUMA_AWARE	(1 << 0)
+#define SHRINKER_MEMCG_AWARE	(1 << 1)
 
 extern int register_shrinker(struct shrinker *);
 extern void unregister_shrinker(struct shrinker *);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f2372b0..13b3131 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -353,7 +353,7 @@ static inline void memcg_kmem_set_active(struct mem_cgroup *memcg)
 	set_bit(KMEM_ACCOUNTED_ACTIVE, &memcg->kmem_account_flags);
 }
 
-static bool memcg_kmem_is_active(struct mem_cgroup *memcg)
+bool memcg_kmem_is_active(struct mem_cgroup *memcg)
 {
 	return test_bit(KMEM_ACCOUNTED_ACTIVE, &memcg->kmem_account_flags);
 }
@@ -1303,6 +1303,26 @@ out:
 	return lruvec;
 }
 
+unsigned long mem_cgroup_zone_reclaimable_pages(struct zone *zone,
+						struct mem_cgroup *memcg)
+{
+	int nid = zone_to_nid(zone);
+	int zid = zone_idx(zone);
+	unsigned long nr = 0;
+	struct mem_cgroup *iter;
+
+	iter = memcg;
+	do {
+		nr += mem_cgroup_zone_nr_lru_pages(iter, nid, zid,
+						   LRU_ALL_FILE);
+		if (do_swap_account)
+			nr += mem_cgroup_zone_nr_lru_pages(iter, nid, zid,
+							   LRU_ALL_ANON);
+		iter = mem_cgroup_iter(memcg, iter, NULL);
+	} while (iter);
+	return nr;
+}
+
 /*
  * Following LRU functions are allowed to be used without PCG_LOCK.
  * Operations are called by routine of global LRU independently from memcg.
@@ -2932,6 +2952,21 @@ static inline bool memcg_can_account_kmem(struct mem_cgroup *memcg)
 		memcg_kmem_is_active(memcg);
 }
 
+bool memcg_kmem_should_reclaim(struct mem_cgroup *memcg)
+{
+	struct mem_cgroup *iter;
+
+	iter = memcg;
+	do {
+		if (memcg_kmem_is_active(iter)) {
+			mem_cgroup_iter_break(memcg, iter);
+			return true;
+		}
+		iter = mem_cgroup_iter(memcg, iter, NULL);
+	} while (iter);
+	return false;
+}
+
 /*
  * helper for acessing a memcg's index. It will be used as an index in the
  * child cache array in kmem_cache, and also to derive its name. This function
diff --git a/mm/vmscan.c b/mm/vmscan.c
index d98f272..67c1950 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -311,6 +311,34 @@ shrink_slab_node(struct shrink_control *shrinkctl, struct shrinker *shrinker,
 	return freed;
 }
 
+static unsigned long
+run_shrinker(struct shrink_control *shrinkctl, struct shrinker *shrinker,
+	     unsigned long nr_pages_scanned, unsigned long lru_pages)
+{
+	unsigned long freed = 0;
+
+	/*
+	 * Since most memory cgroups are small and typically confined to a
+	 * single NUMA node or two by external means and therefore do not need
+	 * the scalability NUMA aware shrinkers provide, we implement per node
+	 * shrinking only for the global list.
+	 */
+	if (!(shrinker->flags & SHRINKER_NUMA_AWARE) ||
+	    shrinkctl->memcg) {
+		shrinkctl->nid = 0;
+		return shrink_slab_node(shrinkctl, shrinker,
+					nr_pages_scanned, lru_pages);
+	}
+
+	for_each_node_mask(shrinkctl->nid, shrinkctl->nodes_to_scan) {
+		if (node_online(shrinkctl->nid))
+			freed += shrink_slab_node(shrinkctl, shrinker,
+						  nr_pages_scanned, lru_pages);
+
+	}
+	return freed;
+}
+
 /*
  * Call the shrink functions to age shrinkable caches
  *
@@ -352,20 +380,34 @@ unsigned long shrink_slab(struct shrink_control *shrinkctl,
 	}
 
 	list_for_each_entry(shrinker, &shrinker_list, list) {
-		if (!(shrinker->flags & SHRINKER_NUMA_AWARE)) {
-			shrinkctl->nid = 0;
-			freed += shrink_slab_node(shrinkctl, shrinker,
-					nr_pages_scanned, lru_pages);
+		/*
+		 * Call memcg-unaware shrinkers only on global pressure.
+		 */
+		if (!(shrinker->flags & SHRINKER_MEMCG_AWARE)) {
+			if (!shrinkctl->target_mem_cgroup) {
+				shrinkctl->memcg = NULL;
+				freed += run_shrinker(shrinkctl, shrinker,
+						nr_pages_scanned, lru_pages);
+			}
 			continue;
 		}
 
-		for_each_node_mask(shrinkctl->nid, shrinkctl->nodes_to_scan) {
-			if (node_online(shrinkctl->nid))
-				freed += shrink_slab_node(shrinkctl, shrinker,
+		/*
+		 * For memcg-aware shrinkers iterate over the target memcg
+		 * hierarchy and run the shrinker on each kmem-active memcg
+		 * found in the hierarchy.
+		 */
+		shrinkctl->memcg = shrinkctl->target_mem_cgroup;
+		do {
+			if (!shrinkctl->memcg ||
+			    memcg_kmem_is_active(shrinkctl->memcg))
+				freed += run_shrinker(shrinkctl, shrinker,
 						nr_pages_scanned, lru_pages);
-
-		}
+		} while ((shrinkctl->memcg =
+			  mem_cgroup_iter(shrinkctl->target_mem_cgroup,
+					  shrinkctl->memcg, NULL)) != NULL);
 	}
+
 	up_read(&shrinker_rwsem);
 out:
 	cond_resched();
@@ -2286,6 +2328,7 @@ static bool shrink_zones(struct zonelist *zonelist,
 	struct reclaim_state *reclaim_state = current->reclaim_state;
 	struct shrink_control shrink = {
 		.gfp_mask = sc->gfp_mask,
+		.target_mem_cgroup = sc->target_mem_cgroup,
 	};
 
 	/*
@@ -2302,17 +2345,22 @@ static bool shrink_zones(struct zonelist *zonelist,
 					gfp_zone(sc->gfp_mask), sc->nodemask) {
 		if (!populated_zone(zone))
 			continue;
+
+		if (global_reclaim(sc) &&
+		    !cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
+			continue;
+
+		lru_pages += global_reclaim(sc) ?
+				zone_reclaimable_pages(zone) :
+				mem_cgroup_zone_reclaimable_pages(zone,
+						sc->target_mem_cgroup);
+		node_set(zone_to_nid(zone), shrink.nodes_to_scan);
+
 		/*
 		 * Take care memory controller reclaiming has small influence
 		 * to global LRU.
 		 */
 		if (global_reclaim(sc)) {
-			if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
-				continue;
-
-			lru_pages += zone_reclaimable_pages(zone);
-			node_set(zone_to_nid(zone), shrink.nodes_to_scan);
-
 			if (sc->priority != DEF_PRIORITY &&
 			    !zone_reclaimable(zone))
 				continue;	/* Let kswapd poll it */
@@ -2350,12 +2398,11 @@ static bool shrink_zones(struct zonelist *zonelist,
 	}
 
 	/*
-	 * Don't shrink slabs when reclaiming memory from over limit
-	 * cgroups but do shrink slab at least once when aborting
-	 * reclaim for compaction to avoid unevenly scanning file/anon
-	 * LRU pages over slab pages.
+	 * Shrink slabs at least once when aborting reclaim for compaction
+	 * to avoid unevenly scanning file/anon LRU pages over slab pages.
 	 */
-	if (global_reclaim(sc)) {
+	if (global_reclaim(sc) ||
+	    memcg_kmem_should_reclaim(sc->target_mem_cgroup)) {
 		shrink_slab(&shrink, sc->nr_scanned, lru_pages);
 		if (reclaim_state) {
 			sc->nr_reclaimed += reclaim_state->reclaimed_slab;
@@ -2649,6 +2696,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 	struct zonelist *zonelist;
 	unsigned long nr_reclaimed;
 	int nid;
+	struct reclaim_state reclaim_state;
 	struct scan_control sc = {
 		.may_writepage = !laptop_mode,
 		.may_unmap = 1,
@@ -2671,6 +2719,10 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 
 	zonelist = NODE_DATA(nid)->node_zonelists;
 
+	lockdep_set_current_reclaim_state(sc.gfp_mask);
+	reclaim_state.reclaimed_slab = 0;
+	current->reclaim_state = &reclaim_state;
+
 	trace_mm_vmscan_memcg_reclaim_begin(0,
 					    sc.may_writepage,
 					    sc.gfp_mask);
@@ -2679,6 +2731,9 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 
 	trace_mm_vmscan_memcg_reclaim_end(nr_reclaimed);
 
+	current->reclaim_state = NULL;
+	lockdep_clear_current_reclaim_state();
+
 	return nr_reclaimed;
 }
 #endif
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v14 13/18] vmscan: take at least one pass with shrinkers
  2013-12-16 12:16 ` Vladimir Davydov
@ 2013-12-16 12:17   ` Vladimir Davydov
  -1 siblings, 0 replies; 64+ messages in thread
From: Vladimir Davydov @ 2013-12-16 12:17 UTC (permalink / raw)
  To: dchinner, mhocko, hannes, akpm
  Cc: linux-kernel, linux-mm, cgroups, devel, glommer, glommer,
	Mel Gorman, Rik van Riel

From: Glauber Costa <glommer@openvz.org>

In very low free kernel memory situations, it may be the case that we
have less objects to free than our initial batch size. If this is the
case, it is better to shrink those, and open space for the new workload
then to keep them and fail the new allocations.

In particular, we are concerned with the direct reclaim case for memcg.
Although this same technique can be applied to other situations just as
well, we will start conservative and apply it for that case, which is
the one that matters the most.

Signed-off-by: Glauber Costa <glommer@openvz.org>
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
---
 mm/vmscan.c |   13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 67c1950..95fd2c3 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -281,17 +281,22 @@ shrink_slab_node(struct shrink_control *shrinkctl, struct shrinker *shrinker,
 				nr_pages_scanned, lru_pages,
 				max_pass, delta, total_scan);
 
-	while (total_scan >= batch_size) {
+	while (total_scan > 0) {
 		unsigned long ret;
+		unsigned long nr_to_scan = min(batch_size, total_scan);
 
-		shrinkctl->nr_to_scan = batch_size;
+		if (!shrinkctl->target_mem_cgroup &&
+		    total_scan < batch_size)
+			break;
+
+		shrinkctl->nr_to_scan = nr_to_scan;
 		ret = shrinker->scan_objects(shrinker, shrinkctl);
 		if (ret == SHRINK_STOP)
 			break;
 		freed += ret;
 
-		count_vm_events(SLABS_SCANNED, batch_size);
-		total_scan -= batch_size;
+		count_vm_events(SLABS_SCANNED, nr_to_scan);
+		total_scan -= nr_to_scan;
 
 		cond_resched();
 	}
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v14 13/18] vmscan: take at least one pass with shrinkers
@ 2013-12-16 12:17   ` Vladimir Davydov
  0 siblings, 0 replies; 64+ messages in thread
From: Vladimir Davydov @ 2013-12-16 12:17 UTC (permalink / raw)
  To: dchinner, mhocko, hannes, akpm
  Cc: linux-kernel, linux-mm, cgroups, devel, glommer, glommer,
	Mel Gorman, Rik van Riel

From: Glauber Costa <glommer@openvz.org>

In very low free kernel memory situations, it may be the case that we
have less objects to free than our initial batch size. If this is the
case, it is better to shrink those, and open space for the new workload
then to keep them and fail the new allocations.

In particular, we are concerned with the direct reclaim case for memcg.
Although this same technique can be applied to other situations just as
well, we will start conservative and apply it for that case, which is
the one that matters the most.

Signed-off-by: Glauber Costa <glommer@openvz.org>
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
---
 mm/vmscan.c |   13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 67c1950..95fd2c3 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -281,17 +281,22 @@ shrink_slab_node(struct shrink_control *shrinkctl, struct shrinker *shrinker,
 				nr_pages_scanned, lru_pages,
 				max_pass, delta, total_scan);
 
-	while (total_scan >= batch_size) {
+	while (total_scan > 0) {
 		unsigned long ret;
+		unsigned long nr_to_scan = min(batch_size, total_scan);
 
-		shrinkctl->nr_to_scan = batch_size;
+		if (!shrinkctl->target_mem_cgroup &&
+		    total_scan < batch_size)
+			break;
+
+		shrinkctl->nr_to_scan = nr_to_scan;
 		ret = shrinker->scan_objects(shrinker, shrinkctl);
 		if (ret == SHRINK_STOP)
 			break;
 		freed += ret;
 
-		count_vm_events(SLABS_SCANNED, batch_size);
-		total_scan -= batch_size;
+		count_vm_events(SLABS_SCANNED, nr_to_scan);
+		total_scan -= nr_to_scan;
 
 		cond_resched();
 	}
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v14 14/18] list_lru: add per-memcg lists
  2013-12-16 12:16 ` Vladimir Davydov
  (?)
@ 2013-12-16 12:17   ` Vladimir Davydov
  -1 siblings, 0 replies; 64+ messages in thread
From: Vladimir Davydov @ 2013-12-16 12:17 UTC (permalink / raw)
  To: dchinner, mhocko, hannes, akpm
  Cc: linux-kernel, linux-mm, cgroups, devel, glommer, glommer,
	Al Viro, Balbir Singh, KAMEZAWA Hiroyuki

There are several FS shrinkers, including super_block::s_shrink, that
keep reclaimable objects in the list_lru structure. That said, to turn
them to memcg-aware shrinkers, it is enough to make list_lru per-memcg.

This patch does the trick. It adds an array of LRU lists to the list_lru
structure, one for each kmem-active memcg, and dispatches every item
addition or removal operation to the list corresponding to the memcg the
item is accounted to.

Since we already pass a shrink_control object to count and walk list_lru
functions to specify the NUMA node to scan, and the target memcg is held
in this structure, there is no need in changing the list_lru interface.

To make sure each kmem-active memcg has its list initialized in each
memcg-enabled list_lru, we keep all memcg-enabled list_lrus in a linked
list, which we iterate over allocating per-memcg LRUs whenever a new
kmem-active memcg is added. To synchronize this with creation of new
list_lrus, we have to take activate_kmem_mutex. Since using this mutex
as is would make all mounts proceed serially, we turn it to an rw
semaphore and take it for writing whenever a new kmem-active memcg is
created and for reading when we are going to create a list_lru. This
still does not allow mount_fs() proceed concurrently with creation of a
kmem-active memcg, but since creation of memcgs is rather a rare event,
this is not that critical.

The idea lying behind the patch as well as the initial implementation
belong to Glauber Costa.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Glauber Costa <glommer@openvz.org>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/list_lru.h   |  112 +++++++++++++------
 include/linux/memcontrol.h |   13 +++
 mm/list_lru.c              |  257 +++++++++++++++++++++++++++++++++++++++-----
 mm/memcontrol.c            |  181 +++++++++++++++++++++++++++++--
 4 files changed, 495 insertions(+), 68 deletions(-)

diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
index 194b1c4..9b228dc 100644
--- a/include/linux/list_lru.h
+++ b/include/linux/list_lru.h
@@ -11,6 +11,8 @@
 #include <linux/nodemask.h>
 #include <linux/shrinker.h>
 
+struct mem_cgroup;
+
 /* list_lru_walk_cb has to always return one of those */
 enum lru_status {
 	LRU_REMOVED,		/* item removed from list */
@@ -30,10 +32,52 @@ struct list_lru_node {
 struct list_lru {
 	struct list_lru_node	*node;
 	nodemask_t		active_nodes;
+#ifdef CONFIG_MEMCG_KMEM
+	/*
+	 * In order to provide ability of scanning objects from different
+	 * memory cgroups independently, we keep a separate LRU list for each
+	 * kmem-active memcg in this array. The array is RCU-protected and
+	 * indexed by memcg_cache_id().
+	 */
+	struct list_lru_node	**memcg;
+	/*
+	 * Every time a kmem-active memcg is created or destroyed, we have to
+	 * update the array of per-memcg LRUs in each memcg enabled list_lru
+	 * structure. To achieve that, we keep all memcg enabled list_lru
+	 * structures in the all_memcg_lrus list.
+	 */
+	struct list_head	memcg_lrus_list;
+	/*
+	 * Since the array of per-memcg LRUs is RCU-protected, we can only free
+	 * it after a call to synchronize_rcu(). To avoid multiple calls to
+	 * synchronize_rcu() when a lot of LRUs get updated at the same time,
+	 * which is a typical scenario, we will store the pointer to the
+	 * previous version of the array in the memcg_old field for each
+	 * list_lru structure, and then free them all at once after a single
+	 * call to synchronize_rcu().
+	 */
+	void			*memcg_old;
+#endif /* CONFIG_MEMCG_KMEM */
 };
 
+#ifdef CONFIG_MEMCG_KMEM
+int list_lru_memcg_alloc(struct list_lru *lru, int memcg_id);
+void list_lru_memcg_free(struct list_lru *lru, int memcg_id);
+int list_lru_grow_memcg(struct list_lru *lru, size_t new_array_size);
+#endif
+
 void list_lru_destroy(struct list_lru *lru);
-int list_lru_init(struct list_lru *lru);
+int __list_lru_init(struct list_lru *lru, bool memcg_enabled);
+
+static inline int list_lru_init(struct list_lru *lru)
+{
+	return __list_lru_init(lru, false);
+}
+
+static inline int list_lru_init_memcg(struct list_lru *lru)
+{
+	return __list_lru_init(lru, true);
+}
 
 /**
  * list_lru_add: add an element to the lru list's tail
@@ -67,39 +111,41 @@ bool list_lru_add(struct list_lru *lru, struct list_head *item);
 bool list_lru_del(struct list_lru *lru, struct list_head *item);
 
 /**
- * list_lru_count_node: return the number of objects currently held by @lru
+ * list_lru_count_node_memcg: return the number of objects currently held by a
+ *  list_lru.
  * @lru: the lru pointer.
  * @nid: the node id to count from.
+ * @memcg: the memcg to count from.
  *
  * Always return a non-negative number, 0 for empty lists. There is no
  * guarantee that the list is not updated while the count is being computed.
  * Callers that want such a guarantee need to provide an outer lock.
  */
-unsigned long list_lru_count_node(struct list_lru *lru, int nid);
+unsigned long list_lru_count_node_memcg(struct list_lru *lru,
+					int nid, struct mem_cgroup *memcg);
 
-static inline unsigned long list_lru_shrink_count(struct list_lru *lru,
-						  struct shrink_control *sc)
+unsigned long list_lru_count(struct list_lru *lru);
+
+static inline unsigned long list_lru_count_node(struct list_lru *lru, int nid)
 {
-	return list_lru_count_node(lru, sc->nid);
+	return list_lru_count_node_memcg(lru, nid, NULL);
 }
 
-static inline unsigned long list_lru_count(struct list_lru *lru)
+static inline unsigned long list_lru_shrink_count(struct list_lru *lru,
+						  struct shrink_control *sc)
 {
-	long count = 0;
-	int nid;
-
-	for_each_node_mask(nid, lru->active_nodes)
-		count += list_lru_count_node(lru, nid);
-
-	return count;
+	return list_lru_count_node_memcg(lru, sc->nid, sc->memcg);
 }
 
 typedef enum lru_status
 (*list_lru_walk_cb)(struct list_head *item, spinlock_t *lock, void *cb_arg);
+
 /**
- * list_lru_walk_node: walk a list_lru, isolating and disposing freeable items.
+ * list_lru_walk_node_memcg: walk a list_lru, isolating and disposing freeable
+ *  items.
  * @lru: the lru pointer.
  * @nid: the node id to scan from.
+ * @memcg: the memcg to scan from.
  * @isolate: callback function that is resposible for deciding what to do with
  *  the item currently being scanned
  * @cb_arg: opaque type that will be passed to @isolate
@@ -117,31 +163,29 @@ typedef enum lru_status
  *
  * Return value: the number of objects effectively removed from the LRU.
  */
-unsigned long list_lru_walk_node(struct list_lru *lru, int nid,
-				 list_lru_walk_cb isolate, void *cb_arg,
-				 unsigned long *nr_to_walk);
+unsigned long list_lru_walk_node_memcg(struct list_lru *lru,
+				       int nid, struct mem_cgroup *memcg,
+				       list_lru_walk_cb isolate, void *cb_arg,
+				       unsigned long *nr_to_walk);
+
+unsigned long list_lru_walk(struct list_lru *lru,
+			    list_lru_walk_cb isolate, void *cb_arg,
+			    unsigned long nr_to_walk);
 
 static inline unsigned long
-list_lru_shrink_walk(struct list_lru *lru, struct shrink_control *sc,
-		     list_lru_walk_cb isolate, void *cb_arg)
+list_lru_walk_node(struct list_lru *lru, int nid,
+		   list_lru_walk_cb isolate, void *cb_arg,
+		   unsigned long *nr_to_walk)
 {
-	return list_lru_walk_node(lru, sc->nid, isolate, cb_arg,
-				  &sc->nr_to_scan);
+	return list_lru_walk_node_memcg(lru, nid, NULL,
+					isolate, cb_arg, nr_to_walk);
 }
 
 static inline unsigned long
-list_lru_walk(struct list_lru *lru, list_lru_walk_cb isolate,
-	      void *cb_arg, unsigned long nr_to_walk)
+list_lru_shrink_walk(struct list_lru *lru, struct shrink_control *sc,
+		     list_lru_walk_cb isolate, void *cb_arg)
 {
-	long isolated = 0;
-	int nid;
-
-	for_each_node_mask(nid, lru->active_nodes) {
-		isolated += list_lru_walk_node(lru, nid, isolate,
-					       cb_arg, &nr_to_walk);
-		if (nr_to_walk <= 0)
-			break;
-	}
-	return isolated;
+	return list_lru_walk_node_memcg(lru, sc->nid, sc->memcg,
+					isolate, cb_arg, &sc->nr_to_scan);
 }
 #endif /* _LRU_LIST_H */
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 6001b31..44fc58a 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -29,6 +29,7 @@ struct page_cgroup;
 struct page;
 struct mm_struct;
 struct kmem_cache;
+struct list_lru;
 
 /*
  * The corresponding mem_cgroup_stat_names is defined in mm/memcontrol.c,
@@ -538,6 +539,9 @@ __memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp);
 void mem_cgroup_destroy_cache(struct kmem_cache *cachep);
 void kmem_cache_destroy_memcg_children(struct kmem_cache *s);
 
+int memcg_list_lru_init(struct list_lru *lru, bool memcg_enabled);
+void memcg_list_lru_destroy(struct list_lru *lru);
+
 /**
  * memcg_kmem_newpage_charge: verify if a new kmem allocation is allowed.
  * @gfp: the gfp allocation flags.
@@ -702,6 +706,15 @@ memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp)
 static inline void kmem_cache_destroy_memcg_children(struct kmem_cache *s)
 {
 }
+
+static inline int memcg_list_lru_init(struct list_lru *lru, bool memcg_enabled)
+{
+	return 0;
+}
+
+static inline void memcg_list_lru_destroy(struct list_lru *lru)
+{
+}
 #endif /* CONFIG_MEMCG_KMEM */
 #endif /* _LINUX_MEMCONTROL_H */
 
diff --git a/mm/list_lru.c b/mm/list_lru.c
index 72f9dec..db644f8 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -7,19 +7,87 @@
 #include <linux/kernel.h>
 #include <linux/module.h>
 #include <linux/mm.h>
-#include <linux/list_lru.h>
 #include <linux/slab.h>
+#include <linux/memcontrol.h>
+#include <linux/page_cgroup.h>
+#include <linux/list_lru.h>
+
+#ifdef CONFIG_MEMCG_KMEM
+static inline bool lru_has_memcg(struct list_lru *lru)
+{
+	return !!lru->memcg;
+}
+
+static struct list_lru_node *lru_node_of_index(struct list_lru *lru,
+					int nid, int memcg_id, bool *is_global)
+{
+	struct list_lru_node **memcg_lrus;
+	struct list_lru_node *nlru = NULL;
+
+	if (memcg_id < 0 || !lru_has_memcg(lru)) {
+		*is_global = true;
+		return &lru->node[nid];
+	}
+
+	rcu_read_lock();
+	memcg_lrus = rcu_dereference(lru->memcg);
+	nlru = memcg_lrus[memcg_id];
+	rcu_read_unlock();
+
+	*is_global = false;
+	return nlru;
+}
+
+static struct list_lru_node *lru_node_of_page(struct list_lru *lru,
+					struct page *page, bool *is_global)
+{
+	struct page_cgroup *pc;
+	struct mem_cgroup *memcg;
+
+	/*
+	 * Since a kmem page cannot change its cgroup after its allocation is
+	 * committed, we do not need to lock_page_cgroup() here.
+	 */
+	pc = lookup_page_cgroup(compound_head(page));
+	memcg = PageCgroupUsed(pc) ? pc->mem_cgroup : NULL;
+
+	return lru_node_of_index(lru, page_to_nid(page),
+				 memcg_cache_id(memcg), is_global);
+}
+#else /* !CONFIG_MEMCG_KMEM */
+static inline bool lru_has_memcg(struct list_lru *lru)
+{
+	return false;
+}
+
+static inline struct list_lru_node *lru_node_of_index(struct list_lru *lru,
+					int nid, int memcg_id, bool *is_global)
+{
+	*is_global = true;
+	return &lru->node[nid];
+}
+
+static inline struct list_lru_node *lru_node_of_page(struct list_lru *lru,
+					struct page *page, bool *is_global)
+{
+	return lru_node_of_index(lru, page_to_nid(page), -1, is_global);
+}
+#endif /* CONFIG_MEMCG_KMEM */
 
 bool list_lru_add(struct list_lru *lru, struct list_head *item)
 {
-	int nid = page_to_nid(virt_to_page(item));
-	struct list_lru_node *nlru = &lru->node[nid];
+	struct page *page = virt_to_page(item);
+	int nid = page_to_nid(page);
+	bool is_global;
+	struct list_lru_node *nlru;
+
+	nlru = lru_node_of_page(lru, page, &is_global);
 
 	spin_lock(&nlru->lock);
 	WARN_ON_ONCE(nlru->nr_items < 0);
 	if (list_empty(item)) {
 		list_add_tail(item, &nlru->list);
-		if (nlru->nr_items++ == 0)
+		if (nlru->nr_items++ == 0 && is_global)
 			node_set(nid, lru->active_nodes);
 		spin_unlock(&nlru->lock);
 		return true;
@@ -31,13 +99,17 @@ EXPORT_SYMBOL_GPL(list_lru_add);
 
 bool list_lru_del(struct list_lru *lru, struct list_head *item)
 {
-	int nid = page_to_nid(virt_to_page(item));
-	struct list_lru_node *nlru = &lru->node[nid];
+	struct page *page = virt_to_page(item);
+	int nid = page_to_nid(page);
+	bool is_global;
+	struct list_lru_node *nlru;
+
+	nlru = lru_node_of_page(lru, page, &is_global);
 
 	spin_lock(&nlru->lock);
 	if (!list_empty(item)) {
 		list_del_init(item);
-		if (--nlru->nr_items == 0)
+		if (--nlru->nr_items == 0 && is_global)
 			node_clear(nid, lru->active_nodes);
 		WARN_ON_ONCE(nlru->nr_items < 0);
 		spin_unlock(&nlru->lock);
@@ -48,11 +120,14 @@ bool list_lru_del(struct list_lru *lru, struct list_head *item)
 }
 EXPORT_SYMBOL_GPL(list_lru_del);
 
-unsigned long
-list_lru_count_node(struct list_lru *lru, int nid)
+unsigned long list_lru_count_node_memcg(struct list_lru *lru,
+					int nid, struct mem_cgroup *memcg)
 {
 	unsigned long count = 0;
-	struct list_lru_node *nlru = &lru->node[nid];
+	bool is_global;
+	struct list_lru_node *nlru;
+
+	nlru = lru_node_of_index(lru, nid, memcg_cache_id(memcg), &is_global);
 
 	spin_lock(&nlru->lock);
 	WARN_ON_ONCE(nlru->nr_items < 0);
@@ -61,16 +136,41 @@ list_lru_count_node(struct list_lru *lru, int nid)
 
 	return count;
 }
-EXPORT_SYMBOL_GPL(list_lru_count_node);
+EXPORT_SYMBOL_GPL(list_lru_count_node_memcg);
+
+unsigned long list_lru_count(struct list_lru *lru)
+{
+	long count = 0;
+	int nid;
+	struct mem_cgroup *memcg;
+
+	for_each_node_mask(nid, lru->active_nodes)
+		count += list_lru_count_node(lru, nid);
+
+	if (!lru_has_memcg(lru))
+		goto out;
+
+	for_each_mem_cgroup(memcg) {
+		if (memcg_kmem_is_active(memcg))
+			count += list_lru_count_node_memcg(lru, 0, memcg);
+	}
+out:
+	return count;
+}
+EXPORT_SYMBOL_GPL(list_lru_count);
 
-unsigned long
-list_lru_walk_node(struct list_lru *lru, int nid, list_lru_walk_cb isolate,
-		   void *cb_arg, unsigned long *nr_to_walk)
+unsigned long list_lru_walk_node_memcg(struct list_lru *lru,
+				       int nid, struct mem_cgroup *memcg,
+				       list_lru_walk_cb isolate, void *cb_arg,
+				       unsigned long *nr_to_walk)
 {
 
-	struct list_lru_node	*nlru = &lru->node[nid];
 	struct list_head *item, *n;
 	unsigned long isolated = 0;
+	bool is_global;
+	struct list_lru_node *nlru;
+
+	nlru = lru_node_of_index(lru, nid, memcg_cache_id(memcg), &is_global);
 
 	spin_lock(&nlru->lock);
 restart:
@@ -88,7 +188,7 @@ restart:
 		ret = isolate(item, &nlru->lock, cb_arg);
 		switch (ret) {
 		case LRU_REMOVED:
-			if (--nlru->nr_items == 0)
+			if (--nlru->nr_items == 0 && is_global)
 				node_clear(nid, lru->active_nodes);
 			WARN_ON_ONCE(nlru->nr_items < 0);
 			isolated++;
@@ -112,29 +212,134 @@ restart:
 	spin_unlock(&nlru->lock);
 	return isolated;
 }
-EXPORT_SYMBOL_GPL(list_lru_walk_node);
+EXPORT_SYMBOL_GPL(list_lru_walk_node_memcg);
 
-int list_lru_init(struct list_lru *lru)
+unsigned long list_lru_walk(struct list_lru *lru,
+			    list_lru_walk_cb isolate, void *cb_arg,
+			    unsigned long nr_to_walk)
+{
+	long isolated = 0;
+	int nid;
+	struct mem_cgroup *memcg;
+
+	for_each_node_mask(nid, lru->active_nodes) {
+		isolated += list_lru_walk_node(lru, nid, isolate,
+					       cb_arg, &nr_to_walk);
+		if (nr_to_walk <= 0)
+			break;
+	}
+
+	if (!lru_has_memcg(lru))
+		goto out;
+
+	for_each_mem_cgroup(memcg) {
+		if (!memcg_kmem_is_active(memcg))
+			continue;
+		isolated += list_lru_walk_node_memcg(lru, 0, memcg, isolate,
+						     cb_arg, &nr_to_walk);
+		if (nr_to_walk <= 0) {
+			mem_cgroup_iter_break(NULL, memcg);
+			break;
+		}
+	}
+out:
+	return isolated;
+}
+EXPORT_SYMBOL_GPL(list_lru_walk);
+
+static void list_lru_node_init(struct list_lru_node *nlru)
+{
+	spin_lock_init(&nlru->lock);
+	INIT_LIST_HEAD(&nlru->list);
+	nlru->nr_items = 0;
+}
+
+int __list_lru_init(struct list_lru *lru, bool memcg_enabled)
 {
 	int i;
-	size_t size = sizeof(*lru->node) * nr_node_ids;
+	int err = 0;
 
-	lru->node = kzalloc(size, GFP_KERNEL);
+	lru->node = kcalloc(nr_node_ids, sizeof(*lru->node), GFP_KERNEL);
 	if (!lru->node)
 		return -ENOMEM;
 
 	nodes_clear(lru->active_nodes);
-	for (i = 0; i < nr_node_ids; i++) {
-		spin_lock_init(&lru->node[i].lock);
-		INIT_LIST_HEAD(&lru->node[i].list);
-		lru->node[i].nr_items = 0;
+	for (i = 0; i < nr_node_ids; i++)
+		list_lru_node_init(&lru->node[i]);
+
+	err = memcg_list_lru_init(lru, memcg_enabled);
+	if (err) {
+		kfree(lru->node);
+		lru->node = NULL; /* see list_lru_destroy() */
 	}
-	return 0;
+
+	return err;
 }
-EXPORT_SYMBOL_GPL(list_lru_init);
+EXPORT_SYMBOL_GPL(__list_lru_init);
 
 void list_lru_destroy(struct list_lru *lru)
 {
+	/*
+	 * We might be called after partial initialisation (e.g. due to ENOMEM
+	 * error) so handle that appropriately.
+	 */
+	if (!lru->node)
+		return;
+
 	kfree(lru->node);
+	memcg_list_lru_destroy(lru);
 }
 EXPORT_SYMBOL_GPL(list_lru_destroy);
+
+#ifdef CONFIG_MEMCG_KMEM
+int list_lru_memcg_alloc(struct list_lru *lru, int memcg_id)
+{
+	struct list_lru_node *nlru;
+
+	nlru = kmalloc(sizeof(*nlru), GFP_KERNEL);
+	if (!nlru)
+		return -ENOMEM;
+
+	list_lru_node_init(nlru);
+
+	VM_BUG_ON(lru->memcg[memcg_id]);
+	lru->memcg[memcg_id] = nlru;
+	return 0;
+}
+
+void list_lru_memcg_free(struct list_lru *lru, int memcg_id)
+{
+	if (lru->memcg[memcg_id]) {
+		kfree(lru->memcg[memcg_id]);
+		lru->memcg[memcg_id] = NULL;
+	}
+}
+
+int list_lru_grow_memcg(struct list_lru *lru, size_t new_array_size)
+{
+	int i;
+	struct list_lru_node **memcg_lrus;
+
+	memcg_lrus = kcalloc(new_array_size, sizeof(*memcg_lrus), GFP_KERNEL);
+	if (!memcg_lrus)
+		return -ENOMEM;
+
+	if (lru->memcg) {
+		for_each_memcg_cache_index(i) {
+			if (lru->memcg[i])
+				memcg_lrus[i] = lru->memcg[i];
+		}
+	}
+
+	/*
+	 * Since we access the lru->memcg array lockless, inside an RCU
+	 * critical section (see lru_node_of_index()), we cannot free the old
+	 * version of the array right now. So we save it to lru->memcg_old to
+	 * be freed by the caller after a grace period.
+	 */
+	VM_BUG_ON(lru->memcg_old);
+	lru->memcg_old = lru->memcg;
+	rcu_assign_pointer(lru->memcg, memcg_lrus);
+	return 0;
+}
+#endif /* CONFIG_MEMCG_KMEM */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 13b3131..5fec8aa 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -55,6 +55,7 @@
 #include <linux/cpu.h>
 #include <linux/oom.h>
 #include <linux/lockdep.h>
+#include <linux/list_lru.h>
 #include "internal.h"
 #include <net/sock.h>
 #include <net/ip.h>
@@ -3238,6 +3239,160 @@ out:
 }
 
 /*
+ * This semaphore serializes activations of kmem accounting for memory cgroups.
+ * Holding it for reading guarantees no cgroups will become kmem active.
+ */
+static DECLARE_RWSEM(activate_kmem_sem);
+
+/*
+ * The list of all memcg-enabled list_lru structures. Needed for updating all
+ * per-memcg LRUs whenever a kmem-active memcg is created or destroyed. The
+ * list is updated under the activate_kmem_sem held for reading so to safely
+ * iterate over it, it is enough to take the activate_kmem_sem for writing.
+ */
+static LIST_HEAD(all_memcg_lrus);
+static DEFINE_SPINLOCK(all_memcg_lrus_lock);
+
+static void __memcg_destroy_all_lrus(int memcg_id)
+{
+	struct list_lru *lru;
+
+	list_for_each_entry(lru, &all_memcg_lrus, memcg_lrus_list)
+		list_lru_memcg_free(lru, memcg_id);
+}
+
+/*
+ * This function is called when a kmem-active memcg is destroyed in order to
+ * free LRUs corresponding to the memcg in all list_lru structures.
+ */
+static void memcg_destroy_all_lrus(struct mem_cgroup *memcg)
+{
+	int memcg_id;
+
+	memcg_id = memcg_cache_id(memcg);
+	if (memcg_id >= 0) {
+		down_write(&activate_kmem_sem);
+		__memcg_destroy_all_lrus(memcg_id);
+		up_write(&activate_kmem_sem);
+	}
+}
+
+/*
+ * This function allocates LRUs for a memcg in all list_lru structures. It is
+ * called with activate_kmem_sem held for writing when a new kmem-active memcg
+ * is added.
+ */
+static int memcg_init_all_lrus(int new_memcg_id)
+{
+	int err = 0;
+	int num_memcgs = new_memcg_id + 1;
+	int grow = (num_memcgs > memcg_limited_groups_array_size);
+	size_t new_array_size = memcg_caches_array_size(num_memcgs);
+	struct list_lru *lru;
+
+	if (grow) {
+		list_for_each_entry(lru, &all_memcg_lrus, memcg_lrus_list) {
+			err = list_lru_grow_memcg(lru, new_array_size);
+			if (err)
+				goto out;
+		}
+	}
+
+	list_for_each_entry(lru, &all_memcg_lrus, memcg_lrus_list) {
+		err = list_lru_memcg_alloc(lru, new_memcg_id);
+		if (err) {
+			__memcg_destroy_all_lrus(new_memcg_id);
+			break;
+		}
+	}
+out:
+	if (grow) {
+		synchronize_rcu();
+		list_for_each_entry(lru, &all_memcg_lrus, memcg_lrus_list) {
+			kfree(lru->memcg_old);
+			lru->memcg_old = NULL;
+		}
+	}
+	return err;
+}
+
+int memcg_list_lru_init(struct list_lru *lru, bool memcg_enabled)
+{
+	int err = 0;
+	int i;
+	struct mem_cgroup *memcg;
+
+	lru->memcg = NULL;
+	lru->memcg_old = NULL;
+	INIT_LIST_HEAD(&lru->memcg_lrus_list);
+
+	if (!memcg_enabled)
+		return 0;
+
+	down_read(&activate_kmem_sem);
+	if (!memcg_kmem_enabled())
+		goto out_list_add;
+
+	lru->memcg = kcalloc(memcg_limited_groups_array_size,
+			     sizeof(*lru->memcg), GFP_KERNEL);
+	if (!lru->memcg) {
+		err = -ENOMEM;
+		goto out;
+	}
+
+	for_each_mem_cgroup(memcg) {
+		int memcg_id;
+
+		memcg_id = memcg_cache_id(memcg);
+		if (memcg_id < 0)
+			continue;
+
+		err = list_lru_memcg_alloc(lru, memcg_id);
+		if (err) {
+			mem_cgroup_iter_break(NULL, memcg);
+			goto out_free_lru_memcg;
+		}
+	}
+out_list_add:
+	spin_lock(&all_memcg_lrus_lock);
+	list_add(&lru->memcg_lrus_list, &all_memcg_lrus);
+	spin_unlock(&all_memcg_lrus_lock);
+out:
+	up_read(&activate_kmem_sem);
+	return err;
+
+out_free_lru_memcg:
+	for (i = 0; i < memcg_limited_groups_array_size; i++)
+		list_lru_memcg_free(lru, i);
+	kfree(lru->memcg);
+	goto out;
+}
+
+void memcg_list_lru_destroy(struct list_lru *lru)
+{
+	int i, array_size;
+
+	if (list_empty(&lru->memcg_lrus_list))
+		return;
+
+	down_read(&activate_kmem_sem);
+
+	array_size = memcg_limited_groups_array_size;
+
+	spin_lock(&all_memcg_lrus_lock);
+	list_del(&lru->memcg_lrus_list);
+	spin_unlock(&all_memcg_lrus_lock);
+
+	up_read(&activate_kmem_sem);
+
+	if (lru->memcg) {
+		for (i = 0; i < array_size; i++)
+			list_lru_memcg_free(lru, i);
+		kfree(lru->memcg);
+	}
+}
+
+/*
  * During the creation a new cache, we need to disable our accounting mechanism
  * altogether. This is true even if we are not creating, but rather just
  * enqueing new caches to be created.
@@ -5129,9 +5284,7 @@ static ssize_t mem_cgroup_read(struct cgroup_subsys_state *css,
 }
 
 #ifdef CONFIG_MEMCG_KMEM
-static DEFINE_MUTEX(activate_kmem_mutex);
-
-/* should be called with activate_kmem_mutex held */
+/* should be called with activate_kmem_sem held for writing */
 static int __memcg_activate_kmem(struct mem_cgroup *memcg,
 				 unsigned long long limit)
 {
@@ -5177,12 +5330,21 @@ static int __memcg_activate_kmem(struct mem_cgroup *memcg,
 	}
 
 	/*
+	 * Initialize this cgroup's lists in each list_lru. This must be done
+	 * before memcg_update_all_caches(), where we update the
+	 * limited_groups_array_size.
+	 */
+	err = memcg_init_all_lrus(memcg_id);
+	if (err)
+		goto out_rmid;
+
+	/*
 	 * Make sure we have enough space for this cgroup in each kmem_cache's
 	 * memcg_params array.
 	 */
 	err = memcg_update_all_caches(memcg_id + 1);
 	if (err)
-		goto out_rmid;
+		goto out_destroy_all_lrus;
 
 	memcg->kmemcg_id = memcg_id;
 
@@ -5197,6 +5359,8 @@ out:
 	memcg_resume_kmem_account();
 	return err;
 
+out_destroy_all_lrus:
+	__memcg_destroy_all_lrus(memcg_id);
 out_rmid:
 	ida_simple_remove(&kmem_limited_groups, memcg_id);
 out_reset_limit:
@@ -5209,9 +5373,9 @@ static int memcg_activate_kmem(struct mem_cgroup *memcg,
 {
 	int ret;
 
-	mutex_lock(&activate_kmem_mutex);
+	down_write(&activate_kmem_sem);
 	ret = __memcg_activate_kmem(memcg, limit);
-	mutex_unlock(&activate_kmem_mutex);
+	up_write(&activate_kmem_sem);
 	return ret;
 }
 
@@ -5235,10 +5399,10 @@ static int memcg_propagate_kmem(struct mem_cgroup *memcg)
 	if (!parent)
 		goto out;
 
-	mutex_lock(&activate_kmem_mutex);
+	down_write(&activate_kmem_sem);
 	if (memcg_kmem_is_active(parent))
 		ret = __memcg_activate_kmem(memcg, RES_COUNTER_MAX);
-	mutex_unlock(&activate_kmem_mutex);
+	up_write(&activate_kmem_sem);
 out:
 	return ret;
 }
@@ -5929,6 +6093,7 @@ static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
 static void memcg_destroy_kmem(struct mem_cgroup *memcg)
 {
 	mem_cgroup_sockets_destroy(memcg);
+	memcg_destroy_all_lrus(memcg);
 }
 
 static void kmem_cgroup_css_offline(struct mem_cgroup *memcg)
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v14 14/18] list_lru: add per-memcg lists
@ 2013-12-16 12:17   ` Vladimir Davydov
  0 siblings, 0 replies; 64+ messages in thread
From: Vladimir Davydov @ 2013-12-16 12:17 UTC (permalink / raw)
  To: dchinner, mhocko, hannes, akpm
  Cc: linux-kernel, linux-mm, cgroups, devel, glommer, glommer,
	Al Viro, Balbir Singh, KAMEZAWA Hiroyuki

There are several FS shrinkers, including super_block::s_shrink, that
keep reclaimable objects in the list_lru structure. That said, to turn
them to memcg-aware shrinkers, it is enough to make list_lru per-memcg.

This patch does the trick. It adds an array of LRU lists to the list_lru
structure, one for each kmem-active memcg, and dispatches every item
addition or removal operation to the list corresponding to the memcg the
item is accounted to.

Since we already pass a shrink_control object to count and walk list_lru
functions to specify the NUMA node to scan, and the target memcg is held
in this structure, there is no need in changing the list_lru interface.

To make sure each kmem-active memcg has its list initialized in each
memcg-enabled list_lru, we keep all memcg-enabled list_lrus in a linked
list, which we iterate over allocating per-memcg LRUs whenever a new
kmem-active memcg is added. To synchronize this with creation of new
list_lrus, we have to take activate_kmem_mutex. Since using this mutex
as is would make all mounts proceed serially, we turn it to an rw
semaphore and take it for writing whenever a new kmem-active memcg is
created and for reading when we are going to create a list_lru. This
still does not allow mount_fs() proceed concurrently with creation of a
kmem-active memcg, but since creation of memcgs is rather a rare event,
this is not that critical.

The idea lying behind the patch as well as the initial implementation
belong to Glauber Costa.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Glauber Costa <glommer@openvz.org>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/list_lru.h   |  112 +++++++++++++------
 include/linux/memcontrol.h |   13 +++
 mm/list_lru.c              |  257 +++++++++++++++++++++++++++++++++++++++-----
 mm/memcontrol.c            |  181 +++++++++++++++++++++++++++++--
 4 files changed, 495 insertions(+), 68 deletions(-)

diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
index 194b1c4..9b228dc 100644
--- a/include/linux/list_lru.h
+++ b/include/linux/list_lru.h
@@ -11,6 +11,8 @@
 #include <linux/nodemask.h>
 #include <linux/shrinker.h>
 
+struct mem_cgroup;
+
 /* list_lru_walk_cb has to always return one of those */
 enum lru_status {
 	LRU_REMOVED,		/* item removed from list */
@@ -30,10 +32,52 @@ struct list_lru_node {
 struct list_lru {
 	struct list_lru_node	*node;
 	nodemask_t		active_nodes;
+#ifdef CONFIG_MEMCG_KMEM
+	/*
+	 * In order to provide ability of scanning objects from different
+	 * memory cgroups independently, we keep a separate LRU list for each
+	 * kmem-active memcg in this array. The array is RCU-protected and
+	 * indexed by memcg_cache_id().
+	 */
+	struct list_lru_node	**memcg;
+	/*
+	 * Every time a kmem-active memcg is created or destroyed, we have to
+	 * update the array of per-memcg LRUs in each memcg enabled list_lru
+	 * structure. To achieve that, we keep all memcg enabled list_lru
+	 * structures in the all_memcg_lrus list.
+	 */
+	struct list_head	memcg_lrus_list;
+	/*
+	 * Since the array of per-memcg LRUs is RCU-protected, we can only free
+	 * it after a call to synchronize_rcu(). To avoid multiple calls to
+	 * synchronize_rcu() when a lot of LRUs get updated at the same time,
+	 * which is a typical scenario, we will store the pointer to the
+	 * previous version of the array in the memcg_old field for each
+	 * list_lru structure, and then free them all at once after a single
+	 * call to synchronize_rcu().
+	 */
+	void			*memcg_old;
+#endif /* CONFIG_MEMCG_KMEM */
 };
 
+#ifdef CONFIG_MEMCG_KMEM
+int list_lru_memcg_alloc(struct list_lru *lru, int memcg_id);
+void list_lru_memcg_free(struct list_lru *lru, int memcg_id);
+int list_lru_grow_memcg(struct list_lru *lru, size_t new_array_size);
+#endif
+
 void list_lru_destroy(struct list_lru *lru);
-int list_lru_init(struct list_lru *lru);
+int __list_lru_init(struct list_lru *lru, bool memcg_enabled);
+
+static inline int list_lru_init(struct list_lru *lru)
+{
+	return __list_lru_init(lru, false);
+}
+
+static inline int list_lru_init_memcg(struct list_lru *lru)
+{
+	return __list_lru_init(lru, true);
+}
 
 /**
  * list_lru_add: add an element to the lru list's tail
@@ -67,39 +111,41 @@ bool list_lru_add(struct list_lru *lru, struct list_head *item);
 bool list_lru_del(struct list_lru *lru, struct list_head *item);
 
 /**
- * list_lru_count_node: return the number of objects currently held by @lru
+ * list_lru_count_node_memcg: return the number of objects currently held by a
+ *  list_lru.
  * @lru: the lru pointer.
  * @nid: the node id to count from.
+ * @memcg: the memcg to count from.
  *
  * Always return a non-negative number, 0 for empty lists. There is no
  * guarantee that the list is not updated while the count is being computed.
  * Callers that want such a guarantee need to provide an outer lock.
  */
-unsigned long list_lru_count_node(struct list_lru *lru, int nid);
+unsigned long list_lru_count_node_memcg(struct list_lru *lru,
+					int nid, struct mem_cgroup *memcg);
 
-static inline unsigned long list_lru_shrink_count(struct list_lru *lru,
-						  struct shrink_control *sc)
+unsigned long list_lru_count(struct list_lru *lru);
+
+static inline unsigned long list_lru_count_node(struct list_lru *lru, int nid)
 {
-	return list_lru_count_node(lru, sc->nid);
+	return list_lru_count_node_memcg(lru, nid, NULL);
 }
 
-static inline unsigned long list_lru_count(struct list_lru *lru)
+static inline unsigned long list_lru_shrink_count(struct list_lru *lru,
+						  struct shrink_control *sc)
 {
-	long count = 0;
-	int nid;
-
-	for_each_node_mask(nid, lru->active_nodes)
-		count += list_lru_count_node(lru, nid);
-
-	return count;
+	return list_lru_count_node_memcg(lru, sc->nid, sc->memcg);
 }
 
 typedef enum lru_status
 (*list_lru_walk_cb)(struct list_head *item, spinlock_t *lock, void *cb_arg);
+
 /**
- * list_lru_walk_node: walk a list_lru, isolating and disposing freeable items.
+ * list_lru_walk_node_memcg: walk a list_lru, isolating and disposing freeable
+ *  items.
  * @lru: the lru pointer.
  * @nid: the node id to scan from.
+ * @memcg: the memcg to scan from.
  * @isolate: callback function that is resposible for deciding what to do with
  *  the item currently being scanned
  * @cb_arg: opaque type that will be passed to @isolate
@@ -117,31 +163,29 @@ typedef enum lru_status
  *
  * Return value: the number of objects effectively removed from the LRU.
  */
-unsigned long list_lru_walk_node(struct list_lru *lru, int nid,
-				 list_lru_walk_cb isolate, void *cb_arg,
-				 unsigned long *nr_to_walk);
+unsigned long list_lru_walk_node_memcg(struct list_lru *lru,
+				       int nid, struct mem_cgroup *memcg,
+				       list_lru_walk_cb isolate, void *cb_arg,
+				       unsigned long *nr_to_walk);
+
+unsigned long list_lru_walk(struct list_lru *lru,
+			    list_lru_walk_cb isolate, void *cb_arg,
+			    unsigned long nr_to_walk);
 
 static inline unsigned long
-list_lru_shrink_walk(struct list_lru *lru, struct shrink_control *sc,
-		     list_lru_walk_cb isolate, void *cb_arg)
+list_lru_walk_node(struct list_lru *lru, int nid,
+		   list_lru_walk_cb isolate, void *cb_arg,
+		   unsigned long *nr_to_walk)
 {
-	return list_lru_walk_node(lru, sc->nid, isolate, cb_arg,
-				  &sc->nr_to_scan);
+	return list_lru_walk_node_memcg(lru, nid, NULL,
+					isolate, cb_arg, nr_to_walk);
 }
 
 static inline unsigned long
-list_lru_walk(struct list_lru *lru, list_lru_walk_cb isolate,
-	      void *cb_arg, unsigned long nr_to_walk)
+list_lru_shrink_walk(struct list_lru *lru, struct shrink_control *sc,
+		     list_lru_walk_cb isolate, void *cb_arg)
 {
-	long isolated = 0;
-	int nid;
-
-	for_each_node_mask(nid, lru->active_nodes) {
-		isolated += list_lru_walk_node(lru, nid, isolate,
-					       cb_arg, &nr_to_walk);
-		if (nr_to_walk <= 0)
-			break;
-	}
-	return isolated;
+	return list_lru_walk_node_memcg(lru, sc->nid, sc->memcg,
+					isolate, cb_arg, &sc->nr_to_scan);
 }
 #endif /* _LRU_LIST_H */
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 6001b31..44fc58a 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -29,6 +29,7 @@ struct page_cgroup;
 struct page;
 struct mm_struct;
 struct kmem_cache;
+struct list_lru;
 
 /*
  * The corresponding mem_cgroup_stat_names is defined in mm/memcontrol.c,
@@ -538,6 +539,9 @@ __memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp);
 void mem_cgroup_destroy_cache(struct kmem_cache *cachep);
 void kmem_cache_destroy_memcg_children(struct kmem_cache *s);
 
+int memcg_list_lru_init(struct list_lru *lru, bool memcg_enabled);
+void memcg_list_lru_destroy(struct list_lru *lru);
+
 /**
  * memcg_kmem_newpage_charge: verify if a new kmem allocation is allowed.
  * @gfp: the gfp allocation flags.
@@ -702,6 +706,15 @@ memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp)
 static inline void kmem_cache_destroy_memcg_children(struct kmem_cache *s)
 {
 }
+
+static inline int memcg_list_lru_init(struct list_lru *lru, bool memcg_enabled)
+{
+	return 0;
+}
+
+static inline void memcg_list_lru_destroy(struct list_lru *lru)
+{
+}
 #endif /* CONFIG_MEMCG_KMEM */
 #endif /* _LINUX_MEMCONTROL_H */
 
diff --git a/mm/list_lru.c b/mm/list_lru.c
index 72f9dec..db644f8 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -7,19 +7,87 @@
 #include <linux/kernel.h>
 #include <linux/module.h>
 #include <linux/mm.h>
-#include <linux/list_lru.h>
 #include <linux/slab.h>
+#include <linux/memcontrol.h>
+#include <linux/page_cgroup.h>
+#include <linux/list_lru.h>
+
+#ifdef CONFIG_MEMCG_KMEM
+static inline bool lru_has_memcg(struct list_lru *lru)
+{
+	return !!lru->memcg;
+}
+
+static struct list_lru_node *lru_node_of_index(struct list_lru *lru,
+					int nid, int memcg_id, bool *is_global)
+{
+	struct list_lru_node **memcg_lrus;
+	struct list_lru_node *nlru = NULL;
+
+	if (memcg_id < 0 || !lru_has_memcg(lru)) {
+		*is_global = true;
+		return &lru->node[nid];
+	}
+
+	rcu_read_lock();
+	memcg_lrus = rcu_dereference(lru->memcg);
+	nlru = memcg_lrus[memcg_id];
+	rcu_read_unlock();
+
+	*is_global = false;
+	return nlru;
+}
+
+static struct list_lru_node *lru_node_of_page(struct list_lru *lru,
+					struct page *page, bool *is_global)
+{
+	struct page_cgroup *pc;
+	struct mem_cgroup *memcg;
+
+	/*
+	 * Since a kmem page cannot change its cgroup after its allocation is
+	 * committed, we do not need to lock_page_cgroup() here.
+	 */
+	pc = lookup_page_cgroup(compound_head(page));
+	memcg = PageCgroupUsed(pc) ? pc->mem_cgroup : NULL;
+
+	return lru_node_of_index(lru, page_to_nid(page),
+				 memcg_cache_id(memcg), is_global);
+}
+#else /* !CONFIG_MEMCG_KMEM */
+static inline bool lru_has_memcg(struct list_lru *lru)
+{
+	return false;
+}
+
+static inline struct list_lru_node *lru_node_of_index(struct list_lru *lru,
+					int nid, int memcg_id, bool *is_global)
+{
+	*is_global = true;
+	return &lru->node[nid];
+}
+
+static inline struct list_lru_node *lru_node_of_page(struct list_lru *lru,
+					struct page *page, bool *is_global)
+{
+	return lru_node_of_index(lru, page_to_nid(page), -1, is_global);
+}
+#endif /* CONFIG_MEMCG_KMEM */
 
 bool list_lru_add(struct list_lru *lru, struct list_head *item)
 {
-	int nid = page_to_nid(virt_to_page(item));
-	struct list_lru_node *nlru = &lru->node[nid];
+	struct page *page = virt_to_page(item);
+	int nid = page_to_nid(page);
+	bool is_global;
+	struct list_lru_node *nlru;
+
+	nlru = lru_node_of_page(lru, page, &is_global);
 
 	spin_lock(&nlru->lock);
 	WARN_ON_ONCE(nlru->nr_items < 0);
 	if (list_empty(item)) {
 		list_add_tail(item, &nlru->list);
-		if (nlru->nr_items++ == 0)
+		if (nlru->nr_items++ == 0 && is_global)
 			node_set(nid, lru->active_nodes);
 		spin_unlock(&nlru->lock);
 		return true;
@@ -31,13 +99,17 @@ EXPORT_SYMBOL_GPL(list_lru_add);
 
 bool list_lru_del(struct list_lru *lru, struct list_head *item)
 {
-	int nid = page_to_nid(virt_to_page(item));
-	struct list_lru_node *nlru = &lru->node[nid];
+	struct page *page = virt_to_page(item);
+	int nid = page_to_nid(page);
+	bool is_global;
+	struct list_lru_node *nlru;
+
+	nlru = lru_node_of_page(lru, page, &is_global);
 
 	spin_lock(&nlru->lock);
 	if (!list_empty(item)) {
 		list_del_init(item);
-		if (--nlru->nr_items == 0)
+		if (--nlru->nr_items == 0 && is_global)
 			node_clear(nid, lru->active_nodes);
 		WARN_ON_ONCE(nlru->nr_items < 0);
 		spin_unlock(&nlru->lock);
@@ -48,11 +120,14 @@ bool list_lru_del(struct list_lru *lru, struct list_head *item)
 }
 EXPORT_SYMBOL_GPL(list_lru_del);
 
-unsigned long
-list_lru_count_node(struct list_lru *lru, int nid)
+unsigned long list_lru_count_node_memcg(struct list_lru *lru,
+					int nid, struct mem_cgroup *memcg)
 {
 	unsigned long count = 0;
-	struct list_lru_node *nlru = &lru->node[nid];
+	bool is_global;
+	struct list_lru_node *nlru;
+
+	nlru = lru_node_of_index(lru, nid, memcg_cache_id(memcg), &is_global);
 
 	spin_lock(&nlru->lock);
 	WARN_ON_ONCE(nlru->nr_items < 0);
@@ -61,16 +136,41 @@ list_lru_count_node(struct list_lru *lru, int nid)
 
 	return count;
 }
-EXPORT_SYMBOL_GPL(list_lru_count_node);
+EXPORT_SYMBOL_GPL(list_lru_count_node_memcg);
+
+unsigned long list_lru_count(struct list_lru *lru)
+{
+	long count = 0;
+	int nid;
+	struct mem_cgroup *memcg;
+
+	for_each_node_mask(nid, lru->active_nodes)
+		count += list_lru_count_node(lru, nid);
+
+	if (!lru_has_memcg(lru))
+		goto out;
+
+	for_each_mem_cgroup(memcg) {
+		if (memcg_kmem_is_active(memcg))
+			count += list_lru_count_node_memcg(lru, 0, memcg);
+	}
+out:
+	return count;
+}
+EXPORT_SYMBOL_GPL(list_lru_count);
 
-unsigned long
-list_lru_walk_node(struct list_lru *lru, int nid, list_lru_walk_cb isolate,
-		   void *cb_arg, unsigned long *nr_to_walk)
+unsigned long list_lru_walk_node_memcg(struct list_lru *lru,
+				       int nid, struct mem_cgroup *memcg,
+				       list_lru_walk_cb isolate, void *cb_arg,
+				       unsigned long *nr_to_walk)
 {
 
-	struct list_lru_node	*nlru = &lru->node[nid];
 	struct list_head *item, *n;
 	unsigned long isolated = 0;
+	bool is_global;
+	struct list_lru_node *nlru;
+
+	nlru = lru_node_of_index(lru, nid, memcg_cache_id(memcg), &is_global);
 
 	spin_lock(&nlru->lock);
 restart:
@@ -88,7 +188,7 @@ restart:
 		ret = isolate(item, &nlru->lock, cb_arg);
 		switch (ret) {
 		case LRU_REMOVED:
-			if (--nlru->nr_items == 0)
+			if (--nlru->nr_items == 0 && is_global)
 				node_clear(nid, lru->active_nodes);
 			WARN_ON_ONCE(nlru->nr_items < 0);
 			isolated++;
@@ -112,29 +212,134 @@ restart:
 	spin_unlock(&nlru->lock);
 	return isolated;
 }
-EXPORT_SYMBOL_GPL(list_lru_walk_node);
+EXPORT_SYMBOL_GPL(list_lru_walk_node_memcg);
 
-int list_lru_init(struct list_lru *lru)
+unsigned long list_lru_walk(struct list_lru *lru,
+			    list_lru_walk_cb isolate, void *cb_arg,
+			    unsigned long nr_to_walk)
+{
+	long isolated = 0;
+	int nid;
+	struct mem_cgroup *memcg;
+
+	for_each_node_mask(nid, lru->active_nodes) {
+		isolated += list_lru_walk_node(lru, nid, isolate,
+					       cb_arg, &nr_to_walk);
+		if (nr_to_walk <= 0)
+			break;
+	}
+
+	if (!lru_has_memcg(lru))
+		goto out;
+
+	for_each_mem_cgroup(memcg) {
+		if (!memcg_kmem_is_active(memcg))
+			continue;
+		isolated += list_lru_walk_node_memcg(lru, 0, memcg, isolate,
+						     cb_arg, &nr_to_walk);
+		if (nr_to_walk <= 0) {
+			mem_cgroup_iter_break(NULL, memcg);
+			break;
+		}
+	}
+out:
+	return isolated;
+}
+EXPORT_SYMBOL_GPL(list_lru_walk);
+
+static void list_lru_node_init(struct list_lru_node *nlru)
+{
+	spin_lock_init(&nlru->lock);
+	INIT_LIST_HEAD(&nlru->list);
+	nlru->nr_items = 0;
+}
+
+int __list_lru_init(struct list_lru *lru, bool memcg_enabled)
 {
 	int i;
-	size_t size = sizeof(*lru->node) * nr_node_ids;
+	int err = 0;
 
-	lru->node = kzalloc(size, GFP_KERNEL);
+	lru->node = kcalloc(nr_node_ids, sizeof(*lru->node), GFP_KERNEL);
 	if (!lru->node)
 		return -ENOMEM;
 
 	nodes_clear(lru->active_nodes);
-	for (i = 0; i < nr_node_ids; i++) {
-		spin_lock_init(&lru->node[i].lock);
-		INIT_LIST_HEAD(&lru->node[i].list);
-		lru->node[i].nr_items = 0;
+	for (i = 0; i < nr_node_ids; i++)
+		list_lru_node_init(&lru->node[i]);
+
+	err = memcg_list_lru_init(lru, memcg_enabled);
+	if (err) {
+		kfree(lru->node);
+		lru->node = NULL; /* see list_lru_destroy() */
 	}
-	return 0;
+
+	return err;
 }
-EXPORT_SYMBOL_GPL(list_lru_init);
+EXPORT_SYMBOL_GPL(__list_lru_init);
 
 void list_lru_destroy(struct list_lru *lru)
 {
+	/*
+	 * We might be called after partial initialisation (e.g. due to ENOMEM
+	 * error) so handle that appropriately.
+	 */
+	if (!lru->node)
+		return;
+
 	kfree(lru->node);
+	memcg_list_lru_destroy(lru);
 }
 EXPORT_SYMBOL_GPL(list_lru_destroy);
+
+#ifdef CONFIG_MEMCG_KMEM
+int list_lru_memcg_alloc(struct list_lru *lru, int memcg_id)
+{
+	struct list_lru_node *nlru;
+
+	nlru = kmalloc(sizeof(*nlru), GFP_KERNEL);
+	if (!nlru)
+		return -ENOMEM;
+
+	list_lru_node_init(nlru);
+
+	VM_BUG_ON(lru->memcg[memcg_id]);
+	lru->memcg[memcg_id] = nlru;
+	return 0;
+}
+
+void list_lru_memcg_free(struct list_lru *lru, int memcg_id)
+{
+	if (lru->memcg[memcg_id]) {
+		kfree(lru->memcg[memcg_id]);
+		lru->memcg[memcg_id] = NULL;
+	}
+}
+
+int list_lru_grow_memcg(struct list_lru *lru, size_t new_array_size)
+{
+	int i;
+	struct list_lru_node **memcg_lrus;
+
+	memcg_lrus = kcalloc(new_array_size, sizeof(*memcg_lrus), GFP_KERNEL);
+	if (!memcg_lrus)
+		return -ENOMEM;
+
+	if (lru->memcg) {
+		for_each_memcg_cache_index(i) {
+			if (lru->memcg[i])
+				memcg_lrus[i] = lru->memcg[i];
+		}
+	}
+
+	/*
+	 * Since we access the lru->memcg array lockless, inside an RCU
+	 * critical section (see lru_node_of_index()), we cannot free the old
+	 * version of the array right now. So we save it to lru->memcg_old to
+	 * be freed by the caller after a grace period.
+	 */
+	VM_BUG_ON(lru->memcg_old);
+	lru->memcg_old = lru->memcg;
+	rcu_assign_pointer(lru->memcg, memcg_lrus);
+	return 0;
+}
+#endif /* CONFIG_MEMCG_KMEM */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 13b3131..5fec8aa 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -55,6 +55,7 @@
 #include <linux/cpu.h>
 #include <linux/oom.h>
 #include <linux/lockdep.h>
+#include <linux/list_lru.h>
 #include "internal.h"
 #include <net/sock.h>
 #include <net/ip.h>
@@ -3238,6 +3239,160 @@ out:
 }
 
 /*
+ * This semaphore serializes activations of kmem accounting for memory cgroups.
+ * Holding it for reading guarantees no cgroups will become kmem active.
+ */
+static DECLARE_RWSEM(activate_kmem_sem);
+
+/*
+ * The list of all memcg-enabled list_lru structures. Needed for updating all
+ * per-memcg LRUs whenever a kmem-active memcg is created or destroyed. The
+ * list is updated under the activate_kmem_sem held for reading so to safely
+ * iterate over it, it is enough to take the activate_kmem_sem for writing.
+ */
+static LIST_HEAD(all_memcg_lrus);
+static DEFINE_SPINLOCK(all_memcg_lrus_lock);
+
+static void __memcg_destroy_all_lrus(int memcg_id)
+{
+	struct list_lru *lru;
+
+	list_for_each_entry(lru, &all_memcg_lrus, memcg_lrus_list)
+		list_lru_memcg_free(lru, memcg_id);
+}
+
+/*
+ * This function is called when a kmem-active memcg is destroyed in order to
+ * free LRUs corresponding to the memcg in all list_lru structures.
+ */
+static void memcg_destroy_all_lrus(struct mem_cgroup *memcg)
+{
+	int memcg_id;
+
+	memcg_id = memcg_cache_id(memcg);
+	if (memcg_id >= 0) {
+		down_write(&activate_kmem_sem);
+		__memcg_destroy_all_lrus(memcg_id);
+		up_write(&activate_kmem_sem);
+	}
+}
+
+/*
+ * This function allocates LRUs for a memcg in all list_lru structures. It is
+ * called with activate_kmem_sem held for writing when a new kmem-active memcg
+ * is added.
+ */
+static int memcg_init_all_lrus(int new_memcg_id)
+{
+	int err = 0;
+	int num_memcgs = new_memcg_id + 1;
+	int grow = (num_memcgs > memcg_limited_groups_array_size);
+	size_t new_array_size = memcg_caches_array_size(num_memcgs);
+	struct list_lru *lru;
+
+	if (grow) {
+		list_for_each_entry(lru, &all_memcg_lrus, memcg_lrus_list) {
+			err = list_lru_grow_memcg(lru, new_array_size);
+			if (err)
+				goto out;
+		}
+	}
+
+	list_for_each_entry(lru, &all_memcg_lrus, memcg_lrus_list) {
+		err = list_lru_memcg_alloc(lru, new_memcg_id);
+		if (err) {
+			__memcg_destroy_all_lrus(new_memcg_id);
+			break;
+		}
+	}
+out:
+	if (grow) {
+		synchronize_rcu();
+		list_for_each_entry(lru, &all_memcg_lrus, memcg_lrus_list) {
+			kfree(lru->memcg_old);
+			lru->memcg_old = NULL;
+		}
+	}
+	return err;
+}
+
+int memcg_list_lru_init(struct list_lru *lru, bool memcg_enabled)
+{
+	int err = 0;
+	int i;
+	struct mem_cgroup *memcg;
+
+	lru->memcg = NULL;
+	lru->memcg_old = NULL;
+	INIT_LIST_HEAD(&lru->memcg_lrus_list);
+
+	if (!memcg_enabled)
+		return 0;
+
+	down_read(&activate_kmem_sem);
+	if (!memcg_kmem_enabled())
+		goto out_list_add;
+
+	lru->memcg = kcalloc(memcg_limited_groups_array_size,
+			     sizeof(*lru->memcg), GFP_KERNEL);
+	if (!lru->memcg) {
+		err = -ENOMEM;
+		goto out;
+	}
+
+	for_each_mem_cgroup(memcg) {
+		int memcg_id;
+
+		memcg_id = memcg_cache_id(memcg);
+		if (memcg_id < 0)
+			continue;
+
+		err = list_lru_memcg_alloc(lru, memcg_id);
+		if (err) {
+			mem_cgroup_iter_break(NULL, memcg);
+			goto out_free_lru_memcg;
+		}
+	}
+out_list_add:
+	spin_lock(&all_memcg_lrus_lock);
+	list_add(&lru->memcg_lrus_list, &all_memcg_lrus);
+	spin_unlock(&all_memcg_lrus_lock);
+out:
+	up_read(&activate_kmem_sem);
+	return err;
+
+out_free_lru_memcg:
+	for (i = 0; i < memcg_limited_groups_array_size; i++)
+		list_lru_memcg_free(lru, i);
+	kfree(lru->memcg);
+	goto out;
+}
+
+void memcg_list_lru_destroy(struct list_lru *lru)
+{
+	int i, array_size;
+
+	if (list_empty(&lru->memcg_lrus_list))
+		return;
+
+	down_read(&activate_kmem_sem);
+
+	array_size = memcg_limited_groups_array_size;
+
+	spin_lock(&all_memcg_lrus_lock);
+	list_del(&lru->memcg_lrus_list);
+	spin_unlock(&all_memcg_lrus_lock);
+
+	up_read(&activate_kmem_sem);
+
+	if (lru->memcg) {
+		for (i = 0; i < array_size; i++)
+			list_lru_memcg_free(lru, i);
+		kfree(lru->memcg);
+	}
+}
+
+/*
  * During the creation a new cache, we need to disable our accounting mechanism
  * altogether. This is true even if we are not creating, but rather just
  * enqueing new caches to be created.
@@ -5129,9 +5284,7 @@ static ssize_t mem_cgroup_read(struct cgroup_subsys_state *css,
 }
 
 #ifdef CONFIG_MEMCG_KMEM
-static DEFINE_MUTEX(activate_kmem_mutex);
-
-/* should be called with activate_kmem_mutex held */
+/* should be called with activate_kmem_sem held for writing */
 static int __memcg_activate_kmem(struct mem_cgroup *memcg,
 				 unsigned long long limit)
 {
@@ -5177,12 +5330,21 @@ static int __memcg_activate_kmem(struct mem_cgroup *memcg,
 	}
 
 	/*
+	 * Initialize this cgroup's lists in each list_lru. This must be done
+	 * before memcg_update_all_caches(), where we update the
+	 * limited_groups_array_size.
+	 */
+	err = memcg_init_all_lrus(memcg_id);
+	if (err)
+		goto out_rmid;
+
+	/*
 	 * Make sure we have enough space for this cgroup in each kmem_cache's
 	 * memcg_params array.
 	 */
 	err = memcg_update_all_caches(memcg_id + 1);
 	if (err)
-		goto out_rmid;
+		goto out_destroy_all_lrus;
 
 	memcg->kmemcg_id = memcg_id;
 
@@ -5197,6 +5359,8 @@ out:
 	memcg_resume_kmem_account();
 	return err;
 
+out_destroy_all_lrus:
+	__memcg_destroy_all_lrus(memcg_id);
 out_rmid:
 	ida_simple_remove(&kmem_limited_groups, memcg_id);
 out_reset_limit:
@@ -5209,9 +5373,9 @@ static int memcg_activate_kmem(struct mem_cgroup *memcg,
 {
 	int ret;
 
-	mutex_lock(&activate_kmem_mutex);
+	down_write(&activate_kmem_sem);
 	ret = __memcg_activate_kmem(memcg, limit);
-	mutex_unlock(&activate_kmem_mutex);
+	up_write(&activate_kmem_sem);
 	return ret;
 }
 
@@ -5235,10 +5399,10 @@ static int memcg_propagate_kmem(struct mem_cgroup *memcg)
 	if (!parent)
 		goto out;
 
-	mutex_lock(&activate_kmem_mutex);
+	down_write(&activate_kmem_sem);
 	if (memcg_kmem_is_active(parent))
 		ret = __memcg_activate_kmem(memcg, RES_COUNTER_MAX);
-	mutex_unlock(&activate_kmem_mutex);
+	up_write(&activate_kmem_sem);
 out:
 	return ret;
 }
@@ -5929,6 +6093,7 @@ static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
 static void memcg_destroy_kmem(struct mem_cgroup *memcg)
 {
 	mem_cgroup_sockets_destroy(memcg);
+	memcg_destroy_all_lrus(memcg);
 }
 
 static void kmem_cgroup_css_offline(struct mem_cgroup *memcg)
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v14 14/18] list_lru: add per-memcg lists
@ 2013-12-16 12:17   ` Vladimir Davydov
  0 siblings, 0 replies; 64+ messages in thread
From: Vladimir Davydov @ 2013-12-16 12:17 UTC (permalink / raw)
  To: dchinner, mhocko, hannes, akpm
  Cc: linux-kernel, linux-mm, cgroups, devel, glommer, glommer,
	Al Viro, Balbir Singh, KAMEZAWA Hiroyuki

There are several FS shrinkers, including super_block::s_shrink, that
keep reclaimable objects in the list_lru structure. That said, to turn
them to memcg-aware shrinkers, it is enough to make list_lru per-memcg.

This patch does the trick. It adds an array of LRU lists to the list_lru
structure, one for each kmem-active memcg, and dispatches every item
addition or removal operation to the list corresponding to the memcg the
item is accounted to.

Since we already pass a shrink_control object to count and walk list_lru
functions to specify the NUMA node to scan, and the target memcg is held
in this structure, there is no need in changing the list_lru interface.

To make sure each kmem-active memcg has its list initialized in each
memcg-enabled list_lru, we keep all memcg-enabled list_lrus in a linked
list, which we iterate over allocating per-memcg LRUs whenever a new
kmem-active memcg is added. To synchronize this with creation of new
list_lrus, we have to take activate_kmem_mutex. Since using this mutex
as is would make all mounts proceed serially, we turn it to an rw
semaphore and take it for writing whenever a new kmem-active memcg is
created and for reading when we are going to create a list_lru. This
still does not allow mount_fs() proceed concurrently with creation of a
kmem-active memcg, but since creation of memcgs is rather a rare event,
this is not that critical.

The idea lying behind the patch as well as the initial implementation
belong to Glauber Costa.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Glauber Costa <glommer@openvz.org>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/list_lru.h   |  112 +++++++++++++------
 include/linux/memcontrol.h |   13 +++
 mm/list_lru.c              |  257 +++++++++++++++++++++++++++++++++++++++-----
 mm/memcontrol.c            |  181 +++++++++++++++++++++++++++++--
 4 files changed, 495 insertions(+), 68 deletions(-)

diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
index 194b1c4..9b228dc 100644
--- a/include/linux/list_lru.h
+++ b/include/linux/list_lru.h
@@ -11,6 +11,8 @@
 #include <linux/nodemask.h>
 #include <linux/shrinker.h>
 
+struct mem_cgroup;
+
 /* list_lru_walk_cb has to always return one of those */
 enum lru_status {
 	LRU_REMOVED,		/* item removed from list */
@@ -30,10 +32,52 @@ struct list_lru_node {
 struct list_lru {
 	struct list_lru_node	*node;
 	nodemask_t		active_nodes;
+#ifdef CONFIG_MEMCG_KMEM
+	/*
+	 * In order to provide ability of scanning objects from different
+	 * memory cgroups independently, we keep a separate LRU list for each
+	 * kmem-active memcg in this array. The array is RCU-protected and
+	 * indexed by memcg_cache_id().
+	 */
+	struct list_lru_node	**memcg;
+	/*
+	 * Every time a kmem-active memcg is created or destroyed, we have to
+	 * update the array of per-memcg LRUs in each memcg enabled list_lru
+	 * structure. To achieve that, we keep all memcg enabled list_lru
+	 * structures in the all_memcg_lrus list.
+	 */
+	struct list_head	memcg_lrus_list;
+	/*
+	 * Since the array of per-memcg LRUs is RCU-protected, we can only free
+	 * it after a call to synchronize_rcu(). To avoid multiple calls to
+	 * synchronize_rcu() when a lot of LRUs get updated at the same time,
+	 * which is a typical scenario, we will store the pointer to the
+	 * previous version of the array in the memcg_old field for each
+	 * list_lru structure, and then free them all at once after a single
+	 * call to synchronize_rcu().
+	 */
+	void			*memcg_old;
+#endif /* CONFIG_MEMCG_KMEM */
 };
 
+#ifdef CONFIG_MEMCG_KMEM
+int list_lru_memcg_alloc(struct list_lru *lru, int memcg_id);
+void list_lru_memcg_free(struct list_lru *lru, int memcg_id);
+int list_lru_grow_memcg(struct list_lru *lru, size_t new_array_size);
+#endif
+
 void list_lru_destroy(struct list_lru *lru);
-int list_lru_init(struct list_lru *lru);
+int __list_lru_init(struct list_lru *lru, bool memcg_enabled);
+
+static inline int list_lru_init(struct list_lru *lru)
+{
+	return __list_lru_init(lru, false);
+}
+
+static inline int list_lru_init_memcg(struct list_lru *lru)
+{
+	return __list_lru_init(lru, true);
+}
 
 /**
  * list_lru_add: add an element to the lru list's tail
@@ -67,39 +111,41 @@ bool list_lru_add(struct list_lru *lru, struct list_head *item);
 bool list_lru_del(struct list_lru *lru, struct list_head *item);
 
 /**
- * list_lru_count_node: return the number of objects currently held by @lru
+ * list_lru_count_node_memcg: return the number of objects currently held by a
+ *  list_lru.
  * @lru: the lru pointer.
  * @nid: the node id to count from.
+ * @memcg: the memcg to count from.
  *
  * Always return a non-negative number, 0 for empty lists. There is no
  * guarantee that the list is not updated while the count is being computed.
  * Callers that want such a guarantee need to provide an outer lock.
  */
-unsigned long list_lru_count_node(struct list_lru *lru, int nid);
+unsigned long list_lru_count_node_memcg(struct list_lru *lru,
+					int nid, struct mem_cgroup *memcg);
 
-static inline unsigned long list_lru_shrink_count(struct list_lru *lru,
-						  struct shrink_control *sc)
+unsigned long list_lru_count(struct list_lru *lru);
+
+static inline unsigned long list_lru_count_node(struct list_lru *lru, int nid)
 {
-	return list_lru_count_node(lru, sc->nid);
+	return list_lru_count_node_memcg(lru, nid, NULL);
 }
 
-static inline unsigned long list_lru_count(struct list_lru *lru)
+static inline unsigned long list_lru_shrink_count(struct list_lru *lru,
+						  struct shrink_control *sc)
 {
-	long count = 0;
-	int nid;
-
-	for_each_node_mask(nid, lru->active_nodes)
-		count += list_lru_count_node(lru, nid);
-
-	return count;
+	return list_lru_count_node_memcg(lru, sc->nid, sc->memcg);
 }
 
 typedef enum lru_status
 (*list_lru_walk_cb)(struct list_head *item, spinlock_t *lock, void *cb_arg);
+
 /**
- * list_lru_walk_node: walk a list_lru, isolating and disposing freeable items.
+ * list_lru_walk_node_memcg: walk a list_lru, isolating and disposing freeable
+ *  items.
  * @lru: the lru pointer.
  * @nid: the node id to scan from.
+ * @memcg: the memcg to scan from.
  * @isolate: callback function that is resposible for deciding what to do with
  *  the item currently being scanned
  * @cb_arg: opaque type that will be passed to @isolate
@@ -117,31 +163,29 @@ typedef enum lru_status
  *
  * Return value: the number of objects effectively removed from the LRU.
  */
-unsigned long list_lru_walk_node(struct list_lru *lru, int nid,
-				 list_lru_walk_cb isolate, void *cb_arg,
-				 unsigned long *nr_to_walk);
+unsigned long list_lru_walk_node_memcg(struct list_lru *lru,
+				       int nid, struct mem_cgroup *memcg,
+				       list_lru_walk_cb isolate, void *cb_arg,
+				       unsigned long *nr_to_walk);
+
+unsigned long list_lru_walk(struct list_lru *lru,
+			    list_lru_walk_cb isolate, void *cb_arg,
+			    unsigned long nr_to_walk);
 
 static inline unsigned long
-list_lru_shrink_walk(struct list_lru *lru, struct shrink_control *sc,
-		     list_lru_walk_cb isolate, void *cb_arg)
+list_lru_walk_node(struct list_lru *lru, int nid,
+		   list_lru_walk_cb isolate, void *cb_arg,
+		   unsigned long *nr_to_walk)
 {
-	return list_lru_walk_node(lru, sc->nid, isolate, cb_arg,
-				  &sc->nr_to_scan);
+	return list_lru_walk_node_memcg(lru, nid, NULL,
+					isolate, cb_arg, nr_to_walk);
 }
 
 static inline unsigned long
-list_lru_walk(struct list_lru *lru, list_lru_walk_cb isolate,
-	      void *cb_arg, unsigned long nr_to_walk)
+list_lru_shrink_walk(struct list_lru *lru, struct shrink_control *sc,
+		     list_lru_walk_cb isolate, void *cb_arg)
 {
-	long isolated = 0;
-	int nid;
-
-	for_each_node_mask(nid, lru->active_nodes) {
-		isolated += list_lru_walk_node(lru, nid, isolate,
-					       cb_arg, &nr_to_walk);
-		if (nr_to_walk <= 0)
-			break;
-	}
-	return isolated;
+	return list_lru_walk_node_memcg(lru, sc->nid, sc->memcg,
+					isolate, cb_arg, &sc->nr_to_scan);
 }
 #endif /* _LRU_LIST_H */
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 6001b31..44fc58a 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -29,6 +29,7 @@ struct page_cgroup;
 struct page;
 struct mm_struct;
 struct kmem_cache;
+struct list_lru;
 
 /*
  * The corresponding mem_cgroup_stat_names is defined in mm/memcontrol.c,
@@ -538,6 +539,9 @@ __memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp);
 void mem_cgroup_destroy_cache(struct kmem_cache *cachep);
 void kmem_cache_destroy_memcg_children(struct kmem_cache *s);
 
+int memcg_list_lru_init(struct list_lru *lru, bool memcg_enabled);
+void memcg_list_lru_destroy(struct list_lru *lru);
+
 /**
  * memcg_kmem_newpage_charge: verify if a new kmem allocation is allowed.
  * @gfp: the gfp allocation flags.
@@ -702,6 +706,15 @@ memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp)
 static inline void kmem_cache_destroy_memcg_children(struct kmem_cache *s)
 {
 }
+
+static inline int memcg_list_lru_init(struct list_lru *lru, bool memcg_enabled)
+{
+	return 0;
+}
+
+static inline void memcg_list_lru_destroy(struct list_lru *lru)
+{
+}
 #endif /* CONFIG_MEMCG_KMEM */
 #endif /* _LINUX_MEMCONTROL_H */
 
diff --git a/mm/list_lru.c b/mm/list_lru.c
index 72f9dec..db644f8 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -7,19 +7,87 @@
 #include <linux/kernel.h>
 #include <linux/module.h>
 #include <linux/mm.h>
-#include <linux/list_lru.h>
 #include <linux/slab.h>
+#include <linux/memcontrol.h>
+#include <linux/page_cgroup.h>
+#include <linux/list_lru.h>
+
+#ifdef CONFIG_MEMCG_KMEM
+static inline bool lru_has_memcg(struct list_lru *lru)
+{
+	return !!lru->memcg;
+}
+
+static struct list_lru_node *lru_node_of_index(struct list_lru *lru,
+					int nid, int memcg_id, bool *is_global)
+{
+	struct list_lru_node **memcg_lrus;
+	struct list_lru_node *nlru = NULL;
+
+	if (memcg_id < 0 || !lru_has_memcg(lru)) {
+		*is_global = true;
+		return &lru->node[nid];
+	}
+
+	rcu_read_lock();
+	memcg_lrus = rcu_dereference(lru->memcg);
+	nlru = memcg_lrus[memcg_id];
+	rcu_read_unlock();
+
+	*is_global = false;
+	return nlru;
+}
+
+static struct list_lru_node *lru_node_of_page(struct list_lru *lru,
+					struct page *page, bool *is_global)
+{
+	struct page_cgroup *pc;
+	struct mem_cgroup *memcg;
+
+	/*
+	 * Since a kmem page cannot change its cgroup after its allocation is
+	 * committed, we do not need to lock_page_cgroup() here.
+	 */
+	pc = lookup_page_cgroup(compound_head(page));
+	memcg = PageCgroupUsed(pc) ? pc->mem_cgroup : NULL;
+
+	return lru_node_of_index(lru, page_to_nid(page),
+				 memcg_cache_id(memcg), is_global);
+}
+#else /* !CONFIG_MEMCG_KMEM */
+static inline bool lru_has_memcg(struct list_lru *lru)
+{
+	return false;
+}
+
+static inline struct list_lru_node *lru_node_of_index(struct list_lru *lru,
+					int nid, int memcg_id, bool *is_global)
+{
+	*is_global = true;
+	return &lru->node[nid];
+}
+
+static inline struct list_lru_node *lru_node_of_page(struct list_lru *lru,
+					struct page *page, bool *is_global)
+{
+	return lru_node_of_index(lru, page_to_nid(page), -1, is_global);
+}
+#endif /* CONFIG_MEMCG_KMEM */
 
 bool list_lru_add(struct list_lru *lru, struct list_head *item)
 {
-	int nid = page_to_nid(virt_to_page(item));
-	struct list_lru_node *nlru = &lru->node[nid];
+	struct page *page = virt_to_page(item);
+	int nid = page_to_nid(page);
+	bool is_global;
+	struct list_lru_node *nlru;
+
+	nlru = lru_node_of_page(lru, page, &is_global);
 
 	spin_lock(&nlru->lock);
 	WARN_ON_ONCE(nlru->nr_items < 0);
 	if (list_empty(item)) {
 		list_add_tail(item, &nlru->list);
-		if (nlru->nr_items++ == 0)
+		if (nlru->nr_items++ == 0 && is_global)
 			node_set(nid, lru->active_nodes);
 		spin_unlock(&nlru->lock);
 		return true;
@@ -31,13 +99,17 @@ EXPORT_SYMBOL_GPL(list_lru_add);
 
 bool list_lru_del(struct list_lru *lru, struct list_head *item)
 {
-	int nid = page_to_nid(virt_to_page(item));
-	struct list_lru_node *nlru = &lru->node[nid];
+	struct page *page = virt_to_page(item);
+	int nid = page_to_nid(page);
+	bool is_global;
+	struct list_lru_node *nlru;
+
+	nlru = lru_node_of_page(lru, page, &is_global);
 
 	spin_lock(&nlru->lock);
 	if (!list_empty(item)) {
 		list_del_init(item);
-		if (--nlru->nr_items == 0)
+		if (--nlru->nr_items == 0 && is_global)
 			node_clear(nid, lru->active_nodes);
 		WARN_ON_ONCE(nlru->nr_items < 0);
 		spin_unlock(&nlru->lock);
@@ -48,11 +120,14 @@ bool list_lru_del(struct list_lru *lru, struct list_head *item)
 }
 EXPORT_SYMBOL_GPL(list_lru_del);
 
-unsigned long
-list_lru_count_node(struct list_lru *lru, int nid)
+unsigned long list_lru_count_node_memcg(struct list_lru *lru,
+					int nid, struct mem_cgroup *memcg)
 {
 	unsigned long count = 0;
-	struct list_lru_node *nlru = &lru->node[nid];
+	bool is_global;
+	struct list_lru_node *nlru;
+
+	nlru = lru_node_of_index(lru, nid, memcg_cache_id(memcg), &is_global);
 
 	spin_lock(&nlru->lock);
 	WARN_ON_ONCE(nlru->nr_items < 0);
@@ -61,16 +136,41 @@ list_lru_count_node(struct list_lru *lru, int nid)
 
 	return count;
 }
-EXPORT_SYMBOL_GPL(list_lru_count_node);
+EXPORT_SYMBOL_GPL(list_lru_count_node_memcg);
+
+unsigned long list_lru_count(struct list_lru *lru)
+{
+	long count = 0;
+	int nid;
+	struct mem_cgroup *memcg;
+
+	for_each_node_mask(nid, lru->active_nodes)
+		count += list_lru_count_node(lru, nid);
+
+	if (!lru_has_memcg(lru))
+		goto out;
+
+	for_each_mem_cgroup(memcg) {
+		if (memcg_kmem_is_active(memcg))
+			count += list_lru_count_node_memcg(lru, 0, memcg);
+	}
+out:
+	return count;
+}
+EXPORT_SYMBOL_GPL(list_lru_count);
 
-unsigned long
-list_lru_walk_node(struct list_lru *lru, int nid, list_lru_walk_cb isolate,
-		   void *cb_arg, unsigned long *nr_to_walk)
+unsigned long list_lru_walk_node_memcg(struct list_lru *lru,
+				       int nid, struct mem_cgroup *memcg,
+				       list_lru_walk_cb isolate, void *cb_arg,
+				       unsigned long *nr_to_walk)
 {
 
-	struct list_lru_node	*nlru = &lru->node[nid];
 	struct list_head *item, *n;
 	unsigned long isolated = 0;
+	bool is_global;
+	struct list_lru_node *nlru;
+
+	nlru = lru_node_of_index(lru, nid, memcg_cache_id(memcg), &is_global);
 
 	spin_lock(&nlru->lock);
 restart:
@@ -88,7 +188,7 @@ restart:
 		ret = isolate(item, &nlru->lock, cb_arg);
 		switch (ret) {
 		case LRU_REMOVED:
-			if (--nlru->nr_items == 0)
+			if (--nlru->nr_items == 0 && is_global)
 				node_clear(nid, lru->active_nodes);
 			WARN_ON_ONCE(nlru->nr_items < 0);
 			isolated++;
@@ -112,29 +212,134 @@ restart:
 	spin_unlock(&nlru->lock);
 	return isolated;
 }
-EXPORT_SYMBOL_GPL(list_lru_walk_node);
+EXPORT_SYMBOL_GPL(list_lru_walk_node_memcg);
 
-int list_lru_init(struct list_lru *lru)
+unsigned long list_lru_walk(struct list_lru *lru,
+			    list_lru_walk_cb isolate, void *cb_arg,
+			    unsigned long nr_to_walk)
+{
+	long isolated = 0;
+	int nid;
+	struct mem_cgroup *memcg;
+
+	for_each_node_mask(nid, lru->active_nodes) {
+		isolated += list_lru_walk_node(lru, nid, isolate,
+					       cb_arg, &nr_to_walk);
+		if (nr_to_walk <= 0)
+			break;
+	}
+
+	if (!lru_has_memcg(lru))
+		goto out;
+
+	for_each_mem_cgroup(memcg) {
+		if (!memcg_kmem_is_active(memcg))
+			continue;
+		isolated += list_lru_walk_node_memcg(lru, 0, memcg, isolate,
+						     cb_arg, &nr_to_walk);
+		if (nr_to_walk <= 0) {
+			mem_cgroup_iter_break(NULL, memcg);
+			break;
+		}
+	}
+out:
+	return isolated;
+}
+EXPORT_SYMBOL_GPL(list_lru_walk);
+
+static void list_lru_node_init(struct list_lru_node *nlru)
+{
+	spin_lock_init(&nlru->lock);
+	INIT_LIST_HEAD(&nlru->list);
+	nlru->nr_items = 0;
+}
+
+int __list_lru_init(struct list_lru *lru, bool memcg_enabled)
 {
 	int i;
-	size_t size = sizeof(*lru->node) * nr_node_ids;
+	int err = 0;
 
-	lru->node = kzalloc(size, GFP_KERNEL);
+	lru->node = kcalloc(nr_node_ids, sizeof(*lru->node), GFP_KERNEL);
 	if (!lru->node)
 		return -ENOMEM;
 
 	nodes_clear(lru->active_nodes);
-	for (i = 0; i < nr_node_ids; i++) {
-		spin_lock_init(&lru->node[i].lock);
-		INIT_LIST_HEAD(&lru->node[i].list);
-		lru->node[i].nr_items = 0;
+	for (i = 0; i < nr_node_ids; i++)
+		list_lru_node_init(&lru->node[i]);
+
+	err = memcg_list_lru_init(lru, memcg_enabled);
+	if (err) {
+		kfree(lru->node);
+		lru->node = NULL; /* see list_lru_destroy() */
 	}
-	return 0;
+
+	return err;
 }
-EXPORT_SYMBOL_GPL(list_lru_init);
+EXPORT_SYMBOL_GPL(__list_lru_init);
 
 void list_lru_destroy(struct list_lru *lru)
 {
+	/*
+	 * We might be called after partial initialisation (e.g. due to ENOMEM
+	 * error) so handle that appropriately.
+	 */
+	if (!lru->node)
+		return;
+
 	kfree(lru->node);
+	memcg_list_lru_destroy(lru);
 }
 EXPORT_SYMBOL_GPL(list_lru_destroy);
+
+#ifdef CONFIG_MEMCG_KMEM
+int list_lru_memcg_alloc(struct list_lru *lru, int memcg_id)
+{
+	struct list_lru_node *nlru;
+
+	nlru = kmalloc(sizeof(*nlru), GFP_KERNEL);
+	if (!nlru)
+		return -ENOMEM;
+
+	list_lru_node_init(nlru);
+
+	VM_BUG_ON(lru->memcg[memcg_id]);
+	lru->memcg[memcg_id] = nlru;
+	return 0;
+}
+
+void list_lru_memcg_free(struct list_lru *lru, int memcg_id)
+{
+	if (lru->memcg[memcg_id]) {
+		kfree(lru->memcg[memcg_id]);
+		lru->memcg[memcg_id] = NULL;
+	}
+}
+
+int list_lru_grow_memcg(struct list_lru *lru, size_t new_array_size)
+{
+	int i;
+	struct list_lru_node **memcg_lrus;
+
+	memcg_lrus = kcalloc(new_array_size, sizeof(*memcg_lrus), GFP_KERNEL);
+	if (!memcg_lrus)
+		return -ENOMEM;
+
+	if (lru->memcg) {
+		for_each_memcg_cache_index(i) {
+			if (lru->memcg[i])
+				memcg_lrus[i] = lru->memcg[i];
+		}
+	}
+
+	/*
+	 * Since we access the lru->memcg array lockless, inside an RCU
+	 * critical section (see lru_node_of_index()), we cannot free the old
+	 * version of the array right now. So we save it to lru->memcg_old to
+	 * be freed by the caller after a grace period.
+	 */
+	VM_BUG_ON(lru->memcg_old);
+	lru->memcg_old = lru->memcg;
+	rcu_assign_pointer(lru->memcg, memcg_lrus);
+	return 0;
+}
+#endif /* CONFIG_MEMCG_KMEM */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 13b3131..5fec8aa 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -55,6 +55,7 @@
 #include <linux/cpu.h>
 #include <linux/oom.h>
 #include <linux/lockdep.h>
+#include <linux/list_lru.h>
 #include "internal.h"
 #include <net/sock.h>
 #include <net/ip.h>
@@ -3238,6 +3239,160 @@ out:
 }
 
 /*
+ * This semaphore serializes activations of kmem accounting for memory cgroups.
+ * Holding it for reading guarantees no cgroups will become kmem active.
+ */
+static DECLARE_RWSEM(activate_kmem_sem);
+
+/*
+ * The list of all memcg-enabled list_lru structures. Needed for updating all
+ * per-memcg LRUs whenever a kmem-active memcg is created or destroyed. The
+ * list is updated under the activate_kmem_sem held for reading so to safely
+ * iterate over it, it is enough to take the activate_kmem_sem for writing.
+ */
+static LIST_HEAD(all_memcg_lrus);
+static DEFINE_SPINLOCK(all_memcg_lrus_lock);
+
+static void __memcg_destroy_all_lrus(int memcg_id)
+{
+	struct list_lru *lru;
+
+	list_for_each_entry(lru, &all_memcg_lrus, memcg_lrus_list)
+		list_lru_memcg_free(lru, memcg_id);
+}
+
+/*
+ * This function is called when a kmem-active memcg is destroyed in order to
+ * free LRUs corresponding to the memcg in all list_lru structures.
+ */
+static void memcg_destroy_all_lrus(struct mem_cgroup *memcg)
+{
+	int memcg_id;
+
+	memcg_id = memcg_cache_id(memcg);
+	if (memcg_id >= 0) {
+		down_write(&activate_kmem_sem);
+		__memcg_destroy_all_lrus(memcg_id);
+		up_write(&activate_kmem_sem);
+	}
+}
+
+/*
+ * This function allocates LRUs for a memcg in all list_lru structures. It is
+ * called with activate_kmem_sem held for writing when a new kmem-active memcg
+ * is added.
+ */
+static int memcg_init_all_lrus(int new_memcg_id)
+{
+	int err = 0;
+	int num_memcgs = new_memcg_id + 1;
+	int grow = (num_memcgs > memcg_limited_groups_array_size);
+	size_t new_array_size = memcg_caches_array_size(num_memcgs);
+	struct list_lru *lru;
+
+	if (grow) {
+		list_for_each_entry(lru, &all_memcg_lrus, memcg_lrus_list) {
+			err = list_lru_grow_memcg(lru, new_array_size);
+			if (err)
+				goto out;
+		}
+	}
+
+	list_for_each_entry(lru, &all_memcg_lrus, memcg_lrus_list) {
+		err = list_lru_memcg_alloc(lru, new_memcg_id);
+		if (err) {
+			__memcg_destroy_all_lrus(new_memcg_id);
+			break;
+		}
+	}
+out:
+	if (grow) {
+		synchronize_rcu();
+		list_for_each_entry(lru, &all_memcg_lrus, memcg_lrus_list) {
+			kfree(lru->memcg_old);
+			lru->memcg_old = NULL;
+		}
+	}
+	return err;
+}
+
+int memcg_list_lru_init(struct list_lru *lru, bool memcg_enabled)
+{
+	int err = 0;
+	int i;
+	struct mem_cgroup *memcg;
+
+	lru->memcg = NULL;
+	lru->memcg_old = NULL;
+	INIT_LIST_HEAD(&lru->memcg_lrus_list);
+
+	if (!memcg_enabled)
+		return 0;
+
+	down_read(&activate_kmem_sem);
+	if (!memcg_kmem_enabled())
+		goto out_list_add;
+
+	lru->memcg = kcalloc(memcg_limited_groups_array_size,
+			     sizeof(*lru->memcg), GFP_KERNEL);
+	if (!lru->memcg) {
+		err = -ENOMEM;
+		goto out;
+	}
+
+	for_each_mem_cgroup(memcg) {
+		int memcg_id;
+
+		memcg_id = memcg_cache_id(memcg);
+		if (memcg_id < 0)
+			continue;
+
+		err = list_lru_memcg_alloc(lru, memcg_id);
+		if (err) {
+			mem_cgroup_iter_break(NULL, memcg);
+			goto out_free_lru_memcg;
+		}
+	}
+out_list_add:
+	spin_lock(&all_memcg_lrus_lock);
+	list_add(&lru->memcg_lrus_list, &all_memcg_lrus);
+	spin_unlock(&all_memcg_lrus_lock);
+out:
+	up_read(&activate_kmem_sem);
+	return err;
+
+out_free_lru_memcg:
+	for (i = 0; i < memcg_limited_groups_array_size; i++)
+		list_lru_memcg_free(lru, i);
+	kfree(lru->memcg);
+	goto out;
+}
+
+void memcg_list_lru_destroy(struct list_lru *lru)
+{
+	int i, array_size;
+
+	if (list_empty(&lru->memcg_lrus_list))
+		return;
+
+	down_read(&activate_kmem_sem);
+
+	array_size = memcg_limited_groups_array_size;
+
+	spin_lock(&all_memcg_lrus_lock);
+	list_del(&lru->memcg_lrus_list);
+	spin_unlock(&all_memcg_lrus_lock);
+
+	up_read(&activate_kmem_sem);
+
+	if (lru->memcg) {
+		for (i = 0; i < array_size; i++)
+			list_lru_memcg_free(lru, i);
+		kfree(lru->memcg);
+	}
+}
+
+/*
  * During the creation a new cache, we need to disable our accounting mechanism
  * altogether. This is true even if we are not creating, but rather just
  * enqueing new caches to be created.
@@ -5129,9 +5284,7 @@ static ssize_t mem_cgroup_read(struct cgroup_subsys_state *css,
 }
 
 #ifdef CONFIG_MEMCG_KMEM
-static DEFINE_MUTEX(activate_kmem_mutex);
-
-/* should be called with activate_kmem_mutex held */
+/* should be called with activate_kmem_sem held for writing */
 static int __memcg_activate_kmem(struct mem_cgroup *memcg,
 				 unsigned long long limit)
 {
@@ -5177,12 +5330,21 @@ static int __memcg_activate_kmem(struct mem_cgroup *memcg,
 	}
 
 	/*
+	 * Initialize this cgroup's lists in each list_lru. This must be done
+	 * before memcg_update_all_caches(), where we update the
+	 * limited_groups_array_size.
+	 */
+	err = memcg_init_all_lrus(memcg_id);
+	if (err)
+		goto out_rmid;
+
+	/*
 	 * Make sure we have enough space for this cgroup in each kmem_cache's
 	 * memcg_params array.
 	 */
 	err = memcg_update_all_caches(memcg_id + 1);
 	if (err)
-		goto out_rmid;
+		goto out_destroy_all_lrus;
 
 	memcg->kmemcg_id = memcg_id;
 
@@ -5197,6 +5359,8 @@ out:
 	memcg_resume_kmem_account();
 	return err;
 
+out_destroy_all_lrus:
+	__memcg_destroy_all_lrus(memcg_id);
 out_rmid:
 	ida_simple_remove(&kmem_limited_groups, memcg_id);
 out_reset_limit:
@@ -5209,9 +5373,9 @@ static int memcg_activate_kmem(struct mem_cgroup *memcg,
 {
 	int ret;
 
-	mutex_lock(&activate_kmem_mutex);
+	down_write(&activate_kmem_sem);
 	ret = __memcg_activate_kmem(memcg, limit);
-	mutex_unlock(&activate_kmem_mutex);
+	up_write(&activate_kmem_sem);
 	return ret;
 }
 
@@ -5235,10 +5399,10 @@ static int memcg_propagate_kmem(struct mem_cgroup *memcg)
 	if (!parent)
 		goto out;
 
-	mutex_lock(&activate_kmem_mutex);
+	down_write(&activate_kmem_sem);
 	if (memcg_kmem_is_active(parent))
 		ret = __memcg_activate_kmem(memcg, RES_COUNTER_MAX);
-	mutex_unlock(&activate_kmem_mutex);
+	up_write(&activate_kmem_sem);
 out:
 	return ret;
 }
@@ -5929,6 +6093,7 @@ static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
 static void memcg_destroy_kmem(struct mem_cgroup *memcg)
 {
 	mem_cgroup_sockets_destroy(memcg);
+	memcg_destroy_all_lrus(memcg);
 }
 
 static void kmem_cgroup_css_offline(struct mem_cgroup *memcg)
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v14 15/18] fs: make shrinker memcg aware
  2013-12-16 12:16 ` Vladimir Davydov
@ 2013-12-16 12:17   ` Vladimir Davydov
  -1 siblings, 0 replies; 64+ messages in thread
From: Vladimir Davydov @ 2013-12-16 12:17 UTC (permalink / raw)
  To: dchinner, mhocko, hannes, akpm
  Cc: linux-kernel, linux-mm, cgroups, devel, glommer, glommer, Al Viro

Now, to make any list_lru-based shrinker memcg aware we should only
initialize its list_lru as memcg-enabled. Let's do it for the general FS
shrinker (super_block::s_shrink) and mark it as memcg aware.

There are other FS-specific shrinkers that use list_lru for storing
objects, such as XFS and GFS2 dquot cache shrinkers, but since they
reclaim objects that may be shared among different cgroups, there is no
point making them memcg aware.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Glauber Costa <glommer@openvz.org>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
---
 fs/super.c |    6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/super.c b/fs/super.c
index 9084c3d..654d021 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -181,9 +181,9 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags)
 	INIT_HLIST_BL_HEAD(&s->s_anon);
 	INIT_LIST_HEAD(&s->s_inodes);
 
-	if (list_lru_init(&s->s_dentry_lru))
+	if (list_lru_init_memcg(&s->s_dentry_lru))
 		goto fail;
-	if (list_lru_init(&s->s_inode_lru))
+	if (list_lru_init_memcg(&s->s_inode_lru))
 		goto fail;
 
 	INIT_LIST_HEAD(&s->s_mounts);
@@ -221,7 +221,7 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags)
 	s->s_shrink.scan_objects = super_cache_scan;
 	s->s_shrink.count_objects = super_cache_count;
 	s->s_shrink.batch = 1024;
-	s->s_shrink.flags = SHRINKER_NUMA_AWARE;
+	s->s_shrink.flags = SHRINKER_NUMA_AWARE | SHRINKER_MEMCG_AWARE;
 	return s;
 
 fail:
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v14 15/18] fs: make shrinker memcg aware
@ 2013-12-16 12:17   ` Vladimir Davydov
  0 siblings, 0 replies; 64+ messages in thread
From: Vladimir Davydov @ 2013-12-16 12:17 UTC (permalink / raw)
  To: dchinner, mhocko, hannes, akpm
  Cc: linux-kernel, linux-mm, cgroups, devel, glommer, glommer, Al Viro

Now, to make any list_lru-based shrinker memcg aware we should only
initialize its list_lru as memcg-enabled. Let's do it for the general FS
shrinker (super_block::s_shrink) and mark it as memcg aware.

There are other FS-specific shrinkers that use list_lru for storing
objects, such as XFS and GFS2 dquot cache shrinkers, but since they
reclaim objects that may be shared among different cgroups, there is no
point making them memcg aware.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Glauber Costa <glommer@openvz.org>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
---
 fs/super.c |    6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/super.c b/fs/super.c
index 9084c3d..654d021 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -181,9 +181,9 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags)
 	INIT_HLIST_BL_HEAD(&s->s_anon);
 	INIT_LIST_HEAD(&s->s_inodes);
 
-	if (list_lru_init(&s->s_dentry_lru))
+	if (list_lru_init_memcg(&s->s_dentry_lru))
 		goto fail;
-	if (list_lru_init(&s->s_inode_lru))
+	if (list_lru_init_memcg(&s->s_inode_lru))
 		goto fail;
 
 	INIT_LIST_HEAD(&s->s_mounts);
@@ -221,7 +221,7 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags)
 	s->s_shrink.scan_objects = super_cache_scan;
 	s->s_shrink.count_objects = super_cache_count;
 	s->s_shrink.batch = 1024;
-	s->s_shrink.flags = SHRINKER_NUMA_AWARE;
+	s->s_shrink.flags = SHRINKER_NUMA_AWARE | SHRINKER_MEMCG_AWARE;
 	return s;
 
 fail:
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v14 16/18] vmpressure: in-kernel notifications
  2013-12-16 12:16 ` Vladimir Davydov
@ 2013-12-16 12:17   ` Vladimir Davydov
  -1 siblings, 0 replies; 64+ messages in thread
From: Vladimir Davydov @ 2013-12-16 12:17 UTC (permalink / raw)
  To: dchinner, mhocko, hannes, akpm
  Cc: linux-kernel, linux-mm, cgroups, devel, glommer, glommer,
	John Stultz, Joonsoo Kim, Kamezawa Hiroyuki

From: Glauber Costa <glommer@openvz.org>

During the past weeks, it became clear to us that the shrinker interface
we have right now works very well for some particular types of users,
but not that well for others. The latter are usually people interested
in one-shot notifications, that were forced to adapt themselves to the
count+scan behavior of shrinkers. To do so, they had no choice than to
greatly abuse the shrinker interface producing little monsters all over.

During LSF/MM, one of the proposals that popped out during our session
was to reuse Anton Voronstsov's vmpressure for this. They are designed
for userspace consumption, but also provide a well-stablished,
cgroup-aware entry point for notifications.

This patch extends that to also support in-kernel users. Events that
should be generated for in-kernel consumption will be marked as such,
and for those, we will call a registered function instead of triggering
an eventfd notification.

Please note that due to my lack of understanding of each shrinker user,
I will stay away from converting the actual users, you are all welcome
to do so.

Signed-off-by: Glauber Costa <glommer@openvz.org>
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: Anton Vorontsov <anton@enomsg.org>
Acked-by: Pekka Enberg <penberg@kernel.org>
Reviewed-by: Greg Thelen <gthelen@google.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/vmpressure.h |    5 +++++
 mm/vmpressure.c            |   53 +++++++++++++++++++++++++++++++++++++++++---
 2 files changed, 55 insertions(+), 3 deletions(-)

diff --git a/include/linux/vmpressure.h b/include/linux/vmpressure.h
index 3f3788d..9102e53 100644
--- a/include/linux/vmpressure.h
+++ b/include/linux/vmpressure.h
@@ -19,6 +19,9 @@ struct vmpressure {
 	/* Have to grab the lock on events traversal or modifications. */
 	struct mutex events_lock;
 
+	/* False if only kernel users want to be notified, true otherwise. */
+	bool notify_userspace;
+
 	struct work_struct work;
 };
 
@@ -38,6 +41,8 @@ extern int vmpressure_register_event(struct cgroup_subsys_state *css,
 				     struct cftype *cft,
 				     struct eventfd_ctx *eventfd,
 				     const char *args);
+extern int vmpressure_register_kernel_event(struct cgroup_subsys_state *css,
+					    void (*fn)(void));
 extern void vmpressure_unregister_event(struct cgroup_subsys_state *css,
 					struct cftype *cft,
 					struct eventfd_ctx *eventfd);
diff --git a/mm/vmpressure.c b/mm/vmpressure.c
index e0f6283..730e7c1 100644
--- a/mm/vmpressure.c
+++ b/mm/vmpressure.c
@@ -130,8 +130,12 @@ static enum vmpressure_levels vmpressure_calc_level(unsigned long scanned,
 }
 
 struct vmpressure_event {
-	struct eventfd_ctx *efd;
+	union {
+		struct eventfd_ctx *efd;
+		void (*fn)(void);
+	};
 	enum vmpressure_levels level;
+	bool kernel_event;
 	struct list_head node;
 };
 
@@ -147,12 +151,15 @@ static bool vmpressure_event(struct vmpressure *vmpr,
 	mutex_lock(&vmpr->events_lock);
 
 	list_for_each_entry(ev, &vmpr->events, node) {
-		if (level >= ev->level) {
+		if (ev->kernel_event) {
+			ev->fn();
+		} else if (vmpr->notify_userspace && level >= ev->level) {
 			eventfd_signal(ev->efd, 1);
 			signalled = true;
 		}
 	}
 
+	vmpr->notify_userspace = false;
 	mutex_unlock(&vmpr->events_lock);
 
 	return signalled;
@@ -222,7 +229,7 @@ void vmpressure(gfp_t gfp, struct mem_cgroup *memcg,
 	 * we account it too.
 	 */
 	if (!(gfp & (__GFP_HIGHMEM | __GFP_MOVABLE | __GFP_IO | __GFP_FS)))
-		return;
+		goto schedule;
 
 	/*
 	 * If we got here with no pages scanned, then that is an indicator
@@ -239,8 +246,15 @@ void vmpressure(gfp_t gfp, struct mem_cgroup *memcg,
 	vmpr->scanned += scanned;
 	vmpr->reclaimed += reclaimed;
 	scanned = vmpr->scanned;
+	/*
+	 * If we didn't reach this point, only kernel events will be triggered.
+	 * It is the job of the worker thread to clean this up once the
+	 * notifications are all delivered.
+	 */
+	vmpr->notify_userspace = true;
 	spin_unlock(&vmpr->sr_lock);
 
+schedule:
 	if (scanned < vmpressure_win)
 		return;
 	schedule_work(&vmpr->work);
@@ -324,6 +338,39 @@ int vmpressure_register_event(struct cgroup_subsys_state *css,
 }
 
 /**
+ * vmpressure_register_kernel_event() - Register kernel-side notification
+ * @css:	css that is interested in vmpressure notifications
+ * @fn:		function to be called when pressure happens
+ *
+ * This function register in-kernel users interested in receiving notifications
+ * about pressure conditions. Pressure notifications will be triggered at the
+ * same time as userspace notifications (with no particular ordering relative
+ * to it).
+ *
+ * Pressure notifications are a alternative method to shrinkers and will serve
+ * well users that are interested in a one-shot notification, with a
+ * well-defined cgroup aware interface.
+ */
+int vmpressure_register_kernel_event(struct cgroup_subsys_state *css,
+				      void (*fn)(void))
+{
+	struct vmpressure *vmpr = css_to_vmpressure(css);
+	struct vmpressure_event *ev;
+
+	ev = kzalloc(sizeof(*ev), GFP_KERNEL);
+	if (!ev)
+		return -ENOMEM;
+
+	ev->kernel_event = true;
+	ev->fn = fn;
+
+	mutex_lock(&vmpr->events_lock);
+	list_add(&ev->node, &vmpr->events);
+	mutex_unlock(&vmpr->events_lock);
+	return 0;
+}
+
+/**
  * vmpressure_unregister_event() - Unbind eventfd from vmpressure
  * @css:	css handle
  * @cft:	cgroup control files handle
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v14 16/18] vmpressure: in-kernel notifications
@ 2013-12-16 12:17   ` Vladimir Davydov
  0 siblings, 0 replies; 64+ messages in thread
From: Vladimir Davydov @ 2013-12-16 12:17 UTC (permalink / raw)
  To: dchinner, mhocko, hannes, akpm
  Cc: linux-kernel, linux-mm, cgroups, devel, glommer, glommer,
	John Stultz, Joonsoo Kim, Kamezawa Hiroyuki

From: Glauber Costa <glommer@openvz.org>

During the past weeks, it became clear to us that the shrinker interface
we have right now works very well for some particular types of users,
but not that well for others. The latter are usually people interested
in one-shot notifications, that were forced to adapt themselves to the
count+scan behavior of shrinkers. To do so, they had no choice than to
greatly abuse the shrinker interface producing little monsters all over.

During LSF/MM, one of the proposals that popped out during our session
was to reuse Anton Voronstsov's vmpressure for this. They are designed
for userspace consumption, but also provide a well-stablished,
cgroup-aware entry point for notifications.

This patch extends that to also support in-kernel users. Events that
should be generated for in-kernel consumption will be marked as such,
and for those, we will call a registered function instead of triggering
an eventfd notification.

Please note that due to my lack of understanding of each shrinker user,
I will stay away from converting the actual users, you are all welcome
to do so.

Signed-off-by: Glauber Costa <glommer@openvz.org>
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: Anton Vorontsov <anton@enomsg.org>
Acked-by: Pekka Enberg <penberg@kernel.org>
Reviewed-by: Greg Thelen <gthelen@google.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/vmpressure.h |    5 +++++
 mm/vmpressure.c            |   53 +++++++++++++++++++++++++++++++++++++++++---
 2 files changed, 55 insertions(+), 3 deletions(-)

diff --git a/include/linux/vmpressure.h b/include/linux/vmpressure.h
index 3f3788d..9102e53 100644
--- a/include/linux/vmpressure.h
+++ b/include/linux/vmpressure.h
@@ -19,6 +19,9 @@ struct vmpressure {
 	/* Have to grab the lock on events traversal or modifications. */
 	struct mutex events_lock;
 
+	/* False if only kernel users want to be notified, true otherwise. */
+	bool notify_userspace;
+
 	struct work_struct work;
 };
 
@@ -38,6 +41,8 @@ extern int vmpressure_register_event(struct cgroup_subsys_state *css,
 				     struct cftype *cft,
 				     struct eventfd_ctx *eventfd,
 				     const char *args);
+extern int vmpressure_register_kernel_event(struct cgroup_subsys_state *css,
+					    void (*fn)(void));
 extern void vmpressure_unregister_event(struct cgroup_subsys_state *css,
 					struct cftype *cft,
 					struct eventfd_ctx *eventfd);
diff --git a/mm/vmpressure.c b/mm/vmpressure.c
index e0f6283..730e7c1 100644
--- a/mm/vmpressure.c
+++ b/mm/vmpressure.c
@@ -130,8 +130,12 @@ static enum vmpressure_levels vmpressure_calc_level(unsigned long scanned,
 }
 
 struct vmpressure_event {
-	struct eventfd_ctx *efd;
+	union {
+		struct eventfd_ctx *efd;
+		void (*fn)(void);
+	};
 	enum vmpressure_levels level;
+	bool kernel_event;
 	struct list_head node;
 };
 
@@ -147,12 +151,15 @@ static bool vmpressure_event(struct vmpressure *vmpr,
 	mutex_lock(&vmpr->events_lock);
 
 	list_for_each_entry(ev, &vmpr->events, node) {
-		if (level >= ev->level) {
+		if (ev->kernel_event) {
+			ev->fn();
+		} else if (vmpr->notify_userspace && level >= ev->level) {
 			eventfd_signal(ev->efd, 1);
 			signalled = true;
 		}
 	}
 
+	vmpr->notify_userspace = false;
 	mutex_unlock(&vmpr->events_lock);
 
 	return signalled;
@@ -222,7 +229,7 @@ void vmpressure(gfp_t gfp, struct mem_cgroup *memcg,
 	 * we account it too.
 	 */
 	if (!(gfp & (__GFP_HIGHMEM | __GFP_MOVABLE | __GFP_IO | __GFP_FS)))
-		return;
+		goto schedule;
 
 	/*
 	 * If we got here with no pages scanned, then that is an indicator
@@ -239,8 +246,15 @@ void vmpressure(gfp_t gfp, struct mem_cgroup *memcg,
 	vmpr->scanned += scanned;
 	vmpr->reclaimed += reclaimed;
 	scanned = vmpr->scanned;
+	/*
+	 * If we didn't reach this point, only kernel events will be triggered.
+	 * It is the job of the worker thread to clean this up once the
+	 * notifications are all delivered.
+	 */
+	vmpr->notify_userspace = true;
 	spin_unlock(&vmpr->sr_lock);
 
+schedule:
 	if (scanned < vmpressure_win)
 		return;
 	schedule_work(&vmpr->work);
@@ -324,6 +338,39 @@ int vmpressure_register_event(struct cgroup_subsys_state *css,
 }
 
 /**
+ * vmpressure_register_kernel_event() - Register kernel-side notification
+ * @css:	css that is interested in vmpressure notifications
+ * @fn:		function to be called when pressure happens
+ *
+ * This function register in-kernel users interested in receiving notifications
+ * about pressure conditions. Pressure notifications will be triggered at the
+ * same time as userspace notifications (with no particular ordering relative
+ * to it).
+ *
+ * Pressure notifications are a alternative method to shrinkers and will serve
+ * well users that are interested in a one-shot notification, with a
+ * well-defined cgroup aware interface.
+ */
+int vmpressure_register_kernel_event(struct cgroup_subsys_state *css,
+				      void (*fn)(void))
+{
+	struct vmpressure *vmpr = css_to_vmpressure(css);
+	struct vmpressure_event *ev;
+
+	ev = kzalloc(sizeof(*ev), GFP_KERNEL);
+	if (!ev)
+		return -ENOMEM;
+
+	ev->kernel_event = true;
+	ev->fn = fn;
+
+	mutex_lock(&vmpr->events_lock);
+	list_add(&ev->node, &vmpr->events);
+	mutex_unlock(&vmpr->events_lock);
+	return 0;
+}
+
+/**
  * vmpressure_unregister_event() - Unbind eventfd from vmpressure
  * @css:	css handle
  * @cft:	cgroup control files handle
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v14 17/18] memcg: reap dead memcgs upon global memory pressure
  2013-12-16 12:16 ` Vladimir Davydov
  (?)
@ 2013-12-16 12:17   ` Vladimir Davydov
  -1 siblings, 0 replies; 64+ messages in thread
From: Vladimir Davydov @ 2013-12-16 12:17 UTC (permalink / raw)
  To: dchinner, mhocko, hannes, akpm
  Cc: linux-kernel, linux-mm, cgroups, devel, glommer, glommer,
	Anton Vorontsov, John Stultz, Kamezawa Hiroyuki

From: Glauber Costa <glommer@openvz.org>

When we delete kmem-enabled memcgs, they can still be zombieing around
for a while. The reason is that the objects may still be alive, and we
won't be able to delete them at destruction time.

The only entry point for that, though, are the shrinkers. The shrinker
interface, however, is not exactly tailored to our needs. It could be a
little bit better by using the API Dave Chinner proposed, but it is
still not ideal since we aren't really a count-and-scan event, but more
a one-off flush-all-you-can event that would have to abuse that somehow.

Signed-off-by: Glauber Costa <glommer@openvz.org>
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Anton Vorontsov <anton@enomsg.org>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/memcontrol.c |   80 ++++++++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 77 insertions(+), 3 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 5fec8aa..963285f 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -287,8 +287,16 @@ struct mem_cgroup {
 	/* thresholds for mem+swap usage. RCU-protected */
 	struct mem_cgroup_thresholds memsw_thresholds;
 
-	/* For oom notifier event fd */
-	struct list_head oom_notify;
+	union {
+		/* For oom notifier event fd */
+		struct list_head oom_notify;
+		/*
+		 * we can only trigger an oom event if the memcg is alive.
+		 * so we will reuse this field to hook the memcg in the list
+		 * of dead memcgs.
+		 */
+		struct list_head dead;
+	};
 
 	/*
 	 * Should we move charges of a task when a task is moved into this
@@ -336,6 +344,29 @@ struct mem_cgroup {
 	/* WARNING: nodeinfo must be the last member here */
 };
 
+#if defined(CONFIG_MEMCG_KMEM) || defined(CONFIG_MEMCG_SWAP)
+static LIST_HEAD(dangling_memcgs);
+static DEFINE_MUTEX(dangling_memcgs_mutex);
+
+static inline void memcg_dangling_del(struct mem_cgroup *memcg)
+{
+	mutex_lock(&dangling_memcgs_mutex);
+	list_del(&memcg->dead);
+	mutex_unlock(&dangling_memcgs_mutex);
+}
+
+static inline void memcg_dangling_add(struct mem_cgroup *memcg)
+{
+	INIT_LIST_HEAD(&memcg->dead);
+	mutex_lock(&dangling_memcgs_mutex);
+	list_add(&memcg->dead, &dangling_memcgs);
+	mutex_unlock(&dangling_memcgs_mutex);
+}
+#else
+static inline void memcg_dangling_del(struct mem_cgroup *memcg) {}
+static inline void memcg_dangling_add(struct mem_cgroup *memcg) {}
+#endif
+
 static size_t memcg_size(void)
 {
 	return sizeof(struct mem_cgroup) +
@@ -6076,6 +6107,41 @@ static int mem_cgroup_oom_control_write(struct cgroup_subsys_state *css,
 }
 
 #ifdef CONFIG_MEMCG_KMEM
+static void memcg_vmpressure_shrink_dead(void)
+{
+	struct memcg_cache_params *params, *tmp;
+	struct kmem_cache *cachep;
+	struct mem_cgroup *memcg;
+
+	mutex_lock(&dangling_memcgs_mutex);
+	list_for_each_entry(memcg, &dangling_memcgs, dead) {
+		mutex_lock(&memcg->slab_caches_mutex);
+		/* The element may go away as an indirect result of shrink */
+		list_for_each_entry_safe(params, tmp,
+					 &memcg->memcg_slab_caches, list) {
+			cachep = memcg_params_to_cache(params);
+			/*
+			 * the cpu_hotplug lock is taken in kmem_cache_create
+			 * outside the slab_caches_mutex manipulation. It will
+			 * be taken by kmem_cache_shrink to flush the cache.
+			 * So we need to drop the lock. It is all right because
+			 * the lock only protects elements moving in and out the
+			 * list.
+			 */
+			mutex_unlock(&memcg->slab_caches_mutex);
+			kmem_cache_shrink(cachep);
+			mutex_lock(&memcg->slab_caches_mutex);
+		}
+		mutex_unlock(&memcg->slab_caches_mutex);
+	}
+	mutex_unlock(&dangling_memcgs_mutex);
+}
+
+static void memcg_register_kmem_events(struct cgroup_subsys_state *css)
+{
+	vmpressure_register_kernel_event(css, memcg_vmpressure_shrink_dead);
+}
+
 static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
 {
 	int ret;
@@ -6130,6 +6196,10 @@ static void kmem_cgroup_css_offline(struct mem_cgroup *memcg)
 		css_put(&memcg->css);
 }
 #else
+static void memcg_register_kmem_events(struct cgroup_subsys_state *css)
+{
+}
+
 static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
 {
 	return 0;
@@ -6467,8 +6537,10 @@ mem_cgroup_css_online(struct cgroup_subsys_state *css)
 	if (css->cgroup->id > MEM_CGROUP_ID_MAX)
 		return -ENOSPC;
 
-	if (!parent)
+	if (!parent) {
+		memcg_register_kmem_events(css);
 		return 0;
+	}
 
 	mutex_lock(&memcg_create_mutex);
 
@@ -6529,6 +6601,7 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
 	mem_cgroup_invalidate_reclaim_iterators(memcg);
 	mem_cgroup_reparent_charges(memcg);
 	mem_cgroup_destroy_all_caches(memcg);
+	memcg_dangling_add(memcg);
 	vmpressure_cleanup(&memcg->vmpressure);
 }
 
@@ -6573,6 +6646,7 @@ static void mem_cgroup_css_free(struct cgroup_subsys_state *css)
 	mem_cgroup_reparent_charges(memcg);
 
 	memcg_destroy_kmem(memcg);
+	memcg_dangling_del(memcg);
 	__mem_cgroup_free(memcg);
 }
 
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v14 17/18] memcg: reap dead memcgs upon global memory pressure
@ 2013-12-16 12:17   ` Vladimir Davydov
  0 siblings, 0 replies; 64+ messages in thread
From: Vladimir Davydov @ 2013-12-16 12:17 UTC (permalink / raw)
  To: dchinner, mhocko, hannes, akpm
  Cc: linux-kernel, linux-mm, cgroups, devel, glommer, glommer,
	Anton Vorontsov, John Stultz, Kamezawa Hiroyuki

From: Glauber Costa <glommer@openvz.org>

When we delete kmem-enabled memcgs, they can still be zombieing around
for a while. The reason is that the objects may still be alive, and we
won't be able to delete them at destruction time.

The only entry point for that, though, are the shrinkers. The shrinker
interface, however, is not exactly tailored to our needs. It could be a
little bit better by using the API Dave Chinner proposed, but it is
still not ideal since we aren't really a count-and-scan event, but more
a one-off flush-all-you-can event that would have to abuse that somehow.

Signed-off-by: Glauber Costa <glommer@openvz.org>
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Anton Vorontsov <anton@enomsg.org>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/memcontrol.c |   80 ++++++++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 77 insertions(+), 3 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 5fec8aa..963285f 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -287,8 +287,16 @@ struct mem_cgroup {
 	/* thresholds for mem+swap usage. RCU-protected */
 	struct mem_cgroup_thresholds memsw_thresholds;
 
-	/* For oom notifier event fd */
-	struct list_head oom_notify;
+	union {
+		/* For oom notifier event fd */
+		struct list_head oom_notify;
+		/*
+		 * we can only trigger an oom event if the memcg is alive.
+		 * so we will reuse this field to hook the memcg in the list
+		 * of dead memcgs.
+		 */
+		struct list_head dead;
+	};
 
 	/*
 	 * Should we move charges of a task when a task is moved into this
@@ -336,6 +344,29 @@ struct mem_cgroup {
 	/* WARNING: nodeinfo must be the last member here */
 };
 
+#if defined(CONFIG_MEMCG_KMEM) || defined(CONFIG_MEMCG_SWAP)
+static LIST_HEAD(dangling_memcgs);
+static DEFINE_MUTEX(dangling_memcgs_mutex);
+
+static inline void memcg_dangling_del(struct mem_cgroup *memcg)
+{
+	mutex_lock(&dangling_memcgs_mutex);
+	list_del(&memcg->dead);
+	mutex_unlock(&dangling_memcgs_mutex);
+}
+
+static inline void memcg_dangling_add(struct mem_cgroup *memcg)
+{
+	INIT_LIST_HEAD(&memcg->dead);
+	mutex_lock(&dangling_memcgs_mutex);
+	list_add(&memcg->dead, &dangling_memcgs);
+	mutex_unlock(&dangling_memcgs_mutex);
+}
+#else
+static inline void memcg_dangling_del(struct mem_cgroup *memcg) {}
+static inline void memcg_dangling_add(struct mem_cgroup *memcg) {}
+#endif
+
 static size_t memcg_size(void)
 {
 	return sizeof(struct mem_cgroup) +
@@ -6076,6 +6107,41 @@ static int mem_cgroup_oom_control_write(struct cgroup_subsys_state *css,
 }
 
 #ifdef CONFIG_MEMCG_KMEM
+static void memcg_vmpressure_shrink_dead(void)
+{
+	struct memcg_cache_params *params, *tmp;
+	struct kmem_cache *cachep;
+	struct mem_cgroup *memcg;
+
+	mutex_lock(&dangling_memcgs_mutex);
+	list_for_each_entry(memcg, &dangling_memcgs, dead) {
+		mutex_lock(&memcg->slab_caches_mutex);
+		/* The element may go away as an indirect result of shrink */
+		list_for_each_entry_safe(params, tmp,
+					 &memcg->memcg_slab_caches, list) {
+			cachep = memcg_params_to_cache(params);
+			/*
+			 * the cpu_hotplug lock is taken in kmem_cache_create
+			 * outside the slab_caches_mutex manipulation. It will
+			 * be taken by kmem_cache_shrink to flush the cache.
+			 * So we need to drop the lock. It is all right because
+			 * the lock only protects elements moving in and out the
+			 * list.
+			 */
+			mutex_unlock(&memcg->slab_caches_mutex);
+			kmem_cache_shrink(cachep);
+			mutex_lock(&memcg->slab_caches_mutex);
+		}
+		mutex_unlock(&memcg->slab_caches_mutex);
+	}
+	mutex_unlock(&dangling_memcgs_mutex);
+}
+
+static void memcg_register_kmem_events(struct cgroup_subsys_state *css)
+{
+	vmpressure_register_kernel_event(css, memcg_vmpressure_shrink_dead);
+}
+
 static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
 {
 	int ret;
@@ -6130,6 +6196,10 @@ static void kmem_cgroup_css_offline(struct mem_cgroup *memcg)
 		css_put(&memcg->css);
 }
 #else
+static void memcg_register_kmem_events(struct cgroup_subsys_state *css)
+{
+}
+
 static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
 {
 	return 0;
@@ -6467,8 +6537,10 @@ mem_cgroup_css_online(struct cgroup_subsys_state *css)
 	if (css->cgroup->id > MEM_CGROUP_ID_MAX)
 		return -ENOSPC;
 
-	if (!parent)
+	if (!parent) {
+		memcg_register_kmem_events(css);
 		return 0;
+	}
 
 	mutex_lock(&memcg_create_mutex);
 
@@ -6529,6 +6601,7 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
 	mem_cgroup_invalidate_reclaim_iterators(memcg);
 	mem_cgroup_reparent_charges(memcg);
 	mem_cgroup_destroy_all_caches(memcg);
+	memcg_dangling_add(memcg);
 	vmpressure_cleanup(&memcg->vmpressure);
 }
 
@@ -6573,6 +6646,7 @@ static void mem_cgroup_css_free(struct cgroup_subsys_state *css)
 	mem_cgroup_reparent_charges(memcg);
 
 	memcg_destroy_kmem(memcg);
+	memcg_dangling_del(memcg);
 	__mem_cgroup_free(memcg);
 }
 
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v14 17/18] memcg: reap dead memcgs upon global memory pressure
@ 2013-12-16 12:17   ` Vladimir Davydov
  0 siblings, 0 replies; 64+ messages in thread
From: Vladimir Davydov @ 2013-12-16 12:17 UTC (permalink / raw)
  To: dchinner-H+wXaHxf7aLQT0dZR+AlfA, mhocko-AlSwsSmVLrQ,
	hannes-druUgvl0LCNAfugRpC6u6w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	devel-GEFAQzZX7r8dnm+yROfE0A, glommer-GEFAQzZX7r8dnm+yROfE0A,
	glommer-Re5JQEeQqe8AvxtiuMwx3w, Anton Vorontsov, John Stultz,
	Kamezawa Hiroyuki

From: Glauber Costa <glommer-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>

When we delete kmem-enabled memcgs, they can still be zombieing around
for a while. The reason is that the objects may still be alive, and we
won't be able to delete them at destruction time.

The only entry point for that, though, are the shrinkers. The shrinker
interface, however, is not exactly tailored to our needs. It could be a
little bit better by using the API Dave Chinner proposed, but it is
still not ideal since we aren't really a count-and-scan event, but more
a one-off flush-all-you-can event that would have to abuse that somehow.

Signed-off-by: Glauber Costa <glommer-GEFAQzZX7r8dnm+yROfE0A@public.gmane.org>
Signed-off-by: Vladimir Davydov <vdavydov-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
Cc: Anton Vorontsov <anton-9xeibp6oKSgdnm+yROfE0A@public.gmane.org>
Cc: John Stultz <john.stultz-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org>
Cc: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
Cc: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
Cc: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
---
 mm/memcontrol.c |   80 ++++++++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 77 insertions(+), 3 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 5fec8aa..963285f 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -287,8 +287,16 @@ struct mem_cgroup {
 	/* thresholds for mem+swap usage. RCU-protected */
 	struct mem_cgroup_thresholds memsw_thresholds;
 
-	/* For oom notifier event fd */
-	struct list_head oom_notify;
+	union {
+		/* For oom notifier event fd */
+		struct list_head oom_notify;
+		/*
+		 * we can only trigger an oom event if the memcg is alive.
+		 * so we will reuse this field to hook the memcg in the list
+		 * of dead memcgs.
+		 */
+		struct list_head dead;
+	};
 
 	/*
 	 * Should we move charges of a task when a task is moved into this
@@ -336,6 +344,29 @@ struct mem_cgroup {
 	/* WARNING: nodeinfo must be the last member here */
 };
 
+#if defined(CONFIG_MEMCG_KMEM) || defined(CONFIG_MEMCG_SWAP)
+static LIST_HEAD(dangling_memcgs);
+static DEFINE_MUTEX(dangling_memcgs_mutex);
+
+static inline void memcg_dangling_del(struct mem_cgroup *memcg)
+{
+	mutex_lock(&dangling_memcgs_mutex);
+	list_del(&memcg->dead);
+	mutex_unlock(&dangling_memcgs_mutex);
+}
+
+static inline void memcg_dangling_add(struct mem_cgroup *memcg)
+{
+	INIT_LIST_HEAD(&memcg->dead);
+	mutex_lock(&dangling_memcgs_mutex);
+	list_add(&memcg->dead, &dangling_memcgs);
+	mutex_unlock(&dangling_memcgs_mutex);
+}
+#else
+static inline void memcg_dangling_del(struct mem_cgroup *memcg) {}
+static inline void memcg_dangling_add(struct mem_cgroup *memcg) {}
+#endif
+
 static size_t memcg_size(void)
 {
 	return sizeof(struct mem_cgroup) +
@@ -6076,6 +6107,41 @@ static int mem_cgroup_oom_control_write(struct cgroup_subsys_state *css,
 }
 
 #ifdef CONFIG_MEMCG_KMEM
+static void memcg_vmpressure_shrink_dead(void)
+{
+	struct memcg_cache_params *params, *tmp;
+	struct kmem_cache *cachep;
+	struct mem_cgroup *memcg;
+
+	mutex_lock(&dangling_memcgs_mutex);
+	list_for_each_entry(memcg, &dangling_memcgs, dead) {
+		mutex_lock(&memcg->slab_caches_mutex);
+		/* The element may go away as an indirect result of shrink */
+		list_for_each_entry_safe(params, tmp,
+					 &memcg->memcg_slab_caches, list) {
+			cachep = memcg_params_to_cache(params);
+			/*
+			 * the cpu_hotplug lock is taken in kmem_cache_create
+			 * outside the slab_caches_mutex manipulation. It will
+			 * be taken by kmem_cache_shrink to flush the cache.
+			 * So we need to drop the lock. It is all right because
+			 * the lock only protects elements moving in and out the
+			 * list.
+			 */
+			mutex_unlock(&memcg->slab_caches_mutex);
+			kmem_cache_shrink(cachep);
+			mutex_lock(&memcg->slab_caches_mutex);
+		}
+		mutex_unlock(&memcg->slab_caches_mutex);
+	}
+	mutex_unlock(&dangling_memcgs_mutex);
+}
+
+static void memcg_register_kmem_events(struct cgroup_subsys_state *css)
+{
+	vmpressure_register_kernel_event(css, memcg_vmpressure_shrink_dead);
+}
+
 static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
 {
 	int ret;
@@ -6130,6 +6196,10 @@ static void kmem_cgroup_css_offline(struct mem_cgroup *memcg)
 		css_put(&memcg->css);
 }
 #else
+static void memcg_register_kmem_events(struct cgroup_subsys_state *css)
+{
+}
+
 static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
 {
 	return 0;
@@ -6467,8 +6537,10 @@ mem_cgroup_css_online(struct cgroup_subsys_state *css)
 	if (css->cgroup->id > MEM_CGROUP_ID_MAX)
 		return -ENOSPC;
 
-	if (!parent)
+	if (!parent) {
+		memcg_register_kmem_events(css);
 		return 0;
+	}
 
 	mutex_lock(&memcg_create_mutex);
 
@@ -6529,6 +6601,7 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
 	mem_cgroup_invalidate_reclaim_iterators(memcg);
 	mem_cgroup_reparent_charges(memcg);
 	mem_cgroup_destroy_all_caches(memcg);
+	memcg_dangling_add(memcg);
 	vmpressure_cleanup(&memcg->vmpressure);
 }
 
@@ -6573,6 +6646,7 @@ static void mem_cgroup_css_free(struct cgroup_subsys_state *css)
 	mem_cgroup_reparent_charges(memcg);
 
 	memcg_destroy_kmem(memcg);
+	memcg_dangling_del(memcg);
 	__mem_cgroup_free(memcg);
 }
 
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v14 18/18] memcg: flush memcg items upon memcg destruction
  2013-12-16 12:16 ` Vladimir Davydov
@ 2013-12-16 12:17   ` Vladimir Davydov
  -1 siblings, 0 replies; 64+ messages in thread
From: Vladimir Davydov @ 2013-12-16 12:17 UTC (permalink / raw)
  To: dchinner, mhocko, hannes, akpm
  Cc: linux-kernel, linux-mm, cgroups, devel, glommer, glommer,
	Balbir Singh, KAMEZAWA Hiroyuki

From: Glauber Costa <glommer@openvz.org>

When a memcg is destroyed, it won't be imediately released until all
objects are gone. This means that if a memcg is restarted with the very
same workload - a very common case, the objects already cached won't be
billed to the new memcg. This is mostly undesirable since a container
can exploit this by restarting itself every time it reaches its limit,
and then coming up again with a fresh new limit.

Since now we have targeted reclaim, I sustain that we should assume that
a memcg that is destroyed should be flushed away. It makes perfect sense
if we assume that a memcg that goes away most likely indicates an
isolated workload that is terminated.

Signed-off-by: Glauber Costa <glommer@openvz.org>
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |   28 ++++++++++++++++++++++++++++
 1 file changed, 28 insertions(+)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 963285f..28d5472 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6162,12 +6162,40 @@ static void memcg_destroy_kmem(struct mem_cgroup *memcg)
 	memcg_destroy_all_lrus(memcg);
 }
 
+static void memcg_drop_slab(struct mem_cgroup *memcg)
+{
+	struct shrink_control shrink = {
+		.gfp_mask = GFP_KERNEL,
+		.target_mem_cgroup = memcg,
+	};
+	unsigned long nr_objects;
+
+	nodes_setall(shrink.nodes_to_scan);
+	do {
+		nr_objects = shrink_slab(&shrink, 1000, 1000);
+	} while (nr_objects > 0);
+}
+
 static void kmem_cgroup_css_offline(struct mem_cgroup *memcg)
 {
 	if (!memcg_kmem_is_active(memcg))
 		return;
 
 	/*
+	 * When a memcg is destroyed, it won't be imediately released until all
+	 * objects are gone. This means that if a memcg is restarted with the
+	 * very same workload - a very common case, the objects already cached
+	 * won't be billed to the new memcg. This is mostly undesirable since a
+	 * container can exploit this by restarting itself every time it
+	 * reaches its limit, and then coming up again with a fresh new limit.
+	 *
+	 * Therefore a memcg that is destroyed should be flushed away. It makes
+	 * perfect sense if we assume that a memcg that goes away indicates an
+	 * isolated workload that is terminated.
+	 */
+	memcg_drop_slab(memcg);
+
+	/*
 	 * kmem charges can outlive the cgroup. In the case of slab
 	 * pages, for instance, a page contain objects from various
 	 * processes. As we prevent from taking a reference for every
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v14 18/18] memcg: flush memcg items upon memcg destruction
@ 2013-12-16 12:17   ` Vladimir Davydov
  0 siblings, 0 replies; 64+ messages in thread
From: Vladimir Davydov @ 2013-12-16 12:17 UTC (permalink / raw)
  To: dchinner, mhocko, hannes, akpm
  Cc: linux-kernel, linux-mm, cgroups, devel, glommer, glommer,
	Balbir Singh, KAMEZAWA Hiroyuki

From: Glauber Costa <glommer@openvz.org>

When a memcg is destroyed, it won't be imediately released until all
objects are gone. This means that if a memcg is restarted with the very
same workload - a very common case, the objects already cached won't be
billed to the new memcg. This is mostly undesirable since a container
can exploit this by restarting itself every time it reaches its limit,
and then coming up again with a fresh new limit.

Since now we have targeted reclaim, I sustain that we should assume that
a memcg that is destroyed should be flushed away. It makes perfect sense
if we assume that a memcg that goes away most likely indicates an
isolated workload that is terminated.

Signed-off-by: Glauber Costa <glommer@openvz.org>
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |   28 ++++++++++++++++++++++++++++
 1 file changed, 28 insertions(+)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 963285f..28d5472 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6162,12 +6162,40 @@ static void memcg_destroy_kmem(struct mem_cgroup *memcg)
 	memcg_destroy_all_lrus(memcg);
 }
 
+static void memcg_drop_slab(struct mem_cgroup *memcg)
+{
+	struct shrink_control shrink = {
+		.gfp_mask = GFP_KERNEL,
+		.target_mem_cgroup = memcg,
+	};
+	unsigned long nr_objects;
+
+	nodes_setall(shrink.nodes_to_scan);
+	do {
+		nr_objects = shrink_slab(&shrink, 1000, 1000);
+	} while (nr_objects > 0);
+}
+
 static void kmem_cgroup_css_offline(struct mem_cgroup *memcg)
 {
 	if (!memcg_kmem_is_active(memcg))
 		return;
 
 	/*
+	 * When a memcg is destroyed, it won't be imediately released until all
+	 * objects are gone. This means that if a memcg is restarted with the
+	 * very same workload - a very common case, the objects already cached
+	 * won't be billed to the new memcg. This is mostly undesirable since a
+	 * container can exploit this by restarting itself every time it
+	 * reaches its limit, and then coming up again with a fresh new limit.
+	 *
+	 * Therefore a memcg that is destroyed should be flushed away. It makes
+	 * perfect sense if we assume that a memcg that goes away indicates an
+	 * isolated workload that is terminated.
+	 */
+	memcg_drop_slab(memcg);
+
+	/*
 	 * kmem charges can outlive the cgroup. In the case of slab
 	 * pages, for instance, a page contain objects from various
 	 * processes. As we prevent from taking a reference for every
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* Re: [PATCH v14 16/18] vmpressure: in-kernel notifications
  2013-12-16 12:17   ` Vladimir Davydov
@ 2013-12-20 14:26     ` Luiz Capitulino
  -1 siblings, 0 replies; 64+ messages in thread
From: Luiz Capitulino @ 2013-12-20 14:26 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: dchinner, mhocko, hannes, akpm, linux-kernel, linux-mm, cgroups,
	devel, glommer, glommer, John Stultz, Joonsoo Kim,
	Kamezawa Hiroyuki

On Mon, 16 Dec 2013 16:17:05 +0400
Vladimir Davydov <vdavydov@parallels.com> wrote:

> From: Glauber Costa <glommer@openvz.org>
> 
> During the past weeks, it became clear to us that the shrinker interface
> we have right now works very well for some particular types of users,
> but not that well for others. The latter are usually people interested
> in one-shot notifications, that were forced to adapt themselves to the
> count+scan behavior of shrinkers. To do so, they had no choice than to
> greatly abuse the shrinker interface producing little monsters all over.
> 
> During LSF/MM, one of the proposals that popped out during our session
> was to reuse Anton Voronstsov's vmpressure for this. They are designed
> for userspace consumption, but also provide a well-stablished,
> cgroup-aware entry point for notifications.

I have the exact problem described above for a project I'm working on
and this solution seems to solve it well.

However, I had a few issues while trying to use this interface. I'll
comment on them below, but please take this more as advice seeking
than patch review.

> This patch extends that to also support in-kernel users. Events that
> should be generated for in-kernel consumption will be marked as such,
> and for those, we will call a registered function instead of triggering
> an eventfd notification.
> 
> Please note that due to my lack of understanding of each shrinker user,
> I will stay away from converting the actual users, you are all welcome
> to do so.
> 
> Signed-off-by: Glauber Costa <glommer@openvz.org>
> Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
> Acked-by: Anton Vorontsov <anton@enomsg.org>
> Acked-by: Pekka Enberg <penberg@kernel.org>
> Reviewed-by: Greg Thelen <gthelen@google.com>
> Cc: Dave Chinner <dchinner@redhat.com>
> Cc: John Stultz <john.stultz@linaro.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> Cc: Michal Hocko <mhocko@suse.cz>
> Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  include/linux/vmpressure.h |    5 +++++
>  mm/vmpressure.c            |   53 +++++++++++++++++++++++++++++++++++++++++---
>  2 files changed, 55 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/vmpressure.h b/include/linux/vmpressure.h
> index 3f3788d..9102e53 100644
> --- a/include/linux/vmpressure.h
> +++ b/include/linux/vmpressure.h
> @@ -19,6 +19,9 @@ struct vmpressure {
>  	/* Have to grab the lock on events traversal or modifications. */
>  	struct mutex events_lock;
>  
> +	/* False if only kernel users want to be notified, true otherwise. */
> +	bool notify_userspace;
> +
>  	struct work_struct work;
>  };
>  
> @@ -38,6 +41,8 @@ extern int vmpressure_register_event(struct cgroup_subsys_state *css,
>  				     struct cftype *cft,
>  				     struct eventfd_ctx *eventfd,
>  				     const char *args);
> +extern int vmpressure_register_kernel_event(struct cgroup_subsys_state *css,
> +					    void (*fn)(void));
>  extern void vmpressure_unregister_event(struct cgroup_subsys_state *css,
>  					struct cftype *cft,
>  					struct eventfd_ctx *eventfd);
> diff --git a/mm/vmpressure.c b/mm/vmpressure.c
> index e0f6283..730e7c1 100644
> --- a/mm/vmpressure.c
> +++ b/mm/vmpressure.c
> @@ -130,8 +130,12 @@ static enum vmpressure_levels vmpressure_calc_level(unsigned long scanned,
>  }
>  
>  struct vmpressure_event {
> -	struct eventfd_ctx *efd;
> +	union {
> +		struct eventfd_ctx *efd;
> +		void (*fn)(void);

How does the callback access its private data?

> +	};
>  	enum vmpressure_levels level;
> +	bool kernel_event;
>  	struct list_head node;
>  };
>  
> @@ -147,12 +151,15 @@ static bool vmpressure_event(struct vmpressure *vmpr,
>  	mutex_lock(&vmpr->events_lock);
>  
>  	list_for_each_entry(ev, &vmpr->events, node) {
> -		if (level >= ev->level) {
> +		if (ev->kernel_event) {
> +			ev->fn();

I think it would be interesting to pass 'level' to the callback (I'll
probably use it myself), but we could wait for a in-tree user before
adding it.

> +		} else if (vmpr->notify_userspace && level >= ev->level) {
>  			eventfd_signal(ev->efd, 1);
>  			signalled = true;
>  		}
>  	}
>  
> +	vmpr->notify_userspace = false;
>  	mutex_unlock(&vmpr->events_lock);
>  
>  	return signalled;
> @@ -222,7 +229,7 @@ void vmpressure(gfp_t gfp, struct mem_cgroup *memcg,
>  	 * we account it too.
>  	 */
>  	if (!(gfp & (__GFP_HIGHMEM | __GFP_MOVABLE | __GFP_IO | __GFP_FS)))
> -		return;
> +		goto schedule;
>  
>  	/*
>  	 * If we got here with no pages scanned, then that is an indicator
> @@ -239,8 +246,15 @@ void vmpressure(gfp_t gfp, struct mem_cgroup *memcg,
>  	vmpr->scanned += scanned;
>  	vmpr->reclaimed += reclaimed;
>  	scanned = vmpr->scanned;
> +	/*
> +	 * If we didn't reach this point, only kernel events will be triggered.
> +	 * It is the job of the worker thread to clean this up once the
> +	 * notifications are all delivered.
> +	 */
> +	vmpr->notify_userspace = true;
>  	spin_unlock(&vmpr->sr_lock);
>  
> +schedule:
>  	if (scanned < vmpressure_win)
>  		return;
>  	schedule_work(&vmpr->work);
> @@ -324,6 +338,39 @@ int vmpressure_register_event(struct cgroup_subsys_state *css,
>  }
>  
>  /**
> + * vmpressure_register_kernel_event() - Register kernel-side notification
> + * @css:	css that is interested in vmpressure notifications
> + * @fn:		function to be called when pressure happens
> + *
> + * This function register in-kernel users interested in receiving notifications
> + * about pressure conditions. Pressure notifications will be triggered at the
> + * same time as userspace notifications (with no particular ordering relative
> + * to it).
> + *
> + * Pressure notifications are a alternative method to shrinkers and will serve
> + * well users that are interested in a one-shot notification, with a
> + * well-defined cgroup aware interface.
> + */
> +int vmpressure_register_kernel_event(struct cgroup_subsys_state *css,
> +				      void (*fn)(void))
> +{
> +	struct vmpressure *vmpr = css_to_vmpressure(css);

This doesn't allow for css=NULL. What's the recommended way for a today's
shrinker (which is not related to cgroups) to register with this API?

Also, you don't seem to provide a way to de-register from the event.

I hacked a patch to be able to use this, seems to work but it's a ugly
hack:

---
 include/linux/vmpressure.h |  3 ++-
 mm/vmpressure.c            | 13 +++++++++----
 2 files changed, 11 insertions(+), 5 deletions(-)

diff --git a/include/linux/vmpressure.h b/include/linux/vmpressure.h
index 9102e53..de416b6 100644
--- a/include/linux/vmpressure.h
+++ b/include/linux/vmpressure.h
@@ -42,7 +42,8 @@ extern int vmpressure_register_event(struct cgroup_subsys_state *css,
 				     struct eventfd_ctx *eventfd,
 				     const char *args);
 extern int vmpressure_register_kernel_event(struct cgroup_subsys_state *css,
-					    void (*fn)(void));
+				     	    void (*fn)(void *data, int level),
+					    void *data);
 extern void vmpressure_unregister_event(struct cgroup_subsys_state *css,
 					struct cftype *cft,
 					struct eventfd_ctx *eventfd);
diff --git a/mm/vmpressure.c b/mm/vmpressure.c
index 730e7c1..4ed0e85 100644
--- a/mm/vmpressure.c
+++ b/mm/vmpressure.c
@@ -132,9 +132,10 @@ static enum vmpressure_levels vmpressure_calc_level(unsigned long scanned,
 struct vmpressure_event {
 	union {
 		struct eventfd_ctx *efd;
-		void (*fn)(void);
+		void (*fn)(void *data, int level);
 	};
 	enum vmpressure_levels level;
+	void *data;
 	bool kernel_event;
 	struct list_head node;
 };
@@ -152,7 +153,7 @@ static bool vmpressure_event(struct vmpressure *vmpr,
 
 	list_for_each_entry(ev, &vmpr->events, node) {
 		if (ev->kernel_event) {
-			ev->fn();
+			ev->fn(ev->data, level);
 		} else if (vmpr->notify_userspace && level >= ev->level) {
 			eventfd_signal(ev->efd, 1);
 			signalled = true;
@@ -352,21 +353,25 @@ int vmpressure_register_event(struct cgroup_subsys_state *css,
  * well-defined cgroup aware interface.
  */
 int vmpressure_register_kernel_event(struct cgroup_subsys_state *css,
-				      void (*fn)(void))
+				     void (*fn)(void *data, int level), void *data)
 {
-	struct vmpressure *vmpr = css_to_vmpressure(css);
+	struct vmpressure *vmpr;
 	struct vmpressure_event *ev;
 
+	vmpr = css ? css_to_vmpressure(css) : memcg_to_vmpressure(NULL);
+
 	ev = kzalloc(sizeof(*ev), GFP_KERNEL);
 	if (!ev)
 		return -ENOMEM;
 
 	ev->kernel_event = true;
+	ev->data = data;
 	ev->fn = fn;
 
 	mutex_lock(&vmpr->events_lock);
 	list_add(&ev->node, &vmpr->events);
 	mutex_unlock(&vmpr->events_lock);
+
 	return 0;
 }
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* Re: [PATCH v14 16/18] vmpressure: in-kernel notifications
@ 2013-12-20 14:26     ` Luiz Capitulino
  0 siblings, 0 replies; 64+ messages in thread
From: Luiz Capitulino @ 2013-12-20 14:26 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: dchinner, mhocko, hannes, akpm, linux-kernel, linux-mm, cgroups,
	devel, glommer, glommer, John Stultz, Joonsoo Kim,
	Kamezawa Hiroyuki

On Mon, 16 Dec 2013 16:17:05 +0400
Vladimir Davydov <vdavydov@parallels.com> wrote:

> From: Glauber Costa <glommer@openvz.org>
> 
> During the past weeks, it became clear to us that the shrinker interface
> we have right now works very well for some particular types of users,
> but not that well for others. The latter are usually people interested
> in one-shot notifications, that were forced to adapt themselves to the
> count+scan behavior of shrinkers. To do so, they had no choice than to
> greatly abuse the shrinker interface producing little monsters all over.
> 
> During LSF/MM, one of the proposals that popped out during our session
> was to reuse Anton Voronstsov's vmpressure for this. They are designed
> for userspace consumption, but also provide a well-stablished,
> cgroup-aware entry point for notifications.

I have the exact problem described above for a project I'm working on
and this solution seems to solve it well.

However, I had a few issues while trying to use this interface. I'll
comment on them below, but please take this more as advice seeking
than patch review.

> This patch extends that to also support in-kernel users. Events that
> should be generated for in-kernel consumption will be marked as such,
> and for those, we will call a registered function instead of triggering
> an eventfd notification.
> 
> Please note that due to my lack of understanding of each shrinker user,
> I will stay away from converting the actual users, you are all welcome
> to do so.
> 
> Signed-off-by: Glauber Costa <glommer@openvz.org>
> Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
> Acked-by: Anton Vorontsov <anton@enomsg.org>
> Acked-by: Pekka Enberg <penberg@kernel.org>
> Reviewed-by: Greg Thelen <gthelen@google.com>
> Cc: Dave Chinner <dchinner@redhat.com>
> Cc: John Stultz <john.stultz@linaro.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> Cc: Michal Hocko <mhocko@suse.cz>
> Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  include/linux/vmpressure.h |    5 +++++
>  mm/vmpressure.c            |   53 +++++++++++++++++++++++++++++++++++++++++---
>  2 files changed, 55 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/vmpressure.h b/include/linux/vmpressure.h
> index 3f3788d..9102e53 100644
> --- a/include/linux/vmpressure.h
> +++ b/include/linux/vmpressure.h
> @@ -19,6 +19,9 @@ struct vmpressure {
>  	/* Have to grab the lock on events traversal or modifications. */
>  	struct mutex events_lock;
>  
> +	/* False if only kernel users want to be notified, true otherwise. */
> +	bool notify_userspace;
> +
>  	struct work_struct work;
>  };
>  
> @@ -38,6 +41,8 @@ extern int vmpressure_register_event(struct cgroup_subsys_state *css,
>  				     struct cftype *cft,
>  				     struct eventfd_ctx *eventfd,
>  				     const char *args);
> +extern int vmpressure_register_kernel_event(struct cgroup_subsys_state *css,
> +					    void (*fn)(void));
>  extern void vmpressure_unregister_event(struct cgroup_subsys_state *css,
>  					struct cftype *cft,
>  					struct eventfd_ctx *eventfd);
> diff --git a/mm/vmpressure.c b/mm/vmpressure.c
> index e0f6283..730e7c1 100644
> --- a/mm/vmpressure.c
> +++ b/mm/vmpressure.c
> @@ -130,8 +130,12 @@ static enum vmpressure_levels vmpressure_calc_level(unsigned long scanned,
>  }
>  
>  struct vmpressure_event {
> -	struct eventfd_ctx *efd;
> +	union {
> +		struct eventfd_ctx *efd;
> +		void (*fn)(void);

How does the callback access its private data?

> +	};
>  	enum vmpressure_levels level;
> +	bool kernel_event;
>  	struct list_head node;
>  };
>  
> @@ -147,12 +151,15 @@ static bool vmpressure_event(struct vmpressure *vmpr,
>  	mutex_lock(&vmpr->events_lock);
>  
>  	list_for_each_entry(ev, &vmpr->events, node) {
> -		if (level >= ev->level) {
> +		if (ev->kernel_event) {
> +			ev->fn();

I think it would be interesting to pass 'level' to the callback (I'll
probably use it myself), but we could wait for a in-tree user before
adding it.

> +		} else if (vmpr->notify_userspace && level >= ev->level) {
>  			eventfd_signal(ev->efd, 1);
>  			signalled = true;
>  		}
>  	}
>  
> +	vmpr->notify_userspace = false;
>  	mutex_unlock(&vmpr->events_lock);
>  
>  	return signalled;
> @@ -222,7 +229,7 @@ void vmpressure(gfp_t gfp, struct mem_cgroup *memcg,
>  	 * we account it too.
>  	 */
>  	if (!(gfp & (__GFP_HIGHMEM | __GFP_MOVABLE | __GFP_IO | __GFP_FS)))
> -		return;
> +		goto schedule;
>  
>  	/*
>  	 * If we got here with no pages scanned, then that is an indicator
> @@ -239,8 +246,15 @@ void vmpressure(gfp_t gfp, struct mem_cgroup *memcg,
>  	vmpr->scanned += scanned;
>  	vmpr->reclaimed += reclaimed;
>  	scanned = vmpr->scanned;
> +	/*
> +	 * If we didn't reach this point, only kernel events will be triggered.
> +	 * It is the job of the worker thread to clean this up once the
> +	 * notifications are all delivered.
> +	 */
> +	vmpr->notify_userspace = true;
>  	spin_unlock(&vmpr->sr_lock);
>  
> +schedule:
>  	if (scanned < vmpressure_win)
>  		return;
>  	schedule_work(&vmpr->work);
> @@ -324,6 +338,39 @@ int vmpressure_register_event(struct cgroup_subsys_state *css,
>  }
>  
>  /**
> + * vmpressure_register_kernel_event() - Register kernel-side notification
> + * @css:	css that is interested in vmpressure notifications
> + * @fn:		function to be called when pressure happens
> + *
> + * This function register in-kernel users interested in receiving notifications
> + * about pressure conditions. Pressure notifications will be triggered at the
> + * same time as userspace notifications (with no particular ordering relative
> + * to it).
> + *
> + * Pressure notifications are a alternative method to shrinkers and will serve
> + * well users that are interested in a one-shot notification, with a
> + * well-defined cgroup aware interface.
> + */
> +int vmpressure_register_kernel_event(struct cgroup_subsys_state *css,
> +				      void (*fn)(void))
> +{
> +	struct vmpressure *vmpr = css_to_vmpressure(css);

This doesn't allow for css=NULL. What's the recommended way for a today's
shrinker (which is not related to cgroups) to register with this API?

Also, you don't seem to provide a way to de-register from the event.

I hacked a patch to be able to use this, seems to work but it's a ugly
hack:

---
 include/linux/vmpressure.h |  3 ++-
 mm/vmpressure.c            | 13 +++++++++----
 2 files changed, 11 insertions(+), 5 deletions(-)

diff --git a/include/linux/vmpressure.h b/include/linux/vmpressure.h
index 9102e53..de416b6 100644
--- a/include/linux/vmpressure.h
+++ b/include/linux/vmpressure.h
@@ -42,7 +42,8 @@ extern int vmpressure_register_event(struct cgroup_subsys_state *css,
 				     struct eventfd_ctx *eventfd,
 				     const char *args);
 extern int vmpressure_register_kernel_event(struct cgroup_subsys_state *css,
-					    void (*fn)(void));
+				     	    void (*fn)(void *data, int level),
+					    void *data);
 extern void vmpressure_unregister_event(struct cgroup_subsys_state *css,
 					struct cftype *cft,
 					struct eventfd_ctx *eventfd);
diff --git a/mm/vmpressure.c b/mm/vmpressure.c
index 730e7c1..4ed0e85 100644
--- a/mm/vmpressure.c
+++ b/mm/vmpressure.c
@@ -132,9 +132,10 @@ static enum vmpressure_levels vmpressure_calc_level(unsigned long scanned,
 struct vmpressure_event {
 	union {
 		struct eventfd_ctx *efd;
-		void (*fn)(void);
+		void (*fn)(void *data, int level);
 	};
 	enum vmpressure_levels level;
+	void *data;
 	bool kernel_event;
 	struct list_head node;
 };
@@ -152,7 +153,7 @@ static bool vmpressure_event(struct vmpressure *vmpr,
 
 	list_for_each_entry(ev, &vmpr->events, node) {
 		if (ev->kernel_event) {
-			ev->fn();
+			ev->fn(ev->data, level);
 		} else if (vmpr->notify_userspace && level >= ev->level) {
 			eventfd_signal(ev->efd, 1);
 			signalled = true;
@@ -352,21 +353,25 @@ int vmpressure_register_event(struct cgroup_subsys_state *css,
  * well-defined cgroup aware interface.
  */
 int vmpressure_register_kernel_event(struct cgroup_subsys_state *css,
-				      void (*fn)(void))
+				     void (*fn)(void *data, int level), void *data)
 {
-	struct vmpressure *vmpr = css_to_vmpressure(css);
+	struct vmpressure *vmpr;
 	struct vmpressure_event *ev;
 
+	vmpr = css ? css_to_vmpressure(css) : memcg_to_vmpressure(NULL);
+
 	ev = kzalloc(sizeof(*ev), GFP_KERNEL);
 	if (!ev)
 		return -ENOMEM;
 
 	ev->kernel_event = true;
+	ev->data = data;
 	ev->fn = fn;
 
 	mutex_lock(&vmpr->events_lock);
 	list_add(&ev->node, &vmpr->events);
 	mutex_unlock(&vmpr->events_lock);
+
 	return 0;
 }
 
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* Re: [PATCH v14 16/18] vmpressure: in-kernel notifications
  2013-12-20 14:26     ` Luiz Capitulino
@ 2013-12-20 14:31       ` Glauber Costa
  -1 siblings, 0 replies; 64+ messages in thread
From: Glauber Costa @ 2013-12-20 14:31 UTC (permalink / raw)
  To: Luiz Capitulino
  Cc: Vladimir Davydov, dchinner, Michal Hocko, Johannes Weiner,
	Andrew Morton, LKML, linux-mm, cgroups, devel, Glauber Costa,
	John Stultz, Joonsoo Kim, Kamezawa Hiroyuki

> I have the exact problem described above for a project I'm working on
> and this solution seems to solve it well.
>
> However, I had a few issues while trying to use this interface. I'll
> comment on them below, but please take this more as advice seeking
> than patch review.
>
>> This patch extends that to also support in-kernel users. Events that
>> should be generated for in-kernel consumption will be marked as such,
>> and for those, we will call a registered function instead of triggering
>> an eventfd notification.
>>
>> Please note that due to my lack of understanding of each shrinker user,
>> I will stay away from converting the actual users, you are all welcome
>> to do so.
>>
>> Signed-off-by: Glauber Costa <glommer@openvz.org>
>> Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
>> Acked-by: Anton Vorontsov <anton@enomsg.org>
>> Acked-by: Pekka Enberg <penberg@kernel.org>
>> Reviewed-by: Greg Thelen <gthelen@google.com>
>> Cc: Dave Chinner <dchinner@redhat.com>
>> Cc: John Stultz <john.stultz@linaro.org>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
>> Cc: Michal Hocko <mhocko@suse.cz>
>> Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>> Cc: Johannes Weiner <hannes@cmpxchg.org>
>> ---
>>  include/linux/vmpressure.h |    5 +++++
>>  mm/vmpressure.c            |   53 +++++++++++++++++++++++++++++++++++++++++---
>>  2 files changed, 55 insertions(+), 3 deletions(-)
>>
>> diff --git a/include/linux/vmpressure.h b/include/linux/vmpressure.h
>> index 3f3788d..9102e53 100644
>> --- a/include/linux/vmpressure.h
>> +++ b/include/linux/vmpressure.h
>> @@ -19,6 +19,9 @@ struct vmpressure {
>>       /* Have to grab the lock on events traversal or modifications. */
>>       struct mutex events_lock;
>>
>> +     /* False if only kernel users want to be notified, true otherwise. */
>> +     bool notify_userspace;
>> +
>>       struct work_struct work;
>>  };
>>
>> @@ -38,6 +41,8 @@ extern int vmpressure_register_event(struct cgroup_subsys_state *css,
>>                                    struct cftype *cft,
>>                                    struct eventfd_ctx *eventfd,
>>                                    const char *args);
>> +extern int vmpressure_register_kernel_event(struct cgroup_subsys_state *css,
>> +                                         void (*fn)(void));
>>  extern void vmpressure_unregister_event(struct cgroup_subsys_state *css,
>>                                       struct cftype *cft,
>>                                       struct eventfd_ctx *eventfd);
>> diff --git a/mm/vmpressure.c b/mm/vmpressure.c
>> index e0f6283..730e7c1 100644
>> --- a/mm/vmpressure.c
>> +++ b/mm/vmpressure.c
>> @@ -130,8 +130,12 @@ static enum vmpressure_levels vmpressure_calc_level(unsigned long scanned,
>>  }
>>
>>  struct vmpressure_event {
>> -     struct eventfd_ctx *efd;
>> +     union {
>> +             struct eventfd_ctx *efd;
>> +             void (*fn)(void);
>
> How does the callback access its private data?
>
>> +     };
>>       enum vmpressure_levels level;
>> +     bool kernel_event;
>>       struct list_head node;
>>  };
>>
>> @@ -147,12 +151,15 @@ static bool vmpressure_event(struct vmpressure *vmpr,
>>       mutex_lock(&vmpr->events_lock);
>>
>>       list_for_each_entry(ev, &vmpr->events, node) {
>> -             if (level >= ev->level) {
>> +             if (ev->kernel_event) {
>> +                     ev->fn();
>
> I think it would be interesting to pass 'level' to the callback (I'll
> probably use it myself), but we could wait for a in-tree user before
> adding it.
>
>> +             } else if (vmpr->notify_userspace && level >= ev->level) {
>>                       eventfd_signal(ev->efd, 1);
>>                       signalled = true;
>>               }
>>       }
>>
>> +     vmpr->notify_userspace = false;
>>       mutex_unlock(&vmpr->events_lock);
>>
>>       return signalled;
>> @@ -222,7 +229,7 @@ void vmpressure(gfp_t gfp, struct mem_cgroup *memcg,
>>        * we account it too.
>>        */
>>       if (!(gfp & (__GFP_HIGHMEM | __GFP_MOVABLE | __GFP_IO | __GFP_FS)))
>> -             return;
>> +             goto schedule;
>>
>>       /*
>>        * If we got here with no pages scanned, then that is an indicator
>> @@ -239,8 +246,15 @@ void vmpressure(gfp_t gfp, struct mem_cgroup *memcg,
>>       vmpr->scanned += scanned;
>>       vmpr->reclaimed += reclaimed;
>>       scanned = vmpr->scanned;
>> +     /*
>> +      * If we didn't reach this point, only kernel events will be triggered.
>> +      * It is the job of the worker thread to clean this up once the
>> +      * notifications are all delivered.
>> +      */
>> +     vmpr->notify_userspace = true;
>>       spin_unlock(&vmpr->sr_lock);
>>
>> +schedule:
>>       if (scanned < vmpressure_win)
>>               return;
>>       schedule_work(&vmpr->work);
>> @@ -324,6 +338,39 @@ int vmpressure_register_event(struct cgroup_subsys_state *css,
>>  }
>>
>>  /**
>> + * vmpressure_register_kernel_event() - Register kernel-side notification
>> + * @css:     css that is interested in vmpressure notifications
>> + * @fn:              function to be called when pressure happens
>> + *
>> + * This function register in-kernel users interested in receiving notifications
>> + * about pressure conditions. Pressure notifications will be triggered at the
>> + * same time as userspace notifications (with no particular ordering relative
>> + * to it).
>> + *
>> + * Pressure notifications are a alternative method to shrinkers and will serve
>> + * well users that are interested in a one-shot notification, with a
>> + * well-defined cgroup aware interface.
>> + */
>> +int vmpressure_register_kernel_event(struct cgroup_subsys_state *css,
>> +                                   void (*fn)(void))
>> +{
>> +     struct vmpressure *vmpr = css_to_vmpressure(css);
>
> This doesn't allow for css=NULL. What's the recommended way for a today's
> shrinker (which is not related to cgroups) to register with this API?
>
> Also, you don't seem to provide a way to de-register from the event.
>

The answer for all of your questions above can be summarized by noting
that for the lack of other users (at the time), this patch does the bare minimum
for memcg needs. I agree, for instance, that it would be good to pass the level
but since memcg won't do anything with thta, I didn't pass it.

That should be extended if you need to.

> I hacked a patch to be able to use this, seems to work but it's a ugly
> hack:
>
> ---
>  include/linux/vmpressure.h |  3 ++-
>  mm/vmpressure.c            | 13 +++++++++----
>  2 files changed, 11 insertions(+), 5 deletions(-)
>
> diff --git a/include/linux/vmpressure.h b/include/linux/vmpressure.h
> index 9102e53..de416b6 100644
> --- a/include/linux/vmpressure.h
> +++ b/include/linux/vmpressure.h
> @@ -42,7 +42,8 @@ extern int vmpressure_register_event(struct cgroup_subsys_state *css,
>                                      struct eventfd_ctx *eventfd,
>                                      const char *args);
>  extern int vmpressure_register_kernel_event(struct cgroup_subsys_state *css,
> -                                           void (*fn)(void));
> +                                           void (*fn)(void *data, int level),
> +                                           void *data);
>  extern void vmpressure_unregister_event(struct cgroup_subsys_state *css,
>                                         struct cftype *cft,
>                                         struct eventfd_ctx *eventfd);
> diff --git a/mm/vmpressure.c b/mm/vmpressure.c
> index 730e7c1..4ed0e85 100644
> --- a/mm/vmpressure.c
> +++ b/mm/vmpressure.c
> @@ -132,9 +132,10 @@ static enum vmpressure_levels vmpressure_calc_level(unsigned long scanned,
>  struct vmpressure_event {
>         union {
>                 struct eventfd_ctx *efd;
> -               void (*fn)(void);
> +               void (*fn)(void *data, int level);
>         };
>         enum vmpressure_levels level;
> +       void *data;
>         bool kernel_event;
>         struct list_head node;
>  };
> @@ -152,7 +153,7 @@ static bool vmpressure_event(struct vmpressure *vmpr,
>
>         list_for_each_entry(ev, &vmpr->events, node) {
>                 if (ev->kernel_event) {
> -                       ev->fn();
> +                       ev->fn(ev->data, level);
>                 } else if (vmpr->notify_userspace && level >= ev->level) {
>                         eventfd_signal(ev->efd, 1);
>                         signalled = true;
> @@ -352,21 +353,25 @@ int vmpressure_register_event(struct cgroup_subsys_state *css,
>   * well-defined cgroup aware interface.
>   */
>  int vmpressure_register_kernel_event(struct cgroup_subsys_state *css,
> -                                     void (*fn)(void))
> +                                    void (*fn)(void *data, int level), void *data)
>  {
> -       struct vmpressure *vmpr = css_to_vmpressure(css);
> +       struct vmpressure *vmpr;
>         struct vmpressure_event *ev;
>
> +       vmpr = css ? css_to_vmpressure(css) : memcg_to_vmpressure(NULL);
> +
>         ev = kzalloc(sizeof(*ev), GFP_KERNEL);
>         if (!ev)
>                 return -ENOMEM;
>
>         ev->kernel_event = true;
> +       ev->data = data;
>         ev->fn = fn;
>
>         mutex_lock(&vmpr->events_lock);
>         list_add(&ev->node, &vmpr->events);
>         mutex_unlock(&vmpr->events_lock);
> +

Your patch makes sense.



-- 
E Mare, Libertas

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v14 16/18] vmpressure: in-kernel notifications
@ 2013-12-20 14:31       ` Glauber Costa
  0 siblings, 0 replies; 64+ messages in thread
From: Glauber Costa @ 2013-12-20 14:31 UTC (permalink / raw)
  To: Luiz Capitulino
  Cc: Vladimir Davydov, dchinner, Michal Hocko, Johannes Weiner,
	Andrew Morton, LKML, linux-mm, cgroups, devel, Glauber Costa,
	John Stultz, Joonsoo Kim, Kamezawa Hiroyuki

> I have the exact problem described above for a project I'm working on
> and this solution seems to solve it well.
>
> However, I had a few issues while trying to use this interface. I'll
> comment on them below, but please take this more as advice seeking
> than patch review.
>
>> This patch extends that to also support in-kernel users. Events that
>> should be generated for in-kernel consumption will be marked as such,
>> and for those, we will call a registered function instead of triggering
>> an eventfd notification.
>>
>> Please note that due to my lack of understanding of each shrinker user,
>> I will stay away from converting the actual users, you are all welcome
>> to do so.
>>
>> Signed-off-by: Glauber Costa <glommer@openvz.org>
>> Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
>> Acked-by: Anton Vorontsov <anton@enomsg.org>
>> Acked-by: Pekka Enberg <penberg@kernel.org>
>> Reviewed-by: Greg Thelen <gthelen@google.com>
>> Cc: Dave Chinner <dchinner@redhat.com>
>> Cc: John Stultz <john.stultz@linaro.org>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
>> Cc: Michal Hocko <mhocko@suse.cz>
>> Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>> Cc: Johannes Weiner <hannes@cmpxchg.org>
>> ---
>>  include/linux/vmpressure.h |    5 +++++
>>  mm/vmpressure.c            |   53 +++++++++++++++++++++++++++++++++++++++++---
>>  2 files changed, 55 insertions(+), 3 deletions(-)
>>
>> diff --git a/include/linux/vmpressure.h b/include/linux/vmpressure.h
>> index 3f3788d..9102e53 100644
>> --- a/include/linux/vmpressure.h
>> +++ b/include/linux/vmpressure.h
>> @@ -19,6 +19,9 @@ struct vmpressure {
>>       /* Have to grab the lock on events traversal or modifications. */
>>       struct mutex events_lock;
>>
>> +     /* False if only kernel users want to be notified, true otherwise. */
>> +     bool notify_userspace;
>> +
>>       struct work_struct work;
>>  };
>>
>> @@ -38,6 +41,8 @@ extern int vmpressure_register_event(struct cgroup_subsys_state *css,
>>                                    struct cftype *cft,
>>                                    struct eventfd_ctx *eventfd,
>>                                    const char *args);
>> +extern int vmpressure_register_kernel_event(struct cgroup_subsys_state *css,
>> +                                         void (*fn)(void));
>>  extern void vmpressure_unregister_event(struct cgroup_subsys_state *css,
>>                                       struct cftype *cft,
>>                                       struct eventfd_ctx *eventfd);
>> diff --git a/mm/vmpressure.c b/mm/vmpressure.c
>> index e0f6283..730e7c1 100644
>> --- a/mm/vmpressure.c
>> +++ b/mm/vmpressure.c
>> @@ -130,8 +130,12 @@ static enum vmpressure_levels vmpressure_calc_level(unsigned long scanned,
>>  }
>>
>>  struct vmpressure_event {
>> -     struct eventfd_ctx *efd;
>> +     union {
>> +             struct eventfd_ctx *efd;
>> +             void (*fn)(void);
>
> How does the callback access its private data?
>
>> +     };
>>       enum vmpressure_levels level;
>> +     bool kernel_event;
>>       struct list_head node;
>>  };
>>
>> @@ -147,12 +151,15 @@ static bool vmpressure_event(struct vmpressure *vmpr,
>>       mutex_lock(&vmpr->events_lock);
>>
>>       list_for_each_entry(ev, &vmpr->events, node) {
>> -             if (level >= ev->level) {
>> +             if (ev->kernel_event) {
>> +                     ev->fn();
>
> I think it would be interesting to pass 'level' to the callback (I'll
> probably use it myself), but we could wait for a in-tree user before
> adding it.
>
>> +             } else if (vmpr->notify_userspace && level >= ev->level) {
>>                       eventfd_signal(ev->efd, 1);
>>                       signalled = true;
>>               }
>>       }
>>
>> +     vmpr->notify_userspace = false;
>>       mutex_unlock(&vmpr->events_lock);
>>
>>       return signalled;
>> @@ -222,7 +229,7 @@ void vmpressure(gfp_t gfp, struct mem_cgroup *memcg,
>>        * we account it too.
>>        */
>>       if (!(gfp & (__GFP_HIGHMEM | __GFP_MOVABLE | __GFP_IO | __GFP_FS)))
>> -             return;
>> +             goto schedule;
>>
>>       /*
>>        * If we got here with no pages scanned, then that is an indicator
>> @@ -239,8 +246,15 @@ void vmpressure(gfp_t gfp, struct mem_cgroup *memcg,
>>       vmpr->scanned += scanned;
>>       vmpr->reclaimed += reclaimed;
>>       scanned = vmpr->scanned;
>> +     /*
>> +      * If we didn't reach this point, only kernel events will be triggered.
>> +      * It is the job of the worker thread to clean this up once the
>> +      * notifications are all delivered.
>> +      */
>> +     vmpr->notify_userspace = true;
>>       spin_unlock(&vmpr->sr_lock);
>>
>> +schedule:
>>       if (scanned < vmpressure_win)
>>               return;
>>       schedule_work(&vmpr->work);
>> @@ -324,6 +338,39 @@ int vmpressure_register_event(struct cgroup_subsys_state *css,
>>  }
>>
>>  /**
>> + * vmpressure_register_kernel_event() - Register kernel-side notification
>> + * @css:     css that is interested in vmpressure notifications
>> + * @fn:              function to be called when pressure happens
>> + *
>> + * This function register in-kernel users interested in receiving notifications
>> + * about pressure conditions. Pressure notifications will be triggered at the
>> + * same time as userspace notifications (with no particular ordering relative
>> + * to it).
>> + *
>> + * Pressure notifications are a alternative method to shrinkers and will serve
>> + * well users that are interested in a one-shot notification, with a
>> + * well-defined cgroup aware interface.
>> + */
>> +int vmpressure_register_kernel_event(struct cgroup_subsys_state *css,
>> +                                   void (*fn)(void))
>> +{
>> +     struct vmpressure *vmpr = css_to_vmpressure(css);
>
> This doesn't allow for css=NULL. What's the recommended way for a today's
> shrinker (which is not related to cgroups) to register with this API?
>
> Also, you don't seem to provide a way to de-register from the event.
>

The answer for all of your questions above can be summarized by noting
that for the lack of other users (at the time), this patch does the bare minimum
for memcg needs. I agree, for instance, that it would be good to pass the level
but since memcg won't do anything with thta, I didn't pass it.

That should be extended if you need to.

> I hacked a patch to be able to use this, seems to work but it's a ugly
> hack:
>
> ---
>  include/linux/vmpressure.h |  3 ++-
>  mm/vmpressure.c            | 13 +++++++++----
>  2 files changed, 11 insertions(+), 5 deletions(-)
>
> diff --git a/include/linux/vmpressure.h b/include/linux/vmpressure.h
> index 9102e53..de416b6 100644
> --- a/include/linux/vmpressure.h
> +++ b/include/linux/vmpressure.h
> @@ -42,7 +42,8 @@ extern int vmpressure_register_event(struct cgroup_subsys_state *css,
>                                      struct eventfd_ctx *eventfd,
>                                      const char *args);
>  extern int vmpressure_register_kernel_event(struct cgroup_subsys_state *css,
> -                                           void (*fn)(void));
> +                                           void (*fn)(void *data, int level),
> +                                           void *data);
>  extern void vmpressure_unregister_event(struct cgroup_subsys_state *css,
>                                         struct cftype *cft,
>                                         struct eventfd_ctx *eventfd);
> diff --git a/mm/vmpressure.c b/mm/vmpressure.c
> index 730e7c1..4ed0e85 100644
> --- a/mm/vmpressure.c
> +++ b/mm/vmpressure.c
> @@ -132,9 +132,10 @@ static enum vmpressure_levels vmpressure_calc_level(unsigned long scanned,
>  struct vmpressure_event {
>         union {
>                 struct eventfd_ctx *efd;
> -               void (*fn)(void);
> +               void (*fn)(void *data, int level);
>         };
>         enum vmpressure_levels level;
> +       void *data;
>         bool kernel_event;
>         struct list_head node;
>  };
> @@ -152,7 +153,7 @@ static bool vmpressure_event(struct vmpressure *vmpr,
>
>         list_for_each_entry(ev, &vmpr->events, node) {
>                 if (ev->kernel_event) {
> -                       ev->fn();
> +                       ev->fn(ev->data, level);
>                 } else if (vmpr->notify_userspace && level >= ev->level) {
>                         eventfd_signal(ev->efd, 1);
>                         signalled = true;
> @@ -352,21 +353,25 @@ int vmpressure_register_event(struct cgroup_subsys_state *css,
>   * well-defined cgroup aware interface.
>   */
>  int vmpressure_register_kernel_event(struct cgroup_subsys_state *css,
> -                                     void (*fn)(void))
> +                                    void (*fn)(void *data, int level), void *data)
>  {
> -       struct vmpressure *vmpr = css_to_vmpressure(css);
> +       struct vmpressure *vmpr;
>         struct vmpressure_event *ev;
>
> +       vmpr = css ? css_to_vmpressure(css) : memcg_to_vmpressure(NULL);
> +
>         ev = kzalloc(sizeof(*ev), GFP_KERNEL);
>         if (!ev)
>                 return -ENOMEM;
>
>         ev->kernel_event = true;
> +       ev->data = data;
>         ev->fn = fn;
>
>         mutex_lock(&vmpr->events_lock);
>         list_add(&ev->node, &vmpr->events);
>         mutex_unlock(&vmpr->events_lock);
> +

Your patch makes sense.



-- 
E Mare, Libertas

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v14 16/18] vmpressure: in-kernel notifications
  2013-12-20 14:31       ` Glauber Costa
@ 2013-12-20 14:32         ` Glauber Costa
  -1 siblings, 0 replies; 64+ messages in thread
From: Glauber Costa @ 2013-12-20 14:32 UTC (permalink / raw)
  To: Luiz Capitulino
  Cc: Vladimir Davydov, dchinner, Michal Hocko, Johannes Weiner,
	Andrew Morton, LKML, linux-mm, cgroups, devel, Glauber Costa,
	John Stultz, Joonsoo Kim, Kamezawa Hiroyuki

One correction:

>>  int vmpressure_register_kernel_event(struct cgroup_subsys_state *css,
>> -                                     void (*fn)(void))
>> +                                    void (*fn)(void *data, int level), void *data)
>>  {
>> -       struct vmpressure *vmpr = css_to_vmpressure(css);
>> +       struct vmpressure *vmpr;
>>         struct vmpressure_event *ev;
>>
>> +       vmpr = css ? css_to_vmpressure(css) : memcg_to_vmpressure(NULL);
>> +

This looks like it could be improved. Better not to have that memcg
specific thing
here.

Other than that it makes sense.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v14 16/18] vmpressure: in-kernel notifications
@ 2013-12-20 14:32         ` Glauber Costa
  0 siblings, 0 replies; 64+ messages in thread
From: Glauber Costa @ 2013-12-20 14:32 UTC (permalink / raw)
  To: Luiz Capitulino
  Cc: Vladimir Davydov, dchinner, Michal Hocko, Johannes Weiner,
	Andrew Morton, LKML, linux-mm, cgroups, devel, Glauber Costa,
	John Stultz, Joonsoo Kim, Kamezawa Hiroyuki

One correction:

>>  int vmpressure_register_kernel_event(struct cgroup_subsys_state *css,
>> -                                     void (*fn)(void))
>> +                                    void (*fn)(void *data, int level), void *data)
>>  {
>> -       struct vmpressure *vmpr = css_to_vmpressure(css);
>> +       struct vmpressure *vmpr;
>>         struct vmpressure_event *ev;
>>
>> +       vmpr = css ? css_to_vmpressure(css) : memcg_to_vmpressure(NULL);
>> +

This looks like it could be improved. Better not to have that memcg
specific thing
here.

Other than that it makes sense.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v14 16/18] vmpressure: in-kernel notifications
  2013-12-20 14:31       ` Glauber Costa
@ 2013-12-20 14:36         ` Vladimir Davydov
  -1 siblings, 0 replies; 64+ messages in thread
From: Vladimir Davydov @ 2013-12-20 14:36 UTC (permalink / raw)
  To: Glauber Costa, Luiz Capitulino
  Cc: dchinner, Michal Hocko, Johannes Weiner, Andrew Morton, LKML,
	linux-mm, cgroups, devel, Glauber Costa, John Stultz,
	Joonsoo Kim, Kamezawa Hiroyuki

On 12/20/2013 06:31 PM, Glauber Costa wrote:
>> I have the exact problem described above for a project I'm working on
>> and this solution seems to solve it well.
>>
>> However, I had a few issues while trying to use this interface. I'll
>> comment on them below, but please take this more as advice seeking
>> than patch review.
>>
>>> This patch extends that to also support in-kernel users. Events that
>>> should be generated for in-kernel consumption will be marked as such,
>>> and for those, we will call a registered function instead of triggering
>>> an eventfd notification.
>>>
>>> Please note that due to my lack of understanding of each shrinker user,
>>> I will stay away from converting the actual users, you are all welcome
>>> to do so.
>>>
>>> Signed-off-by: Glauber Costa <glommer@openvz.org>
>>> Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
>>> Acked-by: Anton Vorontsov <anton@enomsg.org>
>>> Acked-by: Pekka Enberg <penberg@kernel.org>
>>> Reviewed-by: Greg Thelen <gthelen@google.com>
>>> Cc: Dave Chinner <dchinner@redhat.com>
>>> Cc: John Stultz <john.stultz@linaro.org>
>>> Cc: Andrew Morton <akpm@linux-foundation.org>
>>> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
>>> Cc: Michal Hocko <mhocko@suse.cz>
>>> Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>>> Cc: Johannes Weiner <hannes@cmpxchg.org>
>>> ---
>>>  include/linux/vmpressure.h |    5 +++++
>>>  mm/vmpressure.c            |   53 +++++++++++++++++++++++++++++++++++++++++---
>>>  2 files changed, 55 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/include/linux/vmpressure.h b/include/linux/vmpressure.h
>>> index 3f3788d..9102e53 100644
>>> --- a/include/linux/vmpressure.h
>>> +++ b/include/linux/vmpressure.h
>>> @@ -19,6 +19,9 @@ struct vmpressure {
>>>       /* Have to grab the lock on events traversal or modifications. */
>>>       struct mutex events_lock;
>>>
>>> +     /* False if only kernel users want to be notified, true otherwise. */
>>> +     bool notify_userspace;
>>> +
>>>       struct work_struct work;
>>>  };
>>>
>>> @@ -38,6 +41,8 @@ extern int vmpressure_register_event(struct cgroup_subsys_state *css,
>>>                                    struct cftype *cft,
>>>                                    struct eventfd_ctx *eventfd,
>>>                                    const char *args);
>>> +extern int vmpressure_register_kernel_event(struct cgroup_subsys_state *css,
>>> +                                         void (*fn)(void));
>>>  extern void vmpressure_unregister_event(struct cgroup_subsys_state *css,
>>>                                       struct cftype *cft,
>>>                                       struct eventfd_ctx *eventfd);
>>> diff --git a/mm/vmpressure.c b/mm/vmpressure.c
>>> index e0f6283..730e7c1 100644
>>> --- a/mm/vmpressure.c
>>> +++ b/mm/vmpressure.c
>>> @@ -130,8 +130,12 @@ static enum vmpressure_levels vmpressure_calc_level(unsigned long scanned,
>>>  }
>>>
>>>  struct vmpressure_event {
>>> -     struct eventfd_ctx *efd;
>>> +     union {
>>> +             struct eventfd_ctx *efd;
>>> +             void (*fn)(void);
>> How does the callback access its private data?
>>
>>> +     };
>>>       enum vmpressure_levels level;
>>> +     bool kernel_event;
>>>       struct list_head node;
>>>  };
>>>
>>> @@ -147,12 +151,15 @@ static bool vmpressure_event(struct vmpressure *vmpr,
>>>       mutex_lock(&vmpr->events_lock);
>>>
>>>       list_for_each_entry(ev, &vmpr->events, node) {
>>> -             if (level >= ev->level) {
>>> +             if (ev->kernel_event) {
>>> +                     ev->fn();
>> I think it would be interesting to pass 'level' to the callback (I'll
>> probably use it myself), but we could wait for a in-tree user before
>> adding it.
>>
>>> +             } else if (vmpr->notify_userspace && level >= ev->level) {
>>>                       eventfd_signal(ev->efd, 1);
>>>                       signalled = true;
>>>               }
>>>       }
>>>
>>> +     vmpr->notify_userspace = false;
>>>       mutex_unlock(&vmpr->events_lock);
>>>
>>>       return signalled;
>>> @@ -222,7 +229,7 @@ void vmpressure(gfp_t gfp, struct mem_cgroup *memcg,
>>>        * we account it too.
>>>        */
>>>       if (!(gfp & (__GFP_HIGHMEM | __GFP_MOVABLE | __GFP_IO | __GFP_FS)))
>>> -             return;
>>> +             goto schedule;
>>>
>>>       /*
>>>        * If we got here with no pages scanned, then that is an indicator
>>> @@ -239,8 +246,15 @@ void vmpressure(gfp_t gfp, struct mem_cgroup *memcg,
>>>       vmpr->scanned += scanned;
>>>       vmpr->reclaimed += reclaimed;
>>>       scanned = vmpr->scanned;
>>> +     /*
>>> +      * If we didn't reach this point, only kernel events will be triggered.
>>> +      * It is the job of the worker thread to clean this up once the
>>> +      * notifications are all delivered.
>>> +      */
>>> +     vmpr->notify_userspace = true;
>>>       spin_unlock(&vmpr->sr_lock);
>>>
>>> +schedule:
>>>       if (scanned < vmpressure_win)
>>>               return;
>>>       schedule_work(&vmpr->work);
>>> @@ -324,6 +338,39 @@ int vmpressure_register_event(struct cgroup_subsys_state *css,
>>>  }
>>>
>>>  /**
>>> + * vmpressure_register_kernel_event() - Register kernel-side notification
>>> + * @css:     css that is interested in vmpressure notifications
>>> + * @fn:              function to be called when pressure happens
>>> + *
>>> + * This function register in-kernel users interested in receiving notifications
>>> + * about pressure conditions. Pressure notifications will be triggered at the
>>> + * same time as userspace notifications (with no particular ordering relative
>>> + * to it).
>>> + *
>>> + * Pressure notifications are a alternative method to shrinkers and will serve
>>> + * well users that are interested in a one-shot notification, with a
>>> + * well-defined cgroup aware interface.
>>> + */
>>> +int vmpressure_register_kernel_event(struct cgroup_subsys_state *css,
>>> +                                   void (*fn)(void))
>>> +{
>>> +     struct vmpressure *vmpr = css_to_vmpressure(css);
>> This doesn't allow for css=NULL. What's the recommended way for a today's
>> shrinker (which is not related to cgroups) to register with this API?
>>
>> Also, you don't seem to provide a way to de-register from the event.
>>
> The answer for all of your questions above can be summarized by noting
> that for the lack of other users (at the time), this patch does the bare minimum
> for memcg needs. I agree, for instance, that it would be good to pass the level
> but since memcg won't do anything with thta, I didn't pass it.
>
> That should be extended if you need to.
>
>> I hacked a patch to be able to use this, seems to work but it's a ugly
>> hack:
>>
>> ---
>>  include/linux/vmpressure.h |  3 ++-
>>  mm/vmpressure.c            | 13 +++++++++----
>>  2 files changed, 11 insertions(+), 5 deletions(-)
>>
>> diff --git a/include/linux/vmpressure.h b/include/linux/vmpressure.h
>> index 9102e53..de416b6 100644
>> --- a/include/linux/vmpressure.h
>> +++ b/include/linux/vmpressure.h
>> @@ -42,7 +42,8 @@ extern int vmpressure_register_event(struct cgroup_subsys_state *css,
>>                                      struct eventfd_ctx *eventfd,
>>                                      const char *args);
>>  extern int vmpressure_register_kernel_event(struct cgroup_subsys_state *css,
>> -                                           void (*fn)(void));
>> +                                           void (*fn)(void *data, int level),
>> +                                           void *data);
>>  extern void vmpressure_unregister_event(struct cgroup_subsys_state *css,
>>                                         struct cftype *cft,
>>                                         struct eventfd_ctx *eventfd);
>> diff --git a/mm/vmpressure.c b/mm/vmpressure.c
>> index 730e7c1..4ed0e85 100644
>> --- a/mm/vmpressure.c
>> +++ b/mm/vmpressure.c
>> @@ -132,9 +132,10 @@ static enum vmpressure_levels vmpressure_calc_level(unsigned long scanned,
>>  struct vmpressure_event {
>>         union {
>>                 struct eventfd_ctx *efd;
>> -               void (*fn)(void);
>> +               void (*fn)(void *data, int level);
>>         };
>>         enum vmpressure_levels level;
>> +       void *data;
>>         bool kernel_event;
>>         struct list_head node;
>>  };
>> @@ -152,7 +153,7 @@ static bool vmpressure_event(struct vmpressure *vmpr,
>>
>>         list_for_each_entry(ev, &vmpr->events, node) {
>>                 if (ev->kernel_event) {
>> -                       ev->fn();
>> +                       ev->fn(ev->data, level);
>>                 } else if (vmpr->notify_userspace && level >= ev->level) {
>>                         eventfd_signal(ev->efd, 1);
>>                         signalled = true;
>> @@ -352,21 +353,25 @@ int vmpressure_register_event(struct cgroup_subsys_state *css,
>>   * well-defined cgroup aware interface.
>>   */
>>  int vmpressure_register_kernel_event(struct cgroup_subsys_state *css,
>> -                                     void (*fn)(void))
>> +                                    void (*fn)(void *data, int level), void *data)
>>  {
>> -       struct vmpressure *vmpr = css_to_vmpressure(css);
>> +       struct vmpressure *vmpr;
>>         struct vmpressure_event *ev;
>>
>> +       vmpr = css ? css_to_vmpressure(css) : memcg_to_vmpressure(NULL);
>> +
>>         ev = kzalloc(sizeof(*ev), GFP_KERNEL);
>>         if (!ev)
>>                 return -ENOMEM;
>>
>>         ev->kernel_event = true;
>> +       ev->data = data;
>>         ev->fn = fn;
>>
>>         mutex_lock(&vmpr->events_lock);
>>         list_add(&ev->node, &vmpr->events);
>>         mutex_unlock(&vmpr->events_lock);
>> +
> Your patch makes sense.

I guess I'll include this to the next iteration of this patch-set then.

Thanks.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v14 16/18] vmpressure: in-kernel notifications
@ 2013-12-20 14:36         ` Vladimir Davydov
  0 siblings, 0 replies; 64+ messages in thread
From: Vladimir Davydov @ 2013-12-20 14:36 UTC (permalink / raw)
  To: Glauber Costa, Luiz Capitulino
  Cc: dchinner, Michal Hocko, Johannes Weiner, Andrew Morton, LKML,
	linux-mm, cgroups, devel, Glauber Costa, John Stultz,
	Joonsoo Kim, Kamezawa Hiroyuki

On 12/20/2013 06:31 PM, Glauber Costa wrote:
>> I have the exact problem described above for a project I'm working on
>> and this solution seems to solve it well.
>>
>> However, I had a few issues while trying to use this interface. I'll
>> comment on them below, but please take this more as advice seeking
>> than patch review.
>>
>>> This patch extends that to also support in-kernel users. Events that
>>> should be generated for in-kernel consumption will be marked as such,
>>> and for those, we will call a registered function instead of triggering
>>> an eventfd notification.
>>>
>>> Please note that due to my lack of understanding of each shrinker user,
>>> I will stay away from converting the actual users, you are all welcome
>>> to do so.
>>>
>>> Signed-off-by: Glauber Costa <glommer@openvz.org>
>>> Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
>>> Acked-by: Anton Vorontsov <anton@enomsg.org>
>>> Acked-by: Pekka Enberg <penberg@kernel.org>
>>> Reviewed-by: Greg Thelen <gthelen@google.com>
>>> Cc: Dave Chinner <dchinner@redhat.com>
>>> Cc: John Stultz <john.stultz@linaro.org>
>>> Cc: Andrew Morton <akpm@linux-foundation.org>
>>> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
>>> Cc: Michal Hocko <mhocko@suse.cz>
>>> Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>>> Cc: Johannes Weiner <hannes@cmpxchg.org>
>>> ---
>>>  include/linux/vmpressure.h |    5 +++++
>>>  mm/vmpressure.c            |   53 +++++++++++++++++++++++++++++++++++++++++---
>>>  2 files changed, 55 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/include/linux/vmpressure.h b/include/linux/vmpressure.h
>>> index 3f3788d..9102e53 100644
>>> --- a/include/linux/vmpressure.h
>>> +++ b/include/linux/vmpressure.h
>>> @@ -19,6 +19,9 @@ struct vmpressure {
>>>       /* Have to grab the lock on events traversal or modifications. */
>>>       struct mutex events_lock;
>>>
>>> +     /* False if only kernel users want to be notified, true otherwise. */
>>> +     bool notify_userspace;
>>> +
>>>       struct work_struct work;
>>>  };
>>>
>>> @@ -38,6 +41,8 @@ extern int vmpressure_register_event(struct cgroup_subsys_state *css,
>>>                                    struct cftype *cft,
>>>                                    struct eventfd_ctx *eventfd,
>>>                                    const char *args);
>>> +extern int vmpressure_register_kernel_event(struct cgroup_subsys_state *css,
>>> +                                         void (*fn)(void));
>>>  extern void vmpressure_unregister_event(struct cgroup_subsys_state *css,
>>>                                       struct cftype *cft,
>>>                                       struct eventfd_ctx *eventfd);
>>> diff --git a/mm/vmpressure.c b/mm/vmpressure.c
>>> index e0f6283..730e7c1 100644
>>> --- a/mm/vmpressure.c
>>> +++ b/mm/vmpressure.c
>>> @@ -130,8 +130,12 @@ static enum vmpressure_levels vmpressure_calc_level(unsigned long scanned,
>>>  }
>>>
>>>  struct vmpressure_event {
>>> -     struct eventfd_ctx *efd;
>>> +     union {
>>> +             struct eventfd_ctx *efd;
>>> +             void (*fn)(void);
>> How does the callback access its private data?
>>
>>> +     };
>>>       enum vmpressure_levels level;
>>> +     bool kernel_event;
>>>       struct list_head node;
>>>  };
>>>
>>> @@ -147,12 +151,15 @@ static bool vmpressure_event(struct vmpressure *vmpr,
>>>       mutex_lock(&vmpr->events_lock);
>>>
>>>       list_for_each_entry(ev, &vmpr->events, node) {
>>> -             if (level >= ev->level) {
>>> +             if (ev->kernel_event) {
>>> +                     ev->fn();
>> I think it would be interesting to pass 'level' to the callback (I'll
>> probably use it myself), but we could wait for a in-tree user before
>> adding it.
>>
>>> +             } else if (vmpr->notify_userspace && level >= ev->level) {
>>>                       eventfd_signal(ev->efd, 1);
>>>                       signalled = true;
>>>               }
>>>       }
>>>
>>> +     vmpr->notify_userspace = false;
>>>       mutex_unlock(&vmpr->events_lock);
>>>
>>>       return signalled;
>>> @@ -222,7 +229,7 @@ void vmpressure(gfp_t gfp, struct mem_cgroup *memcg,
>>>        * we account it too.
>>>        */
>>>       if (!(gfp & (__GFP_HIGHMEM | __GFP_MOVABLE | __GFP_IO | __GFP_FS)))
>>> -             return;
>>> +             goto schedule;
>>>
>>>       /*
>>>        * If we got here with no pages scanned, then that is an indicator
>>> @@ -239,8 +246,15 @@ void vmpressure(gfp_t gfp, struct mem_cgroup *memcg,
>>>       vmpr->scanned += scanned;
>>>       vmpr->reclaimed += reclaimed;
>>>       scanned = vmpr->scanned;
>>> +     /*
>>> +      * If we didn't reach this point, only kernel events will be triggered.
>>> +      * It is the job of the worker thread to clean this up once the
>>> +      * notifications are all delivered.
>>> +      */
>>> +     vmpr->notify_userspace = true;
>>>       spin_unlock(&vmpr->sr_lock);
>>>
>>> +schedule:
>>>       if (scanned < vmpressure_win)
>>>               return;
>>>       schedule_work(&vmpr->work);
>>> @@ -324,6 +338,39 @@ int vmpressure_register_event(struct cgroup_subsys_state *css,
>>>  }
>>>
>>>  /**
>>> + * vmpressure_register_kernel_event() - Register kernel-side notification
>>> + * @css:     css that is interested in vmpressure notifications
>>> + * @fn:              function to be called when pressure happens
>>> + *
>>> + * This function register in-kernel users interested in receiving notifications
>>> + * about pressure conditions. Pressure notifications will be triggered at the
>>> + * same time as userspace notifications (with no particular ordering relative
>>> + * to it).
>>> + *
>>> + * Pressure notifications are a alternative method to shrinkers and will serve
>>> + * well users that are interested in a one-shot notification, with a
>>> + * well-defined cgroup aware interface.
>>> + */
>>> +int vmpressure_register_kernel_event(struct cgroup_subsys_state *css,
>>> +                                   void (*fn)(void))
>>> +{
>>> +     struct vmpressure *vmpr = css_to_vmpressure(css);
>> This doesn't allow for css=NULL. What's the recommended way for a today's
>> shrinker (which is not related to cgroups) to register with this API?
>>
>> Also, you don't seem to provide a way to de-register from the event.
>>
> The answer for all of your questions above can be summarized by noting
> that for the lack of other users (at the time), this patch does the bare minimum
> for memcg needs. I agree, for instance, that it would be good to pass the level
> but since memcg won't do anything with thta, I didn't pass it.
>
> That should be extended if you need to.
>
>> I hacked a patch to be able to use this, seems to work but it's a ugly
>> hack:
>>
>> ---
>>  include/linux/vmpressure.h |  3 ++-
>>  mm/vmpressure.c            | 13 +++++++++----
>>  2 files changed, 11 insertions(+), 5 deletions(-)
>>
>> diff --git a/include/linux/vmpressure.h b/include/linux/vmpressure.h
>> index 9102e53..de416b6 100644
>> --- a/include/linux/vmpressure.h
>> +++ b/include/linux/vmpressure.h
>> @@ -42,7 +42,8 @@ extern int vmpressure_register_event(struct cgroup_subsys_state *css,
>>                                      struct eventfd_ctx *eventfd,
>>                                      const char *args);
>>  extern int vmpressure_register_kernel_event(struct cgroup_subsys_state *css,
>> -                                           void (*fn)(void));
>> +                                           void (*fn)(void *data, int level),
>> +                                           void *data);
>>  extern void vmpressure_unregister_event(struct cgroup_subsys_state *css,
>>                                         struct cftype *cft,
>>                                         struct eventfd_ctx *eventfd);
>> diff --git a/mm/vmpressure.c b/mm/vmpressure.c
>> index 730e7c1..4ed0e85 100644
>> --- a/mm/vmpressure.c
>> +++ b/mm/vmpressure.c
>> @@ -132,9 +132,10 @@ static enum vmpressure_levels vmpressure_calc_level(unsigned long scanned,
>>  struct vmpressure_event {
>>         union {
>>                 struct eventfd_ctx *efd;
>> -               void (*fn)(void);
>> +               void (*fn)(void *data, int level);
>>         };
>>         enum vmpressure_levels level;
>> +       void *data;
>>         bool kernel_event;
>>         struct list_head node;
>>  };
>> @@ -152,7 +153,7 @@ static bool vmpressure_event(struct vmpressure *vmpr,
>>
>>         list_for_each_entry(ev, &vmpr->events, node) {
>>                 if (ev->kernel_event) {
>> -                       ev->fn();
>> +                       ev->fn(ev->data, level);
>>                 } else if (vmpr->notify_userspace && level >= ev->level) {
>>                         eventfd_signal(ev->efd, 1);
>>                         signalled = true;
>> @@ -352,21 +353,25 @@ int vmpressure_register_event(struct cgroup_subsys_state *css,
>>   * well-defined cgroup aware interface.
>>   */
>>  int vmpressure_register_kernel_event(struct cgroup_subsys_state *css,
>> -                                     void (*fn)(void))
>> +                                    void (*fn)(void *data, int level), void *data)
>>  {
>> -       struct vmpressure *vmpr = css_to_vmpressure(css);
>> +       struct vmpressure *vmpr;
>>         struct vmpressure_event *ev;
>>
>> +       vmpr = css ? css_to_vmpressure(css) : memcg_to_vmpressure(NULL);
>> +
>>         ev = kzalloc(sizeof(*ev), GFP_KERNEL);
>>         if (!ev)
>>                 return -ENOMEM;
>>
>>         ev->kernel_event = true;
>> +       ev->data = data;
>>         ev->fn = fn;
>>
>>         mutex_lock(&vmpr->events_lock);
>>         list_add(&ev->node, &vmpr->events);
>>         mutex_unlock(&vmpr->events_lock);
>> +
> Your patch makes sense.

I guess I'll include this to the next iteration of this patch-set then.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v14 16/18] vmpressure: in-kernel notifications
  2013-12-20 14:31       ` Glauber Costa
@ 2013-12-20 15:03         ` Luiz Capitulino
  -1 siblings, 0 replies; 64+ messages in thread
From: Luiz Capitulino @ 2013-12-20 15:03 UTC (permalink / raw)
  To: Glauber Costa
  Cc: Vladimir Davydov, dchinner, Michal Hocko, Johannes Weiner,
	Andrew Morton, LKML, linux-mm, cgroups, devel, Glauber Costa,
	John Stultz, Joonsoo Kim, Kamezawa Hiroyuki

On Fri, 20 Dec 2013 18:31:02 +0400
Glauber Costa <glommer@gmail.com> wrote:

> > I have the exact problem described above for a project I'm working on
> > and this solution seems to solve it well.
> >
> > However, I had a few issues while trying to use this interface. I'll
> > comment on them below, but please take this more as advice seeking
> > than patch review.
> >
> >> This patch extends that to also support in-kernel users. Events that
> >> should be generated for in-kernel consumption will be marked as such,
> >> and for those, we will call a registered function instead of triggering
> >> an eventfd notification.
> >>
> >> Please note that due to my lack of understanding of each shrinker user,
> >> I will stay away from converting the actual users, you are all welcome
> >> to do so.
> >>
> >> Signed-off-by: Glauber Costa <glommer@openvz.org>
> >> Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
> >> Acked-by: Anton Vorontsov <anton@enomsg.org>
> >> Acked-by: Pekka Enberg <penberg@kernel.org>
> >> Reviewed-by: Greg Thelen <gthelen@google.com>
> >> Cc: Dave Chinner <dchinner@redhat.com>
> >> Cc: John Stultz <john.stultz@linaro.org>
> >> Cc: Andrew Morton <akpm@linux-foundation.org>
> >> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> >> Cc: Michal Hocko <mhocko@suse.cz>
> >> Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> >> Cc: Johannes Weiner <hannes@cmpxchg.org>
> >> ---
> >>  include/linux/vmpressure.h |    5 +++++
> >>  mm/vmpressure.c            |   53 +++++++++++++++++++++++++++++++++++++++++---
> >>  2 files changed, 55 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/include/linux/vmpressure.h b/include/linux/vmpressure.h
> >> index 3f3788d..9102e53 100644
> >> --- a/include/linux/vmpressure.h
> >> +++ b/include/linux/vmpressure.h
> >> @@ -19,6 +19,9 @@ struct vmpressure {
> >>       /* Have to grab the lock on events traversal or modifications. */
> >>       struct mutex events_lock;
> >>
> >> +     /* False if only kernel users want to be notified, true otherwise. */
> >> +     bool notify_userspace;
> >> +
> >>       struct work_struct work;
> >>  };
> >>
> >> @@ -38,6 +41,8 @@ extern int vmpressure_register_event(struct cgroup_subsys_state *css,
> >>                                    struct cftype *cft,
> >>                                    struct eventfd_ctx *eventfd,
> >>                                    const char *args);
> >> +extern int vmpressure_register_kernel_event(struct cgroup_subsys_state *css,
> >> +                                         void (*fn)(void));
> >>  extern void vmpressure_unregister_event(struct cgroup_subsys_state *css,
> >>                                       struct cftype *cft,
> >>                                       struct eventfd_ctx *eventfd);
> >> diff --git a/mm/vmpressure.c b/mm/vmpressure.c
> >> index e0f6283..730e7c1 100644
> >> --- a/mm/vmpressure.c
> >> +++ b/mm/vmpressure.c
> >> @@ -130,8 +130,12 @@ static enum vmpressure_levels vmpressure_calc_level(unsigned long scanned,
> >>  }
> >>
> >>  struct vmpressure_event {
> >> -     struct eventfd_ctx *efd;
> >> +     union {
> >> +             struct eventfd_ctx *efd;
> >> +             void (*fn)(void);
> >
> > How does the callback access its private data?
> >
> >> +     };
> >>       enum vmpressure_levels level;
> >> +     bool kernel_event;
> >>       struct list_head node;
> >>  };
> >>
> >> @@ -147,12 +151,15 @@ static bool vmpressure_event(struct vmpressure *vmpr,
> >>       mutex_lock(&vmpr->events_lock);
> >>
> >>       list_for_each_entry(ev, &vmpr->events, node) {
> >> -             if (level >= ev->level) {
> >> +             if (ev->kernel_event) {
> >> +                     ev->fn();
> >
> > I think it would be interesting to pass 'level' to the callback (I'll
> > probably use it myself), but we could wait for a in-tree user before
> > adding it.
> >
> >> +             } else if (vmpr->notify_userspace && level >= ev->level) {
> >>                       eventfd_signal(ev->efd, 1);
> >>                       signalled = true;
> >>               }
> >>       }
> >>
> >> +     vmpr->notify_userspace = false;
> >>       mutex_unlock(&vmpr->events_lock);
> >>
> >>       return signalled;
> >> @@ -222,7 +229,7 @@ void vmpressure(gfp_t gfp, struct mem_cgroup *memcg,
> >>        * we account it too.
> >>        */
> >>       if (!(gfp & (__GFP_HIGHMEM | __GFP_MOVABLE | __GFP_IO | __GFP_FS)))
> >> -             return;
> >> +             goto schedule;
> >>
> >>       /*
> >>        * If we got here with no pages scanned, then that is an indicator
> >> @@ -239,8 +246,15 @@ void vmpressure(gfp_t gfp, struct mem_cgroup *memcg,
> >>       vmpr->scanned += scanned;
> >>       vmpr->reclaimed += reclaimed;
> >>       scanned = vmpr->scanned;
> >> +     /*
> >> +      * If we didn't reach this point, only kernel events will be triggered.
> >> +      * It is the job of the worker thread to clean this up once the
> >> +      * notifications are all delivered.
> >> +      */
> >> +     vmpr->notify_userspace = true;
> >>       spin_unlock(&vmpr->sr_lock);
> >>
> >> +schedule:
> >>       if (scanned < vmpressure_win)
> >>               return;
> >>       schedule_work(&vmpr->work);
> >> @@ -324,6 +338,39 @@ int vmpressure_register_event(struct cgroup_subsys_state *css,
> >>  }
> >>
> >>  /**
> >> + * vmpressure_register_kernel_event() - Register kernel-side notification
> >> + * @css:     css that is interested in vmpressure notifications
> >> + * @fn:              function to be called when pressure happens
> >> + *
> >> + * This function register in-kernel users interested in receiving notifications
> >> + * about pressure conditions. Pressure notifications will be triggered at the
> >> + * same time as userspace notifications (with no particular ordering relative
> >> + * to it).
> >> + *
> >> + * Pressure notifications are a alternative method to shrinkers and will serve
> >> + * well users that are interested in a one-shot notification, with a
> >> + * well-defined cgroup aware interface.
> >> + */
> >> +int vmpressure_register_kernel_event(struct cgroup_subsys_state *css,
> >> +                                   void (*fn)(void))
> >> +{
> >> +     struct vmpressure *vmpr = css_to_vmpressure(css);
> >
> > This doesn't allow for css=NULL. What's the recommended way for a today's
> > shrinker (which is not related to cgroups) to register with this API?
> >
> > Also, you don't seem to provide a way to de-register from the event.
> >
> 
> The answer for all of your questions above can be summarized by noting
> that for the lack of other users (at the time), this patch does the bare minimum
> for memcg needs. I agree, for instance, that it would be good to pass the level
> but since memcg won't do anything with thta, I didn't pass it.
> 
> That should be extended if you need to.

That works for me. That is, including this minimal version first and
extending it when we get in-tree users.

> 
> > I hacked a patch to be able to use this, seems to work but it's a ugly
> > hack:
> >
> > ---
> >  include/linux/vmpressure.h |  3 ++-
> >  mm/vmpressure.c            | 13 +++++++++----
> >  2 files changed, 11 insertions(+), 5 deletions(-)
> >
> > diff --git a/include/linux/vmpressure.h b/include/linux/vmpressure.h
> > index 9102e53..de416b6 100644
> > --- a/include/linux/vmpressure.h
> > +++ b/include/linux/vmpressure.h
> > @@ -42,7 +42,8 @@ extern int vmpressure_register_event(struct cgroup_subsys_state *css,
> >                                      struct eventfd_ctx *eventfd,
> >                                      const char *args);
> >  extern int vmpressure_register_kernel_event(struct cgroup_subsys_state *css,
> > -                                           void (*fn)(void));
> > +                                           void (*fn)(void *data, int level),
> > +                                           void *data);
> >  extern void vmpressure_unregister_event(struct cgroup_subsys_state *css,
> >                                         struct cftype *cft,
> >                                         struct eventfd_ctx *eventfd);
> > diff --git a/mm/vmpressure.c b/mm/vmpressure.c
> > index 730e7c1..4ed0e85 100644
> > --- a/mm/vmpressure.c
> > +++ b/mm/vmpressure.c
> > @@ -132,9 +132,10 @@ static enum vmpressure_levels vmpressure_calc_level(unsigned long scanned,
> >  struct vmpressure_event {
> >         union {
> >                 struct eventfd_ctx *efd;
> > -               void (*fn)(void);
> > +               void (*fn)(void *data, int level);
> >         };
> >         enum vmpressure_levels level;
> > +       void *data;
> >         bool kernel_event;
> >         struct list_head node;
> >  };
> > @@ -152,7 +153,7 @@ static bool vmpressure_event(struct vmpressure *vmpr,
> >
> >         list_for_each_entry(ev, &vmpr->events, node) {
> >                 if (ev->kernel_event) {
> > -                       ev->fn();
> > +                       ev->fn(ev->data, level);
> >                 } else if (vmpr->notify_userspace && level >= ev->level) {
> >                         eventfd_signal(ev->efd, 1);
> >                         signalled = true;
> > @@ -352,21 +353,25 @@ int vmpressure_register_event(struct cgroup_subsys_state *css,
> >   * well-defined cgroup aware interface.
> >   */
> >  int vmpressure_register_kernel_event(struct cgroup_subsys_state *css,
> > -                                     void (*fn)(void))
> > +                                    void (*fn)(void *data, int level), void *data)
> >  {
> > -       struct vmpressure *vmpr = css_to_vmpressure(css);
> > +       struct vmpressure *vmpr;
> >         struct vmpressure_event *ev;
> >
> > +       vmpr = css ? css_to_vmpressure(css) : memcg_to_vmpressure(NULL);
> > +
> >         ev = kzalloc(sizeof(*ev), GFP_KERNEL);
> >         if (!ev)
> >                 return -ENOMEM;
> >
> >         ev->kernel_event = true;
> > +       ev->data = data;
> >         ev->fn = fn;
> >
> >         mutex_lock(&vmpr->events_lock);
> >         list_add(&ev->node, &vmpr->events);
> >         mutex_unlock(&vmpr->events_lock);
> > +
> 
> Your patch makes sense.
> 
> 
> 


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v14 16/18] vmpressure: in-kernel notifications
@ 2013-12-20 15:03         ` Luiz Capitulino
  0 siblings, 0 replies; 64+ messages in thread
From: Luiz Capitulino @ 2013-12-20 15:03 UTC (permalink / raw)
  To: Glauber Costa
  Cc: Vladimir Davydov, dchinner, Michal Hocko, Johannes Weiner,
	Andrew Morton, LKML, linux-mm, cgroups, devel, Glauber Costa,
	John Stultz, Joonsoo Kim, Kamezawa Hiroyuki

On Fri, 20 Dec 2013 18:31:02 +0400
Glauber Costa <glommer@gmail.com> wrote:

> > I have the exact problem described above for a project I'm working on
> > and this solution seems to solve it well.
> >
> > However, I had a few issues while trying to use this interface. I'll
> > comment on them below, but please take this more as advice seeking
> > than patch review.
> >
> >> This patch extends that to also support in-kernel users. Events that
> >> should be generated for in-kernel consumption will be marked as such,
> >> and for those, we will call a registered function instead of triggering
> >> an eventfd notification.
> >>
> >> Please note that due to my lack of understanding of each shrinker user,
> >> I will stay away from converting the actual users, you are all welcome
> >> to do so.
> >>
> >> Signed-off-by: Glauber Costa <glommer@openvz.org>
> >> Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
> >> Acked-by: Anton Vorontsov <anton@enomsg.org>
> >> Acked-by: Pekka Enberg <penberg@kernel.org>
> >> Reviewed-by: Greg Thelen <gthelen@google.com>
> >> Cc: Dave Chinner <dchinner@redhat.com>
> >> Cc: John Stultz <john.stultz@linaro.org>
> >> Cc: Andrew Morton <akpm@linux-foundation.org>
> >> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> >> Cc: Michal Hocko <mhocko@suse.cz>
> >> Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> >> Cc: Johannes Weiner <hannes@cmpxchg.org>
> >> ---
> >>  include/linux/vmpressure.h |    5 +++++
> >>  mm/vmpressure.c            |   53 +++++++++++++++++++++++++++++++++++++++++---
> >>  2 files changed, 55 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/include/linux/vmpressure.h b/include/linux/vmpressure.h
> >> index 3f3788d..9102e53 100644
> >> --- a/include/linux/vmpressure.h
> >> +++ b/include/linux/vmpressure.h
> >> @@ -19,6 +19,9 @@ struct vmpressure {
> >>       /* Have to grab the lock on events traversal or modifications. */
> >>       struct mutex events_lock;
> >>
> >> +     /* False if only kernel users want to be notified, true otherwise. */
> >> +     bool notify_userspace;
> >> +
> >>       struct work_struct work;
> >>  };
> >>
> >> @@ -38,6 +41,8 @@ extern int vmpressure_register_event(struct cgroup_subsys_state *css,
> >>                                    struct cftype *cft,
> >>                                    struct eventfd_ctx *eventfd,
> >>                                    const char *args);
> >> +extern int vmpressure_register_kernel_event(struct cgroup_subsys_state *css,
> >> +                                         void (*fn)(void));
> >>  extern void vmpressure_unregister_event(struct cgroup_subsys_state *css,
> >>                                       struct cftype *cft,
> >>                                       struct eventfd_ctx *eventfd);
> >> diff --git a/mm/vmpressure.c b/mm/vmpressure.c
> >> index e0f6283..730e7c1 100644
> >> --- a/mm/vmpressure.c
> >> +++ b/mm/vmpressure.c
> >> @@ -130,8 +130,12 @@ static enum vmpressure_levels vmpressure_calc_level(unsigned long scanned,
> >>  }
> >>
> >>  struct vmpressure_event {
> >> -     struct eventfd_ctx *efd;
> >> +     union {
> >> +             struct eventfd_ctx *efd;
> >> +             void (*fn)(void);
> >
> > How does the callback access its private data?
> >
> >> +     };
> >>       enum vmpressure_levels level;
> >> +     bool kernel_event;
> >>       struct list_head node;
> >>  };
> >>
> >> @@ -147,12 +151,15 @@ static bool vmpressure_event(struct vmpressure *vmpr,
> >>       mutex_lock(&vmpr->events_lock);
> >>
> >>       list_for_each_entry(ev, &vmpr->events, node) {
> >> -             if (level >= ev->level) {
> >> +             if (ev->kernel_event) {
> >> +                     ev->fn();
> >
> > I think it would be interesting to pass 'level' to the callback (I'll
> > probably use it myself), but we could wait for a in-tree user before
> > adding it.
> >
> >> +             } else if (vmpr->notify_userspace && level >= ev->level) {
> >>                       eventfd_signal(ev->efd, 1);
> >>                       signalled = true;
> >>               }
> >>       }
> >>
> >> +     vmpr->notify_userspace = false;
> >>       mutex_unlock(&vmpr->events_lock);
> >>
> >>       return signalled;
> >> @@ -222,7 +229,7 @@ void vmpressure(gfp_t gfp, struct mem_cgroup *memcg,
> >>        * we account it too.
> >>        */
> >>       if (!(gfp & (__GFP_HIGHMEM | __GFP_MOVABLE | __GFP_IO | __GFP_FS)))
> >> -             return;
> >> +             goto schedule;
> >>
> >>       /*
> >>        * If we got here with no pages scanned, then that is an indicator
> >> @@ -239,8 +246,15 @@ void vmpressure(gfp_t gfp, struct mem_cgroup *memcg,
> >>       vmpr->scanned += scanned;
> >>       vmpr->reclaimed += reclaimed;
> >>       scanned = vmpr->scanned;
> >> +     /*
> >> +      * If we didn't reach this point, only kernel events will be triggered.
> >> +      * It is the job of the worker thread to clean this up once the
> >> +      * notifications are all delivered.
> >> +      */
> >> +     vmpr->notify_userspace = true;
> >>       spin_unlock(&vmpr->sr_lock);
> >>
> >> +schedule:
> >>       if (scanned < vmpressure_win)
> >>               return;
> >>       schedule_work(&vmpr->work);
> >> @@ -324,6 +338,39 @@ int vmpressure_register_event(struct cgroup_subsys_state *css,
> >>  }
> >>
> >>  /**
> >> + * vmpressure_register_kernel_event() - Register kernel-side notification
> >> + * @css:     css that is interested in vmpressure notifications
> >> + * @fn:              function to be called when pressure happens
> >> + *
> >> + * This function register in-kernel users interested in receiving notifications
> >> + * about pressure conditions. Pressure notifications will be triggered at the
> >> + * same time as userspace notifications (with no particular ordering relative
> >> + * to it).
> >> + *
> >> + * Pressure notifications are a alternative method to shrinkers and will serve
> >> + * well users that are interested in a one-shot notification, with a
> >> + * well-defined cgroup aware interface.
> >> + */
> >> +int vmpressure_register_kernel_event(struct cgroup_subsys_state *css,
> >> +                                   void (*fn)(void))
> >> +{
> >> +     struct vmpressure *vmpr = css_to_vmpressure(css);
> >
> > This doesn't allow for css=NULL. What's the recommended way for a today's
> > shrinker (which is not related to cgroups) to register with this API?
> >
> > Also, you don't seem to provide a way to de-register from the event.
> >
> 
> The answer for all of your questions above can be summarized by noting
> that for the lack of other users (at the time), this patch does the bare minimum
> for memcg needs. I agree, for instance, that it would be good to pass the level
> but since memcg won't do anything with thta, I didn't pass it.
> 
> That should be extended if you need to.

That works for me. That is, including this minimal version first and
extending it when we get in-tree users.

> 
> > I hacked a patch to be able to use this, seems to work but it's a ugly
> > hack:
> >
> > ---
> >  include/linux/vmpressure.h |  3 ++-
> >  mm/vmpressure.c            | 13 +++++++++----
> >  2 files changed, 11 insertions(+), 5 deletions(-)
> >
> > diff --git a/include/linux/vmpressure.h b/include/linux/vmpressure.h
> > index 9102e53..de416b6 100644
> > --- a/include/linux/vmpressure.h
> > +++ b/include/linux/vmpressure.h
> > @@ -42,7 +42,8 @@ extern int vmpressure_register_event(struct cgroup_subsys_state *css,
> >                                      struct eventfd_ctx *eventfd,
> >                                      const char *args);
> >  extern int vmpressure_register_kernel_event(struct cgroup_subsys_state *css,
> > -                                           void (*fn)(void));
> > +                                           void (*fn)(void *data, int level),
> > +                                           void *data);
> >  extern void vmpressure_unregister_event(struct cgroup_subsys_state *css,
> >                                         struct cftype *cft,
> >                                         struct eventfd_ctx *eventfd);
> > diff --git a/mm/vmpressure.c b/mm/vmpressure.c
> > index 730e7c1..4ed0e85 100644
> > --- a/mm/vmpressure.c
> > +++ b/mm/vmpressure.c
> > @@ -132,9 +132,10 @@ static enum vmpressure_levels vmpressure_calc_level(unsigned long scanned,
> >  struct vmpressure_event {
> >         union {
> >                 struct eventfd_ctx *efd;
> > -               void (*fn)(void);
> > +               void (*fn)(void *data, int level);
> >         };
> >         enum vmpressure_levels level;
> > +       void *data;
> >         bool kernel_event;
> >         struct list_head node;
> >  };
> > @@ -152,7 +153,7 @@ static bool vmpressure_event(struct vmpressure *vmpr,
> >
> >         list_for_each_entry(ev, &vmpr->events, node) {
> >                 if (ev->kernel_event) {
> > -                       ev->fn();
> > +                       ev->fn(ev->data, level);
> >                 } else if (vmpr->notify_userspace && level >= ev->level) {
> >                         eventfd_signal(ev->efd, 1);
> >                         signalled = true;
> > @@ -352,21 +353,25 @@ int vmpressure_register_event(struct cgroup_subsys_state *css,
> >   * well-defined cgroup aware interface.
> >   */
> >  int vmpressure_register_kernel_event(struct cgroup_subsys_state *css,
> > -                                     void (*fn)(void))
> > +                                    void (*fn)(void *data, int level), void *data)
> >  {
> > -       struct vmpressure *vmpr = css_to_vmpressure(css);
> > +       struct vmpressure *vmpr;
> >         struct vmpressure_event *ev;
> >
> > +       vmpr = css ? css_to_vmpressure(css) : memcg_to_vmpressure(NULL);
> > +
> >         ev = kzalloc(sizeof(*ev), GFP_KERNEL);
> >         if (!ev)
> >                 return -ENOMEM;
> >
> >         ev->kernel_event = true;
> > +       ev->data = data;
> >         ev->fn = fn;
> >
> >         mutex_lock(&vmpr->events_lock);
> >         list_add(&ev->node, &vmpr->events);
> >         mutex_unlock(&vmpr->events_lock);
> > +
> 
> Your patch makes sense.
> 
> 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v14 16/18] vmpressure: in-kernel notifications
  2013-12-20 15:03         ` Luiz Capitulino
@ 2013-12-20 16:44           ` Luiz Capitulino
  -1 siblings, 0 replies; 64+ messages in thread
From: Luiz Capitulino @ 2013-12-20 16:44 UTC (permalink / raw)
  To: Luiz Capitulino
  Cc: Glauber Costa, Vladimir Davydov, dchinner, Michal Hocko,
	Johannes Weiner, Andrew Morton, LKML, linux-mm, cgroups, devel,
	Glauber Costa, John Stultz, Joonsoo Kim, Kamezawa Hiroyuki

On Fri, 20 Dec 2013 10:03:32 -0500
Luiz Capitulino <lcapitulino@redhat.com> wrote:

> > The answer for all of your questions above can be summarized by noting
> > that for the lack of other users (at the time), this patch does the bare minimum
> > for memcg needs. I agree, for instance, that it would be good to pass the level
> > but since memcg won't do anything with thta, I didn't pass it.
> > 
> > That should be extended if you need to.
> 
> That works for me. That is, including this minimal version first and
> extending it when we get in-tree users.

Btw, there's something I was thinking just right now. If/when we
convert shrink functions to use this API, they will come to depend
on CONFIG_MEMCG=y. IOW, they won't work if CONFIG_MEMCG=n.

Is this acceptable (this is an honest question)? Because today, they
do work when CONFIG_MEMCG=n. Should those shrink functions use the
shrinker API as a fallback?

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v14 16/18] vmpressure: in-kernel notifications
@ 2013-12-20 16:44           ` Luiz Capitulino
  0 siblings, 0 replies; 64+ messages in thread
From: Luiz Capitulino @ 2013-12-20 16:44 UTC (permalink / raw)
  To: Luiz Capitulino
  Cc: Glauber Costa, Vladimir Davydov, dchinner, Michal Hocko,
	Johannes Weiner, Andrew Morton, LKML, linux-mm, cgroups, devel,
	Glauber Costa, John Stultz, Joonsoo Kim, Kamezawa Hiroyuki

On Fri, 20 Dec 2013 10:03:32 -0500
Luiz Capitulino <lcapitulino@redhat.com> wrote:

> > The answer for all of your questions above can be summarized by noting
> > that for the lack of other users (at the time), this patch does the bare minimum
> > for memcg needs. I agree, for instance, that it would be good to pass the level
> > but since memcg won't do anything with thta, I didn't pass it.
> > 
> > That should be extended if you need to.
> 
> That works for me. That is, including this minimal version first and
> extending it when we get in-tree users.

Btw, there's something I was thinking just right now. If/when we
convert shrink functions to use this API, they will come to depend
on CONFIG_MEMCG=y. IOW, they won't work if CONFIG_MEMCG=n.

Is this acceptable (this is an honest question)? Because today, they
do work when CONFIG_MEMCG=n. Should those shrink functions use the
shrinker API as a fallback?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v14 16/18] vmpressure: in-kernel notifications
  2013-12-20 16:44           ` Luiz Capitulino
  (?)
@ 2013-12-20 16:46             ` Glauber Costa
  -1 siblings, 0 replies; 64+ messages in thread
From: Glauber Costa @ 2013-12-20 16:46 UTC (permalink / raw)
  To: Luiz Capitulino
  Cc: Vladimir Davydov, dchinner, Michal Hocko, Johannes Weiner,
	Andrew Morton, LKML, linux-mm, cgroups, devel, Glauber Costa,
	John Stultz, Joonsoo Kim, Kamezawa Hiroyuki

On Fri, Dec 20, 2013 at 8:44 PM, Luiz Capitulino <lcapitulino@redhat.com> wrote:
> On Fri, 20 Dec 2013 10:03:32 -0500
> Luiz Capitulino <lcapitulino@redhat.com> wrote:
>
>> > The answer for all of your questions above can be summarized by noting
>> > that for the lack of other users (at the time), this patch does the bare minimum
>> > for memcg needs. I agree, for instance, that it would be good to pass the level
>> > but since memcg won't do anything with thta, I didn't pass it.
>> >
>> > That should be extended if you need to.
>>
>> That works for me. That is, including this minimal version first and
>> extending it when we get in-tree users.
>
> Btw, there's something I was thinking just right now. If/when we
> convert shrink functions to use this API, they will come to depend
> on CONFIG_MEMCG=y. IOW, they won't work if CONFIG_MEMCG=n.
>
> Is this acceptable (this is an honest question)? Because today, they
> do work when CONFIG_MEMCG=n. Should those shrink functions use the
> shrinker API as a fallback?

If you have a non-memcg user, that should obviously be available for
CONFIG_MEMCG=n

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v14 16/18] vmpressure: in-kernel notifications
@ 2013-12-20 16:46             ` Glauber Costa
  0 siblings, 0 replies; 64+ messages in thread
From: Glauber Costa @ 2013-12-20 16:46 UTC (permalink / raw)
  To: Luiz Capitulino
  Cc: Vladimir Davydov, dchinner, Michal Hocko, Johannes Weiner,
	Andrew Morton, LKML, linux-mm, cgroups, devel, Glauber Costa,
	John Stultz, Joonsoo Kim, Kamezawa Hiroyuki

On Fri, Dec 20, 2013 at 8:44 PM, Luiz Capitulino <lcapitulino@redhat.com> wrote:
> On Fri, 20 Dec 2013 10:03:32 -0500
> Luiz Capitulino <lcapitulino@redhat.com> wrote:
>
>> > The answer for all of your questions above can be summarized by noting
>> > that for the lack of other users (at the time), this patch does the bare minimum
>> > for memcg needs. I agree, for instance, that it would be good to pass the level
>> > but since memcg won't do anything with thta, I didn't pass it.
>> >
>> > That should be extended if you need to.
>>
>> That works for me. That is, including this minimal version first and
>> extending it when we get in-tree users.
>
> Btw, there's something I was thinking just right now. If/when we
> convert shrink functions to use this API, they will come to depend
> on CONFIG_MEMCG=y. IOW, they won't work if CONFIG_MEMCG=n.
>
> Is this acceptable (this is an honest question)? Because today, they
> do work when CONFIG_MEMCG=n. Should those shrink functions use the
> shrinker API as a fallback?

If you have a non-memcg user, that should obviously be available for
CONFIG_MEMCG=n

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v14 16/18] vmpressure: in-kernel notifications
@ 2013-12-20 16:46             ` Glauber Costa
  0 siblings, 0 replies; 64+ messages in thread
From: Glauber Costa @ 2013-12-20 16:46 UTC (permalink / raw)
  To: Luiz Capitulino
  Cc: Vladimir Davydov, dchinner-H+wXaHxf7aLQT0dZR+AlfA, Michal Hocko,
	Johannes Weiner, Andrew Morton, LKML,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	devel-GEFAQzZX7r8dnm+yROfE0A, Glauber Costa, John Stultz,
	Joonsoo Kim, Kamezawa Hiroyuki

On Fri, Dec 20, 2013 at 8:44 PM, Luiz Capitulino <lcapitulino-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> On Fri, 20 Dec 2013 10:03:32 -0500
> Luiz Capitulino <lcapitulino-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>
>> > The answer for all of your questions above can be summarized by noting
>> > that for the lack of other users (at the time), this patch does the bare minimum
>> > for memcg needs. I agree, for instance, that it would be good to pass the level
>> > but since memcg won't do anything with thta, I didn't pass it.
>> >
>> > That should be extended if you need to.
>>
>> That works for me. That is, including this minimal version first and
>> extending it when we get in-tree users.
>
> Btw, there's something I was thinking just right now. If/when we
> convert shrink functions to use this API, they will come to depend
> on CONFIG_MEMCG=y. IOW, they won't work if CONFIG_MEMCG=n.
>
> Is this acceptable (this is an honest question)? Because today, they
> do work when CONFIG_MEMCG=n. Should those shrink functions use the
> shrinker API as a fallback?

If you have a non-memcg user, that should obviously be available for
CONFIG_MEMCG=n

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v14 16/18] vmpressure: in-kernel notifications
  2013-12-20 16:46             ` Glauber Costa
@ 2013-12-20 16:53               ` Luiz Capitulino
  -1 siblings, 0 replies; 64+ messages in thread
From: Luiz Capitulino @ 2013-12-20 16:53 UTC (permalink / raw)
  To: Glauber Costa
  Cc: Vladimir Davydov, dchinner, Michal Hocko, Johannes Weiner,
	Andrew Morton, LKML, linux-mm, cgroups, devel, Glauber Costa,
	John Stultz, Joonsoo Kim, Kamezawa Hiroyuki

On Fri, 20 Dec 2013 20:46:05 +0400
Glauber Costa <glommer@gmail.com> wrote:

> On Fri, Dec 20, 2013 at 8:44 PM, Luiz Capitulino <lcapitulino@redhat.com> wrote:
> > On Fri, 20 Dec 2013 10:03:32 -0500
> > Luiz Capitulino <lcapitulino@redhat.com> wrote:
> >
> >> > The answer for all of your questions above can be summarized by noting
> >> > that for the lack of other users (at the time), this patch does the bare minimum
> >> > for memcg needs. I agree, for instance, that it would be good to pass the level
> >> > but since memcg won't do anything with thta, I didn't pass it.
> >> >
> >> > That should be extended if you need to.
> >>
> >> That works for me. That is, including this minimal version first and
> >> extending it when we get in-tree users.
> >
> > Btw, there's something I was thinking just right now. If/when we
> > convert shrink functions to use this API, they will come to depend
> > on CONFIG_MEMCG=y. IOW, they won't work if CONFIG_MEMCG=n.
> >
> > Is this acceptable (this is an honest question)? Because today, they
> > do work when CONFIG_MEMCG=n. Should those shrink functions use the
> > shrinker API as a fallback?
> 
> If you have a non-memcg user, that should obviously be available for
> CONFIG_MEMCG=n

OK, which means we'll have to change it, right? Because, if I'm not
missing something, today vmpressure does depend on CONFIG_MEMCG=y.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v14 16/18] vmpressure: in-kernel notifications
@ 2013-12-20 16:53               ` Luiz Capitulino
  0 siblings, 0 replies; 64+ messages in thread
From: Luiz Capitulino @ 2013-12-20 16:53 UTC (permalink / raw)
  To: Glauber Costa
  Cc: Vladimir Davydov, dchinner, Michal Hocko, Johannes Weiner,
	Andrew Morton, LKML, linux-mm, cgroups, devel, Glauber Costa,
	John Stultz, Joonsoo Kim, Kamezawa Hiroyuki

On Fri, 20 Dec 2013 20:46:05 +0400
Glauber Costa <glommer@gmail.com> wrote:

> On Fri, Dec 20, 2013 at 8:44 PM, Luiz Capitulino <lcapitulino@redhat.com> wrote:
> > On Fri, 20 Dec 2013 10:03:32 -0500
> > Luiz Capitulino <lcapitulino@redhat.com> wrote:
> >
> >> > The answer for all of your questions above can be summarized by noting
> >> > that for the lack of other users (at the time), this patch does the bare minimum
> >> > for memcg needs. I agree, for instance, that it would be good to pass the level
> >> > but since memcg won't do anything with thta, I didn't pass it.
> >> >
> >> > That should be extended if you need to.
> >>
> >> That works for me. That is, including this minimal version first and
> >> extending it when we get in-tree users.
> >
> > Btw, there's something I was thinking just right now. If/when we
> > convert shrink functions to use this API, they will come to depend
> > on CONFIG_MEMCG=y. IOW, they won't work if CONFIG_MEMCG=n.
> >
> > Is this acceptable (this is an honest question)? Because today, they
> > do work when CONFIG_MEMCG=n. Should those shrink functions use the
> > shrinker API as a fallback?
> 
> If you have a non-memcg user, that should obviously be available for
> CONFIG_MEMCG=n

OK, which means we'll have to change it, right? Because, if I'm not
missing something, today vmpressure does depend on CONFIG_MEMCG=y.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v14 16/18] vmpressure: in-kernel notifications
  2013-12-20 16:53               ` Luiz Capitulino
@ 2013-12-20 16:58                 ` Glauber Costa
  -1 siblings, 0 replies; 64+ messages in thread
From: Glauber Costa @ 2013-12-20 16:58 UTC (permalink / raw)
  To: Luiz Capitulino
  Cc: Vladimir Davydov, dchinner, Michal Hocko, Johannes Weiner,
	Andrew Morton, LKML, linux-mm, cgroups, devel, Glauber Costa,
	John Stultz, Joonsoo Kim, Kamezawa Hiroyuki

On Fri, Dec 20, 2013 at 8:53 PM, Luiz Capitulino <lcapitulino@redhat.com> wrote:
> On Fri, 20 Dec 2013 20:46:05 +0400
> Glauber Costa <glommer@gmail.com> wrote:
>
>> On Fri, Dec 20, 2013 at 8:44 PM, Luiz Capitulino <lcapitulino@redhat.com> wrote:
>> > On Fri, 20 Dec 2013 10:03:32 -0500
>> > Luiz Capitulino <lcapitulino@redhat.com> wrote:
>> >
>> >> > The answer for all of your questions above can be summarized by noting
>> >> > that for the lack of other users (at the time), this patch does the bare minimum
>> >> > for memcg needs. I agree, for instance, that it would be good to pass the level
>> >> > but since memcg won't do anything with thta, I didn't pass it.
>> >> >
>> >> > That should be extended if you need to.
>> >>
>> >> That works for me. That is, including this minimal version first and
>> >> extending it when we get in-tree users.
>> >
>> > Btw, there's something I was thinking just right now. If/when we
>> > convert shrink functions to use this API, they will come to depend
>> > on CONFIG_MEMCG=y. IOW, they won't work if CONFIG_MEMCG=n.
>> >
>> > Is this acceptable (this is an honest question)? Because today, they
>> > do work when CONFIG_MEMCG=n. Should those shrink functions use the
>> > shrinker API as a fallback?
>>
>> If you have a non-memcg user, that should obviously be available for
>> CONFIG_MEMCG=n
>
> OK, which means we'll have to change it, right? Because, if I'm not
> missing something, today vmpressure does depend on CONFIG_MEMCG=y.

You mean the main vmpressure mechanism?
Sorry, this was out of my mental cachelines. Yes, vmpressure depends
on MEMCG, because
the pressure interface is memcg-specific (global == root memcg)

You might want to change that so you can reuse the mechanism and let
only the user interface
depend on memcg.


-- 
E Mare, Libertas

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v14 16/18] vmpressure: in-kernel notifications
@ 2013-12-20 16:58                 ` Glauber Costa
  0 siblings, 0 replies; 64+ messages in thread
From: Glauber Costa @ 2013-12-20 16:58 UTC (permalink / raw)
  To: Luiz Capitulino
  Cc: Vladimir Davydov, dchinner, Michal Hocko, Johannes Weiner,
	Andrew Morton, LKML, linux-mm, cgroups, devel, Glauber Costa,
	John Stultz, Joonsoo Kim, Kamezawa Hiroyuki

On Fri, Dec 20, 2013 at 8:53 PM, Luiz Capitulino <lcapitulino@redhat.com> wrote:
> On Fri, 20 Dec 2013 20:46:05 +0400
> Glauber Costa <glommer@gmail.com> wrote:
>
>> On Fri, Dec 20, 2013 at 8:44 PM, Luiz Capitulino <lcapitulino@redhat.com> wrote:
>> > On Fri, 20 Dec 2013 10:03:32 -0500
>> > Luiz Capitulino <lcapitulino@redhat.com> wrote:
>> >
>> >> > The answer for all of your questions above can be summarized by noting
>> >> > that for the lack of other users (at the time), this patch does the bare minimum
>> >> > for memcg needs. I agree, for instance, that it would be good to pass the level
>> >> > but since memcg won't do anything with thta, I didn't pass it.
>> >> >
>> >> > That should be extended if you need to.
>> >>
>> >> That works for me. That is, including this minimal version first and
>> >> extending it when we get in-tree users.
>> >
>> > Btw, there's something I was thinking just right now. If/when we
>> > convert shrink functions to use this API, they will come to depend
>> > on CONFIG_MEMCG=y. IOW, they won't work if CONFIG_MEMCG=n.
>> >
>> > Is this acceptable (this is an honest question)? Because today, they
>> > do work when CONFIG_MEMCG=n. Should those shrink functions use the
>> > shrinker API as a fallback?
>>
>> If you have a non-memcg user, that should obviously be available for
>> CONFIG_MEMCG=n
>
> OK, which means we'll have to change it, right? Because, if I'm not
> missing something, today vmpressure does depend on CONFIG_MEMCG=y.

You mean the main vmpressure mechanism?
Sorry, this was out of my mental cachelines. Yes, vmpressure depends
on MEMCG, because
the pressure interface is memcg-specific (global == root memcg)

You might want to change that so you can reuse the mechanism and let
only the user interface
depend on memcg.


-- 
E Mare, Libertas

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v14 16/18] vmpressure: in-kernel notifications
  2013-12-20 16:58                 ` Glauber Costa
@ 2013-12-20 17:00                   ` Luiz Capitulino
  -1 siblings, 0 replies; 64+ messages in thread
From: Luiz Capitulino @ 2013-12-20 17:00 UTC (permalink / raw)
  To: Glauber Costa
  Cc: Vladimir Davydov, dchinner, Michal Hocko, Johannes Weiner,
	Andrew Morton, LKML, linux-mm, cgroups, devel, Glauber Costa,
	John Stultz, Joonsoo Kim, Kamezawa Hiroyuki

On Fri, 20 Dec 2013 20:58:52 +0400
Glauber Costa <glommer@gmail.com> wrote:

> On Fri, Dec 20, 2013 at 8:53 PM, Luiz Capitulino <lcapitulino@redhat.com> wrote:
> > On Fri, 20 Dec 2013 20:46:05 +0400
> > Glauber Costa <glommer@gmail.com> wrote:
> >
> >> On Fri, Dec 20, 2013 at 8:44 PM, Luiz Capitulino <lcapitulino@redhat.com> wrote:
> >> > On Fri, 20 Dec 2013 10:03:32 -0500
> >> > Luiz Capitulino <lcapitulino@redhat.com> wrote:
> >> >
> >> >> > The answer for all of your questions above can be summarized by noting
> >> >> > that for the lack of other users (at the time), this patch does the bare minimum
> >> >> > for memcg needs. I agree, for instance, that it would be good to pass the level
> >> >> > but since memcg won't do anything with thta, I didn't pass it.
> >> >> >
> >> >> > That should be extended if you need to.
> >> >>
> >> >> That works for me. That is, including this minimal version first and
> >> >> extending it when we get in-tree users.
> >> >
> >> > Btw, there's something I was thinking just right now. If/when we
> >> > convert shrink functions to use this API, they will come to depend
> >> > on CONFIG_MEMCG=y. IOW, they won't work if CONFIG_MEMCG=n.
> >> >
> >> > Is this acceptable (this is an honest question)? Because today, they
> >> > do work when CONFIG_MEMCG=n. Should those shrink functions use the
> >> > shrinker API as a fallback?
> >>
> >> If you have a non-memcg user, that should obviously be available for
> >> CONFIG_MEMCG=n
> >
> > OK, which means we'll have to change it, right? Because, if I'm not
> > missing something, today vmpressure does depend on CONFIG_MEMCG=y.
> 
> You mean the main vmpressure mechanism?
> Sorry, this was out of my mental cachelines. Yes, vmpressure depends
> on MEMCG, because
> the pressure interface is memcg-specific (global == root memcg)
> 
> You might want to change that so you can reuse the mechanism and let
> only the user interface
> depend on memcg.

OK, that makes sense. Thanks Glauber.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v14 16/18] vmpressure: in-kernel notifications
@ 2013-12-20 17:00                   ` Luiz Capitulino
  0 siblings, 0 replies; 64+ messages in thread
From: Luiz Capitulino @ 2013-12-20 17:00 UTC (permalink / raw)
  To: Glauber Costa
  Cc: Vladimir Davydov, dchinner, Michal Hocko, Johannes Weiner,
	Andrew Morton, LKML, linux-mm, cgroups, devel, Glauber Costa,
	John Stultz, Joonsoo Kim, Kamezawa Hiroyuki

On Fri, 20 Dec 2013 20:58:52 +0400
Glauber Costa <glommer@gmail.com> wrote:

> On Fri, Dec 20, 2013 at 8:53 PM, Luiz Capitulino <lcapitulino@redhat.com> wrote:
> > On Fri, 20 Dec 2013 20:46:05 +0400
> > Glauber Costa <glommer@gmail.com> wrote:
> >
> >> On Fri, Dec 20, 2013 at 8:44 PM, Luiz Capitulino <lcapitulino@redhat.com> wrote:
> >> > On Fri, 20 Dec 2013 10:03:32 -0500
> >> > Luiz Capitulino <lcapitulino@redhat.com> wrote:
> >> >
> >> >> > The answer for all of your questions above can be summarized by noting
> >> >> > that for the lack of other users (at the time), this patch does the bare minimum
> >> >> > for memcg needs. I agree, for instance, that it would be good to pass the level
> >> >> > but since memcg won't do anything with thta, I didn't pass it.
> >> >> >
> >> >> > That should be extended if you need to.
> >> >>
> >> >> That works for me. That is, including this minimal version first and
> >> >> extending it when we get in-tree users.
> >> >
> >> > Btw, there's something I was thinking just right now. If/when we
> >> > convert shrink functions to use this API, they will come to depend
> >> > on CONFIG_MEMCG=y. IOW, they won't work if CONFIG_MEMCG=n.
> >> >
> >> > Is this acceptable (this is an honest question)? Because today, they
> >> > do work when CONFIG_MEMCG=n. Should those shrink functions use the
> >> > shrinker API as a fallback?
> >>
> >> If you have a non-memcg user, that should obviously be available for
> >> CONFIG_MEMCG=n
> >
> > OK, which means we'll have to change it, right? Because, if I'm not
> > missing something, today vmpressure does depend on CONFIG_MEMCG=y.
> 
> You mean the main vmpressure mechanism?
> Sorry, this was out of my mental cachelines. Yes, vmpressure depends
> on MEMCG, because
> the pressure interface is memcg-specific (global == root memcg)
> 
> You might want to change that so you can reuse the mechanism and let
> only the user interface
> depend on memcg.

OK, that makes sense. Thanks Glauber.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 64+ messages in thread

end of thread, other threads:[~2013-12-20 17:01 UTC | newest]

Thread overview: 64+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-12-16 12:16 [PATCH v14 00/18] kmemcg shrinkers Vladimir Davydov
2013-12-16 12:16 ` Vladimir Davydov
2013-12-16 12:16 ` Vladimir Davydov
2013-12-16 12:16 ` [PATCH v14 01/18] memcg: make cache index determination more robust Vladimir Davydov
2013-12-16 12:16   ` Vladimir Davydov
2013-12-16 12:16 ` [PATCH v14 02/18] memcg: consolidate callers of memcg_cache_id Vladimir Davydov
2013-12-16 12:16   ` Vladimir Davydov
2013-12-16 12:16 ` [PATCH v14 03/18] memcg: move initialization to memcg creation Vladimir Davydov
2013-12-16 12:16   ` Vladimir Davydov
2013-12-16 12:16 ` [PATCH v14 04/18] memcg: make for_each_mem_cgroup macros public Vladimir Davydov
2013-12-16 12:16   ` Vladimir Davydov
2013-12-16 12:16   ` Vladimir Davydov
2013-12-16 12:16 ` [PATCH v14 05/18] memcg: remove KMEM_ACCOUNTED_ACTIVATED flag Vladimir Davydov
2013-12-16 12:16   ` Vladimir Davydov
2013-12-16 12:16 ` [PATCH v14 06/18] memcg: rework memcg_update_kmem_limit synchronization Vladimir Davydov
2013-12-16 12:16   ` Vladimir Davydov
2013-12-16 12:16 ` [PATCH v14 07/18] list_lru, shrinkers: introduce list_lru_shrink_{count,walk} Vladimir Davydov
2013-12-16 12:16   ` Vladimir Davydov
2013-12-16 12:16 ` [PATCH v14 08/18] fs: consolidate {nr,free}_cached_objects args in shrink_control Vladimir Davydov
2013-12-16 12:16   ` Vladimir Davydov
2013-12-16 12:16 ` [PATCH v14 09/18] vmscan: move call to shrink_slab() to shrink_zones() Vladimir Davydov
2013-12-16 12:16   ` Vladimir Davydov
2013-12-16 12:16 ` [PATCH v14 10/18] vmscan: remove shrink_control arg from do_try_to_free_pages() Vladimir Davydov
2013-12-16 12:16   ` Vladimir Davydov
2013-12-16 12:17 ` [PATCH v14 11/18] vmscan: call NUMA-unaware shrinkers irrespective of nodemask Vladimir Davydov
2013-12-16 12:17   ` Vladimir Davydov
2013-12-16 12:17 ` [PATCH v14 12/18] vmscan: shrink slab on memcg pressure Vladimir Davydov
2013-12-16 12:17   ` Vladimir Davydov
2013-12-16 12:17   ` Vladimir Davydov
2013-12-16 12:17 ` [PATCH v14 13/18] vmscan: take at least one pass with shrinkers Vladimir Davydov
2013-12-16 12:17   ` Vladimir Davydov
2013-12-16 12:17 ` [PATCH v14 14/18] list_lru: add per-memcg lists Vladimir Davydov
2013-12-16 12:17   ` Vladimir Davydov
2013-12-16 12:17   ` Vladimir Davydov
2013-12-16 12:17 ` [PATCH v14 15/18] fs: make shrinker memcg aware Vladimir Davydov
2013-12-16 12:17   ` Vladimir Davydov
2013-12-16 12:17 ` [PATCH v14 16/18] vmpressure: in-kernel notifications Vladimir Davydov
2013-12-16 12:17   ` Vladimir Davydov
2013-12-20 14:26   ` Luiz Capitulino
2013-12-20 14:26     ` Luiz Capitulino
2013-12-20 14:31     ` Glauber Costa
2013-12-20 14:31       ` Glauber Costa
2013-12-20 14:32       ` Glauber Costa
2013-12-20 14:32         ` Glauber Costa
2013-12-20 14:36       ` Vladimir Davydov
2013-12-20 14:36         ` Vladimir Davydov
2013-12-20 15:03       ` Luiz Capitulino
2013-12-20 15:03         ` Luiz Capitulino
2013-12-20 16:44         ` Luiz Capitulino
2013-12-20 16:44           ` Luiz Capitulino
2013-12-20 16:46           ` Glauber Costa
2013-12-20 16:46             ` Glauber Costa
2013-12-20 16:46             ` Glauber Costa
2013-12-20 16:53             ` Luiz Capitulino
2013-12-20 16:53               ` Luiz Capitulino
2013-12-20 16:58               ` Glauber Costa
2013-12-20 16:58                 ` Glauber Costa
2013-12-20 17:00                 ` Luiz Capitulino
2013-12-20 17:00                   ` Luiz Capitulino
2013-12-16 12:17 ` [PATCH v14 17/18] memcg: reap dead memcgs upon global memory pressure Vladimir Davydov
2013-12-16 12:17   ` Vladimir Davydov
2013-12-16 12:17   ` Vladimir Davydov
2013-12-16 12:17 ` [PATCH v14 18/18] memcg: flush memcg items upon memcg destruction Vladimir Davydov
2013-12-16 12:17   ` Vladimir Davydov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.