linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v12 00/18] kmemcg shrinkers
@ 2013-12-02 11:19 Vladimir Davydov
  2013-12-02 11:19 ` [PATCH v12 01/18] memcg: make cache index determination more robust Vladimir Davydov
                   ` (18 more replies)
  0 siblings, 19 replies; 38+ messages in thread
From: Vladimir Davydov @ 2013-12-02 11:19 UTC (permalink / raw)
  To: hannes, mhocko, dchinner, akpm
  Cc: linux-kernel, linux-mm, cgroups, devel, glommer, vdavydov

Hi,

This is the 12th iteration of Glauber Costa's patchset implementing targeted
shrinking for memory cgroups when kmem limits are present. So far, we've been
accounting kernel objects but failing allocations when short of memory. This is
because our only option would be to call the global shrinker, depleting objects
from all caches and breaking isolation.

The main idea is to make LRU lists used by FS slab shrinkers per-memcg. When
adding or removing an element from from the LRU, we use the page information to
figure out which memory cgroup it belongs to and relay it to the appropriate
list. This allows scanning kmem objects accounted to different memory cgroups
independently.

The patchset is based on top of Linux 3.13-rc2 and organized as follows:

 * patches 1-8 are for cleanup/preparation;
 * patch 9 introduces infrastructure for memcg-aware shrinkers;
 * patches 10 and 11 implement the per-memcg LRU list structure;
 * patch 12 uses per-memcg LRU lists to make dcache and icache shrinkers
   memcg-aware;
 * patch 13 implements kmem-only shrinking;
 * patches 14-18 issue kmem shrinking on limit resize, global pressure.

Known issues:

 * Since FS shrinkers can't be executed on __GFP_FS allocations, such
   allocations will fail if memcg kmem limit is less than the user limit and
   the memcg kmem usage is close to its limit. Glauber proposed to schedule a
   worker which would shrink kmem in the background on such allocations.
   However, this approach does not eliminate failures completely, it just makes
   them rarer. I'm thinking on implementing soft limits for memcg kmem so that
   striking the soft limit will trigger the reclaimer, but won't fail the
   allocation. I would appreciate any other proposals on how this can be fixed.

 * Only dcache and icache are reclaimed on memcg pressure. Other FS objects are
   left for global pressure only. However, it should not be a serious problem
   to make them reclaimable too by passing on memcg to the FS-layer and letting
   each FS decide if its internal objects are shrinkable on memcg pressure.

Changelog:

Changes in v12:
 * Do not prune all slabs on kmem-only pressure.
 * Count all on-LRU pages eligible for reclaim to pass to shrink_slab().
 * Fix isolation issue due to using shrinker->nr_deferred on memcg pressure.
 * Add comments to memcg_list_lru functions.
 * Code cleanup/refactoring.

Changes in v11:
 * Rework per-memcg list_lru infrastructure.

Glauber Costa (7):
  memcg: make cache index determination more robust
  memcg: consolidate callers of memcg_cache_id
  memcg: move initialization to memcg creation
  memcg: allow kmem limit to be resized down
  vmpressure: in-kernel notifications
  memcg: reap dead memcgs upon global memory pressure
  memcg: flush memcg items upon memcg destruction

Vladimir Davydov (11):
  memcg: move several kmemcg functions upper
  fs: do not use destroy_super() in alloc_super() fail path
  vmscan: rename shrink_slab() args to make it more generic
  vmscan: move call to shrink_slab() to shrink_zones()
  vmscan: do_try_to_free_pages(): remove shrink_control argument
  vmscan: shrink slab on memcg pressure
  memcg,list_lru: add per-memcg LRU list infrastructure
  memcg,list_lru: add function walking over all lists of a per-memcg
    LRU
  fs: make icache, dcache shrinkers memcg-aware
  memcg: per-memcg kmem shrinking
  vmscan: take at least one pass with shrinkers

 fs/dcache.c                   |   25 +-
 fs/inode.c                    |   16 +-
 fs/internal.h                 |    9 +-
 fs/super.c                    |   48 ++-
 include/linux/fs.h            |    4 +-
 include/linux/list_lru.h      |   83 +++++
 include/linux/memcontrol.h    |   22 ++
 include/linux/mm.h            |    3 +-
 include/linux/shrinker.h      |   10 +-
 include/linux/swap.h          |    2 +
 include/linux/vmpressure.h    |    5 +
 include/trace/events/vmscan.h |   20 +-
 mm/memcontrol.c               |  728 ++++++++++++++++++++++++++++++++++++-----
 mm/vmpressure.c               |   53 ++-
 mm/vmscan.c                   |  249 +++++++++-----
 15 files changed, 1054 insertions(+), 223 deletions(-)

-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH v12 01/18] memcg: make cache index determination more robust
  2013-12-02 11:19 [PATCH v12 00/18] kmemcg shrinkers Vladimir Davydov
@ 2013-12-02 11:19 ` Vladimir Davydov
  2013-12-02 11:19 ` [PATCH v12 02/18] memcg: consolidate callers of memcg_cache_id Vladimir Davydov
                   ` (17 subsequent siblings)
  18 siblings, 0 replies; 38+ messages in thread
From: Vladimir Davydov @ 2013-12-02 11:19 UTC (permalink / raw)
  To: hannes, mhocko, dchinner, akpm
  Cc: linux-kernel, linux-mm, cgroups, devel, glommer, vdavydov,
	Balbir Singh, KAMEZAWA Hiroyuki

From: Glauber Costa <glommer@openvz.org>

I caught myself doing something like the following outside memcg core:

	memcg_id = -1;
	if (memcg && memcg_kmem_is_active(memcg))
		memcg_id = memcg_cache_id(memcg);

to be able to handle all possible memcgs in a sane manner. In particular, the
root cache will have kmemcg_id = -1 (just because we don't call memcg_kmem_init
to the root cache since it is not limitable). We have always coped with that by
making sure we sanitize which cache is passed to memcg_cache_id. Although this
example is given for root, what we really need to know is whether or not a
cache is kmem active.

But outside the memcg core testing for root, for instance, is not trivial since
we don't export mem_cgroup_is_root. I ended up realizing that this tests really
belong inside memcg_cache_id. This patch moves a similar but stronger test
inside memcg_cache_id and make sure it always return a meaningful value.

Signed-off-by: Glauber Costa <glommer@openvz.org>
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |    4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f1a0ae6..02b5176 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3073,7 +3073,9 @@ void memcg_cache_list_add(struct mem_cgroup *memcg, struct kmem_cache *cachep)
  */
 int memcg_cache_id(struct mem_cgroup *memcg)
 {
-	return memcg ? memcg->kmemcg_id : -1;
+	if (!memcg || !memcg_can_account_kmem(memcg))
+		return -1;
+	return memcg->kmemcg_id;
 }
 
 /*
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v12 02/18] memcg: consolidate callers of memcg_cache_id
  2013-12-02 11:19 [PATCH v12 00/18] kmemcg shrinkers Vladimir Davydov
  2013-12-02 11:19 ` [PATCH v12 01/18] memcg: make cache index determination more robust Vladimir Davydov
@ 2013-12-02 11:19 ` Vladimir Davydov
  2013-12-02 11:19 ` [PATCH v12 03/18] memcg: move initialization to memcg creation Vladimir Davydov
                   ` (16 subsequent siblings)
  18 siblings, 0 replies; 38+ messages in thread
From: Vladimir Davydov @ 2013-12-02 11:19 UTC (permalink / raw)
  To: hannes, mhocko, dchinner, akpm
  Cc: linux-kernel, linux-mm, cgroups, devel, glommer, vdavydov,
	Balbir Singh, KAMEZAWA Hiroyuki

From: Glauber Costa <glommer@openvz.org>

Each caller of memcg_cache_id ends up sanitizing its parameters in its own way.
Now that the memcg_cache_id itself is more robust, we can consolidate this.

Also, as suggested by Michal, a special helper memcg_cache_idx is used when the
result is expected to be used directly as an array index to make sure we never
accesses in a negative index.

Signed-off-by: Glauber Costa <glommer@openvz.org>
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |   49 +++++++++++++++++++++++++++++--------------------
 1 file changed, 29 insertions(+), 20 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 02b5176..144cb4c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2960,6 +2960,30 @@ static inline bool memcg_can_account_kmem(struct mem_cgroup *memcg)
 }
 
 /*
+ * helper for acessing a memcg's index. It will be used as an index in the
+ * child cache array in kmem_cache, and also to derive its name. This function
+ * will return -1 when this is not a kmem-limited memcg.
+ */
+int memcg_cache_id(struct mem_cgroup *memcg)
+{
+	if (!memcg || !memcg_can_account_kmem(memcg))
+		return -1;
+	return memcg->kmemcg_id;
+}
+
+/*
+ * This helper around memcg_cache_id is not intented for use outside memcg
+ * core. It is meant for places where the cache id is used directly as an array
+ * index
+ */
+static int memcg_cache_idx(struct mem_cgroup *memcg)
+{
+	int ret = memcg_cache_id(memcg);
+	BUG_ON(ret < 0);
+	return ret;
+}
+
+/*
  * This is a bit cumbersome, but it is rarely used and avoids a backpointer
  * in the memcg_cache_params struct.
  */
@@ -2969,7 +2993,7 @@ static struct kmem_cache *memcg_params_to_cache(struct memcg_cache_params *p)
 
 	VM_BUG_ON(p->is_root_cache);
 	cachep = p->root_cache;
-	return cache_from_memcg_idx(cachep, memcg_cache_id(p->memcg));
+	return cache_from_memcg_idx(cachep, memcg_cache_idx(p->memcg));
 }
 
 #ifdef CONFIG_SLABINFO
@@ -3067,18 +3091,6 @@ void memcg_cache_list_add(struct mem_cgroup *memcg, struct kmem_cache *cachep)
 }
 
 /*
- * helper for acessing a memcg's index. It will be used as an index in the
- * child cache array in kmem_cache, and also to derive its name. This function
- * will return -1 when this is not a kmem-limited memcg.
- */
-int memcg_cache_id(struct mem_cgroup *memcg)
-{
-	if (!memcg || !memcg_can_account_kmem(memcg))
-		return -1;
-	return memcg->kmemcg_id;
-}
-
-/*
  * This ends up being protected by the set_limit mutex, during normal
  * operation, because that is its main call site.
  *
@@ -3240,7 +3252,7 @@ void memcg_release_cache(struct kmem_cache *s)
 		goto out;
 
 	memcg = s->memcg_params->memcg;
-	id  = memcg_cache_id(memcg);
+	id = memcg_cache_idx(memcg);
 
 	root = s->memcg_params->root_cache;
 	root->memcg_params->memcg_caches[id] = NULL;
@@ -3403,9 +3415,7 @@ static struct kmem_cache *memcg_create_kmem_cache(struct mem_cgroup *memcg,
 	struct kmem_cache *new_cachep;
 	int idx;
 
-	BUG_ON(!memcg_can_account_kmem(memcg));
-
-	idx = memcg_cache_id(memcg);
+	idx = memcg_cache_idx(memcg);
 
 	mutex_lock(&memcg_cache_mutex);
 	new_cachep = cache_from_memcg_idx(cachep, idx);
@@ -3578,10 +3588,9 @@ struct kmem_cache *__memcg_kmem_get_cache(struct kmem_cache *cachep,
 	rcu_read_lock();
 	memcg = mem_cgroup_from_task(rcu_dereference(current->mm->owner));
 
-	if (!memcg_can_account_kmem(memcg))
-		goto out;
-
 	idx = memcg_cache_id(memcg);
+	if (idx < 0)
+		goto out;
 
 	/*
 	 * barrier to mare sure we're always seeing the up to date value.  The
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v12 03/18] memcg: move initialization to memcg creation
  2013-12-02 11:19 [PATCH v12 00/18] kmemcg shrinkers Vladimir Davydov
  2013-12-02 11:19 ` [PATCH v12 01/18] memcg: make cache index determination more robust Vladimir Davydov
  2013-12-02 11:19 ` [PATCH v12 02/18] memcg: consolidate callers of memcg_cache_id Vladimir Davydov
@ 2013-12-02 11:19 ` Vladimir Davydov
  2013-12-02 11:19 ` [PATCH v12 04/18] memcg: move several kmemcg functions upper Vladimir Davydov
                   ` (15 subsequent siblings)
  18 siblings, 0 replies; 38+ messages in thread
From: Vladimir Davydov @ 2013-12-02 11:19 UTC (permalink / raw)
  To: hannes, mhocko, dchinner, akpm
  Cc: linux-kernel, linux-mm, cgroups, devel, glommer, vdavydov,
	Balbir Singh, KAMEZAWA Hiroyuki

From: Glauber Costa <glommer@openvz.org>

Those structures are only used for memcgs that are effectively using
kmemcg. However, in a later patch I intend to use scan that list
inconditionally (list empty meaning no kmem caches present), which
simplifies the code a lot.

So move the initialization to early kmem creation.

Signed-off-by: Glauber Costa <glommer@openvz.org>
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 144cb4c..3a4e2f8 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3122,8 +3122,6 @@ int memcg_update_cache_sizes(struct mem_cgroup *memcg)
 	}
 
 	memcg->kmemcg_id = num;
-	INIT_LIST_HEAD(&memcg->memcg_slab_caches);
-	mutex_init(&memcg->slab_caches_mutex);
 	return 0;
 }
 
@@ -5909,6 +5907,8 @@ static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
 {
 	int ret;
 
+	INIT_LIST_HEAD(&memcg->memcg_slab_caches);
+	mutex_init(&memcg->slab_caches_mutex);
 	memcg->kmemcg_id = -1;
 	ret = memcg_propagate_kmem(memcg);
 	if (ret)
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v12 04/18] memcg: move several kmemcg functions upper
  2013-12-02 11:19 [PATCH v12 00/18] kmemcg shrinkers Vladimir Davydov
                   ` (2 preceding siblings ...)
  2013-12-02 11:19 ` [PATCH v12 03/18] memcg: move initialization to memcg creation Vladimir Davydov
@ 2013-12-02 11:19 ` Vladimir Davydov
  2013-12-02 11:19 ` [PATCH v12 05/18] fs: do not use destroy_super() in alloc_super() fail path Vladimir Davydov
                   ` (14 subsequent siblings)
  18 siblings, 0 replies; 38+ messages in thread
From: Vladimir Davydov @ 2013-12-02 11:19 UTC (permalink / raw)
  To: hannes, mhocko, dchinner, akpm
  Cc: linux-kernel, linux-mm, cgroups, devel, glommer, vdavydov,
	Balbir Singh, KAMEZAWA Hiroyuki

I need to move memcg_{stop,resume}_kmem_account() and
memcg_caches_array_size() upper since I am going to use them in
per-memcg lrus implementation introduced by the following patches.
These functions are very simple and do not depend on other kmemcg
bits so it is better to keep them on top anyway.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |   92 +++++++++++++++++++++++++++----------------------------
 1 file changed, 46 insertions(+), 46 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 3a4e2f8..3a92ab3 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2983,6 +2983,52 @@ static int memcg_cache_idx(struct mem_cgroup *memcg)
 	return ret;
 }
 
+static size_t memcg_caches_array_size(int num_groups)
+{
+	ssize_t size;
+	if (num_groups <= 0)
+		return 0;
+
+	size = 2 * num_groups;
+	if (size < MEMCG_CACHES_MIN_SIZE)
+		size = MEMCG_CACHES_MIN_SIZE;
+	else if (size > MEMCG_CACHES_MAX_SIZE)
+		size = MEMCG_CACHES_MAX_SIZE;
+
+	return size;
+}
+
+/*
+ * During the creation a new cache, we need to disable our accounting mechanism
+ * altogether. This is true even if we are not creating, but rather just
+ * enqueing new caches to be created.
+ *
+ * This is because that process will trigger allocations; some visible, like
+ * explicit kmallocs to auxiliary data structures, name strings and internal
+ * cache structures; some well concealed, like INIT_WORK() that can allocate
+ * objects during debug.
+ *
+ * If any allocation happens during memcg_kmem_get_cache, we will recurse back
+ * to it. This may not be a bounded recursion: since the first cache creation
+ * failed to complete (waiting on the allocation), we'll just try to create the
+ * cache again, failing at the same point.
+ *
+ * memcg_kmem_get_cache is prepared to abort after seeing a positive count of
+ * memcg_kmem_skip_account. So we enclose anything that might allocate memory
+ * inside the following two functions.
+ */
+static inline void memcg_stop_kmem_account(void)
+{
+	VM_BUG_ON(!current->mm);
+	current->memcg_kmem_skip_account++;
+}
+
+static inline void memcg_resume_kmem_account(void)
+{
+	VM_BUG_ON(!current->mm);
+	current->memcg_kmem_skip_account--;
+}
+
 /*
  * This is a bit cumbersome, but it is rarely used and avoids a backpointer
  * in the memcg_cache_params struct.
@@ -3125,21 +3171,6 @@ int memcg_update_cache_sizes(struct mem_cgroup *memcg)
 	return 0;
 }
 
-static size_t memcg_caches_array_size(int num_groups)
-{
-	ssize_t size;
-	if (num_groups <= 0)
-		return 0;
-
-	size = 2 * num_groups;
-	if (size < MEMCG_CACHES_MIN_SIZE)
-		size = MEMCG_CACHES_MIN_SIZE;
-	else if (size > MEMCG_CACHES_MAX_SIZE)
-		size = MEMCG_CACHES_MAX_SIZE;
-
-	return size;
-}
-
 /*
  * We should update the current array size iff all caches updates succeed. This
  * can only be done from the slab side. The slab mutex needs to be held when
@@ -3264,37 +3295,6 @@ out:
 	kfree(s->memcg_params);
 }
 
-/*
- * During the creation a new cache, we need to disable our accounting mechanism
- * altogether. This is true even if we are not creating, but rather just
- * enqueing new caches to be created.
- *
- * This is because that process will trigger allocations; some visible, like
- * explicit kmallocs to auxiliary data structures, name strings and internal
- * cache structures; some well concealed, like INIT_WORK() that can allocate
- * objects during debug.
- *
- * If any allocation happens during memcg_kmem_get_cache, we will recurse back
- * to it. This may not be a bounded recursion: since the first cache creation
- * failed to complete (waiting on the allocation), we'll just try to create the
- * cache again, failing at the same point.
- *
- * memcg_kmem_get_cache is prepared to abort after seeing a positive count of
- * memcg_kmem_skip_account. So we enclose anything that might allocate memory
- * inside the following two functions.
- */
-static inline void memcg_stop_kmem_account(void)
-{
-	VM_BUG_ON(!current->mm);
-	current->memcg_kmem_skip_account++;
-}
-
-static inline void memcg_resume_kmem_account(void)
-{
-	VM_BUG_ON(!current->mm);
-	current->memcg_kmem_skip_account--;
-}
-
 static void kmem_cache_destroy_work_func(struct work_struct *w)
 {
 	struct kmem_cache *cachep;
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v12 05/18] fs: do not use destroy_super() in alloc_super() fail path
  2013-12-02 11:19 [PATCH v12 00/18] kmemcg shrinkers Vladimir Davydov
                   ` (3 preceding siblings ...)
  2013-12-02 11:19 ` [PATCH v12 04/18] memcg: move several kmemcg functions upper Vladimir Davydov
@ 2013-12-02 11:19 ` Vladimir Davydov
  2013-12-03  9:00   ` Dave Chinner
  2013-12-02 11:19 ` [PATCH v12 06/18] vmscan: rename shrink_slab() args to make it more generic Vladimir Davydov
                   ` (13 subsequent siblings)
  18 siblings, 1 reply; 38+ messages in thread
From: Vladimir Davydov @ 2013-12-02 11:19 UTC (permalink / raw)
  To: hannes, mhocko, dchinner, akpm
  Cc: linux-kernel, linux-mm, cgroups, devel, glommer, vdavydov, Al Viro

Using destroy_super() in alloc_super() fail path is bad, because:

* It will trigger WARN_ON(!list_empty(&s->s_mounts)) since s_mounts is
  initialized after several 'goto fail's.
* It will call kfree_rcu() to free the super block although kfree() is
  obviously enough there.
* The list_lru structure was initially implemented without the ability
  to destroy an uninitialized object in mind.

I'm going to replace the conventional list_lru with per-memcg lru to
implement per-memcg slab reclaim. This new structure will fail
destruction of objects that haven't been properly initialized so let's
inline appropriate snippets from destroy_super() to alloc_super() fail
path instead of using the whole function there.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
---
 fs/super.c |    9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/fs/super.c b/fs/super.c
index e5f6c2c..cece164 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -185,8 +185,10 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags)
 
 	if (list_lru_init(&s->s_dentry_lru))
 		goto fail;
-	if (list_lru_init(&s->s_inode_lru))
+	if (list_lru_init(&s->s_inode_lru)) {
+		list_lru_destroy(&s->s_dentry_lru);
 		goto fail;
+	}
 
 	INIT_LIST_HEAD(&s->s_mounts);
 	init_rwsem(&s->s_umount);
@@ -227,7 +229,10 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags)
 	return s;
 
 fail:
-	destroy_super(s);
+	for (i = 0; i < SB_FREEZE_LEVELS; i++)
+		percpu_counter_destroy(&s->s_writers.counter[i]);
+	security_sb_free(s);
+	kfree(s);
 	return NULL;
 }
 
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v12 06/18] vmscan: rename shrink_slab() args to make it more generic
  2013-12-02 11:19 [PATCH v12 00/18] kmemcg shrinkers Vladimir Davydov
                   ` (4 preceding siblings ...)
  2013-12-02 11:19 ` [PATCH v12 05/18] fs: do not use destroy_super() in alloc_super() fail path Vladimir Davydov
@ 2013-12-02 11:19 ` Vladimir Davydov
  2013-12-03  9:33   ` Dave Chinner
  2013-12-02 11:19 ` [PATCH v12 07/18] vmscan: move call to shrink_slab() to shrink_zones() Vladimir Davydov
                   ` (12 subsequent siblings)
  18 siblings, 1 reply; 38+ messages in thread
From: Vladimir Davydov @ 2013-12-02 11:19 UTC (permalink / raw)
  To: hannes, mhocko, dchinner, akpm
  Cc: linux-kernel, linux-mm, cgroups, devel, glommer, vdavydov,
	Mel Gorman, Rik van Riel

Currently in addition to a shrink_control struct shrink_slab() takes two
arguments, nr_pages_scanned and lru_pages, which are used for balancing
slab reclaim versus page reclaim - roughly speaking, shrink_slab() will
try to scan nr_pages_scanned/lru_pages fraction of all slab objects.
However, shrink_slab() is not always called after page cache reclaim.
For example, drop_slab() uses shrink_slab() to drop as many slab objects
as possible and thus has to pass phony values 1000/1000 to it, which do
not make sense for nr_pages_scanned/lru_pages. Moreover, as soon as
kmemcg reclaim is introduced, we will have to make up phony values for
nr_pages_scanned and lru_pages again when doing kmem-only reclaim for a
memory cgroup, which is possible if the cgroup has its kmem limit less
than the total memory limit.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
---
 include/linux/mm.h            |    3 +--
 include/trace/events/vmscan.h |   20 ++++++++++----------
 mm/vmscan.c                   |   26 +++++++++++++-------------
 3 files changed, 24 insertions(+), 25 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 1cedd00..71c7f50 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1926,8 +1926,7 @@ int drop_caches_sysctl_handler(struct ctl_table *, int,
 #endif
 
 unsigned long shrink_slab(struct shrink_control *shrink,
-			  unsigned long nr_pages_scanned,
-			  unsigned long lru_pages);
+			  unsigned long fraction, unsigned long denominator);
 
 #ifndef CONFIG_MMU
 #define randomize_va_space 0
diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
index 132a985..6bed4ab 100644
--- a/include/trace/events/vmscan.h
+++ b/include/trace/events/vmscan.h
@@ -181,11 +181,11 @@ DEFINE_EVENT(mm_vmscan_direct_reclaim_end_template, mm_vmscan_memcg_softlimit_re
 
 TRACE_EVENT(mm_shrink_slab_start,
 	TP_PROTO(struct shrinker *shr, struct shrink_control *sc,
-		long nr_objects_to_shrink, unsigned long pgs_scanned,
-		unsigned long lru_pgs, unsigned long cache_items,
+		long nr_objects_to_shrink, unsigned long frac,
+		unsigned long denom, unsigned long cache_items,
 		unsigned long long delta, unsigned long total_scan),
 
-	TP_ARGS(shr, sc, nr_objects_to_shrink, pgs_scanned, lru_pgs,
+	TP_ARGS(shr, sc, nr_objects_to_shrink, frac, denom,
 		cache_items, delta, total_scan),
 
 	TP_STRUCT__entry(
@@ -193,8 +193,8 @@ TRACE_EVENT(mm_shrink_slab_start,
 		__field(void *, shrink)
 		__field(long, nr_objects_to_shrink)
 		__field(gfp_t, gfp_flags)
-		__field(unsigned long, pgs_scanned)
-		__field(unsigned long, lru_pgs)
+		__field(unsigned long, frac)
+		__field(unsigned long, denom)
 		__field(unsigned long, cache_items)
 		__field(unsigned long long, delta)
 		__field(unsigned long, total_scan)
@@ -205,20 +205,20 @@ TRACE_EVENT(mm_shrink_slab_start,
 		__entry->shrink = shr->scan_objects;
 		__entry->nr_objects_to_shrink = nr_objects_to_shrink;
 		__entry->gfp_flags = sc->gfp_mask;
-		__entry->pgs_scanned = pgs_scanned;
-		__entry->lru_pgs = lru_pgs;
+		__entry->frac = frac;
+		__entry->denom = denom;
 		__entry->cache_items = cache_items;
 		__entry->delta = delta;
 		__entry->total_scan = total_scan;
 	),
 
-	TP_printk("%pF %p: objects to shrink %ld gfp_flags %s pgs_scanned %ld lru_pgs %ld cache items %ld delta %lld total_scan %ld",
+	TP_printk("%pF %p: objects to shrink %ld gfp_flags %s frac %ld denom %ld cache items %ld delta %lld total_scan %ld",
 		__entry->shrink,
 		__entry->shr,
 		__entry->nr_objects_to_shrink,
 		show_gfp_flags(__entry->gfp_flags),
-		__entry->pgs_scanned,
-		__entry->lru_pgs,
+		__entry->frac,
+		__entry->denom,
 		__entry->cache_items,
 		__entry->delta,
 		__entry->total_scan)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index eea668d..6946997 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -219,7 +219,7 @@ EXPORT_SYMBOL(unregister_shrinker);
 
 static unsigned long
 shrink_slab_node(struct shrink_control *shrinkctl, struct shrinker *shrinker,
-		 unsigned long nr_pages_scanned, unsigned long lru_pages)
+		 unsigned long fraction, unsigned long denominator)
 {
 	unsigned long freed = 0;
 	unsigned long long delta;
@@ -243,9 +243,9 @@ shrink_slab_node(struct shrink_control *shrinkctl, struct shrinker *shrinker,
 	nr = atomic_long_xchg(&shrinker->nr_deferred[nid], 0);
 
 	total_scan = nr;
-	delta = (4 * nr_pages_scanned) / shrinker->seeks;
+	delta = (4 * fraction) / shrinker->seeks;
 	delta *= max_pass;
-	do_div(delta, lru_pages + 1);
+	do_div(delta, denominator + 1);
 	total_scan += delta;
 	if (total_scan < 0) {
 		printk(KERN_ERR
@@ -278,7 +278,7 @@ shrink_slab_node(struct shrink_control *shrinkctl, struct shrinker *shrinker,
 		total_scan = max_pass * 2;
 
 	trace_mm_shrink_slab_start(shrinker, shrinkctl, nr,
-				nr_pages_scanned, lru_pages,
+				fraction, denominator,
 				max_pass, delta, total_scan);
 
 	while (total_scan >= batch_size) {
@@ -322,23 +322,23 @@ shrink_slab_node(struct shrink_control *shrinkctl, struct shrinker *shrinker,
  * If the vm encountered mapped pages on the LRU it increase the pressure on
  * slab to avoid swapping.
  *
- * We do weird things to avoid (scanned*seeks*entries) overflowing 32 bits.
+ * We do weird things to avoid (fraction*seeks*entries) overflowing 32 bits.
  *
- * `lru_pages' represents the number of on-LRU pages in all the zones which
- * are eligible for the caller's allocation attempt.  It is used for balancing
- * slab reclaim versus page reclaim.
+ * `fraction' and `denominator' are used for balancing slab reclaim versus page
+ * reclaim. To scan slab objects proportionally to page cache, pass the number
+ * of pages scanned and the total number of on-LRU pages in all the zones which
+ * are eligible for the caller's allocation attempt respectively.
  *
  * Returns the number of slab objects which we shrunk.
  */
 unsigned long shrink_slab(struct shrink_control *shrinkctl,
-			  unsigned long nr_pages_scanned,
-			  unsigned long lru_pages)
+			  unsigned long fraction, unsigned long denominator)
 {
 	struct shrinker *shrinker;
 	unsigned long freed = 0;
 
-	if (nr_pages_scanned == 0)
-		nr_pages_scanned = SWAP_CLUSTER_MAX;
+	if (fraction == 0)
+		fraction = SWAP_CLUSTER_MAX;
 
 	if (!down_read_trylock(&shrinker_rwsem)) {
 		/*
@@ -361,7 +361,7 @@ unsigned long shrink_slab(struct shrink_control *shrinkctl,
 				break;
 
 			freed += shrink_slab_node(shrinkctl, shrinker,
-				 nr_pages_scanned, lru_pages);
+						  fraction, denominator);
 
 		}
 	}
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v12 07/18] vmscan: move call to shrink_slab() to shrink_zones()
  2013-12-02 11:19 [PATCH v12 00/18] kmemcg shrinkers Vladimir Davydov
                   ` (5 preceding siblings ...)
  2013-12-02 11:19 ` [PATCH v12 06/18] vmscan: rename shrink_slab() args to make it more generic Vladimir Davydov
@ 2013-12-02 11:19 ` Vladimir Davydov
  2013-12-02 11:19 ` [PATCH v12 08/18] vmscan: do_try_to_free_pages(): remove shrink_control argument Vladimir Davydov
                   ` (11 subsequent siblings)
  18 siblings, 0 replies; 38+ messages in thread
From: Vladimir Davydov @ 2013-12-02 11:19 UTC (permalink / raw)
  To: hannes, mhocko, dchinner, akpm
  Cc: linux-kernel, linux-mm, cgroups, devel, glommer, vdavydov,
	Mel Gorman, Rik van Riel

This reduces the indentation level of do_try_to_free_pages() and removes
extra loop over all eligible zones counting the number of on-LRU pages.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
---
 mm/vmscan.c |   57 ++++++++++++++++++++++++++-------------------------------
 1 file changed, 26 insertions(+), 31 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 6946997..ba1ad6e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2273,13 +2273,17 @@ static inline bool compaction_ready(struct zone *zone, struct scan_control *sc)
  * the caller that it should consider retrying the allocation instead of
  * further reclaim.
  */
-static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
+static bool shrink_zones(struct zonelist *zonelist,
+			 struct scan_control *sc,
+			 struct shrink_control *shrink)
 {
 	struct zoneref *z;
 	struct zone *zone;
 	unsigned long nr_soft_reclaimed;
 	unsigned long nr_soft_scanned;
+	unsigned long lru_pages = 0;
 	bool aborted_reclaim = false;
+	struct reclaim_state *reclaim_state = current->reclaim_state;
 
 	/*
 	 * If the number of buffer_heads in the machine exceeds the maximum
@@ -2289,6 +2293,8 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
 	if (buffer_heads_over_limit)
 		sc->gfp_mask |= __GFP_HIGHMEM;
 
+	nodes_clear(shrink->nodes_to_scan);
+
 	for_each_zone_zonelist_nodemask(zone, z, zonelist,
 					gfp_zone(sc->gfp_mask), sc->nodemask) {
 		if (!populated_zone(zone))
@@ -2300,6 +2306,10 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
 		if (global_reclaim(sc)) {
 			if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
 				continue;
+
+			lru_pages += zone_reclaimable_pages(zone);
+			node_set(zone_to_nid(zone), shrink->nodes_to_scan);
+
 			if (sc->priority != DEF_PRIORITY &&
 			    !zone_reclaimable(zone))
 				continue;	/* Let kswapd poll it */
@@ -2336,6 +2346,20 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
 		shrink_zone(zone, sc);
 	}
 
+	/*
+	 * Don't shrink slabs when reclaiming memory from over limit
+	 * cgroups but do shrink slab at least once when aborting
+	 * reclaim for compaction to avoid unevenly scanning file/anon
+	 * LRU pages over slab pages.
+	 */
+	if (global_reclaim(sc)) {
+		shrink_slab(shrink, sc->nr_scanned, lru_pages);
+		if (reclaim_state) {
+			sc->nr_reclaimed += reclaim_state->reclaimed_slab;
+			reclaim_state->reclaimed_slab = 0;
+		}
+	}
+
 	return aborted_reclaim;
 }
 
@@ -2380,9 +2404,6 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 					struct shrink_control *shrink)
 {
 	unsigned long total_scanned = 0;
-	struct reclaim_state *reclaim_state = current->reclaim_state;
-	struct zoneref *z;
-	struct zone *zone;
 	unsigned long writeback_threshold;
 	bool aborted_reclaim;
 
@@ -2395,34 +2416,8 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 		vmpressure_prio(sc->gfp_mask, sc->target_mem_cgroup,
 				sc->priority);
 		sc->nr_scanned = 0;
-		aborted_reclaim = shrink_zones(zonelist, sc);
-
-		/*
-		 * Don't shrink slabs when reclaiming memory from over limit
-		 * cgroups but do shrink slab at least once when aborting
-		 * reclaim for compaction to avoid unevenly scanning file/anon
-		 * LRU pages over slab pages.
-		 */
-		if (global_reclaim(sc)) {
-			unsigned long lru_pages = 0;
+		aborted_reclaim = shrink_zones(zonelist, sc, shrink);
 
-			nodes_clear(shrink->nodes_to_scan);
-			for_each_zone_zonelist(zone, z, zonelist,
-					gfp_zone(sc->gfp_mask)) {
-				if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
-					continue;
-
-				lru_pages += zone_reclaimable_pages(zone);
-				node_set(zone_to_nid(zone),
-					 shrink->nodes_to_scan);
-			}
-
-			shrink_slab(shrink, sc->nr_scanned, lru_pages);
-			if (reclaim_state) {
-				sc->nr_reclaimed += reclaim_state->reclaimed_slab;
-				reclaim_state->reclaimed_slab = 0;
-			}
-		}
 		total_scanned += sc->nr_scanned;
 		if (sc->nr_reclaimed >= sc->nr_to_reclaim)
 			goto out;
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v12 08/18] vmscan: do_try_to_free_pages(): remove shrink_control argument
  2013-12-02 11:19 [PATCH v12 00/18] kmemcg shrinkers Vladimir Davydov
                   ` (6 preceding siblings ...)
  2013-12-02 11:19 ` [PATCH v12 07/18] vmscan: move call to shrink_slab() to shrink_zones() Vladimir Davydov
@ 2013-12-02 11:19 ` Vladimir Davydov
  2013-12-02 11:19 ` [PATCH v12 09/18] vmscan: shrink slab on memcg pressure Vladimir Davydov
                   ` (10 subsequent siblings)
  18 siblings, 0 replies; 38+ messages in thread
From: Vladimir Davydov @ 2013-12-02 11:19 UTC (permalink / raw)
  To: hannes, mhocko, dchinner, akpm
  Cc: linux-kernel, linux-mm, cgroups, devel, glommer, vdavydov,
	Mel Gorman, Rik van Riel

There is no need passing on a shrink_control struct from
try_to_free_pages() and friends to do_try_to_free_pages() and then to
shrink_zones(), because it is only used in shrink_zones() and the only
fields initialized on the top level is gfp_mask, which always equals to
scan_control.gfp_mask known to shrink_zones(). So let's move
shrink_control initialization to shrink_zones().

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
---
 mm/vmscan.c |   32 ++++++++++++--------------------
 1 file changed, 12 insertions(+), 20 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index ba1ad6e..7601b95 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2274,8 +2274,7 @@ static inline bool compaction_ready(struct zone *zone, struct scan_control *sc)
  * further reclaim.
  */
 static bool shrink_zones(struct zonelist *zonelist,
-			 struct scan_control *sc,
-			 struct shrink_control *shrink)
+			 struct scan_control *sc)
 {
 	struct zoneref *z;
 	struct zone *zone;
@@ -2284,6 +2283,9 @@ static bool shrink_zones(struct zonelist *zonelist,
 	unsigned long lru_pages = 0;
 	bool aborted_reclaim = false;
 	struct reclaim_state *reclaim_state = current->reclaim_state;
+	struct shrink_control shrink = {
+		.gfp_mask = sc->gfp_mask,
+	};
 
 	/*
 	 * If the number of buffer_heads in the machine exceeds the maximum
@@ -2293,7 +2295,7 @@ static bool shrink_zones(struct zonelist *zonelist,
 	if (buffer_heads_over_limit)
 		sc->gfp_mask |= __GFP_HIGHMEM;
 
-	nodes_clear(shrink->nodes_to_scan);
+	nodes_clear(shrink.nodes_to_scan);
 
 	for_each_zone_zonelist_nodemask(zone, z, zonelist,
 					gfp_zone(sc->gfp_mask), sc->nodemask) {
@@ -2308,7 +2310,7 @@ static bool shrink_zones(struct zonelist *zonelist,
 				continue;
 
 			lru_pages += zone_reclaimable_pages(zone);
-			node_set(zone_to_nid(zone), shrink->nodes_to_scan);
+			node_set(zone_to_nid(zone), shrink.nodes_to_scan);
 
 			if (sc->priority != DEF_PRIORITY &&
 			    !zone_reclaimable(zone))
@@ -2353,7 +2355,7 @@ static bool shrink_zones(struct zonelist *zonelist,
 	 * LRU pages over slab pages.
 	 */
 	if (global_reclaim(sc)) {
-		shrink_slab(shrink, sc->nr_scanned, lru_pages);
+		shrink_slab(&shrink, sc->nr_scanned, lru_pages);
 		if (reclaim_state) {
 			sc->nr_reclaimed += reclaim_state->reclaimed_slab;
 			reclaim_state->reclaimed_slab = 0;
@@ -2400,8 +2402,7 @@ static bool all_unreclaimable(struct zonelist *zonelist,
  * 		else, the number of pages reclaimed
  */
 static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
-					struct scan_control *sc,
-					struct shrink_control *shrink)
+					  struct scan_control *sc)
 {
 	unsigned long total_scanned = 0;
 	unsigned long writeback_threshold;
@@ -2416,7 +2417,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 		vmpressure_prio(sc->gfp_mask, sc->target_mem_cgroup,
 				sc->priority);
 		sc->nr_scanned = 0;
-		aborted_reclaim = shrink_zones(zonelist, sc, shrink);
+		aborted_reclaim = shrink_zones(zonelist, sc);
 
 		total_scanned += sc->nr_scanned;
 		if (sc->nr_reclaimed >= sc->nr_to_reclaim)
@@ -2579,9 +2580,6 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 		.target_mem_cgroup = NULL,
 		.nodemask = nodemask,
 	};
-	struct shrink_control shrink = {
-		.gfp_mask = sc.gfp_mask,
-	};
 
 	/*
 	 * Do not enter reclaim if fatal signal was delivered while throttled.
@@ -2595,7 +2593,7 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 				sc.may_writepage,
 				gfp_mask);
 
-	nr_reclaimed = do_try_to_free_pages(zonelist, &sc, &shrink);
+	nr_reclaimed = do_try_to_free_pages(zonelist, &sc);
 
 	trace_mm_vmscan_direct_reclaim_end(nr_reclaimed);
 
@@ -2662,9 +2660,6 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 		.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
 				(GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK),
 	};
-	struct shrink_control shrink = {
-		.gfp_mask = sc.gfp_mask,
-	};
 
 	/*
 	 * Unlike direct reclaim via alloc_pages(), memcg's reclaim doesn't
@@ -2679,7 +2674,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 					    sc.may_writepage,
 					    sc.gfp_mask);
 
-	nr_reclaimed = do_try_to_free_pages(zonelist, &sc, &shrink);
+	nr_reclaimed = do_try_to_free_pages(zonelist, &sc);
 
 	trace_mm_vmscan_memcg_reclaim_end(nr_reclaimed);
 
@@ -3335,9 +3330,6 @@ unsigned long shrink_all_memory(unsigned long nr_to_reclaim)
 		.order = 0,
 		.priority = DEF_PRIORITY,
 	};
-	struct shrink_control shrink = {
-		.gfp_mask = sc.gfp_mask,
-	};
 	struct zonelist *zonelist = node_zonelist(numa_node_id(), sc.gfp_mask);
 	struct task_struct *p = current;
 	unsigned long nr_reclaimed;
@@ -3347,7 +3339,7 @@ unsigned long shrink_all_memory(unsigned long nr_to_reclaim)
 	reclaim_state.reclaimed_slab = 0;
 	p->reclaim_state = &reclaim_state;
 
-	nr_reclaimed = do_try_to_free_pages(zonelist, &sc, &shrink);
+	nr_reclaimed = do_try_to_free_pages(zonelist, &sc);
 
 	p->reclaim_state = NULL;
 	lockdep_clear_current_reclaim_state();
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v12 09/18] vmscan: shrink slab on memcg pressure
  2013-12-02 11:19 [PATCH v12 00/18] kmemcg shrinkers Vladimir Davydov
                   ` (7 preceding siblings ...)
  2013-12-02 11:19 ` [PATCH v12 08/18] vmscan: do_try_to_free_pages(): remove shrink_control argument Vladimir Davydov
@ 2013-12-02 11:19 ` Vladimir Davydov
  2013-12-03 10:48   ` Dave Chinner
  2013-12-02 11:19 ` [PATCH v12 10/18] memcg,list_lru: add per-memcg LRU list infrastructure Vladimir Davydov
                   ` (9 subsequent siblings)
  18 siblings, 1 reply; 38+ messages in thread
From: Vladimir Davydov @ 2013-12-02 11:19 UTC (permalink / raw)
  To: hannes, mhocko, dchinner, akpm
  Cc: linux-kernel, linux-mm, cgroups, devel, glommer, vdavydov,
	Mel Gorman, Rik van Riel, Al Viro, Balbir Singh,
	KAMEZAWA Hiroyuki

This patch makes direct reclaim path shrink slabs not only on global
memory pressure, but also when we reach memory cgroup limit. To achieve
that, it introduces a new per-shrinker flag, SHRINKER_MEMCG_AWARE, which
should be set if the shrinker can handle per-memcg reclaim. For such
shrinkers, shrink_slab() will iterate over all eligible memory cgroups
(i.e. the cgroup that triggered the reclaim and all its descendants) and
pass the current memory cgroup to the shrinker in shrink_control.memcg
just like it passes the current NUMA node to NUMA-aware shrinkers.  It
is completely up to memcg-aware shrinkers how to organize objects in
order to provide required functionality. Currently none of the existing
shrinkers is memcg-aware, but next patches will introduce per-memcg
list_lru, which will facilitate the process of turning shrinkers that
use list_lru to be memcg-aware.

The number of slab objects scanned on memcg pressure is calculated in
the same way as on global pressure - it is proportional to the number of
pages scanned over the number of pages eligible for reclaim (i.e. the
number of on-LRU pages in the target memcg and all its descendants) -
except we do not employ the nr_deferred per-shrinker counter to avoid
memory cgroup isolation issues. Ideally, this counter should be made
per-memcg.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/memcontrol.h |   22 +++++++++
 include/linux/shrinker.h   |   10 +++-
 mm/memcontrol.c            |   37 +++++++++++++-
 mm/vmscan.c                |  117 +++++++++++++++++++++++++++++++-------------
 4 files changed, 150 insertions(+), 36 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index b3e7a66..c0f24a9 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -80,6 +80,9 @@ extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
 struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *);
 struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *);
 
+unsigned long mem_cgroup_zone_reclaimable_pages(struct zone *,
+						struct mem_cgroup *);
+
 /* For coalescing uncharge for reducing memcg' overhead*/
 extern void mem_cgroup_uncharge_start(void);
 extern void mem_cgroup_uncharge_end(void);
@@ -289,6 +292,12 @@ static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page,
 	return &zone->lruvec;
 }
 
+static inline unsigned long mem_cgroup_zone_reclaimable_pages(struct zone *zone,
+							struct mem_cgroup *)
+{
+	return 0;
+}
+
 static inline struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
 {
 	return NULL;
@@ -479,6 +488,9 @@ static inline bool memcg_kmem_enabled(void)
 	return static_key_false(&memcg_kmem_enabled_key);
 }
 
+bool memcg_kmem_is_active(struct mem_cgroup *memcg);
+bool memcg_kmem_should_reclaim(struct mem_cgroup *memcg);
+
 /*
  * In general, we'll do everything in our power to not incur in any overhead
  * for non-memcg users for the kmem functions. Not even a function call, if we
@@ -620,6 +632,16 @@ static inline bool memcg_kmem_enabled(void)
 	return false;
 }
 
+static inline bool memcg_kmem_is_active(struct mem_cgroup *memcg)
+{
+	return false;
+}
+
+static inline bool memcg_kmem_should_reclaim(struct mem_cgroup *memcg)
+{
+	return false;
+}
+
 static inline bool
 memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup **memcg, int order)
 {
diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
index 68c0970..ab79b17 100644
--- a/include/linux/shrinker.h
+++ b/include/linux/shrinker.h
@@ -20,8 +20,15 @@ struct shrink_control {
 
 	/* shrink from these nodes */
 	nodemask_t nodes_to_scan;
+
+	/* shrink from this memory cgroup hierarchy (if not NULL) */
+	struct mem_cgroup *target_mem_cgroup;
+
 	/* current node being shrunk (for NUMA aware shrinkers) */
 	int nid;
+
+	/* current memcg being shrunk (for memcg aware shrinkers) */
+	struct mem_cgroup *memcg;
 };
 
 #define SHRINK_STOP (~0UL)
@@ -63,7 +70,8 @@ struct shrinker {
 #define DEFAULT_SEEKS 2 /* A good number if you don't know better. */
 
 /* Flags */
-#define SHRINKER_NUMA_AWARE (1 << 0)
+#define SHRINKER_NUMA_AWARE	(1 << 0)
+#define SHRINKER_MEMCG_AWARE	(1 << 1)
 
 extern int register_shrinker(struct shrinker *);
 extern void unregister_shrinker(struct shrinker *);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 3a92ab3..3f12cec 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -358,7 +358,7 @@ static inline void memcg_kmem_set_active(struct mem_cgroup *memcg)
 	set_bit(KMEM_ACCOUNTED_ACTIVE, &memcg->kmem_account_flags);
 }
 
-static bool memcg_kmem_is_active(struct mem_cgroup *memcg)
+bool memcg_kmem_is_active(struct mem_cgroup *memcg)
 {
 	return test_bit(KMEM_ACCOUNTED_ACTIVE, &memcg->kmem_account_flags);
 }
@@ -1333,6 +1333,26 @@ out:
 	return lruvec;
 }
 
+unsigned long mem_cgroup_zone_reclaimable_pages(struct zone *zone,
+						struct mem_cgroup *memcg)
+{
+	int nid = zone_to_nid(zone);
+	int zid = zone_idx(zone);
+	unsigned long nr = 0;
+	struct mem_cgroup *iter;
+
+	iter = memcg;
+	do {
+		nr += mem_cgroup_zone_nr_lru_pages(iter, nid, zid,
+						   LRU_ALL_FILE);
+		if (do_swap_account)
+			nr += mem_cgroup_zone_nr_lru_pages(iter, nid, zid,
+							   LRU_ALL_ANON);
+		iter = mem_cgroup_iter(memcg, iter, NULL);
+	} while (iter);
+	return nr;
+}
+
 /*
  * Following LRU functions are allowed to be used without PCG_LOCK.
  * Operations are called by routine of global LRU independently from memcg.
@@ -2959,6 +2979,21 @@ static inline bool memcg_can_account_kmem(struct mem_cgroup *memcg)
 		(memcg->kmem_account_flags & KMEM_ACCOUNTED_MASK);
 }
 
+bool memcg_kmem_should_reclaim(struct mem_cgroup *memcg)
+{
+	struct mem_cgroup *iter;
+
+	iter = memcg;
+	do {
+		if (memcg_kmem_is_active(iter)) {
+			mem_cgroup_iter_break(memcg, iter);
+			return true;
+		}
+		iter = mem_cgroup_iter(memcg, iter, NULL);
+	} while (iter);
+	return false;
+}
+
 /*
  * helper for acessing a memcg's index. It will be used as an index in the
  * child cache array in kmem_cache, and also to derive its name. This function
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 7601b95..04df967 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -225,7 +225,7 @@ shrink_slab_node(struct shrink_control *shrinkctl, struct shrinker *shrinker,
 	unsigned long long delta;
 	long total_scan;
 	long max_pass;
-	long nr;
+	long nr = 0;
 	long new_nr;
 	int nid = shrinkctl->nid;
 	long batch_size = shrinker->batch ? shrinker->batch
@@ -236,11 +236,17 @@ shrink_slab_node(struct shrink_control *shrinkctl, struct shrinker *shrinker,
 		return 0;
 
 	/*
-	 * copy the current shrinker scan count into a local variable
-	 * and zero it so that other concurrent shrinker invocations
-	 * don't also do this scanning work.
+	 * Do not touch global counter of deferred objects on memcg pressure to
+	 * avoid isolation issues. Ideally the counter should be per-memcg.
 	 */
-	nr = atomic_long_xchg(&shrinker->nr_deferred[nid], 0);
+	if (!shrinkctl->target_mem_cgroup) {
+		/*
+		 * copy the current shrinker scan count into a local variable
+		 * and zero it so that other concurrent shrinker invocations
+		 * don't also do this scanning work.
+		 */
+		nr = atomic_long_xchg(&shrinker->nr_deferred[nid], 0);
+	}
 
 	total_scan = nr;
 	delta = (4 * fraction) / shrinker->seeks;
@@ -296,21 +302,46 @@ shrink_slab_node(struct shrink_control *shrinkctl, struct shrinker *shrinker,
 		cond_resched();
 	}
 
-	/*
-	 * move the unused scan count back into the shrinker in a
-	 * manner that handles concurrent updates. If we exhausted the
-	 * scan, there is no need to do an update.
-	 */
-	if (total_scan > 0)
-		new_nr = atomic_long_add_return(total_scan,
+	if (!shrinkctl->target_mem_cgroup) {
+		/*
+		 * move the unused scan count back into the shrinker in a
+		 * manner that handles concurrent updates. If we exhausted the
+		 * scan, there is no need to do an update.
+		 */
+		if (total_scan > 0)
+			new_nr = atomic_long_add_return(total_scan,
 						&shrinker->nr_deferred[nid]);
-	else
-		new_nr = atomic_long_read(&shrinker->nr_deferred[nid]);
+		else
+			new_nr = atomic_long_read(&shrinker->nr_deferred[nid]);
+	}
 
 	trace_mm_shrink_slab_end(shrinker, freed, nr, new_nr);
 	return freed;
 }
 
+static unsigned long
+shrink_slab_memcg(struct shrink_control *shrinkctl, struct shrinker *shrinker,
+		  unsigned long fraction, unsigned long denominator)
+{
+	unsigned long freed = 0;
+
+	if (shrinkctl->memcg && !memcg_kmem_is_active(shrinkctl->memcg))
+		return 0;
+
+	for_each_node_mask(shrinkctl->nid, shrinkctl->nodes_to_scan) {
+		if (!node_online(shrinkctl->nid))
+			continue;
+
+		if (!(shrinker->flags & SHRINKER_NUMA_AWARE) &&
+		    (shrinkctl->nid != 0))
+			break;
+
+		freed += shrink_slab_node(shrinkctl, shrinker,
+					  fraction, denominator);
+	}
+	return freed;
+}
+
 /*
  * Call the shrink functions to age shrinkable caches
  *
@@ -352,18 +383,23 @@ unsigned long shrink_slab(struct shrink_control *shrinkctl,
 	}
 
 	list_for_each_entry(shrinker, &shrinker_list, list) {
-		for_each_node_mask(shrinkctl->nid, shrinkctl->nodes_to_scan) {
-			if (!node_online(shrinkctl->nid))
-				continue;
-
-			if (!(shrinker->flags & SHRINKER_NUMA_AWARE) &&
-			    (shrinkctl->nid != 0))
+		shrinkctl->memcg = shrinkctl->target_mem_cgroup;
+		do {
+			if (!(shrinker->flags & SHRINKER_MEMCG_AWARE) &&
+			    (shrinkctl->memcg != NULL)) {
+				mem_cgroup_iter_break(
+						shrinkctl->target_mem_cgroup,
+						shrinkctl->memcg);
 				break;
+			}
 
-			freed += shrink_slab_node(shrinkctl, shrinker,
-						  fraction, denominator);
+			freed += shrink_slab_memcg(shrinkctl, shrinker,
+						   fraction, denominator);
 
-		}
+			shrinkctl->memcg = mem_cgroup_iter(
+						shrinkctl->target_mem_cgroup,
+						shrinkctl->memcg, NULL);
+		} while (shrinkctl->memcg);
 	}
 	up_read(&shrinker_rwsem);
 out:
@@ -2285,6 +2321,7 @@ static bool shrink_zones(struct zonelist *zonelist,
 	struct reclaim_state *reclaim_state = current->reclaim_state;
 	struct shrink_control shrink = {
 		.gfp_mask = sc->gfp_mask,
+		.target_mem_cgroup = sc->target_mem_cgroup,
 	};
 
 	/*
@@ -2301,17 +2338,22 @@ static bool shrink_zones(struct zonelist *zonelist,
 					gfp_zone(sc->gfp_mask), sc->nodemask) {
 		if (!populated_zone(zone))
 			continue;
+
+		if (global_reclaim(sc) &&
+		    !cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
+			continue;
+
+		lru_pages += global_reclaim(sc) ?
+				zone_reclaimable_pages(zone) :
+				mem_cgroup_zone_reclaimable_pages(zone,
+						sc->target_mem_cgroup);
+		node_set(zone_to_nid(zone), shrink.nodes_to_scan);
+
 		/*
 		 * Take care memory controller reclaiming has small influence
 		 * to global LRU.
 		 */
 		if (global_reclaim(sc)) {
-			if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
-				continue;
-
-			lru_pages += zone_reclaimable_pages(zone);
-			node_set(zone_to_nid(zone), shrink.nodes_to_scan);
-
 			if (sc->priority != DEF_PRIORITY &&
 			    !zone_reclaimable(zone))
 				continue;	/* Let kswapd poll it */
@@ -2349,12 +2391,11 @@ static bool shrink_zones(struct zonelist *zonelist,
 	}
 
 	/*
-	 * Don't shrink slabs when reclaiming memory from over limit
-	 * cgroups but do shrink slab at least once when aborting
-	 * reclaim for compaction to avoid unevenly scanning file/anon
-	 * LRU pages over slab pages.
+	 * Shrink slabs at least once when aborting reclaim for compaction
+	 * to avoid unevenly scanning file/anon LRU pages over slab pages.
 	 */
-	if (global_reclaim(sc)) {
+	if (global_reclaim(sc) ||
+	    memcg_kmem_should_reclaim(sc->target_mem_cgroup)) {
 		shrink_slab(&shrink, sc->nr_scanned, lru_pages);
 		if (reclaim_state) {
 			sc->nr_reclaimed += reclaim_state->reclaimed_slab;
@@ -2648,6 +2689,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 	struct zonelist *zonelist;
 	unsigned long nr_reclaimed;
 	int nid;
+	struct reclaim_state reclaim_state;
 	struct scan_control sc = {
 		.may_writepage = !laptop_mode,
 		.may_unmap = 1,
@@ -2670,6 +2712,10 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 
 	zonelist = NODE_DATA(nid)->node_zonelists;
 
+	lockdep_set_current_reclaim_state(sc.gfp_mask);
+	reclaim_state.reclaimed_slab = 0;
+	current->reclaim_state = &reclaim_state;
+
 	trace_mm_vmscan_memcg_reclaim_begin(0,
 					    sc.may_writepage,
 					    sc.gfp_mask);
@@ -2678,6 +2724,9 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 
 	trace_mm_vmscan_memcg_reclaim_end(nr_reclaimed);
 
+	current->reclaim_state = NULL;
+	lockdep_clear_current_reclaim_state();
+
 	return nr_reclaimed;
 }
 #endif
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v12 10/18] memcg,list_lru: add per-memcg LRU list infrastructure
  2013-12-02 11:19 [PATCH v12 00/18] kmemcg shrinkers Vladimir Davydov
                   ` (8 preceding siblings ...)
  2013-12-02 11:19 ` [PATCH v12 09/18] vmscan: shrink slab on memcg pressure Vladimir Davydov
@ 2013-12-02 11:19 ` Vladimir Davydov
  2013-12-03 11:18   ` Dave Chinner
  2013-12-02 11:19 ` [PATCH v12 11/18] memcg,list_lru: add function walking over all lists of a per-memcg LRU Vladimir Davydov
                   ` (8 subsequent siblings)
  18 siblings, 1 reply; 38+ messages in thread
From: Vladimir Davydov @ 2013-12-02 11:19 UTC (permalink / raw)
  To: hannes, mhocko, dchinner, akpm
  Cc: linux-kernel, linux-mm, cgroups, devel, glommer, vdavydov,
	Al Viro, Balbir Singh, KAMEZAWA Hiroyuki

FS-shrinkers, which shrink dcaches and icaches, keep dentries and inodes
in list_lru structures in order to evict least recently used objects.
With per-memcg kmem shrinking infrastructure introduced, we have to make
those LRU lists per-memcg in order to allow shrinking FS caches that
belong to different memory cgroups independently.

This patch addresses the issue by introducing struct memcg_list_lru.
This struct aggregates list_lru objects for each kmem-active memcg, and
keeps it uptodate whenever a memcg is created or destroyed. Its
interface is very simple: it only allows to get the pointer to the
appropriate list_lru object from a memcg or a kmem ptr, which should be
further operated with conventional list_lru methods.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/list_lru.h |   62 ++++++++++
 mm/memcontrol.c          |  299 +++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 356 insertions(+), 5 deletions(-)

diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
index 3ce5417..2ad0bc6 100644
--- a/include/linux/list_lru.h
+++ b/include/linux/list_lru.h
@@ -10,6 +10,8 @@
 #include <linux/list.h>
 #include <linux/nodemask.h>
 
+struct mem_cgroup;
+
 /* list_lru_walk_cb has to always return one of those */
 enum lru_status {
 	LRU_REMOVED,		/* item removed from list */
@@ -31,6 +33,33 @@ struct list_lru {
 	nodemask_t		active_nodes;
 };
 
+/*
+ * The following structure can be used to reclaim kmem objects accounted to
+ * different memory cgroups independently. It aggregates a set of list_lru
+ * objects, one for each kmem-enabled memcg, and provides the method to get
+ * the lru corresponding to a memcg.
+ */
+struct memcg_list_lru {
+	struct list_lru global_lru;
+
+#ifdef CONFIG_MEMCG_KMEM
+	struct list_lru **memcg_lrus;	/* rcu-protected array of per-memcg
+					   lrus, indexed by memcg_cache_id() */
+
+	struct list_head list;		/* list of all memcg-aware lrus */
+
+	/*
+	 * The memcg_lrus array is rcu protected, so we can only free it after
+	 * a call to synchronize_rcu(). To avoid multiple calls to
+	 * synchronize_rcu() when many lrus get updated at the same time, which
+	 * is a typical scenario, we will store the pointer to the previous
+	 * version of the array in the old_lrus variable for each lru, and then
+	 * free them all at once after a single call to synchronize_rcu().
+	 */
+	void *old_lrus;
+#endif
+};
+
 void list_lru_destroy(struct list_lru *lru);
 int list_lru_init(struct list_lru *lru);
 
@@ -128,4 +157,37 @@ list_lru_walk(struct list_lru *lru, list_lru_walk_cb isolate,
 	}
 	return isolated;
 }
+
+#ifdef CONFIG_MEMCG_KMEM
+int memcg_list_lru_init(struct memcg_list_lru *lru);
+void memcg_list_lru_destroy(struct memcg_list_lru *lru);
+
+struct list_lru *mem_cgroup_list_lru(struct memcg_list_lru *lru,
+				     struct mem_cgroup *memcg);
+struct list_lru *mem_cgroup_kmem_list_lru(struct memcg_list_lru *lru,
+					  void *ptr);
+#else
+static inline int memcg_list_lru_init(struct memcg_list_lru *lru)
+{
+	return list_lru_init(&lru->global_lru);
+}
+
+static inline void memcg_list_lru_destroy(struct memcg_list_lru *lru)
+{
+	list_lru_destroy(&lru->global_lru);
+}
+
+static inline struct list_lru *
+mem_cgroup_list_lru(struct memcg_list_lru *lru, struct mem_cgroup *memcg)
+{
+	return &lru->global_lru;
+}
+
+static inline struct list_lru *
+mem_cgroup_kmem_list_lru(struct memcg_list_lru *lru, void *ptr)
+{
+	return &lru->global_lru;
+}
+#endif /* CONFIG_MEMCG_KMEM */
+
 #endif /* _LRU_LIST_H */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 3f12cec..253e01e 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -55,6 +55,7 @@
 #include <linux/cpu.h>
 #include <linux/oom.h>
 #include <linux/lockdep.h>
+#include <linux/list_lru.h>
 #include "internal.h"
 #include <net/sock.h>
 #include <net/ip.h>
@@ -3065,6 +3066,280 @@ static inline void memcg_resume_kmem_account(void)
 }
 
 /*
+ * The list of all memcg_list_lru objects. Protected by memcg_create_mutex.
+ * Needed for updating all per-memcg lrus whenever a new kmem-enabled memcg
+ * is created or destroyed.
+ */
+static LIST_HEAD(all_per_memcg_lrus);
+
+/* helper to allocate a list_lru for a memcg in a per-memcg lru */
+static int alloc_memcg_lru(struct memcg_list_lru *lru, int memcg_id)
+{
+	int err;
+	struct list_lru *memcg_lru;
+
+	memcg_lru = kmalloc(sizeof(*memcg_lru), GFP_KERNEL);
+	if (!memcg_lru)
+		return -ENOMEM;
+
+	err = list_lru_init(memcg_lru);
+	if (err) {
+		kfree(memcg_lru);
+		return err;
+	}
+
+	VM_BUG_ON(lru->memcg_lrus[memcg_id]);
+	lru->memcg_lrus[memcg_id] = memcg_lru;
+	return 0;
+}
+
+/* helper to free the memcg list_lru in a per-memcg lru */
+static void free_memcg_lru(struct memcg_list_lru *lru, int memcg_id)
+{
+	struct list_lru *memcg_lru;
+
+	memcg_lru = lru->memcg_lrus[memcg_id];
+	if (memcg_lru) {
+		list_lru_destroy(memcg_lru);
+		kfree(memcg_lru);
+		lru->memcg_lrus[memcg_id] = NULL;
+	}
+}
+
+/*
+ * Grows a per-memcg lru to acommodate list_lrus for new_num_memcg memory
+ * cgroups. Is called for each per-memcg lru whenever a new kmem-enabled memcg
+ * is added and we need to update the caches array. It will keep the old array
+ * of pointers to per-memcg lrus in old_lrus to be freed after a call to
+ * synchronize_rcu().
+ */
+static int memcg_list_lru_grow(struct memcg_list_lru *lru, int new_num_memcgs)
+{
+	struct list_lru **new_lrus;
+
+	new_lrus = kcalloc(memcg_caches_array_size(new_num_memcgs),
+			   sizeof(*new_lrus), GFP_KERNEL);
+	if (!new_lrus)
+		return -ENOMEM;
+
+	if (lru->memcg_lrus) {
+		int i;
+
+		for_each_memcg_cache_index(i) {
+			if (lru->memcg_lrus[i])
+				new_lrus[i] = lru->memcg_lrus[i];
+		}
+	}
+
+	VM_BUG_ON(lru->old_lrus);
+	lru->old_lrus = lru->memcg_lrus;
+	rcu_assign_pointer(lru->memcg_lrus, new_lrus);
+	return 0;
+}
+
+static void __memcg_destroy_all_lrus(int memcg_id)
+{
+	struct memcg_list_lru *lru;
+
+	list_for_each_entry(lru, &all_per_memcg_lrus, list)
+		free_memcg_lru(lru, memcg_id);
+}
+
+/*
+ * Frees list_lrus corresponding to a memcg in all per-memcg lrus. Is called
+ * when a memcg is destroyed.
+ */
+static void memcg_destroy_all_lrus(struct mem_cgroup *memcg)
+{
+	int memcg_id;
+
+	memcg_id = memcg_cache_id(memcg);
+	if (memcg_id >= 0) {
+		mutex_lock(&memcg_create_mutex);
+		__memcg_destroy_all_lrus(memcg_id);
+		mutex_unlock(&memcg_create_mutex);
+	}
+}
+
+/*
+ * This function is called when a new kmem-enabled memcg is added. It
+ * initializes list_lrus corresponding to the memcg in all per-memcg lrus
+ * growing them if necessary.
+ */
+static int memcg_init_all_lrus(int new_memcg_id)
+{
+	int err = 0;
+	struct memcg_list_lru *lru;
+	int num_memcgs = new_memcg_id + 1;
+	int grow = (num_memcgs > memcg_limited_groups_array_size);
+
+	memcg_stop_kmem_account();
+	if (grow) {
+		list_for_each_entry(lru, &all_per_memcg_lrus, list) {
+			err = memcg_list_lru_grow(lru, num_memcgs);
+			if (err)
+				goto free_old_lrus;
+		}
+	}
+	list_for_each_entry(lru, &all_per_memcg_lrus, list) {
+		err = alloc_memcg_lru(lru, new_memcg_id);
+		if (err) {
+			__memcg_destroy_all_lrus(new_memcg_id);
+			break;
+		}
+	}
+free_old_lrus:
+	if (grow) {
+		/* free previous versions of memcg_lrus arrays */
+		synchronize_rcu();
+		list_for_each_entry(lru, &all_per_memcg_lrus, list) {
+			kfree(lru->old_lrus);
+			lru->old_lrus = NULL;
+		}
+	}
+	memcg_resume_kmem_account();
+	return err;
+}
+
+static void __memcg_list_lru_destroy(struct memcg_list_lru *lru)
+{
+	int i;
+
+	if (lru->memcg_lrus) {
+		for_each_memcg_cache_index(i)
+			free_memcg_lru(lru, i);
+	}
+}
+
+static int __memcg_list_lru_init(struct memcg_list_lru *lru)
+{
+	int err = 0;
+	struct mem_cgroup *memcg;
+
+	lru->memcg_lrus = NULL;
+	lru->old_lrus = NULL;
+
+	if (!memcg_kmem_enabled())
+		return 0;
+
+	memcg_stop_kmem_account();
+	lru->memcg_lrus = kcalloc(memcg_limited_groups_array_size,
+				  sizeof(*lru->memcg_lrus), GFP_KERNEL);
+	if (!lru->memcg_lrus) {
+		err = -ENOMEM;
+		goto out;
+	}
+
+	for_each_mem_cgroup(memcg) {
+		int memcg_id;
+
+		memcg_id = memcg_cache_id(memcg);
+		if (memcg_id < 0)
+			continue;
+
+		err = alloc_memcg_lru(lru, memcg_id);
+		if (err) {
+			mem_cgroup_iter_break(NULL, memcg);
+			break;
+		}
+	}
+out:
+	if (err)
+		__memcg_list_lru_destroy(lru);
+	memcg_resume_kmem_account();
+	return err;
+}
+
+int memcg_list_lru_init(struct memcg_list_lru *lru)
+{
+	int err;
+
+	err = list_lru_init(&lru->global_lru);
+	if (err)
+		return err;
+
+	mutex_lock(&memcg_create_mutex);
+	err = __memcg_list_lru_init(lru);
+	if (!err)
+		list_add(&lru->list, &all_per_memcg_lrus);
+	mutex_unlock(&memcg_create_mutex);
+
+	if (err)
+		list_lru_destroy(&lru->global_lru);
+	return err;
+}
+
+void memcg_list_lru_destroy(struct memcg_list_lru *lru)
+{
+	mutex_lock(&memcg_create_mutex);
+	__memcg_list_lru_destroy(lru);
+	list_del(&lru->list);
+	mutex_unlock(&memcg_create_mutex);
+
+	list_lru_destroy(&lru->global_lru);
+}
+
+/**
+ * mem_cgroup_list_lru - get the lru list for a memcg
+ * @lru: per-memcg lru
+ * @memcg: memcg of the wanted lru list
+ *
+ * Returns the lru list corresponding to the given @memcg in the per-memcg
+ * @lru. If @memcg is NULL, the lru corresponding to the root is returned.
+ *
+ * For each kmem-active memcg, there is a dedicated lru list while all
+ * kmem-inactive memcgs (including the root cgroup) share the same lru list
+ * in each per-memcg lru.
+ */
+struct list_lru *mem_cgroup_list_lru(struct memcg_list_lru *lru,
+				     struct mem_cgroup *memcg)
+{
+	struct list_lru **memcg_lrus;
+	struct list_lru *memcg_lru;
+	int memcg_id;
+
+	memcg_id = memcg_cache_id(memcg);
+	if (memcg_id < 0)
+		return &lru->global_lru;
+
+	rcu_read_lock();
+	memcg_lrus = rcu_dereference(lru->memcg_lrus);
+	memcg_lru = memcg_lrus[memcg_id];
+	rcu_read_unlock();
+
+	return memcg_lru;
+}
+
+/**
+ * mem_cgroup_kmem_list_lru - get the lru list to store a kmem object
+ * @lru: per-memcg lru
+ * @ptr: pointer to the kmem object
+ *
+ * Returns the lru list corresponding to the memcg the given kmem object @ptr
+ * is accounted to in the per-memcg @lru.
+ *
+ * For each kmem-active memcg, there is a dedicated lru list while all
+ * kmem-inactive memcgs (including the root cgroup) share the same lru list
+ * in each per-memcg lru.
+ */
+struct list_lru *mem_cgroup_kmem_list_lru(struct memcg_list_lru *lru,
+					  void *ptr)
+{
+	struct page *page = virt_to_page(ptr);
+	struct mem_cgroup *memcg = NULL;
+	struct page_cgroup *pc;
+
+	pc = lookup_page_cgroup(compound_head(page));
+	if (PageCgroupUsed(pc)) {
+		lock_page_cgroup(pc);
+		if (PageCgroupUsed(pc))
+			memcg = pc->mem_cgroup;
+		unlock_page_cgroup(pc);
+	}
+	return mem_cgroup_list_lru(lru, memcg);
+}
+
+/*
  * This is a bit cumbersome, but it is rarely used and avoids a backpointer
  * in the memcg_cache_params struct.
  */
@@ -3195,15 +3470,28 @@ int memcg_update_cache_sizes(struct mem_cgroup *memcg)
 	 */
 	memcg_kmem_set_activated(memcg);
 
+	/*
+	 * We need to init the memcg lru lists before we update the caches.
+	 * Once the caches are updated, they will be able to start hosting
+	 * objects. If a cache is created very quickly and an element is used
+	 * and disposed to the lru quickly as well, we can end up with a NULL
+	 * pointer dereference while trying to add a new element to a memcg
+	 * lru.
+	 */
+	ret = memcg_init_all_lrus(num);
+	if (ret)
+		goto out;
+
 	ret = memcg_update_all_caches(num+1);
-	if (ret) {
-		ida_simple_remove(&kmem_limited_groups, num);
-		memcg_kmem_clear_activated(memcg);
-		return ret;
-	}
+	if (ret)
+		goto out;
 
 	memcg->kmemcg_id = num;
 	return 0;
+out:
+	ida_simple_remove(&kmem_limited_groups, num);
+	memcg_kmem_clear_activated(memcg);
+	return ret;
 }
 
 /*
@@ -5955,6 +6243,7 @@ static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
 static void memcg_destroy_kmem(struct mem_cgroup *memcg)
 {
 	mem_cgroup_sockets_destroy(memcg);
+	memcg_destroy_all_lrus(memcg);
 }
 
 static void kmem_cgroup_css_offline(struct mem_cgroup *memcg)
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v12 11/18] memcg,list_lru: add function walking over all lists of a per-memcg LRU
  2013-12-02 11:19 [PATCH v12 00/18] kmemcg shrinkers Vladimir Davydov
                   ` (9 preceding siblings ...)
  2013-12-02 11:19 ` [PATCH v12 10/18] memcg,list_lru: add per-memcg LRU list infrastructure Vladimir Davydov
@ 2013-12-02 11:19 ` Vladimir Davydov
  2013-12-02 11:19 ` [PATCH v12 12/18] fs: make icache, dcache shrinkers memcg-aware Vladimir Davydov
                   ` (7 subsequent siblings)
  18 siblings, 0 replies; 38+ messages in thread
From: Vladimir Davydov @ 2013-12-02 11:19 UTC (permalink / raw)
  To: hannes, mhocko, dchinner, akpm
  Cc: linux-kernel, linux-mm, cgroups, devel, glommer, vdavydov,
	Al Viro, Balbir Singh, KAMEZAWA Hiroyuki

Sometimes it can be necessary to iterate over all memcgs' lists of the
same memcg-aware LRU. For example shrink_dcache_sb() should prune all
dentries no matter what memory cgroup they belong to. Current interface
to struct memcg_list_lru, however, only allows per-memcg LRU walks.
This patch adds the special method memcg_list_lru_walk_all() which
provides the required functionality. Note that this function does not
guarantee that all the elements will be processed in the true
least-recently-used order, in fact it simply enumerates all kmem-active
memcgs and for each of them calls list_lru_walk(), but
shrink_dcache_sb(), which is going to be the only user of this function,
does not need it.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/list_lru.h |   21 ++++++++++++++++
 mm/memcontrol.c          |   60 +++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 80 insertions(+), 1 deletion(-)

diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
index 2ad0bc6..a9e078a 100644
--- a/include/linux/list_lru.h
+++ b/include/linux/list_lru.h
@@ -46,6 +46,16 @@ struct memcg_list_lru {
 	struct list_lru **memcg_lrus;	/* rcu-protected array of per-memcg
 					   lrus, indexed by memcg_cache_id() */
 
+	/*
+	 * When a memory cgroup is removed, all pointers to its list_lru
+	 * objects stored in memcg_lrus arrays are first marked as dead by
+	 * setting the lowest bit of the address while the actual data free
+	 * happens only after an rcu grace period. If a memcg_lrus reader,
+	 * which should be rcu-protected, faces a dead pointer, it won't
+	 * dereference it. This ensures there will be no use-after-free.
+	 */
+#define MEMCG_LIST_LRU_DEAD		1
+
 	struct list_head list;		/* list of all memcg-aware lrus */
 
 	/*
@@ -166,6 +176,10 @@ struct list_lru *mem_cgroup_list_lru(struct memcg_list_lru *lru,
 				     struct mem_cgroup *memcg);
 struct list_lru *mem_cgroup_kmem_list_lru(struct memcg_list_lru *lru,
 					  void *ptr);
+
+unsigned long
+memcg_list_lru_walk_all(struct memcg_list_lru *lru, list_lru_walk_cb isolate,
+			void *cb_arg, unsigned long nr_to_walk);
 #else
 static inline int memcg_list_lru_init(struct memcg_list_lru *lru)
 {
@@ -188,6 +202,13 @@ mem_cgroup_kmem_list_lru(struct memcg_list_lru *lru, void *ptr)
 {
 	return &lru->global_lru;
 }
+
+static inline unsigned long
+memcg_list_lru_walk_all(struct memcg_list_lru *lru, list_lru_walk_cb isolate,
+			void *cb_arg, unsigned long nr_to_walk)
+{
+	return list_lru_walk(&lru->global_lru, isolate, cb_arg, nr_to_walk);
+}
 #endif /* CONFIG_MEMCG_KMEM */
 
 #endif /* _LRU_LIST_H */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 253e01e..da06f91 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3088,6 +3088,7 @@ static int alloc_memcg_lru(struct memcg_list_lru *lru, int memcg_id)
 		return err;
 	}
 
+	smp_wmb();
 	VM_BUG_ON(lru->memcg_lrus[memcg_id]);
 	lru->memcg_lrus[memcg_id] = memcg_lru;
 	return 0;
@@ -3098,7 +3099,8 @@ static void free_memcg_lru(struct memcg_list_lru *lru, int memcg_id)
 {
 	struct list_lru *memcg_lru;
 
-	memcg_lru = lru->memcg_lrus[memcg_id];
+	memcg_lru = (void *)((unsigned long)lru->memcg_lrus[memcg_id] &
+			     ~MEMCG_LIST_LRU_DEAD);
 	if (memcg_lru) {
 		list_lru_destroy(memcg_lru);
 		kfree(memcg_lru);
@@ -3106,6 +3108,16 @@ static void free_memcg_lru(struct memcg_list_lru *lru, int memcg_id)
 	}
 }
 
+static void memcg_lru_mark_dead(struct memcg_list_lru *lru, int memcg_id)
+{
+	struct list_lru *memcg_lru;
+
+	memcg_lru = lru->memcg_lrus[memcg_id];
+	if (memcg_lru)
+		lru->memcg_lrus[memcg_id] = (void *)((unsigned long)memcg_lru |
+						     MEMCG_LIST_LRU_DEAD);
+}
+
 /*
  * Grows a per-memcg lru to acommodate list_lrus for new_num_memcg memory
  * cgroups. Is called for each per-memcg lru whenever a new kmem-enabled memcg
@@ -3141,6 +3153,17 @@ static void __memcg_destroy_all_lrus(int memcg_id)
 {
 	struct memcg_list_lru *lru;
 
+	/*
+	 * Mark all lru lists of this memcg as dead and free them only after a
+	 * grace period. This is to prevent functions iterating over memcg_lrus
+	 * arrays from dereferencing pointers pointing to already freed data
+	 * (see memcg_list_lru_walk_all()).
+	 */
+	list_for_each_entry(lru, &all_per_memcg_lrus, list)
+		memcg_lru_mark_dead(lru, memcg_id);
+
+	synchronize_rcu();
+
 	list_for_each_entry(lru, &all_per_memcg_lrus, list)
 		free_memcg_lru(lru, memcg_id);
 }
@@ -3340,6 +3363,41 @@ struct list_lru *mem_cgroup_kmem_list_lru(struct memcg_list_lru *lru,
 }
 
 /*
+ * This function calls the list_lru_walk() function for each list_lru
+ * comprising a per-memcg lru. It may be useful if one wants to scan all
+ * elements of a per-memcg lru, no matter in which order.
+ */
+unsigned long
+memcg_list_lru_walk_all(struct memcg_list_lru *lru, list_lru_walk_cb isolate,
+			void *cb_arg, unsigned long nr_to_walk)
+{
+	int i;
+	unsigned long isolated;
+	struct list_lru *memcg_lru;
+	struct list_lru **memcg_lrus;
+
+	isolated = list_lru_walk(&lru->global_lru, isolate, cb_arg, nr_to_walk);
+
+	rcu_read_lock();
+	memcg_lrus = rcu_dereference(lru->memcg_lrus);
+	for_each_memcg_cache_index(i) {
+		memcg_lru = memcg_lrus[i];
+		if (!memcg_lru)
+			continue;
+
+		if ((unsigned long)memcg_lru & MEMCG_LIST_LRU_DEAD)
+			continue;
+
+		smp_read_barrier_depends();
+		isolated += list_lru_walk(memcg_lru,
+					  isolate, cb_arg, nr_to_walk);
+	}
+	rcu_read_unlock();
+
+	return isolated;
+}
+
+/*
  * This is a bit cumbersome, but it is rarely used and avoids a backpointer
  * in the memcg_cache_params struct.
  */
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v12 12/18] fs: make icache, dcache shrinkers memcg-aware
  2013-12-02 11:19 [PATCH v12 00/18] kmemcg shrinkers Vladimir Davydov
                   ` (10 preceding siblings ...)
  2013-12-02 11:19 ` [PATCH v12 11/18] memcg,list_lru: add function walking over all lists of a per-memcg LRU Vladimir Davydov
@ 2013-12-02 11:19 ` Vladimir Davydov
  2013-12-03 11:45   ` Dave Chinner
  2013-12-02 11:19 ` [PATCH v12 13/18] memcg: per-memcg kmem shrinking Vladimir Davydov
                   ` (6 subsequent siblings)
  18 siblings, 1 reply; 38+ messages in thread
From: Vladimir Davydov @ 2013-12-02 11:19 UTC (permalink / raw)
  To: hannes, mhocko, dchinner, akpm
  Cc: linux-kernel, linux-mm, cgroups, devel, glommer, vdavydov,
	Al Viro, Balbir Singh, KAMEZAWA Hiroyuki

Using the per-memcg LRU infrastructure introduced by previous patches,
this patch makes dcache and icache shrinkers memcg-aware. To achieve
that, it converts s_dentry_lru and s_inode_lru from list_lru to
memcg_list_lru and restricts the reclaim to per-memcg parts of the lists
in case of memcg pressure.

Other FS objects are currently ignored and only reclaimed on global
pressure, because their shrinkers are heavily FS-specific and can't be
converted to be memcg-aware so easily. However, we can pass on target
memcg to the FS layer and let it decide if per-memcg objects should be
reclaimed.

Note that with this patch applied we lose global LRU order, but it does
not appear to be a critical drawback, because global pressure should try
to balance the amount reclaimed from all memcgs. On the other hand,
preserving global LRU order would require an extra list_head added to
each dentry and inode, which seems to be too costly.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 fs/dcache.c        |   25 +++++++++++++++----------
 fs/inode.c         |   16 ++++++++++------
 fs/internal.h      |    9 +++++----
 fs/super.c         |   43 +++++++++++++++++++++++++------------------
 include/linux/fs.h |    4 ++--
 5 files changed, 57 insertions(+), 40 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 4bdb300..e8499db 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -343,18 +343,24 @@ static void dentry_unlink_inode(struct dentry * dentry)
 #define D_FLAG_VERIFY(dentry,x) WARN_ON_ONCE(((dentry)->d_flags & (DCACHE_LRU_LIST | DCACHE_SHRINK_LIST)) != (x))
 static void d_lru_add(struct dentry *dentry)
 {
+	struct list_lru *lru =
+		mem_cgroup_kmem_list_lru(&dentry->d_sb->s_dentry_lru, dentry);
+
 	D_FLAG_VERIFY(dentry, 0);
 	dentry->d_flags |= DCACHE_LRU_LIST;
 	this_cpu_inc(nr_dentry_unused);
-	WARN_ON_ONCE(!list_lru_add(&dentry->d_sb->s_dentry_lru, &dentry->d_lru));
+	WARN_ON_ONCE(!list_lru_add(lru, &dentry->d_lru));
 }
 
 static void d_lru_del(struct dentry *dentry)
 {
+	struct list_lru *lru =
+		mem_cgroup_kmem_list_lru(&dentry->d_sb->s_dentry_lru, dentry);
+
 	D_FLAG_VERIFY(dentry, DCACHE_LRU_LIST);
 	dentry->d_flags &= ~DCACHE_LRU_LIST;
 	this_cpu_dec(nr_dentry_unused);
-	WARN_ON_ONCE(!list_lru_del(&dentry->d_sb->s_dentry_lru, &dentry->d_lru));
+	WARN_ON_ONCE(!list_lru_del(lru, &dentry->d_lru));
 }
 
 static void d_shrink_del(struct dentry *dentry)
@@ -970,9 +976,9 @@ dentry_lru_isolate(struct list_head *item, spinlock_t *lru_lock, void *arg)
 }
 
 /**
- * prune_dcache_sb - shrink the dcache
- * @sb: superblock
- * @nr_to_scan : number of entries to try to free
+ * prune_dcache_lru - shrink the dcache
+ * @lru: dentry lru list
+ * @nr_to_scan: number of entries to try to free
  * @nid: which node to scan for freeable entities
  *
  * Attempt to shrink the superblock dcache LRU by @nr_to_scan entries. This is
@@ -982,14 +988,13 @@ dentry_lru_isolate(struct list_head *item, spinlock_t *lru_lock, void *arg)
  * This function may fail to free any resources if all the dentries are in
  * use.
  */
-long prune_dcache_sb(struct super_block *sb, unsigned long nr_to_scan,
-		     int nid)
+long prune_dcache_lru(struct list_lru *lru, unsigned long nr_to_scan, int nid)
 {
 	LIST_HEAD(dispose);
 	long freed;
 
-	freed = list_lru_walk_node(&sb->s_dentry_lru, nid, dentry_lru_isolate,
-				       &dispose, &nr_to_scan);
+	freed = list_lru_walk_node(lru, nid, dentry_lru_isolate,
+				   &dispose, &nr_to_scan);
 	shrink_dentry_list(&dispose);
 	return freed;
 }
@@ -1029,7 +1034,7 @@ void shrink_dcache_sb(struct super_block *sb)
 	do {
 		LIST_HEAD(dispose);
 
-		freed = list_lru_walk(&sb->s_dentry_lru,
+		freed = memcg_list_lru_walk_all(&sb->s_dentry_lru,
 			dentry_lru_isolate_shrink, &dispose, UINT_MAX);
 
 		this_cpu_sub(nr_dentry_unused, freed);
diff --git a/fs/inode.c b/fs/inode.c
index 4bcdad3..f06a963 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -402,7 +402,10 @@ EXPORT_SYMBOL(ihold);
 
 static void inode_lru_list_add(struct inode *inode)
 {
-	if (list_lru_add(&inode->i_sb->s_inode_lru, &inode->i_lru))
+	struct list_lru *lru =
+		mem_cgroup_kmem_list_lru(&inode->i_sb->s_inode_lru, inode);
+
+	if (list_lru_add(lru, &inode->i_lru))
 		this_cpu_inc(nr_unused);
 }
 
@@ -421,8 +424,10 @@ void inode_add_lru(struct inode *inode)
 
 static void inode_lru_list_del(struct inode *inode)
 {
+	struct list_lru *lru =
+		mem_cgroup_kmem_list_lru(&inode->i_sb->s_inode_lru, inode);
 
-	if (list_lru_del(&inode->i_sb->s_inode_lru, &inode->i_lru))
+	if (list_lru_del(lru, &inode->i_lru))
 		this_cpu_dec(nr_unused);
 }
 
@@ -748,14 +753,13 @@ inode_lru_isolate(struct list_head *item, spinlock_t *lru_lock, void *arg)
  * to trim from the LRU. Inodes to be freed are moved to a temporary list and
  * then are freed outside inode_lock by dispose_list().
  */
-long prune_icache_sb(struct super_block *sb, unsigned long nr_to_scan,
-		     int nid)
+long prune_icache_lru(struct list_lru *lru, unsigned long nr_to_scan, int nid)
 {
 	LIST_HEAD(freeable);
 	long freed;
 
-	freed = list_lru_walk_node(&sb->s_inode_lru, nid, inode_lru_isolate,
-				       &freeable, &nr_to_scan);
+	freed = list_lru_walk_node(lru, nid, inode_lru_isolate,
+				   &freeable, &nr_to_scan);
 	dispose_list(&freeable);
 	return freed;
 }
diff --git a/fs/internal.h b/fs/internal.h
index 4657424..5d977f3 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -14,6 +14,7 @@ struct file_system_type;
 struct linux_binprm;
 struct path;
 struct mount;
+struct list_lru;
 
 /*
  * block_dev.c
@@ -107,8 +108,8 @@ extern int open_check_o_direct(struct file *f);
  * inode.c
  */
 extern spinlock_t inode_sb_list_lock;
-extern long prune_icache_sb(struct super_block *sb, unsigned long nr_to_scan,
-			    int nid);
+extern long prune_icache_lru(struct list_lru *lru,
+			     unsigned long nr_to_scan, int nid);
 extern void inode_add_lru(struct inode *inode);
 
 /*
@@ -125,8 +126,8 @@ extern int invalidate_inodes(struct super_block *, bool);
  */
 extern struct dentry *__d_alloc(struct super_block *, const struct qstr *);
 extern int d_set_mounted(struct dentry *dentry);
-extern long prune_dcache_sb(struct super_block *sb, unsigned long nr_to_scan,
-			    int nid);
+extern long prune_dcache_lru(struct list_lru *lru,
+			     unsigned long nr_to_scan, int nid);
 
 /*
  * read_write.c
diff --git a/fs/super.c b/fs/super.c
index cece164..b198da4 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -57,6 +57,8 @@ static unsigned long super_cache_scan(struct shrinker *shrink,
 				      struct shrink_control *sc)
 {
 	struct super_block *sb;
+	struct list_lru *inode_lru;
+	struct list_lru *dentry_lru;
 	long	fs_objects = 0;
 	long	total_objects;
 	long	freed = 0;
@@ -75,11 +77,14 @@ static unsigned long super_cache_scan(struct shrinker *shrink,
 	if (!grab_super_passive(sb))
 		return SHRINK_STOP;
 
-	if (sb->s_op->nr_cached_objects)
+	if (sb->s_op->nr_cached_objects && !sc->memcg)
 		fs_objects = sb->s_op->nr_cached_objects(sb, sc->nid);
 
-	inodes = list_lru_count_node(&sb->s_inode_lru, sc->nid);
-	dentries = list_lru_count_node(&sb->s_dentry_lru, sc->nid);
+	inode_lru = mem_cgroup_list_lru(&sb->s_inode_lru, sc->memcg);
+	dentry_lru = mem_cgroup_list_lru(&sb->s_dentry_lru, sc->memcg);
+
+	inodes = list_lru_count_node(inode_lru, sc->nid);
+	dentries = list_lru_count_node(dentry_lru, sc->nid);
 	total_objects = dentries + inodes + fs_objects + 1;
 
 	/* proportion the scan between the caches */
@@ -90,8 +95,8 @@ static unsigned long super_cache_scan(struct shrinker *shrink,
 	 * prune the dcache first as the icache is pinned by it, then
 	 * prune the icache, followed by the filesystem specific caches
 	 */
-	freed = prune_dcache_sb(sb, dentries, sc->nid);
-	freed += prune_icache_sb(sb, inodes, sc->nid);
+	freed = prune_dcache_lru(dentry_lru, dentries, sc->nid);
+	freed += prune_icache_lru(inode_lru, inodes, sc->nid);
 
 	if (fs_objects) {
 		fs_objects = mult_frac(sc->nr_to_scan, fs_objects,
@@ -108,6 +113,8 @@ static unsigned long super_cache_count(struct shrinker *shrink,
 				       struct shrink_control *sc)
 {
 	struct super_block *sb;
+	struct list_lru *inode_lru;
+	struct list_lru *dentry_lru;
 	long	total_objects = 0;
 
 	sb = container_of(shrink, struct super_block, s_shrink);
@@ -115,14 +122,14 @@ static unsigned long super_cache_count(struct shrinker *shrink,
 	if (!grab_super_passive(sb))
 		return 0;
 
-	if (sb->s_op && sb->s_op->nr_cached_objects)
-		total_objects = sb->s_op->nr_cached_objects(sb,
-						 sc->nid);
+	if (sb->s_op && sb->s_op->nr_cached_objects && !sc->memcg)
+		total_objects = sb->s_op->nr_cached_objects(sb, sc->nid);
+
+	inode_lru = mem_cgroup_list_lru(&sb->s_inode_lru, sc->memcg);
+	dentry_lru = mem_cgroup_list_lru(&sb->s_dentry_lru, sc->memcg);
 
-	total_objects += list_lru_count_node(&sb->s_dentry_lru,
-						 sc->nid);
-	total_objects += list_lru_count_node(&sb->s_inode_lru,
-						 sc->nid);
+	total_objects += list_lru_count_node(dentry_lru, sc->nid);
+	total_objects += list_lru_count_node(inode_lru, sc->nid);
 
 	total_objects = vfs_pressure_ratio(total_objects);
 	drop_super(sb);
@@ -138,8 +145,8 @@ static unsigned long super_cache_count(struct shrinker *shrink,
 static void destroy_super(struct super_block *s)
 {
 	int i;
-	list_lru_destroy(&s->s_dentry_lru);
-	list_lru_destroy(&s->s_inode_lru);
+	memcg_list_lru_destroy(&s->s_dentry_lru);
+	memcg_list_lru_destroy(&s->s_inode_lru);
 	for (i = 0; i < SB_FREEZE_LEVELS; i++)
 		percpu_counter_destroy(&s->s_writers.counter[i]);
 	security_sb_free(s);
@@ -183,10 +190,10 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags)
 	INIT_HLIST_BL_HEAD(&s->s_anon);
 	INIT_LIST_HEAD(&s->s_inodes);
 
-	if (list_lru_init(&s->s_dentry_lru))
+	if (memcg_list_lru_init(&s->s_dentry_lru))
 		goto fail;
-	if (list_lru_init(&s->s_inode_lru)) {
-		list_lru_destroy(&s->s_dentry_lru);
+	if (memcg_list_lru_init(&s->s_inode_lru)) {
+		memcg_list_lru_destroy(&s->s_dentry_lru);
 		goto fail;
 	}
 
@@ -225,7 +232,7 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags)
 	s->s_shrink.scan_objects = super_cache_scan;
 	s->s_shrink.count_objects = super_cache_count;
 	s->s_shrink.batch = 1024;
-	s->s_shrink.flags = SHRINKER_NUMA_AWARE;
+	s->s_shrink.flags = SHRINKER_NUMA_AWARE | SHRINKER_MEMCG_AWARE;
 	return s;
 
 fail:
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 121f11f..8256a7e 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1322,8 +1322,8 @@ struct super_block {
 	 * Keep the lru lists last in the structure so they always sit on their
 	 * own individual cachelines.
 	 */
-	struct list_lru		s_dentry_lru ____cacheline_aligned_in_smp;
-	struct list_lru		s_inode_lru ____cacheline_aligned_in_smp;
+	struct memcg_list_lru s_dentry_lru ____cacheline_aligned_in_smp;
+	struct memcg_list_lru s_inode_lru ____cacheline_aligned_in_smp;
 	struct rcu_head		rcu;
 };
 
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v12 13/18] memcg: per-memcg kmem shrinking
  2013-12-02 11:19 [PATCH v12 00/18] kmemcg shrinkers Vladimir Davydov
                   ` (11 preceding siblings ...)
  2013-12-02 11:19 ` [PATCH v12 12/18] fs: make icache, dcache shrinkers memcg-aware Vladimir Davydov
@ 2013-12-02 11:19 ` Vladimir Davydov
  2013-12-02 11:19 ` [PATCH v12 14/18] vmscan: take at least one pass with shrinkers Vladimir Davydov
                   ` (5 subsequent siblings)
  18 siblings, 0 replies; 38+ messages in thread
From: Vladimir Davydov @ 2013-12-02 11:19 UTC (permalink / raw)
  To: hannes, mhocko, dchinner, akpm
  Cc: linux-kernel, linux-mm, cgroups, devel, glommer, vdavydov,
	Mel Gorman, Rik van Riel, Balbir Singh, KAMEZAWA Hiroyuki

If a memory cgroup's kmem limit is less than its user memory limit, we
can run into a situation where our allocation fail, but freeing user
pages will buy us nothing. In such scenarios we would like to call a
specialized reclaimer that only frees kernel memory. This patch adds it.
All the magic lies behind the previous patches that made slab shrinkers
memcg-aware, this patch only employs the shrink_slab() facility to scan
slab objects accounted to a specific memory cgroup, however, there is
one thing that is worth noticing.

The point is that since all memcg-aware shrinkers are FS-dependent, we
have no option rather than failing all GFP_NOFS allocations when we are
close to the kmem limit. The best thing we can do in such a situation is
to spawn the reclaimer in a background process hoping next allocations
will succeed.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/swap.h |    2 ++
 mm/memcontrol.c      |   51 +++++++++++++++++++++++++++++++++++++++++++++++++-
 mm/vmscan.c          |   38 ++++++++++++++++++++++++++++++++++++-
 3 files changed, 89 insertions(+), 2 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 46ba0c6..367a773 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -309,6 +309,8 @@ extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 extern int __isolate_lru_page(struct page *page, isolate_mode_t mode);
 extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem,
 						  gfp_t gfp_mask, bool noswap);
+extern unsigned long try_to_free_mem_cgroup_kmem(struct mem_cgroup *mem,
+						 gfp_t gfp_mask);
 extern unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
 						gfp_t gfp_mask, bool noswap,
 						struct zone *zone,
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index da06f91..3679acb 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -323,6 +323,8 @@ struct mem_cgroup {
 	struct mutex slab_caches_mutex;
         /* Index in the kmem_cache->memcg_params->memcg_caches array */
 	int kmemcg_id;
+	/* when kmem shrinkers cannot proceed due to context */
+	struct work_struct kmem_shrink_work;
 #endif
 
 	int last_scanned_node;
@@ -3431,13 +3433,57 @@ static int mem_cgroup_slabinfo_read(struct cgroup_subsys_state *css,
 }
 #endif
 
+static void kmem_shrink_work_func(struct work_struct *w)
+{
+	struct mem_cgroup *memcg;
+
+	memcg = container_of(w, struct mem_cgroup, kmem_shrink_work);
+	try_to_free_mem_cgroup_kmem(memcg, GFP_KERNEL);
+}
+
+static int memcg_try_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, u64 size)
+{
+	int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
+	struct mem_cgroup *mem_over_limit;
+	struct res_counter *fail_res;
+	int ret;
+
+	do {
+		ret = res_counter_charge(&memcg->kmem, size, &fail_res);
+		if (!ret)
+			break;
+
+		mem_over_limit = mem_cgroup_from_res_counter(fail_res, kmem);
+
+		/*
+		 * Now we are going to shrink kernel memory present in caches.
+		 * If we cannot wait, we will have no option rather than fail
+		 * the current allocation and make room in the background
+		 * hoping the next one will succeed.
+		 *
+		 * If we are in FS context, then although we can wait, we
+		 * cannot call the shrinkers, because most FS shrinkers will
+		 * not run without __GFP_FS to avoid deadlock.
+		 */
+		if (!(gfp & __GFP_WAIT) || !(gfp & __GFP_FS)) {
+			schedule_work(&mem_over_limit->kmem_shrink_work);
+			break;
+		}
+
+		if (!try_to_free_mem_cgroup_kmem(mem_over_limit, gfp))
+			break;
+	} while (--nr_retries > 0);
+
+	return ret;
+}
+
 static int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, u64 size)
 {
 	struct res_counter *fail_res;
 	struct mem_cgroup *_memcg;
 	int ret = 0;
 
-	ret = res_counter_charge(&memcg->kmem, size, &fail_res);
+	ret = memcg_try_charge_kmem(memcg, gfp, size);
 	if (ret)
 		return ret;
 
@@ -6289,6 +6335,7 @@ static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
 	int ret;
 
 	INIT_LIST_HEAD(&memcg->memcg_slab_caches);
+	INIT_WORK(&memcg->kmem_shrink_work, kmem_shrink_work_func);
 	mutex_init(&memcg->slab_caches_mutex);
 	memcg->kmemcg_id = -1;
 	ret = memcg_propagate_kmem(memcg);
@@ -6309,6 +6356,8 @@ static void kmem_cgroup_css_offline(struct mem_cgroup *memcg)
 	if (!memcg_kmem_is_active(memcg))
 		return;
 
+	cancel_work_sync(&memcg->kmem_shrink_work);
+
 	/*
 	 * kmem charges can outlive the cgroup. In the case of slab
 	 * pages, for instance, a page contain objects from various
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 04df967..10e6b2f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2729,7 +2729,43 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 
 	return nr_reclaimed;
 }
-#endif
+
+#ifdef CONFIG_MEMCG_KMEM
+unsigned long try_to_free_mem_cgroup_kmem(struct mem_cgroup *memcg,
+					  gfp_t gfp_mask)
+{
+	struct reclaim_state reclaim_state;
+	struct shrink_control shrink = {
+		.gfp_mask = gfp_mask,
+		.target_mem_cgroup = memcg,
+	};
+	int priority = DEF_PRIORITY;
+	unsigned long nr_to_reclaim = SWAP_CLUSTER_MAX;
+	unsigned long freed, total_freed = 0;
+
+	nodes_setall(shrink.nodes_to_scan);
+
+	lockdep_set_current_reclaim_state(sc.gfp_mask);
+	reclaim_state.reclaimed_slab = 0;
+	current->reclaim_state = &reclaim_state;
+
+	do {
+		freed = shrink_slab(&shrink, 1000, 1000 << priority);
+		if (!freed)
+			congestion_wait(BLK_RW_ASYNC, HZ/10);
+		total_freed += freed;
+		if (current->reclaim_state->reclaimed_slab >= nr_to_reclaim)
+			break;
+	} while (--priority >= 0);
+
+	current->reclaim_state = NULL;
+	lockdep_clear_current_reclaim_state();
+
+	return total_freed;
+}
+#endif /* CONFIG_MEMCG_KMEM */
+
+#endif /* CONFIG_MEMCG */
 
 static void age_active_anon(struct zone *zone, struct scan_control *sc)
 {
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v12 14/18] vmscan: take at least one pass with shrinkers
  2013-12-02 11:19 [PATCH v12 00/18] kmemcg shrinkers Vladimir Davydov
                   ` (12 preceding siblings ...)
  2013-12-02 11:19 ` [PATCH v12 13/18] memcg: per-memcg kmem shrinking Vladimir Davydov
@ 2013-12-02 11:19 ` Vladimir Davydov
  2013-12-02 11:19 ` [PATCH v12 15/18] memcg: allow kmem limit to be resized down Vladimir Davydov
                   ` (4 subsequent siblings)
  18 siblings, 0 replies; 38+ messages in thread
From: Vladimir Davydov @ 2013-12-02 11:19 UTC (permalink / raw)
  To: hannes, mhocko, dchinner, akpm
  Cc: linux-kernel, linux-mm, cgroups, devel, glommer, vdavydov,
	Mel Gorman, Rik van Riel

In very low free kernel memory situations, it may be the case that we
have less objects to free than our initial batch size. If this is the
case, it is better to shrink those, and open space for the new workload
then to keep them and fail the new allocations.

In particular, we are concerned with the direct reclaim case for memcg.
Although this same technique can be applied to other situations just as
well, we will start conservative and apply it for that case, which is
the one that matters the most.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
---
 mm/vmscan.c |   13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 10e6b2f..749e492 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -287,17 +287,22 @@ shrink_slab_node(struct shrink_control *shrinkctl, struct shrinker *shrinker,
 				fraction, denominator,
 				max_pass, delta, total_scan);
 
-	while (total_scan >= batch_size) {
+	while (total_scan > 0) {
 		unsigned long ret;
+		unsigned long nr_to_scan = min(batch_size, total_scan);
 
-		shrinkctl->nr_to_scan = batch_size;
+		if (!shrinkctl->target_mem_cgroup &&
+		    total_scan < batch_size)
+			break;
+
+		shrinkctl->nr_to_scan = nr_to_scan;
 		ret = shrinker->scan_objects(shrinker, shrinkctl);
 		if (ret == SHRINK_STOP)
 			break;
 		freed += ret;
 
-		count_vm_events(SLABS_SCANNED, batch_size);
-		total_scan -= batch_size;
+		count_vm_events(SLABS_SCANNED, nr_to_scan);
+		total_scan -= nr_to_scan;
 
 		cond_resched();
 	}
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v12 15/18] memcg: allow kmem limit to be resized down
  2013-12-02 11:19 [PATCH v12 00/18] kmemcg shrinkers Vladimir Davydov
                   ` (13 preceding siblings ...)
  2013-12-02 11:19 ` [PATCH v12 14/18] vmscan: take at least one pass with shrinkers Vladimir Davydov
@ 2013-12-02 11:19 ` Vladimir Davydov
  2013-12-02 11:19 ` [PATCH v12 16/18] vmpressure: in-kernel notifications Vladimir Davydov
                   ` (3 subsequent siblings)
  18 siblings, 0 replies; 38+ messages in thread
From: Vladimir Davydov @ 2013-12-02 11:19 UTC (permalink / raw)
  To: hannes, mhocko, dchinner, akpm
  Cc: linux-kernel, linux-mm, cgroups, devel, glommer, vdavydov,
	Balbir Singh, KAMEZAWA Hiroyuki

From: Glauber Costa <glommer@openvz.org>

The userspace memory limit can be freely resized down. Upon attempt,
reclaim will be called to flush the pages away until we either reach the
limit we want or give up.

It wasn't possible so far with the kmem limit, since we had no way to
shrink the kmem buffers other than using the big hammer of shrink_slab,
that effectively frees data around the whole system.

The situation flips now that we have a per-memcg shrinker
infrastructure. We will proceed analogously to our user memory
counterpart and try to shrink our buffers until we either reach the
limit we want or give up.

Signed-off-by: Glauber Costa <glommer@openvz.org>
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |   43 ++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 38 insertions(+), 5 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 3679acb..a0b22d7 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5582,10 +5582,39 @@ static ssize_t mem_cgroup_read(struct cgroup_subsys_state *css,
 	return simple_read_from_buffer(buf, nbytes, ppos, str, len);
 }
 
+#ifdef CONFIG_MEMCG_KMEM
+/*
+ * This is slightly different than res or memsw reclaim.  We already have
+ * vmscan behind us to drive the reclaim, so we can basically keep trying until
+ * all buffers that can be flushed are flushed. We have a very clear signal
+ * about it in the form of the return value of try_to_free_mem_cgroup_kmem.
+ */
+static int mem_cgroup_resize_kmem_limit(struct mem_cgroup *memcg,
+					unsigned long long val)
+{
+	int ret = -EBUSY;
+
+	for (;;) {
+		if (signal_pending(current)) {
+			ret = -EINTR;
+			break;
+		}
+
+		ret = res_counter_set_limit(&memcg->kmem, val);
+		if (!ret)
+			break;
+
+		/* Can't free anything, pointless to continue */
+		if (!try_to_free_mem_cgroup_kmem(memcg, GFP_KERNEL))
+			break;
+	}
+
+	return ret;
+}
+
 static int memcg_update_kmem_limit(struct cgroup_subsys_state *css, u64 val)
 {
 	int ret = -EINVAL;
-#ifdef CONFIG_MEMCG_KMEM
 	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
 	/*
 	 * For simplicity, we won't allow this to be disabled.  It also can't
@@ -5620,16 +5649,15 @@ static int memcg_update_kmem_limit(struct cgroup_subsys_state *css, u64 val)
 		 * starts accounting before all call sites are patched
 		 */
 		memcg_kmem_set_active(memcg);
-	} else
-		ret = res_counter_set_limit(&memcg->kmem, val);
+	} else {
+		ret = mem_cgroup_resize_kmem_limit(memcg, val);
+	}
 out:
 	mutex_unlock(&set_limit_mutex);
 	mutex_unlock(&memcg_create_mutex);
-#endif
 	return ret;
 }
 
-#ifdef CONFIG_MEMCG_KMEM
 static int memcg_propagate_kmem(struct mem_cgroup *memcg)
 {
 	int ret = 0;
@@ -5666,6 +5694,11 @@ static int memcg_propagate_kmem(struct mem_cgroup *memcg)
 out:
 	return ret;
 }
+#else
+static int memcg_update_kmem_limit(struct cgroup *cont, u64 val)
+{
+	return -EINVAL;
+}
 #endif /* CONFIG_MEMCG_KMEM */
 
 /*
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v12 16/18] vmpressure: in-kernel notifications
  2013-12-02 11:19 [PATCH v12 00/18] kmemcg shrinkers Vladimir Davydov
                   ` (14 preceding siblings ...)
  2013-12-02 11:19 ` [PATCH v12 15/18] memcg: allow kmem limit to be resized down Vladimir Davydov
@ 2013-12-02 11:19 ` Vladimir Davydov
  2013-12-02 11:19 ` [PATCH v12 17/18] memcg: reap dead memcgs upon global memory pressure Vladimir Davydov
                   ` (2 subsequent siblings)
  18 siblings, 0 replies; 38+ messages in thread
From: Vladimir Davydov @ 2013-12-02 11:19 UTC (permalink / raw)
  To: hannes, mhocko, dchinner, akpm
  Cc: linux-kernel, linux-mm, cgroups, devel, glommer, vdavydov,
	John Stultz, Joonsoo Kim, Kamezawa Hiroyuki

From: Glauber Costa <glommer@openvz.org>

During the past weeks, it became clear to us that the shrinker interface
we have right now works very well for some particular types of users,
but not that well for others. The latter are usually people interested in
one-shot notifications, that were forced to adapt themselves to the
count+scan behavior of shrinkers. To do so, they had no choice than to
greatly abuse the shrinker interface producing little monsters all over.

During LSF/MM, one of the proposals that popped out during our session
was to reuse Anton Voronstsov's vmpressure for this. They are designed
for userspace consumption, but also provide a well-stablished,
cgroup-aware entry point for notifications.

This patch extends that to also support in-kernel users. Events that
should be generated for in-kernel consumption will be marked as such,
and for those, we will call a registered function instead of triggering
an eventfd notification.

Please note that due to my lack of understanding of each shrinker user,
I will stay away from converting the actual users, you are all welcome
to do so.

Signed-off-by: Glauber Costa <glommer@openvz.org>
Acked-by: Anton Vorontsov <anton@enomsg.org>
Acked-by: Pekka Enberg <penberg@kernel.org>
Reviewed-by: Greg Thelen <gthelen@google.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/vmpressure.h |    5 +++++
 mm/vmpressure.c            |   53 +++++++++++++++++++++++++++++++++++++++++---
 2 files changed, 55 insertions(+), 3 deletions(-)

diff --git a/include/linux/vmpressure.h b/include/linux/vmpressure.h
index 3f3788d..9102e53 100644
--- a/include/linux/vmpressure.h
+++ b/include/linux/vmpressure.h
@@ -19,6 +19,9 @@ struct vmpressure {
 	/* Have to grab the lock on events traversal or modifications. */
 	struct mutex events_lock;
 
+	/* False if only kernel users want to be notified, true otherwise. */
+	bool notify_userspace;
+
 	struct work_struct work;
 };
 
@@ -38,6 +41,8 @@ extern int vmpressure_register_event(struct cgroup_subsys_state *css,
 				     struct cftype *cft,
 				     struct eventfd_ctx *eventfd,
 				     const char *args);
+extern int vmpressure_register_kernel_event(struct cgroup_subsys_state *css,
+					    void (*fn)(void));
 extern void vmpressure_unregister_event(struct cgroup_subsys_state *css,
 					struct cftype *cft,
 					struct eventfd_ctx *eventfd);
diff --git a/mm/vmpressure.c b/mm/vmpressure.c
index e0f6283..730e7c1 100644
--- a/mm/vmpressure.c
+++ b/mm/vmpressure.c
@@ -130,8 +130,12 @@ static enum vmpressure_levels vmpressure_calc_level(unsigned long scanned,
 }
 
 struct vmpressure_event {
-	struct eventfd_ctx *efd;
+	union {
+		struct eventfd_ctx *efd;
+		void (*fn)(void);
+	};
 	enum vmpressure_levels level;
+	bool kernel_event;
 	struct list_head node;
 };
 
@@ -147,12 +151,15 @@ static bool vmpressure_event(struct vmpressure *vmpr,
 	mutex_lock(&vmpr->events_lock);
 
 	list_for_each_entry(ev, &vmpr->events, node) {
-		if (level >= ev->level) {
+		if (ev->kernel_event) {
+			ev->fn();
+		} else if (vmpr->notify_userspace && level >= ev->level) {
 			eventfd_signal(ev->efd, 1);
 			signalled = true;
 		}
 	}
 
+	vmpr->notify_userspace = false;
 	mutex_unlock(&vmpr->events_lock);
 
 	return signalled;
@@ -222,7 +229,7 @@ void vmpressure(gfp_t gfp, struct mem_cgroup *memcg,
 	 * we account it too.
 	 */
 	if (!(gfp & (__GFP_HIGHMEM | __GFP_MOVABLE | __GFP_IO | __GFP_FS)))
-		return;
+		goto schedule;
 
 	/*
 	 * If we got here with no pages scanned, then that is an indicator
@@ -239,8 +246,15 @@ void vmpressure(gfp_t gfp, struct mem_cgroup *memcg,
 	vmpr->scanned += scanned;
 	vmpr->reclaimed += reclaimed;
 	scanned = vmpr->scanned;
+	/*
+	 * If we didn't reach this point, only kernel events will be triggered.
+	 * It is the job of the worker thread to clean this up once the
+	 * notifications are all delivered.
+	 */
+	vmpr->notify_userspace = true;
 	spin_unlock(&vmpr->sr_lock);
 
+schedule:
 	if (scanned < vmpressure_win)
 		return;
 	schedule_work(&vmpr->work);
@@ -324,6 +338,39 @@ int vmpressure_register_event(struct cgroup_subsys_state *css,
 }
 
 /**
+ * vmpressure_register_kernel_event() - Register kernel-side notification
+ * @css:	css that is interested in vmpressure notifications
+ * @fn:		function to be called when pressure happens
+ *
+ * This function register in-kernel users interested in receiving notifications
+ * about pressure conditions. Pressure notifications will be triggered at the
+ * same time as userspace notifications (with no particular ordering relative
+ * to it).
+ *
+ * Pressure notifications are a alternative method to shrinkers and will serve
+ * well users that are interested in a one-shot notification, with a
+ * well-defined cgroup aware interface.
+ */
+int vmpressure_register_kernel_event(struct cgroup_subsys_state *css,
+				      void (*fn)(void))
+{
+	struct vmpressure *vmpr = css_to_vmpressure(css);
+	struct vmpressure_event *ev;
+
+	ev = kzalloc(sizeof(*ev), GFP_KERNEL);
+	if (!ev)
+		return -ENOMEM;
+
+	ev->kernel_event = true;
+	ev->fn = fn;
+
+	mutex_lock(&vmpr->events_lock);
+	list_add(&ev->node, &vmpr->events);
+	mutex_unlock(&vmpr->events_lock);
+	return 0;
+}
+
+/**
  * vmpressure_unregister_event() - Unbind eventfd from vmpressure
  * @css:	css handle
  * @cft:	cgroup control files handle
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v12 17/18] memcg: reap dead memcgs upon global memory pressure
  2013-12-02 11:19 [PATCH v12 00/18] kmemcg shrinkers Vladimir Davydov
                   ` (15 preceding siblings ...)
  2013-12-02 11:19 ` [PATCH v12 16/18] vmpressure: in-kernel notifications Vladimir Davydov
@ 2013-12-02 11:19 ` Vladimir Davydov
  2013-12-02 11:19 ` [PATCH v12 18/18] memcg: flush memcg items upon memcg destruction Vladimir Davydov
  2013-12-02 11:22 ` [PATCH v12 00/18] kmemcg shrinkers Vladimir Davydov
  18 siblings, 0 replies; 38+ messages in thread
From: Vladimir Davydov @ 2013-12-02 11:19 UTC (permalink / raw)
  To: hannes, mhocko, dchinner, akpm
  Cc: linux-kernel, linux-mm, cgroups, devel, glommer, vdavydov,
	Anton Vorontsov, John Stultz, Kamezawa Hiroyuki

From: Glauber Costa <glommer@openvz.org>

When we delete kmem-enabled memcgs, they can still be zombieing
around for a while. The reason is that the objects may still be alive,
and we won't be able to delete them at destruction time.

The only entry point for that, though, are the shrinkers. The
shrinker interface, however, is not exactly tailored to our needs. It
could be a little bit better by using the API Dave Chinner proposed, but
it is still not ideal since we aren't really a count-and-scan event, but
more a one-off flush-all-you-can event that would have to abuse that
somehow.

Signed-off-by: Glauber Costa <glommer@openvz.org>
Cc: Anton Vorontsov <anton@enomsg.org>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/memcontrol.c |   80 ++++++++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 77 insertions(+), 3 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index a0b22d7..72db892 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -287,8 +287,16 @@ struct mem_cgroup {
 	/* thresholds for mem+swap usage. RCU-protected */
 	struct mem_cgroup_thresholds memsw_thresholds;
 
-	/* For oom notifier event fd */
-	struct list_head oom_notify;
+	union {
+		/* For oom notifier event fd */
+		struct list_head oom_notify;
+		/*
+		 * we can only trigger an oom event if the memcg is alive.
+		 * so we will reuse this field to hook the memcg in the list
+		 * of dead memcgs.
+		 */
+		struct list_head dead;
+	};
 
 	/*
 	 * Should we move charges of a task when a task is moved into this
@@ -338,6 +346,29 @@ struct mem_cgroup {
 	/* WARNING: nodeinfo must be the last member here */
 };
 
+#if defined(CONFIG_MEMCG_KMEM) || defined(CONFIG_MEMCG_SWAP)
+static LIST_HEAD(dangling_memcgs);
+static DEFINE_MUTEX(dangling_memcgs_mutex);
+
+static inline void memcg_dangling_del(struct mem_cgroup *memcg)
+{
+	mutex_lock(&dangling_memcgs_mutex);
+	list_del(&memcg->dead);
+	mutex_unlock(&dangling_memcgs_mutex);
+}
+
+static inline void memcg_dangling_add(struct mem_cgroup *memcg)
+{
+	INIT_LIST_HEAD(&memcg->dead);
+	mutex_lock(&dangling_memcgs_mutex);
+	list_add(&memcg->dead, &dangling_memcgs);
+	mutex_unlock(&dangling_memcgs_mutex);
+}
+#else
+static inline void memcg_dangling_del(struct mem_cgroup *memcg) {}
+static inline void memcg_dangling_add(struct mem_cgroup *memcg) {}
+#endif
+
 static size_t memcg_size(void)
 {
 	return sizeof(struct mem_cgroup) +
@@ -6363,6 +6394,41 @@ static int mem_cgroup_oom_control_write(struct cgroup_subsys_state *css,
 }
 
 #ifdef CONFIG_MEMCG_KMEM
+static void memcg_vmpressure_shrink_dead(void)
+{
+	struct memcg_cache_params *params, *tmp;
+	struct kmem_cache *cachep;
+	struct mem_cgroup *memcg;
+
+	mutex_lock(&dangling_memcgs_mutex);
+	list_for_each_entry(memcg, &dangling_memcgs, dead) {
+		mutex_lock(&memcg->slab_caches_mutex);
+		/* The element may go away as an indirect result of shrink */
+		list_for_each_entry_safe(params, tmp,
+					 &memcg->memcg_slab_caches, list) {
+			cachep = memcg_params_to_cache(params);
+			/*
+			 * the cpu_hotplug lock is taken in kmem_cache_create
+			 * outside the slab_caches_mutex manipulation. It will
+			 * be taken by kmem_cache_shrink to flush the cache.
+			 * So we need to drop the lock. It is all right because
+			 * the lock only protects elements moving in and out the
+			 * list.
+			 */
+			mutex_unlock(&memcg->slab_caches_mutex);
+			kmem_cache_shrink(cachep);
+			mutex_lock(&memcg->slab_caches_mutex);
+		}
+		mutex_unlock(&memcg->slab_caches_mutex);
+	}
+	mutex_unlock(&dangling_memcgs_mutex);
+}
+
+static void memcg_register_kmem_events(struct cgroup_subsys_state *css)
+{
+	vmpressure_register_kernel_event(css, memcg_vmpressure_shrink_dead);
+}
+
 static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
 {
 	int ret;
@@ -6420,6 +6486,10 @@ static void kmem_cgroup_css_offline(struct mem_cgroup *memcg)
 		css_put(&memcg->css);
 }
 #else
+static inline void memcg_register_kmem_events(struct cgroup *cont)
+{
+}
+
 static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
 {
 	return 0;
@@ -6758,8 +6828,10 @@ mem_cgroup_css_online(struct cgroup_subsys_state *css)
 	if (css->cgroup->id > MEM_CGROUP_ID_MAX)
 		return -ENOSPC;
 
-	if (!parent)
+	if (!parent) {
+		memcg_register_kmem_events(css);
 		return 0;
+	}
 
 	mutex_lock(&memcg_create_mutex);
 
@@ -6821,6 +6893,7 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
 	mem_cgroup_invalidate_reclaim_iterators(memcg);
 	mem_cgroup_reparent_charges(memcg);
 	mem_cgroup_destroy_all_caches(memcg);
+	memcg_dangling_add(memcg);
 	vmpressure_cleanup(&memcg->vmpressure);
 }
 
@@ -6829,6 +6902,7 @@ static void mem_cgroup_css_free(struct cgroup_subsys_state *css)
 	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
 
 	memcg_destroy_kmem(memcg);
+	memcg_dangling_del(memcg);
 	__mem_cgroup_free(memcg);
 }
 
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v12 18/18] memcg: flush memcg items upon memcg destruction
  2013-12-02 11:19 [PATCH v12 00/18] kmemcg shrinkers Vladimir Davydov
                   ` (16 preceding siblings ...)
  2013-12-02 11:19 ` [PATCH v12 17/18] memcg: reap dead memcgs upon global memory pressure Vladimir Davydov
@ 2013-12-02 11:19 ` Vladimir Davydov
  2013-12-02 11:22 ` [PATCH v12 00/18] kmemcg shrinkers Vladimir Davydov
  18 siblings, 0 replies; 38+ messages in thread
From: Vladimir Davydov @ 2013-12-02 11:19 UTC (permalink / raw)
  To: hannes, mhocko, dchinner, akpm
  Cc: linux-kernel, linux-mm, cgroups, devel, glommer, vdavydov,
	Balbir Singh, KAMEZAWA Hiroyuki

From: Glauber Costa <glommer@openvz.org>

When a memcg is destroyed, it won't be imediately released until all
objects are gone. This means that if a memcg is restarted with the very
same workload - a very common case, the objects already cached won't be
billed to the new memcg. This is mostly undesirable since a container
can exploit this by restarting itself every time it reaches its limit,
and then coming up again with a fresh new limit.

Since now we have targeted reclaim, I sustain that we should assume that
a memcg that is destroyed should be flushed away. It makes perfect sense
if we assume that a memcg that goes away most likely indicates an
isolated workload that is terminated.

Signed-off-by: Glauber Costa <glommer@openvz.org>
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |   17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 72db892..e780511 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6452,12 +6452,29 @@ static void memcg_destroy_kmem(struct mem_cgroup *memcg)
 
 static void kmem_cgroup_css_offline(struct mem_cgroup *memcg)
 {
+	int ret;
 	if (!memcg_kmem_is_active(memcg))
 		return;
 
 	cancel_work_sync(&memcg->kmem_shrink_work);
 
 	/*
+	 * When a memcg is destroyed, it won't be imediately released until all
+	 * objects are gone. This means that if a memcg is restarted with the
+	 * very same workload - a very common case, the objects already cached
+	 * won't be billed to the new memcg. This is mostly undesirable since a
+	 * container can exploit this by restarting itself every time it
+	 * reaches its limit, and then coming up again with a fresh new limit.
+	 *
+	 * Therefore a memcg that is destroyed should be flushed away. It makes
+	 * perfect sense if we assume that a memcg that goes away indicates an
+	 * isolated workload that is terminated.
+	 */
+	do {
+		ret = try_to_free_mem_cgroup_kmem(memcg, GFP_KERNEL);
+	} while (ret);
+
+	/*
 	 * kmem charges can outlive the cgroup. In the case of slab
 	 * pages, for instance, a page contain objects from various
 	 * processes. As we prevent from taking a reference for every
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH v12 00/18] kmemcg shrinkers
  2013-12-02 11:19 [PATCH v12 00/18] kmemcg shrinkers Vladimir Davydov
                   ` (17 preceding siblings ...)
  2013-12-02 11:19 ` [PATCH v12 18/18] memcg: flush memcg items upon memcg destruction Vladimir Davydov
@ 2013-12-02 11:22 ` Vladimir Davydov
  18 siblings, 0 replies; 38+ messages in thread
From: Vladimir Davydov @ 2013-12-02 11:22 UTC (permalink / raw)
  To: hannes
  Cc: Vladimir Davydov, mhocko, dchinner, akpm, linux-kernel, linux-mm,
	cgroups, devel, glommer

Hi, Johannes

I tried to fix the patchset according to your comments, but there were a 
couple of places that after a bit of thinking I found impossible to 
amend exactly the way you proposed. Here they go:

>> +static unsigned long
>> +zone_nr_reclaimable_pages(struct scan_control *sc, struct zone *zone)
>> +{
>> +	if (global_reclaim(sc))
>> +		return zone_reclaimable_pages(zone);
>> +	return memcg_zone_reclaimable_pages(sc->target_mem_cgroup, zone);
>> +}
> So we have zone_reclaimable_pages() and zone_nr_reclaimable_pages()
> with completely different signatures and usecases.  Not good.
>
> The intersection between a zone and a memcg is called an lruvec,
> please use that.  Look up an lruvec as early as possible, then
> implement lruvec_reclaimable_pages() etc. for use during reclaim.

We iterate over lruvecs in shrink_zone() so AFAIU I should have put the 
lru_pages counting there. However, we're not guaranteed to iterate over 
all lruvecs eligible for current allocations if the zone is being shrunk 
concurrently. One way to fix this would be rewriting memcg iteration 
interface to always iterate over all memcgs returning in a flag in the 
cookie if the current memcg should be reclaimed, but I found this 
somewhat obfuscating and preferred simply fix the function name.

> -		for_each_node_mask(shrinkctl->nid, shrinkctl->nodes_to_scan) {
> -			if (!node_online(shrinkctl->nid))
> -				continue;
> -
> -			if (!(shrinker->flags & SHRINKER_NUMA_AWARE) &&
> -			    (shrinkctl->nid != 0))
> +			if (!memcg || memcg_kmem_is_active(memcg))
> +				freed += shrink_slab_one(shrinkctl, shrinker,
> +					 nr_pages_scanned, lru_pages);
> +			/*
> +			 * For non-memcg aware shrinkers, we will arrive here
> +			 * at first pass because we need to scan the root
> +			 * memcg.  We need to bail out, since exactly because
> +			 * they are not memcg aware, instead of noticing they
> +			 * have nothing to shrink, they will just shrink again,
> +			 * and deplete too many objects.
> +			 */
> I actually found the code easier to understand without this comment.
>
>> +			if (!(shrinker->flags & SHRINKER_MEMCG_AWARE))
>>   				break;
>> +			shrinkctl->target_mem_cgroup =
>> +				mem_cgroup_iter(root, memcg, NULL);
> The target memcg is always the same, don't change this.  Look at the
> lru scan code for reference.  Iterate zones (nodes in this case)
> first, then iterate the memcgs in each zone (node), look up the lruvec
> and then call shrink_slab_lruvec(lruvec, ...).

They are somewhat different: shrink_zone() calls shrink_lruvec() for 
each memcg's lruvec, but for shrink_slab() we don't have lruvecs. We do 
have memcg_list_lru though, but it's up to the shrinker to use it. In 
other words, in contrast to page cache reclaim, kmem shrinkers are 
opaque to vmscan.

Anyway, I can't help agreeing with you that changing target_mem_cgroup 
while iterating over memcgs looks ugly. So I rewrote it a bit to 
resemble the way per-node shrinking is implemented. For per-node 
shrinking we have shrink_control::nodemask, which is set by the 
shrink_slab() caller, and shrink_control::nid, which is initialized by 
shrink_slab() while iterating over online NUMA nodes and actually used 
by the shrinker. Similarly, for memory cgroups I added two fields to the 
shrink_control struct, target_mem_cgroup and memcg, the former is set by 
the shrink_slab() caller and specifies the memory cgroup tree to scan, 
and the latter is used as the iterator by shrink_slab() and as the 
target memcg by a particular memcg-aware shrinker.

I would appreciate if you could look at the new version and share your 
attitude toward it.

Thank you.

On 12/02/2013 03:19 PM, Vladimir Davydov wrote:
> Hi,
>
> This is the 12th iteration of Glauber Costa's patchset implementing targeted
> shrinking for memory cgroups when kmem limits are present. So far, we've been
> accounting kernel objects but failing allocations when short of memory. This is
> because our only option would be to call the global shrinker, depleting objects
> from all caches and breaking isolation.
>
> The main idea is to make LRU lists used by FS slab shrinkers per-memcg. When
> adding or removing an element from from the LRU, we use the page information to
> figure out which memory cgroup it belongs to and relay it to the appropriate
> list. This allows scanning kmem objects accounted to different memory cgroups
> independently.
>
> The patchset is based on top of Linux 3.13-rc2 and organized as follows:
>
>   * patches 1-8 are for cleanup/preparation;
>   * patch 9 introduces infrastructure for memcg-aware shrinkers;
>   * patches 10 and 11 implement the per-memcg LRU list structure;
>   * patch 12 uses per-memcg LRU lists to make dcache and icache shrinkers
>     memcg-aware;
>   * patch 13 implements kmem-only shrinking;
>   * patches 14-18 issue kmem shrinking on limit resize, global pressure.
>
> Known issues:
>
>   * Since FS shrinkers can't be executed on __GFP_FS allocations, such
>     allocations will fail if memcg kmem limit is less than the user limit and
>     the memcg kmem usage is close to its limit. Glauber proposed to schedule a
>     worker which would shrink kmem in the background on such allocations.
>     However, this approach does not eliminate failures completely, it just makes
>     them rarer. I'm thinking on implementing soft limits for memcg kmem so that
>     striking the soft limit will trigger the reclaimer, but won't fail the
>     allocation. I would appreciate any other proposals on how this can be fixed.
>
>   * Only dcache and icache are reclaimed on memcg pressure. Other FS objects are
>     left for global pressure only. However, it should not be a serious problem
>     to make them reclaimable too by passing on memcg to the FS-layer and letting
>     each FS decide if its internal objects are shrinkable on memcg pressure.
>
> Changelog:
>
> Changes in v12:
>   * Do not prune all slabs on kmem-only pressure.
>   * Count all on-LRU pages eligible for reclaim to pass to shrink_slab().
>   * Fix isolation issue due to using shrinker->nr_deferred on memcg pressure.
>   * Add comments to memcg_list_lru functions.
>   * Code cleanup/refactoring.
>
> Changes in v11:
>   * Rework per-memcg list_lru infrastructure.
>
> Glauber Costa (7):
>    memcg: make cache index determination more robust
>    memcg: consolidate callers of memcg_cache_id
>    memcg: move initialization to memcg creation
>    memcg: allow kmem limit to be resized down
>    vmpressure: in-kernel notifications
>    memcg: reap dead memcgs upon global memory pressure
>    memcg: flush memcg items upon memcg destruction
>
> Vladimir Davydov (11):
>    memcg: move several kmemcg functions upper
>    fs: do not use destroy_super() in alloc_super() fail path
>    vmscan: rename shrink_slab() args to make it more generic
>    vmscan: move call to shrink_slab() to shrink_zones()
>    vmscan: do_try_to_free_pages(): remove shrink_control argument
>    vmscan: shrink slab on memcg pressure
>    memcg,list_lru: add per-memcg LRU list infrastructure
>    memcg,list_lru: add function walking over all lists of a per-memcg
>      LRU
>    fs: make icache, dcache shrinkers memcg-aware
>    memcg: per-memcg kmem shrinking
>    vmscan: take at least one pass with shrinkers
>
>   fs/dcache.c                   |   25 +-
>   fs/inode.c                    |   16 +-
>   fs/internal.h                 |    9 +-
>   fs/super.c                    |   48 ++-
>   include/linux/fs.h            |    4 +-
>   include/linux/list_lru.h      |   83 +++++
>   include/linux/memcontrol.h    |   22 ++
>   include/linux/mm.h            |    3 +-
>   include/linux/shrinker.h      |   10 +-
>   include/linux/swap.h          |    2 +
>   include/linux/vmpressure.h    |    5 +
>   include/trace/events/vmscan.h |   20 +-
>   mm/memcontrol.c               |  728 ++++++++++++++++++++++++++++++++++++-----
>   mm/vmpressure.c               |   53 ++-
>   mm/vmscan.c                   |  249 +++++++++-----
>   15 files changed, 1054 insertions(+), 223 deletions(-)
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v12 05/18] fs: do not use destroy_super() in alloc_super() fail path
  2013-12-02 11:19 ` [PATCH v12 05/18] fs: do not use destroy_super() in alloc_super() fail path Vladimir Davydov
@ 2013-12-03  9:00   ` Dave Chinner
  2013-12-03  9:23     ` Vladimir Davydov
  0 siblings, 1 reply; 38+ messages in thread
From: Dave Chinner @ 2013-12-03  9:00 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: hannes, mhocko, dchinner, akpm, linux-kernel, linux-mm, cgroups,
	devel, glommer, Al Viro

On Mon, Dec 02, 2013 at 03:19:40PM +0400, Vladimir Davydov wrote:
> Using destroy_super() in alloc_super() fail path is bad, because:
> 
> * It will trigger WARN_ON(!list_empty(&s->s_mounts)) since s_mounts is
>   initialized after several 'goto fail's.

So let's fix that.

> * It will call kfree_rcu() to free the super block although kfree() is
>   obviously enough there.
> * The list_lru structure was initially implemented without the ability
>   to destroy an uninitialized object in mind.
> 
> I'm going to replace the conventional list_lru with per-memcg lru to
> implement per-memcg slab reclaim. This new structure will fail
> destruction of objects that haven't been properly initialized so let's
> inline appropriate snippets from destroy_super() to alloc_super() fail
> path instead of using the whole function there.

You're basically undoing the change made in commit 7eb5e88 ("uninline
destroy_super(), consolidate alloc_super()") which was done less
than a month ago. :/

The code as it stands works just fine - the list-lru structures in
the superblock are actually initialised (to zeros) - and so calling
list_lru_destroy() on it works just fine in that state as the
pointers that are freed are NULL. Yes, unexpected, but perfectly
valid code.

I haven't looked at the internals of the list_lru changes you've
made yet, but it surprises me that we can't handle this case
internally to list_lru_destroy().

Al, your call on inlining destroy_super() in alloc_super() again....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v12 05/18] fs: do not use destroy_super() in alloc_super() fail path
  2013-12-03  9:00   ` Dave Chinner
@ 2013-12-03  9:23     ` Vladimir Davydov
  2013-12-03 13:37       ` Al Viro
  0 siblings, 1 reply; 38+ messages in thread
From: Vladimir Davydov @ 2013-12-03  9:23 UTC (permalink / raw)
  To: Dave Chinner
  Cc: hannes, mhocko, dchinner, akpm, linux-kernel, linux-mm, cgroups,
	devel, glommer, Al Viro

On 12/03/2013 01:00 PM, Dave Chinner wrote:
> On Mon, Dec 02, 2013 at 03:19:40PM +0400, Vladimir Davydov wrote:
>> Using destroy_super() in alloc_super() fail path is bad, because:
>>
>> * It will trigger WARN_ON(!list_empty(&s->s_mounts)) since s_mounts is
>>   initialized after several 'goto fail's.
> So let's fix that.
>
>> * It will call kfree_rcu() to free the super block although kfree() is
>>   obviously enough there.
>> * The list_lru structure was initially implemented without the ability
>>   to destroy an uninitialized object in mind.
>>
>> I'm going to replace the conventional list_lru with per-memcg lru to
>> implement per-memcg slab reclaim. This new structure will fail
>> destruction of objects that haven't been properly initialized so let's
>> inline appropriate snippets from destroy_super() to alloc_super() fail
>> path instead of using the whole function there.
> You're basically undoing the change made in commit 7eb5e88 ("uninline
> destroy_super(), consolidate alloc_super()") which was done less
> than a month ago. :/
>
> The code as it stands works just fine - the list-lru structures in
> the superblock are actually initialised (to zeros) - and so calling
> list_lru_destroy() on it works just fine in that state as the
> pointers that are freed are NULL. Yes, unexpected, but perfectly
> valid code.
>
> I haven't looked at the internals of the list_lru changes you've
> made yet, but it surprises me that we can't handle this case
> internally to list_lru_destroy().

Actually, I'm not going to modify the list_lru structure, because I
think it's good as it is. I'd like to substitute it with a new
structure, memcg_list_lru, only in those places where this functionality
(per-memcg scanning) is really needed. This new structure would look
like this:

struct memcg_list_lru {
    struct list_lru global_lru;
    struct list_lru **memcg_lrus;
    struct list_head list;
    void *old_lrus;
}

Since old_lrus and memcg_lrus can be NULL under normal operation, in
memcg_list_lru_destroy() I'd have to check either the list or the
global_lru field, i.e. it would look like:

if (!list.next)
    /* has not been initialized */
    return;

or

if (!global_lru.node)
    /* has not been initialized */
    return;

I find both of these checks ugly :-(

Personally, I think that's calling destroy() w/o init() is OK only for
simple structures where destroy/init are inline functions or macros,
otherwise one can forget to "fix" destroy() after it extends a structure.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v12 06/18] vmscan: rename shrink_slab() args to make it more generic
  2013-12-02 11:19 ` [PATCH v12 06/18] vmscan: rename shrink_slab() args to make it more generic Vladimir Davydov
@ 2013-12-03  9:33   ` Dave Chinner
  2013-12-03  9:44     ` Vladimir Davydov
  0 siblings, 1 reply; 38+ messages in thread
From: Dave Chinner @ 2013-12-03  9:33 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: hannes, mhocko, dchinner, akpm, linux-kernel, linux-mm, cgroups,
	devel, glommer, Mel Gorman, Rik van Riel

On Mon, Dec 02, 2013 at 03:19:41PM +0400, Vladimir Davydov wrote:
> Currently in addition to a shrink_control struct shrink_slab() takes two
> arguments, nr_pages_scanned and lru_pages, which are used for balancing
> slab reclaim versus page reclaim - roughly speaking, shrink_slab() will
> try to scan nr_pages_scanned/lru_pages fraction of all slab objects.

Yes, that is it's primary purpose, and the variables explain that
clearly. i.e. it passes a quantity of work to be done, and a value
to relate that to the overall size of the cache that the work was
one on. i.e. they tell us that shrink_slab is trying to stay in
balance with the amount of work that page cache reclaim has just
done.

> However, shrink_slab() is not always called after page cache reclaim.
> For example, drop_slab() uses shrink_slab() to drop as many slab objects
> as possible and thus has to pass phony values 1000/1000 to it, which do
> not make sense for nr_pages_scanned/lru_pages. Moreover, as soon as

Yup, but that's not the primary purpose of the code, and doesn't
require balancing against page cache reclaim. hence the numbers that
are passed in are just there to make the shrinkers iterate
efficiently but without doing too much work in a single scan. i.e.
reclaim in chunks across all caches, rather than try to completely
remove a single cache at a time....

And the reason that this is done? because caches have reclaim
relationships that mean some shrinkers can't make progress until
other shrinkers do their work. Hence to effective free all memory,
we have to iterate repeatedly across all caches. That's what
drop_slab() does.
o
> kmemcg reclaim is introduced, we will have to make up phony values for
> nr_pages_scanned and lru_pages again when doing kmem-only reclaim for a
> memory cgroup, which is possible if the cgroup has its kmem limit less
> than the total memory limit.

I'm missing something here - why would memcg reclaim require
passing phony values? How are you going to keep slab caches in
balance with memory pressure generated by the page cache?

And, FWIW:

>  unsigned long shrink_slab(struct shrink_control *shrink,
> -			  unsigned long nr_pages_scanned,
> -			  unsigned long lru_pages);
> +			  unsigned long fraction, unsigned long denominator);

I'm not sure what "fraction" means in this case. A fraction is made
up of a numerator and denominator, but:

> @@ -243,9 +243,9 @@ shrink_slab_node(struct shrink_control *shrinkctl, struct shrinker *shrinker,
>  	nr = atomic_long_xchg(&shrinker->nr_deferred[nid], 0);
>  
>  	total_scan = nr;
> -	delta = (4 * nr_pages_scanned) / shrinker->seeks;
> +	delta = (4 * fraction) / shrinker->seeks;

 (4 * nr_pages_scanned) is a dividend, while:

>  	delta *= max_pass;
> -	do_div(delta, lru_pages + 1);
> +	do_div(delta, denominator + 1);

(lru_pages + 1) is a divisor.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v12 06/18] vmscan: rename shrink_slab() args to make it more generic
  2013-12-03  9:33   ` Dave Chinner
@ 2013-12-03  9:44     ` Vladimir Davydov
  2013-12-03 10:04       ` Dave Chinner
  0 siblings, 1 reply; 38+ messages in thread
From: Vladimir Davydov @ 2013-12-03  9:44 UTC (permalink / raw)
  To: Dave Chinner
  Cc: hannes, mhocko, dchinner, akpm, linux-kernel, linux-mm, cgroups,
	devel, glommer, Mel Gorman, Rik van Riel

On 12/03/2013 01:33 PM, Dave Chinner wrote:
>> kmemcg reclaim is introduced, we will have to make up phony values for
>> nr_pages_scanned and lru_pages again when doing kmem-only reclaim for a
>> memory cgroup, which is possible if the cgroup has its kmem limit less
>> than the total memory limit.
> I'm missing something here - why would memcg reclaim require
> passing phony values? How are you going to keep slab caches in
> balance with memory pressure generated by the page cache?

To perform kmem-only reclaim - please see patch 13. When the kmem limit
is less than the user limit, we can reach only the kmem limit. It is no
use trying to reclaim page cache then.

On normal reclaim - I mean try_to_free_mem_cgroup_pages() - I pass
nr_scanned/lru_pages to shrink_slab() as usual.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v12 06/18] vmscan: rename shrink_slab() args to make it more generic
  2013-12-03  9:44     ` Vladimir Davydov
@ 2013-12-03 10:04       ` Dave Chinner
  0 siblings, 0 replies; 38+ messages in thread
From: Dave Chinner @ 2013-12-03 10:04 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: hannes, mhocko, dchinner, akpm, linux-kernel, linux-mm, cgroups,
	devel, glommer, Mel Gorman, Rik van Riel

On Tue, Dec 03, 2013 at 01:44:38PM +0400, Vladimir Davydov wrote:
> On 12/03/2013 01:33 PM, Dave Chinner wrote:
> >> kmemcg reclaim is introduced, we will have to make up phony values for
> >> nr_pages_scanned and lru_pages again when doing kmem-only reclaim for a
> >> memory cgroup, which is possible if the cgroup has its kmem limit less
> >> than the total memory limit.
> > I'm missing something here - why would memcg reclaim require
> > passing phony values? How are you going to keep slab caches in
> > balance with memory pressure generated by the page cache?
> 
> To perform kmem-only reclaim - please see patch 13. When the kmem limit
> is less than the user limit, we can reach only the kmem limit. It is no
> use trying to reclaim page cache then.

Oh dear. What a nasty hack.

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v12 09/18] vmscan: shrink slab on memcg pressure
  2013-12-02 11:19 ` [PATCH v12 09/18] vmscan: shrink slab on memcg pressure Vladimir Davydov
@ 2013-12-03 10:48   ` Dave Chinner
  2013-12-03 12:15     ` Vladimir Davydov
  0 siblings, 1 reply; 38+ messages in thread
From: Dave Chinner @ 2013-12-03 10:48 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: hannes, mhocko, dchinner, akpm, linux-kernel, linux-mm, cgroups,
	devel, glommer, Mel Gorman, Rik van Riel, Al Viro, Balbir Singh,
	KAMEZAWA Hiroyuki

On Mon, Dec 02, 2013 at 03:19:44PM +0400, Vladimir Davydov wrote:
> This patch makes direct reclaim path shrink slabs not only on global
> memory pressure, but also when we reach memory cgroup limit. To achieve
> that, it introduces a new per-shrinker flag, SHRINKER_MEMCG_AWARE, which
> should be set if the shrinker can handle per-memcg reclaim. For such
> shrinkers, shrink_slab() will iterate over all eligible memory cgroups
> (i.e. the cgroup that triggered the reclaim and all its descendants) and
> pass the current memory cgroup to the shrinker in shrink_control.memcg
> just like it passes the current NUMA node to NUMA-aware shrinkers.  It
> is completely up to memcg-aware shrinkers how to organize objects in
> order to provide required functionality. Currently none of the existing
> shrinkers is memcg-aware, but next patches will introduce per-memcg
> list_lru, which will facilitate the process of turning shrinkers that
> use list_lru to be memcg-aware.
> 
> The number of slab objects scanned on memcg pressure is calculated in
> the same way as on global pressure - it is proportional to the number of
> pages scanned over the number of pages eligible for reclaim (i.e. the
> number of on-LRU pages in the target memcg and all its descendants) -
> except we do not employ the nr_deferred per-shrinker counter to avoid
> memory cgroup isolation issues. Ideally, this counter should be made
> per-memcg.
> 
....

> @@ -236,11 +236,17 @@ shrink_slab_node(struct shrink_control *shrinkctl, struct shrinker *shrinker,
>  		return 0;
>  
>  	/*
> -	 * copy the current shrinker scan count into a local variable
> -	 * and zero it so that other concurrent shrinker invocations
> -	 * don't also do this scanning work.
> +	 * Do not touch global counter of deferred objects on memcg pressure to
> +	 * avoid isolation issues. Ideally the counter should be per-memcg.
>  	 */
> -	nr = atomic_long_xchg(&shrinker->nr_deferred[nid], 0);
> +	if (!shrinkctl->target_mem_cgroup) {
> +		/*
> +		 * copy the current shrinker scan count into a local variable
> +		 * and zero it so that other concurrent shrinker invocations
> +		 * don't also do this scanning work.
> +		 */
> +		nr = atomic_long_xchg(&shrinker->nr_deferred[nid], 0);
> +	}

That's ugly. Effectively it means that memcg reclaim is going to be
completely ineffective when large numbers of allocations and hence
reclaim attempts are done under GFP_NOFS context.

The only thing that keeps filesystem caches in balance when there is
lots of filesystem work going on (i.e. lots of GFP_NOFS allocations)
is the deferal of reclaim work to a context that can do something
about it.

>  	total_scan = nr;
>  	delta = (4 * fraction) / shrinker->seeks;
> @@ -296,21 +302,46 @@ shrink_slab_node(struct shrink_control *shrinkctl, struct shrinker *shrinker,
>  		cond_resched();
>  	}
>  
> -	/*
> -	 * move the unused scan count back into the shrinker in a
> -	 * manner that handles concurrent updates. If we exhausted the
> -	 * scan, there is no need to do an update.
> -	 */
> -	if (total_scan > 0)
> -		new_nr = atomic_long_add_return(total_scan,
> +	if (!shrinkctl->target_mem_cgroup) {
> +		/*
> +		 * move the unused scan count back into the shrinker in a
> +		 * manner that handles concurrent updates. If we exhausted the
> +		 * scan, there is no need to do an update.
> +		 */
> +		if (total_scan > 0)
> +			new_nr = atomic_long_add_return(total_scan,
>  						&shrinker->nr_deferred[nid]);
> -	else
> -		new_nr = atomic_long_read(&shrinker->nr_deferred[nid]);
> +		else
> +			new_nr = atomic_long_read(&shrinker->nr_deferred[nid]);
> +	}

So, if the memcg can't make progress, why wouldn't you defer the
work to the global scan? Or can't a global scan trim memcg LRUs?
And if it can't, then isn't that a major design flaw? Why not just
allow kswapd to walk memcg LRUs in the background?

/me just looked at patch 13

Yeah, this goes some way to explaining why something like patch 13
is necessary - slab shrinkers are not keeping up with page cache
reclaim because of GFP_NOFS allocations, and so the page cache
empties only leaving slab caches to be trimmed....


> +static unsigned long
> +shrink_slab_memcg(struct shrink_control *shrinkctl, struct shrinker *shrinker,
> +		  unsigned long fraction, unsigned long denominator)

what's this function got to do with memcgs? Why did you rename it
from the self explanitory shrink_slab_one() name that Glauber gave
it?

> +{
> +	unsigned long freed = 0;
> +
> +	if (shrinkctl->memcg && !memcg_kmem_is_active(shrinkctl->memcg))
> +		return 0;

Why here? why not check that in the caller where memcg's are being
iterated?

> +
> +	for_each_node_mask(shrinkctl->nid, shrinkctl->nodes_to_scan) {
> +		if (!node_online(shrinkctl->nid))
> +			continue;
> +
> +		if (!(shrinker->flags & SHRINKER_NUMA_AWARE) &&
> +		    (shrinkctl->nid != 0))
> +			break;

Hmmm - this looks broken. Nothing guarantees that node 0 in
shrinkctl->nodes_to_scan is ever set, so non-numa aware shrinkers
will do nothing when the first node in the mask is not set. For non-numa
aware shrinkers, the shrinker should always be called once with a
node id of 0.

That's what earlier versions of the numa aware shrinker patch set
did, and it seems to have been lost along the way.  Yeah, there's
the last version from Glauber's tree that I saw:

static unsigned long
shrink_slab_one(struct shrink_control *shrinkctl, struct shrinker *shrinker,
               unsigned long nr_pages_scanned, unsigned long lru_pages)
{
       unsigned long freed = 0;

       if (!(shrinker->flags & SHRINKER_NUMA_AWARE)) {
               shrinkctl->nid = 0;

               return shrink_slab_node(shrinkctl, shrinker,
                        nr_pages_scanned, lru_pages,
                        &shrinker->nr_deferred);
       }

       for_each_node_mask(shrinkctl->nid, shrinkctl->nodes_to_scan)

               if (!node_online(shrinkctl->nid))
                       continue;

               freed += shrink_slab_node(shrinkctl, shrinker,
                        nr_pages_scanned, lru_pages,
			 &shrinker->nr_deferred_node[shrinkctl->nid]);
       }

       return freed;
}

So, that's likely to be another reason that all the non-numa slab
caches are not being shrunk appropriately and need to be hit with a
bit hammer...

> @@ -352,18 +383,23 @@ unsigned long shrink_slab(struct shrink_control *shrinkctl,
>  	}
>  
>  	list_for_each_entry(shrinker, &shrinker_list, list) {
> -		for_each_node_mask(shrinkctl->nid, shrinkctl->nodes_to_scan) {
> -			if (!node_online(shrinkctl->nid))
> -				continue;
> -
> -			if (!(shrinker->flags & SHRINKER_NUMA_AWARE) &&
> -			    (shrinkctl->nid != 0))
> +		shrinkctl->memcg = shrinkctl->target_mem_cgroup;
> +		do {
> +			if (!(shrinker->flags & SHRINKER_MEMCG_AWARE) &&
> +			    (shrinkctl->memcg != NULL)) {
> +				mem_cgroup_iter_break(
> +						shrinkctl->target_mem_cgroup,
> +						shrinkctl->memcg);
>  				break;
> +			}
>  
> -			freed += shrink_slab_node(shrinkctl, shrinker,
> -						  fraction, denominator);
> +			freed += shrink_slab_memcg(shrinkctl, shrinker,
> +						   fraction, denominator);
> +			shrinkctl->memcg = mem_cgroup_iter(
> +						shrinkctl->target_mem_cgroup,
> +						shrinkctl->memcg, NULL);
> +		} while (shrinkctl->memcg);

Glauber's tree also had a bunch of comments explaining what was
going on here. I've got no idea what the hell this code is doing,
and why the hell we are iterating memcgs here and how and why the
normal, non-memcg scan and shrinkers still worked.

This is now just a bunch of memcg gobbledegook with no explanations
to tell us what it is supposed to be doing. Comments are important -
you might not think they are necessary, but seeing comments like
this:

+               /*
+                * In a hierarchical chain, it might be that not all memcgs are
+                * kmem active. kmemcg design mandates that when one memcg is
+                * active, its children will be active as well. But it is
+                * perfectly possible that its parent is not.
+                *
+                * We also need to make sure we scan at least once, for the
+                * global case. So if we don't have a target memcg (saved in
+                * root), we proceed normally and expect to break in the next
+                * round.
+                */

in Glauber's tree helped an awful lot to explain the mess that the
memcg stuff was making of the code...

I'm liking this patch set less and less as I work my way through
it...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v12 10/18] memcg,list_lru: add per-memcg LRU list infrastructure
  2013-12-02 11:19 ` [PATCH v12 10/18] memcg,list_lru: add per-memcg LRU list infrastructure Vladimir Davydov
@ 2013-12-03 11:18   ` Dave Chinner
  2013-12-03 12:29     ` Vladimir Davydov
  0 siblings, 1 reply; 38+ messages in thread
From: Dave Chinner @ 2013-12-03 11:18 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: hannes, mhocko, dchinner, akpm, linux-kernel, linux-mm, cgroups,
	devel, glommer, Al Viro, Balbir Singh, KAMEZAWA Hiroyuki

On Mon, Dec 02, 2013 at 03:19:45PM +0400, Vladimir Davydov wrote:
> FS-shrinkers, which shrink dcaches and icaches, keep dentries and inodes
> in list_lru structures in order to evict least recently used objects.
> With per-memcg kmem shrinking infrastructure introduced, we have to make
> those LRU lists per-memcg in order to allow shrinking FS caches that
> belong to different memory cgroups independently.
> 
> This patch addresses the issue by introducing struct memcg_list_lru.
> This struct aggregates list_lru objects for each kmem-active memcg, and
> keeps it uptodate whenever a memcg is created or destroyed. Its
> interface is very simple: it only allows to get the pointer to the
> appropriate list_lru object from a memcg or a kmem ptr, which should be
> further operated with conventional list_lru methods.

Basically The idea was that the memcg LRUs hide entirely behind the
generic list_lru interface so that any cache that used the list_lru
insfrastructure got memcg capabilities for free. memcg's to shrink
were to be passed through the shrinker control shrinkers to the list
LRU code, and it then did all the "which lru are we using" logic
internally.

What you've done is driven all the "which LRU are we using" logic
into every single caller location. i.e. you've just broken the
underlying design principle that Glauber and I had worked towards
with this code - that memcg aware LRUs should be completely
transparent to list_lru users. Just like NUMA awareness came for
free with the list_lru code, so should memcg awareness....

> +/*
> + * The following structure can be used to reclaim kmem objects accounted to
> + * different memory cgroups independently. It aggregates a set of list_lru
> + * objects, one for each kmem-enabled memcg, and provides the method to get
> + * the lru corresponding to a memcg.
> + */
> +struct memcg_list_lru {
> +	struct list_lru global_lru;
> +
> +#ifdef CONFIG_MEMCG_KMEM
> +	struct list_lru **memcg_lrus;	/* rcu-protected array of per-memcg
> +					   lrus, indexed by memcg_cache_id() */
> +
> +	struct list_head list;		/* list of all memcg-aware lrus */
> +
> +	/*
> +	 * The memcg_lrus array is rcu protected, so we can only free it after
> +	 * a call to synchronize_rcu(). To avoid multiple calls to
> +	 * synchronize_rcu() when many lrus get updated at the same time, which
> +	 * is a typical scenario, we will store the pointer to the previous
> +	 * version of the array in the old_lrus variable for each lru, and then
> +	 * free them all at once after a single call to synchronize_rcu().
> +	 */
> +	void *old_lrus;
> +#endif
> +};

Really, this should be embedded in the struct list_lru, not wrapping
around the outside. I don't see any changelog to tell me why you
changed the code from what was last in Glauber's tree, so can you
explain why exposing all this memcg stuff to everyone is a good
idea?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v12 12/18] fs: make icache, dcache shrinkers memcg-aware
  2013-12-02 11:19 ` [PATCH v12 12/18] fs: make icache, dcache shrinkers memcg-aware Vladimir Davydov
@ 2013-12-03 11:45   ` Dave Chinner
  2013-12-03 12:34     ` Vladimir Davydov
  0 siblings, 1 reply; 38+ messages in thread
From: Dave Chinner @ 2013-12-03 11:45 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: hannes, mhocko, dchinner, akpm, linux-kernel, linux-mm, cgroups,
	devel, glommer, Al Viro, Balbir Singh, KAMEZAWA Hiroyuki

On Mon, Dec 02, 2013 at 03:19:47PM +0400, Vladimir Davydov wrote:
> Using the per-memcg LRU infrastructure introduced by previous patches,
> this patch makes dcache and icache shrinkers memcg-aware. To achieve
> that, it converts s_dentry_lru and s_inode_lru from list_lru to
> memcg_list_lru and restricts the reclaim to per-memcg parts of the lists
> in case of memcg pressure.
> 
> Other FS objects are currently ignored and only reclaimed on global
> pressure, because their shrinkers are heavily FS-specific and can't be
> converted to be memcg-aware so easily. However, we can pass on target
> memcg to the FS layer and let it decide if per-memcg objects should be
> reclaimed.

And now you have a big problem, because that means filesystems like
XFS won't reclaim inodes during memcg reclaim.

That is, for XFS, prune_icache_lru() does not free any memory. All
it does is remove all the VFS references to the struct xfs_inode,
which is then reclaimed via the sb->s_op->free_cached_objects()
method.

IOWs, what you've done is broken.

> Note that with this patch applied we lose global LRU order, but it does

We don't have global LRU order today for the filesystem caches.
We have per superblock, per-node LRU reclaim order.

> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -343,18 +343,24 @@ static void dentry_unlink_inode(struct dentry * dentry)
>  #define D_FLAG_VERIFY(dentry,x) WARN_ON_ONCE(((dentry)->d_flags & (DCACHE_LRU_LIST | DCACHE_SHRINK_LIST)) != (x))
>  static void d_lru_add(struct dentry *dentry)
>  {
> +	struct list_lru *lru =
> +		mem_cgroup_kmem_list_lru(&dentry->d_sb->s_dentry_lru, dentry);
> +
>  	D_FLAG_VERIFY(dentry, 0);
>  	dentry->d_flags |= DCACHE_LRU_LIST;
>  	this_cpu_inc(nr_dentry_unused);
> -	WARN_ON_ONCE(!list_lru_add(&dentry->d_sb->s_dentry_lru, &dentry->d_lru));
> +	WARN_ON_ONCE(!list_lru_add(lru, &dentry->d_lru));
>  }

This is what I mean about pushing memcg cruft into places where it
is not necessary. This can be done entirely behind list_lru_add(),
without the caller having to care.

> @@ -970,9 +976,9 @@ dentry_lru_isolate(struct list_head *item, spinlock_t *lru_lock, void *arg)
>  }
>  
>  /**
> - * prune_dcache_sb - shrink the dcache
> - * @sb: superblock
> - * @nr_to_scan : number of entries to try to free
> + * prune_dcache_lru - shrink the dcache
> + * @lru: dentry lru list
> + * @nr_to_scan: number of entries to try to free
>   * @nid: which node to scan for freeable entities
>   *
>   * Attempt to shrink the superblock dcache LRU by @nr_to_scan entries. This is
> @@ -982,14 +988,13 @@ dentry_lru_isolate(struct list_head *item, spinlock_t *lru_lock, void *arg)
>   * This function may fail to free any resources if all the dentries are in
>   * use.
>   */
> -long prune_dcache_sb(struct super_block *sb, unsigned long nr_to_scan,
> -		     int nid)
> +long prune_dcache_lru(struct list_lru *lru, unsigned long nr_to_scan, int nid)
>  {
>  	LIST_HEAD(dispose);
>  	long freed;
>  
> -	freed = list_lru_walk_node(&sb->s_dentry_lru, nid, dentry_lru_isolate,
> -				       &dispose, &nr_to_scan);
> +	freed = list_lru_walk_node(lru, nid, dentry_lru_isolate,
> +				   &dispose, &nr_to_scan);
>  	shrink_dentry_list(&dispose);
>  	return freed;
>  }

And here, you pass an LRU when what we really need to pass is the
struct shrink_control that contains nr_to_scan, nid, and the memcg
that pruning is targetting.

Because of the tight integration of the LRUs and shrinkers, it makes
sense to pass the shrink control all the way into the list. i.e:

	freed = list_lru_scan(&sb->s_dentry_lru, sc, dentry_lru_isolate,
			      &dispose);

And again, that hides everything to do with memcg based LRUs and
reclaim from the callers. It's clean, simple and hard to get wrong.

> @@ -1029,7 +1034,7 @@ void shrink_dcache_sb(struct super_block *sb)
>  	do {
>  		LIST_HEAD(dispose);
>  
> -		freed = list_lru_walk(&sb->s_dentry_lru,
> +		freed = memcg_list_lru_walk_all(&sb->s_dentry_lru,
>  			dentry_lru_isolate_shrink, &dispose, UINT_MAX);
>  

list_lru_walk() is, by definition, supposed to walk every single
object on the LRU. With memcg awareness, it should be walking all
the memcg lists, too.

> diff --git a/fs/super.c b/fs/super.c
> index cece164..b198da4 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -57,6 +57,8 @@ static unsigned long super_cache_scan(struct shrinker *shrink,
>  				      struct shrink_control *sc)
>  {
>  	struct super_block *sb;
> +	struct list_lru *inode_lru;
> +	struct list_lru *dentry_lru;
>  	long	fs_objects = 0;
>  	long	total_objects;
>  	long	freed = 0;
> @@ -75,11 +77,14 @@ static unsigned long super_cache_scan(struct shrinker *shrink,
>  	if (!grab_super_passive(sb))
>  		return SHRINK_STOP;
>  
> -	if (sb->s_op->nr_cached_objects)
> +	if (sb->s_op->nr_cached_objects && !sc->memcg)
>  		fs_objects = sb->s_op->nr_cached_objects(sb, sc->nid);
>  
> -	inodes = list_lru_count_node(&sb->s_inode_lru, sc->nid);
> -	dentries = list_lru_count_node(&sb->s_dentry_lru, sc->nid);
> +	inode_lru = mem_cgroup_list_lru(&sb->s_inode_lru, sc->memcg);
> +	dentry_lru = mem_cgroup_list_lru(&sb->s_dentry_lru, sc->memcg);
> +
> +	inodes = list_lru_count_node(inode_lru, sc->nid);
> +	dentries = list_lru_count_node(dentry_lru, sc->nid);
>  	total_objects = dentries + inodes + fs_objects + 1;

Again: list_lru_count_sc(&sb->s_inode_lru, sc).

And push the scan control down into ->nr_cached_objects, too.

>  	/* proportion the scan between the caches */
> @@ -90,8 +95,8 @@ static unsigned long super_cache_scan(struct shrinker *shrink,
>  	 * prune the dcache first as the icache is pinned by it, then
>  	 * prune the icache, followed by the filesystem specific caches
>  	 */
> -	freed = prune_dcache_sb(sb, dentries, sc->nid);
> -	freed += prune_icache_sb(sb, inodes, sc->nid);
> +	freed = prune_dcache_lru(dentry_lru, dentries, sc->nid);
> +	freed += prune_icache_lru(inode_lru, inodes, sc->nid);

	sc->nr_to_scan = dentries;
	freed += prune_dcache_sb(sb, sc);
	sc->nr_to_scan = inodes;
	freed += prune_icache_sb(sb, sc);
	if (fs_objects) {
		sc->nr_to_scan = mult_frac(sc->nr_to_scan, fs_objects,
					   total_objects);
		freed += sb->s_op->free_cached_objects(sb, sc);
	}

So much simpler, so much nicer. And nothing memcg related in
sight....

> @@ -225,7 +232,7 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags)
>  	s->s_shrink.scan_objects = super_cache_scan;
>  	s->s_shrink.count_objects = super_cache_count;
>  	s->s_shrink.batch = 1024;
> -	s->s_shrink.flags = SHRINKER_NUMA_AWARE;
> +	s->s_shrink.flags = SHRINKER_NUMA_AWARE | SHRINKER_MEMCG_AWARE;

That's basically the only real logic change that should be
necessary to configure memcg LRUs and shrinkers for any user of the
list_lru infrastructure....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v12 09/18] vmscan: shrink slab on memcg pressure
  2013-12-03 10:48   ` Dave Chinner
@ 2013-12-03 12:15     ` Vladimir Davydov
  2013-12-04  4:51       ` Dave Chinner
  0 siblings, 1 reply; 38+ messages in thread
From: Vladimir Davydov @ 2013-12-03 12:15 UTC (permalink / raw)
  To: Dave Chinner
  Cc: hannes, mhocko, dchinner, akpm, linux-kernel, linux-mm, cgroups,
	devel, glommer, Mel Gorman, Rik van Riel, Al Viro, Balbir Singh,
	KAMEZAWA Hiroyuki

On 12/03/2013 02:48 PM, Dave Chinner wrote:
>> @@ -236,11 +236,17 @@ shrink_slab_node(struct shrink_control *shrinkctl, struct shrinker *shrinker,
>>  		return 0;
>>  
>>  	/*
>> -	 * copy the current shrinker scan count into a local variable
>> -	 * and zero it so that other concurrent shrinker invocations
>> -	 * don't also do this scanning work.
>> +	 * Do not touch global counter of deferred objects on memcg pressure to
>> +	 * avoid isolation issues. Ideally the counter should be per-memcg.
>>  	 */
>> -	nr = atomic_long_xchg(&shrinker->nr_deferred[nid], 0);
>> +	if (!shrinkctl->target_mem_cgroup) {
>> +		/*
>> +		 * copy the current shrinker scan count into a local variable
>> +		 * and zero it so that other concurrent shrinker invocations
>> +		 * don't also do this scanning work.
>> +		 */
>> +		nr = atomic_long_xchg(&shrinker->nr_deferred[nid], 0);
>> +	}
> That's ugly. Effectively it means that memcg reclaim is going to be
> completely ineffective when large numbers of allocations and hence
> reclaim attempts are done under GFP_NOFS context.
>
> The only thing that keeps filesystem caches in balance when there is
> lots of filesystem work going on (i.e. lots of GFP_NOFS allocations)
> is the deferal of reclaim work to a context that can do something
> about it.

Imagine the situation: a memcg issues a GFP_NOFS allocation and goes to
shrink_slab() where it defers them to the global counter; then another
memcg issues a GFP_KERNEL allocation, also goes to shrink_slab() where
it sees a huge number of deferred objects and starts shrinking them,
which is not good IMHO. I understand that nr_deferred is necessary, but
I think it should be per-memcg. What do you think about moving it to
list_lru?

>>  	total_scan = nr;
>>  	delta = (4 * fraction) / shrinker->seeks;
>> @@ -296,21 +302,46 @@ shrink_slab_node(struct shrink_control *shrinkctl, struct shrinker *shrinker,
>>  		cond_resched();
>>  	}
>>  
>> -	/*
>> -	 * move the unused scan count back into the shrinker in a
>> -	 * manner that handles concurrent updates. If we exhausted the
>> -	 * scan, there is no need to do an update.
>> -	 */
>> -	if (total_scan > 0)
>> -		new_nr = atomic_long_add_return(total_scan,
>> +	if (!shrinkctl->target_mem_cgroup) {
>> +		/*
>> +		 * move the unused scan count back into the shrinker in a
>> +		 * manner that handles concurrent updates. If we exhausted the
>> +		 * scan, there is no need to do an update.
>> +		 */
>> +		if (total_scan > 0)
>> +			new_nr = atomic_long_add_return(total_scan,
>>  						&shrinker->nr_deferred[nid]);
>> -	else
>> -		new_nr = atomic_long_read(&shrinker->nr_deferred[nid]);
>> +		else
>> +			new_nr = atomic_long_read(&shrinker->nr_deferred[nid]);
>> +	}
> So, if the memcg can't make progress, why wouldn't you defer the
> work to the global scan? Or can't a global scan trim memcg LRUs?
> And if it can't, then isn't that a major design flaw? Why not just
> allow kswapd to walk memcg LRUs in the background?
>
> /me just looked at patch 13
>
> Yeah, this goes some way to explaining why something like patch 13
> is necessary - slab shrinkers are not keeping up with page cache
> reclaim because of GFP_NOFS allocations, and so the page cache
> empties only leaving slab caches to be trimmed....
>
>
>> +static unsigned long
>> +shrink_slab_memcg(struct shrink_control *shrinkctl, struct shrinker *shrinker,
>> +		  unsigned long fraction, unsigned long denominator)
> what's this function got to do with memcgs? Why did you rename it
> from the self explanitory shrink_slab_one() name that Glauber gave
> it?

When I sent the previous version, Johannes Weiner disliked the name that
was why I renamed it, now you don't like the new name and ask for the
old one :-) But why do you think that shrink_slab_one() is
self-explanatory while shrink_slab_memcg() is not? I mean
shrink_slab_memcg() means "shrink slab accounted to a memcg" just like
shrink_slab_node() means "shrink slab on the node" while seeing
shrink_slab_one() I would ask "one what?".

>> +{
>> +	unsigned long freed = 0;
>> +
>> +	if (shrinkctl->memcg && !memcg_kmem_is_active(shrinkctl->memcg))
>> +		return 0;
> Why here? why not check that in the caller where memcg's are being
> iterated?

No problem, I'll move it.

>> +
>> +	for_each_node_mask(shrinkctl->nid, shrinkctl->nodes_to_scan) {
>> +		if (!node_online(shrinkctl->nid))
>> +			continue;
>> +
>> +		if (!(shrinker->flags & SHRINKER_NUMA_AWARE) &&
>> +		    (shrinkctl->nid != 0))
>> +			break;
> Hmmm - this looks broken. Nothing guarantees that node 0 in
> shrinkctl->nodes_to_scan is ever set, so non-numa aware shrinkers
> will do nothing when the first node in the mask is not set. For non-numa
> aware shrinkers, the shrinker should always be called once with a
> node id of 0.

That's how it operates now - this patch simply moved this piece from
shrink_slab(). I'll fix this.

> That's what earlier versions of the numa aware shrinker patch set
> did, and it seems to have been lost along the way.  Yeah, there's
> the last version from Glauber's tree that I saw:
>
> static unsigned long
> shrink_slab_one(struct shrink_control *shrinkctl, struct shrinker *shrinker,
>                unsigned long nr_pages_scanned, unsigned long lru_pages)
> {
>        unsigned long freed = 0;
>
>        if (!(shrinker->flags & SHRINKER_NUMA_AWARE)) {
>                shrinkctl->nid = 0;
>
>                return shrink_slab_node(shrinkctl, shrinker,
>                         nr_pages_scanned, lru_pages,
>                         &shrinker->nr_deferred);
>        }
>
>        for_each_node_mask(shrinkctl->nid, shrinkctl->nodes_to_scan)
>
>                if (!node_online(shrinkctl->nid))
>                        continue;
>
>                freed += shrink_slab_node(shrinkctl, shrinker,
>                         nr_pages_scanned, lru_pages,
> 			 &shrinker->nr_deferred_node[shrinkctl->nid]);
>        }
>
>        return freed;
> }
>
> So, that's likely to be another reason that all the non-numa slab
> caches are not being shrunk appropriately and need to be hit with a
> bit hammer...
>
>> @@ -352,18 +383,23 @@ unsigned long shrink_slab(struct shrink_control *shrinkctl,
>>  	}
>>  
>>  	list_for_each_entry(shrinker, &shrinker_list, list) {
>> -		for_each_node_mask(shrinkctl->nid, shrinkctl->nodes_to_scan) {
>> -			if (!node_online(shrinkctl->nid))
>> -				continue;
>> -
>> -			if (!(shrinker->flags & SHRINKER_NUMA_AWARE) &&
>> -			    (shrinkctl->nid != 0))
>> +		shrinkctl->memcg = shrinkctl->target_mem_cgroup;
>> +		do {
>> +			if (!(shrinker->flags & SHRINKER_MEMCG_AWARE) &&
>> +			    (shrinkctl->memcg != NULL)) {
>> +				mem_cgroup_iter_break(
>> +						shrinkctl->target_mem_cgroup,
>> +						shrinkctl->memcg);
>>  				break;
>> +			}
>>  
>> -			freed += shrink_slab_node(shrinkctl, shrinker,
>> -						  fraction, denominator);
>> +			freed += shrink_slab_memcg(shrinkctl, shrinker,
>> +						   fraction, denominator);
>> +			shrinkctl->memcg = mem_cgroup_iter(
>> +						shrinkctl->target_mem_cgroup,
>> +						shrinkctl->memcg, NULL);
>> +		} while (shrinkctl->memcg);
> Glauber's tree also had a bunch of comments explaining what was
> going on here. I've got no idea what the hell this code is doing,
> and why the hell we are iterating memcgs here and how and why the
> normal, non-memcg scan and shrinkers still worked.

I found this code straightforward, just like the shrink_zone(), which
also lacks comments, but I admit I was wrong and I'll try to improve.

> This is now just a bunch of memcg gobbledegook with no explanations
> to tell us what it is supposed to be doing. Comments are important -
> you might not think they are necessary, but seeing comments like
> this:
>
> +               /*
> +                * In a hierarchical chain, it might be that not all memcgs are
> +                * kmem active. kmemcg design mandates that when one memcg is
> +                * active, its children will be active as well. But it is
> +                * perfectly possible that its parent is not.
> +                *
> +                * We also need to make sure we scan at least once, for the
> +                * global case. So if we don't have a target memcg (saved in
> +                * root), we proceed normally and expect to break in the next
> +                * round.
> +                */
>
> in Glauber's tree helped an awful lot to explain the mess that the
> memcg stuff was making of the code...
>
> I'm liking this patch set less and less as I work my way through
> it...

Will try to improve.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v12 10/18] memcg,list_lru: add per-memcg LRU list infrastructure
  2013-12-03 11:18   ` Dave Chinner
@ 2013-12-03 12:29     ` Vladimir Davydov
  2013-12-05 21:19       ` Dave Chinner
  0 siblings, 1 reply; 38+ messages in thread
From: Vladimir Davydov @ 2013-12-03 12:29 UTC (permalink / raw)
  To: Dave Chinner
  Cc: hannes, mhocko, dchinner, akpm, linux-kernel, linux-mm, cgroups,
	devel, glommer, Al Viro, Balbir Singh, KAMEZAWA Hiroyuki

On 12/03/2013 03:18 PM, Dave Chinner wrote:
> On Mon, Dec 02, 2013 at 03:19:45PM +0400, Vladimir Davydov wrote:
>> FS-shrinkers, which shrink dcaches and icaches, keep dentries and inodes
>> in list_lru structures in order to evict least recently used objects.
>> With per-memcg kmem shrinking infrastructure introduced, we have to make
>> those LRU lists per-memcg in order to allow shrinking FS caches that
>> belong to different memory cgroups independently.
>>
>> This patch addresses the issue by introducing struct memcg_list_lru.
>> This struct aggregates list_lru objects for each kmem-active memcg, and
>> keeps it uptodate whenever a memcg is created or destroyed. Its
>> interface is very simple: it only allows to get the pointer to the
>> appropriate list_lru object from a memcg or a kmem ptr, which should be
>> further operated with conventional list_lru methods.
> Basically The idea was that the memcg LRUs hide entirely behind the
> generic list_lru interface so that any cache that used the list_lru
> insfrastructure got memcg capabilities for free. memcg's to shrink
> were to be passed through the shrinker control shrinkers to the list
> LRU code, and it then did all the "which lru are we using" logic
> internally.
>
> What you've done is driven all the "which LRU are we using" logic
> into every single caller location. i.e. you've just broken the
> underlying design principle that Glauber and I had worked towards
> with this code - that memcg aware LRUs should be completely
> transparent to list_lru users. Just like NUMA awareness came for
> free with the list_lru code, so should memcg awareness....
>
>> +/*
>> + * The following structure can be used to reclaim kmem objects accounted to
>> + * different memory cgroups independently. It aggregates a set of list_lru
>> + * objects, one for each kmem-enabled memcg, and provides the method to get
>> + * the lru corresponding to a memcg.
>> + */
>> +struct memcg_list_lru {
>> +	struct list_lru global_lru;
>> +
>> +#ifdef CONFIG_MEMCG_KMEM
>> +	struct list_lru **memcg_lrus;	/* rcu-protected array of per-memcg
>> +					   lrus, indexed by memcg_cache_id() */
>> +
>> +	struct list_head list;		/* list of all memcg-aware lrus */
>> +
>> +	/*
>> +	 * The memcg_lrus array is rcu protected, so we can only free it after
>> +	 * a call to synchronize_rcu(). To avoid multiple calls to
>> +	 * synchronize_rcu() when many lrus get updated at the same time, which
>> +	 * is a typical scenario, we will store the pointer to the previous
>> +	 * version of the array in the old_lrus variable for each lru, and then
>> +	 * free them all at once after a single call to synchronize_rcu().
>> +	 */
>> +	void *old_lrus;
>> +#endif
>> +};
> Really, this should be embedded in the struct list_lru, not wrapping
> around the outside. I don't see any changelog to tell me why you
> changed the code from what was last in Glauber's tree, so can you
> explain why exposing all this memcg stuff to everyone is a good
> idea?

I preferred to move from list_lru to memcg_list_lru, because the
connection between list_lru and memcgs' turned memcontrol.c and
list_lru.c into a monolithic structure. When I read comments to the last
version of this patchset submitted by Glauber (v10), I found that Andrew
Morton disliked it, that was why I tried to "fix" it the way you observe
in this patch. Besides, I though that the list_lru may be used w/o memcgs.

I didn't participate in the previous discussion so I don't know all your
plans on it :-( If you think it's unacceptable, I'll try to find another
way around.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v12 12/18] fs: make icache, dcache shrinkers memcg-aware
  2013-12-03 11:45   ` Dave Chinner
@ 2013-12-03 12:34     ` Vladimir Davydov
  0 siblings, 0 replies; 38+ messages in thread
From: Vladimir Davydov @ 2013-12-03 12:34 UTC (permalink / raw)
  To: Dave Chinner
  Cc: hannes, mhocko, dchinner, akpm, linux-kernel, linux-mm, cgroups,
	devel, glommer, Al Viro, Balbir Singh, KAMEZAWA Hiroyuki

On 12/03/2013 03:45 PM, Dave Chinner wrote:
> On Mon, Dec 02, 2013 at 03:19:47PM +0400, Vladimir Davydov wrote:
>> Using the per-memcg LRU infrastructure introduced by previous patches,
>> this patch makes dcache and icache shrinkers memcg-aware. To achieve
>> that, it converts s_dentry_lru and s_inode_lru from list_lru to
>> memcg_list_lru and restricts the reclaim to per-memcg parts of the lists
>> in case of memcg pressure.
>>
>> Other FS objects are currently ignored and only reclaimed on global
>> pressure, because their shrinkers are heavily FS-specific and can't be
>> converted to be memcg-aware so easily. However, we can pass on target
>> memcg to the FS layer and let it decide if per-memcg objects should be
>> reclaimed.
> And now you have a big problem, because that means filesystems like
> XFS won't reclaim inodes during memcg reclaim.
>
> That is, for XFS, prune_icache_lru() does not free any memory. All
> it does is remove all the VFS references to the struct xfs_inode,
> which is then reclaimed via the sb->s_op->free_cached_objects()
> method.
>
> IOWs, what you've done is broken.

Missed that, thanks for pointing out.

>
>> Note that with this patch applied we lose global LRU order, but it does
> We don't have global LRU order today for the filesystem caches.
> We have per superblock, per-node LRU reclaim order.
>
>> --- a/fs/dcache.c
>> +++ b/fs/dcache.c
>> @@ -343,18 +343,24 @@ static void dentry_unlink_inode(struct dentry * dentry)
>>  #define D_FLAG_VERIFY(dentry,x) WARN_ON_ONCE(((dentry)->d_flags & (DCACHE_LRU_LIST | DCACHE_SHRINK_LIST)) != (x))
>>  static void d_lru_add(struct dentry *dentry)
>>  {
>> +	struct list_lru *lru =
>> +		mem_cgroup_kmem_list_lru(&dentry->d_sb->s_dentry_lru, dentry);
>> +
>>  	D_FLAG_VERIFY(dentry, 0);
>>  	dentry->d_flags |= DCACHE_LRU_LIST;
>>  	this_cpu_inc(nr_dentry_unused);
>> -	WARN_ON_ONCE(!list_lru_add(&dentry->d_sb->s_dentry_lru, &dentry->d_lru));
>> +	WARN_ON_ONCE(!list_lru_add(lru, &dentry->d_lru));
>>  }
> This is what I mean about pushing memcg cruft into places where it
> is not necessary. This can be done entirely behind list_lru_add(),
> without the caller having to care.
>
>> @@ -970,9 +976,9 @@ dentry_lru_isolate(struct list_head *item, spinlock_t *lru_lock, void *arg)
>>  }
>>  
>>  /**
>> - * prune_dcache_sb - shrink the dcache
>> - * @sb: superblock
>> - * @nr_to_scan : number of entries to try to free
>> + * prune_dcache_lru - shrink the dcache
>> + * @lru: dentry lru list
>> + * @nr_to_scan: number of entries to try to free
>>   * @nid: which node to scan for freeable entities
>>   *
>>   * Attempt to shrink the superblock dcache LRU by @nr_to_scan entries. This is
>> @@ -982,14 +988,13 @@ dentry_lru_isolate(struct list_head *item, spinlock_t *lru_lock, void *arg)
>>   * This function may fail to free any resources if all the dentries are in
>>   * use.
>>   */
>> -long prune_dcache_sb(struct super_block *sb, unsigned long nr_to_scan,
>> -		     int nid)
>> +long prune_dcache_lru(struct list_lru *lru, unsigned long nr_to_scan, int nid)
>>  {
>>  	LIST_HEAD(dispose);
>>  	long freed;
>>  
>> -	freed = list_lru_walk_node(&sb->s_dentry_lru, nid, dentry_lru_isolate,
>> -				       &dispose, &nr_to_scan);
>> +	freed = list_lru_walk_node(lru, nid, dentry_lru_isolate,
>> +				   &dispose, &nr_to_scan);
>>  	shrink_dentry_list(&dispose);
>>  	return freed;
>>  }
> And here, you pass an LRU when what we really need to pass is the
> struct shrink_control that contains nr_to_scan, nid, and the memcg
> that pruning is targetting.
>
> Because of the tight integration of the LRUs and shrinkers, it makes
> sense to pass the shrink control all the way into the list. i.e:
>
> 	freed = list_lru_scan(&sb->s_dentry_lru, sc, dentry_lru_isolate,
> 			      &dispose);
>
> And again, that hides everything to do with memcg based LRUs and
> reclaim from the callers. It's clean, simple and hard to get wrong.
>
>> @@ -1029,7 +1034,7 @@ void shrink_dcache_sb(struct super_block *sb)
>>  	do {
>>  		LIST_HEAD(dispose);
>>  
>> -		freed = list_lru_walk(&sb->s_dentry_lru,
>> +		freed = memcg_list_lru_walk_all(&sb->s_dentry_lru,
>>  			dentry_lru_isolate_shrink, &dispose, UINT_MAX);
>>  
> list_lru_walk() is, by definition, supposed to walk every single
> object on the LRU. With memcg awareness, it should be walking all
> the memcg lists, too.
>
>> diff --git a/fs/super.c b/fs/super.c
>> index cece164..b198da4 100644
>> --- a/fs/super.c
>> +++ b/fs/super.c
>> @@ -57,6 +57,8 @@ static unsigned long super_cache_scan(struct shrinker *shrink,
>>  				      struct shrink_control *sc)
>>  {
>>  	struct super_block *sb;
>> +	struct list_lru *inode_lru;
>> +	struct list_lru *dentry_lru;
>>  	long	fs_objects = 0;
>>  	long	total_objects;
>>  	long	freed = 0;
>> @@ -75,11 +77,14 @@ static unsigned long super_cache_scan(struct shrinker *shrink,
>>  	if (!grab_super_passive(sb))
>>  		return SHRINK_STOP;
>>  
>> -	if (sb->s_op->nr_cached_objects)
>> +	if (sb->s_op->nr_cached_objects && !sc->memcg)
>>  		fs_objects = sb->s_op->nr_cached_objects(sb, sc->nid);
>>  
>> -	inodes = list_lru_count_node(&sb->s_inode_lru, sc->nid);
>> -	dentries = list_lru_count_node(&sb->s_dentry_lru, sc->nid);
>> +	inode_lru = mem_cgroup_list_lru(&sb->s_inode_lru, sc->memcg);
>> +	dentry_lru = mem_cgroup_list_lru(&sb->s_dentry_lru, sc->memcg);
>> +
>> +	inodes = list_lru_count_node(inode_lru, sc->nid);
>> +	dentries = list_lru_count_node(dentry_lru, sc->nid);
>>  	total_objects = dentries + inodes + fs_objects + 1;
> Again: list_lru_count_sc(&sb->s_inode_lru, sc).
>
> And push the scan control down into ->nr_cached_objects, too.

Thanks for the tip! Now I see your point, it would really look better if
we used shrink_control here.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v12 05/18] fs: do not use destroy_super() in alloc_super() fail path
  2013-12-03  9:23     ` Vladimir Davydov
@ 2013-12-03 13:37       ` Al Viro
  2013-12-03 13:48         ` Vladimir Davydov
  0 siblings, 1 reply; 38+ messages in thread
From: Al Viro @ 2013-12-03 13:37 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Dave Chinner, hannes, mhocko, dchinner, akpm, linux-kernel,
	linux-mm, cgroups, devel, glommer

On Tue, Dec 03, 2013 at 01:23:01PM +0400, Vladimir Davydov wrote:

> Actually, I'm not going to modify the list_lru structure, because I
> think it's good as it is. I'd like to substitute it with a new
> structure, memcg_list_lru, only in those places where this functionality
> (per-memcg scanning) is really needed. This new structure would look
> like this:
> 
> struct memcg_list_lru {
>     struct list_lru global_lru;
>     struct list_lru **memcg_lrus;
>     struct list_head list;
>     void *old_lrus;
> }
> 
> Since old_lrus and memcg_lrus can be NULL under normal operation, in
> memcg_list_lru_destroy() I'd have to check either the list or the
> global_lru field, i.e. it would look like:
> 
> if (!list.next)
>     /* has not been initialized */
>     return;
> 
> or

... or just use hlist_head.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v12 05/18] fs: do not use destroy_super() in alloc_super() fail path
  2013-12-03 13:37       ` Al Viro
@ 2013-12-03 13:48         ` Vladimir Davydov
  0 siblings, 0 replies; 38+ messages in thread
From: Vladimir Davydov @ 2013-12-03 13:48 UTC (permalink / raw)
  To: Al Viro
  Cc: Dave Chinner, hannes, mhocko, dchinner, akpm, linux-kernel,
	linux-mm, cgroups, devel, glommer

On 12/03/2013 05:37 PM, Al Viro wrote:
> On Tue, Dec 03, 2013 at 01:23:01PM +0400, Vladimir Davydov wrote:
>
>> Actually, I'm not going to modify the list_lru structure, because I
>> think it's good as it is. I'd like to substitute it with a new
>> structure, memcg_list_lru, only in those places where this functionality
>> (per-memcg scanning) is really needed. This new structure would look
>> like this:
>>
>> struct memcg_list_lru {
>>     struct list_lru global_lru;
>>     struct list_lru **memcg_lrus;
>>     struct list_head list;
>>     void *old_lrus;
>> }
>>
>> Since old_lrus and memcg_lrus can be NULL under normal operation, in
>> memcg_list_lru_destroy() I'd have to check either the list or the
>> global_lru field, i.e. it would look like:
>>
>> if (!list.next)
>>     /* has not been initialized */
>>     return;
>>
>> or
> ... or just use hlist_head.

list_head serves as a list node here (those structures are organized in
a linked list) and I have to remove it from the list upon destruction so
hlist_head is not relevant here.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v12 09/18] vmscan: shrink slab on memcg pressure
  2013-12-03 12:15     ` Vladimir Davydov
@ 2013-12-04  4:51       ` Dave Chinner
  2013-12-04  6:31         ` Vladimir Davydov
  0 siblings, 1 reply; 38+ messages in thread
From: Dave Chinner @ 2013-12-04  4:51 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: hannes, mhocko, dchinner, akpm, linux-kernel, linux-mm, cgroups,
	devel, glommer, Mel Gorman, Rik van Riel, Al Viro, Balbir Singh,
	KAMEZAWA Hiroyuki

On Tue, Dec 03, 2013 at 04:15:57PM +0400, Vladimir Davydov wrote:
> On 12/03/2013 02:48 PM, Dave Chinner wrote:
> >> @@ -236,11 +236,17 @@ shrink_slab_node(struct shrink_control *shrinkctl, struct shrinker *shrinker,
> >>  		return 0;
> >>  
> >>  	/*
> >> -	 * copy the current shrinker scan count into a local variable
> >> -	 * and zero it so that other concurrent shrinker invocations
> >> -	 * don't also do this scanning work.
> >> +	 * Do not touch global counter of deferred objects on memcg pressure to
> >> +	 * avoid isolation issues. Ideally the counter should be per-memcg.
> >>  	 */
> >> -	nr = atomic_long_xchg(&shrinker->nr_deferred[nid], 0);
> >> +	if (!shrinkctl->target_mem_cgroup) {
> >> +		/*
> >> +		 * copy the current shrinker scan count into a local variable
> >> +		 * and zero it so that other concurrent shrinker invocations
> >> +		 * don't also do this scanning work.
> >> +		 */
> >> +		nr = atomic_long_xchg(&shrinker->nr_deferred[nid], 0);
> >> +	}
> > That's ugly. Effectively it means that memcg reclaim is going to be
> > completely ineffective when large numbers of allocations and hence
> > reclaim attempts are done under GFP_NOFS context.
> >
> > The only thing that keeps filesystem caches in balance when there is
> > lots of filesystem work going on (i.e. lots of GFP_NOFS allocations)
> > is the deferal of reclaim work to a context that can do something
> > about it.
> 
> Imagine the situation: a memcg issues a GFP_NOFS allocation and goes to
> shrink_slab() where it defers them to the global counter; then another
> memcg issues a GFP_KERNEL allocation, also goes to shrink_slab() where
> it sees a huge number of deferred objects and starts shrinking them,
> which is not good IMHO.

That's exactly what the deferred mechanism is for - we know we have
to do the work, but we can't do it right now so let someone else do
it who can.

In most cases, deferral is handled by kswapd, because when a
filesystem workload is causing memory pressure then most allocations
are done in GFP_NOFS conditions. Hence the only memory reclaim that
can make progress here is kswapd.

Right now, you aren't deferring any of this memory pressure to some
other agent, so it just does not get done. That's a massive problem
- it's a design flaw - and instead I see lots of crazy hacks being
added to do stuff that should simply be deferred to kswapd like is
done for global memory pressure.

Hell, kswapd shoul dbe allowed to walk memcg LRU lists and trim
them, just like it does for the global lists. We only need a single
"deferred work" counter per node for that - just let kswapd
proportion the deferred work over the per-node LRU and the
memcgs....

> I understand that nr_deferred is necessary, but
> I think it should be per-memcg. What do you think about moving it to
> list_lru?

It's part of the shrinker state that is used to calculate how much
work the shrinker needs to do. We can't hold it in the LRU, because
there is no guarantee that a shrinker actually uses a list_lru, and
shrinkers can be memcg aware without using list_lru infrastructure.

So, no, moving it to the list-lru is not a solution....

> > So, if the memcg can't make progress, why wouldn't you defer the
> > work to the global scan? Or can't a global scan trim memcg LRUs?
> > And if it can't, then isn't that a major design flaw? Why not just
> > allow kswapd to walk memcg LRUs in the background?
> >
> > /me just looked at patch 13
> >
> > Yeah, this goes some way to explaining why something like patch 13
> > is necessary - slab shrinkers are not keeping up with page cache
> > reclaim because of GFP_NOFS allocations, and so the page cache
> > empties only leaving slab caches to be trimmed....
> >
> >
> >> +static unsigned long
> >> +shrink_slab_memcg(struct shrink_control *shrinkctl, struct shrinker *shrinker,
> >> +		  unsigned long fraction, unsigned long denominator)
> > what's this function got to do with memcgs? Why did you rename it
> > from the self explanitory shrink_slab_one() name that Glauber gave
> > it?
> 
> When I sent the previous version, Johannes Weiner disliked the name that
> was why I renamed it, now you don't like the new name and ask for the
> old one :-) But why do you think that shrink_slab_one() is
> self-explanatory while shrink_slab_memcg() is not? I mean
> shrink_slab_memcg() means "shrink slab accounted to a memcg" just like

But it's not shrinking a slab accounted to a memcg - the memcg can
be null. All it's is doing is executing a shrinker scan...

> shrink_slab_node() means "shrink slab on the node" while seeing
> shrink_slab_one() I would ask "one what?".

It's running a shrinker scan on *one* shrinker. It doesn't matter if
the shrinker is memcg aware, or numa aware, it's just running one
shrinker....

So, while shrink_slab_one() might not be the best name, it's
certainly more correct than shrink_slab_memcg(). Perhaps it would be
better named run_shrinker()....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v12 09/18] vmscan: shrink slab on memcg pressure
  2013-12-04  4:51       ` Dave Chinner
@ 2013-12-04  6:31         ` Vladimir Davydov
  2013-12-05  5:01           ` Dave Chinner
  0 siblings, 1 reply; 38+ messages in thread
From: Vladimir Davydov @ 2013-12-04  6:31 UTC (permalink / raw)
  To: Dave Chinner
  Cc: hannes, mhocko, dchinner, akpm, linux-kernel, linux-mm, cgroups,
	devel, glommer, Mel Gorman, Rik van Riel, Al Viro, Balbir Singh,
	KAMEZAWA Hiroyuki

On 12/04/2013 08:51 AM, Dave Chinner wrote:
> On Tue, Dec 03, 2013 at 04:15:57PM +0400, Vladimir Davydov wrote:
>> On 12/03/2013 02:48 PM, Dave Chinner wrote:
>>>> @@ -236,11 +236,17 @@ shrink_slab_node(struct shrink_control *shrinkctl, struct shrinker *shrinker,
>>>>  		return 0;
>>>>  
>>>>  	/*
>>>> -	 * copy the current shrinker scan count into a local variable
>>>> -	 * and zero it so that other concurrent shrinker invocations
>>>> -	 * don't also do this scanning work.
>>>> +	 * Do not touch global counter of deferred objects on memcg pressure to
>>>> +	 * avoid isolation issues. Ideally the counter should be per-memcg.
>>>>  	 */
>>>> -	nr = atomic_long_xchg(&shrinker->nr_deferred[nid], 0);
>>>> +	if (!shrinkctl->target_mem_cgroup) {
>>>> +		/*
>>>> +		 * copy the current shrinker scan count into a local variable
>>>> +		 * and zero it so that other concurrent shrinker invocations
>>>> +		 * don't also do this scanning work.
>>>> +		 */
>>>> +		nr = atomic_long_xchg(&shrinker->nr_deferred[nid], 0);
>>>> +	}
>>> That's ugly. Effectively it means that memcg reclaim is going to be
>>> completely ineffective when large numbers of allocations and hence
>>> reclaim attempts are done under GFP_NOFS context.
>>>
>>> The only thing that keeps filesystem caches in balance when there is
>>> lots of filesystem work going on (i.e. lots of GFP_NOFS allocations)
>>> is the deferal of reclaim work to a context that can do something
>>> about it.
>> Imagine the situation: a memcg issues a GFP_NOFS allocation and goes to
>> shrink_slab() where it defers them to the global counter; then another
>> memcg issues a GFP_KERNEL allocation, also goes to shrink_slab() where
>> it sees a huge number of deferred objects and starts shrinking them,
>> which is not good IMHO.
> That's exactly what the deferred mechanism is for - we know we have
> to do the work, but we can't do it right now so let someone else do
> it who can.
>
> In most cases, deferral is handled by kswapd, because when a
> filesystem workload is causing memory pressure then most allocations
> are done in GFP_NOFS conditions. Hence the only memory reclaim that
> can make progress here is kswapd.
>
> Right now, you aren't deferring any of this memory pressure to some
> other agent, so it just does not get done. That's a massive problem
> - it's a design flaw - and instead I see lots of crazy hacks being
> added to do stuff that should simply be deferred to kswapd like is
> done for global memory pressure.
>
> Hell, kswapd shoul dbe allowed to walk memcg LRU lists and trim
> them, just like it does for the global lists. We only need a single
> "deferred work" counter per node for that - just let kswapd
> proportion the deferred work over the per-node LRU and the
> memcgs....

Seems I misunderstand :-(

Let me try. You mean we have the only nr_deferred counter per-node, and
kswapd scans

nr_deferred*memcg_kmem_size/total_kmem_size

objects in each memcg, right?

Then if there were a lot of objects deferred on memcg (not global)
pressure due to a memcg issuing a lot of GFP_NOFS allocations, kswapd
will reclaim objects from all, even unlimited, memcgs. This looks like
an isolation issue :-/

Currently we have a per-node nr_deferred counter for each shrinker. If
we add per-memcg reclaim, we have to make it per-memcg per-node, don't we?

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v12 09/18] vmscan: shrink slab on memcg pressure
  2013-12-04  6:31         ` Vladimir Davydov
@ 2013-12-05  5:01           ` Dave Chinner
  2013-12-05  6:57             ` Vladimir Davydov
  0 siblings, 1 reply; 38+ messages in thread
From: Dave Chinner @ 2013-12-05  5:01 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: hannes, mhocko, dchinner, akpm, linux-kernel, linux-mm, cgroups,
	devel, glommer, Mel Gorman, Rik van Riel, Al Viro, Balbir Singh,
	KAMEZAWA Hiroyuki

On Wed, Dec 04, 2013 at 10:31:32AM +0400, Vladimir Davydov wrote:
> On 12/04/2013 08:51 AM, Dave Chinner wrote:
> > On Tue, Dec 03, 2013 at 04:15:57PM +0400, Vladimir Davydov wrote:
> >> On 12/03/2013 02:48 PM, Dave Chinner wrote:
> >>>> @@ -236,11 +236,17 @@ shrink_slab_node(struct shrink_control *shrinkctl, struct shrinker *shrinker,
> >>>>  		return 0;
> >>>>  
> >>>>  	/*
> >>>> -	 * copy the current shrinker scan count into a local variable
> >>>> -	 * and zero it so that other concurrent shrinker invocations
> >>>> -	 * don't also do this scanning work.
> >>>> +	 * Do not touch global counter of deferred objects on memcg pressure to
> >>>> +	 * avoid isolation issues. Ideally the counter should be per-memcg.
> >>>>  	 */
> >>>> -	nr = atomic_long_xchg(&shrinker->nr_deferred[nid], 0);
> >>>> +	if (!shrinkctl->target_mem_cgroup) {
> >>>> +		/*
> >>>> +		 * copy the current shrinker scan count into a local variable
> >>>> +		 * and zero it so that other concurrent shrinker invocations
> >>>> +		 * don't also do this scanning work.
> >>>> +		 */
> >>>> +		nr = atomic_long_xchg(&shrinker->nr_deferred[nid], 0);
> >>>> +	}
> >>> That's ugly. Effectively it means that memcg reclaim is going to be
> >>> completely ineffective when large numbers of allocations and hence
> >>> reclaim attempts are done under GFP_NOFS context.
> >>>
> >>> The only thing that keeps filesystem caches in balance when there is
> >>> lots of filesystem work going on (i.e. lots of GFP_NOFS allocations)
> >>> is the deferal of reclaim work to a context that can do something
> >>> about it.
> >> Imagine the situation: a memcg issues a GFP_NOFS allocation and goes to
> >> shrink_slab() where it defers them to the global counter; then another
> >> memcg issues a GFP_KERNEL allocation, also goes to shrink_slab() where
> >> it sees a huge number of deferred objects and starts shrinking them,
> >> which is not good IMHO.
> > That's exactly what the deferred mechanism is for - we know we have
> > to do the work, but we can't do it right now so let someone else do
> > it who can.
> >
> > In most cases, deferral is handled by kswapd, because when a
> > filesystem workload is causing memory pressure then most allocations
> > are done in GFP_NOFS conditions. Hence the only memory reclaim that
> > can make progress here is kswapd.
> >
> > Right now, you aren't deferring any of this memory pressure to some
> > other agent, so it just does not get done. That's a massive problem
> > - it's a design flaw - and instead I see lots of crazy hacks being
> > added to do stuff that should simply be deferred to kswapd like is
> > done for global memory pressure.
> >
> > Hell, kswapd shoul dbe allowed to walk memcg LRU lists and trim
> > them, just like it does for the global lists. We only need a single
> > "deferred work" counter per node for that - just let kswapd
> > proportion the deferred work over the per-node LRU and the
> > memcgs....
> 
> Seems I misunderstand :-(
> 
> Let me try. You mean we have the only nr_deferred counter per-node, and
> kswapd scans
> 
> nr_deferred*memcg_kmem_size/total_kmem_size
> 
> objects in each memcg, right?
> 
> Then if there were a lot of objects deferred on memcg (not global)
> pressure due to a memcg issuing a lot of GFP_NOFS allocations, kswapd
> will reclaim objects from all, even unlimited, memcgs. This looks like
> an isolation issue :-/

Which, when you are running out of memory, is a much less of an
issue than not being able to make progress reclaiming memory.

Besides, the "isolation" argument runs both ways. e.g. when there
isn't memory available, it's entirely possible it's because there is
actually no free memory, not because we've hit a memcg limit. e.g.
all the memory has been consumed by an unlimited memcg, and we need to
reclaim from it so this memcg can make progress.

In those situations we need to reclaim from everyone, not
just the memcg that can't find free memory to allocate....

> Currently we have a per-node nr_deferred counter for each shrinker. If
> we add per-memcg reclaim, we have to make it per-memcg per-node, don't we?

Think about what you just said for a moment. We have how many memcg
shrinkers?  And we can support how many nodes? And we can support
how many memcgs? And when we multiply that all together, how much
memory do we need to track that?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v12 09/18] vmscan: shrink slab on memcg pressure
  2013-12-05  5:01           ` Dave Chinner
@ 2013-12-05  6:57             ` Vladimir Davydov
  0 siblings, 0 replies; 38+ messages in thread
From: Vladimir Davydov @ 2013-12-05  6:57 UTC (permalink / raw)
  To: Dave Chinner
  Cc: hannes, mhocko, dchinner, akpm, linux-kernel, linux-mm, cgroups,
	devel, glommer, Mel Gorman, Rik van Riel, Al Viro, Balbir Singh,
	KAMEZAWA Hiroyuki

On 12/05/2013 09:01 AM, Dave Chinner wrote:
> On Wed, Dec 04, 2013 at 10:31:32AM +0400, Vladimir Davydov wrote:
>> On 12/04/2013 08:51 AM, Dave Chinner wrote:
>>> On Tue, Dec 03, 2013 at 04:15:57PM +0400, Vladimir Davydov wrote:
>>>> On 12/03/2013 02:48 PM, Dave Chinner wrote:
>>>>>> @@ -236,11 +236,17 @@ shrink_slab_node(struct shrink_control *shrinkctl, struct shrinker *shrinker,
>>>>>>  		return 0;
>>>>>>  
>>>>>>  	/*
>>>>>> -	 * copy the current shrinker scan count into a local variable
>>>>>> -	 * and zero it so that other concurrent shrinker invocations
>>>>>> -	 * don't also do this scanning work.
>>>>>> +	 * Do not touch global counter of deferred objects on memcg pressure to
>>>>>> +	 * avoid isolation issues. Ideally the counter should be per-memcg.
>>>>>>  	 */
>>>>>> -	nr = atomic_long_xchg(&shrinker->nr_deferred[nid], 0);
>>>>>> +	if (!shrinkctl->target_mem_cgroup) {
>>>>>> +		/*
>>>>>> +		 * copy the current shrinker scan count into a local variable
>>>>>> +		 * and zero it so that other concurrent shrinker invocations
>>>>>> +		 * don't also do this scanning work.
>>>>>> +		 */
>>>>>> +		nr = atomic_long_xchg(&shrinker->nr_deferred[nid], 0);
>>>>>> +	}
>>>>> That's ugly. Effectively it means that memcg reclaim is going to be
>>>>> completely ineffective when large numbers of allocations and hence
>>>>> reclaim attempts are done under GFP_NOFS context.
>>>>>
>>>>> The only thing that keeps filesystem caches in balance when there is
>>>>> lots of filesystem work going on (i.e. lots of GFP_NOFS allocations)
>>>>> is the deferal of reclaim work to a context that can do something
>>>>> about it.
>>>> Imagine the situation: a memcg issues a GFP_NOFS allocation and goes to
>>>> shrink_slab() where it defers them to the global counter; then another
>>>> memcg issues a GFP_KERNEL allocation, also goes to shrink_slab() where
>>>> it sees a huge number of deferred objects and starts shrinking them,
>>>> which is not good IMHO.
>>> That's exactly what the deferred mechanism is for - we know we have
>>> to do the work, but we can't do it right now so let someone else do
>>> it who can.
>>>
>>> In most cases, deferral is handled by kswapd, because when a
>>> filesystem workload is causing memory pressure then most allocations
>>> are done in GFP_NOFS conditions. Hence the only memory reclaim that
>>> can make progress here is kswapd.
>>>
>>> Right now, you aren't deferring any of this memory pressure to some
>>> other agent, so it just does not get done. That's a massive problem
>>> - it's a design flaw - and instead I see lots of crazy hacks being
>>> added to do stuff that should simply be deferred to kswapd like is
>>> done for global memory pressure.
>>>
>>> Hell, kswapd shoul dbe allowed to walk memcg LRU lists and trim
>>> them, just like it does for the global lists. We only need a single
>>> "deferred work" counter per node for that - just let kswapd
>>> proportion the deferred work over the per-node LRU and the
>>> memcgs....
>> Seems I misunderstand :-(
>>
>> Let me try. You mean we have the only nr_deferred counter per-node, and
>> kswapd scans
>>
>> nr_deferred*memcg_kmem_size/total_kmem_size
>>
>> objects in each memcg, right?
>>
>> Then if there were a lot of objects deferred on memcg (not global)
>> pressure due to a memcg issuing a lot of GFP_NOFS allocations, kswapd
>> will reclaim objects from all, even unlimited, memcgs. This looks like
>> an isolation issue :-/
> Which, when you are running out of memory, is a much less of an
> issue than not being able to make progress reclaiming memory.
>
> Besides, the "isolation" argument runs both ways. e.g. when there
> isn't memory available, it's entirely possible it's because there is
> actually no free memory, not because we've hit a memcg limit. e.g.
> all the memory has been consumed by an unlimited memcg, and we need to
> reclaim from it so this memcg can make progress.
>
> In those situations we need to reclaim from everyone, not
> just the memcg that can't find free memory to allocate....

Agree, on global overcommit we have to reclaim from all. I guess it
would be also nice to balance the reclaim proportionally to memlimit
somehow then.

>> Currently we have a per-node nr_deferred counter for each shrinker. If
>> we add per-memcg reclaim, we have to make it per-memcg per-node, don't we?
> Think about what you just said for a moment. We have how many memcg
> shrinkers?  And we can support how many nodes? And we can support
> how many memcgs? And when we multiply that all together, how much
> memory do we need to track that?

But we could grow nr_deferred dynamically as the number of kmem-active
memcgs grows just like we're going to grow list_lru. Then the overhead
would not be that big, it would be practically 0 if there is no
kmem-active memcgs.

I mean, per memcg nr_deferred wouldn't hinder the global reclaim and
it's not that much of overhead comparing to several per memcg lrus for
each superblock, but it would make the reclaimer fairer on memcg
pressure, wouldn't it?

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v12 10/18] memcg,list_lru: add per-memcg LRU list infrastructure
  2013-12-03 12:29     ` Vladimir Davydov
@ 2013-12-05 21:19       ` Dave Chinner
  0 siblings, 0 replies; 38+ messages in thread
From: Dave Chinner @ 2013-12-05 21:19 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: hannes, mhocko, dchinner, akpm, linux-kernel, linux-mm, cgroups,
	devel, glommer, Al Viro, Balbir Singh, KAMEZAWA Hiroyuki

On Tue, Dec 03, 2013 at 04:29:14PM +0400, Vladimir Davydov wrote:
> On 12/03/2013 03:18 PM, Dave Chinner wrote:
> > On Mon, Dec 02, 2013 at 03:19:45PM +0400, Vladimir Davydov wrote:
> >> FS-shrinkers, which shrink dcaches and icaches, keep dentries and inodes
> >> in list_lru structures in order to evict least recently used objects.
> >> With per-memcg kmem shrinking infrastructure introduced, we have to make
> >> those LRU lists per-memcg in order to allow shrinking FS caches that
> >> belong to different memory cgroups independently.
> >>
> >> This patch addresses the issue by introducing struct memcg_list_lru.
> >> This struct aggregates list_lru objects for each kmem-active memcg, and
> >> keeps it uptodate whenever a memcg is created or destroyed. Its
> >> interface is very simple: it only allows to get the pointer to the
> >> appropriate list_lru object from a memcg or a kmem ptr, which should be
> >> further operated with conventional list_lru methods.
> > Basically The idea was that the memcg LRUs hide entirely behind the
> > generic list_lru interface so that any cache that used the list_lru
> > insfrastructure got memcg capabilities for free. memcg's to shrink
> > were to be passed through the shrinker control shrinkers to the list
> > LRU code, and it then did all the "which lru are we using" logic
> > internally.
> >
> > What you've done is driven all the "which LRU are we using" logic
> > into every single caller location. i.e. you've just broken the
> > underlying design principle that Glauber and I had worked towards
> > with this code - that memcg aware LRUs should be completely
> > transparent to list_lru users. Just like NUMA awareness came for
> > free with the list_lru code, so should memcg awareness....
> >
> >> +/*
> >> + * The following structure can be used to reclaim kmem objects accounted to
> >> + * different memory cgroups independently. It aggregates a set of list_lru
> >> + * objects, one for each kmem-enabled memcg, and provides the method to get
> >> + * the lru corresponding to a memcg.
> >> + */
> >> +struct memcg_list_lru {
> >> +	struct list_lru global_lru;
> >> +
> >> +#ifdef CONFIG_MEMCG_KMEM
> >> +	struct list_lru **memcg_lrus;	/* rcu-protected array of per-memcg
> >> +					   lrus, indexed by memcg_cache_id() */
> >> +
> >> +	struct list_head list;		/* list of all memcg-aware lrus */
> >> +
> >> +	/*
> >> +	 * The memcg_lrus array is rcu protected, so we can only free it after
> >> +	 * a call to synchronize_rcu(). To avoid multiple calls to
> >> +	 * synchronize_rcu() when many lrus get updated at the same time, which
> >> +	 * is a typical scenario, we will store the pointer to the previous
> >> +	 * version of the array in the old_lrus variable for each lru, and then
> >> +	 * free them all at once after a single call to synchronize_rcu().
> >> +	 */
> >> +	void *old_lrus;
> >> +#endif
> >> +};
> > Really, this should be embedded in the struct list_lru, not wrapping
> > around the outside. I don't see any changelog to tell me why you
> > changed the code from what was last in Glauber's tree, so can you
> > explain why exposing all this memcg stuff to everyone is a good
> > idea?
> 
> I preferred to move from list_lru to memcg_list_lru, because the
> connection between list_lru and memcgs' turned memcontrol.c and
> list_lru.c into a monolithic structure. When I read comments to the last
> version of this patchset submitted by Glauber (v10), I found that Andrew
> Morton disliked it, that was why I tried to "fix" it the way you observe
> in this patch. Besides, I though that the list_lru may be used w/o memcgs.

Yes, the list_lru can be used without memcgs. That's the point
of having a generic list_lru API - we are able to add more
fucntionality to the list_lru implemenation without needing to
change all the users of the API.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2013-12-05 21:19 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-12-02 11:19 [PATCH v12 00/18] kmemcg shrinkers Vladimir Davydov
2013-12-02 11:19 ` [PATCH v12 01/18] memcg: make cache index determination more robust Vladimir Davydov
2013-12-02 11:19 ` [PATCH v12 02/18] memcg: consolidate callers of memcg_cache_id Vladimir Davydov
2013-12-02 11:19 ` [PATCH v12 03/18] memcg: move initialization to memcg creation Vladimir Davydov
2013-12-02 11:19 ` [PATCH v12 04/18] memcg: move several kmemcg functions upper Vladimir Davydov
2013-12-02 11:19 ` [PATCH v12 05/18] fs: do not use destroy_super() in alloc_super() fail path Vladimir Davydov
2013-12-03  9:00   ` Dave Chinner
2013-12-03  9:23     ` Vladimir Davydov
2013-12-03 13:37       ` Al Viro
2013-12-03 13:48         ` Vladimir Davydov
2013-12-02 11:19 ` [PATCH v12 06/18] vmscan: rename shrink_slab() args to make it more generic Vladimir Davydov
2013-12-03  9:33   ` Dave Chinner
2013-12-03  9:44     ` Vladimir Davydov
2013-12-03 10:04       ` Dave Chinner
2013-12-02 11:19 ` [PATCH v12 07/18] vmscan: move call to shrink_slab() to shrink_zones() Vladimir Davydov
2013-12-02 11:19 ` [PATCH v12 08/18] vmscan: do_try_to_free_pages(): remove shrink_control argument Vladimir Davydov
2013-12-02 11:19 ` [PATCH v12 09/18] vmscan: shrink slab on memcg pressure Vladimir Davydov
2013-12-03 10:48   ` Dave Chinner
2013-12-03 12:15     ` Vladimir Davydov
2013-12-04  4:51       ` Dave Chinner
2013-12-04  6:31         ` Vladimir Davydov
2013-12-05  5:01           ` Dave Chinner
2013-12-05  6:57             ` Vladimir Davydov
2013-12-02 11:19 ` [PATCH v12 10/18] memcg,list_lru: add per-memcg LRU list infrastructure Vladimir Davydov
2013-12-03 11:18   ` Dave Chinner
2013-12-03 12:29     ` Vladimir Davydov
2013-12-05 21:19       ` Dave Chinner
2013-12-02 11:19 ` [PATCH v12 11/18] memcg,list_lru: add function walking over all lists of a per-memcg LRU Vladimir Davydov
2013-12-02 11:19 ` [PATCH v12 12/18] fs: make icache, dcache shrinkers memcg-aware Vladimir Davydov
2013-12-03 11:45   ` Dave Chinner
2013-12-03 12:34     ` Vladimir Davydov
2013-12-02 11:19 ` [PATCH v12 13/18] memcg: per-memcg kmem shrinking Vladimir Davydov
2013-12-02 11:19 ` [PATCH v12 14/18] vmscan: take at least one pass with shrinkers Vladimir Davydov
2013-12-02 11:19 ` [PATCH v12 15/18] memcg: allow kmem limit to be resized down Vladimir Davydov
2013-12-02 11:19 ` [PATCH v12 16/18] vmpressure: in-kernel notifications Vladimir Davydov
2013-12-02 11:19 ` [PATCH v12 17/18] memcg: reap dead memcgs upon global memory pressure Vladimir Davydov
2013-12-02 11:19 ` [PATCH v12 18/18] memcg: flush memcg items upon memcg destruction Vladimir Davydov
2013-12-02 11:22 ` [PATCH v12 00/18] kmemcg shrinkers Vladimir Davydov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).