linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH -mm 00/14] Per memcg slab shrinkers
@ 2014-09-21 15:14 Vladimir Davydov
  2014-09-21 15:14 ` [PATCH -mm 01/14] list_lru: introduce list_lru_shrink_{count,walk} Vladimir Davydov
                   ` (15 more replies)
  0 siblings, 16 replies; 18+ messages in thread
From: Vladimir Davydov @ 2014-09-21 15:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, Greg Thelen, Dave Chinner,
	Glauber Costa, Suleiman Souhlal, Kamezawa Hiroyuki, Tejun Heo,
	linux-kernel, linux-mm, cgroups

Hi,

Kmem accounting of memcg is unusable now, because it lacks slab shrinker
support. That means when we hit the limit we will get ENOMEM w/o any
chance to recover. What we should do then is to call shrink_slab, which
would reclaim old inode/dentry caches from this cgroup. This is what
this patch set is intended to do.

Basically, it does two things. First, it introduces the notion of
per-memcg slab shrinker. A shrinker that wants to reclaim objects per
cgroup should mark itself as SHRINKER_MEMCG_AWARE. Then it will be
passed the memory cgroup to scan from in shrink_control->memcg. For such
shrinkers shrink_slab iterates over the whole cgroup subtree under the
target cgroup and calls the shrinker for each kmem-active memory cgroup.

Secondly, this patch set makes the list_lru structure per-memcg. It's
done transparently to list_lru users - everything they have to do is to
tell list_lru_init that they want memcg-aware list_lru. Then the
list_lru will automatically distribute objects among per-memcg lists
basing on which cgroup the object is accounted to. This way to make FS
shrinkers (icache, dcache) memcg-aware we only need to make them use
memcg-aware list_lru, and this is what this patch set does.

The main difference of this patch set from my previous attempts to push
memcg aware shrinkers is in how it handles css offline. Now we don't let
list_lrus corresponding to dead memory cgroups hang around till all
objects are freed. Instead we move lru items to the parent cgroup's lru
list. This is really important, because this allows us to release
memcg_cache_id used for indexing in per-memcg arrays. If we don't do
this, the arrays will grow uncontrollably, which is really bad. Note, in
comparison to user memory reparenting, which Johannes is going to get
rid of, it's not racy and much easier to implement although it does
impose some limitations on how list_lru locking can be implemented.
Another difference is that it doesn't reparent charges, only list_lru
entries - the css will be dangling until the last kmem object is freed.

As before, this patch set only enables per-memcg kmem reclaim when the
pressure goes from memory.limit, not from memory.kmem.limit. Handling
memory.kmem.limit is going to be tricky due to GFP_NOFS allocations, it
will probably require a sort of soft limit to work properly. I'm leaving
this for future work.

The patch set basically consists of three main parts and organized as
follows:

 - Patches 1-3 implement per-memcg shrinker core with patches 1 and 2
   preparing list_lru users for upcoming changes and patch 3 tuning
   shrink_slab.

 - Patches 4-10 make memcg core release cache ids on offline doing a bit
   of cleanup in the meanwhile. This is easy, because kmem_caches don't
   need the cache id after css offline since there can't be allocations
   going from a dead memcg. Note that most of these patches (namely 4-6,
   and 8) were once merged, but then I decided to drop them, because I
   didn't know how to deal with list_lrus at that time (see
   https://lkml.org/lkml/2014/7/23/218).

 - Finally patches 11-14 make list_lru per-memcg and mark FS shrinkers
   as memcg-aware. This is the most difficult part of this patch set
   with patch 13 (unlucky :-) doing the most important work.

Reviews are more than welcome.

Thanks,

Vladimir Davydov (14):
  list_lru: introduce list_lru_shrink_{count,walk}
  fs: consolidate {nr,free}_cached_objects args in shrink_control
  vmscan: shrink slab on memcg pressure
  memcg: use mem_cgroup_id for per memcg cache naming
  memcg: add pointer to owner cache to memcg_cache_params
  memcg: keep all children of each root cache on a list
  memcg: update memcg_caches array entries on the slab side
  memcg: release memcg_cache_id on css offline
  memcg: rename some cache id related variables
  memcg: add rwsem to sync against memcg_caches arrays relocation
  list_lru: get rid of ->active_nodes
  list_lru: organize all list_lrus to list
  list_lru: introduce per-memcg lists
  fs: make shrinker memcg aware

 fs/dcache.c                |   14 +-
 fs/gfs2/main.c             |    2 +-
 fs/gfs2/quota.c            |    6 +-
 fs/inode.c                 |    7 +-
 fs/internal.h              |    7 +-
 fs/super.c                 |   36 ++--
 fs/xfs/xfs_buf.c           |    9 +-
 fs/xfs/xfs_qm.c            |    9 +-
 fs/xfs/xfs_super.c         |    7 +-
 include/linux/fs.h         |    6 +-
 include/linux/list_lru.h   |   85 +++++----
 include/linux/memcontrol.h |   60 ++++--
 include/linux/shrinker.h   |   10 +-
 include/linux/slab.h       |    5 +-
 mm/list_lru.c              |  434 +++++++++++++++++++++++++++++++++++++++++---
 mm/memcontrol.c            |  254 +++++++++++++++++++-------
 mm/slab.c                  |   40 ++--
 mm/slab.h                  |   27 ++-
 mm/slab_common.c           |   65 +++----
 mm/slub.c                  |   41 +++--
 mm/vmscan.c                |   87 +++++++--
 mm/workingset.c            |    9 +-
 22 files changed, 925 insertions(+), 295 deletions(-)

-- 
1.7.10.4


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH -mm 01/14] list_lru: introduce list_lru_shrink_{count,walk}
  2014-09-21 15:14 [PATCH -mm 00/14] Per memcg slab shrinkers Vladimir Davydov
@ 2014-09-21 15:14 ` Vladimir Davydov
  2014-09-21 15:14 ` [PATCH -mm 02/14] fs: consolidate {nr,free}_cached_objects args in shrink_control Vladimir Davydov
                   ` (14 subsequent siblings)
  15 siblings, 0 replies; 18+ messages in thread
From: Vladimir Davydov @ 2014-09-21 15:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, Greg Thelen, Dave Chinner,
	Glauber Costa, Suleiman Souhlal, Kamezawa Hiroyuki, Tejun Heo,
	linux-kernel, linux-mm, cgroups

NUMA aware slab shrinkers use the list_lru structure to distribute
objects coming from different NUMA nodes to different lists. Whenever
such a shrinker needs to count or scan objects from a particular node,
it issues commands like this:

        count = list_lru_count_node(lru, sc->nid);
        freed = list_lru_walk_node(lru, sc->nid, isolate_func,
                                   isolate_arg, &sc->nr_to_scan);

where sc is an instance of the shrink_control structure passed to it
from vmscan.

To simplify this, let's add special list_lru functions to be used by
shrinkers, list_lru_shrink_count() and list_lru_shrink_walk(), which
consolidate the nid and nr_to_scan arguments in the shrink_control
structure.

This will also allow us to avoid patching shrinkers that use list_lru
when we make shrink_slab() per-memcg - all we will have to do is extend
the shrink_control structure to include the target memcg and make
list_lru_shrink_{count,walk} handle this appropriately.

Suggested-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
---
 fs/dcache.c              |   14 ++++++--------
 fs/gfs2/quota.c          |    6 +++---
 fs/inode.c               |    7 +++----
 fs/internal.h            |    7 +++----
 fs/super.c               |   24 +++++++++++-------------
 fs/xfs/xfs_buf.c         |    7 +++----
 fs/xfs/xfs_qm.c          |    7 +++----
 include/linux/list_lru.h |   16 ++++++++++++++++
 mm/workingset.c          |    6 +++---
 9 files changed, 51 insertions(+), 43 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 35e61f6ce57a..e9c1a81a0ffa 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -900,24 +900,22 @@ dentry_lru_isolate(struct list_head *item, spinlock_t *lru_lock, void *arg)
 /**
  * prune_dcache_sb - shrink the dcache
  * @sb: superblock
- * @nr_to_scan : number of entries to try to free
- * @nid: which node to scan for freeable entities
+ * @sc: shrink control, passed to list_lru_shrink_walk()
  *
- * Attempt to shrink the superblock dcache LRU by @nr_to_scan entries. This is
- * done when we need more memory an called from the superblock shrinker
+ * Attempt to shrink the superblock dcache LRU by @sc->nr_to_scan entries. This
+ * is done when we need more memory and called from the superblock shrinker
  * function.
  *
  * This function may fail to free any resources if all the dentries are in
  * use.
  */
-long prune_dcache_sb(struct super_block *sb, unsigned long nr_to_scan,
-		     int nid)
+long prune_dcache_sb(struct super_block *sb, struct shrink_control *sc)
 {
 	LIST_HEAD(dispose);
 	long freed;
 
-	freed = list_lru_walk_node(&sb->s_dentry_lru, nid, dentry_lru_isolate,
-				       &dispose, &nr_to_scan);
+	freed = list_lru_shrink_walk(&sb->s_dentry_lru, sc,
+				     dentry_lru_isolate, &dispose);
 	shrink_dentry_list(&dispose);
 	return freed;
 }
diff --git a/fs/gfs2/quota.c b/fs/gfs2/quota.c
index 64b29f7f6b4c..6292d79fc340 100644
--- a/fs/gfs2/quota.c
+++ b/fs/gfs2/quota.c
@@ -171,8 +171,8 @@ static unsigned long gfs2_qd_shrink_scan(struct shrinker *shrink,
 	if (!(sc->gfp_mask & __GFP_FS))
 		return SHRINK_STOP;
 
-	freed = list_lru_walk_node(&gfs2_qd_lru, sc->nid, gfs2_qd_isolate,
-				   &dispose, &sc->nr_to_scan);
+	freed = list_lru_shrink_walk(&gfs2_qd_lru, sc,
+				     gfs2_qd_isolate, &dispose);
 
 	gfs2_qd_dispose(&dispose);
 
@@ -182,7 +182,7 @@ static unsigned long gfs2_qd_shrink_scan(struct shrinker *shrink,
 static unsigned long gfs2_qd_shrink_count(struct shrinker *shrink,
 					  struct shrink_control *sc)
 {
-	return vfs_pressure_ratio(list_lru_count_node(&gfs2_qd_lru, sc->nid));
+	return vfs_pressure_ratio(list_lru_shrink_count(&gfs2_qd_lru, sc));
 }
 
 struct shrinker gfs2_qd_shrinker = {
diff --git a/fs/inode.c b/fs/inode.c
index 26753ba7b6d6..f08420a3bf50 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -749,14 +749,13 @@ inode_lru_isolate(struct list_head *item, spinlock_t *lru_lock, void *arg)
  * to trim from the LRU. Inodes to be freed are moved to a temporary list and
  * then are freed outside inode_lock by dispose_list().
  */
-long prune_icache_sb(struct super_block *sb, unsigned long nr_to_scan,
-		     int nid)
+long prune_icache_sb(struct super_block *sb, struct shrink_control *sc)
 {
 	LIST_HEAD(freeable);
 	long freed;
 
-	freed = list_lru_walk_node(&sb->s_inode_lru, nid, inode_lru_isolate,
-				       &freeable, &nr_to_scan);
+	freed = list_lru_shrink_walk(&sb->s_inode_lru, sc,
+				     inode_lru_isolate, &freeable);
 	dispose_list(&freeable);
 	return freed;
 }
diff --git a/fs/internal.h b/fs/internal.h
index 9477f8f6aefc..7a6aa641c060 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -14,6 +14,7 @@ struct file_system_type;
 struct linux_binprm;
 struct path;
 struct mount;
+struct shrink_control;
 
 /*
  * block_dev.c
@@ -112,8 +113,7 @@ extern int open_check_o_direct(struct file *f);
  * inode.c
  */
 extern spinlock_t inode_sb_list_lock;
-extern long prune_icache_sb(struct super_block *sb, unsigned long nr_to_scan,
-			    int nid);
+extern long prune_icache_sb(struct super_block *sb, struct shrink_control *sc);
 extern void inode_add_lru(struct inode *inode);
 
 /*
@@ -130,8 +130,7 @@ extern int invalidate_inodes(struct super_block *, bool);
  */
 extern struct dentry *__d_alloc(struct super_block *, const struct qstr *);
 extern int d_set_mounted(struct dentry *dentry);
-extern long prune_dcache_sb(struct super_block *sb, unsigned long nr_to_scan,
-			    int nid);
+extern long prune_dcache_sb(struct super_block *sb, struct shrink_control *sc);
 
 /*
  * read_write.c
diff --git a/fs/super.c b/fs/super.c
index eae088f6aaae..4554ac257647 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -77,8 +77,8 @@ static unsigned long super_cache_scan(struct shrinker *shrink,
 	if (sb->s_op->nr_cached_objects)
 		fs_objects = sb->s_op->nr_cached_objects(sb, sc->nid);
 
-	inodes = list_lru_count_node(&sb->s_inode_lru, sc->nid);
-	dentries = list_lru_count_node(&sb->s_dentry_lru, sc->nid);
+	inodes = list_lru_shrink_count(&sb->s_inode_lru, sc);
+	dentries = list_lru_shrink_count(&sb->s_dentry_lru, sc);
 	total_objects = dentries + inodes + fs_objects + 1;
 	if (!total_objects)
 		total_objects = 1;
@@ -86,20 +86,20 @@ static unsigned long super_cache_scan(struct shrinker *shrink,
 	/* proportion the scan between the caches */
 	dentries = mult_frac(sc->nr_to_scan, dentries, total_objects);
 	inodes = mult_frac(sc->nr_to_scan, inodes, total_objects);
+	fs_objects = mult_frac(sc->nr_to_scan, fs_objects, total_objects);
 
 	/*
 	 * prune the dcache first as the icache is pinned by it, then
 	 * prune the icache, followed by the filesystem specific caches
 	 */
-	freed = prune_dcache_sb(sb, dentries, sc->nid);
-	freed += prune_icache_sb(sb, inodes, sc->nid);
+	sc->nr_to_scan = dentries;
+	freed = prune_dcache_sb(sb, sc);
+	sc->nr_to_scan = inodes;
+	freed += prune_icache_sb(sb, sc);
 
-	if (fs_objects) {
-		fs_objects = mult_frac(sc->nr_to_scan, fs_objects,
-								total_objects);
+	if (fs_objects)
 		freed += sb->s_op->free_cached_objects(sb, fs_objects,
 						       sc->nid);
-	}
 
 	drop_super(sb);
 	return freed;
@@ -118,17 +118,15 @@ static unsigned long super_cache_count(struct shrinker *shrink,
 	 * scalability bottleneck. The counts could get updated
 	 * between super_cache_count and super_cache_scan anyway.
 	 * Call to super_cache_count with shrinker_rwsem held
-	 * ensures the safety of call to list_lru_count_node() and
+	 * ensures the safety of call to list_lru_shrink_count() and
 	 * s_op->nr_cached_objects().
 	 */
 	if (sb->s_op && sb->s_op->nr_cached_objects)
 		total_objects = sb->s_op->nr_cached_objects(sb,
 						 sc->nid);
 
-	total_objects += list_lru_count_node(&sb->s_dentry_lru,
-						 sc->nid);
-	total_objects += list_lru_count_node(&sb->s_inode_lru,
-						 sc->nid);
+	total_objects += list_lru_shrink_count(&sb->s_dentry_lru, sc);
+	total_objects += list_lru_shrink_count(&sb->s_inode_lru, sc);
 
 	total_objects = vfs_pressure_ratio(total_objects);
 	return total_objects;
diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 497fcde381d7..e02a49a30f89 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -1586,10 +1586,9 @@ xfs_buftarg_shrink_scan(
 					struct xfs_buftarg, bt_shrinker);
 	LIST_HEAD(dispose);
 	unsigned long		freed;
-	unsigned long		nr_to_scan = sc->nr_to_scan;
 
-	freed = list_lru_walk_node(&btp->bt_lru, sc->nid, xfs_buftarg_isolate,
-				       &dispose, &nr_to_scan);
+	freed = list_lru_shrink_walk(&btp->bt_lru, sc,
+				     xfs_buftarg_isolate, &dispose);
 
 	while (!list_empty(&dispose)) {
 		struct xfs_buf *bp;
@@ -1608,7 +1607,7 @@ xfs_buftarg_shrink_count(
 {
 	struct xfs_buftarg	*btp = container_of(shrink,
 					struct xfs_buftarg, bt_shrinker);
-	return list_lru_count_node(&btp->bt_lru, sc->nid);
+	return list_lru_shrink_count(&btp->bt_lru, sc);
 }
 
 void
diff --git a/fs/xfs/xfs_qm.c b/fs/xfs/xfs_qm.c
index 10232102b4a6..51167f44c408 100644
--- a/fs/xfs/xfs_qm.c
+++ b/fs/xfs/xfs_qm.c
@@ -524,7 +524,6 @@ xfs_qm_shrink_scan(
 	struct xfs_qm_isolate	isol;
 	unsigned long		freed;
 	int			error;
-	unsigned long		nr_to_scan = sc->nr_to_scan;
 
 	if ((sc->gfp_mask & (__GFP_FS|__GFP_WAIT)) != (__GFP_FS|__GFP_WAIT))
 		return 0;
@@ -532,8 +531,8 @@ xfs_qm_shrink_scan(
 	INIT_LIST_HEAD(&isol.buffers);
 	INIT_LIST_HEAD(&isol.dispose);
 
-	freed = list_lru_walk_node(&qi->qi_lru, sc->nid, xfs_qm_dquot_isolate, &isol,
-					&nr_to_scan);
+	freed = list_lru_shrink_walk(&qi->qi_lru, sc,
+				     xfs_qm_dquot_isolate, &isol);
 
 	error = xfs_buf_delwri_submit(&isol.buffers);
 	if (error)
@@ -558,7 +557,7 @@ xfs_qm_shrink_count(
 	struct xfs_quotainfo	*qi = container_of(shrink,
 					struct xfs_quotainfo, qi_shrinker);
 
-	return list_lru_count_node(&qi->qi_lru, sc->nid);
+	return list_lru_shrink_count(&qi->qi_lru, sc);
 }
 
 /*
diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
index f3434533fbf8..f500a2e39b13 100644
--- a/include/linux/list_lru.h
+++ b/include/linux/list_lru.h
@@ -9,6 +9,7 @@
 
 #include <linux/list.h>
 #include <linux/nodemask.h>
+#include <linux/shrinker.h>
 
 /* list_lru_walk_cb has to always return one of those */
 enum lru_status {
@@ -81,6 +82,13 @@ bool list_lru_del(struct list_lru *lru, struct list_head *item);
  * Callers that want such a guarantee need to provide an outer lock.
  */
 unsigned long list_lru_count_node(struct list_lru *lru, int nid);
+
+static inline unsigned long list_lru_shrink_count(struct list_lru *lru,
+						  struct shrink_control *sc)
+{
+	return list_lru_count_node(lru, sc->nid);
+}
+
 static inline unsigned long list_lru_count(struct list_lru *lru)
 {
 	long count = 0;
@@ -120,6 +128,14 @@ unsigned long list_lru_walk_node(struct list_lru *lru, int nid,
 				 unsigned long *nr_to_walk);
 
 static inline unsigned long
+list_lru_shrink_walk(struct list_lru *lru, struct shrink_control *sc,
+		     list_lru_walk_cb isolate, void *cb_arg)
+{
+	return list_lru_walk_node(lru, sc->nid, isolate, cb_arg,
+				  &sc->nr_to_scan);
+}
+
+static inline unsigned long
 list_lru_walk(struct list_lru *lru, list_lru_walk_cb isolate,
 	      void *cb_arg, unsigned long nr_to_walk)
 {
diff --git a/mm/workingset.c b/mm/workingset.c
index f7216fa7da27..d4fa7fb10a52 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -275,7 +275,7 @@ static unsigned long count_shadow_nodes(struct shrinker *shrinker,
 
 	/* list_lru lock nests inside IRQ-safe mapping->tree_lock */
 	local_irq_disable();
-	shadow_nodes = list_lru_count_node(&workingset_shadow_nodes, sc->nid);
+	shadow_nodes = list_lru_shrink_count(&workingset_shadow_nodes, sc);
 	local_irq_enable();
 
 	pages = node_present_pages(sc->nid);
@@ -376,8 +376,8 @@ static unsigned long scan_shadow_nodes(struct shrinker *shrinker,
 
 	/* list_lru lock nests inside IRQ-safe mapping->tree_lock */
 	local_irq_disable();
-	ret =  list_lru_walk_node(&workingset_shadow_nodes, sc->nid,
-				  shadow_lru_isolate, NULL, &sc->nr_to_scan);
+	ret =  list_lru_shrink_walk(&workingset_shadow_nodes, sc,
+				    shadow_lru_isolate, NULL);
 	local_irq_enable();
 	return ret;
 }
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH -mm 02/14] fs: consolidate {nr,free}_cached_objects args in shrink_control
  2014-09-21 15:14 [PATCH -mm 00/14] Per memcg slab shrinkers Vladimir Davydov
  2014-09-21 15:14 ` [PATCH -mm 01/14] list_lru: introduce list_lru_shrink_{count,walk} Vladimir Davydov
@ 2014-09-21 15:14 ` Vladimir Davydov
  2014-09-21 15:14 ` [PATCH -mm 03/14] vmscan: shrink slab on memcg pressure Vladimir Davydov
                   ` (13 subsequent siblings)
  15 siblings, 0 replies; 18+ messages in thread
From: Vladimir Davydov @ 2014-09-21 15:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, Greg Thelen, Dave Chinner,
	Glauber Costa, Suleiman Souhlal, Kamezawa Hiroyuki, Tejun Heo,
	linux-kernel, linux-mm, cgroups

We are going to make FS shrinkers memcg-aware. To achieve that, we will
have to pass the memcg to scan to the nr_cached_objects and
free_cached_objects VFS methods, which currently take only the NUMA node
to scan. Since the shrink_control structure already holds the node, and
the memcg to scan will be added to it when we introduce memcg-aware
vmscan, let us consolidate the methods' arguments in this structure to
keep things clean.

Suggested-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
---
 fs/super.c         |   12 ++++++------
 fs/xfs/xfs_super.c |    7 +++----
 include/linux/fs.h |    6 ++++--
 3 files changed, 13 insertions(+), 12 deletions(-)

diff --git a/fs/super.c b/fs/super.c
index 4554ac257647..a2b735a42e74 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -75,7 +75,7 @@ static unsigned long super_cache_scan(struct shrinker *shrink,
 		return SHRINK_STOP;
 
 	if (sb->s_op->nr_cached_objects)
-		fs_objects = sb->s_op->nr_cached_objects(sb, sc->nid);
+		fs_objects = sb->s_op->nr_cached_objects(sb, sc);
 
 	inodes = list_lru_shrink_count(&sb->s_inode_lru, sc);
 	dentries = list_lru_shrink_count(&sb->s_dentry_lru, sc);
@@ -97,9 +97,10 @@ static unsigned long super_cache_scan(struct shrinker *shrink,
 	sc->nr_to_scan = inodes;
 	freed += prune_icache_sb(sb, sc);
 
-	if (fs_objects)
-		freed += sb->s_op->free_cached_objects(sb, fs_objects,
-						       sc->nid);
+	if (fs_objects) {
+		sc->nr_to_scan = fs_objects;
+		freed += sb->s_op->free_cached_objects(sb, sc);
+	}
 
 	drop_super(sb);
 	return freed;
@@ -122,8 +123,7 @@ static unsigned long super_cache_count(struct shrinker *shrink,
 	 * s_op->nr_cached_objects().
 	 */
 	if (sb->s_op && sb->s_op->nr_cached_objects)
-		total_objects = sb->s_op->nr_cached_objects(sb,
-						 sc->nid);
+		total_objects = sb->s_op->nr_cached_objects(sb, sc);
 
 	total_objects += list_lru_shrink_count(&sb->s_dentry_lru, sc);
 	total_objects += list_lru_shrink_count(&sb->s_inode_lru, sc);
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index b194652033cd..d7e5c93c1a28 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1521,7 +1521,7 @@ xfs_fs_mount(
 static long
 xfs_fs_nr_cached_objects(
 	struct super_block	*sb,
-	int			nid)
+	struct shrink_control	*sc)
 {
 	return xfs_reclaim_inodes_count(XFS_M(sb));
 }
@@ -1529,10 +1529,9 @@ xfs_fs_nr_cached_objects(
 static long
 xfs_fs_free_cached_objects(
 	struct super_block	*sb,
-	long			nr_to_scan,
-	int			nid)
+	struct shrink_control	*sc)
 {
-	return xfs_reclaim_inodes_nr(XFS_M(sb), nr_to_scan);
+	return xfs_reclaim_inodes_nr(XFS_M(sb), sc->nr_to_scan);
 }
 
 static const struct super_operations xfs_super_operations = {
diff --git a/include/linux/fs.h b/include/linux/fs.h
index ddd10c70a01d..63fbf6f4bd36 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1573,8 +1573,10 @@ struct super_operations {
 	ssize_t (*quota_write)(struct super_block *, int, const char *, size_t, loff_t);
 #endif
 	int (*bdev_try_to_free_page)(struct super_block*, struct page*, gfp_t);
-	long (*nr_cached_objects)(struct super_block *, int);
-	long (*free_cached_objects)(struct super_block *, long, int);
+	long (*nr_cached_objects)(struct super_block *,
+				  struct shrink_control *);
+	long (*free_cached_objects)(struct super_block *,
+				    struct shrink_control *);
 };
 
 /*
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH -mm 03/14] vmscan: shrink slab on memcg pressure
  2014-09-21 15:14 [PATCH -mm 00/14] Per memcg slab shrinkers Vladimir Davydov
  2014-09-21 15:14 ` [PATCH -mm 01/14] list_lru: introduce list_lru_shrink_{count,walk} Vladimir Davydov
  2014-09-21 15:14 ` [PATCH -mm 02/14] fs: consolidate {nr,free}_cached_objects args in shrink_control Vladimir Davydov
@ 2014-09-21 15:14 ` Vladimir Davydov
  2014-09-21 15:14 ` [PATCH -mm 04/14] memcg: use mem_cgroup_id for per memcg cache naming Vladimir Davydov
                   ` (12 subsequent siblings)
  15 siblings, 0 replies; 18+ messages in thread
From: Vladimir Davydov @ 2014-09-21 15:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, Greg Thelen, Dave Chinner,
	Glauber Costa, Suleiman Souhlal, Kamezawa Hiroyuki, Tejun Heo,
	linux-kernel, linux-mm, cgroups

This patch makes direct reclaim path shrink slab not only on global
memory pressure, but also when we reach the user memory limit of a
memcg. To achieve that, it makes shrink_slab() walk over the memcg
hierarchy and run shrinkers marked as memcg-aware on the target memcg
and all its descendants. The memcg to scan is passed in a shrink_control
structure; memcg-unaware shrinkers are still called only on global
memory pressure with memcg=NULL. It is up to the shrinker how to
organize the objects it is responsible for to achieve per-memcg reclaim.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
---
 include/linux/memcontrol.h |   22 +++++++++++
 include/linux/shrinker.h   |   10 ++++-
 mm/memcontrol.c            |   46 ++++++++++++++++++++++-
 mm/vmscan.c                |   87 ++++++++++++++++++++++++++++++++++----------
 4 files changed, 143 insertions(+), 22 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 19df5d857411..c4e64d0e318d 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -68,6 +68,9 @@ void mem_cgroup_migrate(struct page *oldpage, struct page *newpage,
 struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *);
 struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *);
 
+unsigned long mem_cgroup_zone_reclaimable_pages(struct zone *zone,
+						struct mem_cgroup *memcg);
+
 bool __mem_cgroup_same_or_subtree(const struct mem_cgroup *root_memcg,
 				  struct mem_cgroup *memcg);
 bool task_in_mem_cgroup(struct task_struct *task,
@@ -251,6 +254,12 @@ static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page,
 	return &zone->lruvec;
 }
 
+static inline unsigned long mem_cgroup_zone_reclaimable_pages(struct zone *zone,
+							struct mem_cgroup *)
+{
+	return 0;
+}
+
 static inline struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
 {
 	return NULL;
@@ -421,6 +430,9 @@ static inline bool memcg_kmem_enabled(void)
 	return static_key_false(&memcg_kmem_enabled_key);
 }
 
+bool memcg_kmem_is_active(struct mem_cgroup *memcg);
+bool memcg_kmem_is_active_subtree(struct mem_cgroup *memcg);
+
 /*
  * In general, we'll do everything in our power to not incur in any overhead
  * for non-memcg users for the kmem functions. Not even a function call, if we
@@ -549,6 +561,16 @@ static inline bool memcg_kmem_enabled(void)
 	return false;
 }
 
+static inline bool memcg_kmem_is_active(struct mem_cgroup *memcg)
+{
+	return false;
+}
+
+static inline bool memcg_kmem_is_active_subtree(struct mem_cgroup *memcg)
+{
+	return false;
+}
+
 static inline bool
 memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup **memcg, int order)
 {
diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
index 68c097077ef0..ab79b174bfbe 100644
--- a/include/linux/shrinker.h
+++ b/include/linux/shrinker.h
@@ -20,8 +20,15 @@ struct shrink_control {
 
 	/* shrink from these nodes */
 	nodemask_t nodes_to_scan;
+
+	/* shrink from this memory cgroup hierarchy (if not NULL) */
+	struct mem_cgroup *target_mem_cgroup;
+
 	/* current node being shrunk (for NUMA aware shrinkers) */
 	int nid;
+
+	/* current memcg being shrunk (for memcg aware shrinkers) */
+	struct mem_cgroup *memcg;
 };
 
 #define SHRINK_STOP (~0UL)
@@ -63,7 +70,8 @@ struct shrinker {
 #define DEFAULT_SEEKS 2 /* A good number if you don't know better. */
 
 /* Flags */
-#define SHRINKER_NUMA_AWARE (1 << 0)
+#define SHRINKER_NUMA_AWARE	(1 << 0)
+#define SHRINKER_MEMCG_AWARE	(1 << 1)
 
 extern int register_shrinker(struct shrinker *);
 extern void unregister_shrinker(struct shrinker *);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 9431024e490c..7361bd8b720a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -391,7 +391,7 @@ static inline void memcg_kmem_set_active(struct mem_cgroup *memcg)
 	set_bit(KMEM_ACCOUNTED_ACTIVE, &memcg->kmem_account_flags);
 }
 
-static bool memcg_kmem_is_active(struct mem_cgroup *memcg)
+bool memcg_kmem_is_active(struct mem_cgroup *memcg)
 {
 	return test_bit(KMEM_ACCOUNTED_ACTIVE, &memcg->kmem_account_flags);
 }
@@ -1411,6 +1411,31 @@ void mem_cgroup_update_lru_size(struct lruvec *lruvec, enum lru_list lru,
 	VM_BUG_ON((long)(*lru_size) < 0);
 }
 
+unsigned long mem_cgroup_zone_reclaimable_pages(struct zone *zone,
+						struct mem_cgroup *memcg)
+{
+	unsigned long nr = 0;
+	unsigned int lru_mask;
+	struct mem_cgroup *iter;
+
+	lru_mask = LRU_ALL_FILE;
+	if (get_nr_swap_pages() > 0)
+		lru_mask |= LRU_ALL_ANON;
+
+	iter = memcg;
+	do {
+		struct mem_cgroup_per_zone *mz;
+		enum lru_list lru;
+
+		mz = mem_cgroup_zone_zoneinfo(memcg, zone);
+		for_each_lru(lru)
+			if (BIT(lru) & lru_mask)
+				nr += mz->lru_size[lru];
+	} while ((iter = mem_cgroup_iter(memcg, iter, NULL)) != NULL);
+
+	return nr;
+}
+
 /*
  * Checks whether given mem is same or in the root_mem_cgroup's
  * hierarchy subtree
@@ -2786,6 +2811,25 @@ static DEFINE_MUTEX(memcg_slab_mutex);
 
 static DEFINE_MUTEX(activate_kmem_mutex);
 
+/*
+ * Returns true if the given cgroup or any of its descendants has kmem
+ * accounting enabled.
+ */
+bool memcg_kmem_is_active_subtree(struct mem_cgroup *memcg)
+{
+	struct mem_cgroup *iter;
+
+	iter = memcg;
+	do {
+		if (memcg_kmem_is_active(iter)) {
+			mem_cgroup_iter_break(memcg, iter);
+			return true;
+		}
+	} while ((iter = mem_cgroup_iter(memcg, iter, NULL)) != NULL);
+
+	return false;
+}
+
 static inline bool memcg_can_account_kmem(struct mem_cgroup *memcg)
 {
 	return !mem_cgroup_disabled() && !mem_cgroup_is_root(memcg) &&
diff --git a/mm/vmscan.c b/mm/vmscan.c
index b672e2c6becc..041d0e41a5a4 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -340,6 +340,26 @@ shrink_slab_node(struct shrink_control *shrinkctl, struct shrinker *shrinker,
 	return freed;
 }
 
+static unsigned long
+run_shrinker(struct shrink_control *shrinkctl, struct shrinker *shrinker,
+	     unsigned long nr_pages_scanned, unsigned long lru_pages)
+{
+	unsigned long freed = 0;
+
+	if (!(shrinker->flags & SHRINKER_NUMA_AWARE)) {
+		shrinkctl->nid = 0;
+		return shrink_slab_node(shrinkctl, shrinker,
+					nr_pages_scanned, lru_pages);
+	}
+
+	for_each_node_mask(shrinkctl->nid, shrinkctl->nodes_to_scan) {
+		if (node_online(shrinkctl->nid))
+			freed += shrink_slab_node(shrinkctl, shrinker,
+						  nr_pages_scanned, lru_pages);
+	}
+	return freed;
+}
+
 /*
  * Call the shrink functions to age shrinkable caches
  *
@@ -381,20 +401,34 @@ unsigned long shrink_slab(struct shrink_control *shrinkctl,
 	}
 
 	list_for_each_entry(shrinker, &shrinker_list, list) {
-		if (!(shrinker->flags & SHRINKER_NUMA_AWARE)) {
-			shrinkctl->nid = 0;
-			freed += shrink_slab_node(shrinkctl, shrinker,
-					nr_pages_scanned, lru_pages);
+		/*
+		 * Call memcg-unaware shrinkers only on global pressure.
+		 */
+		if (!(shrinker->flags & SHRINKER_MEMCG_AWARE)) {
+			if (!shrinkctl->target_mem_cgroup) {
+				shrinkctl->memcg = NULL;
+				freed += run_shrinker(shrinkctl, shrinker,
+						nr_pages_scanned, lru_pages);
+			}
 			continue;
 		}
 
-		for_each_node_mask(shrinkctl->nid, shrinkctl->nodes_to_scan) {
-			if (node_online(shrinkctl->nid))
-				freed += shrink_slab_node(shrinkctl, shrinker,
+		/*
+		 * For memcg-aware shrinkers iterate over the target memcg
+		 * hierarchy and run the shrinker on each kmem-active memcg
+		 * found in the hierarchy.
+		 */
+		shrinkctl->memcg = shrinkctl->target_mem_cgroup;
+		do {
+			if (!shrinkctl->memcg ||
+			    memcg_kmem_is_active(shrinkctl->memcg))
+				freed += run_shrinker(shrinkctl, shrinker,
 						nr_pages_scanned, lru_pages);
-
-		}
+		} while ((shrinkctl->memcg =
+			  mem_cgroup_iter(shrinkctl->target_mem_cgroup,
+					  shrinkctl->memcg, NULL)) != NULL);
 	}
+
 	up_read(&shrinker_rwsem);
 out:
 	cond_resched();
@@ -2381,6 +2415,7 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
 	gfp_t orig_mask;
 	struct shrink_control shrink = {
 		.gfp_mask = sc->gfp_mask,
+		.target_mem_cgroup = sc->target_mem_cgroup,
 	};
 	enum zone_type requested_highidx = gfp_zone(sc->gfp_mask);
 	bool reclaimable = false;
@@ -2400,17 +2435,22 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
 					gfp_zone(sc->gfp_mask), sc->nodemask) {
 		if (!populated_zone(zone))
 			continue;
+
+		if (global_reclaim(sc) &&
+		    !cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
+			continue;
+
+		lru_pages += global_reclaim(sc) ?
+				zone_reclaimable_pages(zone) :
+				mem_cgroup_zone_reclaimable_pages(zone,
+						sc->target_mem_cgroup);
+		node_set(zone_to_nid(zone), shrink.nodes_to_scan);
+
 		/*
 		 * Take care memory controller reclaiming has small influence
 		 * to global LRU.
 		 */
 		if (global_reclaim(sc)) {
-			if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
-				continue;
-
-			lru_pages += zone_reclaimable_pages(zone);
-			node_set(zone_to_nid(zone), shrink.nodes_to_scan);
-
 			if (sc->priority != DEF_PRIORITY &&
 			    !zone_reclaimable(zone))
 				continue;	/* Let kswapd poll it */
@@ -2458,12 +2498,11 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
 	}
 
 	/*
-	 * Don't shrink slabs when reclaiming memory from over limit cgroups
-	 * but do shrink slab at least once when aborting reclaim for
-	 * compaction to avoid unevenly scanning file/anon LRU pages over slab
-	 * pages.
+	 * Shrink slabs at least once when aborting reclaim for compaction
+	 * to avoid unevenly scanning file/anon LRU pages over slab pages.
 	 */
-	if (global_reclaim(sc)) {
+	if (global_reclaim(sc) ||
+	    memcg_kmem_is_active_subtree(sc->target_mem_cgroup)) {
 		shrink_slab(&shrink, sc->nr_scanned, lru_pages);
 		if (reclaim_state) {
 			sc->nr_reclaimed += reclaim_state->reclaimed_slab;
@@ -2765,6 +2804,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 	struct zonelist *zonelist;
 	unsigned long nr_reclaimed;
 	int nid;
+	struct reclaim_state reclaim_state;
 	struct scan_control sc = {
 		.nr_to_reclaim = SWAP_CLUSTER_MAX,
 		.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
@@ -2785,6 +2825,10 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 
 	zonelist = NODE_DATA(nid)->node_zonelists;
 
+	lockdep_set_current_reclaim_state(sc.gfp_mask);
+	reclaim_state.reclaimed_slab = 0;
+	current->reclaim_state = &reclaim_state;
+
 	trace_mm_vmscan_memcg_reclaim_begin(0,
 					    sc.may_writepage,
 					    sc.gfp_mask);
@@ -2793,6 +2837,9 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 
 	trace_mm_vmscan_memcg_reclaim_end(nr_reclaimed);
 
+	current->reclaim_state = NULL;
+	lockdep_clear_current_reclaim_state();
+
 	return nr_reclaimed;
 }
 #endif
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH -mm 04/14] memcg: use mem_cgroup_id for per memcg cache naming
  2014-09-21 15:14 [PATCH -mm 00/14] Per memcg slab shrinkers Vladimir Davydov
                   ` (2 preceding siblings ...)
  2014-09-21 15:14 ` [PATCH -mm 03/14] vmscan: shrink slab on memcg pressure Vladimir Davydov
@ 2014-09-21 15:14 ` Vladimir Davydov
  2014-09-21 15:14 ` [PATCH -mm 05/14] memcg: add pointer to owner cache to memcg_cache_params Vladimir Davydov
                   ` (11 subsequent siblings)
  15 siblings, 0 replies; 18+ messages in thread
From: Vladimir Davydov @ 2014-09-21 15:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, Greg Thelen, Dave Chinner,
	Glauber Costa, Suleiman Souhlal, Kamezawa Hiroyuki, Tejun Heo,
	linux-kernel, linux-mm, cgroups

Currently, we use memcg_cache_id as a part of a per memcg cache name.
Since memcg_cache_id is released only on css free, this guarantees cache
name uniqueness.

However, it's a bad practice to keep memcg_cache_id till css free,
because it occupies a slot in kmem_cache->memcg_params->memcg_caches
arrays. So I'm going to make memcg release memcg_cache_id on css
offline. As a result, memcg_cache_id won't guarantee cache name
uniqueness any more.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
---
 include/linux/slab.h |    2 +-
 mm/memcontrol.c      |   13 +++++++++++--
 mm/slab_common.c     |   15 +++------------
 3 files changed, 15 insertions(+), 15 deletions(-)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index c265bec6a57d..f4d489aee6cb 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -118,7 +118,7 @@ struct kmem_cache *kmem_cache_create(const char *, size_t, size_t,
 #ifdef CONFIG_MEMCG_KMEM
 struct kmem_cache *memcg_create_kmem_cache(struct mem_cgroup *,
 					   struct kmem_cache *,
-					   const char *);
+					   char *);
 #endif
 void kmem_cache_destroy(struct kmem_cache *);
 int kmem_cache_shrink(struct kmem_cache *);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 7361bd8b720a..16fcdbef1b7d 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2995,6 +2995,7 @@ static void memcg_register_cache(struct mem_cgroup *memcg,
 	static char memcg_name_buf[NAME_MAX + 1]; /* protected by
 						     memcg_slab_mutex */
 	struct kmem_cache *cachep;
+	char *cache_name;
 	int id;
 
 	lockdep_assert_held(&memcg_slab_mutex);
@@ -3010,14 +3011,22 @@ static void memcg_register_cache(struct mem_cgroup *memcg,
 		return;
 
 	cgroup_name(memcg->css.cgroup, memcg_name_buf, NAME_MAX + 1);
-	cachep = memcg_create_kmem_cache(memcg, root_cache, memcg_name_buf);
+
+	cache_name = kasprintf(GFP_KERNEL, "%s(%d:%s)", root_cache->name,
+			       mem_cgroup_id(memcg), memcg_name_buf);
+	if (!cache_name)
+		return;
+
+	cachep = memcg_create_kmem_cache(memcg, root_cache, cache_name);
 	/*
 	 * If we could not create a memcg cache, do not complain, because
 	 * that's not critical at all as we can always proceed with the root
 	 * cache.
 	 */
-	if (!cachep)
+	if (!cachep) {
+		kfree(cache_name);
 		return;
+	}
 
 	css_get(&memcg->css);
 	list_add(&cachep->memcg_params->list, &memcg->memcg_slab_caches);
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 800314e2a075..8b486f05c414 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -345,7 +345,7 @@ EXPORT_SYMBOL(kmem_cache_create);
  * memcg_create_kmem_cache - Create a cache for a memory cgroup.
  * @memcg: The memory cgroup the new cache is for.
  * @root_cache: The parent of the new cache.
- * @memcg_name: The name of the memory cgroup (used for naming the new cache).
+ * @cache_name: The string to be used as the new cache name.
  *
  * This function attempts to create a kmem cache that will serve allocation
  * requests going from @memcg to @root_cache. The new cache inherits properties
@@ -353,31 +353,22 @@ EXPORT_SYMBOL(kmem_cache_create);
  */
 struct kmem_cache *memcg_create_kmem_cache(struct mem_cgroup *memcg,
 					   struct kmem_cache *root_cache,
-					   const char *memcg_name)
+					   char *cache_name)
 {
 	struct kmem_cache *s = NULL;
-	char *cache_name;
 
 	get_online_cpus();
 	get_online_mems();
 
 	mutex_lock(&slab_mutex);
 
-	cache_name = kasprintf(GFP_KERNEL, "%s(%d:%s)", root_cache->name,
-			       memcg_cache_id(memcg), memcg_name);
-	if (!cache_name)
-		goto out_unlock;
-
 	s = do_kmem_cache_create(cache_name, root_cache->object_size,
 				 root_cache->size, root_cache->align,
 				 root_cache->flags, root_cache->ctor,
 				 memcg, root_cache);
-	if (IS_ERR(s)) {
-		kfree(cache_name);
+	if (IS_ERR(s))
 		s = NULL;
-	}
 
-out_unlock:
 	mutex_unlock(&slab_mutex);
 
 	put_online_mems();
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH -mm 05/14] memcg: add pointer to owner cache to memcg_cache_params
  2014-09-21 15:14 [PATCH -mm 00/14] Per memcg slab shrinkers Vladimir Davydov
                   ` (3 preceding siblings ...)
  2014-09-21 15:14 ` [PATCH -mm 04/14] memcg: use mem_cgroup_id for per memcg cache naming Vladimir Davydov
@ 2014-09-21 15:14 ` Vladimir Davydov
  2014-09-21 15:14 ` [PATCH -mm 06/14] memcg: keep all children of each root cache on a list Vladimir Davydov
                   ` (10 subsequent siblings)
  15 siblings, 0 replies; 18+ messages in thread
From: Vladimir Davydov @ 2014-09-21 15:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, Greg Thelen, Dave Chinner,
	Glauber Costa, Suleiman Souhlal, Kamezawa Hiroyuki, Tejun Heo,
	linux-kernel, linux-mm, cgroups

We don't keep a pointer to the owner kmem cache in the
memcg_cache_params struct, because we can always get the cache by
reading the slot corresponding to the owner memcg in the root cache's
memcg_caches array (see memcg_params_to_cache).

However, this means that offline css's, which can be zombieing around
for quite a long time, will occupy slots in memcg_caches arrays, making
them grow larger and larger, which doesn't sound good. Therefore I'm
going to make memcg release the slots on offline, which will render
memcg_params_to_cache invalid. So I'm removing it and adding a back
pointer to memcg_cache_params instead.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
---
 include/linux/slab.h |    2 ++
 mm/memcontrol.c      |   19 +++----------------
 mm/slab_common.c     |    1 +
 3 files changed, 6 insertions(+), 16 deletions(-)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index f4d489aee6cb..c61344074c11 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -490,6 +490,7 @@ static __always_inline void *kmalloc_node(size_t size, gfp_t flags, int node)
  *
  * Child caches will hold extra metadata needed for its operation. Fields are:
  *
+ * @cachep: cache which this struct is for
  * @memcg: pointer to the memcg this cache belongs to
  * @list: list_head for the list of all caches in this memcg
  * @root_cache: pointer to the global, root cache, this cache was derived from
@@ -503,6 +504,7 @@ struct memcg_cache_params {
 			struct kmem_cache *memcg_caches[0];
 		};
 		struct {
+			struct kmem_cache *cachep;
 			struct mem_cgroup *memcg;
 			struct list_head list;
 			struct kmem_cache *root_cache;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 16fcdbef1b7d..9cb311c199be 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2836,19 +2836,6 @@ static inline bool memcg_can_account_kmem(struct mem_cgroup *memcg)
 		memcg_kmem_is_active(memcg);
 }
 
-/*
- * This is a bit cumbersome, but it is rarely used and avoids a backpointer
- * in the memcg_cache_params struct.
- */
-static struct kmem_cache *memcg_params_to_cache(struct memcg_cache_params *p)
-{
-	struct kmem_cache *cachep;
-
-	VM_BUG_ON(p->is_root_cache);
-	cachep = p->root_cache;
-	return cache_from_memcg_idx(cachep, memcg_cache_id(p->memcg));
-}
-
 #ifdef CONFIG_SLABINFO
 static int mem_cgroup_slabinfo_read(struct seq_file *m, void *v)
 {
@@ -2862,7 +2849,7 @@ static int mem_cgroup_slabinfo_read(struct seq_file *m, void *v)
 
 	mutex_lock(&memcg_slab_mutex);
 	list_for_each_entry(params, &memcg->memcg_slab_caches, list)
-		cache_show(memcg_params_to_cache(params), m);
+		cache_show(params->cachep, m);
 	mutex_unlock(&memcg_slab_mutex);
 
 	return 0;
@@ -3120,7 +3107,6 @@ int __memcg_cleanup_cache_params(struct kmem_cache *s)
 
 static void memcg_unregister_all_caches(struct mem_cgroup *memcg)
 {
-	struct kmem_cache *cachep;
 	struct memcg_cache_params *params, *tmp;
 
 	if (!memcg_kmem_is_active(memcg))
@@ -3128,7 +3114,8 @@ static void memcg_unregister_all_caches(struct mem_cgroup *memcg)
 
 	mutex_lock(&memcg_slab_mutex);
 	list_for_each_entry_safe(params, tmp, &memcg->memcg_slab_caches, list) {
-		cachep = memcg_params_to_cache(params);
+		struct kmem_cache *cachep = params->cachep;
+
 		kmem_cache_shrink(cachep);
 		if (atomic_read(&cachep->memcg_params->nr_pages) == 0)
 			memcg_unregister_cache(cachep);
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 8b486f05c414..b5c9d90535af 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -107,6 +107,7 @@ static int memcg_alloc_cache_params(struct mem_cgroup *memcg,
 		return -ENOMEM;
 
 	if (memcg) {
+		s->memcg_params->cachep = s;
 		s->memcg_params->memcg = memcg;
 		s->memcg_params->root_cache = root_cache;
 	} else
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH -mm 06/14] memcg: keep all children of each root cache on a list
  2014-09-21 15:14 [PATCH -mm 00/14] Per memcg slab shrinkers Vladimir Davydov
                   ` (4 preceding siblings ...)
  2014-09-21 15:14 ` [PATCH -mm 05/14] memcg: add pointer to owner cache to memcg_cache_params Vladimir Davydov
@ 2014-09-21 15:14 ` Vladimir Davydov
  2014-09-21 15:14 ` [PATCH -mm 07/14] memcg: update memcg_caches array entries on the slab side Vladimir Davydov
                   ` (9 subsequent siblings)
  15 siblings, 0 replies; 18+ messages in thread
From: Vladimir Davydov @ 2014-09-21 15:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, Greg Thelen, Dave Chinner,
	Glauber Costa, Suleiman Souhlal, Kamezawa Hiroyuki, Tejun Heo,
	linux-kernel, linux-mm, cgroups

Sometimes we need to iterate over all child caches of a particular root
cache, e.g. when we are destroying it. Currently each root cache keeps
pointers to its children in its memcg_cache_params->memcg_caches_array
so that we can enumerate all active kmemcg ids dereferencing appropriate
array slots to get a memcg.

However, I'm going to make memcg clear the slots on offline to avoid
uncontrollable memcg_caches arrays growth. Hence to iterate over all
memcg caches of a particular root cache we have to link all memcg caches
to per root cache lists.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
---
 include/linux/memcontrol.h |   13 +------------
 include/linux/slab.h       |    1 +
 mm/memcontrol.c            |   20 ++++++--------------
 mm/slab.c                  |   40 +++++++++++++++++++++++-----------------
 mm/slab.h                  |    6 ------
 mm/slab_common.c           |   37 +++++++++++++++++++++++--------------
 mm/slub.c                  |   41 +++++++++++++++++++++++++----------------
 7 files changed, 79 insertions(+), 79 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index c4e64d0e318d..e57a097cf393 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -417,14 +417,6 @@ extern struct static_key memcg_kmem_enabled_key;
 
 extern int memcg_limited_groups_array_size;
 
-/*
- * Helper macro to loop through all memcg-specific caches. Callers must still
- * check if the cache is valid (it is either valid or NULL).
- * the slab_mutex must be held when looping through those caches
- */
-#define for_each_memcg_cache_index(_idx)	\
-	for ((_idx) = 0; (_idx) < memcg_limited_groups_array_size; (_idx)++)
-
 static inline bool memcg_kmem_enabled(void)
 {
 	return static_key_false(&memcg_kmem_enabled_key);
@@ -460,7 +452,7 @@ __memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp);
 int __memcg_charge_slab(struct kmem_cache *cachep, gfp_t gfp, int order);
 void __memcg_uncharge_slab(struct kmem_cache *cachep, int order);
 
-int __memcg_cleanup_cache_params(struct kmem_cache *s);
+void __memcg_cleanup_cache_params(struct kmem_cache *s);
 
 /**
  * memcg_kmem_newpage_charge: verify if a new kmem allocation is allowed.
@@ -553,9 +545,6 @@ memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp)
 	return __memcg_kmem_get_cache(cachep, gfp);
 }
 #else
-#define for_each_memcg_cache_index(_idx)	\
-	for (; NULL; )
-
 static inline bool memcg_kmem_enabled(void)
 {
 	return false;
diff --git a/include/linux/slab.h b/include/linux/slab.h
index c61344074c11..22388b4c6b88 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -498,6 +498,7 @@ static __always_inline void *kmalloc_node(size_t size, gfp_t flags, int node)
  */
 struct memcg_cache_params {
 	bool is_root_cache;
+	struct list_head memcg_caches_list;
 	union {
 		struct {
 			struct rcu_head rcu_head;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 9cb311c199be..412fa220b9aa 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3085,24 +3085,16 @@ static inline void memcg_resume_kmem_account(void)
 	current->memcg_kmem_skip_account--;
 }
 
-int __memcg_cleanup_cache_params(struct kmem_cache *s)
+void __memcg_cleanup_cache_params(struct kmem_cache *s)
 {
-	struct kmem_cache *c;
-	int i, failed = 0;
+	struct memcg_cache_params *params, *tmp;
 
 	mutex_lock(&memcg_slab_mutex);
-	for_each_memcg_cache_index(i) {
-		c = cache_from_memcg_idx(s, i);
-		if (!c)
-			continue;
-
-		memcg_unregister_cache(c);
-
-		if (cache_from_memcg_idx(s, i))
-			failed++;
-	}
+	list_for_each_entry_safe(params, tmp,
+				 &s->memcg_params->memcg_caches_list,
+				 memcg_caches_list)
+		memcg_unregister_cache(params->cachep);
 	mutex_unlock(&memcg_slab_mutex);
-	return failed;
 }
 
 static void memcg_unregister_all_caches(struct mem_cgroup *memcg)
diff --git a/mm/slab.c b/mm/slab.c
index 56116acedacf..be10cad44969 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -3769,29 +3769,35 @@ static int __do_tune_cpucache(struct kmem_cache *cachep, int limit,
 	return alloc_kmem_cache_node(cachep, gfp);
 }
 
+static void memcg_do_tune_cpucache(struct kmem_cache *cachep, int limit,
+				   int batchcount, int shared, gfp_t gfp)
+{
+#ifdef CONFIG_MEMCG_KMEM
+	struct memcg_cache_params *params;
+
+	if (!cachep->memcg_params ||
+	    !cachep->memcg_params->is_root_cache)
+		return;
+
+	lockdep_assert_held(&slab_mutex);
+	list_for_each_entry(params, &cachep->memcg_params->memcg_caches_list,
+			    memcg_caches_list) {
+		/* return value determined by the parent cache only */
+		__do_tune_cpucache(params->cachep, limit,
+				   batchcount, shared, gfp);
+	}
+#endif
+}
+
 static int do_tune_cpucache(struct kmem_cache *cachep, int limit,
 				int batchcount, int shared, gfp_t gfp)
 {
 	int ret;
-	struct kmem_cache *c = NULL;
-	int i = 0;
 
 	ret = __do_tune_cpucache(cachep, limit, batchcount, shared, gfp);
-
-	if (slab_state < FULL)
-		return ret;
-
-	if ((ret < 0) || !is_root_cache(cachep))
-		return ret;
-
-	VM_BUG_ON(!mutex_is_locked(&slab_mutex));
-	for_each_memcg_cache_index(i) {
-		c = cache_from_memcg_idx(cachep, i);
-		if (c)
-			/* return value determined by the parent cache only */
-			__do_tune_cpucache(c, limit, batchcount, shared, gfp);
-	}
-
+	if (!ret)
+		memcg_do_tune_cpucache(cachep, limit,
+				       batchcount, shared, gfp);
 	return ret;
 }
 
diff --git a/mm/slab.h b/mm/slab.h
index 026e7c393f0b..52b570932ba0 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -249,12 +249,6 @@ static inline const char *cache_name(struct kmem_cache *s)
 	return s->name;
 }
 
-static inline struct kmem_cache *
-cache_from_memcg_idx(struct kmem_cache *s, int idx)
-{
-	return NULL;
-}
-
 static inline struct kmem_cache *memcg_root_cache(struct kmem_cache *s)
 {
 	return s;
diff --git a/mm/slab_common.c b/mm/slab_common.c
index b5c9d90535af..d4add958843c 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -106,6 +106,7 @@ static int memcg_alloc_cache_params(struct mem_cgroup *memcg,
 	if (!s->memcg_params)
 		return -ENOMEM;
 
+	INIT_LIST_HEAD(&s->memcg_params->memcg_caches_list);
 	if (memcg) {
 		s->memcg_params->cachep = s;
 		s->memcg_params->memcg = memcg;
@@ -140,6 +141,10 @@ static int memcg_update_cache_params(struct kmem_cache *s, int num_memcgs)
 	       memcg_limited_groups_array_size * sizeof(void *));
 
 	new_params->is_root_cache = true;
+	INIT_LIST_HEAD(&new_params->memcg_caches_list);
+	if (cur_params)
+		list_replace(&cur_params->memcg_caches_list,
+			     &new_params->memcg_caches_list);
 
 	rcu_assign_pointer(s->memcg_params, new_params);
 	if (cur_params)
@@ -367,7 +372,10 @@ struct kmem_cache *memcg_create_kmem_cache(struct mem_cgroup *memcg,
 				 root_cache->size, root_cache->align,
 				 root_cache->flags, root_cache->ctor,
 				 memcg, root_cache);
-	if (IS_ERR(s))
+	if (!IS_ERR(s))
+		list_add(&s->memcg_params->memcg_caches_list,
+			 &root_cache->memcg_params->memcg_caches_list);
+	else
 		s = NULL;
 
 	mutex_unlock(&slab_mutex);
@@ -380,17 +388,15 @@ struct kmem_cache *memcg_create_kmem_cache(struct mem_cgroup *memcg,
 
 static int memcg_cleanup_cache_params(struct kmem_cache *s)
 {
-	int rc;
-
 	if (!s->memcg_params ||
 	    !s->memcg_params->is_root_cache)
 		return 0;
 
 	mutex_unlock(&slab_mutex);
-	rc = __memcg_cleanup_cache_params(s);
+	__memcg_cleanup_cache_params(s);
 	mutex_lock(&slab_mutex);
 
-	return rc;
+	return !list_empty(&s->memcg_params->memcg_caches_list);
 }
 #else
 static int memcg_cleanup_cache_params(struct kmem_cache *s)
@@ -427,6 +433,10 @@ void kmem_cache_destroy(struct kmem_cache *s)
 	}
 
 	list_del(&s->list);
+#ifdef CONFIG_MEMCG_KMEM
+	if (!is_root_cache(s))
+		list_del(&s->memcg_params->memcg_caches_list);
+#endif
 
 	mutex_unlock(&slab_mutex);
 	if (s->flags & SLAB_DESTROY_BY_RCU)
@@ -765,20 +775,18 @@ void slab_stop(struct seq_file *m, void *p)
 static void
 memcg_accumulate_slabinfo(struct kmem_cache *s, struct slabinfo *info)
 {
-	struct kmem_cache *c;
+#ifdef CONFIG_MEMCG_KMEM
+	struct memcg_cache_params *params;
 	struct slabinfo sinfo;
-	int i;
 
-	if (!is_root_cache(s))
+	if (!s->memcg_params ||
+	    !s->memcg_params->is_root_cache)
 		return;
 
-	for_each_memcg_cache_index(i) {
-		c = cache_from_memcg_idx(s, i);
-		if (!c)
-			continue;
-
+	list_for_each_entry(params, &s->memcg_params->memcg_caches_list,
+			    memcg_caches_list) {
 		memset(&sinfo, 0, sizeof(sinfo));
-		get_slabinfo(c, &sinfo);
+		get_slabinfo(params->cachep, &sinfo);
 
 		info->active_slabs += sinfo.active_slabs;
 		info->num_slabs += sinfo.num_slabs;
@@ -786,6 +794,7 @@ memcg_accumulate_slabinfo(struct kmem_cache *s, struct slabinfo *info)
 		info->active_objs += sinfo.active_objs;
 		info->num_objs += sinfo.num_objs;
 	}
+#endif
 }
 
 int cache_show(struct kmem_cache *s, struct seq_file *m)
diff --git a/mm/slub.c b/mm/slub.c
index fa86e5845093..1a1b85c585b3 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3688,6 +3688,24 @@ static struct kmem_cache *find_mergeable(size_t size, size_t align,
 	return NULL;
 }
 
+static void memcg_slab_merge(struct kmem_cache *s, size_t size)
+{
+#ifdef CONFIG_MEMCG_KMEM
+	struct kmem_cache *c;
+	struct memcg_cache_params *params;
+
+	if (!s->memcg_params)
+		return;
+
+	list_for_each_entry(params, &s->memcg_params->memcg_caches_list,
+			    memcg_caches_list) {
+		c = params->cachep;
+		c->object_size = s->object_size;
+		c->inuse = max_t(int, c->inuse, ALIGN(size, sizeof(void *)));
+	}
+#endif
+}
+
 struct kmem_cache *
 __kmem_cache_alias(const char *name, size_t size, size_t align,
 		   unsigned long flags, void (*ctor)(void *))
@@ -3696,9 +3714,6 @@ __kmem_cache_alias(const char *name, size_t size, size_t align,
 
 	s = find_mergeable(size, align, flags, name, ctor);
 	if (s) {
-		int i;
-		struct kmem_cache *c;
-
 		s->refcount++;
 
 		/*
@@ -3708,14 +3723,7 @@ __kmem_cache_alias(const char *name, size_t size, size_t align,
 		s->object_size = max(s->object_size, (int)size);
 		s->inuse = max_t(int, s->inuse, ALIGN(size, sizeof(void *)));
 
-		for_each_memcg_cache_index(i) {
-			c = cache_from_memcg_idx(s, i);
-			if (!c)
-				continue;
-			c->object_size = s->object_size;
-			c->inuse = max_t(int, c->inuse,
-					 ALIGN(size, sizeof(void *)));
-		}
+		memcg_slab_merge(s, size);
 
 		if (sysfs_slab_alias(s, name)) {
 			s->refcount--;
@@ -4977,7 +4985,7 @@ static ssize_t slab_attr_store(struct kobject *kobj,
 	err = attribute->store(s, buf, len);
 #ifdef CONFIG_MEMCG_KMEM
 	if (slab_state >= FULL && err >= 0 && is_root_cache(s)) {
-		int i;
+		struct memcg_cache_params *params;
 
 		mutex_lock(&slab_mutex);
 		if (s->max_attr_size < len)
@@ -5000,10 +5008,11 @@ static ssize_t slab_attr_store(struct kobject *kobj,
 		 * directly either failed or succeeded, in which case we loop
 		 * through the descendants with best-effort propagation.
 		 */
-		for_each_memcg_cache_index(i) {
-			struct kmem_cache *c = cache_from_memcg_idx(s, i);
-			if (c)
-				attribute->store(c, buf, len);
+		if (s->memcg_params) {
+			list_for_each_entry(params,
+					    &s->memcg_params->memcg_caches_list,
+					    memcg_caches_list)
+				attribute->store(params->cachep, buf, len);
 		}
 		mutex_unlock(&slab_mutex);
 	}
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH -mm 07/14] memcg: update memcg_caches array entries on the slab side
  2014-09-21 15:14 [PATCH -mm 00/14] Per memcg slab shrinkers Vladimir Davydov
                   ` (5 preceding siblings ...)
  2014-09-21 15:14 ` [PATCH -mm 06/14] memcg: keep all children of each root cache on a list Vladimir Davydov
@ 2014-09-21 15:14 ` Vladimir Davydov
  2014-09-21 15:14 ` [PATCH -mm 08/14] memcg: release memcg_cache_id on css offline Vladimir Davydov
                   ` (8 subsequent siblings)
  15 siblings, 0 replies; 18+ messages in thread
From: Vladimir Davydov @ 2014-09-21 15:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, Greg Thelen, Dave Chinner,
	Glauber Costa, Suleiman Souhlal, Kamezawa Hiroyuki, Tejun Heo,
	linux-kernel, linux-mm, cgroups

I think that all manipulations on memcg_caches array must happen where
it is defined, i.e. on the slab side. The array allocation and
relocation paths as well as elements access follow this pattern (see
e.g. cache_from_memcg_idx, memcg_update_all_caches), but elements update
doesn't. We still want to setup new array elements in memcontrol.c (see
memcg_{un,}register_cache), though it may change in the future. Anyway,
let's introduce a simple function for updating the array entries,
cache_install_at_memcg_idx, to match cache_from_memcg_idx.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
---
 mm/memcontrol.c |   15 ++++-----------
 mm/slab.h       |   21 ++++++++++++++++++++-
 2 files changed, 24 insertions(+), 12 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 412fa220b9aa..9ae2627bd3b1 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3018,15 +3018,8 @@ static void memcg_register_cache(struct mem_cgroup *memcg,
 	css_get(&memcg->css);
 	list_add(&cachep->memcg_params->list, &memcg->memcg_slab_caches);
 
-	/*
-	 * Since readers won't lock (see cache_from_memcg_idx()), we need a
-	 * barrier here to ensure nobody will see the kmem_cache partially
-	 * initialized.
-	 */
-	smp_wmb();
-
-	BUG_ON(root_cache->memcg_params->memcg_caches[id]);
-	root_cache->memcg_params->memcg_caches[id] = cachep;
+	BUG_ON(cache_from_memcg_idx(root_cache, id) != NULL);
+	cache_install_at_memcg_idx(root_cache, id, cachep);
 }
 
 static void memcg_unregister_cache(struct kmem_cache *cachep)
@@ -3043,8 +3036,8 @@ static void memcg_unregister_cache(struct kmem_cache *cachep)
 	memcg = cachep->memcg_params->memcg;
 	id = memcg_cache_id(memcg);
 
-	BUG_ON(root_cache->memcg_params->memcg_caches[id] != cachep);
-	root_cache->memcg_params->memcg_caches[id] = NULL;
+	BUG_ON(cache_from_memcg_idx(root_cache, id) != cachep);
+	cache_install_at_memcg_idx(root_cache, id, NULL);
 
 	list_del(&cachep->memcg_params->list);
 
diff --git a/mm/slab.h b/mm/slab.h
index 52b570932ba0..da798cfe5efa 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -201,12 +201,31 @@ cache_from_memcg_idx(struct kmem_cache *s, int idx)
 	/*
 	 * Make sure we will access the up-to-date value. The code updating
 	 * memcg_caches issues a write barrier to match this (see
-	 * memcg_register_cache()).
+	 * cache_install_at_memcg_idx()).
 	 */
 	smp_read_barrier_depends();
 	return cachep;
 }
 
+/*
+ * Update the entry at index @memcg_idx in the memcg_caches array of
+ * @root_cache. The caller must synchronize against concurrent updates to the
+ * same entry as well as guarantee that the memcg_caches array won't be
+ * relocated under our noses.
+ */
+static inline void cache_install_at_memcg_idx(struct kmem_cache *root_cache,
+				int memcg_idx, struct kmem_cache *memcg_cache)
+{
+	/*
+	 * Since readers won't lock (see cache_from_memcg_idx()), we need a
+	 * barrier here to ensure nobody will see the kmem_cache partially
+	 * initialized.
+	 */
+	smp_wmb();
+
+	root_cache->memcg_params->memcg_caches[memcg_idx] = memcg_cache;
+}
+
 static inline struct kmem_cache *memcg_root_cache(struct kmem_cache *s)
 {
 	if (is_root_cache(s))
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH -mm 08/14] memcg: release memcg_cache_id on css offline
  2014-09-21 15:14 [PATCH -mm 00/14] Per memcg slab shrinkers Vladimir Davydov
                   ` (6 preceding siblings ...)
  2014-09-21 15:14 ` [PATCH -mm 07/14] memcg: update memcg_caches array entries on the slab side Vladimir Davydov
@ 2014-09-21 15:14 ` Vladimir Davydov
  2014-09-21 15:14 ` [PATCH -mm 09/14] memcg: rename some cache id related variables Vladimir Davydov
                   ` (7 subsequent siblings)
  15 siblings, 0 replies; 18+ messages in thread
From: Vladimir Davydov @ 2014-09-21 15:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, Greg Thelen, Dave Chinner,
	Glauber Costa, Suleiman Souhlal, Kamezawa Hiroyuki, Tejun Heo,
	linux-kernel, linux-mm, cgroups

The memcg_cache_id (mem_cgroup->kmemcg_id) is used as the index in root
cache's memcg_cache_params->memcg_caches array. Whenever a new kmem
active cgroup is created we must allocate an id for it. As a result, the
array size must always be greater than or equal to the number of memory
cgroups that have memcg_cache_id assigned to them.

Currently we release the id only on css free. This is bad, because css
can be zombieing around for quite a long time after css offline due to
pending charges, occupying an array slot and making the arrays grow
larger and larger. Although the number of arrays is limited - only root
kmem caches have them - we can still experience problems while creating
new kmem active cgroups, because they might require arrays relocation
and each array relocation will require costly high-order page
allocations if there are a lot of ids allocated. The situation will
become even worse when per-memcg list_lru's are introduced, because each
super block has a list_lru, and the number of super blocks is
practically unlimited.

This patch makes memcg release memcg_cache_id on css offline. The id of
a dead memcg is set to its parent cgroup's id. Currently ids are not
used after cgroup death so we could set it to -1, however, once per
memcg list_lru is introduced, we will have to deal with list_lru entries
accounted to the memcg somehow. I'm planning to move those entries to
parent cgroup's list_lru, so we have to set kmemcg_id appropriately.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
---
 mm/memcontrol.c |   64 ++++++++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 56 insertions(+), 8 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 9ae2627bd3b1..d665d715090b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -646,14 +646,10 @@ int memcg_limited_groups_array_size;
 struct static_key memcg_kmem_enabled_key;
 EXPORT_SYMBOL(memcg_kmem_enabled_key);
 
-static void memcg_free_cache_id(int id);
-
 static void disarm_kmem_keys(struct mem_cgroup *memcg)
 {
-	if (memcg_kmem_is_active(memcg)) {
+	if (memcg_kmem_is_active(memcg))
 		static_key_slow_dec(&memcg_kmem_enabled_key);
-		memcg_free_cache_id(memcg->kmemcg_id);
-	}
 	/*
 	 * This check can't live in kmem destruction function,
 	 * since the charges will outlive the cgroup
@@ -2988,6 +2984,12 @@ static void memcg_register_cache(struct mem_cgroup *memcg,
 	lockdep_assert_held(&memcg_slab_mutex);
 
 	id = memcg_cache_id(memcg);
+	/*
+	 * This might happen if the cgroup was taken offline while the create
+	 * work was pending.
+	 */
+	if (id < 0)
+		return;
 
 	/*
 	 * Since per-memcg caches are created asynchronously on first
@@ -3036,8 +3038,15 @@ static void memcg_unregister_cache(struct kmem_cache *cachep)
 	memcg = cachep->memcg_params->memcg;
 	id = memcg_cache_id(memcg);
 
-	BUG_ON(cache_from_memcg_idx(root_cache, id) != cachep);
-	cache_install_at_memcg_idx(root_cache, id, NULL);
+	/*
+	 * This function can be called both after and before css offline. If
+	 * it's called before css offline, which happens on the root cache
+	 * destruction, we should clear the slot corresponding to the cache in
+	 * memcg_caches array. Otherwise the slot must have already been
+	 * cleared in memcg_unregister_all_caches.
+	 */
+	if (id >= 0 && cache_from_memcg_idx(root_cache, id) == cachep)
+		cache_install_at_memcg_idx(root_cache, id, NULL);
 
 	list_del(&cachep->memcg_params->list);
 
@@ -3093,19 +3102,49 @@ void __memcg_cleanup_cache_params(struct kmem_cache *s)
 static void memcg_unregister_all_caches(struct mem_cgroup *memcg)
 {
 	struct memcg_cache_params *params, *tmp;
+	int id = memcg_cache_id(memcg);
+	struct cgroup_subsys_state *iter;
+	struct mem_cgroup *parent;
+	int parent_id;
 
 	if (!memcg_kmem_is_active(memcg))
 		return;
 
+	/*
+	 * Clear the slots corresponding to this cgroup in all root caches'
+	 * memcg_params->memcg_caches arrays. If a cache is empty, remove it.
+	 */
 	mutex_lock(&memcg_slab_mutex);
 	list_for_each_entry_safe(params, tmp, &memcg->memcg_slab_caches, list) {
 		struct kmem_cache *cachep = params->cachep;
+		struct kmem_cache *root_cache = params->root_cache;
+
+		BUG_ON(cache_from_memcg_idx(root_cache, id) != cachep);
+		cache_install_at_memcg_idx(root_cache, id, NULL);
 
 		kmem_cache_shrink(cachep);
 		if (atomic_read(&cachep->memcg_params->nr_pages) == 0)
 			memcg_unregister_cache(cachep);
 	}
 	mutex_unlock(&memcg_slab_mutex);
+
+	/*
+	 * Change kmemcg_id of this cgroup and all its descendants (which are
+	 * already dead) to the parent cgroup's id.
+	 */
+	parent = parent_mem_cgroup(memcg);
+	if (parent && !mem_cgroup_is_root(parent)) {
+		BUG_ON(!memcg_kmem_is_active(parent));
+		parent_id = parent->kmemcg_id;
+	} else
+		parent_id = -1;
+
+	/* Safe, because we are holding the cgroup_mutex */
+	css_for_each_descendant_post(iter, &memcg->css)
+		mem_cgroup_from_css(iter)->kmemcg_id = parent_id;
+
+	/* The id is not used anymore, free it so that it could be reused. */
+	memcg_free_cache_id(id);
 }
 
 struct memcg_register_cache_work {
@@ -3204,6 +3243,7 @@ struct kmem_cache *__memcg_kmem_get_cache(struct kmem_cache *cachep,
 {
 	struct mem_cgroup *memcg;
 	struct kmem_cache *memcg_cachep;
+	int id;
 
 	VM_BUG_ON(!cachep->memcg_params);
 	VM_BUG_ON(!cachep->memcg_params->is_root_cache);
@@ -3217,7 +3257,15 @@ struct kmem_cache *__memcg_kmem_get_cache(struct kmem_cache *cachep,
 	if (!memcg_can_account_kmem(memcg))
 		goto out;
 
-	memcg_cachep = cache_from_memcg_idx(cachep, memcg_cache_id(memcg));
+	id = memcg_cache_id(memcg);
+	/*
+	 * This might happen if current was migrated to another cgroup and this
+	 * cgroup was taken offline after we issued mem_cgroup_from_task above.
+	 */
+	if (unlikely(id < 0))
+		goto out;
+
+	memcg_cachep = cache_from_memcg_idx(cachep, id);
 	if (likely(memcg_cachep)) {
 		cachep = memcg_cachep;
 		goto out;
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH -mm 09/14] memcg: rename some cache id related variables
  2014-09-21 15:14 [PATCH -mm 00/14] Per memcg slab shrinkers Vladimir Davydov
                   ` (7 preceding siblings ...)
  2014-09-21 15:14 ` [PATCH -mm 08/14] memcg: release memcg_cache_id on css offline Vladimir Davydov
@ 2014-09-21 15:14 ` Vladimir Davydov
  2014-09-21 15:14 ` [PATCH -mm 10/14] memcg: add rwsem to sync against memcg_caches arrays relocation Vladimir Davydov
                   ` (6 subsequent siblings)
  15 siblings, 0 replies; 18+ messages in thread
From: Vladimir Davydov @ 2014-09-21 15:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, Greg Thelen, Dave Chinner,
	Glauber Costa, Suleiman Souhlal, Kamezawa Hiroyuki, Tejun Heo,
	linux-kernel, linux-mm, cgroups

memcg_limited_groups_array_size, which defines the size of memcg_caches
arrays, sounds rather cumbersome. Also it doesn't point anyhow that it's
related to kmem/caches stuff. So let's rename it to memcg_max_cache_ids.
It's concise and points us directly to memcg_cache_id.

Also, rename kmem_limited_groups to memcg_cache_ida, because it's not a
container for groups, but the memcg_cache_id allocator.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
---
 include/linux/memcontrol.h |    9 ++++++++-
 mm/memcontrol.c            |   19 +++++++++----------
 mm/slab_common.c           |    4 ++--
 3 files changed, 19 insertions(+), 13 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index e57a097cf393..7c1bf0a84950 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -415,7 +415,12 @@ static inline void sock_release_memcg(struct sock *sk)
 #ifdef CONFIG_MEMCG_KMEM
 extern struct static_key memcg_kmem_enabled_key;
 
-extern int memcg_limited_groups_array_size;
+/*
+ * The maximal number of kmem-active memory cgroups that can exist on the
+ * system. May grow, but never shrinks. The value returned by memcg_cache_id()
+ * is always less.
+ */
+extern int memcg_max_cache_ids;
 
 static inline bool memcg_kmem_enabled(void)
 {
@@ -545,6 +550,8 @@ memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp)
 	return __memcg_kmem_get_cache(cachep, gfp);
 }
 #else
+#define memcg_max_cache_ids 0
+
 static inline bool memcg_kmem_enabled(void)
 {
 	return false;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d665d715090b..0020824dee96 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -615,12 +615,11 @@ static void disarm_sock_keys(struct mem_cgroup *memcg)
  *  memcgs, and none but the 200th is kmem-limited, we'd have to have a
  *  200 entry array for that.
  *
- * The current size of the caches array is stored in
- * memcg_limited_groups_array_size.  It will double each time we have to
- * increase it.
+ * The current size of the caches array is stored in memcg_max_cache_ids. It
+ * will double each time we have to increase it.
  */
-static DEFINE_IDA(kmem_limited_groups);
-int memcg_limited_groups_array_size;
+static DEFINE_IDA(memcg_cache_ida);
+int memcg_max_cache_ids;
 
 /*
  * MIN_SIZE is different than 1, because we would like to avoid going through
@@ -2926,12 +2925,12 @@ static int memcg_alloc_cache_id(void)
 	int id, size;
 	int err;
 
-	id = ida_simple_get(&kmem_limited_groups,
+	id = ida_simple_get(&memcg_cache_ida,
 			    0, MEMCG_CACHES_MAX_SIZE, GFP_KERNEL);
 	if (id < 0)
 		return id;
 
-	if (id < memcg_limited_groups_array_size)
+	if (id < memcg_max_cache_ids)
 		return id;
 
 	/*
@@ -2950,7 +2949,7 @@ static int memcg_alloc_cache_id(void)
 	mutex_unlock(&memcg_slab_mutex);
 
 	if (err) {
-		ida_simple_remove(&kmem_limited_groups, id);
+		ida_simple_remove(&memcg_cache_ida, id);
 		return err;
 	}
 	return id;
@@ -2959,7 +2958,7 @@ static int memcg_alloc_cache_id(void)
 
 static void memcg_free_cache_id(int id)
 {
-	ida_simple_remove(&kmem_limited_groups, id);
+	ida_simple_remove(&memcg_cache_ida, id);
 }
 
 /*
@@ -2969,7 +2968,7 @@ static void memcg_free_cache_id(int id)
  */
 void memcg_update_array_size(int num)
 {
-	memcg_limited_groups_array_size = num;
+	memcg_max_cache_ids = num;
 }
 
 static void memcg_register_cache(struct mem_cgroup *memcg,
diff --git a/mm/slab_common.c b/mm/slab_common.c
index d4add958843c..cc6e18437f6c 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -98,7 +98,7 @@ static int memcg_alloc_cache_params(struct mem_cgroup *memcg,
 
 	if (!memcg) {
 		size = offsetof(struct memcg_cache_params, memcg_caches);
-		size += memcg_limited_groups_array_size * sizeof(void *);
+		size += memcg_max_cache_ids * sizeof(void *);
 	} else
 		size = sizeof(struct memcg_cache_params);
 
@@ -138,7 +138,7 @@ static int memcg_update_cache_params(struct kmem_cache *s, int num_memcgs)
 
 	cur_params = s->memcg_params;
 	memcpy(new_params->memcg_caches, cur_params->memcg_caches,
-	       memcg_limited_groups_array_size * sizeof(void *));
+	       memcg_max_cache_ids * sizeof(void *));
 
 	new_params->is_root_cache = true;
 	INIT_LIST_HEAD(&new_params->memcg_caches_list);
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH -mm 10/14] memcg: add rwsem to sync against memcg_caches arrays relocation
  2014-09-21 15:14 [PATCH -mm 00/14] Per memcg slab shrinkers Vladimir Davydov
                   ` (8 preceding siblings ...)
  2014-09-21 15:14 ` [PATCH -mm 09/14] memcg: rename some cache id related variables Vladimir Davydov
@ 2014-09-21 15:14 ` Vladimir Davydov
  2014-09-21 15:14 ` [PATCH -mm 11/14] list_lru: get rid of ->active_nodes Vladimir Davydov
                   ` (5 subsequent siblings)
  15 siblings, 0 replies; 18+ messages in thread
From: Vladimir Davydov @ 2014-09-21 15:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, Greg Thelen, Dave Chinner,
	Glauber Costa, Suleiman Souhlal, Kamezawa Hiroyuki, Tejun Heo,
	linux-kernel, linux-mm, cgroups

We need a stable value of memcg_max_cache_ids in kmem_cache_create()
(memcg_alloc_cache_params() wants it for root caches), where we only
hold the slab_mutex and no memcg-related locks. As a result, we have to
update memcg_cache_ids under the slab_mutex, which we can only take from
the slab's side. This looks awkward and will become even worse when
per-memcg list_lru is introduced, which also wants stable access to
memcg_max_cache_ids.

To get rid of this dependency between the memcg_max_cache_ids and the
slab_mutex, this patch introduces a special rwsem. The rwsem is held for
writing during memcg_caches arrays relocation and memcg_max_cache_ids
updates. Therefore one can take it for reading to get a stable access to
memcg_caches arrays and/or memcg_max_cache_ids.

Currently the semaphore is taken for reading only from
kmem_cache_create, right before taking the slab_mutex, so right now
there's no point in using rwsem instead of mutex. However, once list_lru
is made per-memcg it will allow list_lru initializations to proceed
concurrently.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
---
 include/linux/memcontrol.h |   15 +++++++++++++--
 mm/memcontrol.c            |   28 ++++++++++++++++++----------
 mm/slab_common.c           |   10 +++++-----
 3 files changed, 36 insertions(+), 17 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 7c1bf0a84950..f2cd342d6544 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -419,8 +419,13 @@ extern struct static_key memcg_kmem_enabled_key;
  * The maximal number of kmem-active memory cgroups that can exist on the
  * system. May grow, but never shrinks. The value returned by memcg_cache_id()
  * is always less.
+ *
+ * To prevent memcg_max_cache_ids from growing, memcg_lock_cache_id_space() can
+ * be used. It's backed by rw semaphore.
  */
 extern int memcg_max_cache_ids;
+extern void memcg_lock_cache_id_space(void);
+extern void memcg_unlock_cache_id_space(void);
 
 static inline bool memcg_kmem_enabled(void)
 {
@@ -449,8 +454,6 @@ void __memcg_kmem_uncharge_pages(struct page *page, int order);
 
 int memcg_cache_id(struct mem_cgroup *memcg);
 
-void memcg_update_array_size(int num_groups);
-
 struct kmem_cache *
 __memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp);
 
@@ -587,6 +590,14 @@ static inline int memcg_cache_id(struct mem_cgroup *memcg)
 	return -1;
 }
 
+static inline void memcg_lock_cache_id_space(void)
+{
+}
+
+static inline void memcg_unlock_cache_id_space(void)
+{
+}
+
 static inline struct kmem_cache *
 memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp)
 {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 0020824dee96..0c6d412ae5a3 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -621,6 +621,19 @@ static void disarm_sock_keys(struct mem_cgroup *memcg)
 static DEFINE_IDA(memcg_cache_ida);
 int memcg_max_cache_ids;
 
+/* Protects memcg_max_cache_ids */
+static DECLARE_RWSEM(memcg_cache_id_space_sem);
+
+void memcg_lock_cache_id_space(void)
+{
+	down_read(&memcg_cache_id_space_sem);
+}
+
+void memcg_unlock_cache_id_space(void)
+{
+	up_read(&memcg_cache_id_space_sem);
+}
+
 /*
  * MIN_SIZE is different than 1, because we would like to avoid going through
  * the alloc/free process all the time. In a small machine, 4 kmem-limited
@@ -2937,6 +2950,7 @@ static int memcg_alloc_cache_id(void)
 	 * There's no space for the new id in memcg_caches arrays,
 	 * so we have to grow them.
 	 */
+	down_write(&memcg_cache_id_space_sem);
 
 	size = 2 * (id + 1);
 	if (size < MEMCG_CACHES_MIN_SIZE)
@@ -2948,6 +2962,10 @@ static int memcg_alloc_cache_id(void)
 	err = memcg_update_all_caches(size);
 	mutex_unlock(&memcg_slab_mutex);
 
+	if (!err)
+		memcg_max_cache_ids = size;
+	up_write(&memcg_cache_id_space_sem);
+
 	if (err) {
 		ida_simple_remove(&memcg_cache_ida, id);
 		return err;
@@ -2961,16 +2979,6 @@ static void memcg_free_cache_id(int id)
 	ida_simple_remove(&memcg_cache_ida, id);
 }
 
-/*
- * We should update the current array size iff all caches updates succeed. This
- * can only be done from the slab side. The slab mutex needs to be held when
- * calling this.
- */
-void memcg_update_array_size(int num)
-{
-	memcg_max_cache_ids = num;
-}
-
 static void memcg_register_cache(struct mem_cgroup *memcg,
 				 struct kmem_cache *root_cache)
 {
diff --git a/mm/slab_common.c b/mm/slab_common.c
index cc6e18437f6c..4e2b9040a49f 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -157,8 +157,8 @@ int memcg_update_all_caches(int num_memcgs)
 {
 	struct kmem_cache *s;
 	int ret = 0;
-	mutex_lock(&slab_mutex);
 
+	mutex_lock(&slab_mutex);
 	list_for_each_entry(s, &slab_caches, list) {
 		if (!is_root_cache(s))
 			continue;
@@ -169,11 +169,8 @@ int memcg_update_all_caches(int num_memcgs)
 		 * up to this point in an updated state.
 		 */
 		if (ret)
-			goto out;
+			break;
 	}
-
-	memcg_update_array_size(num_memcgs);
-out:
 	mutex_unlock(&slab_mutex);
 	return ret;
 }
@@ -290,6 +287,8 @@ kmem_cache_create(const char *name, size_t size, size_t align,
 
 	get_online_cpus();
 	get_online_mems();
+	memcg_lock_cache_id_space(); /* memcg_alloc_cache_params() needs a
+					stable value of memcg_max_cache_ids */
 
 	mutex_lock(&slab_mutex);
 
@@ -328,6 +327,7 @@ kmem_cache_create(const char *name, size_t size, size_t align,
 out_unlock:
 	mutex_unlock(&slab_mutex);
 
+	memcg_unlock_cache_id_space();
 	put_online_mems();
 	put_online_cpus();
 
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH -mm 11/14] list_lru: get rid of ->active_nodes
  2014-09-21 15:14 [PATCH -mm 00/14] Per memcg slab shrinkers Vladimir Davydov
                   ` (9 preceding siblings ...)
  2014-09-21 15:14 ` [PATCH -mm 10/14] memcg: add rwsem to sync against memcg_caches arrays relocation Vladimir Davydov
@ 2014-09-21 15:14 ` Vladimir Davydov
  2014-09-21 15:14 ` [PATCH -mm 12/14] list_lru: organize all list_lrus to list Vladimir Davydov
                   ` (4 subsequent siblings)
  15 siblings, 0 replies; 18+ messages in thread
From: Vladimir Davydov @ 2014-09-21 15:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, Greg Thelen, Dave Chinner,
	Glauber Costa, Suleiman Souhlal, Kamezawa Hiroyuki, Tejun Heo,
	linux-kernel, linux-mm, cgroups

The active_nodes mask allows us to skip empty nodes when walking over
list_lru items from all nodes in list_lru_count/walk. However, these
functions are never called from really hot paths, so it doesn't seem we
need such kind of optimization there. OTOH, removing the mask will make
it easier to make list_lru per-memcg.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
---
 include/linux/list_lru.h |    5 ++---
 mm/list_lru.c            |   10 +++-------
 2 files changed, 5 insertions(+), 10 deletions(-)

diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
index f500a2e39b13..53c1d6b78270 100644
--- a/include/linux/list_lru.h
+++ b/include/linux/list_lru.h
@@ -31,7 +31,6 @@ struct list_lru_node {
 
 struct list_lru {
 	struct list_lru_node	*node;
-	nodemask_t		active_nodes;
 };
 
 void list_lru_destroy(struct list_lru *lru);
@@ -94,7 +93,7 @@ static inline unsigned long list_lru_count(struct list_lru *lru)
 	long count = 0;
 	int nid;
 
-	for_each_node_mask(nid, lru->active_nodes)
+	for_each_node_state(nid, N_NORMAL_MEMORY)
 		count += list_lru_count_node(lru, nid);
 
 	return count;
@@ -142,7 +141,7 @@ list_lru_walk(struct list_lru *lru, list_lru_walk_cb isolate,
 	long isolated = 0;
 	int nid;
 
-	for_each_node_mask(nid, lru->active_nodes) {
+	for_each_node_state(nid, N_NORMAL_MEMORY) {
 		isolated += list_lru_walk_node(lru, nid, isolate,
 					       cb_arg, &nr_to_walk);
 		if (nr_to_walk <= 0)
diff --git a/mm/list_lru.c b/mm/list_lru.c
index f1a0db194173..07e198c77888 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -19,8 +19,7 @@ bool list_lru_add(struct list_lru *lru, struct list_head *item)
 	WARN_ON_ONCE(nlru->nr_items < 0);
 	if (list_empty(item)) {
 		list_add_tail(item, &nlru->list);
-		if (nlru->nr_items++ == 0)
-			node_set(nid, lru->active_nodes);
+		nlru->nr_items++;
 		spin_unlock(&nlru->lock);
 		return true;
 	}
@@ -37,8 +36,7 @@ bool list_lru_del(struct list_lru *lru, struct list_head *item)
 	spin_lock(&nlru->lock);
 	if (!list_empty(item)) {
 		list_del_init(item);
-		if (--nlru->nr_items == 0)
-			node_clear(nid, lru->active_nodes);
+		nlru->nr_items--;
 		WARN_ON_ONCE(nlru->nr_items < 0);
 		spin_unlock(&nlru->lock);
 		return true;
@@ -90,8 +88,7 @@ restart:
 		case LRU_REMOVED_RETRY:
 			assert_spin_locked(&nlru->lock);
 		case LRU_REMOVED:
-			if (--nlru->nr_items == 0)
-				node_clear(nid, lru->active_nodes);
+			nlru->nr_items--;
 			WARN_ON_ONCE(nlru->nr_items < 0);
 			isolated++;
 			/*
@@ -133,7 +130,6 @@ int list_lru_init_key(struct list_lru *lru, struct lock_class_key *key)
 	if (!lru->node)
 		return -ENOMEM;
 
-	nodes_clear(lru->active_nodes);
 	for (i = 0; i < nr_node_ids; i++) {
 		spin_lock_init(&lru->node[i].lock);
 		if (key)
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH -mm 12/14] list_lru: organize all list_lrus to list
  2014-09-21 15:14 [PATCH -mm 00/14] Per memcg slab shrinkers Vladimir Davydov
                   ` (10 preceding siblings ...)
  2014-09-21 15:14 ` [PATCH -mm 11/14] list_lru: get rid of ->active_nodes Vladimir Davydov
@ 2014-09-21 15:14 ` Vladimir Davydov
  2014-09-21 15:14 ` [PATCH -mm 13/14] list_lru: introduce per-memcg lists Vladimir Davydov
                   ` (3 subsequent siblings)
  15 siblings, 0 replies; 18+ messages in thread
From: Vladimir Davydov @ 2014-09-21 15:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, Greg Thelen, Dave Chinner,
	Glauber Costa, Suleiman Souhlal, Kamezawa Hiroyuki, Tejun Heo,
	linux-kernel, linux-mm, cgroups

I need it for making list_lru memcg-aware.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
---
 include/linux/list_lru.h |    3 +++
 mm/list_lru.c            |   29 +++++++++++++++++++++++++++++
 2 files changed, 32 insertions(+)

diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
index 53c1d6b78270..ee9486ac0621 100644
--- a/include/linux/list_lru.h
+++ b/include/linux/list_lru.h
@@ -31,6 +31,9 @@ struct list_lru_node {
 
 struct list_lru {
 	struct list_lru_node	*node;
+#ifdef CONFIG_MEMCG_KMEM
+	struct list_head	list;
+#endif
 };
 
 void list_lru_destroy(struct list_lru *lru);
diff --git a/mm/list_lru.c b/mm/list_lru.c
index 07e198c77888..53086eda7942 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -10,6 +10,33 @@
 #include <linux/list_lru.h>
 #include <linux/slab.h>
 
+#ifdef CONFIG_MEMCG_KMEM
+static LIST_HEAD(list_lrus);
+static DEFINE_SPINLOCK(list_lrus_lock);
+
+static void list_lru_register(struct list_lru *lru)
+{
+	spin_lock(&list_lrus_lock);
+	list_add(&lru->list, &list_lrus);
+	spin_unlock(&list_lrus_lock);
+}
+
+static void list_lru_unregister(struct list_lru *lru)
+{
+	spin_lock(&list_lrus_lock);
+	list_del(&lru->list);
+	spin_unlock(&list_lrus_lock);
+}
+#else
+static void list_lru_register(struct list_lru *lru)
+{
+}
+
+static void list_lru_unregister(struct list_lru *lru)
+{
+}
+#endif /* CONFIG_MEMCG_KMEM */
+
 bool list_lru_add(struct list_lru *lru, struct list_head *item)
 {
 	int nid = page_to_nid(virt_to_page(item));
@@ -137,12 +164,14 @@ int list_lru_init_key(struct list_lru *lru, struct lock_class_key *key)
 		INIT_LIST_HEAD(&lru->node[i].list);
 		lru->node[i].nr_items = 0;
 	}
+	list_lru_register(lru);
 	return 0;
 }
 EXPORT_SYMBOL_GPL(list_lru_init_key);
 
 void list_lru_destroy(struct list_lru *lru)
 {
+	list_lru_unregister(lru);
 	kfree(lru->node);
 }
 EXPORT_SYMBOL_GPL(list_lru_destroy);
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH -mm 13/14] list_lru: introduce per-memcg lists
  2014-09-21 15:14 [PATCH -mm 00/14] Per memcg slab shrinkers Vladimir Davydov
                   ` (11 preceding siblings ...)
  2014-09-21 15:14 ` [PATCH -mm 12/14] list_lru: organize all list_lrus to list Vladimir Davydov
@ 2014-09-21 15:14 ` Vladimir Davydov
  2014-09-21 15:14 ` [PATCH -mm 14/14] fs: make shrinker memcg aware Vladimir Davydov
                   ` (2 subsequent siblings)
  15 siblings, 0 replies; 18+ messages in thread
From: Vladimir Davydov @ 2014-09-21 15:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, Greg Thelen, Dave Chinner,
	Glauber Costa, Suleiman Souhlal, Kamezawa Hiroyuki, Tejun Heo,
	linux-kernel, linux-mm, cgroups

There are several FS shrinkers, including super_block::s_shrink, that
keep reclaimable objects in the list_lru structure. Hence to turn them
to memcg-aware shrinkers, it is enough to make list_lru per-memcg.

This patch does the trick. It adds an array of lru lists to the
list_lru_node structure (per-node part of the list_lru), one for each
kmem-active memcg, and dispatches every item addition or removal to the
list corresponding to the memcg which the item is accounted to. So now
the list_lru structure is not just per node, but per node and per memcg.

Not all list_lrus need this feature, so this patch also adds the
memcg_aware bool argument to list_lru_init. One has to pass true in it
to make the list_lru memcg-aware.

Just like per memcg caches arrays, the arrays of per-memcg lists are
indexed by memcg_cache_id, so we must grow them whenever
memcg_max_cache_ids is increased. So we introduce a callback,
memcg_update_all_list_lrus, invoked by memcg_alloc_cache_id if the id
space is full.

Since on memcg destruction (css offline) we release its cache id to
avoid uncontrollable per-memcg arrays growth, we must deal with
list_lrus corresponding to dead memcgs somehow. In this patch, all
elements from the lru lists corresponding to the dead memcg are moved to
its parent's lists (reparented). This is kind of tricky, because this
can race with concurrent lru walkers and items insertions/removals. To
achieve that we have to remove nr_items<0 checks, because it can become
negative for a short time during reparenting. Secondly, reparenting
imposes a limitation on the locking scheme of the list_lru. We must have
a stable lock for all per-memcg lrus. That's why all per-memcg lrus on
the same node are protected by the node's list_lru_node->lock. This is
similar to how lruvecs locking works.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
---
 fs/gfs2/main.c             |    2 +-
 fs/super.c                 |    4 +-
 fs/xfs/xfs_buf.c           |    2 +-
 fs/xfs/xfs_qm.c            |    2 +-
 include/linux/list_lru.h   |   85 +++++-----
 include/linux/memcontrol.h |    7 +
 mm/list_lru.c              |  403 +++++++++++++++++++++++++++++++++++++++++---
 mm/memcontrol.c            |   36 ++++
 mm/workingset.c            |    3 +-
 9 files changed, 471 insertions(+), 73 deletions(-)

diff --git a/fs/gfs2/main.c b/fs/gfs2/main.c
index 82b6ac829656..fb51e99a0281 100644
--- a/fs/gfs2/main.c
+++ b/fs/gfs2/main.c
@@ -84,7 +84,7 @@ static int __init init_gfs2_fs(void)
 	if (error)
 		return error;
 
-	error = list_lru_init(&gfs2_qd_lru);
+	error = list_lru_init(&gfs2_qd_lru, false);
 	if (error)
 		goto fail_lru;
 
diff --git a/fs/super.c b/fs/super.c
index a2b735a42e74..a82e97b0b8b9 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -189,9 +189,9 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags)
 	INIT_HLIST_BL_HEAD(&s->s_anon);
 	INIT_LIST_HEAD(&s->s_inodes);
 
-	if (list_lru_init(&s->s_dentry_lru))
+	if (list_lru_init(&s->s_dentry_lru, false))
 		goto fail;
-	if (list_lru_init(&s->s_inode_lru))
+	if (list_lru_init(&s->s_inode_lru, false))
 		goto fail;
 
 	init_rwsem(&s->s_umount);
diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index e02a49a30f89..1f789f805dcc 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -1681,7 +1681,7 @@ xfs_alloc_buftarg(
 	if (xfs_setsize_buftarg_early(btp, bdev))
 		goto error;
 
-	if (list_lru_init(&btp->bt_lru))
+	if (list_lru_init(&btp->bt_lru, false))
 		goto error;
 
 	btp->bt_shrinker.count_objects = xfs_buftarg_shrink_count;
diff --git a/fs/xfs/xfs_qm.c b/fs/xfs/xfs_qm.c
index 51167f44c408..a6a56197656f 100644
--- a/fs/xfs/xfs_qm.c
+++ b/fs/xfs/xfs_qm.c
@@ -576,7 +576,7 @@ xfs_qm_init_quotainfo(
 
 	qinf = mp->m_quotainfo = kmem_zalloc(sizeof(xfs_quotainfo_t), KM_SLEEP);
 
-	error = list_lru_init(&qinf->qi_lru);
+	error = list_lru_init(&qinf->qi_lru, false);
 	if (error)
 		goto out_free_qinf;
 
diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
index ee9486ac0621..9fe8b09f496e 100644
--- a/include/linux/list_lru.h
+++ b/include/linux/list_lru.h
@@ -22,11 +22,26 @@ enum lru_status {
 				   internally, but has to return locked. */
 };
 
-struct list_lru_node {
-	spinlock_t		lock;
+struct list_lru_one {
 	struct list_head	list;
-	/* kept as signed so we can catch imbalance bugs */
+	/* may become negative during memcg reparenting */
 	long			nr_items;
+};
+
+struct list_lru_memcg {
+	/* array of per-memcg lists, indexed by memcg_cache_id */
+	struct list_lru_one	*lru[0];
+};
+
+struct list_lru_node {
+	/* protects all lists on the node, including per-memcg */
+	spinlock_t		lock;
+	/* global list, used by the root cgroup in memcg-aware lrus */
+	struct list_lru_one	lru;
+#ifdef CONFIG_MEMCG_KMEM
+	/* for memcg-aware lrus points to per-memcg lists, otherwise NULL */
+	struct list_lru_memcg	*memcg_lrus;
+#endif
 } ____cacheline_aligned_in_smp;
 
 struct list_lru {
@@ -36,11 +51,17 @@ struct list_lru {
 #endif
 };
 
+#ifdef CONFIG_MEMCG_KMEM
+int memcg_update_all_list_lrus(int num_memcgs);
+void memcg_reparent_all_list_lrus(int from_idx, int to_idx);
+#endif
+
 void list_lru_destroy(struct list_lru *lru);
-int list_lru_init_key(struct list_lru *lru, struct lock_class_key *key);
-static inline int list_lru_init(struct list_lru *lru)
+int list_lru_init_key(struct list_lru *lru, bool memcg_aware,
+		      struct lock_class_key *key);
+static inline int list_lru_init(struct list_lru *lru, bool memcg_aware)
 {
-	return list_lru_init_key(lru, NULL);
+	return list_lru_init_key(lru, memcg_aware, NULL);
 }
 
 /**
@@ -75,39 +96,32 @@ bool list_lru_add(struct list_lru *lru, struct list_head *item);
 bool list_lru_del(struct list_lru *lru, struct list_head *item);
 
 /**
- * list_lru_count_node: return the number of objects currently held by @lru
+ * list_lru_count_one: return the number of objects currently held by @lru
  * @lru: the lru pointer.
  * @nid: the node id to count from.
+ * @memcg: the memcg to count from.
  *
  * Always return a non-negative number, 0 for empty lists. There is no
  * guarantee that the list is not updated while the count is being computed.
  * Callers that want such a guarantee need to provide an outer lock.
  */
-unsigned long list_lru_count_node(struct list_lru *lru, int nid);
+unsigned long list_lru_count_one(struct list_lru *lru,
+				 int nid, struct mem_cgroup *memcg);
+unsigned long list_lru_count(struct list_lru *lru);
 
 static inline unsigned long list_lru_shrink_count(struct list_lru *lru,
 						  struct shrink_control *sc)
 {
-	return list_lru_count_node(lru, sc->nid);
-}
-
-static inline unsigned long list_lru_count(struct list_lru *lru)
-{
-	long count = 0;
-	int nid;
-
-	for_each_node_state(nid, N_NORMAL_MEMORY)
-		count += list_lru_count_node(lru, nid);
-
-	return count;
+	return list_lru_count_one(lru, sc->nid, sc->memcg);
 }
 
 typedef enum lru_status
 (*list_lru_walk_cb)(struct list_head *item, spinlock_t *lock, void *cb_arg);
 /**
- * list_lru_walk_node: walk a list_lru, isolating and disposing freeable items.
+ * list_lru_walk_one: walk a list_lru, isolating and disposing freeable items.
  * @lru: the lru pointer.
  * @nid: the node id to scan from.
+ * @memcg: the memcg to scan from.
  * @isolate: callback function that is resposible for deciding what to do with
  *  the item currently being scanned
  * @cb_arg: opaque type that will be passed to @isolate
@@ -125,31 +139,18 @@ typedef enum lru_status
  *
  * Return value: the number of objects effectively removed from the LRU.
  */
-unsigned long list_lru_walk_node(struct list_lru *lru, int nid,
-				 list_lru_walk_cb isolate, void *cb_arg,
-				 unsigned long *nr_to_walk);
+unsigned long list_lru_walk_one(struct list_lru *lru,
+				int nid, struct mem_cgroup *memcg,
+				list_lru_walk_cb isolate, void *cb_arg,
+				unsigned long *nr_to_walk);
+unsigned long list_lru_walk(struct list_lru *lru, list_lru_walk_cb isolate,
+			    void *cb_arg, unsigned long nr_to_walk);
 
 static inline unsigned long
 list_lru_shrink_walk(struct list_lru *lru, struct shrink_control *sc,
 		     list_lru_walk_cb isolate, void *cb_arg)
 {
-	return list_lru_walk_node(lru, sc->nid, isolate, cb_arg,
-				  &sc->nr_to_scan);
-}
-
-static inline unsigned long
-list_lru_walk(struct list_lru *lru, list_lru_walk_cb isolate,
-	      void *cb_arg, unsigned long nr_to_walk)
-{
-	long isolated = 0;
-	int nid;
-
-	for_each_node_state(nid, N_NORMAL_MEMORY) {
-		isolated += list_lru_walk_node(lru, nid, isolate,
-					       cb_arg, &nr_to_walk);
-		if (nr_to_walk <= 0)
-			break;
-	}
-	return isolated;
+	return list_lru_walk_one(lru, sc->nid, sc->memcg, isolate, cb_arg,
+				 &sc->nr_to_scan);
 }
 #endif /* _LRU_LIST_H */
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index f2cd342d6544..d1ab65b4ce02 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -435,6 +435,8 @@ static inline bool memcg_kmem_enabled(void)
 bool memcg_kmem_is_active(struct mem_cgroup *memcg);
 bool memcg_kmem_is_active_subtree(struct mem_cgroup *memcg);
 
+struct mem_cgroup *mem_cgroup_from_kmem(void *ptr);
+
 /*
  * In general, we'll do everything in our power to not incur in any overhead
  * for non-memcg users for the kmem functions. Not even a function call, if we
@@ -570,6 +572,11 @@ static inline bool memcg_kmem_is_active_subtree(struct mem_cgroup *memcg)
 	return false;
 }
 
+static inline struct mem_cgroup *mem_cgroup_from_kmem(void *ptr)
+{
+	return NULL;
+}
+
 static inline bool
 memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup **memcg, int order)
 {
diff --git a/mm/list_lru.c b/mm/list_lru.c
index 53086eda7942..f10529e47788 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -9,11 +9,28 @@
 #include <linux/mm.h>
 #include <linux/list_lru.h>
 #include <linux/slab.h>
+#include <linux/memcontrol.h>
 
 #ifdef CONFIG_MEMCG_KMEM
 static LIST_HEAD(list_lrus);
 static DEFINE_SPINLOCK(list_lrus_lock);
 
+/*
+ * Insertion/deletion to the list_lrus list must be atomic (nobody expects
+ * list_lru_destroy to block), but we still want to sleep while iterating over
+ * the list (e.g. to allocate).
+ *
+ * To make it possible we employ the fact that list removals don't require the
+ * caller to know the list to delete the item from. As a result, we can move
+ * list_lrus which we walked over to a temporary list and iterate over the
+ * list_lrus list releasing the lock whenever necessary until it empties. When
+ * we are done, we put all the elements we removed from the list during the
+ * walk back to the list_lrus list.
+ *
+ * The list_lrus_walk_mutex is used to synchronize concurrent walkers.
+ */
+static DEFINE_MUTEX(list_lrus_walk_mutex);
+
 static void list_lru_register(struct list_lru *lru)
 {
 	spin_lock(&list_lrus_lock);
@@ -37,16 +54,44 @@ static void list_lru_unregister(struct list_lru *lru)
 }
 #endif /* CONFIG_MEMCG_KMEM */
 
+static inline bool list_lru_memcg_aware(struct list_lru *lru)
+{
+#ifdef CONFIG_MEMCG_KMEM
+	return !!lru->node[0].memcg_lrus;
+#else
+	return false;
+#endif
+}
+
+static inline struct list_lru_one *
+lru_from_memcg_idx(struct list_lru_node *nlru, int idx)
+{
+	struct list_lru_one *l = &nlru->lru;
+
+#ifdef CONFIG_MEMCG_KMEM
+	/*
+	 * The lock protects memcg_lrus array from relocation
+	 * (see update_memcg_lru).
+	 */
+	lockdep_assert_held(&nlru->lock);
+	if (nlru->memcg_lrus && idx >= 0)
+		l = nlru->memcg_lrus->lru[idx];
+#endif
+	return l;
+}
+
 bool list_lru_add(struct list_lru *lru, struct list_head *item)
 {
 	int nid = page_to_nid(virt_to_page(item));
 	struct list_lru_node *nlru = &lru->node[nid];
+	struct mem_cgroup *memcg = mem_cgroup_from_kmem(item);
+	struct list_lru_one *l;
 
 	spin_lock(&nlru->lock);
-	WARN_ON_ONCE(nlru->nr_items < 0);
+	l = lru_from_memcg_idx(nlru, memcg_cache_id(memcg));
 	if (list_empty(item)) {
-		list_add_tail(item, &nlru->list);
-		nlru->nr_items++;
+		list_add_tail(item, &l->list);
+		l->nr_items++;
 		spin_unlock(&nlru->lock);
 		return true;
 	}
@@ -59,12 +104,14 @@ bool list_lru_del(struct list_lru *lru, struct list_head *item)
 {
 	int nid = page_to_nid(virt_to_page(item));
 	struct list_lru_node *nlru = &lru->node[nid];
+	struct mem_cgroup *memcg = mem_cgroup_from_kmem(item);
+	struct list_lru_one *l;
 
 	spin_lock(&nlru->lock);
+	l = lru_from_memcg_idx(nlru, memcg_cache_id(memcg));
 	if (!list_empty(item)) {
 		list_del_init(item);
-		nlru->nr_items--;
-		WARN_ON_ONCE(nlru->nr_items < 0);
+		l->nr_items--;
 		spin_unlock(&nlru->lock);
 		return true;
 	}
@@ -73,33 +120,60 @@ bool list_lru_del(struct list_lru *lru, struct list_head *item)
 }
 EXPORT_SYMBOL_GPL(list_lru_del);
 
-unsigned long
-list_lru_count_node(struct list_lru *lru, int nid)
+static unsigned long __list_lru_count_one(struct list_lru *lru,
+					  int nid, int memcg_idx)
 {
-	unsigned long count = 0;
 	struct list_lru_node *nlru = &lru->node[nid];
+	struct list_lru_one *l;
+	long count;
 
 	spin_lock(&nlru->lock);
-	WARN_ON_ONCE(nlru->nr_items < 0);
-	count += nlru->nr_items;
+	l = lru_from_memcg_idx(nlru, memcg_idx);
+	count = l->nr_items;
 	spin_unlock(&nlru->lock);
 
+	return count > 0 ? count : 0;
+}
+
+unsigned long list_lru_count_one(struct list_lru *lru,
+				 int nid, struct mem_cgroup *memcg)
+{
+	return __list_lru_count_one(lru, nid, memcg_cache_id(memcg));
+}
+EXPORT_SYMBOL_GPL(list_lru_count_one);
+
+unsigned long list_lru_count(struct list_lru *lru)
+{
+	long count = 0;
+	int nid, memcg_idx;
+
+	for_each_node_state(nid, N_NORMAL_MEMORY) {
+		count += __list_lru_count_one(lru, nid, -1);
+		if (!list_lru_memcg_aware(lru))
+			continue;
+		for (memcg_idx = 0;
+		     memcg_idx < memcg_max_cache_ids; memcg_idx++)
+			count += __list_lru_count_one(lru, nid, memcg_idx);
+	}
 	return count;
 }
-EXPORT_SYMBOL_GPL(list_lru_count_node);
+EXPORT_SYMBOL_GPL(list_lru_count);
 
-unsigned long
-list_lru_walk_node(struct list_lru *lru, int nid, list_lru_walk_cb isolate,
-		   void *cb_arg, unsigned long *nr_to_walk)
+static unsigned long
+__list_lru_walk_one(struct list_lru *lru, int nid, int memcg_idx,
+		    list_lru_walk_cb isolate, void *cb_arg,
+		    unsigned long *nr_to_walk)
 {
 
-	struct list_lru_node	*nlru = &lru->node[nid];
+	struct list_lru_node *nlru = &lru->node[nid];
+	struct list_lru_one *l;
 	struct list_head *item, *n;
 	unsigned long isolated = 0;
 
 	spin_lock(&nlru->lock);
+	l = lru_from_memcg_idx(nlru, memcg_idx);
 restart:
-	list_for_each_safe(item, n, &nlru->list) {
+	list_for_each_safe(item, n, &l->list) {
 		enum lru_status ret;
 
 		/*
@@ -115,8 +189,7 @@ restart:
 		case LRU_REMOVED_RETRY:
 			assert_spin_locked(&nlru->lock);
 		case LRU_REMOVED:
-			nlru->nr_items--;
-			WARN_ON_ONCE(nlru->nr_items < 0);
+			l->nr_items--;
 			isolated++;
 			/*
 			 * If the lru lock has been dropped, our list
@@ -127,7 +200,7 @@ restart:
 				goto restart;
 			break;
 		case LRU_ROTATE:
-			list_move_tail(item, &nlru->list);
+			list_move_tail(item, &l->list);
 			break;
 		case LRU_SKIP:
 			break;
@@ -146,12 +219,279 @@ restart:
 	spin_unlock(&nlru->lock);
 	return isolated;
 }
-EXPORT_SYMBOL_GPL(list_lru_walk_node);
 
-int list_lru_init_key(struct list_lru *lru, struct lock_class_key *key)
+unsigned long
+list_lru_walk_one(struct list_lru *lru, int nid, struct mem_cgroup *memcg,
+		  list_lru_walk_cb isolate, void *cb_arg,
+		  unsigned long *nr_to_walk)
+{
+	return __list_lru_walk_one(lru, nid, memcg_cache_id(memcg),
+				   isolate, cb_arg, nr_to_walk);
+}
+EXPORT_SYMBOL_GPL(list_lru_walk_one);
+
+unsigned long list_lru_walk(struct list_lru *lru, list_lru_walk_cb isolate,
+			    void *cb_arg, unsigned long nr_to_walk)
+{
+	long isolated = 0;
+	int nid, memcg_idx;
+
+	for_each_node_state(nid, N_NORMAL_MEMORY) {
+		isolated += __list_lru_walk_one(lru, nid, -1,
+					isolate, cb_arg, &nr_to_walk);
+		if (nr_to_walk <= 0)
+			goto out;
+		if (!list_lru_memcg_aware(lru))
+			continue;
+		for (memcg_idx = 0;
+		     memcg_idx < memcg_max_cache_ids; memcg_idx++) {
+			isolated += __list_lru_walk_one(lru, nid, memcg_idx,
+						isolate, cb_arg, &nr_to_walk);
+			if (nr_to_walk <= 0)
+				goto out;
+		}
+	}
+out:
+	return isolated;
+}
+EXPORT_SYMBOL_GPL(list_lru_walk);
+
+static void init_one_lru(struct list_lru_one *l)
+{
+	INIT_LIST_HEAD(&l->list);
+	l->nr_items = 0;
+}
+
+#ifdef CONFIG_MEMCG_KMEM
+static void free_list_lru_memcg(struct list_lru_memcg *memcg_lrus)
+{
+	int i, nr;
+
+	if (memcg_lrus) {
+		nr = ksize(memcg_lrus) / sizeof(void *);
+		for (i = 0; i < nr; i++)
+			kfree(memcg_lrus->lru[i]);
+		kfree(memcg_lrus);
+	}
+}
+
+static struct list_lru_memcg *alloc_list_lru_memcg(int nr, int init_from)
+{
+	struct list_lru_memcg *memcg_lrus;
+	struct list_lru_one *l;
+	int i;
+	
+	memcg_lrus = kmalloc(nr * sizeof(void *), GFP_KERNEL);
+	if (!memcg_lrus)
+		return NULL;
+
+	/* make sure free_list_lru_memcg won't dereference crap */
+	memset(memcg_lrus, 0, ksize(memcg_lrus));
+
+	for (i = init_from; i < nr; i++) {
+		l = kmalloc(sizeof(struct list_lru_one), GFP_KERNEL);
+		if (!l) {
+			free_list_lru_memcg(memcg_lrus);
+			return NULL;
+		}
+		init_one_lru(l);
+		memcg_lrus->lru[i] = l;
+	}
+	return memcg_lrus;
+}
+
+static void list_lru_destroy_memcg(struct list_lru *lru)
+{
+	int i;
+
+	for (i = 0; i < nr_node_ids; i++)
+		free_list_lru_memcg(lru->node[i].memcg_lrus);
+}
+
+static int list_lru_init_memcg(struct list_lru *lru)
+{
+	int i;
+
+	for (i = 0; i < nr_node_ids; i++) {
+		/*
+		 * memcg_max_cache_ids can be 0, but memcg_lrus won't be NULL
+		 * then, it will be equal to ZERO_SIZE_PTR.
+		 */
+		lru->node[i].memcg_lrus =
+			alloc_list_lru_memcg(memcg_max_cache_ids, 0);
+		if (!lru->node[i].memcg_lrus) {
+			list_lru_destroy_memcg(lru);
+			return -ENOMEM;
+		}
+	}
+	return 0;
+}
+
+static void update_memcg_lru(struct list_lru_node *nlru,
+			      struct list_lru_memcg *new)
+{
+	struct list_lru_memcg *old = nlru->memcg_lrus;
+
+	memcpy(new, old, memcg_max_cache_ids * sizeof(void *));
+
+	spin_lock(&nlru->lock);
+	nlru->memcg_lrus = new;
+	spin_unlock(&nlru->lock);
+
+	kfree(old);
+}
+
+/*
+ * This function is called from the memory cgroup core before increasing
+ * memcg_max_cache_ids. We must update all lrus to conform to the new size.
+ * The memcg_cache_id_space_sem must be held for writing.
+ */
+int memcg_update_all_list_lrus(int num_memcgs)
+{
+	LIST_HEAD(updated);
+	struct list_lru_memcg **memcg_lrus;
+	bool memcg_lrus_allocated = false;
+	int i, ret = 0;
+
+	memcg_lrus = kmalloc(nr_node_ids * sizeof(void *), GFP_KERNEL);
+	if (!memcg_lrus)
+		return -ENOMEM;
+
+	mutex_lock(&list_lrus_walk_mutex);
+	spin_lock(&list_lrus_lock);
+	while (!list_empty(&list_lrus)) {
+		struct list_lru *lru;
+
+		lru = list_first_entry(&list_lrus, struct list_lru, list);
+		if (!list_lru_memcg_aware(lru))
+			goto next;
+
+		if (memcg_lrus_allocated)
+			goto update;
+
+		spin_unlock(&list_lrus_lock);
+		for (i = 0; i < nr_node_ids; i++) {
+			memcg_lrus[i] = alloc_list_lru_memcg(num_memcgs,
+							memcg_max_cache_ids);
+			if (!memcg_lrus[i]) {
+				ret = -ENOMEM;
+				break;
+			}
+		}
+		/* even if failed, we need to free what was allocated */
+		memset(memcg_lrus + i, 0, (nr_node_ids - i) * sizeof(void *));
+		memcg_lrus_allocated = true;
+		spin_lock(&list_lrus_lock);
+
+		if (ret)
+			break;
+		/*
+		 * We released the lock so we must check if there are still
+		 * memcg-aware list_lrus left on the list.
+		 */
+		continue;
+update:
+		for (i = 0; i < nr_node_ids; i++)
+			update_memcg_lru(&lru->node[i], memcg_lrus[i]);
+		memcg_lrus_allocated = false;
+next:
+		list_move(&lru->list, &updated);
+	}
+	list_splice(&updated, &list_lrus);
+	spin_unlock(&list_lrus_lock);
+	mutex_unlock(&list_lrus_walk_mutex);
+
+	if (memcg_lrus_allocated) {
+		for (i = 0; i < nr_node_ids; i++)
+			free_list_lru_memcg(memcg_lrus[i]);
+	}
+	kfree(memcg_lrus);
+	return ret;
+}
+
+static bool reparent_memcg_lru(struct list_lru_node *nlru,
+			       int from_idx, int to_idx)
+{
+	const int max_batch = 32;
+	int batch = 0;
+	struct list_lru_one *from, *to;
+	bool done;
+
+	/*
+	 * We can't just splice the lists, because walkers can drop the lock
+	 * after removing an element from the list, but before decreasing
+	 * nr_items. Splicing could therefore result in permanent divergence
+	 * between nr_items and the actual number of elements on the list. So
+	 * we iterate over all elements and move them one by one accounting
+	 * nr_items accordingly. This way the race with a walker is still
+	 * possible, but nr_items will be fixed once the walker reacquires the
+	 * lock.
+	 */
+	spin_lock(&nlru->lock);
+	from = lru_from_memcg_idx(nlru, from_idx);
+	to = lru_from_memcg_idx(nlru, to_idx);
+	while (!list_empty(&from->list)) {
+		list_move(from->list.next, &to->list);
+		from->nr_items--;
+		to->nr_items++;
+		if (++batch >= max_batch)
+			break;
+	}
+	done = list_empty(&from->list);
+	spin_unlock(&nlru->lock);
+	return done;
+}
+
+/*
+ * When a memcg dies, there still might be elements on its list_lrus. We can't
+ * just leave them there, because we want to release it cache id. So we move
+ * them to its parent's lrus.
+ */
+void memcg_reparent_all_list_lrus(int from_idx, int to_idx)
+{
+	LIST_HEAD(reparented);
+	int i;
+
+	mutex_lock(&list_lrus_walk_mutex);
+	spin_lock(&list_lrus_lock);
+	while (!list_empty(&list_lrus)) {
+		struct list_lru *lru;
+		bool done = true;
+
+		lru = list_first_entry(&list_lrus, struct list_lru, list);
+		if (!list_lru_memcg_aware(lru))
+			goto next;
+
+		for (i = 0; i < nr_node_ids; i++)
+			if (!reparent_memcg_lru(&lru->node[i],
+						from_idx, to_idx))
+				done = false;
+next:
+		if (done)
+			list_move(&lru->list, &reparented);
+		cond_resched_lock(&list_lrus_lock);
+	}
+	list_splice(&reparented, &list_lrus);
+	spin_unlock(&list_lrus_lock);
+	mutex_unlock(&list_lrus_walk_mutex);
+}
+#else
+static int list_lru_init_memcg(struct list_lru *lru)
+{
+	return 0;
+}
+
+static void list_lru_destroy_memcg(struct list_lru *lru)
+{
+}
+#endif /* CONFIG_MEMCG_KMEM */
+
+int list_lru_init_key(struct list_lru *lru, bool memcg_aware,
+		      struct lock_class_key *key)
 {
 	int i;
 	size_t size = sizeof(*lru->node) * nr_node_ids;
+	int err = 0;
 
 	lru->node = kzalloc(size, GFP_KERNEL);
 	if (!lru->node)
@@ -161,17 +501,30 @@ int list_lru_init_key(struct list_lru *lru, struct lock_class_key *key)
 		spin_lock_init(&lru->node[i].lock);
 		if (key)
 			lockdep_set_class(&lru->node[i].lock, key);
-		INIT_LIST_HEAD(&lru->node[i].list);
-		lru->node[i].nr_items = 0;
+		init_one_lru(&lru->node[i].lru);
 	}
-	list_lru_register(lru);
-	return 0;
+
+	/*
+	 * Note, memcg_max_cache_ids must remain stable while we are
+	 * allocating per-memcg lrus *and* registering the list_lru,
+	 * otherwise memcg_update_all_list_lrus can skip our list_lru.
+	 */
+	memcg_lock_cache_id_space();
+	if (memcg_aware)
+		err = list_lru_init_memcg(lru);
+
+	if (!err)
+		list_lru_register(lru);
+
+	memcg_unlock_cache_id_space();
+	return err;
 }
 EXPORT_SYMBOL_GPL(list_lru_init_key);
 
 void list_lru_destroy(struct list_lru *lru)
 {
 	list_lru_unregister(lru);
+	list_lru_destroy_memcg(lru);
 	kfree(lru->node);
 }
 EXPORT_SYMBOL_GPL(list_lru_destroy);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 0c6d412ae5a3..b82a6ea32ead 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2963,6 +2963,9 @@ static int memcg_alloc_cache_id(void)
 	mutex_unlock(&memcg_slab_mutex);
 
 	if (!err)
+		err = memcg_update_all_list_lrus(size);
+
+	if (!err)
 		memcg_max_cache_ids = size;
 	up_write(&memcg_cache_id_space_sem);
 
@@ -3150,6 +3153,15 @@ static void memcg_unregister_all_caches(struct mem_cgroup *memcg)
 	css_for_each_descendant_post(iter, &memcg->css)
 		mem_cgroup_from_css(iter)->kmemcg_id = parent_id;
 
+	/*
+	 * Move all elements from the dead cgroup's list_lrus to its parent's
+	 * so that we could release the id.
+	 *
+	 * This must be done strictly after we updated the cgroup's id in order
+	 * to guarantee no new elements will be added there afterwards.
+	 */
+	memcg_reparent_all_list_lrus(id, parent_id);
+
 	/* The id is not used anymore, free it so that it could be reused. */
 	memcg_free_cache_id(id);
 }
@@ -3411,6 +3423,30 @@ void __memcg_kmem_uncharge_pages(struct page *page, int order)
 	VM_BUG_ON_PAGE(mem_cgroup_is_root(memcg), page);
 	memcg_uncharge_kmem(memcg, PAGE_SIZE << order);
 }
+
+struct mem_cgroup *mem_cgroup_from_kmem(void *ptr)
+{
+	struct mem_cgroup *memcg = NULL;
+	struct page_cgroup *pc;
+	struct kmem_cache *cachep;
+	struct page *page;
+
+	if (!memcg_kmem_enabled())
+		return NULL;
+
+	page = virt_to_head_page(ptr);
+	if (PageSlab(page)) {
+		cachep = page->slab_cache;
+		if (!is_root_cache(cachep))
+			memcg = cachep->memcg_params->memcg;
+	} else {
+		/* page allocated with alloc_kmem_pages */
+		pc = lookup_page_cgroup(page);
+		if (PageCgroupUsed(pc))
+			memcg = pc->mem_cgroup;
+	}
+	return memcg;
+}
 #else
 static inline void memcg_unregister_all_caches(struct mem_cgroup *memcg)
 {
diff --git a/mm/workingset.c b/mm/workingset.c
index d4fa7fb10a52..f8aae7497723 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -399,7 +399,8 @@ static int __init workingset_init(void)
 {
 	int ret;
 
-	ret = list_lru_init_key(&workingset_shadow_nodes, &shadow_nodes_key);
+	ret = list_lru_init_key(&workingset_shadow_nodes, false,
+				&shadow_nodes_key);
 	if (ret)
 		goto err;
 	ret = register_shrinker(&workingset_shadow_shrinker);
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH -mm 14/14] fs: make shrinker memcg aware
  2014-09-21 15:14 [PATCH -mm 00/14] Per memcg slab shrinkers Vladimir Davydov
                   ` (12 preceding siblings ...)
  2014-09-21 15:14 ` [PATCH -mm 13/14] list_lru: introduce per-memcg lists Vladimir Davydov
@ 2014-09-21 15:14 ` Vladimir Davydov
  2014-09-21 16:00 ` [PATCH -mm 00/14] Per memcg slab shrinkers Tejun Heo
  2014-09-29  7:02 ` Vladimir Davydov
  15 siblings, 0 replies; 18+ messages in thread
From: Vladimir Davydov @ 2014-09-21 15:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, Greg Thelen, Dave Chinner,
	Glauber Costa, Suleiman Souhlal, Kamezawa Hiroyuki, Tejun Heo,
	linux-kernel, linux-mm, cgroups

Now, to make any list_lru-based shrinker memcg aware we should only
initialize its list_lru as memcg-enabled. Let's do it for the general FS
shrinker (super_block::s_shrink) and mark it as memcg aware.

There are other FS-specific shrinkers that use list_lru for storing
objects, such as XFS and GFS2 dquot cache shrinkers, but since they
reclaim objects that are shared among different cgroups, there is no
point making them memcg aware.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
---
 fs/super.c |    6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/super.c b/fs/super.c
index a82e97b0b8b9..9c765d3a14f3 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -189,9 +189,9 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags)
 	INIT_HLIST_BL_HEAD(&s->s_anon);
 	INIT_LIST_HEAD(&s->s_inodes);
 
-	if (list_lru_init(&s->s_dentry_lru, false))
+	if (list_lru_init(&s->s_dentry_lru, true))
 		goto fail;
-	if (list_lru_init(&s->s_inode_lru, false))
+	if (list_lru_init(&s->s_inode_lru, true))
 		goto fail;
 
 	init_rwsem(&s->s_umount);
@@ -227,7 +227,7 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags)
 	s->s_shrink.scan_objects = super_cache_scan;
 	s->s_shrink.count_objects = super_cache_count;
 	s->s_shrink.batch = 1024;
-	s->s_shrink.flags = SHRINKER_NUMA_AWARE;
+	s->s_shrink.flags = SHRINKER_NUMA_AWARE | SHRINKER_MEMCG_AWARE;
 	return s;
 
 fail:
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH -mm 00/14] Per memcg slab shrinkers
  2014-09-21 15:14 [PATCH -mm 00/14] Per memcg slab shrinkers Vladimir Davydov
                   ` (13 preceding siblings ...)
  2014-09-21 15:14 ` [PATCH -mm 14/14] fs: make shrinker memcg aware Vladimir Davydov
@ 2014-09-21 16:00 ` Tejun Heo
  2014-09-22  7:04   ` Vladimir Davydov
  2014-09-29  7:02 ` Vladimir Davydov
  15 siblings, 1 reply; 18+ messages in thread
From: Tejun Heo @ 2014-09-21 16:00 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Greg Thelen,
	Dave Chinner, Glauber Costa, Suleiman Souhlal, Kamezawa Hiroyuki,
	linux-kernel, linux-mm, cgroups

Hello,

On Sun, Sep 21, 2014 at 07:14:32PM +0400, Vladimir Davydov wrote:
...
> list. This is really important, because this allows us to release
> memcg_cache_id used for indexing in per-memcg arrays. If we don't do
> this, the arrays will grow uncontrollably, which is really bad. Note, in
> comparison to user memory reparenting, which Johannes is going to get

I don't know the code well and haven't read the patches and could
easilya be completely off the mark, but, if the size of slab array is
the only issue, wouldn't it be easier to separate that part out?  The
indexing is only necessary for allocating new items, right?  Can't
that part be shutdown and the index freed on offline and the rest stay
till release?  Things like reparenting tends to add fair amount of
complexity and hot path overheads which aren't necessary otherwise.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH -mm 00/14] Per memcg slab shrinkers
  2014-09-21 16:00 ` [PATCH -mm 00/14] Per memcg slab shrinkers Tejun Heo
@ 2014-09-22  7:04   ` Vladimir Davydov
  0 siblings, 0 replies; 18+ messages in thread
From: Vladimir Davydov @ 2014-09-22  7:04 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Greg Thelen,
	Dave Chinner, Glauber Costa, Suleiman Souhlal, Kamezawa Hiroyuki,
	linux-kernel, linux-mm, cgroups

Hi Tejun,

On Sun, Sep 21, 2014 at 12:00:12PM -0400, Tejun Heo wrote:
> On Sun, Sep 21, 2014 at 07:14:32PM +0400, Vladimir Davydov wrote:
> ...
> > list. This is really important, because this allows us to release
> > memcg_cache_id used for indexing in per-memcg arrays. If we don't do
> > this, the arrays will grow uncontrollably, which is really bad. Note, in
> > comparison to user memory reparenting, which Johannes is going to get
> 
> I don't know the code well and haven't read the patches and could
> easilya be completely off the mark, but, if the size of slab array is
> the only issue, wouldn't it be easier to separate that part out?  The
> indexing is only necessary for allocating new items, right?  Can't
> that part be shutdown and the index freed on offline and the rest stay
> till release?  

That's exactly what I did in this set. I release the cache index on css
offline, but I don't reparent kmem charges, or merge kmem slabs, or
whatever, because we only need to index caches on allocations. So kmem
caches corresponding to a dead memory cgroup will be hanging around
until the last object is freed. And they will be holding a css reference
just like swap charges do and user memory charges will after Johannes'
rework.

However, we still need the cache index for list_lru, which is made
per-memcg in this patch set. The point is the objects accounted to a
memory cgroup can be added/removed from lru lists even after the cgroup
death. If we just set the cache index of a dead cgroup to its parent's,
the objects will be added/removed from an active list_lru, but there
still might be some objects left on the list_lru of the dead cgroup. We
have to move them in order to release the cache index.

> Things like reparenting tends to add fair amount of complexity and hot
> path overheads which aren't necessary otherwise.

There is no overhead added to hot paths *due to reparenting* in this set
AFAIU. And the code is way simpler than that of the user charges
reparenting, because the usage scenario of list_lru is much simpler than
that of lruvec. E.g. we don't have to retry, we are guaranteed to
succeed after the first scan. Just look at the two function doing the
stuff - memcg_reparent_all_list_lrus and reparent_memcg_lru in patch 13
- ain't they complex?

I'd like to emphasize this once again - the reparenting I'm talking
about is not about charges, we're not waiting until kmem res drops to 0.
I just move list_lru items from one list to another on css offline, w/o
retries, waiting, or some weird checks here and there.

Thanks,
Vladimir

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH -mm 00/14] Per memcg slab shrinkers
  2014-09-21 15:14 [PATCH -mm 00/14] Per memcg slab shrinkers Vladimir Davydov
                   ` (14 preceding siblings ...)
  2014-09-21 16:00 ` [PATCH -mm 00/14] Per memcg slab shrinkers Tejun Heo
@ 2014-09-29  7:02 ` Vladimir Davydov
  15 siblings, 0 replies; 18+ messages in thread
From: Vladimir Davydov @ 2014-09-29  7:02 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, Greg Thelen, Dave Chinner,
	Glauber Costa, Suleiman Souhlal, Kamezawa Hiroyuki, Tejun Heo,
	linux-kernel, linux-mm, cgroups

ping

On Sun, Sep 21, 2014 at 07:14:32PM +0400, Vladimir Davydov wrote:
> Hi,
> 
> Kmem accounting of memcg is unusable now, because it lacks slab shrinker
> support. That means when we hit the limit we will get ENOMEM w/o any
> chance to recover. What we should do then is to call shrink_slab, which
> would reclaim old inode/dentry caches from this cgroup. This is what
> this patch set is intended to do.
> 
> Basically, it does two things. First, it introduces the notion of
> per-memcg slab shrinker. A shrinker that wants to reclaim objects per
> cgroup should mark itself as SHRINKER_MEMCG_AWARE. Then it will be
> passed the memory cgroup to scan from in shrink_control->memcg. For such
> shrinkers shrink_slab iterates over the whole cgroup subtree under the
> target cgroup and calls the shrinker for each kmem-active memory cgroup.
> 
> Secondly, this patch set makes the list_lru structure per-memcg. It's
> done transparently to list_lru users - everything they have to do is to
> tell list_lru_init that they want memcg-aware list_lru. Then the
> list_lru will automatically distribute objects among per-memcg lists
> basing on which cgroup the object is accounted to. This way to make FS
> shrinkers (icache, dcache) memcg-aware we only need to make them use
> memcg-aware list_lru, and this is what this patch set does.
> 
> The main difference of this patch set from my previous attempts to push
> memcg aware shrinkers is in how it handles css offline. Now we don't let
> list_lrus corresponding to dead memory cgroups hang around till all
> objects are freed. Instead we move lru items to the parent cgroup's lru
> list. This is really important, because this allows us to release
> memcg_cache_id used for indexing in per-memcg arrays. If we don't do
> this, the arrays will grow uncontrollably, which is really bad. Note, in
> comparison to user memory reparenting, which Johannes is going to get
> rid of, it's not racy and much easier to implement although it does
> impose some limitations on how list_lru locking can be implemented.
> Another difference is that it doesn't reparent charges, only list_lru
> entries - the css will be dangling until the last kmem object is freed.
> 
> As before, this patch set only enables per-memcg kmem reclaim when the
> pressure goes from memory.limit, not from memory.kmem.limit. Handling
> memory.kmem.limit is going to be tricky due to GFP_NOFS allocations, it
> will probably require a sort of soft limit to work properly. I'm leaving
> this for future work.
> 
> The patch set basically consists of three main parts and organized as
> follows:
> 
>  - Patches 1-3 implement per-memcg shrinker core with patches 1 and 2
>    preparing list_lru users for upcoming changes and patch 3 tuning
>    shrink_slab.
> 
>  - Patches 4-10 make memcg core release cache ids on offline doing a bit
>    of cleanup in the meanwhile. This is easy, because kmem_caches don't
>    need the cache id after css offline since there can't be allocations
>    going from a dead memcg. Note that most of these patches (namely 4-6,
>    and 8) were once merged, but then I decided to drop them, because I
>    didn't know how to deal with list_lrus at that time (see
>    https://lkml.org/lkml/2014/7/23/218).
> 
>  - Finally patches 11-14 make list_lru per-memcg and mark FS shrinkers
>    as memcg-aware. This is the most difficult part of this patch set
>    with patch 13 (unlucky :-) doing the most important work.
> 
> Reviews are more than welcome.

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2014-09-29  7:03 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-09-21 15:14 [PATCH -mm 00/14] Per memcg slab shrinkers Vladimir Davydov
2014-09-21 15:14 ` [PATCH -mm 01/14] list_lru: introduce list_lru_shrink_{count,walk} Vladimir Davydov
2014-09-21 15:14 ` [PATCH -mm 02/14] fs: consolidate {nr,free}_cached_objects args in shrink_control Vladimir Davydov
2014-09-21 15:14 ` [PATCH -mm 03/14] vmscan: shrink slab on memcg pressure Vladimir Davydov
2014-09-21 15:14 ` [PATCH -mm 04/14] memcg: use mem_cgroup_id for per memcg cache naming Vladimir Davydov
2014-09-21 15:14 ` [PATCH -mm 05/14] memcg: add pointer to owner cache to memcg_cache_params Vladimir Davydov
2014-09-21 15:14 ` [PATCH -mm 06/14] memcg: keep all children of each root cache on a list Vladimir Davydov
2014-09-21 15:14 ` [PATCH -mm 07/14] memcg: update memcg_caches array entries on the slab side Vladimir Davydov
2014-09-21 15:14 ` [PATCH -mm 08/14] memcg: release memcg_cache_id on css offline Vladimir Davydov
2014-09-21 15:14 ` [PATCH -mm 09/14] memcg: rename some cache id related variables Vladimir Davydov
2014-09-21 15:14 ` [PATCH -mm 10/14] memcg: add rwsem to sync against memcg_caches arrays relocation Vladimir Davydov
2014-09-21 15:14 ` [PATCH -mm 11/14] list_lru: get rid of ->active_nodes Vladimir Davydov
2014-09-21 15:14 ` [PATCH -mm 12/14] list_lru: organize all list_lrus to list Vladimir Davydov
2014-09-21 15:14 ` [PATCH -mm 13/14] list_lru: introduce per-memcg lists Vladimir Davydov
2014-09-21 15:14 ` [PATCH -mm 14/14] fs: make shrinker memcg aware Vladimir Davydov
2014-09-21 16:00 ` [PATCH -mm 00/14] Per memcg slab shrinkers Tejun Heo
2014-09-22  7:04   ` Vladimir Davydov
2014-09-29  7:02 ` Vladimir Davydov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).