All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/7] memcg targeted shrinking
@ 2013-02-08 13:07 ` Glauber Costa
  0 siblings, 0 replies; 60+ messages in thread
From: Glauber Costa @ 2013-02-08 13:07 UTC (permalink / raw)
  To: linux-mm
  Cc: cgroups, Andrew Morton, Michal Hocko, Johannes Weiner,
	kamezawa.hiroyu, Dave Shrinnker, linux-fsdevel

This patchset implements targeted shrinking for memcg when kmem limits are
present. So far, we've been accounting kernel objects but failing allocations
when short of memory. This is because our only option would be to call the
global shrinker, depleting objects from all caches and breaking isolation.

This patchset builds upon the recent work from David Chinner
(http://oss.sgi.com/archives/xfs/2012-11/msg00643.html) to implement NUMA
aware per-node LRUs. I build heavily on its API, and its presence is implied.

The main idea is to associate per-memcg lists with each of the LRUs. The main
LRU still provides a single entry point and when adding or removing an element
from the LRU, we use the page information to figure out which memcg it belongs
to and relay it to the right list.

This patchset is still not perfect, and some uses cases still need to be
dealt with. But I wanted to get this out in the open sooner rather than
later. In particular, I have the following (noncomprehensive) todo list:

TODO:
* shrink dead memcgs when global pressure kicks in.
* balance global reclaim among memcgs.
* improve testing and reliability (I am still seeing some stalls in some cases)

Glauber Costa (7):
  vmscan: also shrink slab in memcg pressure
  memcg,list_lru: duplicate LRUs upon kmemcg creation
  lru: add an element to a memcg list
  list_lru: also include memcg lists in counts and scans
  list_lru: per-memcg walks
  super: targeted memcg reclaim
  memcg: per-memcg kmem shrinking

 fs/dcache.c                |   7 +-
 fs/inode.c                 |   6 +-
 fs/internal.h              |   5 +-
 fs/super.c                 |  37 ++++--
 include/linux/list_lru.h   |  81 +++++++++++-
 include/linux/memcontrol.h |  34 +++++
 include/linux/shrinker.h   |   4 +
 include/linux/swap.h       |   2 +
 lib/list_lru.c             | 301 ++++++++++++++++++++++++++++++++++++++-------
 mm/memcontrol.c            | 271 ++++++++++++++++++++++++++++++++++++++--
 mm/slab_common.c           |   1 -
 mm/vmscan.c                |  78 +++++++++++-
 12 files changed, 747 insertions(+), 80 deletions(-)

-- 
1.8.1


^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH 0/7] memcg targeted shrinking
@ 2013-02-08 13:07 ` Glauber Costa
  0 siblings, 0 replies; 60+ messages in thread
From: Glauber Costa @ 2013-02-08 13:07 UTC (permalink / raw)
  To: linux-mm
  Cc: cgroups, Andrew Morton, Michal Hocko, Johannes Weiner,
	kamezawa.hiroyu, Dave Shrinnker, linux-fsdevel

This patchset implements targeted shrinking for memcg when kmem limits are
present. So far, we've been accounting kernel objects but failing allocations
when short of memory. This is because our only option would be to call the
global shrinker, depleting objects from all caches and breaking isolation.

This patchset builds upon the recent work from David Chinner
(http://oss.sgi.com/archives/xfs/2012-11/msg00643.html) to implement NUMA
aware per-node LRUs. I build heavily on its API, and its presence is implied.

The main idea is to associate per-memcg lists with each of the LRUs. The main
LRU still provides a single entry point and when adding or removing an element
from the LRU, we use the page information to figure out which memcg it belongs
to and relay it to the right list.

This patchset is still not perfect, and some uses cases still need to be
dealt with. But I wanted to get this out in the open sooner rather than
later. In particular, I have the following (noncomprehensive) todo list:

TODO:
* shrink dead memcgs when global pressure kicks in.
* balance global reclaim among memcgs.
* improve testing and reliability (I am still seeing some stalls in some cases)

Glauber Costa (7):
  vmscan: also shrink slab in memcg pressure
  memcg,list_lru: duplicate LRUs upon kmemcg creation
  lru: add an element to a memcg list
  list_lru: also include memcg lists in counts and scans
  list_lru: per-memcg walks
  super: targeted memcg reclaim
  memcg: per-memcg kmem shrinking

 fs/dcache.c                |   7 +-
 fs/inode.c                 |   6 +-
 fs/internal.h              |   5 +-
 fs/super.c                 |  37 ++++--
 include/linux/list_lru.h   |  81 +++++++++++-
 include/linux/memcontrol.h |  34 +++++
 include/linux/shrinker.h   |   4 +
 include/linux/swap.h       |   2 +
 lib/list_lru.c             | 301 ++++++++++++++++++++++++++++++++++++++-------
 mm/memcontrol.c            | 271 ++++++++++++++++++++++++++++++++++++++--
 mm/slab_common.c           |   1 -
 mm/vmscan.c                |  78 +++++++++++-
 12 files changed, 747 insertions(+), 80 deletions(-)

-- 
1.8.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH 0/7] memcg targeted shrinking
@ 2013-02-08 13:07 ` Glauber Costa
  0 siblings, 0 replies; 60+ messages in thread
From: Glauber Costa @ 2013-02-08 13:07 UTC (permalink / raw)
  To: linux-mm
  Cc: cgroups, Andrew Morton, Michal Hocko, Johannes Weiner,
	kamezawa.hiroyu, Dave Shrinnker, linux-fsdevel

This patchset implements targeted shrinking for memcg when kmem limits are
present. So far, we've been accounting kernel objects but failing allocations
when short of memory. This is because our only option would be to call the
global shrinker, depleting objects from all caches and breaking isolation.

This patchset builds upon the recent work from David Chinner
(http://oss.sgi.com/archives/xfs/2012-11/msg00643.html) to implement NUMA
aware per-node LRUs. I build heavily on its API, and its presence is implied.

The main idea is to associate per-memcg lists with each of the LRUs. The main
LRU still provides a single entry point and when adding or removing an element
from the LRU, we use the page information to figure out which memcg it belongs
to and relay it to the right list.

This patchset is still not perfect, and some uses cases still need to be
dealt with. But I wanted to get this out in the open sooner rather than
later. In particular, I have the following (noncomprehensive) todo list:

TODO:
* shrink dead memcgs when global pressure kicks in.
* balance global reclaim among memcgs.
* improve testing and reliability (I am still seeing some stalls in some cases)

Glauber Costa (7):
  vmscan: also shrink slab in memcg pressure
  memcg,list_lru: duplicate LRUs upon kmemcg creation
  lru: add an element to a memcg list
  list_lru: also include memcg lists in counts and scans
  list_lru: per-memcg walks
  super: targeted memcg reclaim
  memcg: per-memcg kmem shrinking

 fs/dcache.c                |   7 +-
 fs/inode.c                 |   6 +-
 fs/internal.h              |   5 +-
 fs/super.c                 |  37 ++++--
 include/linux/list_lru.h   |  81 +++++++++++-
 include/linux/memcontrol.h |  34 +++++
 include/linux/shrinker.h   |   4 +
 include/linux/swap.h       |   2 +
 lib/list_lru.c             | 301 ++++++++++++++++++++++++++++++++++++++-------
 mm/memcontrol.c            | 271 ++++++++++++++++++++++++++++++++++++++--
 mm/slab_common.c           |   1 -
 mm/vmscan.c                |  78 +++++++++++-
 12 files changed, 747 insertions(+), 80 deletions(-)

-- 
1.8.1


^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH 1/7] vmscan: also shrink slab in memcg pressure
  2013-02-08 13:07 ` Glauber Costa
  (?)
@ 2013-02-08 13:07   ` Glauber Costa
  -1 siblings, 0 replies; 60+ messages in thread
From: Glauber Costa @ 2013-02-08 13:07 UTC (permalink / raw)
  To: linux-mm
  Cc: cgroups, Andrew Morton, Michal Hocko, Johannes Weiner,
	kamezawa.hiroyu, Dave Shrinnker, linux-fsdevel, Glauber Costa,
	Dave Chinner, Mel Gorman, Rik van Riel, Hugh Dickins

Without the surrounding infrastructure, this patch is a bit of a hammer:
it will basically shrink objects from all memcgs under memcg pressure.
At least, however, we will keep the scan limited to the shrinkers marked
as per-memcg.

Future patches will implement the in-shrinker logic to filter objects
based on its memcg association.

Signed-off-by: Glauber Costa <glommer@parallels.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
---
 include/linux/memcontrol.h | 16 ++++++++++++++++
 include/linux/shrinker.h   |  4 ++++
 mm/memcontrol.c            | 11 ++++++++++-
 mm/vmscan.c                | 41 ++++++++++++++++++++++++++++++++++++++---
 4 files changed, 68 insertions(+), 4 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 0108a56..b7de557 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -200,6 +200,9 @@ void mem_cgroup_split_huge_fixup(struct page *head);
 bool mem_cgroup_bad_page_check(struct page *page);
 void mem_cgroup_print_bad_page(struct page *page);
 #endif
+
+unsigned long
+memcg_zone_reclaimable_pages(struct mem_cgroup *memcg, struct zone *zone);
 #else /* CONFIG_MEMCG */
 struct mem_cgroup;
 
@@ -384,6 +387,11 @@ static inline void mem_cgroup_replace_page_cache(struct page *oldpage,
 				struct page *newpage)
 {
 }
+
+static inline unsigned long
+memcg_zone_reclaimable_pages(struct mem_cgroup *memcg, struct zone *zone)
+{
+}
 #endif /* CONFIG_MEMCG */
 
 #if !defined(CONFIG_MEMCG) || !defined(CONFIG_DEBUG_VM)
@@ -436,6 +444,8 @@ static inline bool memcg_kmem_enabled(void)
 	return static_key_false(&memcg_kmem_enabled_key);
 }
 
+bool memcg_kmem_is_active(struct mem_cgroup *memcg);
+
 /*
  * In general, we'll do everything in our power to not incur in any overhead
  * for non-memcg users for the kmem functions. Not even a function call, if we
@@ -569,6 +579,12 @@ memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp)
 	return __memcg_kmem_get_cache(cachep, gfp);
 }
 #else
+
+static inline bool memcg_kmem_is_active(struct mem_cgroup *memcg)
+{
+	return false;
+}
+
 #define for_each_memcg_cache_index(_idx)	\
 	for (; NULL; )
 
diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
index d4636a0..a767f2e 100644
--- a/include/linux/shrinker.h
+++ b/include/linux/shrinker.h
@@ -20,6 +20,9 @@ struct shrink_control {
 
 	/* shrink from these nodes */
 	nodemask_t nodes_to_scan;
+
+	/* reclaim from this memcg only (if not NULL) */
+	struct mem_cgroup *target_mem_cgroup;
 };
 
 /*
@@ -45,6 +48,7 @@ struct shrinker {
 
 	int seeks;	/* seeks to recreate an obj */
 	long batch;	/* reclaim batch size, 0 = default */
+	bool memcg_shrinker;
 
 	/* These are for internal use */
 	struct list_head list;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 3817460..b1d4dfa 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -442,7 +442,7 @@ static inline void memcg_kmem_set_active(struct mem_cgroup *memcg)
 	set_bit(KMEM_ACCOUNTED_ACTIVE, &memcg->kmem_account_flags);
 }
 
-static bool memcg_kmem_is_active(struct mem_cgroup *memcg)
+bool memcg_kmem_is_active(struct mem_cgroup *memcg)
 {
 	return test_bit(KMEM_ACCOUNTED_ACTIVE, &memcg->kmem_account_flags);
 }
@@ -991,6 +991,15 @@ mem_cgroup_zone_nr_lru_pages(struct mem_cgroup *memcg, int nid, int zid,
 	return ret;
 }
 
+unsigned long
+memcg_zone_reclaimable_pages(struct mem_cgroup *memcg, struct zone *zone)
+{
+	int nid = zone_to_nid(zone);
+	int zid = zone_idx(zone);
+
+	return mem_cgroup_zone_nr_lru_pages(memcg, nid, zid, LRU_ALL);
+}
+
 static unsigned long
 mem_cgroup_node_nr_lru_pages(struct mem_cgroup *memcg,
 			int nid, unsigned int lru_mask)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 6d96280..8af0e2b 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -138,11 +138,42 @@ static bool global_reclaim(struct scan_control *sc)
 {
 	return !sc->target_mem_cgroup;
 }
+
+/*
+ * kmem reclaim should usually not be triggered when we are doing targetted
+ * reclaim. It is only valid when global reclaim is triggered, or when the
+ * underlying memcg has kmem objects.
+ */
+static bool has_kmem_reclaim(struct scan_control *sc)
+{
+	return !sc->target_mem_cgroup ||
+	(sc->target_mem_cgroup && memcg_kmem_is_active(sc->target_mem_cgroup));
+}
+
+static unsigned long
+zone_nr_reclaimable_pages(struct scan_control *sc, struct zone *zone)
+{
+	if (global_reclaim(sc))
+		return zone_reclaimable_pages(zone);
+	return memcg_zone_reclaimable_pages(sc->target_mem_cgroup, zone);
+}
+
 #else
 static bool global_reclaim(struct scan_control *sc)
 {
 	return true;
 }
+
+static bool has_kmem_reclaim(struct scan_control *sc)
+{
+	return true;
+}
+
+static unsigned long
+zone_nr_reclaimable_pages(struct scan_control *sc, struct zone *zone)
+{
+	return zone_reclaimable_pages(zone);
+}
 #endif
 
 static unsigned long get_lru_size(struct lruvec *lruvec, enum lru_list lru)
@@ -221,6 +252,9 @@ unsigned long shrink_slab(struct shrink_control *sc,
 		long batch_size = shrinker->batch ? shrinker->batch
 						  : SHRINK_BATCH;
 
+		if (!shrinker->memcg_shrinker && sc->target_mem_cgroup)
+			continue;
+
 		max_pass = shrinker->count_objects(shrinker, sc);
 		WARN_ON(max_pass < 0);
 		if (max_pass <= 0)
@@ -2170,9 +2204,9 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 
 		/*
 		 * Don't shrink slabs when reclaiming memory from
-		 * over limit cgroups
+		 * over limit cgroups, unless we know they have kmem objects
 		 */
-		if (global_reclaim(sc)) {
+		if (has_kmem_reclaim(sc)) {
 			unsigned long lru_pages = 0;
 
 			nodes_clear(shrink->nodes_to_scan);
@@ -2181,7 +2215,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 				if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
 					continue;
 
-				lru_pages += zone_reclaimable_pages(zone);
+				lru_pages += zone_nr_reclaimable_pages(sc, zone);
 				node_set(zone_to_nid(zone),
 					 shrink->nodes_to_scan);
 			}
@@ -2443,6 +2477,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 	};
 	struct shrink_control shrink = {
 		.gfp_mask = sc.gfp_mask,
+		.target_mem_cgroup = memcg,
 	};
 
 	/*
-- 
1.8.1


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 1/7] vmscan: also shrink slab in memcg pressure
@ 2013-02-08 13:07   ` Glauber Costa
  0 siblings, 0 replies; 60+ messages in thread
From: Glauber Costa @ 2013-02-08 13:07 UTC (permalink / raw)
  To: linux-mm
  Cc: cgroups, Andrew Morton, Michal Hocko, Johannes Weiner,
	kamezawa.hiroyu, Dave Shrinnker, linux-fsdevel, Glauber Costa,
	Dave Chinner, Mel Gorman, Rik van Riel, Hugh Dickins

Without the surrounding infrastructure, this patch is a bit of a hammer:
it will basically shrink objects from all memcgs under memcg pressure.
At least, however, we will keep the scan limited to the shrinkers marked
as per-memcg.

Future patches will implement the in-shrinker logic to filter objects
based on its memcg association.

Signed-off-by: Glauber Costa <glommer@parallels.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
---
 include/linux/memcontrol.h | 16 ++++++++++++++++
 include/linux/shrinker.h   |  4 ++++
 mm/memcontrol.c            | 11 ++++++++++-
 mm/vmscan.c                | 41 ++++++++++++++++++++++++++++++++++++++---
 4 files changed, 68 insertions(+), 4 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 0108a56..b7de557 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -200,6 +200,9 @@ void mem_cgroup_split_huge_fixup(struct page *head);
 bool mem_cgroup_bad_page_check(struct page *page);
 void mem_cgroup_print_bad_page(struct page *page);
 #endif
+
+unsigned long
+memcg_zone_reclaimable_pages(struct mem_cgroup *memcg, struct zone *zone);
 #else /* CONFIG_MEMCG */
 struct mem_cgroup;
 
@@ -384,6 +387,11 @@ static inline void mem_cgroup_replace_page_cache(struct page *oldpage,
 				struct page *newpage)
 {
 }
+
+static inline unsigned long
+memcg_zone_reclaimable_pages(struct mem_cgroup *memcg, struct zone *zone)
+{
+}
 #endif /* CONFIG_MEMCG */
 
 #if !defined(CONFIG_MEMCG) || !defined(CONFIG_DEBUG_VM)
@@ -436,6 +444,8 @@ static inline bool memcg_kmem_enabled(void)
 	return static_key_false(&memcg_kmem_enabled_key);
 }
 
+bool memcg_kmem_is_active(struct mem_cgroup *memcg);
+
 /*
  * In general, we'll do everything in our power to not incur in any overhead
  * for non-memcg users for the kmem functions. Not even a function call, if we
@@ -569,6 +579,12 @@ memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp)
 	return __memcg_kmem_get_cache(cachep, gfp);
 }
 #else
+
+static inline bool memcg_kmem_is_active(struct mem_cgroup *memcg)
+{
+	return false;
+}
+
 #define for_each_memcg_cache_index(_idx)	\
 	for (; NULL; )
 
diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
index d4636a0..a767f2e 100644
--- a/include/linux/shrinker.h
+++ b/include/linux/shrinker.h
@@ -20,6 +20,9 @@ struct shrink_control {
 
 	/* shrink from these nodes */
 	nodemask_t nodes_to_scan;
+
+	/* reclaim from this memcg only (if not NULL) */
+	struct mem_cgroup *target_mem_cgroup;
 };
 
 /*
@@ -45,6 +48,7 @@ struct shrinker {
 
 	int seeks;	/* seeks to recreate an obj */
 	long batch;	/* reclaim batch size, 0 = default */
+	bool memcg_shrinker;
 
 	/* These are for internal use */
 	struct list_head list;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 3817460..b1d4dfa 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -442,7 +442,7 @@ static inline void memcg_kmem_set_active(struct mem_cgroup *memcg)
 	set_bit(KMEM_ACCOUNTED_ACTIVE, &memcg->kmem_account_flags);
 }
 
-static bool memcg_kmem_is_active(struct mem_cgroup *memcg)
+bool memcg_kmem_is_active(struct mem_cgroup *memcg)
 {
 	return test_bit(KMEM_ACCOUNTED_ACTIVE, &memcg->kmem_account_flags);
 }
@@ -991,6 +991,15 @@ mem_cgroup_zone_nr_lru_pages(struct mem_cgroup *memcg, int nid, int zid,
 	return ret;
 }
 
+unsigned long
+memcg_zone_reclaimable_pages(struct mem_cgroup *memcg, struct zone *zone)
+{
+	int nid = zone_to_nid(zone);
+	int zid = zone_idx(zone);
+
+	return mem_cgroup_zone_nr_lru_pages(memcg, nid, zid, LRU_ALL);
+}
+
 static unsigned long
 mem_cgroup_node_nr_lru_pages(struct mem_cgroup *memcg,
 			int nid, unsigned int lru_mask)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 6d96280..8af0e2b 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -138,11 +138,42 @@ static bool global_reclaim(struct scan_control *sc)
 {
 	return !sc->target_mem_cgroup;
 }
+
+/*
+ * kmem reclaim should usually not be triggered when we are doing targetted
+ * reclaim. It is only valid when global reclaim is triggered, or when the
+ * underlying memcg has kmem objects.
+ */
+static bool has_kmem_reclaim(struct scan_control *sc)
+{
+	return !sc->target_mem_cgroup ||
+	(sc->target_mem_cgroup && memcg_kmem_is_active(sc->target_mem_cgroup));
+}
+
+static unsigned long
+zone_nr_reclaimable_pages(struct scan_control *sc, struct zone *zone)
+{
+	if (global_reclaim(sc))
+		return zone_reclaimable_pages(zone);
+	return memcg_zone_reclaimable_pages(sc->target_mem_cgroup, zone);
+}
+
 #else
 static bool global_reclaim(struct scan_control *sc)
 {
 	return true;
 }
+
+static bool has_kmem_reclaim(struct scan_control *sc)
+{
+	return true;
+}
+
+static unsigned long
+zone_nr_reclaimable_pages(struct scan_control *sc, struct zone *zone)
+{
+	return zone_reclaimable_pages(zone);
+}
 #endif
 
 static unsigned long get_lru_size(struct lruvec *lruvec, enum lru_list lru)
@@ -221,6 +252,9 @@ unsigned long shrink_slab(struct shrink_control *sc,
 		long batch_size = shrinker->batch ? shrinker->batch
 						  : SHRINK_BATCH;
 
+		if (!shrinker->memcg_shrinker && sc->target_mem_cgroup)
+			continue;
+
 		max_pass = shrinker->count_objects(shrinker, sc);
 		WARN_ON(max_pass < 0);
 		if (max_pass <= 0)
@@ -2170,9 +2204,9 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 
 		/*
 		 * Don't shrink slabs when reclaiming memory from
-		 * over limit cgroups
+		 * over limit cgroups, unless we know they have kmem objects
 		 */
-		if (global_reclaim(sc)) {
+		if (has_kmem_reclaim(sc)) {
 			unsigned long lru_pages = 0;
 
 			nodes_clear(shrink->nodes_to_scan);
@@ -2181,7 +2215,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 				if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
 					continue;
 
-				lru_pages += zone_reclaimable_pages(zone);
+				lru_pages += zone_nr_reclaimable_pages(sc, zone);
 				node_set(zone_to_nid(zone),
 					 shrink->nodes_to_scan);
 			}
@@ -2443,6 +2477,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 	};
 	struct shrink_control shrink = {
 		.gfp_mask = sc.gfp_mask,
+		.target_mem_cgroup = memcg,
 	};
 
 	/*
-- 
1.8.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 1/7] vmscan: also shrink slab in memcg pressure
@ 2013-02-08 13:07   ` Glauber Costa
  0 siblings, 0 replies; 60+ messages in thread
From: Glauber Costa @ 2013-02-08 13:07 UTC (permalink / raw)
  To: linux-mm
  Cc: cgroups, Andrew Morton, Michal Hocko, Johannes Weiner,
	kamezawa.hiroyu, Dave Shrinnker, linux-fsdevel, Glauber Costa,
	Dave Chinner, Mel Gorman, Rik van Riel, Hugh Dickins

Without the surrounding infrastructure, this patch is a bit of a hammer:
it will basically shrink objects from all memcgs under memcg pressure.
At least, however, we will keep the scan limited to the shrinkers marked
as per-memcg.

Future patches will implement the in-shrinker logic to filter objects
based on its memcg association.

Signed-off-by: Glauber Costa <glommer@parallels.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
---
 include/linux/memcontrol.h | 16 ++++++++++++++++
 include/linux/shrinker.h   |  4 ++++
 mm/memcontrol.c            | 11 ++++++++++-
 mm/vmscan.c                | 41 ++++++++++++++++++++++++++++++++++++++---
 4 files changed, 68 insertions(+), 4 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 0108a56..b7de557 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -200,6 +200,9 @@ void mem_cgroup_split_huge_fixup(struct page *head);
 bool mem_cgroup_bad_page_check(struct page *page);
 void mem_cgroup_print_bad_page(struct page *page);
 #endif
+
+unsigned long
+memcg_zone_reclaimable_pages(struct mem_cgroup *memcg, struct zone *zone);
 #else /* CONFIG_MEMCG */
 struct mem_cgroup;
 
@@ -384,6 +387,11 @@ static inline void mem_cgroup_replace_page_cache(struct page *oldpage,
 				struct page *newpage)
 {
 }
+
+static inline unsigned long
+memcg_zone_reclaimable_pages(struct mem_cgroup *memcg, struct zone *zone)
+{
+}
 #endif /* CONFIG_MEMCG */
 
 #if !defined(CONFIG_MEMCG) || !defined(CONFIG_DEBUG_VM)
@@ -436,6 +444,8 @@ static inline bool memcg_kmem_enabled(void)
 	return static_key_false(&memcg_kmem_enabled_key);
 }
 
+bool memcg_kmem_is_active(struct mem_cgroup *memcg);
+
 /*
  * In general, we'll do everything in our power to not incur in any overhead
  * for non-memcg users for the kmem functions. Not even a function call, if we
@@ -569,6 +579,12 @@ memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp)
 	return __memcg_kmem_get_cache(cachep, gfp);
 }
 #else
+
+static inline bool memcg_kmem_is_active(struct mem_cgroup *memcg)
+{
+	return false;
+}
+
 #define for_each_memcg_cache_index(_idx)	\
 	for (; NULL; )
 
diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
index d4636a0..a767f2e 100644
--- a/include/linux/shrinker.h
+++ b/include/linux/shrinker.h
@@ -20,6 +20,9 @@ struct shrink_control {
 
 	/* shrink from these nodes */
 	nodemask_t nodes_to_scan;
+
+	/* reclaim from this memcg only (if not NULL) */
+	struct mem_cgroup *target_mem_cgroup;
 };
 
 /*
@@ -45,6 +48,7 @@ struct shrinker {
 
 	int seeks;	/* seeks to recreate an obj */
 	long batch;	/* reclaim batch size, 0 = default */
+	bool memcg_shrinker;
 
 	/* These are for internal use */
 	struct list_head list;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 3817460..b1d4dfa 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -442,7 +442,7 @@ static inline void memcg_kmem_set_active(struct mem_cgroup *memcg)
 	set_bit(KMEM_ACCOUNTED_ACTIVE, &memcg->kmem_account_flags);
 }
 
-static bool memcg_kmem_is_active(struct mem_cgroup *memcg)
+bool memcg_kmem_is_active(struct mem_cgroup *memcg)
 {
 	return test_bit(KMEM_ACCOUNTED_ACTIVE, &memcg->kmem_account_flags);
 }
@@ -991,6 +991,15 @@ mem_cgroup_zone_nr_lru_pages(struct mem_cgroup *memcg, int nid, int zid,
 	return ret;
 }
 
+unsigned long
+memcg_zone_reclaimable_pages(struct mem_cgroup *memcg, struct zone *zone)
+{
+	int nid = zone_to_nid(zone);
+	int zid = zone_idx(zone);
+
+	return mem_cgroup_zone_nr_lru_pages(memcg, nid, zid, LRU_ALL);
+}
+
 static unsigned long
 mem_cgroup_node_nr_lru_pages(struct mem_cgroup *memcg,
 			int nid, unsigned int lru_mask)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 6d96280..8af0e2b 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -138,11 +138,42 @@ static bool global_reclaim(struct scan_control *sc)
 {
 	return !sc->target_mem_cgroup;
 }
+
+/*
+ * kmem reclaim should usually not be triggered when we are doing targetted
+ * reclaim. It is only valid when global reclaim is triggered, or when the
+ * underlying memcg has kmem objects.
+ */
+static bool has_kmem_reclaim(struct scan_control *sc)
+{
+	return !sc->target_mem_cgroup ||
+	(sc->target_mem_cgroup && memcg_kmem_is_active(sc->target_mem_cgroup));
+}
+
+static unsigned long
+zone_nr_reclaimable_pages(struct scan_control *sc, struct zone *zone)
+{
+	if (global_reclaim(sc))
+		return zone_reclaimable_pages(zone);
+	return memcg_zone_reclaimable_pages(sc->target_mem_cgroup, zone);
+}
+
 #else
 static bool global_reclaim(struct scan_control *sc)
 {
 	return true;
 }
+
+static bool has_kmem_reclaim(struct scan_control *sc)
+{
+	return true;
+}
+
+static unsigned long
+zone_nr_reclaimable_pages(struct scan_control *sc, struct zone *zone)
+{
+	return zone_reclaimable_pages(zone);
+}
 #endif
 
 static unsigned long get_lru_size(struct lruvec *lruvec, enum lru_list lru)
@@ -221,6 +252,9 @@ unsigned long shrink_slab(struct shrink_control *sc,
 		long batch_size = shrinker->batch ? shrinker->batch
 						  : SHRINK_BATCH;
 
+		if (!shrinker->memcg_shrinker && sc->target_mem_cgroup)
+			continue;
+
 		max_pass = shrinker->count_objects(shrinker, sc);
 		WARN_ON(max_pass < 0);
 		if (max_pass <= 0)
@@ -2170,9 +2204,9 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 
 		/*
 		 * Don't shrink slabs when reclaiming memory from
-		 * over limit cgroups
+		 * over limit cgroups, unless we know they have kmem objects
 		 */
-		if (global_reclaim(sc)) {
+		if (has_kmem_reclaim(sc)) {
 			unsigned long lru_pages = 0;
 
 			nodes_clear(shrink->nodes_to_scan);
@@ -2181,7 +2215,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 				if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
 					continue;
 
-				lru_pages += zone_reclaimable_pages(zone);
+				lru_pages += zone_nr_reclaimable_pages(sc, zone);
 				node_set(zone_to_nid(zone),
 					 shrink->nodes_to_scan);
 			}
@@ -2443,6 +2477,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 	};
 	struct shrink_control shrink = {
 		.gfp_mask = sc.gfp_mask,
+		.target_mem_cgroup = memcg,
 	};
 
 	/*
-- 
1.8.1


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 2/7] memcg,list_lru: duplicate LRUs upon kmemcg creation
  2013-02-08 13:07 ` Glauber Costa
  (?)
@ 2013-02-08 13:07   ` Glauber Costa
  -1 siblings, 0 replies; 60+ messages in thread
From: Glauber Costa @ 2013-02-08 13:07 UTC (permalink / raw)
  To: linux-mm
  Cc: cgroups, Andrew Morton, Michal Hocko, Johannes Weiner,
	kamezawa.hiroyu, Dave Shrinnker, linux-fsdevel, Glauber Costa,
	Dave Chinner, Mel Gorman, Rik van Riel, Hugh Dickins

When a new memcg is created, we need to open up room for its descriptors
in all of the list_lrus that are marked per-memcg. The process is quite
similar to the one we are using for the kmem caches: we initialize the
new structures in an array indexed by kmemcg_id, and grow the array if
needed. Key data like the size of the array will be shared between the
kmem cache code and the list_lru code (they basically describe the same
thing)

Signed-off-by: Glauber Costa <glommer@parallels.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
---
 include/linux/list_lru.h   |  47 +++++++++++++++++
 include/linux/memcontrol.h |   6 +++
 lib/list_lru.c             | 115 +++++++++++++++++++++++++++++++++++++---
 mm/memcontrol.c            | 128 ++++++++++++++++++++++++++++++++++++++++++---
 mm/slab_common.c           |   1 -
 5 files changed, 283 insertions(+), 14 deletions(-)

diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
index 02796da..370b989 100644
--- a/include/linux/list_lru.h
+++ b/include/linux/list_lru.h
@@ -16,11 +16,58 @@ struct list_lru_node {
 	long			nr_items;
 } ____cacheline_aligned_in_smp;
 
+struct list_lru_array {
+	struct list_lru_node node[1];
+};
+
 struct list_lru {
+	struct list_head	lrus;
 	struct list_lru_node	node[MAX_NUMNODES];
 	nodemask_t		active_nodes;
+#ifdef CONFIG_MEMCG_KMEM
+	struct list_lru_array	**memcg_lrus;
+#endif
 };
 
+struct mem_cgroup;
+#ifdef CONFIG_MEMCG_KMEM
+/*
+ * We will reuse the last bit of the pointer to tell the lru subsystem that
+ * this particular lru should be replicated when a memcg comes in.
+ */
+static inline void lru_memcg_enable(struct list_lru *lru)
+{
+	lru->memcg_lrus = (void *)0x1ULL;
+}
+
+/*
+ * This will return true if we have already allocated and assignment a memcg
+ * pointer set to the LRU. Therefore, we need to mask the first bit out
+ */
+static inline bool lru_memcg_is_assigned(struct list_lru *lru)
+{
+	return (unsigned long)lru->memcg_lrus & ~0x1ULL;
+}
+
+struct list_lru_array *lru_alloc_array(void);
+int memcg_update_all_lrus(unsigned long num);
+void list_lru_destroy(struct list_lru *lru);
+void list_lru_destroy_memcg(struct mem_cgroup *memcg);
+#else
+static inline void lru_memcg_enable(struct list_lru *lru)
+{
+}
+
+static inline bool lru_memcg_is_assigned(struct list_lru *lru)
+{
+	return false;
+}
+
+static inline void list_lru_destroy(struct list_lru *lru)
+{
+}
+#endif
+
 int list_lru_init(struct list_lru *lru);
 int list_lru_add(struct list_lru *lru, struct list_head *item);
 int list_lru_del(struct list_lru *lru, struct list_head *item);
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index b7de557..f9558d0 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -23,6 +23,7 @@
 #include <linux/vm_event_item.h>
 #include <linux/hardirq.h>
 #include <linux/jump_label.h>
+#include <linux/list_lru.h>
 
 struct mem_cgroup;
 struct page_cgroup;
@@ -475,6 +476,11 @@ void memcg_update_array_size(int num_groups);
 struct kmem_cache *
 __memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp);
 
+int memcg_new_lru(struct list_lru *lru);
+
+int memcg_kmem_update_lru_size(struct list_lru *lru, int num_groups,
+			       bool new_lru);
+
 void mem_cgroup_destroy_cache(struct kmem_cache *cachep);
 void kmem_cache_destroy_memcg_children(struct kmem_cache *s);
 
diff --git a/lib/list_lru.c b/lib/list_lru.c
index 0f08ed6..3b0e89d 100644
--- a/lib/list_lru.c
+++ b/lib/list_lru.c
@@ -8,6 +8,7 @@
 #include <linux/module.h>
 #include <linux/mm.h>
 #include <linux/list_lru.h>
+#include <linux/memcontrol.h>
 
 int
 list_lru_add(
@@ -184,18 +185,118 @@ list_lru_dispose_all(
 	return total;
 }
 
-int
-list_lru_init(
-	struct list_lru	*lru)
+/*
+ * This protects the list of all LRU in the system. One only needs
+ * to take when registering an LRU, or when duplicating the list of lrus.
+ * Transversing an LRU can and should be done outside the lock
+ */
+static DEFINE_MUTEX(all_lrus_mutex);
+static LIST_HEAD(all_lrus);
+
+static void list_lru_init_one(struct list_lru_node *lru)
+{
+	spin_lock_init(&lru->lock);
+	INIT_LIST_HEAD(&lru->list);
+	lru->nr_items = 0;
+}
+
+struct list_lru_array *lru_alloc_array(void)
+{
+	struct list_lru_array *lru_array;
+	int i;
+
+	lru_array = kzalloc(nr_node_ids * sizeof(struct list_lru_node),
+				GFP_KERNEL);
+	if (!lru_array)
+		return NULL;
+
+	for (i = 0; i < nr_node_ids ; i++)
+		list_lru_init_one(&lru_array->node[i]);
+
+	return lru_array;
+}
+
+int __list_lru_init(struct list_lru *lru)
 {
 	int i;
 
 	nodes_clear(lru->active_nodes);
-	for (i = 0; i < MAX_NUMNODES; i++) {
-		spin_lock_init(&lru->node[i].lock);
-		INIT_LIST_HEAD(&lru->node[i].list);
-		lru->node[i].nr_items = 0;
+	for (i = 0; i < MAX_NUMNODES; i++)
+		list_lru_init_one(&lru->node[i]);
+
+	return 0;
+}
+
+#ifdef CONFIG_MEMCG_KMEM
+static int memcg_init_lru(struct list_lru *lru)
+{
+	int ret;
+
+	if (!lru->memcg_lrus)
+		return 0;
+
+	INIT_LIST_HEAD(&lru->lrus);
+	mutex_lock(&all_lrus_mutex);
+	list_add(&lru->lrus, &all_lrus);
+	ret = memcg_new_lru(lru);
+	mutex_unlock(&all_lrus_mutex);
+	return ret;
+}
+
+int memcg_update_all_lrus(unsigned long num)
+{
+	int ret = 0;
+	struct list_lru *lru;
+
+	mutex_lock(&all_lrus_mutex);
+	list_for_each_entry(lru, &all_lrus, lrus) {
+		if (!lru->memcg_lrus)
+			continue;
+
+		ret = memcg_kmem_update_lru_size(lru, num, false);
+		if (ret)
+			goto out;
+	}
+out:
+	mutex_unlock(&all_lrus_mutex);
+	return ret;
+}
+
+void list_lru_destroy(struct list_lru *lru)
+{
+	if (!lru->memcg_lrus)
+		return;
+
+	mutex_lock(&all_lrus_mutex);
+	list_del(&lru->lrus);
+	mutex_unlock(&all_lrus_mutex);
+}
+
+void list_lru_destroy_memcg(struct mem_cgroup *memcg)
+{
+	struct list_lru *lru;
+	mutex_lock(&all_lrus_mutex);
+	list_for_each_entry(lru, &all_lrus, lrus) {
+		lru->memcg_lrus[memcg_cache_id(memcg)] = NULL;
+		/* everybody must beaware that this memcg is no longer valid */
+		wmb();
 	}
+	mutex_unlock(&all_lrus_mutex);
+}
+#else
+static int memcg_init_lru(struct list_lru *lru)
+{
 	return 0;
 }
+#endif
+
+int list_lru_init(struct list_lru *lru)
+{
+	int ret;
+	ret = __list_lru_init(lru);
+	if (ret)
+		return ret;
+
+	return memcg_init_lru(lru);
+}
 EXPORT_SYMBOL_GPL(list_lru_init);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b1d4dfa..b9e1941 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3032,16 +3032,30 @@ int memcg_update_cache_sizes(struct mem_cgroup *memcg)
 	memcg_kmem_set_activated(memcg);
 
 	ret = memcg_update_all_caches(num+1);
-	if (ret) {
-		ida_simple_remove(&kmem_limited_groups, num);
-		memcg_kmem_clear_activated(memcg);
-		return ret;
-	}
+	if (ret)
+		goto out;
+
+	/*
+	 * We should make sure that the array size is not updated until we are
+	 * done; otherwise we have no easy way to know whether or not we should
+	 * grow the array.
+	 */
+	ret = memcg_update_all_lrus(num + 1);
+	if (ret)
+		goto out;
 
 	memcg->kmemcg_id = num;
+
+	memcg_update_array_size(num + 1);
+
 	INIT_LIST_HEAD(&memcg->memcg_slab_caches);
 	mutex_init(&memcg->slab_caches_mutex);
+
 	return 0;
+out:
+	ida_simple_remove(&kmem_limited_groups, num);
+	memcg_kmem_clear_activated(memcg);
+	return ret;
 }
 
 static size_t memcg_caches_array_size(int num_groups)
@@ -3121,6 +3135,106 @@ int memcg_update_cache_size(struct kmem_cache *s, int num_groups)
 	return 0;
 }
 
+/*
+ * memcg_kmem_update_lru_size - fill in kmemcg info into a list_lru
+ *
+ * @lru: the lru we are operating with
+ * @num_groups: how many kmem-limited cgroups we have
+ * @new_lru: true if this is a new_lru being created, false if this
+ * was triggered from the memcg side
+ *
+ * Returns 0 on success, and an error code otherwise.
+ *
+ * This function can be called either when a new kmem-limited memcg appears,
+ * or when a new list_lru is created. The work is roughly the same in two cases,
+ * but in the later we never have to expand the array size.
+ *
+ * This is always protected by the all_lrus_mutex from the list_lru side.
+ */
+int memcg_kmem_update_lru_size(struct list_lru *lru, int num_groups,
+			       bool new_lru)
+{
+	struct list_lru_array **new_lru_array;
+	struct list_lru_array *lru_array;
+
+	lru_array = lru_alloc_array();
+	if (!lru_array)
+		return -ENOMEM;
+
+	/* need some fucked up locking around the list acquisition */
+	if ((num_groups > memcg_limited_groups_array_size) || new_lru) {
+		int i;
+		struct list_lru_array **old_array;
+		size_t size = memcg_caches_array_size(num_groups);
+
+		new_lru_array = kzalloc(size * sizeof(void *), GFP_KERNEL);
+		if (!new_lru_array) {
+			kfree(lru_array);
+			return -ENOMEM;
+		}
+
+		for (i = 0; i < memcg_limited_groups_array_size; i++) {
+			if (!lru_memcg_is_assigned(lru) || lru->memcg_lrus[i])
+				continue;
+			new_lru_array[i] =  lru->memcg_lrus[i];
+		}
+
+		old_array = lru->memcg_lrus;
+		lru->memcg_lrus = new_lru_array;
+		/*
+		 * We don't need a barrier here because we are just copying
+		 * information over. Anybody operating in memcg_lrus will
+		 * either follow the new array or the old one and they contain
+		 * exactly the same information. The new space in the end is
+		 * always empty anyway.
+		 *
+		 * We do have to make sure that no more users of the old
+		 * memcg_lrus array exist before we free, and this is achieved
+		 * by the synchronize_lru below.
+		 */
+		if (lru_memcg_is_assigned(lru)) {
+			synchronize_rcu();
+			kfree(old_array);
+		}
+
+	}
+
+	if (lru_memcg_is_assigned(lru)) {
+		lru->memcg_lrus[num_groups - 1] = lru_array;
+		/*
+		 * Here we do need the barrier, because of the state transition
+		 * implied by the assignment of the array. All users should be
+		 * able to see it
+		 */
+		wmb();
+	}
+
+	return 0;
+
+}
+
+int memcg_new_lru(struct list_lru *lru)
+{
+	struct mem_cgroup *iter;
+
+	if (!memcg_kmem_enabled())
+		return 0;
+
+	for_each_mem_cgroup(iter) {
+		int ret;
+		int memcg_id = memcg_cache_id(iter);
+		if (memcg_id < 0)
+			continue;
+
+		ret = memcg_kmem_update_lru_size(lru, memcg_id + 1, true);
+		if (ret) {
+			mem_cgroup_iter_break(root_mem_cgroup, iter);
+			return ret;
+		}
+	}
+	return 0;
+}
+
 int memcg_register_cache(struct mem_cgroup *memcg, struct kmem_cache *s,
 			 struct kmem_cache *root_cache)
 {
@@ -5914,8 +6028,10 @@ static void kmem_cgroup_destroy(struct mem_cgroup *memcg)
 	 * possible that the charges went down to 0 between mark_dead and the
 	 * res_counter read, so in that case, we don't need the put
 	 */
-	if (memcg_kmem_test_and_clear_dead(memcg))
+	if (memcg_kmem_test_and_clear_dead(memcg)) {
+		list_lru_destroy_memcg(memcg);
 		mem_cgroup_put(memcg);
+	}
 }
 #else
 static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 3f3cd97..2470d11 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -102,7 +102,6 @@ int memcg_update_all_caches(int num_memcgs)
 			goto out;
 	}
 
-	memcg_update_array_size(num_memcgs);
 out:
 	mutex_unlock(&slab_mutex);
 	return ret;
-- 
1.8.1


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 2/7] memcg,list_lru: duplicate LRUs upon kmemcg creation
@ 2013-02-08 13:07   ` Glauber Costa
  0 siblings, 0 replies; 60+ messages in thread
From: Glauber Costa @ 2013-02-08 13:07 UTC (permalink / raw)
  To: linux-mm
  Cc: cgroups, Andrew Morton, Michal Hocko, Johannes Weiner,
	kamezawa.hiroyu, Dave Shrinnker, linux-fsdevel, Glauber Costa,
	Dave Chinner, Mel Gorman, Rik van Riel, Hugh Dickins

When a new memcg is created, we need to open up room for its descriptors
in all of the list_lrus that are marked per-memcg. The process is quite
similar to the one we are using for the kmem caches: we initialize the
new structures in an array indexed by kmemcg_id, and grow the array if
needed. Key data like the size of the array will be shared between the
kmem cache code and the list_lru code (they basically describe the same
thing)

Signed-off-by: Glauber Costa <glommer@parallels.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
---
 include/linux/list_lru.h   |  47 +++++++++++++++++
 include/linux/memcontrol.h |   6 +++
 lib/list_lru.c             | 115 +++++++++++++++++++++++++++++++++++++---
 mm/memcontrol.c            | 128 ++++++++++++++++++++++++++++++++++++++++++---
 mm/slab_common.c           |   1 -
 5 files changed, 283 insertions(+), 14 deletions(-)

diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
index 02796da..370b989 100644
--- a/include/linux/list_lru.h
+++ b/include/linux/list_lru.h
@@ -16,11 +16,58 @@ struct list_lru_node {
 	long			nr_items;
 } ____cacheline_aligned_in_smp;
 
+struct list_lru_array {
+	struct list_lru_node node[1];
+};
+
 struct list_lru {
+	struct list_head	lrus;
 	struct list_lru_node	node[MAX_NUMNODES];
 	nodemask_t		active_nodes;
+#ifdef CONFIG_MEMCG_KMEM
+	struct list_lru_array	**memcg_lrus;
+#endif
 };
 
+struct mem_cgroup;
+#ifdef CONFIG_MEMCG_KMEM
+/*
+ * We will reuse the last bit of the pointer to tell the lru subsystem that
+ * this particular lru should be replicated when a memcg comes in.
+ */
+static inline void lru_memcg_enable(struct list_lru *lru)
+{
+	lru->memcg_lrus = (void *)0x1ULL;
+}
+
+/*
+ * This will return true if we have already allocated and assignment a memcg
+ * pointer set to the LRU. Therefore, we need to mask the first bit out
+ */
+static inline bool lru_memcg_is_assigned(struct list_lru *lru)
+{
+	return (unsigned long)lru->memcg_lrus & ~0x1ULL;
+}
+
+struct list_lru_array *lru_alloc_array(void);
+int memcg_update_all_lrus(unsigned long num);
+void list_lru_destroy(struct list_lru *lru);
+void list_lru_destroy_memcg(struct mem_cgroup *memcg);
+#else
+static inline void lru_memcg_enable(struct list_lru *lru)
+{
+}
+
+static inline bool lru_memcg_is_assigned(struct list_lru *lru)
+{
+	return false;
+}
+
+static inline void list_lru_destroy(struct list_lru *lru)
+{
+}
+#endif
+
 int list_lru_init(struct list_lru *lru);
 int list_lru_add(struct list_lru *lru, struct list_head *item);
 int list_lru_del(struct list_lru *lru, struct list_head *item);
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index b7de557..f9558d0 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -23,6 +23,7 @@
 #include <linux/vm_event_item.h>
 #include <linux/hardirq.h>
 #include <linux/jump_label.h>
+#include <linux/list_lru.h>
 
 struct mem_cgroup;
 struct page_cgroup;
@@ -475,6 +476,11 @@ void memcg_update_array_size(int num_groups);
 struct kmem_cache *
 __memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp);
 
+int memcg_new_lru(struct list_lru *lru);
+
+int memcg_kmem_update_lru_size(struct list_lru *lru, int num_groups,
+			       bool new_lru);
+
 void mem_cgroup_destroy_cache(struct kmem_cache *cachep);
 void kmem_cache_destroy_memcg_children(struct kmem_cache *s);
 
diff --git a/lib/list_lru.c b/lib/list_lru.c
index 0f08ed6..3b0e89d 100644
--- a/lib/list_lru.c
+++ b/lib/list_lru.c
@@ -8,6 +8,7 @@
 #include <linux/module.h>
 #include <linux/mm.h>
 #include <linux/list_lru.h>
+#include <linux/memcontrol.h>
 
 int
 list_lru_add(
@@ -184,18 +185,118 @@ list_lru_dispose_all(
 	return total;
 }
 
-int
-list_lru_init(
-	struct list_lru	*lru)
+/*
+ * This protects the list of all LRU in the system. One only needs
+ * to take when registering an LRU, or when duplicating the list of lrus.
+ * Transversing an LRU can and should be done outside the lock
+ */
+static DEFINE_MUTEX(all_lrus_mutex);
+static LIST_HEAD(all_lrus);
+
+static void list_lru_init_one(struct list_lru_node *lru)
+{
+	spin_lock_init(&lru->lock);
+	INIT_LIST_HEAD(&lru->list);
+	lru->nr_items = 0;
+}
+
+struct list_lru_array *lru_alloc_array(void)
+{
+	struct list_lru_array *lru_array;
+	int i;
+
+	lru_array = kzalloc(nr_node_ids * sizeof(struct list_lru_node),
+				GFP_KERNEL);
+	if (!lru_array)
+		return NULL;
+
+	for (i = 0; i < nr_node_ids ; i++)
+		list_lru_init_one(&lru_array->node[i]);
+
+	return lru_array;
+}
+
+int __list_lru_init(struct list_lru *lru)
 {
 	int i;
 
 	nodes_clear(lru->active_nodes);
-	for (i = 0; i < MAX_NUMNODES; i++) {
-		spin_lock_init(&lru->node[i].lock);
-		INIT_LIST_HEAD(&lru->node[i].list);
-		lru->node[i].nr_items = 0;
+	for (i = 0; i < MAX_NUMNODES; i++)
+		list_lru_init_one(&lru->node[i]);
+
+	return 0;
+}
+
+#ifdef CONFIG_MEMCG_KMEM
+static int memcg_init_lru(struct list_lru *lru)
+{
+	int ret;
+
+	if (!lru->memcg_lrus)
+		return 0;
+
+	INIT_LIST_HEAD(&lru->lrus);
+	mutex_lock(&all_lrus_mutex);
+	list_add(&lru->lrus, &all_lrus);
+	ret = memcg_new_lru(lru);
+	mutex_unlock(&all_lrus_mutex);
+	return ret;
+}
+
+int memcg_update_all_lrus(unsigned long num)
+{
+	int ret = 0;
+	struct list_lru *lru;
+
+	mutex_lock(&all_lrus_mutex);
+	list_for_each_entry(lru, &all_lrus, lrus) {
+		if (!lru->memcg_lrus)
+			continue;
+
+		ret = memcg_kmem_update_lru_size(lru, num, false);
+		if (ret)
+			goto out;
+	}
+out:
+	mutex_unlock(&all_lrus_mutex);
+	return ret;
+}
+
+void list_lru_destroy(struct list_lru *lru)
+{
+	if (!lru->memcg_lrus)
+		return;
+
+	mutex_lock(&all_lrus_mutex);
+	list_del(&lru->lrus);
+	mutex_unlock(&all_lrus_mutex);
+}
+
+void list_lru_destroy_memcg(struct mem_cgroup *memcg)
+{
+	struct list_lru *lru;
+	mutex_lock(&all_lrus_mutex);
+	list_for_each_entry(lru, &all_lrus, lrus) {
+		lru->memcg_lrus[memcg_cache_id(memcg)] = NULL;
+		/* everybody must beaware that this memcg is no longer valid */
+		wmb();
 	}
+	mutex_unlock(&all_lrus_mutex);
+}
+#else
+static int memcg_init_lru(struct list_lru *lru)
+{
 	return 0;
 }
+#endif
+
+int list_lru_init(struct list_lru *lru)
+{
+	int ret;
+	ret = __list_lru_init(lru);
+	if (ret)
+		return ret;
+
+	return memcg_init_lru(lru);
+}
 EXPORT_SYMBOL_GPL(list_lru_init);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b1d4dfa..b9e1941 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3032,16 +3032,30 @@ int memcg_update_cache_sizes(struct mem_cgroup *memcg)
 	memcg_kmem_set_activated(memcg);
 
 	ret = memcg_update_all_caches(num+1);
-	if (ret) {
-		ida_simple_remove(&kmem_limited_groups, num);
-		memcg_kmem_clear_activated(memcg);
-		return ret;
-	}
+	if (ret)
+		goto out;
+
+	/*
+	 * We should make sure that the array size is not updated until we are
+	 * done; otherwise we have no easy way to know whether or not we should
+	 * grow the array.
+	 */
+	ret = memcg_update_all_lrus(num + 1);
+	if (ret)
+		goto out;
 
 	memcg->kmemcg_id = num;
+
+	memcg_update_array_size(num + 1);
+
 	INIT_LIST_HEAD(&memcg->memcg_slab_caches);
 	mutex_init(&memcg->slab_caches_mutex);
+
 	return 0;
+out:
+	ida_simple_remove(&kmem_limited_groups, num);
+	memcg_kmem_clear_activated(memcg);
+	return ret;
 }
 
 static size_t memcg_caches_array_size(int num_groups)
@@ -3121,6 +3135,106 @@ int memcg_update_cache_size(struct kmem_cache *s, int num_groups)
 	return 0;
 }
 
+/*
+ * memcg_kmem_update_lru_size - fill in kmemcg info into a list_lru
+ *
+ * @lru: the lru we are operating with
+ * @num_groups: how many kmem-limited cgroups we have
+ * @new_lru: true if this is a new_lru being created, false if this
+ * was triggered from the memcg side
+ *
+ * Returns 0 on success, and an error code otherwise.
+ *
+ * This function can be called either when a new kmem-limited memcg appears,
+ * or when a new list_lru is created. The work is roughly the same in two cases,
+ * but in the later we never have to expand the array size.
+ *
+ * This is always protected by the all_lrus_mutex from the list_lru side.
+ */
+int memcg_kmem_update_lru_size(struct list_lru *lru, int num_groups,
+			       bool new_lru)
+{
+	struct list_lru_array **new_lru_array;
+	struct list_lru_array *lru_array;
+
+	lru_array = lru_alloc_array();
+	if (!lru_array)
+		return -ENOMEM;
+
+	/* need some fucked up locking around the list acquisition */
+	if ((num_groups > memcg_limited_groups_array_size) || new_lru) {
+		int i;
+		struct list_lru_array **old_array;
+		size_t size = memcg_caches_array_size(num_groups);
+
+		new_lru_array = kzalloc(size * sizeof(void *), GFP_KERNEL);
+		if (!new_lru_array) {
+			kfree(lru_array);
+			return -ENOMEM;
+		}
+
+		for (i = 0; i < memcg_limited_groups_array_size; i++) {
+			if (!lru_memcg_is_assigned(lru) || lru->memcg_lrus[i])
+				continue;
+			new_lru_array[i] =  lru->memcg_lrus[i];
+		}
+
+		old_array = lru->memcg_lrus;
+		lru->memcg_lrus = new_lru_array;
+		/*
+		 * We don't need a barrier here because we are just copying
+		 * information over. Anybody operating in memcg_lrus will
+		 * either follow the new array or the old one and they contain
+		 * exactly the same information. The new space in the end is
+		 * always empty anyway.
+		 *
+		 * We do have to make sure that no more users of the old
+		 * memcg_lrus array exist before we free, and this is achieved
+		 * by the synchronize_lru below.
+		 */
+		if (lru_memcg_is_assigned(lru)) {
+			synchronize_rcu();
+			kfree(old_array);
+		}
+
+	}
+
+	if (lru_memcg_is_assigned(lru)) {
+		lru->memcg_lrus[num_groups - 1] = lru_array;
+		/*
+		 * Here we do need the barrier, because of the state transition
+		 * implied by the assignment of the array. All users should be
+		 * able to see it
+		 */
+		wmb();
+	}
+
+	return 0;
+
+}
+
+int memcg_new_lru(struct list_lru *lru)
+{
+	struct mem_cgroup *iter;
+
+	if (!memcg_kmem_enabled())
+		return 0;
+
+	for_each_mem_cgroup(iter) {
+		int ret;
+		int memcg_id = memcg_cache_id(iter);
+		if (memcg_id < 0)
+			continue;
+
+		ret = memcg_kmem_update_lru_size(lru, memcg_id + 1, true);
+		if (ret) {
+			mem_cgroup_iter_break(root_mem_cgroup, iter);
+			return ret;
+		}
+	}
+	return 0;
+}
+
 int memcg_register_cache(struct mem_cgroup *memcg, struct kmem_cache *s,
 			 struct kmem_cache *root_cache)
 {
@@ -5914,8 +6028,10 @@ static void kmem_cgroup_destroy(struct mem_cgroup *memcg)
 	 * possible that the charges went down to 0 between mark_dead and the
 	 * res_counter read, so in that case, we don't need the put
 	 */
-	if (memcg_kmem_test_and_clear_dead(memcg))
+	if (memcg_kmem_test_and_clear_dead(memcg)) {
+		list_lru_destroy_memcg(memcg);
 		mem_cgroup_put(memcg);
+	}
 }
 #else
 static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 3f3cd97..2470d11 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -102,7 +102,6 @@ int memcg_update_all_caches(int num_memcgs)
 			goto out;
 	}
 
-	memcg_update_array_size(num_memcgs);
 out:
 	mutex_unlock(&slab_mutex);
 	return ret;
-- 
1.8.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 2/7] memcg,list_lru: duplicate LRUs upon kmemcg creation
@ 2013-02-08 13:07   ` Glauber Costa
  0 siblings, 0 replies; 60+ messages in thread
From: Glauber Costa @ 2013-02-08 13:07 UTC (permalink / raw)
  To: linux-mm
  Cc: cgroups, Andrew Morton, Michal Hocko, Johannes Weiner,
	kamezawa.hiroyu, Dave Shrinnker, linux-fsdevel, Glauber Costa,
	Dave Chinner, Mel Gorman, Rik van Riel, Hugh Dickins

When a new memcg is created, we need to open up room for its descriptors
in all of the list_lrus that are marked per-memcg. The process is quite
similar to the one we are using for the kmem caches: we initialize the
new structures in an array indexed by kmemcg_id, and grow the array if
needed. Key data like the size of the array will be shared between the
kmem cache code and the list_lru code (they basically describe the same
thing)

Signed-off-by: Glauber Costa <glommer@parallels.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
---
 include/linux/list_lru.h   |  47 +++++++++++++++++
 include/linux/memcontrol.h |   6 +++
 lib/list_lru.c             | 115 +++++++++++++++++++++++++++++++++++++---
 mm/memcontrol.c            | 128 ++++++++++++++++++++++++++++++++++++++++++---
 mm/slab_common.c           |   1 -
 5 files changed, 283 insertions(+), 14 deletions(-)

diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
index 02796da..370b989 100644
--- a/include/linux/list_lru.h
+++ b/include/linux/list_lru.h
@@ -16,11 +16,58 @@ struct list_lru_node {
 	long			nr_items;
 } ____cacheline_aligned_in_smp;
 
+struct list_lru_array {
+	struct list_lru_node node[1];
+};
+
 struct list_lru {
+	struct list_head	lrus;
 	struct list_lru_node	node[MAX_NUMNODES];
 	nodemask_t		active_nodes;
+#ifdef CONFIG_MEMCG_KMEM
+	struct list_lru_array	**memcg_lrus;
+#endif
 };
 
+struct mem_cgroup;
+#ifdef CONFIG_MEMCG_KMEM
+/*
+ * We will reuse the last bit of the pointer to tell the lru subsystem that
+ * this particular lru should be replicated when a memcg comes in.
+ */
+static inline void lru_memcg_enable(struct list_lru *lru)
+{
+	lru->memcg_lrus = (void *)0x1ULL;
+}
+
+/*
+ * This will return true if we have already allocated and assignment a memcg
+ * pointer set to the LRU. Therefore, we need to mask the first bit out
+ */
+static inline bool lru_memcg_is_assigned(struct list_lru *lru)
+{
+	return (unsigned long)lru->memcg_lrus & ~0x1ULL;
+}
+
+struct list_lru_array *lru_alloc_array(void);
+int memcg_update_all_lrus(unsigned long num);
+void list_lru_destroy(struct list_lru *lru);
+void list_lru_destroy_memcg(struct mem_cgroup *memcg);
+#else
+static inline void lru_memcg_enable(struct list_lru *lru)
+{
+}
+
+static inline bool lru_memcg_is_assigned(struct list_lru *lru)
+{
+	return false;
+}
+
+static inline void list_lru_destroy(struct list_lru *lru)
+{
+}
+#endif
+
 int list_lru_init(struct list_lru *lru);
 int list_lru_add(struct list_lru *lru, struct list_head *item);
 int list_lru_del(struct list_lru *lru, struct list_head *item);
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index b7de557..f9558d0 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -23,6 +23,7 @@
 #include <linux/vm_event_item.h>
 #include <linux/hardirq.h>
 #include <linux/jump_label.h>
+#include <linux/list_lru.h>
 
 struct mem_cgroup;
 struct page_cgroup;
@@ -475,6 +476,11 @@ void memcg_update_array_size(int num_groups);
 struct kmem_cache *
 __memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp);
 
+int memcg_new_lru(struct list_lru *lru);
+
+int memcg_kmem_update_lru_size(struct list_lru *lru, int num_groups,
+			       bool new_lru);
+
 void mem_cgroup_destroy_cache(struct kmem_cache *cachep);
 void kmem_cache_destroy_memcg_children(struct kmem_cache *s);
 
diff --git a/lib/list_lru.c b/lib/list_lru.c
index 0f08ed6..3b0e89d 100644
--- a/lib/list_lru.c
+++ b/lib/list_lru.c
@@ -8,6 +8,7 @@
 #include <linux/module.h>
 #include <linux/mm.h>
 #include <linux/list_lru.h>
+#include <linux/memcontrol.h>
 
 int
 list_lru_add(
@@ -184,18 +185,118 @@ list_lru_dispose_all(
 	return total;
 }
 
-int
-list_lru_init(
-	struct list_lru	*lru)
+/*
+ * This protects the list of all LRU in the system. One only needs
+ * to take when registering an LRU, or when duplicating the list of lrus.
+ * Transversing an LRU can and should be done outside the lock
+ */
+static DEFINE_MUTEX(all_lrus_mutex);
+static LIST_HEAD(all_lrus);
+
+static void list_lru_init_one(struct list_lru_node *lru)
+{
+	spin_lock_init(&lru->lock);
+	INIT_LIST_HEAD(&lru->list);
+	lru->nr_items = 0;
+}
+
+struct list_lru_array *lru_alloc_array(void)
+{
+	struct list_lru_array *lru_array;
+	int i;
+
+	lru_array = kzalloc(nr_node_ids * sizeof(struct list_lru_node),
+				GFP_KERNEL);
+	if (!lru_array)
+		return NULL;
+
+	for (i = 0; i < nr_node_ids ; i++)
+		list_lru_init_one(&lru_array->node[i]);
+
+	return lru_array;
+}
+
+int __list_lru_init(struct list_lru *lru)
 {
 	int i;
 
 	nodes_clear(lru->active_nodes);
-	for (i = 0; i < MAX_NUMNODES; i++) {
-		spin_lock_init(&lru->node[i].lock);
-		INIT_LIST_HEAD(&lru->node[i].list);
-		lru->node[i].nr_items = 0;
+	for (i = 0; i < MAX_NUMNODES; i++)
+		list_lru_init_one(&lru->node[i]);
+
+	return 0;
+}
+
+#ifdef CONFIG_MEMCG_KMEM
+static int memcg_init_lru(struct list_lru *lru)
+{
+	int ret;
+
+	if (!lru->memcg_lrus)
+		return 0;
+
+	INIT_LIST_HEAD(&lru->lrus);
+	mutex_lock(&all_lrus_mutex);
+	list_add(&lru->lrus, &all_lrus);
+	ret = memcg_new_lru(lru);
+	mutex_unlock(&all_lrus_mutex);
+	return ret;
+}
+
+int memcg_update_all_lrus(unsigned long num)
+{
+	int ret = 0;
+	struct list_lru *lru;
+
+	mutex_lock(&all_lrus_mutex);
+	list_for_each_entry(lru, &all_lrus, lrus) {
+		if (!lru->memcg_lrus)
+			continue;
+
+		ret = memcg_kmem_update_lru_size(lru, num, false);
+		if (ret)
+			goto out;
+	}
+out:
+	mutex_unlock(&all_lrus_mutex);
+	return ret;
+}
+
+void list_lru_destroy(struct list_lru *lru)
+{
+	if (!lru->memcg_lrus)
+		return;
+
+	mutex_lock(&all_lrus_mutex);
+	list_del(&lru->lrus);
+	mutex_unlock(&all_lrus_mutex);
+}
+
+void list_lru_destroy_memcg(struct mem_cgroup *memcg)
+{
+	struct list_lru *lru;
+	mutex_lock(&all_lrus_mutex);
+	list_for_each_entry(lru, &all_lrus, lrus) {
+		lru->memcg_lrus[memcg_cache_id(memcg)] = NULL;
+		/* everybody must beaware that this memcg is no longer valid */
+		wmb();
 	}
+	mutex_unlock(&all_lrus_mutex);
+}
+#else
+static int memcg_init_lru(struct list_lru *lru)
+{
 	return 0;
 }
+#endif
+
+int list_lru_init(struct list_lru *lru)
+{
+	int ret;
+	ret = __list_lru_init(lru);
+	if (ret)
+		return ret;
+
+	return memcg_init_lru(lru);
+}
 EXPORT_SYMBOL_GPL(list_lru_init);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b1d4dfa..b9e1941 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3032,16 +3032,30 @@ int memcg_update_cache_sizes(struct mem_cgroup *memcg)
 	memcg_kmem_set_activated(memcg);
 
 	ret = memcg_update_all_caches(num+1);
-	if (ret) {
-		ida_simple_remove(&kmem_limited_groups, num);
-		memcg_kmem_clear_activated(memcg);
-		return ret;
-	}
+	if (ret)
+		goto out;
+
+	/*
+	 * We should make sure that the array size is not updated until we are
+	 * done; otherwise we have no easy way to know whether or not we should
+	 * grow the array.
+	 */
+	ret = memcg_update_all_lrus(num + 1);
+	if (ret)
+		goto out;
 
 	memcg->kmemcg_id = num;
+
+	memcg_update_array_size(num + 1);
+
 	INIT_LIST_HEAD(&memcg->memcg_slab_caches);
 	mutex_init(&memcg->slab_caches_mutex);
+
 	return 0;
+out:
+	ida_simple_remove(&kmem_limited_groups, num);
+	memcg_kmem_clear_activated(memcg);
+	return ret;
 }
 
 static size_t memcg_caches_array_size(int num_groups)
@@ -3121,6 +3135,106 @@ int memcg_update_cache_size(struct kmem_cache *s, int num_groups)
 	return 0;
 }
 
+/*
+ * memcg_kmem_update_lru_size - fill in kmemcg info into a list_lru
+ *
+ * @lru: the lru we are operating with
+ * @num_groups: how many kmem-limited cgroups we have
+ * @new_lru: true if this is a new_lru being created, false if this
+ * was triggered from the memcg side
+ *
+ * Returns 0 on success, and an error code otherwise.
+ *
+ * This function can be called either when a new kmem-limited memcg appears,
+ * or when a new list_lru is created. The work is roughly the same in two cases,
+ * but in the later we never have to expand the array size.
+ *
+ * This is always protected by the all_lrus_mutex from the list_lru side.
+ */
+int memcg_kmem_update_lru_size(struct list_lru *lru, int num_groups,
+			       bool new_lru)
+{
+	struct list_lru_array **new_lru_array;
+	struct list_lru_array *lru_array;
+
+	lru_array = lru_alloc_array();
+	if (!lru_array)
+		return -ENOMEM;
+
+	/* need some fucked up locking around the list acquisition */
+	if ((num_groups > memcg_limited_groups_array_size) || new_lru) {
+		int i;
+		struct list_lru_array **old_array;
+		size_t size = memcg_caches_array_size(num_groups);
+
+		new_lru_array = kzalloc(size * sizeof(void *), GFP_KERNEL);
+		if (!new_lru_array) {
+			kfree(lru_array);
+			return -ENOMEM;
+		}
+
+		for (i = 0; i < memcg_limited_groups_array_size; i++) {
+			if (!lru_memcg_is_assigned(lru) || lru->memcg_lrus[i])
+				continue;
+			new_lru_array[i] =  lru->memcg_lrus[i];
+		}
+
+		old_array = lru->memcg_lrus;
+		lru->memcg_lrus = new_lru_array;
+		/*
+		 * We don't need a barrier here because we are just copying
+		 * information over. Anybody operating in memcg_lrus will
+		 * either follow the new array or the old one and they contain
+		 * exactly the same information. The new space in the end is
+		 * always empty anyway.
+		 *
+		 * We do have to make sure that no more users of the old
+		 * memcg_lrus array exist before we free, and this is achieved
+		 * by the synchronize_lru below.
+		 */
+		if (lru_memcg_is_assigned(lru)) {
+			synchronize_rcu();
+			kfree(old_array);
+		}
+
+	}
+
+	if (lru_memcg_is_assigned(lru)) {
+		lru->memcg_lrus[num_groups - 1] = lru_array;
+		/*
+		 * Here we do need the barrier, because of the state transition
+		 * implied by the assignment of the array. All users should be
+		 * able to see it
+		 */
+		wmb();
+	}
+
+	return 0;
+
+}
+
+int memcg_new_lru(struct list_lru *lru)
+{
+	struct mem_cgroup *iter;
+
+	if (!memcg_kmem_enabled())
+		return 0;
+
+	for_each_mem_cgroup(iter) {
+		int ret;
+		int memcg_id = memcg_cache_id(iter);
+		if (memcg_id < 0)
+			continue;
+
+		ret = memcg_kmem_update_lru_size(lru, memcg_id + 1, true);
+		if (ret) {
+			mem_cgroup_iter_break(root_mem_cgroup, iter);
+			return ret;
+		}
+	}
+	return 0;
+}
+
 int memcg_register_cache(struct mem_cgroup *memcg, struct kmem_cache *s,
 			 struct kmem_cache *root_cache)
 {
@@ -5914,8 +6028,10 @@ static void kmem_cgroup_destroy(struct mem_cgroup *memcg)
 	 * possible that the charges went down to 0 between mark_dead and the
 	 * res_counter read, so in that case, we don't need the put
 	 */
-	if (memcg_kmem_test_and_clear_dead(memcg))
+	if (memcg_kmem_test_and_clear_dead(memcg)) {
+		list_lru_destroy_memcg(memcg);
 		mem_cgroup_put(memcg);
+	}
 }
 #else
 static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 3f3cd97..2470d11 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -102,7 +102,6 @@ int memcg_update_all_caches(int num_memcgs)
 			goto out;
 	}
 
-	memcg_update_array_size(num_memcgs);
 out:
 	mutex_unlock(&slab_mutex);
 	return ret;
-- 
1.8.1


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 3/7] lru: add an element to a memcg list
  2013-02-08 13:07 ` Glauber Costa
  (?)
@ 2013-02-08 13:07   ` Glauber Costa
  -1 siblings, 0 replies; 60+ messages in thread
From: Glauber Costa @ 2013-02-08 13:07 UTC (permalink / raw)
  To: linux-mm
  Cc: cgroups, Andrew Morton, Michal Hocko, Johannes Weiner,
	kamezawa.hiroyu, Dave Shrinnker, linux-fsdevel, Glauber Costa,
	Dave Chinner, Mel Gorman, Rik van Riel, Hugh Dickins

With the infrastructure we now have, we can add an element to a memcg
LRU list instead of the global list. The memcg lists are still
per-node.

Technically, we will never trigger per-node shrinking in the memcg is
short of memory. Therefore an alternative to this would be to add the
element to *both* a single-node memcg array and a per-node global array.

There are two main reasons for this design choice:

1) adding an extra list_head to each of the objects would waste 16-bytes
per object, always remembering that we are talking about 1 dentry + 1
inode in the common case. This means a close to 10 % increase in the
dentry size, and a lower yet significant increase in the inode size. In
terms of total memory, this design pays 32-byte per-superblock-per-node
(size of struct list_lru_node), which means that in any scenario where
we have more than 10 dentries + inodes, we would already be paying more
memory in the two-list-heads approach than we will here with 1 node x 10
superblocks. The turning point of course depends on the workload, but I
hope the figures above would convince you that the memory footprint is
in my side in any workload that matters.

2) The main drawback of this, namely, that we loose global LRU order, is
not really seen by me as a disadvantage: if we are using memcg to
isolate the workloads, global pressure should try to balance the amount
reclaimed from all memcgs the same way the shrinkers will already
naturally balance the amount reclaimed from each superblock. (This
patchset needs some love in this regard, btw).

To help us easily tracking down which nodes have and which nodes doesn't
have elements in the list, we will count on an auxiliary node bitmap in
the global level.

Signed-off-by: Glauber Costa <glommer@parallels.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
---
 include/linux/list_lru.h   | 10 ++++++++
 include/linux/memcontrol.h | 10 ++++++++
 lib/list_lru.c             | 63 ++++++++++++++++++++++++++++++++++++++++------
 mm/memcontrol.c            | 30 ++++++++++++++++++++++
 4 files changed, 105 insertions(+), 8 deletions(-)

diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
index 370b989..5d8f7ab 100644
--- a/include/linux/list_lru.h
+++ b/include/linux/list_lru.h
@@ -23,6 +23,7 @@ struct list_lru_array {
 struct list_lru {
 	struct list_head	lrus;
 	struct list_lru_node	node[MAX_NUMNODES];
+	atomic_long_t		node_totals[MAX_NUMNODES];
 	nodemask_t		active_nodes;
 #ifdef CONFIG_MEMCG_KMEM
 	struct list_lru_array	**memcg_lrus;
@@ -53,6 +54,8 @@ struct list_lru_array *lru_alloc_array(void);
 int memcg_update_all_lrus(unsigned long num);
 void list_lru_destroy(struct list_lru *lru);
 void list_lru_destroy_memcg(struct mem_cgroup *memcg);
+struct list_lru_node *
+lru_node_of_index(struct list_lru *lru, int index, int nid);
 #else
 static inline void lru_memcg_enable(struct list_lru *lru)
 {
@@ -66,6 +69,13 @@ static inline bool lru_memcg_is_assigned(struct list_lru *lru)
 static inline void list_lru_destroy(struct list_lru *lru)
 {
 }
+
+static inline struct list_lru_node *
+lru_node_of_index(struct list_lru *lru, int index, int nid);
+{
+	BUG_ON(index < 0); /* index != -1 with !MEMCG_KMEM. Impossible */
+	return &lru->node[nid];
+}
 #endif
 
 int list_lru_init(struct list_lru *lru);
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index f9558d0..3510730 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -24,6 +24,7 @@
 #include <linux/hardirq.h>
 #include <linux/jump_label.h>
 #include <linux/list_lru.h>
+#include <linux/mm.h>
 
 struct mem_cgroup;
 struct page_cgroup;
@@ -478,6 +479,9 @@ __memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp);
 
 int memcg_new_lru(struct list_lru *lru);
 
+struct list_lru_node *
+memcg_kmem_lru_of_page(struct list_lru *lru, struct page *page);
+
 int memcg_kmem_update_lru_size(struct list_lru *lru, int num_groups,
 			       bool new_lru);
 
@@ -644,6 +648,12 @@ memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp)
 static inline void kmem_cache_destroy_memcg_children(struct kmem_cache *s)
 {
 }
+
+static inline struct list_lru_node *
+memcg_kmem_lru_of_page(struct list_lru *lru, struct page *page)
+{
+	return &lru->node[page_to_nid(page)];
+}
 #endif /* CONFIG_MEMCG_KMEM */
 #endif /* _LINUX_MEMCONTROL_H */
 
diff --git a/lib/list_lru.c b/lib/list_lru.c
index 3b0e89d..8c36579 100644
--- a/lib/list_lru.c
+++ b/lib/list_lru.c
@@ -15,14 +15,22 @@ list_lru_add(
 	struct list_lru	*lru,
 	struct list_head *item)
 {
-	int nid = page_to_nid(virt_to_page(item));
-	struct list_lru_node *nlru = &lru->node[nid];
+	struct page *page = virt_to_page(item);
+	struct list_lru_node *nlru;
+	int nid = page_to_nid(page);
+
+	nlru = memcg_kmem_lru_of_page(lru, page);
 
 	spin_lock(&nlru->lock);
 	BUG_ON(nlru->nr_items < 0);
 	if (list_empty(item)) {
 		list_add_tail(item, &nlru->list);
-		if (nlru->nr_items++ == 0)
+		nlru->nr_items++;
+		/*
+		 * We only consider a node active or inactive based on the
+		 * total figure for all involved children.
+		 */
+		if (atomic_long_inc_and_test(&lru->node_totals[nid]))
 			node_set(nid, lru->active_nodes);
 		spin_unlock(&nlru->lock);
 		return 1;
@@ -37,14 +45,20 @@ list_lru_del(
 	struct list_lru	*lru,
 	struct list_head *item)
 {
-	int nid = page_to_nid(virt_to_page(item));
-	struct list_lru_node *nlru = &lru->node[nid];
+	struct page *page = virt_to_page(item);
+	struct list_lru_node *nlru;
+	int nid = page_to_nid(page);
+
+	nlru = memcg_kmem_lru_of_page(lru, page);
 
 	spin_lock(&nlru->lock);
 	if (!list_empty(item)) {
 		list_del_init(item);
-		if (--nlru->nr_items == 0)
+		nlru->nr_items--;
+
+		if (atomic_long_dec_and_test(&lru->node_totals[nid]))
 			node_clear(nid, lru->active_nodes);
+
 		BUG_ON(nlru->nr_items < 0);
 		spin_unlock(&nlru->lock);
 		return 1;
@@ -97,7 +111,9 @@ restart:
 		ret = isolate(item, &nlru->lock, cb_arg);
 		switch (ret) {
 		case 0:	/* item removed from list */
-			if (--nlru->nr_items == 0)
+			nlru->nr_items--;
+
+			if (atomic_long_dec_and_test(&lru->node_totals[nid]))
 				node_clear(nid, lru->active_nodes);
 			BUG_ON(nlru->nr_items < 0);
 			isolated++;
@@ -221,8 +237,10 @@ int __list_lru_init(struct list_lru *lru)
 	int i;
 
 	nodes_clear(lru->active_nodes);
-	for (i = 0; i < MAX_NUMNODES; i++)
+	for (i = 0; i < MAX_NUMNODES; i++) {
 		list_lru_init_one(&lru->node[i]);
+		atomic_long_set(&lru->node_totals[i], 0);
+	}
 
 	return 0;
 }
@@ -262,6 +280,35 @@ out:
 	return ret;
 }
 
+struct list_lru_node *
+lru_node_of_index(struct list_lru *lru, int index, int nid)
+{
+	struct list_lru_node *nlru;
+
+	if (index < 0)
+		return &lru->node[nid];
+
+	if (!lru_memcg_is_assigned(lru))
+		return NULL;
+
+	/*
+	 * because we will only ever free the memcg_lrus after synchronize_rcu,
+	 * we are safe with the rcu lock here: even if we are operating in the
+	 * stale version of the array, the data is still valid and we are not
+	 * risking anything.
+	 *
+	 * The read barrier is needed to make sure that we see the pointer
+	 * assigment for the specific memcg
+	 */
+	rcu_read_lock();
+	rmb();
+	nlru = &lru->memcg_lrus[index]->node[nid];
+	rcu_read_unlock();
+	return nlru;
+
+	BUG(); /* index != -1 with !MEMCG_KMEM. Impossible */
+}
+
 void list_lru_destroy(struct list_lru *lru)
 {
 	if (!lru->memcg_lrus)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b9e1941..bfb4b5b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3319,6 +3319,36 @@ static inline void memcg_resume_kmem_account(void)
 	current->memcg_kmem_skip_account--;
 }
 
+static struct mem_cgroup *mem_cgroup_from_kmem_page(struct page *page)
+{
+	struct page_cgroup *pc;
+	struct mem_cgroup *memcg = NULL;
+
+	pc = lookup_page_cgroup(page);
+	if (!PageCgroupUsed(pc))
+		return NULL;
+
+	lock_page_cgroup(pc);
+	if (PageCgroupUsed(pc))
+		memcg = pc->mem_cgroup;
+	unlock_page_cgroup(pc);
+	return memcg;
+}
+
+struct list_lru_node *
+memcg_kmem_lru_of_page(struct list_lru *lru, struct page *page)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_kmem_page(page);
+	int nid = page_to_nid(page);
+	int memcg_id;
+
+	if (!memcg_kmem_enabled())
+		return &lru->node[nid];
+
+	memcg_id = memcg_cache_id(memcg);
+	return lru_node_of_index(lru, memcg_id, nid);
+}
+
 static void kmem_cache_destroy_work_func(struct work_struct *w)
 {
 	struct kmem_cache *cachep;
-- 
1.8.1


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 3/7] lru: add an element to a memcg list
@ 2013-02-08 13:07   ` Glauber Costa
  0 siblings, 0 replies; 60+ messages in thread
From: Glauber Costa @ 2013-02-08 13:07 UTC (permalink / raw)
  To: linux-mm
  Cc: cgroups, Andrew Morton, Michal Hocko, Johannes Weiner,
	kamezawa.hiroyu, Dave Shrinnker, linux-fsdevel, Glauber Costa,
	Dave Chinner, Mel Gorman, Rik van Riel, Hugh Dickins

With the infrastructure we now have, we can add an element to a memcg
LRU list instead of the global list. The memcg lists are still
per-node.

Technically, we will never trigger per-node shrinking in the memcg is
short of memory. Therefore an alternative to this would be to add the
element to *both* a single-node memcg array and a per-node global array.

There are two main reasons for this design choice:

1) adding an extra list_head to each of the objects would waste 16-bytes
per object, always remembering that we are talking about 1 dentry + 1
inode in the common case. This means a close to 10 % increase in the
dentry size, and a lower yet significant increase in the inode size. In
terms of total memory, this design pays 32-byte per-superblock-per-node
(size of struct list_lru_node), which means that in any scenario where
we have more than 10 dentries + inodes, we would already be paying more
memory in the two-list-heads approach than we will here with 1 node x 10
superblocks. The turning point of course depends on the workload, but I
hope the figures above would convince you that the memory footprint is
in my side in any workload that matters.

2) The main drawback of this, namely, that we loose global LRU order, is
not really seen by me as a disadvantage: if we are using memcg to
isolate the workloads, global pressure should try to balance the amount
reclaimed from all memcgs the same way the shrinkers will already
naturally balance the amount reclaimed from each superblock. (This
patchset needs some love in this regard, btw).

To help us easily tracking down which nodes have and which nodes doesn't
have elements in the list, we will count on an auxiliary node bitmap in
the global level.

Signed-off-by: Glauber Costa <glommer@parallels.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
---
 include/linux/list_lru.h   | 10 ++++++++
 include/linux/memcontrol.h | 10 ++++++++
 lib/list_lru.c             | 63 ++++++++++++++++++++++++++++++++++++++++------
 mm/memcontrol.c            | 30 ++++++++++++++++++++++
 4 files changed, 105 insertions(+), 8 deletions(-)

diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
index 370b989..5d8f7ab 100644
--- a/include/linux/list_lru.h
+++ b/include/linux/list_lru.h
@@ -23,6 +23,7 @@ struct list_lru_array {
 struct list_lru {
 	struct list_head	lrus;
 	struct list_lru_node	node[MAX_NUMNODES];
+	atomic_long_t		node_totals[MAX_NUMNODES];
 	nodemask_t		active_nodes;
 #ifdef CONFIG_MEMCG_KMEM
 	struct list_lru_array	**memcg_lrus;
@@ -53,6 +54,8 @@ struct list_lru_array *lru_alloc_array(void);
 int memcg_update_all_lrus(unsigned long num);
 void list_lru_destroy(struct list_lru *lru);
 void list_lru_destroy_memcg(struct mem_cgroup *memcg);
+struct list_lru_node *
+lru_node_of_index(struct list_lru *lru, int index, int nid);
 #else
 static inline void lru_memcg_enable(struct list_lru *lru)
 {
@@ -66,6 +69,13 @@ static inline bool lru_memcg_is_assigned(struct list_lru *lru)
 static inline void list_lru_destroy(struct list_lru *lru)
 {
 }
+
+static inline struct list_lru_node *
+lru_node_of_index(struct list_lru *lru, int index, int nid);
+{
+	BUG_ON(index < 0); /* index != -1 with !MEMCG_KMEM. Impossible */
+	return &lru->node[nid];
+}
 #endif
 
 int list_lru_init(struct list_lru *lru);
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index f9558d0..3510730 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -24,6 +24,7 @@
 #include <linux/hardirq.h>
 #include <linux/jump_label.h>
 #include <linux/list_lru.h>
+#include <linux/mm.h>
 
 struct mem_cgroup;
 struct page_cgroup;
@@ -478,6 +479,9 @@ __memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp);
 
 int memcg_new_lru(struct list_lru *lru);
 
+struct list_lru_node *
+memcg_kmem_lru_of_page(struct list_lru *lru, struct page *page);
+
 int memcg_kmem_update_lru_size(struct list_lru *lru, int num_groups,
 			       bool new_lru);
 
@@ -644,6 +648,12 @@ memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp)
 static inline void kmem_cache_destroy_memcg_children(struct kmem_cache *s)
 {
 }
+
+static inline struct list_lru_node *
+memcg_kmem_lru_of_page(struct list_lru *lru, struct page *page)
+{
+	return &lru->node[page_to_nid(page)];
+}
 #endif /* CONFIG_MEMCG_KMEM */
 #endif /* _LINUX_MEMCONTROL_H */
 
diff --git a/lib/list_lru.c b/lib/list_lru.c
index 3b0e89d..8c36579 100644
--- a/lib/list_lru.c
+++ b/lib/list_lru.c
@@ -15,14 +15,22 @@ list_lru_add(
 	struct list_lru	*lru,
 	struct list_head *item)
 {
-	int nid = page_to_nid(virt_to_page(item));
-	struct list_lru_node *nlru = &lru->node[nid];
+	struct page *page = virt_to_page(item);
+	struct list_lru_node *nlru;
+	int nid = page_to_nid(page);
+
+	nlru = memcg_kmem_lru_of_page(lru, page);
 
 	spin_lock(&nlru->lock);
 	BUG_ON(nlru->nr_items < 0);
 	if (list_empty(item)) {
 		list_add_tail(item, &nlru->list);
-		if (nlru->nr_items++ == 0)
+		nlru->nr_items++;
+		/*
+		 * We only consider a node active or inactive based on the
+		 * total figure for all involved children.
+		 */
+		if (atomic_long_inc_and_test(&lru->node_totals[nid]))
 			node_set(nid, lru->active_nodes);
 		spin_unlock(&nlru->lock);
 		return 1;
@@ -37,14 +45,20 @@ list_lru_del(
 	struct list_lru	*lru,
 	struct list_head *item)
 {
-	int nid = page_to_nid(virt_to_page(item));
-	struct list_lru_node *nlru = &lru->node[nid];
+	struct page *page = virt_to_page(item);
+	struct list_lru_node *nlru;
+	int nid = page_to_nid(page);
+
+	nlru = memcg_kmem_lru_of_page(lru, page);
 
 	spin_lock(&nlru->lock);
 	if (!list_empty(item)) {
 		list_del_init(item);
-		if (--nlru->nr_items == 0)
+		nlru->nr_items--;
+
+		if (atomic_long_dec_and_test(&lru->node_totals[nid]))
 			node_clear(nid, lru->active_nodes);
+
 		BUG_ON(nlru->nr_items < 0);
 		spin_unlock(&nlru->lock);
 		return 1;
@@ -97,7 +111,9 @@ restart:
 		ret = isolate(item, &nlru->lock, cb_arg);
 		switch (ret) {
 		case 0:	/* item removed from list */
-			if (--nlru->nr_items == 0)
+			nlru->nr_items--;
+
+			if (atomic_long_dec_and_test(&lru->node_totals[nid]))
 				node_clear(nid, lru->active_nodes);
 			BUG_ON(nlru->nr_items < 0);
 			isolated++;
@@ -221,8 +237,10 @@ int __list_lru_init(struct list_lru *lru)
 	int i;
 
 	nodes_clear(lru->active_nodes);
-	for (i = 0; i < MAX_NUMNODES; i++)
+	for (i = 0; i < MAX_NUMNODES; i++) {
 		list_lru_init_one(&lru->node[i]);
+		atomic_long_set(&lru->node_totals[i], 0);
+	}
 
 	return 0;
 }
@@ -262,6 +280,35 @@ out:
 	return ret;
 }
 
+struct list_lru_node *
+lru_node_of_index(struct list_lru *lru, int index, int nid)
+{
+	struct list_lru_node *nlru;
+
+	if (index < 0)
+		return &lru->node[nid];
+
+	if (!lru_memcg_is_assigned(lru))
+		return NULL;
+
+	/*
+	 * because we will only ever free the memcg_lrus after synchronize_rcu,
+	 * we are safe with the rcu lock here: even if we are operating in the
+	 * stale version of the array, the data is still valid and we are not
+	 * risking anything.
+	 *
+	 * The read barrier is needed to make sure that we see the pointer
+	 * assigment for the specific memcg
+	 */
+	rcu_read_lock();
+	rmb();
+	nlru = &lru->memcg_lrus[index]->node[nid];
+	rcu_read_unlock();
+	return nlru;
+
+	BUG(); /* index != -1 with !MEMCG_KMEM. Impossible */
+}
+
 void list_lru_destroy(struct list_lru *lru)
 {
 	if (!lru->memcg_lrus)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b9e1941..bfb4b5b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3319,6 +3319,36 @@ static inline void memcg_resume_kmem_account(void)
 	current->memcg_kmem_skip_account--;
 }
 
+static struct mem_cgroup *mem_cgroup_from_kmem_page(struct page *page)
+{
+	struct page_cgroup *pc;
+	struct mem_cgroup *memcg = NULL;
+
+	pc = lookup_page_cgroup(page);
+	if (!PageCgroupUsed(pc))
+		return NULL;
+
+	lock_page_cgroup(pc);
+	if (PageCgroupUsed(pc))
+		memcg = pc->mem_cgroup;
+	unlock_page_cgroup(pc);
+	return memcg;
+}
+
+struct list_lru_node *
+memcg_kmem_lru_of_page(struct list_lru *lru, struct page *page)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_kmem_page(page);
+	int nid = page_to_nid(page);
+	int memcg_id;
+
+	if (!memcg_kmem_enabled())
+		return &lru->node[nid];
+
+	memcg_id = memcg_cache_id(memcg);
+	return lru_node_of_index(lru, memcg_id, nid);
+}
+
 static void kmem_cache_destroy_work_func(struct work_struct *w)
 {
 	struct kmem_cache *cachep;
-- 
1.8.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 3/7] lru: add an element to a memcg list
@ 2013-02-08 13:07   ` Glauber Costa
  0 siblings, 0 replies; 60+ messages in thread
From: Glauber Costa @ 2013-02-08 13:07 UTC (permalink / raw)
  To: linux-mm
  Cc: cgroups, Andrew Morton, Michal Hocko, Johannes Weiner,
	kamezawa.hiroyu, Dave Shrinnker, linux-fsdevel, Glauber Costa,
	Dave Chinner, Mel Gorman, Rik van Riel, Hugh Dickins

With the infrastructure we now have, we can add an element to a memcg
LRU list instead of the global list. The memcg lists are still
per-node.

Technically, we will never trigger per-node shrinking in the memcg is
short of memory. Therefore an alternative to this would be to add the
element to *both* a single-node memcg array and a per-node global array.

There are two main reasons for this design choice:

1) adding an extra list_head to each of the objects would waste 16-bytes
per object, always remembering that we are talking about 1 dentry + 1
inode in the common case. This means a close to 10 % increase in the
dentry size, and a lower yet significant increase in the inode size. In
terms of total memory, this design pays 32-byte per-superblock-per-node
(size of struct list_lru_node), which means that in any scenario where
we have more than 10 dentries + inodes, we would already be paying more
memory in the two-list-heads approach than we will here with 1 node x 10
superblocks. The turning point of course depends on the workload, but I
hope the figures above would convince you that the memory footprint is
in my side in any workload that matters.

2) The main drawback of this, namely, that we loose global LRU order, is
not really seen by me as a disadvantage: if we are using memcg to
isolate the workloads, global pressure should try to balance the amount
reclaimed from all memcgs the same way the shrinkers will already
naturally balance the amount reclaimed from each superblock. (This
patchset needs some love in this regard, btw).

To help us easily tracking down which nodes have and which nodes doesn't
have elements in the list, we will count on an auxiliary node bitmap in
the global level.

Signed-off-by: Glauber Costa <glommer@parallels.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
---
 include/linux/list_lru.h   | 10 ++++++++
 include/linux/memcontrol.h | 10 ++++++++
 lib/list_lru.c             | 63 ++++++++++++++++++++++++++++++++++++++++------
 mm/memcontrol.c            | 30 ++++++++++++++++++++++
 4 files changed, 105 insertions(+), 8 deletions(-)

diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
index 370b989..5d8f7ab 100644
--- a/include/linux/list_lru.h
+++ b/include/linux/list_lru.h
@@ -23,6 +23,7 @@ struct list_lru_array {
 struct list_lru {
 	struct list_head	lrus;
 	struct list_lru_node	node[MAX_NUMNODES];
+	atomic_long_t		node_totals[MAX_NUMNODES];
 	nodemask_t		active_nodes;
 #ifdef CONFIG_MEMCG_KMEM
 	struct list_lru_array	**memcg_lrus;
@@ -53,6 +54,8 @@ struct list_lru_array *lru_alloc_array(void);
 int memcg_update_all_lrus(unsigned long num);
 void list_lru_destroy(struct list_lru *lru);
 void list_lru_destroy_memcg(struct mem_cgroup *memcg);
+struct list_lru_node *
+lru_node_of_index(struct list_lru *lru, int index, int nid);
 #else
 static inline void lru_memcg_enable(struct list_lru *lru)
 {
@@ -66,6 +69,13 @@ static inline bool lru_memcg_is_assigned(struct list_lru *lru)
 static inline void list_lru_destroy(struct list_lru *lru)
 {
 }
+
+static inline struct list_lru_node *
+lru_node_of_index(struct list_lru *lru, int index, int nid);
+{
+	BUG_ON(index < 0); /* index != -1 with !MEMCG_KMEM. Impossible */
+	return &lru->node[nid];
+}
 #endif
 
 int list_lru_init(struct list_lru *lru);
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index f9558d0..3510730 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -24,6 +24,7 @@
 #include <linux/hardirq.h>
 #include <linux/jump_label.h>
 #include <linux/list_lru.h>
+#include <linux/mm.h>
 
 struct mem_cgroup;
 struct page_cgroup;
@@ -478,6 +479,9 @@ __memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp);
 
 int memcg_new_lru(struct list_lru *lru);
 
+struct list_lru_node *
+memcg_kmem_lru_of_page(struct list_lru *lru, struct page *page);
+
 int memcg_kmem_update_lru_size(struct list_lru *lru, int num_groups,
 			       bool new_lru);
 
@@ -644,6 +648,12 @@ memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp)
 static inline void kmem_cache_destroy_memcg_children(struct kmem_cache *s)
 {
 }
+
+static inline struct list_lru_node *
+memcg_kmem_lru_of_page(struct list_lru *lru, struct page *page)
+{
+	return &lru->node[page_to_nid(page)];
+}
 #endif /* CONFIG_MEMCG_KMEM */
 #endif /* _LINUX_MEMCONTROL_H */
 
diff --git a/lib/list_lru.c b/lib/list_lru.c
index 3b0e89d..8c36579 100644
--- a/lib/list_lru.c
+++ b/lib/list_lru.c
@@ -15,14 +15,22 @@ list_lru_add(
 	struct list_lru	*lru,
 	struct list_head *item)
 {
-	int nid = page_to_nid(virt_to_page(item));
-	struct list_lru_node *nlru = &lru->node[nid];
+	struct page *page = virt_to_page(item);
+	struct list_lru_node *nlru;
+	int nid = page_to_nid(page);
+
+	nlru = memcg_kmem_lru_of_page(lru, page);
 
 	spin_lock(&nlru->lock);
 	BUG_ON(nlru->nr_items < 0);
 	if (list_empty(item)) {
 		list_add_tail(item, &nlru->list);
-		if (nlru->nr_items++ == 0)
+		nlru->nr_items++;
+		/*
+		 * We only consider a node active or inactive based on the
+		 * total figure for all involved children.
+		 */
+		if (atomic_long_inc_and_test(&lru->node_totals[nid]))
 			node_set(nid, lru->active_nodes);
 		spin_unlock(&nlru->lock);
 		return 1;
@@ -37,14 +45,20 @@ list_lru_del(
 	struct list_lru	*lru,
 	struct list_head *item)
 {
-	int nid = page_to_nid(virt_to_page(item));
-	struct list_lru_node *nlru = &lru->node[nid];
+	struct page *page = virt_to_page(item);
+	struct list_lru_node *nlru;
+	int nid = page_to_nid(page);
+
+	nlru = memcg_kmem_lru_of_page(lru, page);
 
 	spin_lock(&nlru->lock);
 	if (!list_empty(item)) {
 		list_del_init(item);
-		if (--nlru->nr_items == 0)
+		nlru->nr_items--;
+
+		if (atomic_long_dec_and_test(&lru->node_totals[nid]))
 			node_clear(nid, lru->active_nodes);
+
 		BUG_ON(nlru->nr_items < 0);
 		spin_unlock(&nlru->lock);
 		return 1;
@@ -97,7 +111,9 @@ restart:
 		ret = isolate(item, &nlru->lock, cb_arg);
 		switch (ret) {
 		case 0:	/* item removed from list */
-			if (--nlru->nr_items == 0)
+			nlru->nr_items--;
+
+			if (atomic_long_dec_and_test(&lru->node_totals[nid]))
 				node_clear(nid, lru->active_nodes);
 			BUG_ON(nlru->nr_items < 0);
 			isolated++;
@@ -221,8 +237,10 @@ int __list_lru_init(struct list_lru *lru)
 	int i;
 
 	nodes_clear(lru->active_nodes);
-	for (i = 0; i < MAX_NUMNODES; i++)
+	for (i = 0; i < MAX_NUMNODES; i++) {
 		list_lru_init_one(&lru->node[i]);
+		atomic_long_set(&lru->node_totals[i], 0);
+	}
 
 	return 0;
 }
@@ -262,6 +280,35 @@ out:
 	return ret;
 }
 
+struct list_lru_node *
+lru_node_of_index(struct list_lru *lru, int index, int nid)
+{
+	struct list_lru_node *nlru;
+
+	if (index < 0)
+		return &lru->node[nid];
+
+	if (!lru_memcg_is_assigned(lru))
+		return NULL;
+
+	/*
+	 * because we will only ever free the memcg_lrus after synchronize_rcu,
+	 * we are safe with the rcu lock here: even if we are operating in the
+	 * stale version of the array, the data is still valid and we are not
+	 * risking anything.
+	 *
+	 * The read barrier is needed to make sure that we see the pointer
+	 * assigment for the specific memcg
+	 */
+	rcu_read_lock();
+	rmb();
+	nlru = &lru->memcg_lrus[index]->node[nid];
+	rcu_read_unlock();
+	return nlru;
+
+	BUG(); /* index != -1 with !MEMCG_KMEM. Impossible */
+}
+
 void list_lru_destroy(struct list_lru *lru)
 {
 	if (!lru->memcg_lrus)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b9e1941..bfb4b5b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3319,6 +3319,36 @@ static inline void memcg_resume_kmem_account(void)
 	current->memcg_kmem_skip_account--;
 }
 
+static struct mem_cgroup *mem_cgroup_from_kmem_page(struct page *page)
+{
+	struct page_cgroup *pc;
+	struct mem_cgroup *memcg = NULL;
+
+	pc = lookup_page_cgroup(page);
+	if (!PageCgroupUsed(pc))
+		return NULL;
+
+	lock_page_cgroup(pc);
+	if (PageCgroupUsed(pc))
+		memcg = pc->mem_cgroup;
+	unlock_page_cgroup(pc);
+	return memcg;
+}
+
+struct list_lru_node *
+memcg_kmem_lru_of_page(struct list_lru *lru, struct page *page)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_kmem_page(page);
+	int nid = page_to_nid(page);
+	int memcg_id;
+
+	if (!memcg_kmem_enabled())
+		return &lru->node[nid];
+
+	memcg_id = memcg_cache_id(memcg);
+	return lru_node_of_index(lru, memcg_id, nid);
+}
+
 static void kmem_cache_destroy_work_func(struct work_struct *w)
 {
 	struct kmem_cache *cachep;
-- 
1.8.1


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 4/7] list_lru: also include memcg lists in counts and scans
  2013-02-08 13:07 ` Glauber Costa
  (?)
@ 2013-02-08 13:07   ` Glauber Costa
  -1 siblings, 0 replies; 60+ messages in thread
From: Glauber Costa @ 2013-02-08 13:07 UTC (permalink / raw)
  To: linux-mm
  Cc: cgroups, Andrew Morton, Michal Hocko, Johannes Weiner,
	kamezawa.hiroyu, Dave Shrinnker, linux-fsdevel, Glauber Costa,
	Dave Chinner, Mel Gorman, Rik van Riel, Hugh Dickins

As elements are added to per-memcg lists, they will be invisible to
global reclaimers. This patch mainly modifies list_lru walk and count
functions to take that into account.

Counting is very simple: since we already have total figures for the
node, which we use to figure out when to set or clear the node in the
bitmap, we can just use that.

For walking, we need to walk the memcg lists as well as the global list.
To achieve that, this patch introduces the helper macro
for_each_memcg_lru_index. Locking semantics are simple, since
introducing a new LRU in the list does not influence the memcg walkers.

The only operation we race against is memcg creation and teardown.  For
those, barriers should be enough to guarantee that we are seeing
up-to-date information and not accessing invalid pointers.

Signed-off-by: Glauber Costa <glommer@parallels.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
---
 include/linux/memcontrol.h |  2 ++
 lib/list_lru.c             | 90 ++++++++++++++++++++++++++++++++++------------
 2 files changed, 69 insertions(+), 23 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 3510730..e723202 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -598,6 +598,8 @@ static inline bool memcg_kmem_is_active(struct mem_cgroup *memcg)
 #define for_each_memcg_cache_index(_idx)	\
 	for (; NULL; )
 
+#define memcg_limited_groups_array_size 0
+
 static inline bool memcg_kmem_enabled(void)
 {
 	return false;
diff --git a/lib/list_lru.c b/lib/list_lru.c
index 8c36579..1d16404 100644
--- a/lib/list_lru.c
+++ b/lib/list_lru.c
@@ -10,6 +10,23 @@
 #include <linux/list_lru.h>
 #include <linux/memcontrol.h>
 
+/*
+ * This helper will loop through all node-data in the LRU, either global or
+ * per-memcg.  If memcg is either not present or not used,
+ * memcg_limited_groups_array_size will be 0. _idx starts at -1, and it will
+ * still be allowed to execute once.
+ *
+ * We convention that for _idx = -1, the global node info should be used.
+ * After that, we will go through each of the memcgs, starting at 0.
+ *
+ * We don't need any kind of locking for the loop because
+ * memcg_limited_groups_array_size can only grow, gaining new fields at the
+ * end. The old ones are just copied, and any interesting manipulation happen
+ * in the node list itself, and we already lock the list.
+ */
+#define for_each_memcg_lru_index(lru, _idx, _node)		\
+	for ((_idx) = -1; (_idx) < memcg_limited_groups_array_size; (_idx)++)
+
 int
 list_lru_add(
 	struct list_lru	*lru,
@@ -77,12 +94,12 @@ list_lru_count_nodemask(
 	int nid;
 
 	for_each_node_mask(nid, *nodes_to_count) {
-		struct list_lru_node *nlru = &lru->node[nid];
-
-		spin_lock(&nlru->lock);
-		BUG_ON(nlru->nr_items < 0);
-		count += nlru->nr_items;
-		spin_unlock(&nlru->lock);
+		/*
+		 * We don't need to loop through all memcgs here, because we
+		 * have the node_totals information for the node. If we hadn't,
+		 * this would still be achieavable by a loop-over-all-groups
+		 */
+		count += atomic_long_read(&lru->node_totals[nid]);
 	}
 
 	return count;
@@ -92,12 +109,12 @@ EXPORT_SYMBOL_GPL(list_lru_count_nodemask);
 static long
 list_lru_walk_node(
 	struct list_lru		*lru,
+	struct list_lru_node	*nlru,
 	int			nid,
 	list_lru_walk_cb	isolate,
 	void			*cb_arg,
 	long			*nr_to_walk)
 {
-	struct list_lru_node	*nlru = &lru->node[nid];
 	struct list_head *item, *n;
 	long isolated = 0;
 restart:
@@ -143,12 +160,28 @@ list_lru_walk_nodemask(
 {
 	long isolated = 0;
 	int nid;
+	nodemask_t nodes;
+	int idx;
+	struct list_lru_node *nlru;
 
-	for_each_node_mask(nid, *nodes_to_walk) {
-		isolated += list_lru_walk_node(lru, nid, isolate,
-					       cb_arg, &nr_to_walk);
-		if (nr_to_walk <= 0)
-			break;
+	/*
+	 * Conservative code can call this setting nodes with node_setall.
+	 * This will generate an out of bound access for memcg.
+	 */
+	nodes_and(nodes, *nodes_to_walk, node_online_map);
+
+	for_each_node_mask(nid, nodes) {
+		for_each_memcg_lru_index(lru, idx, nid) {
+
+			nlru = lru_node_of_index(lru, idx, nid);
+			if (!nlru)
+				continue;
+
+			isolated += list_lru_walk_node(lru, nlru, nid, isolate,
+						       cb_arg, &nr_to_walk);
+			if (nr_to_walk <= 0)
+				break;
+		}
 	}
 	return isolated;
 }
@@ -160,23 +193,34 @@ list_lru_dispose_all_node(
 	int			nid,
 	list_lru_dispose_cb	dispose)
 {
-	struct list_lru_node	*nlru = &lru->node[nid];
+	struct list_lru_node *nlru;
 	LIST_HEAD(dispose_list);
 	long disposed = 0;
+	int idx;
 
-	spin_lock(&nlru->lock);
-	while (!list_empty(&nlru->list)) {
-		list_splice_init(&nlru->list, &dispose_list);
-		disposed += nlru->nr_items;
-		nlru->nr_items = 0;
-		node_clear(nid, lru->active_nodes);
-		spin_unlock(&nlru->lock);
-
-		dispose(&dispose_list);
+	for_each_memcg_lru_index(lru, idx, nid) {
+		nlru = lru_node_of_index(lru, idx, nid);
+		if (!nlru)
+			continue;
 
 		spin_lock(&nlru->lock);
+		while (!list_empty(&nlru->list)) {
+			list_splice_init(&nlru->list, &dispose_list);
+
+			if (atomic_long_sub_and_test(nlru->nr_items,
+							&lru->node_totals[nid]))
+				node_clear(nid, lru->active_nodes);
+			disposed += nlru->nr_items;
+			nlru->nr_items = 0;
+			spin_unlock(&nlru->lock);
+
+			dispose(&dispose_list);
+
+			spin_lock(&nlru->lock);
+		}
+		spin_unlock(&nlru->lock);
 	}
-	spin_unlock(&nlru->lock);
+
 	return disposed;
 }
 
-- 
1.8.1


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 4/7] list_lru: also include memcg lists in counts and scans
@ 2013-02-08 13:07   ` Glauber Costa
  0 siblings, 0 replies; 60+ messages in thread
From: Glauber Costa @ 2013-02-08 13:07 UTC (permalink / raw)
  To: linux-mm
  Cc: cgroups, Andrew Morton, Michal Hocko, Johannes Weiner,
	kamezawa.hiroyu, Dave Shrinnker, linux-fsdevel, Glauber Costa,
	Dave Chinner, Mel Gorman, Rik van Riel, Hugh Dickins

As elements are added to per-memcg lists, they will be invisible to
global reclaimers. This patch mainly modifies list_lru walk and count
functions to take that into account.

Counting is very simple: since we already have total figures for the
node, which we use to figure out when to set or clear the node in the
bitmap, we can just use that.

For walking, we need to walk the memcg lists as well as the global list.
To achieve that, this patch introduces the helper macro
for_each_memcg_lru_index. Locking semantics are simple, since
introducing a new LRU in the list does not influence the memcg walkers.

The only operation we race against is memcg creation and teardown.  For
those, barriers should be enough to guarantee that we are seeing
up-to-date information and not accessing invalid pointers.

Signed-off-by: Glauber Costa <glommer@parallels.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
---
 include/linux/memcontrol.h |  2 ++
 lib/list_lru.c             | 90 ++++++++++++++++++++++++++++++++++------------
 2 files changed, 69 insertions(+), 23 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 3510730..e723202 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -598,6 +598,8 @@ static inline bool memcg_kmem_is_active(struct mem_cgroup *memcg)
 #define for_each_memcg_cache_index(_idx)	\
 	for (; NULL; )
 
+#define memcg_limited_groups_array_size 0
+
 static inline bool memcg_kmem_enabled(void)
 {
 	return false;
diff --git a/lib/list_lru.c b/lib/list_lru.c
index 8c36579..1d16404 100644
--- a/lib/list_lru.c
+++ b/lib/list_lru.c
@@ -10,6 +10,23 @@
 #include <linux/list_lru.h>
 #include <linux/memcontrol.h>
 
+/*
+ * This helper will loop through all node-data in the LRU, either global or
+ * per-memcg.  If memcg is either not present or not used,
+ * memcg_limited_groups_array_size will be 0. _idx starts at -1, and it will
+ * still be allowed to execute once.
+ *
+ * We convention that for _idx = -1, the global node info should be used.
+ * After that, we will go through each of the memcgs, starting at 0.
+ *
+ * We don't need any kind of locking for the loop because
+ * memcg_limited_groups_array_size can only grow, gaining new fields at the
+ * end. The old ones are just copied, and any interesting manipulation happen
+ * in the node list itself, and we already lock the list.
+ */
+#define for_each_memcg_lru_index(lru, _idx, _node)		\
+	for ((_idx) = -1; (_idx) < memcg_limited_groups_array_size; (_idx)++)
+
 int
 list_lru_add(
 	struct list_lru	*lru,
@@ -77,12 +94,12 @@ list_lru_count_nodemask(
 	int nid;
 
 	for_each_node_mask(nid, *nodes_to_count) {
-		struct list_lru_node *nlru = &lru->node[nid];
-
-		spin_lock(&nlru->lock);
-		BUG_ON(nlru->nr_items < 0);
-		count += nlru->nr_items;
-		spin_unlock(&nlru->lock);
+		/*
+		 * We don't need to loop through all memcgs here, because we
+		 * have the node_totals information for the node. If we hadn't,
+		 * this would still be achieavable by a loop-over-all-groups
+		 */
+		count += atomic_long_read(&lru->node_totals[nid]);
 	}
 
 	return count;
@@ -92,12 +109,12 @@ EXPORT_SYMBOL_GPL(list_lru_count_nodemask);
 static long
 list_lru_walk_node(
 	struct list_lru		*lru,
+	struct list_lru_node	*nlru,
 	int			nid,
 	list_lru_walk_cb	isolate,
 	void			*cb_arg,
 	long			*nr_to_walk)
 {
-	struct list_lru_node	*nlru = &lru->node[nid];
 	struct list_head *item, *n;
 	long isolated = 0;
 restart:
@@ -143,12 +160,28 @@ list_lru_walk_nodemask(
 {
 	long isolated = 0;
 	int nid;
+	nodemask_t nodes;
+	int idx;
+	struct list_lru_node *nlru;
 
-	for_each_node_mask(nid, *nodes_to_walk) {
-		isolated += list_lru_walk_node(lru, nid, isolate,
-					       cb_arg, &nr_to_walk);
-		if (nr_to_walk <= 0)
-			break;
+	/*
+	 * Conservative code can call this setting nodes with node_setall.
+	 * This will generate an out of bound access for memcg.
+	 */
+	nodes_and(nodes, *nodes_to_walk, node_online_map);
+
+	for_each_node_mask(nid, nodes) {
+		for_each_memcg_lru_index(lru, idx, nid) {
+
+			nlru = lru_node_of_index(lru, idx, nid);
+			if (!nlru)
+				continue;
+
+			isolated += list_lru_walk_node(lru, nlru, nid, isolate,
+						       cb_arg, &nr_to_walk);
+			if (nr_to_walk <= 0)
+				break;
+		}
 	}
 	return isolated;
 }
@@ -160,23 +193,34 @@ list_lru_dispose_all_node(
 	int			nid,
 	list_lru_dispose_cb	dispose)
 {
-	struct list_lru_node	*nlru = &lru->node[nid];
+	struct list_lru_node *nlru;
 	LIST_HEAD(dispose_list);
 	long disposed = 0;
+	int idx;
 
-	spin_lock(&nlru->lock);
-	while (!list_empty(&nlru->list)) {
-		list_splice_init(&nlru->list, &dispose_list);
-		disposed += nlru->nr_items;
-		nlru->nr_items = 0;
-		node_clear(nid, lru->active_nodes);
-		spin_unlock(&nlru->lock);
-
-		dispose(&dispose_list);
+	for_each_memcg_lru_index(lru, idx, nid) {
+		nlru = lru_node_of_index(lru, idx, nid);
+		if (!nlru)
+			continue;
 
 		spin_lock(&nlru->lock);
+		while (!list_empty(&nlru->list)) {
+			list_splice_init(&nlru->list, &dispose_list);
+
+			if (atomic_long_sub_and_test(nlru->nr_items,
+							&lru->node_totals[nid]))
+				node_clear(nid, lru->active_nodes);
+			disposed += nlru->nr_items;
+			nlru->nr_items = 0;
+			spin_unlock(&nlru->lock);
+
+			dispose(&dispose_list);
+
+			spin_lock(&nlru->lock);
+		}
+		spin_unlock(&nlru->lock);
 	}
-	spin_unlock(&nlru->lock);
+
 	return disposed;
 }
 
-- 
1.8.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 4/7] list_lru: also include memcg lists in counts and scans
@ 2013-02-08 13:07   ` Glauber Costa
  0 siblings, 0 replies; 60+ messages in thread
From: Glauber Costa @ 2013-02-08 13:07 UTC (permalink / raw)
  To: linux-mm
  Cc: cgroups, Andrew Morton, Michal Hocko, Johannes Weiner,
	kamezawa.hiroyu, Dave Shrinnker, linux-fsdevel, Glauber Costa,
	Dave Chinner, Mel Gorman, Rik van Riel, Hugh Dickins

As elements are added to per-memcg lists, they will be invisible to
global reclaimers. This patch mainly modifies list_lru walk and count
functions to take that into account.

Counting is very simple: since we already have total figures for the
node, which we use to figure out when to set or clear the node in the
bitmap, we can just use that.

For walking, we need to walk the memcg lists as well as the global list.
To achieve that, this patch introduces the helper macro
for_each_memcg_lru_index. Locking semantics are simple, since
introducing a new LRU in the list does not influence the memcg walkers.

The only operation we race against is memcg creation and teardown.  For
those, barriers should be enough to guarantee that we are seeing
up-to-date information and not accessing invalid pointers.

Signed-off-by: Glauber Costa <glommer@parallels.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
---
 include/linux/memcontrol.h |  2 ++
 lib/list_lru.c             | 90 ++++++++++++++++++++++++++++++++++------------
 2 files changed, 69 insertions(+), 23 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 3510730..e723202 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -598,6 +598,8 @@ static inline bool memcg_kmem_is_active(struct mem_cgroup *memcg)
 #define for_each_memcg_cache_index(_idx)	\
 	for (; NULL; )
 
+#define memcg_limited_groups_array_size 0
+
 static inline bool memcg_kmem_enabled(void)
 {
 	return false;
diff --git a/lib/list_lru.c b/lib/list_lru.c
index 8c36579..1d16404 100644
--- a/lib/list_lru.c
+++ b/lib/list_lru.c
@@ -10,6 +10,23 @@
 #include <linux/list_lru.h>
 #include <linux/memcontrol.h>
 
+/*
+ * This helper will loop through all node-data in the LRU, either global or
+ * per-memcg.  If memcg is either not present or not used,
+ * memcg_limited_groups_array_size will be 0. _idx starts at -1, and it will
+ * still be allowed to execute once.
+ *
+ * We convention that for _idx = -1, the global node info should be used.
+ * After that, we will go through each of the memcgs, starting at 0.
+ *
+ * We don't need any kind of locking for the loop because
+ * memcg_limited_groups_array_size can only grow, gaining new fields at the
+ * end. The old ones are just copied, and any interesting manipulation happen
+ * in the node list itself, and we already lock the list.
+ */
+#define for_each_memcg_lru_index(lru, _idx, _node)		\
+	for ((_idx) = -1; (_idx) < memcg_limited_groups_array_size; (_idx)++)
+
 int
 list_lru_add(
 	struct list_lru	*lru,
@@ -77,12 +94,12 @@ list_lru_count_nodemask(
 	int nid;
 
 	for_each_node_mask(nid, *nodes_to_count) {
-		struct list_lru_node *nlru = &lru->node[nid];
-
-		spin_lock(&nlru->lock);
-		BUG_ON(nlru->nr_items < 0);
-		count += nlru->nr_items;
-		spin_unlock(&nlru->lock);
+		/*
+		 * We don't need to loop through all memcgs here, because we
+		 * have the node_totals information for the node. If we hadn't,
+		 * this would still be achieavable by a loop-over-all-groups
+		 */
+		count += atomic_long_read(&lru->node_totals[nid]);
 	}
 
 	return count;
@@ -92,12 +109,12 @@ EXPORT_SYMBOL_GPL(list_lru_count_nodemask);
 static long
 list_lru_walk_node(
 	struct list_lru		*lru,
+	struct list_lru_node	*nlru,
 	int			nid,
 	list_lru_walk_cb	isolate,
 	void			*cb_arg,
 	long			*nr_to_walk)
 {
-	struct list_lru_node	*nlru = &lru->node[nid];
 	struct list_head *item, *n;
 	long isolated = 0;
 restart:
@@ -143,12 +160,28 @@ list_lru_walk_nodemask(
 {
 	long isolated = 0;
 	int nid;
+	nodemask_t nodes;
+	int idx;
+	struct list_lru_node *nlru;
 
-	for_each_node_mask(nid, *nodes_to_walk) {
-		isolated += list_lru_walk_node(lru, nid, isolate,
-					       cb_arg, &nr_to_walk);
-		if (nr_to_walk <= 0)
-			break;
+	/*
+	 * Conservative code can call this setting nodes with node_setall.
+	 * This will generate an out of bound access for memcg.
+	 */
+	nodes_and(nodes, *nodes_to_walk, node_online_map);
+
+	for_each_node_mask(nid, nodes) {
+		for_each_memcg_lru_index(lru, idx, nid) {
+
+			nlru = lru_node_of_index(lru, idx, nid);
+			if (!nlru)
+				continue;
+
+			isolated += list_lru_walk_node(lru, nlru, nid, isolate,
+						       cb_arg, &nr_to_walk);
+			if (nr_to_walk <= 0)
+				break;
+		}
 	}
 	return isolated;
 }
@@ -160,23 +193,34 @@ list_lru_dispose_all_node(
 	int			nid,
 	list_lru_dispose_cb	dispose)
 {
-	struct list_lru_node	*nlru = &lru->node[nid];
+	struct list_lru_node *nlru;
 	LIST_HEAD(dispose_list);
 	long disposed = 0;
+	int idx;
 
-	spin_lock(&nlru->lock);
-	while (!list_empty(&nlru->list)) {
-		list_splice_init(&nlru->list, &dispose_list);
-		disposed += nlru->nr_items;
-		nlru->nr_items = 0;
-		node_clear(nid, lru->active_nodes);
-		spin_unlock(&nlru->lock);
-
-		dispose(&dispose_list);
+	for_each_memcg_lru_index(lru, idx, nid) {
+		nlru = lru_node_of_index(lru, idx, nid);
+		if (!nlru)
+			continue;
 
 		spin_lock(&nlru->lock);
+		while (!list_empty(&nlru->list)) {
+			list_splice_init(&nlru->list, &dispose_list);
+
+			if (atomic_long_sub_and_test(nlru->nr_items,
+							&lru->node_totals[nid]))
+				node_clear(nid, lru->active_nodes);
+			disposed += nlru->nr_items;
+			nlru->nr_items = 0;
+			spin_unlock(&nlru->lock);
+
+			dispose(&dispose_list);
+
+			spin_lock(&nlru->lock);
+		}
+		spin_unlock(&nlru->lock);
 	}
-	spin_unlock(&nlru->lock);
+
 	return disposed;
 }
 
-- 
1.8.1


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 5/7] list_lru: per-memcg walks
  2013-02-08 13:07 ` Glauber Costa
  (?)
@ 2013-02-08 13:07   ` Glauber Costa
  -1 siblings, 0 replies; 60+ messages in thread
From: Glauber Costa @ 2013-02-08 13:07 UTC (permalink / raw)
  To: linux-mm
  Cc: cgroups, Andrew Morton, Michal Hocko, Johannes Weiner,
	kamezawa.hiroyu, Dave Shrinnker, linux-fsdevel, Glauber Costa,
	Dave Chinner, Mel Gorman, Rik van Riel, Hugh Dickins

This patch extend the list_lru interfaces to allow for a memcg
parameter. Because most of its users won't need it, instead of
modifying the function signatures we create a new set of _memcg()
functions and write the old API ontop of that.

At this point, the infrastructure is mostly in place. We already walk
the nodes using all memcg indexes, so we just need to make sure we skip
all but the one we're interested in. We could just go directly to the
memcg of interest, but I am assuming that given the gained simplicity,
spending a few cycles here won't hurt *that* much (but that can be
improved if needed, of course).

Signed-off-by: Glauber Costa <glommer@parallels.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
---
 include/linux/list_lru.h | 24 ++++++++++++++++++++----
 lib/list_lru.c           | 41 ++++++++++++++++++++++++++++++++---------
 2 files changed, 52 insertions(+), 13 deletions(-)

diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
index 5d8f7ab..c7e6115 100644
--- a/include/linux/list_lru.h
+++ b/include/linux/list_lru.h
@@ -81,20 +81,36 @@ lru_node_of_index(struct list_lru *lru, int index, int nid);
 int list_lru_init(struct list_lru *lru);
 int list_lru_add(struct list_lru *lru, struct list_head *item);
 int list_lru_del(struct list_lru *lru, struct list_head *item);
-long list_lru_count_nodemask(struct list_lru *lru, nodemask_t *nodes_to_count);
+
+long list_lru_count_nodemask_memcg(struct list_lru *lru,
+			nodemask_t *nodes_to_count, struct mem_cgroup *memcg);
+
+static inline long
+list_lru_count_nodemask(struct list_lru *lru, nodemask_t *nodes_to_count)
+{
+	return list_lru_count_nodemask_memcg(lru, nodes_to_count, NULL);
+}
 
 static inline long list_lru_count(struct list_lru *lru)
 {
 	return list_lru_count_nodemask(lru, &lru->active_nodes);
 }
 
-
 typedef int (*list_lru_walk_cb)(struct list_head *item, spinlock_t *lock,
 				void *cb_arg);
 typedef void (*list_lru_dispose_cb)(struct list_head *dispose_list);
 
-long list_lru_walk_nodemask(struct list_lru *lru, list_lru_walk_cb isolate,
-		   void *cb_arg, long nr_to_walk, nodemask_t *nodes_to_walk);
+long list_lru_walk_nodemask_memcg(struct list_lru *lru,
+	list_lru_walk_cb isolate, void *cb_arg, long nr_to_walk,
+	nodemask_t *nodes_to_walk, struct mem_cgroup *memcg);
+
+static inline long list_lru_walk_nodemask(struct list_lru *lru,
+	list_lru_walk_cb isolate, void *cb_arg, long nr_to_walk,
+	nodemask_t *nodes_to_walk)
+{
+	return list_lru_walk_nodemask_memcg(lru, isolate, cb_arg, nr_to_walk,
+					    &lru->active_nodes, NULL);
+}
 
 static inline long list_lru_walk(struct list_lru *lru, list_lru_walk_cb isolate,
 				 void *cb_arg, long nr_to_walk)
diff --git a/lib/list_lru.c b/lib/list_lru.c
index 1d16404..e2bbde6 100644
--- a/lib/list_lru.c
+++ b/lib/list_lru.c
@@ -86,25 +86,44 @@ list_lru_del(
 EXPORT_SYMBOL_GPL(list_lru_del);
 
 long
-list_lru_count_nodemask(
+list_lru_count_nodemask_memcg(
 	struct list_lru *lru,
-	nodemask_t	*nodes_to_count)
+	nodemask_t	*nodes_to_count,
+	struct mem_cgroup *memcg)
 {
 	long count = 0;
 	int nid;
+	nodemask_t nodes;
+	struct list_lru_node *nlru;
+	int memcg_id = memcg_cache_id(memcg);
+
+	/*
+	 * Conservative code can call this setting nodes with node_setall.
+	 * This will generate an out of bound access for memcg.
+	 */
+	nodes_and(nodes, *nodes_to_count, node_online_map);
 
-	for_each_node_mask(nid, *nodes_to_count) {
+	for_each_node_mask(nid, nodes) {
 		/*
 		 * We don't need to loop through all memcgs here, because we
 		 * have the node_totals information for the node. If we hadn't,
 		 * this would still be achieavable by a loop-over-all-groups
 		 */
-		count += atomic_long_read(&lru->node_totals[nid]);
-	}
+		if (!memcg)
+			count += atomic_long_read(&lru->node_totals[nid]);
+		else {
+			nlru = lru_node_of_index(lru, memcg_id, nid);
+			WARN_ON(!nlru);
 
+			spin_lock(&nlru->lock);
+			BUG_ON(nlru->nr_items < 0);
+			count += nlru->nr_items;
+			spin_unlock(&nlru->lock);
+		}
+	}
 	return count;
 }
-EXPORT_SYMBOL_GPL(list_lru_count_nodemask);
+EXPORT_SYMBOL_GPL(list_lru_count_nodemask_memcg);
 
 static long
 list_lru_walk_node(
@@ -151,16 +170,18 @@ restart:
 }
 
 long
-list_lru_walk_nodemask(
+list_lru_walk_nodemask_memcg(
 	struct list_lru	*lru,
 	list_lru_walk_cb isolate,
 	void		*cb_arg,
 	long		nr_to_walk,
-	nodemask_t	*nodes_to_walk)
+	nodemask_t	*nodes_to_walk,
+	struct mem_cgroup *memcg)
 {
 	long isolated = 0;
 	int nid;
 	nodemask_t nodes;
+	int memcg_id = memcg_cache_id(memcg);
 	int idx;
 	struct list_lru_node *nlru;
 
@@ -172,6 +193,8 @@ list_lru_walk_nodemask(
 
 	for_each_node_mask(nid, nodes) {
 		for_each_memcg_lru_index(lru, idx, nid) {
+			if ((memcg_id >= 0) &&  (idx != memcg_id))
+				continue;
 
 			nlru = lru_node_of_index(lru, idx, nid);
 			if (!nlru)
@@ -185,7 +208,7 @@ list_lru_walk_nodemask(
 	}
 	return isolated;
 }
-EXPORT_SYMBOL_GPL(list_lru_walk_nodemask);
+EXPORT_SYMBOL_GPL(list_lru_walk_nodemask_memcg);
 
 long
 list_lru_dispose_all_node(
-- 
1.8.1


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 5/7] list_lru: per-memcg walks
@ 2013-02-08 13:07   ` Glauber Costa
  0 siblings, 0 replies; 60+ messages in thread
From: Glauber Costa @ 2013-02-08 13:07 UTC (permalink / raw)
  To: linux-mm
  Cc: cgroups, Andrew Morton, Michal Hocko, Johannes Weiner,
	kamezawa.hiroyu, Dave Shrinnker, linux-fsdevel, Glauber Costa,
	Dave Chinner, Mel Gorman, Rik van Riel, Hugh Dickins

This patch extend the list_lru interfaces to allow for a memcg
parameter. Because most of its users won't need it, instead of
modifying the function signatures we create a new set of _memcg()
functions and write the old API ontop of that.

At this point, the infrastructure is mostly in place. We already walk
the nodes using all memcg indexes, so we just need to make sure we skip
all but the one we're interested in. We could just go directly to the
memcg of interest, but I am assuming that given the gained simplicity,
spending a few cycles here won't hurt *that* much (but that can be
improved if needed, of course).

Signed-off-by: Glauber Costa <glommer@parallels.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
---
 include/linux/list_lru.h | 24 ++++++++++++++++++++----
 lib/list_lru.c           | 41 ++++++++++++++++++++++++++++++++---------
 2 files changed, 52 insertions(+), 13 deletions(-)

diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
index 5d8f7ab..c7e6115 100644
--- a/include/linux/list_lru.h
+++ b/include/linux/list_lru.h
@@ -81,20 +81,36 @@ lru_node_of_index(struct list_lru *lru, int index, int nid);
 int list_lru_init(struct list_lru *lru);
 int list_lru_add(struct list_lru *lru, struct list_head *item);
 int list_lru_del(struct list_lru *lru, struct list_head *item);
-long list_lru_count_nodemask(struct list_lru *lru, nodemask_t *nodes_to_count);
+
+long list_lru_count_nodemask_memcg(struct list_lru *lru,
+			nodemask_t *nodes_to_count, struct mem_cgroup *memcg);
+
+static inline long
+list_lru_count_nodemask(struct list_lru *lru, nodemask_t *nodes_to_count)
+{
+	return list_lru_count_nodemask_memcg(lru, nodes_to_count, NULL);
+}
 
 static inline long list_lru_count(struct list_lru *lru)
 {
 	return list_lru_count_nodemask(lru, &lru->active_nodes);
 }
 
-
 typedef int (*list_lru_walk_cb)(struct list_head *item, spinlock_t *lock,
 				void *cb_arg);
 typedef void (*list_lru_dispose_cb)(struct list_head *dispose_list);
 
-long list_lru_walk_nodemask(struct list_lru *lru, list_lru_walk_cb isolate,
-		   void *cb_arg, long nr_to_walk, nodemask_t *nodes_to_walk);
+long list_lru_walk_nodemask_memcg(struct list_lru *lru,
+	list_lru_walk_cb isolate, void *cb_arg, long nr_to_walk,
+	nodemask_t *nodes_to_walk, struct mem_cgroup *memcg);
+
+static inline long list_lru_walk_nodemask(struct list_lru *lru,
+	list_lru_walk_cb isolate, void *cb_arg, long nr_to_walk,
+	nodemask_t *nodes_to_walk)
+{
+	return list_lru_walk_nodemask_memcg(lru, isolate, cb_arg, nr_to_walk,
+					    &lru->active_nodes, NULL);
+}
 
 static inline long list_lru_walk(struct list_lru *lru, list_lru_walk_cb isolate,
 				 void *cb_arg, long nr_to_walk)
diff --git a/lib/list_lru.c b/lib/list_lru.c
index 1d16404..e2bbde6 100644
--- a/lib/list_lru.c
+++ b/lib/list_lru.c
@@ -86,25 +86,44 @@ list_lru_del(
 EXPORT_SYMBOL_GPL(list_lru_del);
 
 long
-list_lru_count_nodemask(
+list_lru_count_nodemask_memcg(
 	struct list_lru *lru,
-	nodemask_t	*nodes_to_count)
+	nodemask_t	*nodes_to_count,
+	struct mem_cgroup *memcg)
 {
 	long count = 0;
 	int nid;
+	nodemask_t nodes;
+	struct list_lru_node *nlru;
+	int memcg_id = memcg_cache_id(memcg);
+
+	/*
+	 * Conservative code can call this setting nodes with node_setall.
+	 * This will generate an out of bound access for memcg.
+	 */
+	nodes_and(nodes, *nodes_to_count, node_online_map);
 
-	for_each_node_mask(nid, *nodes_to_count) {
+	for_each_node_mask(nid, nodes) {
 		/*
 		 * We don't need to loop through all memcgs here, because we
 		 * have the node_totals information for the node. If we hadn't,
 		 * this would still be achieavable by a loop-over-all-groups
 		 */
-		count += atomic_long_read(&lru->node_totals[nid]);
-	}
+		if (!memcg)
+			count += atomic_long_read(&lru->node_totals[nid]);
+		else {
+			nlru = lru_node_of_index(lru, memcg_id, nid);
+			WARN_ON(!nlru);
 
+			spin_lock(&nlru->lock);
+			BUG_ON(nlru->nr_items < 0);
+			count += nlru->nr_items;
+			spin_unlock(&nlru->lock);
+		}
+	}
 	return count;
 }
-EXPORT_SYMBOL_GPL(list_lru_count_nodemask);
+EXPORT_SYMBOL_GPL(list_lru_count_nodemask_memcg);
 
 static long
 list_lru_walk_node(
@@ -151,16 +170,18 @@ restart:
 }
 
 long
-list_lru_walk_nodemask(
+list_lru_walk_nodemask_memcg(
 	struct list_lru	*lru,
 	list_lru_walk_cb isolate,
 	void		*cb_arg,
 	long		nr_to_walk,
-	nodemask_t	*nodes_to_walk)
+	nodemask_t	*nodes_to_walk,
+	struct mem_cgroup *memcg)
 {
 	long isolated = 0;
 	int nid;
 	nodemask_t nodes;
+	int memcg_id = memcg_cache_id(memcg);
 	int idx;
 	struct list_lru_node *nlru;
 
@@ -172,6 +193,8 @@ list_lru_walk_nodemask(
 
 	for_each_node_mask(nid, nodes) {
 		for_each_memcg_lru_index(lru, idx, nid) {
+			if ((memcg_id >= 0) &&  (idx != memcg_id))
+				continue;
 
 			nlru = lru_node_of_index(lru, idx, nid);
 			if (!nlru)
@@ -185,7 +208,7 @@ list_lru_walk_nodemask(
 	}
 	return isolated;
 }
-EXPORT_SYMBOL_GPL(list_lru_walk_nodemask);
+EXPORT_SYMBOL_GPL(list_lru_walk_nodemask_memcg);
 
 long
 list_lru_dispose_all_node(
-- 
1.8.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 5/7] list_lru: per-memcg walks
@ 2013-02-08 13:07   ` Glauber Costa
  0 siblings, 0 replies; 60+ messages in thread
From: Glauber Costa @ 2013-02-08 13:07 UTC (permalink / raw)
  To: linux-mm
  Cc: cgroups, Andrew Morton, Michal Hocko, Johannes Weiner,
	kamezawa.hiroyu, Dave Shrinnker, linux-fsdevel, Glauber Costa,
	Dave Chinner, Mel Gorman, Rik van Riel, Hugh Dickins

This patch extend the list_lru interfaces to allow for a memcg
parameter. Because most of its users won't need it, instead of
modifying the function signatures we create a new set of _memcg()
functions and write the old API ontop of that.

At this point, the infrastructure is mostly in place. We already walk
the nodes using all memcg indexes, so we just need to make sure we skip
all but the one we're interested in. We could just go directly to the
memcg of interest, but I am assuming that given the gained simplicity,
spending a few cycles here won't hurt *that* much (but that can be
improved if needed, of course).

Signed-off-by: Glauber Costa <glommer@parallels.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
---
 include/linux/list_lru.h | 24 ++++++++++++++++++++----
 lib/list_lru.c           | 41 ++++++++++++++++++++++++++++++++---------
 2 files changed, 52 insertions(+), 13 deletions(-)

diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
index 5d8f7ab..c7e6115 100644
--- a/include/linux/list_lru.h
+++ b/include/linux/list_lru.h
@@ -81,20 +81,36 @@ lru_node_of_index(struct list_lru *lru, int index, int nid);
 int list_lru_init(struct list_lru *lru);
 int list_lru_add(struct list_lru *lru, struct list_head *item);
 int list_lru_del(struct list_lru *lru, struct list_head *item);
-long list_lru_count_nodemask(struct list_lru *lru, nodemask_t *nodes_to_count);
+
+long list_lru_count_nodemask_memcg(struct list_lru *lru,
+			nodemask_t *nodes_to_count, struct mem_cgroup *memcg);
+
+static inline long
+list_lru_count_nodemask(struct list_lru *lru, nodemask_t *nodes_to_count)
+{
+	return list_lru_count_nodemask_memcg(lru, nodes_to_count, NULL);
+}
 
 static inline long list_lru_count(struct list_lru *lru)
 {
 	return list_lru_count_nodemask(lru, &lru->active_nodes);
 }
 
-
 typedef int (*list_lru_walk_cb)(struct list_head *item, spinlock_t *lock,
 				void *cb_arg);
 typedef void (*list_lru_dispose_cb)(struct list_head *dispose_list);
 
-long list_lru_walk_nodemask(struct list_lru *lru, list_lru_walk_cb isolate,
-		   void *cb_arg, long nr_to_walk, nodemask_t *nodes_to_walk);
+long list_lru_walk_nodemask_memcg(struct list_lru *lru,
+	list_lru_walk_cb isolate, void *cb_arg, long nr_to_walk,
+	nodemask_t *nodes_to_walk, struct mem_cgroup *memcg);
+
+static inline long list_lru_walk_nodemask(struct list_lru *lru,
+	list_lru_walk_cb isolate, void *cb_arg, long nr_to_walk,
+	nodemask_t *nodes_to_walk)
+{
+	return list_lru_walk_nodemask_memcg(lru, isolate, cb_arg, nr_to_walk,
+					    &lru->active_nodes, NULL);
+}
 
 static inline long list_lru_walk(struct list_lru *lru, list_lru_walk_cb isolate,
 				 void *cb_arg, long nr_to_walk)
diff --git a/lib/list_lru.c b/lib/list_lru.c
index 1d16404..e2bbde6 100644
--- a/lib/list_lru.c
+++ b/lib/list_lru.c
@@ -86,25 +86,44 @@ list_lru_del(
 EXPORT_SYMBOL_GPL(list_lru_del);
 
 long
-list_lru_count_nodemask(
+list_lru_count_nodemask_memcg(
 	struct list_lru *lru,
-	nodemask_t	*nodes_to_count)
+	nodemask_t	*nodes_to_count,
+	struct mem_cgroup *memcg)
 {
 	long count = 0;
 	int nid;
+	nodemask_t nodes;
+	struct list_lru_node *nlru;
+	int memcg_id = memcg_cache_id(memcg);
+
+	/*
+	 * Conservative code can call this setting nodes with node_setall.
+	 * This will generate an out of bound access for memcg.
+	 */
+	nodes_and(nodes, *nodes_to_count, node_online_map);
 
-	for_each_node_mask(nid, *nodes_to_count) {
+	for_each_node_mask(nid, nodes) {
 		/*
 		 * We don't need to loop through all memcgs here, because we
 		 * have the node_totals information for the node. If we hadn't,
 		 * this would still be achieavable by a loop-over-all-groups
 		 */
-		count += atomic_long_read(&lru->node_totals[nid]);
-	}
+		if (!memcg)
+			count += atomic_long_read(&lru->node_totals[nid]);
+		else {
+			nlru = lru_node_of_index(lru, memcg_id, nid);
+			WARN_ON(!nlru);
 
+			spin_lock(&nlru->lock);
+			BUG_ON(nlru->nr_items < 0);
+			count += nlru->nr_items;
+			spin_unlock(&nlru->lock);
+		}
+	}
 	return count;
 }
-EXPORT_SYMBOL_GPL(list_lru_count_nodemask);
+EXPORT_SYMBOL_GPL(list_lru_count_nodemask_memcg);
 
 static long
 list_lru_walk_node(
@@ -151,16 +170,18 @@ restart:
 }
 
 long
-list_lru_walk_nodemask(
+list_lru_walk_nodemask_memcg(
 	struct list_lru	*lru,
 	list_lru_walk_cb isolate,
 	void		*cb_arg,
 	long		nr_to_walk,
-	nodemask_t	*nodes_to_walk)
+	nodemask_t	*nodes_to_walk,
+	struct mem_cgroup *memcg)
 {
 	long isolated = 0;
 	int nid;
 	nodemask_t nodes;
+	int memcg_id = memcg_cache_id(memcg);
 	int idx;
 	struct list_lru_node *nlru;
 
@@ -172,6 +193,8 @@ list_lru_walk_nodemask(
 
 	for_each_node_mask(nid, nodes) {
 		for_each_memcg_lru_index(lru, idx, nid) {
+			if ((memcg_id >= 0) &&  (idx != memcg_id))
+				continue;
 
 			nlru = lru_node_of_index(lru, idx, nid);
 			if (!nlru)
@@ -185,7 +208,7 @@ list_lru_walk_nodemask(
 	}
 	return isolated;
 }
-EXPORT_SYMBOL_GPL(list_lru_walk_nodemask);
+EXPORT_SYMBOL_GPL(list_lru_walk_nodemask_memcg);
 
 long
 list_lru_dispose_all_node(
-- 
1.8.1


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 6/7] super: targeted memcg reclaim
  2013-02-08 13:07 ` Glauber Costa
  (?)
@ 2013-02-08 13:07   ` Glauber Costa
  -1 siblings, 0 replies; 60+ messages in thread
From: Glauber Costa @ 2013-02-08 13:07 UTC (permalink / raw)
  To: linux-mm
  Cc: cgroups, Andrew Morton, Michal Hocko, Johannes Weiner,
	kamezawa.hiroyu, Dave Shrinnker, linux-fsdevel, Glauber Costa,
	Dave Chinner, Mel Gorman, Rik van Riel, Hugh Dickins

We now have all our dentries and inodes placed in memcg-specific LRU
lists. All we have to do, is to restrict the reclaim to the said lists
in case of memcg pressure.

Marking the superblock shrinker and its LRUs as memcg-aware will
guarantee that the shrinkers will get invoked during targetted reclaim.

Signed-off-by: Glauber Costa <glommer@parallels.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
---
 fs/dcache.c   |  7 ++++---
 fs/inode.c    |  6 +++---
 fs/internal.h |  5 +++--
 fs/super.c    | 37 +++++++++++++++++++++++++++----------
 4 files changed, 37 insertions(+), 18 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 7f107fb..6f74887 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -907,14 +907,15 @@ static int dentry_lru_isolate(struct list_head *item, spinlock_t *lru_lock,
  * This function may fail to free any resources if all the dentries are in
  * use.
  */
+
 long prune_dcache_sb(struct super_block *sb, long nr_to_scan,
-		     nodemask_t *nodes_to_walk)
+		     nodemask_t *nodes_to_walk, struct mem_cgroup *memcg)
 {
 	LIST_HEAD(dispose);
 	long freed;
 
-	freed = list_lru_walk_nodemask(&sb->s_dentry_lru, dentry_lru_isolate,
-				       &dispose, nr_to_scan, nodes_to_walk);
+	freed = list_lru_walk_nodemask_memcg(&sb->s_dentry_lru,
+		dentry_lru_isolate, &dispose, nr_to_scan, nodes_to_walk, memcg);
 	shrink_dentry_list(&dispose);
 	return freed;
 }
diff --git a/fs/inode.c b/fs/inode.c
index 5bb1e21..61673be 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -746,13 +746,13 @@ static int inode_lru_isolate(struct list_head *item, spinlock_t *lru_lock,
  * then are freed outside inode_lock by dispose_list().
  */
 long prune_icache_sb(struct super_block *sb, long nr_to_scan,
-		     nodemask_t *nodes_to_walk)
+			nodemask_t *nodes_to_walk, struct mem_cgroup *memcg)
 {
 	LIST_HEAD(freeable);
 	long freed;
 
-	freed = list_lru_walk_nodemask(&sb->s_inode_lru, inode_lru_isolate,
-				       &freeable, nr_to_scan, nodes_to_walk);
+	freed = list_lru_walk_nodemask_memcg(&sb->s_inode_lru,
+		inode_lru_isolate, &freeable, nr_to_scan, nodes_to_walk, memcg);
 	dispose_list(&freeable);
 	return freed;
 }
diff --git a/fs/internal.h b/fs/internal.h
index 0f37896..5e2211f 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -16,6 +16,7 @@ struct file_system_type;
 struct linux_binprm;
 struct path;
 struct mount;
+struct mem_cgroup;
 
 /*
  * block_dev.c
@@ -111,7 +112,7 @@ extern int open_check_o_direct(struct file *f);
  */
 extern spinlock_t inode_sb_list_lock;
 extern long prune_icache_sb(struct super_block *sb, long nr_to_scan,
-			    nodemask_t *nodes_to_scan);
+		    nodemask_t *nodes_to_scan, struct mem_cgroup *memcg);
 extern void inode_add_lru(struct inode *inode);
 
 /*
@@ -128,4 +129,4 @@ extern int invalidate_inodes(struct super_block *, bool);
  */
 extern struct dentry *__d_alloc(struct super_block *, const struct qstr *);
 extern long prune_dcache_sb(struct super_block *sb, long nr_to_scan,
-			    nodemask_t *nodes_to_scan);
+		    nodemask_t *nodes_to_scan, struct mem_cgroup *memcg);
diff --git a/fs/super.c b/fs/super.c
index fe3aa4c..f687cc2 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -34,6 +34,7 @@
 #include <linux/cleancache.h>
 #include <linux/fsnotify.h>
 #include <linux/lockdep.h>
+#include <linux/memcontrol.h>
 #include "internal.h"
 
 
@@ -56,6 +57,7 @@ static char *sb_writers_name[SB_FREEZE_LEVELS] = {
 static long super_cache_scan(struct shrinker *shrink, struct shrink_control *sc)
 {
 	struct super_block *sb;
+	struct mem_cgroup *memcg = sc->target_mem_cgroup;
 	long	fs_objects = 0;
 	long	total_objects;
 	long	freed = 0;
@@ -74,11 +76,13 @@ static long super_cache_scan(struct shrinker *shrink, struct shrink_control *sc)
 	if (!grab_super_passive(sb))
 		return -1;
 
-	if (sb->s_op && sb->s_op->nr_cached_objects)
+	if (sb->s_op && sb->s_op->nr_cached_objects && !memcg)
 		fs_objects = sb->s_op->nr_cached_objects(sb, &sc->nodes_to_scan);
 
-	inodes = list_lru_count_nodemask(&sb->s_inode_lru, &sc->nodes_to_scan);
-	dentries = list_lru_count_nodemask(&sb->s_dentry_lru, &sc->nodes_to_scan);
+	inodes = list_lru_count_nodemask_memcg(&sb->s_inode_lru,
+					 &sc->nodes_to_scan, memcg);
+	dentries = list_lru_count_nodemask_memcg(&sb->s_dentry_lru,
+					   &sc->nodes_to_scan, memcg);
 	total_objects = dentries + inodes + fs_objects + 1;
 
 	/* proportion the scan between the caches */
@@ -89,8 +93,8 @@ static long super_cache_scan(struct shrinker *shrink, struct shrink_control *sc)
 	 * prune the dcache first as the icache is pinned by it, then
 	 * prune the icache, followed by the filesystem specific caches
 	 */
-	freed = prune_dcache_sb(sb, dentries, &sc->nodes_to_scan);
-	freed += prune_icache_sb(sb, inodes, &sc->nodes_to_scan);
+	freed = prune_dcache_sb(sb, dentries, &sc->nodes_to_scan, memcg);
+	freed += prune_icache_sb(sb, inodes, &sc->nodes_to_scan, memcg);
 
 	if (fs_objects) {
 		fs_objects = (sc->nr_to_scan * fs_objects) / total_objects;
@@ -106,20 +110,26 @@ static long super_cache_count(struct shrinker *shrink, struct shrink_control *sc
 {
 	struct super_block *sb;
 	long	total_objects = 0;
+	struct mem_cgroup *memcg = sc->target_mem_cgroup;
 
 	sb = container_of(shrink, struct super_block, s_shrink);
 
 	if (!grab_super_passive(sb))
 		return 0;
 
-	if (sb->s_op && sb->s_op->nr_cached_objects)
+	/*
+	 * Ideally we would pass memcg to nr_cached_objects, and
+	 * let the underlying filesystem decide. Most likely the
+	 * path will be if (!memcg) return;, but even then.
+	 */
+	if (sb->s_op && sb->s_op->nr_cached_objects && !memcg)
 		total_objects = sb->s_op->nr_cached_objects(sb,
 						 &sc->nodes_to_scan);
 
-	total_objects += list_lru_count_nodemask(&sb->s_dentry_lru,
-						 &sc->nodes_to_scan);
-	total_objects += list_lru_count_nodemask(&sb->s_inode_lru,
-						 &sc->nodes_to_scan);
+	total_objects += list_lru_count_nodemask_memcg(&sb->s_dentry_lru,
+						 &sc->nodes_to_scan, memcg);
+	total_objects += list_lru_count_nodemask_memcg(&sb->s_inode_lru,
+						 &sc->nodes_to_scan, memcg);
 
 	total_objects = vfs_pressure_ratio(total_objects);
 	drop_super(sb);
@@ -198,8 +208,12 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags)
 		INIT_HLIST_NODE(&s->s_instances);
 		INIT_HLIST_BL_HEAD(&s->s_anon);
 		INIT_LIST_HEAD(&s->s_inodes);
+
+		lru_memcg_enable(&s->s_dentry_lru);
 		list_lru_init(&s->s_dentry_lru);
+		lru_memcg_enable(&s->s_inode_lru);
 		list_lru_init(&s->s_inode_lru);
+
 		INIT_LIST_HEAD(&s->s_mounts);
 		init_rwsem(&s->s_umount);
 		lockdep_set_class(&s->s_umount, &type->s_umount_key);
@@ -235,6 +249,7 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags)
 		s->s_shrink.scan_objects = super_cache_scan;
 		s->s_shrink.count_objects = super_cache_count;
 		s->s_shrink.batch = 1024;
+		s->s_shrink.memcg_shrinker = true;
 	}
 out:
 	return s;
@@ -317,6 +332,8 @@ void deactivate_locked_super(struct super_block *s)
 
 		/* caches are now gone, we can safely kill the shrinker now */
 		unregister_shrinker(&s->s_shrink);
+		list_lru_destroy(&s->s_dentry_lru);
+		list_lru_destroy(&s->s_inode_lru);
 		put_filesystem(fs);
 		put_super(s);
 	} else {
-- 
1.8.1


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 6/7] super: targeted memcg reclaim
@ 2013-02-08 13:07   ` Glauber Costa
  0 siblings, 0 replies; 60+ messages in thread
From: Glauber Costa @ 2013-02-08 13:07 UTC (permalink / raw)
  To: linux-mm
  Cc: cgroups, Andrew Morton, Michal Hocko, Johannes Weiner,
	kamezawa.hiroyu, Dave Shrinnker, linux-fsdevel, Glauber Costa,
	Dave Chinner, Mel Gorman, Rik van Riel, Hugh Dickins

We now have all our dentries and inodes placed in memcg-specific LRU
lists. All we have to do, is to restrict the reclaim to the said lists
in case of memcg pressure.

Marking the superblock shrinker and its LRUs as memcg-aware will
guarantee that the shrinkers will get invoked during targetted reclaim.

Signed-off-by: Glauber Costa <glommer@parallels.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
---
 fs/dcache.c   |  7 ++++---
 fs/inode.c    |  6 +++---
 fs/internal.h |  5 +++--
 fs/super.c    | 37 +++++++++++++++++++++++++++----------
 4 files changed, 37 insertions(+), 18 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 7f107fb..6f74887 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -907,14 +907,15 @@ static int dentry_lru_isolate(struct list_head *item, spinlock_t *lru_lock,
  * This function may fail to free any resources if all the dentries are in
  * use.
  */
+
 long prune_dcache_sb(struct super_block *sb, long nr_to_scan,
-		     nodemask_t *nodes_to_walk)
+		     nodemask_t *nodes_to_walk, struct mem_cgroup *memcg)
 {
 	LIST_HEAD(dispose);
 	long freed;
 
-	freed = list_lru_walk_nodemask(&sb->s_dentry_lru, dentry_lru_isolate,
-				       &dispose, nr_to_scan, nodes_to_walk);
+	freed = list_lru_walk_nodemask_memcg(&sb->s_dentry_lru,
+		dentry_lru_isolate, &dispose, nr_to_scan, nodes_to_walk, memcg);
 	shrink_dentry_list(&dispose);
 	return freed;
 }
diff --git a/fs/inode.c b/fs/inode.c
index 5bb1e21..61673be 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -746,13 +746,13 @@ static int inode_lru_isolate(struct list_head *item, spinlock_t *lru_lock,
  * then are freed outside inode_lock by dispose_list().
  */
 long prune_icache_sb(struct super_block *sb, long nr_to_scan,
-		     nodemask_t *nodes_to_walk)
+			nodemask_t *nodes_to_walk, struct mem_cgroup *memcg)
 {
 	LIST_HEAD(freeable);
 	long freed;
 
-	freed = list_lru_walk_nodemask(&sb->s_inode_lru, inode_lru_isolate,
-				       &freeable, nr_to_scan, nodes_to_walk);
+	freed = list_lru_walk_nodemask_memcg(&sb->s_inode_lru,
+		inode_lru_isolate, &freeable, nr_to_scan, nodes_to_walk, memcg);
 	dispose_list(&freeable);
 	return freed;
 }
diff --git a/fs/internal.h b/fs/internal.h
index 0f37896..5e2211f 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -16,6 +16,7 @@ struct file_system_type;
 struct linux_binprm;
 struct path;
 struct mount;
+struct mem_cgroup;
 
 /*
  * block_dev.c
@@ -111,7 +112,7 @@ extern int open_check_o_direct(struct file *f);
  */
 extern spinlock_t inode_sb_list_lock;
 extern long prune_icache_sb(struct super_block *sb, long nr_to_scan,
-			    nodemask_t *nodes_to_scan);
+		    nodemask_t *nodes_to_scan, struct mem_cgroup *memcg);
 extern void inode_add_lru(struct inode *inode);
 
 /*
@@ -128,4 +129,4 @@ extern int invalidate_inodes(struct super_block *, bool);
  */
 extern struct dentry *__d_alloc(struct super_block *, const struct qstr *);
 extern long prune_dcache_sb(struct super_block *sb, long nr_to_scan,
-			    nodemask_t *nodes_to_scan);
+		    nodemask_t *nodes_to_scan, struct mem_cgroup *memcg);
diff --git a/fs/super.c b/fs/super.c
index fe3aa4c..f687cc2 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -34,6 +34,7 @@
 #include <linux/cleancache.h>
 #include <linux/fsnotify.h>
 #include <linux/lockdep.h>
+#include <linux/memcontrol.h>
 #include "internal.h"
 
 
@@ -56,6 +57,7 @@ static char *sb_writers_name[SB_FREEZE_LEVELS] = {
 static long super_cache_scan(struct shrinker *shrink, struct shrink_control *sc)
 {
 	struct super_block *sb;
+	struct mem_cgroup *memcg = sc->target_mem_cgroup;
 	long	fs_objects = 0;
 	long	total_objects;
 	long	freed = 0;
@@ -74,11 +76,13 @@ static long super_cache_scan(struct shrinker *shrink, struct shrink_control *sc)
 	if (!grab_super_passive(sb))
 		return -1;
 
-	if (sb->s_op && sb->s_op->nr_cached_objects)
+	if (sb->s_op && sb->s_op->nr_cached_objects && !memcg)
 		fs_objects = sb->s_op->nr_cached_objects(sb, &sc->nodes_to_scan);
 
-	inodes = list_lru_count_nodemask(&sb->s_inode_lru, &sc->nodes_to_scan);
-	dentries = list_lru_count_nodemask(&sb->s_dentry_lru, &sc->nodes_to_scan);
+	inodes = list_lru_count_nodemask_memcg(&sb->s_inode_lru,
+					 &sc->nodes_to_scan, memcg);
+	dentries = list_lru_count_nodemask_memcg(&sb->s_dentry_lru,
+					   &sc->nodes_to_scan, memcg);
 	total_objects = dentries + inodes + fs_objects + 1;
 
 	/* proportion the scan between the caches */
@@ -89,8 +93,8 @@ static long super_cache_scan(struct shrinker *shrink, struct shrink_control *sc)
 	 * prune the dcache first as the icache is pinned by it, then
 	 * prune the icache, followed by the filesystem specific caches
 	 */
-	freed = prune_dcache_sb(sb, dentries, &sc->nodes_to_scan);
-	freed += prune_icache_sb(sb, inodes, &sc->nodes_to_scan);
+	freed = prune_dcache_sb(sb, dentries, &sc->nodes_to_scan, memcg);
+	freed += prune_icache_sb(sb, inodes, &sc->nodes_to_scan, memcg);
 
 	if (fs_objects) {
 		fs_objects = (sc->nr_to_scan * fs_objects) / total_objects;
@@ -106,20 +110,26 @@ static long super_cache_count(struct shrinker *shrink, struct shrink_control *sc
 {
 	struct super_block *sb;
 	long	total_objects = 0;
+	struct mem_cgroup *memcg = sc->target_mem_cgroup;
 
 	sb = container_of(shrink, struct super_block, s_shrink);
 
 	if (!grab_super_passive(sb))
 		return 0;
 
-	if (sb->s_op && sb->s_op->nr_cached_objects)
+	/*
+	 * Ideally we would pass memcg to nr_cached_objects, and
+	 * let the underlying filesystem decide. Most likely the
+	 * path will be if (!memcg) return;, but even then.
+	 */
+	if (sb->s_op && sb->s_op->nr_cached_objects && !memcg)
 		total_objects = sb->s_op->nr_cached_objects(sb,
 						 &sc->nodes_to_scan);
 
-	total_objects += list_lru_count_nodemask(&sb->s_dentry_lru,
-						 &sc->nodes_to_scan);
-	total_objects += list_lru_count_nodemask(&sb->s_inode_lru,
-						 &sc->nodes_to_scan);
+	total_objects += list_lru_count_nodemask_memcg(&sb->s_dentry_lru,
+						 &sc->nodes_to_scan, memcg);
+	total_objects += list_lru_count_nodemask_memcg(&sb->s_inode_lru,
+						 &sc->nodes_to_scan, memcg);
 
 	total_objects = vfs_pressure_ratio(total_objects);
 	drop_super(sb);
@@ -198,8 +208,12 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags)
 		INIT_HLIST_NODE(&s->s_instances);
 		INIT_HLIST_BL_HEAD(&s->s_anon);
 		INIT_LIST_HEAD(&s->s_inodes);
+
+		lru_memcg_enable(&s->s_dentry_lru);
 		list_lru_init(&s->s_dentry_lru);
+		lru_memcg_enable(&s->s_inode_lru);
 		list_lru_init(&s->s_inode_lru);
+
 		INIT_LIST_HEAD(&s->s_mounts);
 		init_rwsem(&s->s_umount);
 		lockdep_set_class(&s->s_umount, &type->s_umount_key);
@@ -235,6 +249,7 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags)
 		s->s_shrink.scan_objects = super_cache_scan;
 		s->s_shrink.count_objects = super_cache_count;
 		s->s_shrink.batch = 1024;
+		s->s_shrink.memcg_shrinker = true;
 	}
 out:
 	return s;
@@ -317,6 +332,8 @@ void deactivate_locked_super(struct super_block *s)
 
 		/* caches are now gone, we can safely kill the shrinker now */
 		unregister_shrinker(&s->s_shrink);
+		list_lru_destroy(&s->s_dentry_lru);
+		list_lru_destroy(&s->s_inode_lru);
 		put_filesystem(fs);
 		put_super(s);
 	} else {
-- 
1.8.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 6/7] super: targeted memcg reclaim
@ 2013-02-08 13:07   ` Glauber Costa
  0 siblings, 0 replies; 60+ messages in thread
From: Glauber Costa @ 2013-02-08 13:07 UTC (permalink / raw)
  To: linux-mm
  Cc: cgroups, Andrew Morton, Michal Hocko, Johannes Weiner,
	kamezawa.hiroyu, Dave Shrinnker, linux-fsdevel, Glauber Costa,
	Dave Chinner, Mel Gorman, Rik van Riel, Hugh Dickins

We now have all our dentries and inodes placed in memcg-specific LRU
lists. All we have to do, is to restrict the reclaim to the said lists
in case of memcg pressure.

Marking the superblock shrinker and its LRUs as memcg-aware will
guarantee that the shrinkers will get invoked during targetted reclaim.

Signed-off-by: Glauber Costa <glommer@parallels.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
---
 fs/dcache.c   |  7 ++++---
 fs/inode.c    |  6 +++---
 fs/internal.h |  5 +++--
 fs/super.c    | 37 +++++++++++++++++++++++++++----------
 4 files changed, 37 insertions(+), 18 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 7f107fb..6f74887 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -907,14 +907,15 @@ static int dentry_lru_isolate(struct list_head *item, spinlock_t *lru_lock,
  * This function may fail to free any resources if all the dentries are in
  * use.
  */
+
 long prune_dcache_sb(struct super_block *sb, long nr_to_scan,
-		     nodemask_t *nodes_to_walk)
+		     nodemask_t *nodes_to_walk, struct mem_cgroup *memcg)
 {
 	LIST_HEAD(dispose);
 	long freed;
 
-	freed = list_lru_walk_nodemask(&sb->s_dentry_lru, dentry_lru_isolate,
-				       &dispose, nr_to_scan, nodes_to_walk);
+	freed = list_lru_walk_nodemask_memcg(&sb->s_dentry_lru,
+		dentry_lru_isolate, &dispose, nr_to_scan, nodes_to_walk, memcg);
 	shrink_dentry_list(&dispose);
 	return freed;
 }
diff --git a/fs/inode.c b/fs/inode.c
index 5bb1e21..61673be 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -746,13 +746,13 @@ static int inode_lru_isolate(struct list_head *item, spinlock_t *lru_lock,
  * then are freed outside inode_lock by dispose_list().
  */
 long prune_icache_sb(struct super_block *sb, long nr_to_scan,
-		     nodemask_t *nodes_to_walk)
+			nodemask_t *nodes_to_walk, struct mem_cgroup *memcg)
 {
 	LIST_HEAD(freeable);
 	long freed;
 
-	freed = list_lru_walk_nodemask(&sb->s_inode_lru, inode_lru_isolate,
-				       &freeable, nr_to_scan, nodes_to_walk);
+	freed = list_lru_walk_nodemask_memcg(&sb->s_inode_lru,
+		inode_lru_isolate, &freeable, nr_to_scan, nodes_to_walk, memcg);
 	dispose_list(&freeable);
 	return freed;
 }
diff --git a/fs/internal.h b/fs/internal.h
index 0f37896..5e2211f 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -16,6 +16,7 @@ struct file_system_type;
 struct linux_binprm;
 struct path;
 struct mount;
+struct mem_cgroup;
 
 /*
  * block_dev.c
@@ -111,7 +112,7 @@ extern int open_check_o_direct(struct file *f);
  */
 extern spinlock_t inode_sb_list_lock;
 extern long prune_icache_sb(struct super_block *sb, long nr_to_scan,
-			    nodemask_t *nodes_to_scan);
+		    nodemask_t *nodes_to_scan, struct mem_cgroup *memcg);
 extern void inode_add_lru(struct inode *inode);
 
 /*
@@ -128,4 +129,4 @@ extern int invalidate_inodes(struct super_block *, bool);
  */
 extern struct dentry *__d_alloc(struct super_block *, const struct qstr *);
 extern long prune_dcache_sb(struct super_block *sb, long nr_to_scan,
-			    nodemask_t *nodes_to_scan);
+		    nodemask_t *nodes_to_scan, struct mem_cgroup *memcg);
diff --git a/fs/super.c b/fs/super.c
index fe3aa4c..f687cc2 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -34,6 +34,7 @@
 #include <linux/cleancache.h>
 #include <linux/fsnotify.h>
 #include <linux/lockdep.h>
+#include <linux/memcontrol.h>
 #include "internal.h"
 
 
@@ -56,6 +57,7 @@ static char *sb_writers_name[SB_FREEZE_LEVELS] = {
 static long super_cache_scan(struct shrinker *shrink, struct shrink_control *sc)
 {
 	struct super_block *sb;
+	struct mem_cgroup *memcg = sc->target_mem_cgroup;
 	long	fs_objects = 0;
 	long	total_objects;
 	long	freed = 0;
@@ -74,11 +76,13 @@ static long super_cache_scan(struct shrinker *shrink, struct shrink_control *sc)
 	if (!grab_super_passive(sb))
 		return -1;
 
-	if (sb->s_op && sb->s_op->nr_cached_objects)
+	if (sb->s_op && sb->s_op->nr_cached_objects && !memcg)
 		fs_objects = sb->s_op->nr_cached_objects(sb, &sc->nodes_to_scan);
 
-	inodes = list_lru_count_nodemask(&sb->s_inode_lru, &sc->nodes_to_scan);
-	dentries = list_lru_count_nodemask(&sb->s_dentry_lru, &sc->nodes_to_scan);
+	inodes = list_lru_count_nodemask_memcg(&sb->s_inode_lru,
+					 &sc->nodes_to_scan, memcg);
+	dentries = list_lru_count_nodemask_memcg(&sb->s_dentry_lru,
+					   &sc->nodes_to_scan, memcg);
 	total_objects = dentries + inodes + fs_objects + 1;
 
 	/* proportion the scan between the caches */
@@ -89,8 +93,8 @@ static long super_cache_scan(struct shrinker *shrink, struct shrink_control *sc)
 	 * prune the dcache first as the icache is pinned by it, then
 	 * prune the icache, followed by the filesystem specific caches
 	 */
-	freed = prune_dcache_sb(sb, dentries, &sc->nodes_to_scan);
-	freed += prune_icache_sb(sb, inodes, &sc->nodes_to_scan);
+	freed = prune_dcache_sb(sb, dentries, &sc->nodes_to_scan, memcg);
+	freed += prune_icache_sb(sb, inodes, &sc->nodes_to_scan, memcg);
 
 	if (fs_objects) {
 		fs_objects = (sc->nr_to_scan * fs_objects) / total_objects;
@@ -106,20 +110,26 @@ static long super_cache_count(struct shrinker *shrink, struct shrink_control *sc
 {
 	struct super_block *sb;
 	long	total_objects = 0;
+	struct mem_cgroup *memcg = sc->target_mem_cgroup;
 
 	sb = container_of(shrink, struct super_block, s_shrink);
 
 	if (!grab_super_passive(sb))
 		return 0;
 
-	if (sb->s_op && sb->s_op->nr_cached_objects)
+	/*
+	 * Ideally we would pass memcg to nr_cached_objects, and
+	 * let the underlying filesystem decide. Most likely the
+	 * path will be if (!memcg) return;, but even then.
+	 */
+	if (sb->s_op && sb->s_op->nr_cached_objects && !memcg)
 		total_objects = sb->s_op->nr_cached_objects(sb,
 						 &sc->nodes_to_scan);
 
-	total_objects += list_lru_count_nodemask(&sb->s_dentry_lru,
-						 &sc->nodes_to_scan);
-	total_objects += list_lru_count_nodemask(&sb->s_inode_lru,
-						 &sc->nodes_to_scan);
+	total_objects += list_lru_count_nodemask_memcg(&sb->s_dentry_lru,
+						 &sc->nodes_to_scan, memcg);
+	total_objects += list_lru_count_nodemask_memcg(&sb->s_inode_lru,
+						 &sc->nodes_to_scan, memcg);
 
 	total_objects = vfs_pressure_ratio(total_objects);
 	drop_super(sb);
@@ -198,8 +208,12 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags)
 		INIT_HLIST_NODE(&s->s_instances);
 		INIT_HLIST_BL_HEAD(&s->s_anon);
 		INIT_LIST_HEAD(&s->s_inodes);
+
+		lru_memcg_enable(&s->s_dentry_lru);
 		list_lru_init(&s->s_dentry_lru);
+		lru_memcg_enable(&s->s_inode_lru);
 		list_lru_init(&s->s_inode_lru);
+
 		INIT_LIST_HEAD(&s->s_mounts);
 		init_rwsem(&s->s_umount);
 		lockdep_set_class(&s->s_umount, &type->s_umount_key);
@@ -235,6 +249,7 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags)
 		s->s_shrink.scan_objects = super_cache_scan;
 		s->s_shrink.count_objects = super_cache_count;
 		s->s_shrink.batch = 1024;
+		s->s_shrink.memcg_shrinker = true;
 	}
 out:
 	return s;
@@ -317,6 +332,8 @@ void deactivate_locked_super(struct super_block *s)
 
 		/* caches are now gone, we can safely kill the shrinker now */
 		unregister_shrinker(&s->s_shrink);
+		list_lru_destroy(&s->s_dentry_lru);
+		list_lru_destroy(&s->s_inode_lru);
 		put_filesystem(fs);
 		put_super(s);
 	} else {
-- 
1.8.1


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 7/7] memcg: per-memcg kmem shrinking
  2013-02-08 13:07 ` Glauber Costa
  (?)
@ 2013-02-08 13:07   ` Glauber Costa
  -1 siblings, 0 replies; 60+ messages in thread
From: Glauber Costa @ 2013-02-08 13:07 UTC (permalink / raw)
  To: linux-mm
  Cc: cgroups, Andrew Morton, Michal Hocko, Johannes Weiner,
	kamezawa.hiroyu, Dave Shrinnker, linux-fsdevel, Glauber Costa,
	Dave Chinner, Mel Gorman, Rik van Riel, Hugh Dickins

If the kernel limit is smaller than the user limit, we will have
situations in which our allocations fail but freeing user pages will buy
us nothing.  In those, we would like to call a specialized memcg
reclaimer that only frees kernel memory and leave the user memory alone.
Those are also expected to fail when we account memcg->kmem, instead of
when we account memcg->res. Based on that, this patch implements a
memcg-specific reclaimer, that only shrinks kernel objects, withouth
touching user pages.

There might be situations in which there are plenty of objects to
shrink, but we can't do it because the __GFP_FS flag is not set.
Although they can happen with user pages, they are a lot more common
with fs-metadata: this is the case with almost all inode allocation.

Those allocations are, however, capable of waiting.  So we can just span
a worker, let it finish its job and proceed with the allocation. As slow
as it is, at this point we are already past any hopes anyway.

Signed-off-by: Glauber Costa <glommer@parallels.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
---
 include/linux/swap.h |   2 +
 mm/memcontrol.c      | 102 +++++++++++++++++++++++++++++++++++++++++++++++++--
 mm/vmscan.c          |  37 ++++++++++++++++++-
 3 files changed, 137 insertions(+), 4 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 8c66486..ff74226 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -259,6 +259,8 @@ extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 extern int __isolate_lru_page(struct page *page, isolate_mode_t mode);
 extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem,
 						  gfp_t gfp_mask, bool noswap);
+extern unsigned long try_to_free_mem_cgroup_kmem(struct mem_cgroup *mem,
+						 gfp_t gfp_mask);
 extern unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
 						gfp_t gfp_mask, bool noswap,
 						struct zone *zone,
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index bfb4b5b..7dc9ec1 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -294,6 +294,8 @@ struct mem_cgroup {
 	atomic_t	numainfo_events;
 	atomic_t	numainfo_updating;
 #endif
+
+	struct work_struct kmemcg_shrink_work;
 	/*
 	 * Should the accounting and control be hierarchical, per subtree?
 	 */
@@ -376,6 +378,9 @@ struct mem_cgroup {
 #endif
 };
 
+
+static DEFINE_MUTEX(set_limit_mutex);
+
 #ifdef CONFIG_MEMCG_DEBUG_ASYNC_DESTROY
 static LIST_HEAD(dangling_memcgs);
 static DEFINE_MUTEX(dangling_memcgs_mutex);
@@ -430,6 +435,7 @@ enum {
 	KMEM_ACCOUNTED_ACTIVE = 0, /* accounted by this cgroup itself */
 	KMEM_ACCOUNTED_ACTIVATED, /* static key enabled. */
 	KMEM_ACCOUNTED_DEAD, /* dead memcg with pending kmem charges */
+	KMEM_MAY_SHRINK, /* kmem limit < mem limit, shrink kmem only */
 };
 
 /* We account when limit is on, but only after call sites are patched */
@@ -468,6 +474,36 @@ static bool memcg_kmem_test_and_clear_dead(struct mem_cgroup *memcg)
 	return test_and_clear_bit(KMEM_ACCOUNTED_DEAD,
 				  &memcg->kmem_account_flags);
 }
+
+/*
+ * If the kernel limit is smaller than the user limit, we will have situations
+ * in which our allocations fail but freeing user pages will buy us nothing.
+ * In those, we would like to call a specialized memcg reclaimer that only
+ * frees kernel memory and leave the user memory alone.
+ *
+ * This test exists so we can differentiate between those. Everytime one of the
+ * limits is updated, we need to run it. The set_limit_mutex must be held, so
+ * they don't change again.
+ */
+static void memcg_update_shrink_status(struct mem_cgroup *memcg)
+{
+	mutex_lock(&set_limit_mutex);
+	if (res_counter_read_u64(&memcg->kmem, RES_LIMIT) <
+		res_counter_read_u64(&memcg->res, RES_LIMIT))
+		set_bit(KMEM_MAY_SHRINK, &memcg->kmem_account_flags);
+	else
+		clear_bit(KMEM_MAY_SHRINK, &memcg->kmem_account_flags);
+	mutex_unlock(&set_limit_mutex);
+}
+
+static bool memcg_kmem_should_shrink(struct mem_cgroup *memcg)
+{
+	return test_bit(KMEM_MAY_SHRINK, &memcg->kmem_account_flags);
+}
+#else
+static void memcg_update_shrink_status(struct mem_cgroup *memcg)
+{
+}
 #endif
 
 /* Stuffs for move charges at task migration. */
@@ -2882,8 +2918,6 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *memcg,
 	memcg_check_events(memcg, page);
 }
 
-static DEFINE_MUTEX(set_limit_mutex);
-
 #ifdef CONFIG_MEMCG_KMEM
 static inline bool memcg_can_account_kmem(struct mem_cgroup *memcg)
 {
@@ -2925,6 +2959,7 @@ static int mem_cgroup_slabinfo_read(struct cgroup *cont, struct cftype *cft,
 }
 #endif
 
+static int memcg_try_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, u64 size);
 static int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, u64 size)
 {
 	struct res_counter *fail_res;
@@ -2932,7 +2967,7 @@ static int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, u64 size)
 	int ret = 0;
 	bool may_oom;
 
-	ret = res_counter_charge(&memcg->kmem, size, &fail_res);
+	ret = memcg_try_charge_kmem(memcg, gfp, size);
 	if (ret)
 		return ret;
 
@@ -2973,6 +3008,25 @@ static int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, u64 size)
 	return ret;
 }
 
+/*
+ * There might be situations in which there are plenty of objects to shrink,
+ * but we can't do it because the __GFP_FS flag is not set.  This is the case
+ * with almost all inode allocation. They do are, however, capable of waiting.
+ * So we can just span a worker, let it finish its job and proceed with the
+ * allocation. As slow as it is, at this point we are already past any hopes
+ * anyway.
+ */
+static void kmemcg_shrink_work_fn(struct work_struct *w)
+{
+	struct mem_cgroup *memcg;
+
+	memcg = container_of(w, struct mem_cgroup, kmemcg_shrink_work);
+
+	if (!try_to_free_mem_cgroup_kmem(memcg, GFP_KERNEL))
+		congestion_wait(BLK_RW_ASYNC, HZ/10);
+}
+
+
 static void memcg_uncharge_kmem(struct mem_cgroup *memcg, u64 size)
 {
 	res_counter_uncharge(&memcg->res, size);
@@ -3049,6 +3103,7 @@ int memcg_update_cache_sizes(struct mem_cgroup *memcg)
 	memcg_update_array_size(num + 1);
 
 	INIT_LIST_HEAD(&memcg->memcg_slab_caches);
+	INIT_WORK(&memcg->kmemcg_shrink_work, kmemcg_shrink_work_fn);
 	mutex_init(&memcg->slab_caches_mutex);
 
 	return 0;
@@ -3319,6 +3374,36 @@ static inline void memcg_resume_kmem_account(void)
 	current->memcg_kmem_skip_account--;
 }
 
+static int memcg_try_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, u64 size)
+{
+	int retries = MEM_CGROUP_RECLAIM_RETRIES;
+	struct res_counter *fail_res;
+	int ret;
+
+	do {
+		ret = res_counter_charge(&memcg->kmem, size, &fail_res);
+		if (!ret)
+			return ret;
+
+		if (!memcg_kmem_should_shrink(memcg) || !(gfp & __GFP_WAIT))
+			return ret;
+
+		if (!(gfp & __GFP_FS)) {
+			/*
+			 * we are already short on memory, every queue
+			 * allocation is likely to fail
+			 */
+			memcg_stop_kmem_account();
+			schedule_work(&memcg->kmemcg_shrink_work);
+			memcg_resume_kmem_account();
+		} else if (!try_to_free_mem_cgroup_kmem(memcg, gfp))
+			congestion_wait(BLK_RW_ASYNC, HZ/10);
+
+	} while (retries--);
+
+	return ret;
+}
+
 static struct mem_cgroup *mem_cgroup_from_kmem_page(struct page *page)
 {
 	struct page_cgroup *pc;
@@ -5399,6 +5484,9 @@ static int mem_cgroup_write(struct cgroup *cont, struct cftype *cft,
 			ret = memcg_update_kmem_limit(cont, val);
 		else
 			return -EINVAL;
+
+		if (!ret)
+			memcg_update_shrink_status(memcg);
 		break;
 	case RES_SOFT_LIMIT:
 		ret = res_counter_memparse_write_strategy(buffer, &val);
@@ -5411,6 +5499,8 @@ static int mem_cgroup_write(struct cgroup *cont, struct cftype *cft,
 		 */
 		if (type == _MEM)
 			ret = res_counter_set_soft_limit(&memcg->res, val);
+		else if (type == _KMEM)
+			ret = res_counter_set_soft_limit(&memcg->kmem, val);
 		else
 			ret = -EINVAL;
 		break;
@@ -6178,6 +6268,12 @@ static struct cftype mem_cgroup_files[] = {
 		.read = mem_cgroup_read,
 	},
 	{
+		.name = "kmem.soft_limit_in_bytes",
+		.private = MEMFILE_PRIVATE(_KMEM, RES_SOFT_LIMIT),
+		.write_string = mem_cgroup_write,
+		.read = mem_cgroup_read,
+	},
+	{
 		.name = "kmem.usage_in_bytes",
 		.private = MEMFILE_PRIVATE(_KMEM, RES_USAGE),
 		.read = mem_cgroup_read,
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 8af0e2b..e4de27a 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2499,7 +2499,42 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 
 	return nr_reclaimed;
 }
-#endif
+
+#ifdef CONFIG_MEMCG_KMEM
+/*
+ * This function is called when we are under kmem-specific pressure.  It will
+ * only trigger in environments with kmem.limit_in_bytes < limit_in_bytes, IOW,
+ * with a lower kmem allowance than the memory allowance.
+ *
+ * In this situation, freeing user pages from the cgroup won't do us any good.
+ * What we really need is to call the memcg-aware shrinkers, in the hope of
+ * freeing pages holding kmem objects. It may also be that we won't be able to
+ * free any pages, but will get rid of old objects opening up space for new
+ * ones.
+ */
+unsigned long try_to_free_mem_cgroup_kmem(struct mem_cgroup *memcg,
+					  gfp_t gfp_mask)
+{
+	struct shrink_control shrink = {
+		.gfp_mask = gfp_mask,
+		.target_mem_cgroup = memcg,
+	};
+
+	if (!(gfp_mask & __GFP_WAIT))
+		return 0;
+
+	nodes_setall(shrink.nodes_to_scan);
+
+	/*
+	 * We haven't scanned any user LRU, so we basically come up with
+	 * crafted values of nr_scanned and LRU page (1 and 0 respectively).
+	 * This should be enough to tell shrink_slab that the freeing
+	 * responsibility is all on himself.
+	 */
+	return shrink_slab(&shrink, 1, 0);
+}
+#endif /* CONFIG_MEMCG_KMEM */
+#endif /* CONFIG_MEMCG */
 
 static void age_active_anon(struct zone *zone, struct scan_control *sc)
 {
-- 
1.8.1


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 7/7] memcg: per-memcg kmem shrinking
@ 2013-02-08 13:07   ` Glauber Costa
  0 siblings, 0 replies; 60+ messages in thread
From: Glauber Costa @ 2013-02-08 13:07 UTC (permalink / raw)
  To: linux-mm
  Cc: cgroups, Andrew Morton, Michal Hocko, Johannes Weiner,
	kamezawa.hiroyu, Dave Shrinnker, linux-fsdevel, Glauber Costa,
	Dave Chinner, Mel Gorman, Rik van Riel, Hugh Dickins

If the kernel limit is smaller than the user limit, we will have
situations in which our allocations fail but freeing user pages will buy
us nothing.  In those, we would like to call a specialized memcg
reclaimer that only frees kernel memory and leave the user memory alone.
Those are also expected to fail when we account memcg->kmem, instead of
when we account memcg->res. Based on that, this patch implements a
memcg-specific reclaimer, that only shrinks kernel objects, withouth
touching user pages.

There might be situations in which there are plenty of objects to
shrink, but we can't do it because the __GFP_FS flag is not set.
Although they can happen with user pages, they are a lot more common
with fs-metadata: this is the case with almost all inode allocation.

Those allocations are, however, capable of waiting.  So we can just span
a worker, let it finish its job and proceed with the allocation. As slow
as it is, at this point we are already past any hopes anyway.

Signed-off-by: Glauber Costa <glommer@parallels.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
---
 include/linux/swap.h |   2 +
 mm/memcontrol.c      | 102 +++++++++++++++++++++++++++++++++++++++++++++++++--
 mm/vmscan.c          |  37 ++++++++++++++++++-
 3 files changed, 137 insertions(+), 4 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 8c66486..ff74226 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -259,6 +259,8 @@ extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 extern int __isolate_lru_page(struct page *page, isolate_mode_t mode);
 extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem,
 						  gfp_t gfp_mask, bool noswap);
+extern unsigned long try_to_free_mem_cgroup_kmem(struct mem_cgroup *mem,
+						 gfp_t gfp_mask);
 extern unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
 						gfp_t gfp_mask, bool noswap,
 						struct zone *zone,
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index bfb4b5b..7dc9ec1 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -294,6 +294,8 @@ struct mem_cgroup {
 	atomic_t	numainfo_events;
 	atomic_t	numainfo_updating;
 #endif
+
+	struct work_struct kmemcg_shrink_work;
 	/*
 	 * Should the accounting and control be hierarchical, per subtree?
 	 */
@@ -376,6 +378,9 @@ struct mem_cgroup {
 #endif
 };
 
+
+static DEFINE_MUTEX(set_limit_mutex);
+
 #ifdef CONFIG_MEMCG_DEBUG_ASYNC_DESTROY
 static LIST_HEAD(dangling_memcgs);
 static DEFINE_MUTEX(dangling_memcgs_mutex);
@@ -430,6 +435,7 @@ enum {
 	KMEM_ACCOUNTED_ACTIVE = 0, /* accounted by this cgroup itself */
 	KMEM_ACCOUNTED_ACTIVATED, /* static key enabled. */
 	KMEM_ACCOUNTED_DEAD, /* dead memcg with pending kmem charges */
+	KMEM_MAY_SHRINK, /* kmem limit < mem limit, shrink kmem only */
 };
 
 /* We account when limit is on, but only after call sites are patched */
@@ -468,6 +474,36 @@ static bool memcg_kmem_test_and_clear_dead(struct mem_cgroup *memcg)
 	return test_and_clear_bit(KMEM_ACCOUNTED_DEAD,
 				  &memcg->kmem_account_flags);
 }
+
+/*
+ * If the kernel limit is smaller than the user limit, we will have situations
+ * in which our allocations fail but freeing user pages will buy us nothing.
+ * In those, we would like to call a specialized memcg reclaimer that only
+ * frees kernel memory and leave the user memory alone.
+ *
+ * This test exists so we can differentiate between those. Everytime one of the
+ * limits is updated, we need to run it. The set_limit_mutex must be held, so
+ * they don't change again.
+ */
+static void memcg_update_shrink_status(struct mem_cgroup *memcg)
+{
+	mutex_lock(&set_limit_mutex);
+	if (res_counter_read_u64(&memcg->kmem, RES_LIMIT) <
+		res_counter_read_u64(&memcg->res, RES_LIMIT))
+		set_bit(KMEM_MAY_SHRINK, &memcg->kmem_account_flags);
+	else
+		clear_bit(KMEM_MAY_SHRINK, &memcg->kmem_account_flags);
+	mutex_unlock(&set_limit_mutex);
+}
+
+static bool memcg_kmem_should_shrink(struct mem_cgroup *memcg)
+{
+	return test_bit(KMEM_MAY_SHRINK, &memcg->kmem_account_flags);
+}
+#else
+static void memcg_update_shrink_status(struct mem_cgroup *memcg)
+{
+}
 #endif
 
 /* Stuffs for move charges at task migration. */
@@ -2882,8 +2918,6 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *memcg,
 	memcg_check_events(memcg, page);
 }
 
-static DEFINE_MUTEX(set_limit_mutex);
-
 #ifdef CONFIG_MEMCG_KMEM
 static inline bool memcg_can_account_kmem(struct mem_cgroup *memcg)
 {
@@ -2925,6 +2959,7 @@ static int mem_cgroup_slabinfo_read(struct cgroup *cont, struct cftype *cft,
 }
 #endif
 
+static int memcg_try_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, u64 size);
 static int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, u64 size)
 {
 	struct res_counter *fail_res;
@@ -2932,7 +2967,7 @@ static int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, u64 size)
 	int ret = 0;
 	bool may_oom;
 
-	ret = res_counter_charge(&memcg->kmem, size, &fail_res);
+	ret = memcg_try_charge_kmem(memcg, gfp, size);
 	if (ret)
 		return ret;
 
@@ -2973,6 +3008,25 @@ static int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, u64 size)
 	return ret;
 }
 
+/*
+ * There might be situations in which there are plenty of objects to shrink,
+ * but we can't do it because the __GFP_FS flag is not set.  This is the case
+ * with almost all inode allocation. They do are, however, capable of waiting.
+ * So we can just span a worker, let it finish its job and proceed with the
+ * allocation. As slow as it is, at this point we are already past any hopes
+ * anyway.
+ */
+static void kmemcg_shrink_work_fn(struct work_struct *w)
+{
+	struct mem_cgroup *memcg;
+
+	memcg = container_of(w, struct mem_cgroup, kmemcg_shrink_work);
+
+	if (!try_to_free_mem_cgroup_kmem(memcg, GFP_KERNEL))
+		congestion_wait(BLK_RW_ASYNC, HZ/10);
+}
+
+
 static void memcg_uncharge_kmem(struct mem_cgroup *memcg, u64 size)
 {
 	res_counter_uncharge(&memcg->res, size);
@@ -3049,6 +3103,7 @@ int memcg_update_cache_sizes(struct mem_cgroup *memcg)
 	memcg_update_array_size(num + 1);
 
 	INIT_LIST_HEAD(&memcg->memcg_slab_caches);
+	INIT_WORK(&memcg->kmemcg_shrink_work, kmemcg_shrink_work_fn);
 	mutex_init(&memcg->slab_caches_mutex);
 
 	return 0;
@@ -3319,6 +3374,36 @@ static inline void memcg_resume_kmem_account(void)
 	current->memcg_kmem_skip_account--;
 }
 
+static int memcg_try_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, u64 size)
+{
+	int retries = MEM_CGROUP_RECLAIM_RETRIES;
+	struct res_counter *fail_res;
+	int ret;
+
+	do {
+		ret = res_counter_charge(&memcg->kmem, size, &fail_res);
+		if (!ret)
+			return ret;
+
+		if (!memcg_kmem_should_shrink(memcg) || !(gfp & __GFP_WAIT))
+			return ret;
+
+		if (!(gfp & __GFP_FS)) {
+			/*
+			 * we are already short on memory, every queue
+			 * allocation is likely to fail
+			 */
+			memcg_stop_kmem_account();
+			schedule_work(&memcg->kmemcg_shrink_work);
+			memcg_resume_kmem_account();
+		} else if (!try_to_free_mem_cgroup_kmem(memcg, gfp))
+			congestion_wait(BLK_RW_ASYNC, HZ/10);
+
+	} while (retries--);
+
+	return ret;
+}
+
 static struct mem_cgroup *mem_cgroup_from_kmem_page(struct page *page)
 {
 	struct page_cgroup *pc;
@@ -5399,6 +5484,9 @@ static int mem_cgroup_write(struct cgroup *cont, struct cftype *cft,
 			ret = memcg_update_kmem_limit(cont, val);
 		else
 			return -EINVAL;
+
+		if (!ret)
+			memcg_update_shrink_status(memcg);
 		break;
 	case RES_SOFT_LIMIT:
 		ret = res_counter_memparse_write_strategy(buffer, &val);
@@ -5411,6 +5499,8 @@ static int mem_cgroup_write(struct cgroup *cont, struct cftype *cft,
 		 */
 		if (type == _MEM)
 			ret = res_counter_set_soft_limit(&memcg->res, val);
+		else if (type == _KMEM)
+			ret = res_counter_set_soft_limit(&memcg->kmem, val);
 		else
 			ret = -EINVAL;
 		break;
@@ -6178,6 +6268,12 @@ static struct cftype mem_cgroup_files[] = {
 		.read = mem_cgroup_read,
 	},
 	{
+		.name = "kmem.soft_limit_in_bytes",
+		.private = MEMFILE_PRIVATE(_KMEM, RES_SOFT_LIMIT),
+		.write_string = mem_cgroup_write,
+		.read = mem_cgroup_read,
+	},
+	{
 		.name = "kmem.usage_in_bytes",
 		.private = MEMFILE_PRIVATE(_KMEM, RES_USAGE),
 		.read = mem_cgroup_read,
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 8af0e2b..e4de27a 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2499,7 +2499,42 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 
 	return nr_reclaimed;
 }
-#endif
+
+#ifdef CONFIG_MEMCG_KMEM
+/*
+ * This function is called when we are under kmem-specific pressure.  It will
+ * only trigger in environments with kmem.limit_in_bytes < limit_in_bytes, IOW,
+ * with a lower kmem allowance than the memory allowance.
+ *
+ * In this situation, freeing user pages from the cgroup won't do us any good.
+ * What we really need is to call the memcg-aware shrinkers, in the hope of
+ * freeing pages holding kmem objects. It may also be that we won't be able to
+ * free any pages, but will get rid of old objects opening up space for new
+ * ones.
+ */
+unsigned long try_to_free_mem_cgroup_kmem(struct mem_cgroup *memcg,
+					  gfp_t gfp_mask)
+{
+	struct shrink_control shrink = {
+		.gfp_mask = gfp_mask,
+		.target_mem_cgroup = memcg,
+	};
+
+	if (!(gfp_mask & __GFP_WAIT))
+		return 0;
+
+	nodes_setall(shrink.nodes_to_scan);
+
+	/*
+	 * We haven't scanned any user LRU, so we basically come up with
+	 * crafted values of nr_scanned and LRU page (1 and 0 respectively).
+	 * This should be enough to tell shrink_slab that the freeing
+	 * responsibility is all on himself.
+	 */
+	return shrink_slab(&shrink, 1, 0);
+}
+#endif /* CONFIG_MEMCG_KMEM */
+#endif /* CONFIG_MEMCG */
 
 static void age_active_anon(struct zone *zone, struct scan_control *sc)
 {
-- 
1.8.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* [PATCH 7/7] memcg: per-memcg kmem shrinking
@ 2013-02-08 13:07   ` Glauber Costa
  0 siblings, 0 replies; 60+ messages in thread
From: Glauber Costa @ 2013-02-08 13:07 UTC (permalink / raw)
  To: linux-mm
  Cc: cgroups, Andrew Morton, Michal Hocko, Johannes Weiner,
	kamezawa.hiroyu, Dave Shrinnker, linux-fsdevel, Glauber Costa,
	Dave Chinner, Mel Gorman, Rik van Riel, Hugh Dickins

If the kernel limit is smaller than the user limit, we will have
situations in which our allocations fail but freeing user pages will buy
us nothing.  In those, we would like to call a specialized memcg
reclaimer that only frees kernel memory and leave the user memory alone.
Those are also expected to fail when we account memcg->kmem, instead of
when we account memcg->res. Based on that, this patch implements a
memcg-specific reclaimer, that only shrinks kernel objects, withouth
touching user pages.

There might be situations in which there are plenty of objects to
shrink, but we can't do it because the __GFP_FS flag is not set.
Although they can happen with user pages, they are a lot more common
with fs-metadata: this is the case with almost all inode allocation.

Those allocations are, however, capable of waiting.  So we can just span
a worker, let it finish its job and proceed with the allocation. As slow
as it is, at this point we are already past any hopes anyway.

Signed-off-by: Glauber Costa <glommer@parallels.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
---
 include/linux/swap.h |   2 +
 mm/memcontrol.c      | 102 +++++++++++++++++++++++++++++++++++++++++++++++++--
 mm/vmscan.c          |  37 ++++++++++++++++++-
 3 files changed, 137 insertions(+), 4 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 8c66486..ff74226 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -259,6 +259,8 @@ extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 extern int __isolate_lru_page(struct page *page, isolate_mode_t mode);
 extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem,
 						  gfp_t gfp_mask, bool noswap);
+extern unsigned long try_to_free_mem_cgroup_kmem(struct mem_cgroup *mem,
+						 gfp_t gfp_mask);
 extern unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
 						gfp_t gfp_mask, bool noswap,
 						struct zone *zone,
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index bfb4b5b..7dc9ec1 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -294,6 +294,8 @@ struct mem_cgroup {
 	atomic_t	numainfo_events;
 	atomic_t	numainfo_updating;
 #endif
+
+	struct work_struct kmemcg_shrink_work;
 	/*
 	 * Should the accounting and control be hierarchical, per subtree?
 	 */
@@ -376,6 +378,9 @@ struct mem_cgroup {
 #endif
 };
 
+
+static DEFINE_MUTEX(set_limit_mutex);
+
 #ifdef CONFIG_MEMCG_DEBUG_ASYNC_DESTROY
 static LIST_HEAD(dangling_memcgs);
 static DEFINE_MUTEX(dangling_memcgs_mutex);
@@ -430,6 +435,7 @@ enum {
 	KMEM_ACCOUNTED_ACTIVE = 0, /* accounted by this cgroup itself */
 	KMEM_ACCOUNTED_ACTIVATED, /* static key enabled. */
 	KMEM_ACCOUNTED_DEAD, /* dead memcg with pending kmem charges */
+	KMEM_MAY_SHRINK, /* kmem limit < mem limit, shrink kmem only */
 };
 
 /* We account when limit is on, but only after call sites are patched */
@@ -468,6 +474,36 @@ static bool memcg_kmem_test_and_clear_dead(struct mem_cgroup *memcg)
 	return test_and_clear_bit(KMEM_ACCOUNTED_DEAD,
 				  &memcg->kmem_account_flags);
 }
+
+/*
+ * If the kernel limit is smaller than the user limit, we will have situations
+ * in which our allocations fail but freeing user pages will buy us nothing.
+ * In those, we would like to call a specialized memcg reclaimer that only
+ * frees kernel memory and leave the user memory alone.
+ *
+ * This test exists so we can differentiate between those. Everytime one of the
+ * limits is updated, we need to run it. The set_limit_mutex must be held, so
+ * they don't change again.
+ */
+static void memcg_update_shrink_status(struct mem_cgroup *memcg)
+{
+	mutex_lock(&set_limit_mutex);
+	if (res_counter_read_u64(&memcg->kmem, RES_LIMIT) <
+		res_counter_read_u64(&memcg->res, RES_LIMIT))
+		set_bit(KMEM_MAY_SHRINK, &memcg->kmem_account_flags);
+	else
+		clear_bit(KMEM_MAY_SHRINK, &memcg->kmem_account_flags);
+	mutex_unlock(&set_limit_mutex);
+}
+
+static bool memcg_kmem_should_shrink(struct mem_cgroup *memcg)
+{
+	return test_bit(KMEM_MAY_SHRINK, &memcg->kmem_account_flags);
+}
+#else
+static void memcg_update_shrink_status(struct mem_cgroup *memcg)
+{
+}
 #endif
 
 /* Stuffs for move charges at task migration. */
@@ -2882,8 +2918,6 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *memcg,
 	memcg_check_events(memcg, page);
 }
 
-static DEFINE_MUTEX(set_limit_mutex);
-
 #ifdef CONFIG_MEMCG_KMEM
 static inline bool memcg_can_account_kmem(struct mem_cgroup *memcg)
 {
@@ -2925,6 +2959,7 @@ static int mem_cgroup_slabinfo_read(struct cgroup *cont, struct cftype *cft,
 }
 #endif
 
+static int memcg_try_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, u64 size);
 static int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, u64 size)
 {
 	struct res_counter *fail_res;
@@ -2932,7 +2967,7 @@ static int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, u64 size)
 	int ret = 0;
 	bool may_oom;
 
-	ret = res_counter_charge(&memcg->kmem, size, &fail_res);
+	ret = memcg_try_charge_kmem(memcg, gfp, size);
 	if (ret)
 		return ret;
 
@@ -2973,6 +3008,25 @@ static int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, u64 size)
 	return ret;
 }
 
+/*
+ * There might be situations in which there are plenty of objects to shrink,
+ * but we can't do it because the __GFP_FS flag is not set.  This is the case
+ * with almost all inode allocation. They do are, however, capable of waiting.
+ * So we can just span a worker, let it finish its job and proceed with the
+ * allocation. As slow as it is, at this point we are already past any hopes
+ * anyway.
+ */
+static void kmemcg_shrink_work_fn(struct work_struct *w)
+{
+	struct mem_cgroup *memcg;
+
+	memcg = container_of(w, struct mem_cgroup, kmemcg_shrink_work);
+
+	if (!try_to_free_mem_cgroup_kmem(memcg, GFP_KERNEL))
+		congestion_wait(BLK_RW_ASYNC, HZ/10);
+}
+
+
 static void memcg_uncharge_kmem(struct mem_cgroup *memcg, u64 size)
 {
 	res_counter_uncharge(&memcg->res, size);
@@ -3049,6 +3103,7 @@ int memcg_update_cache_sizes(struct mem_cgroup *memcg)
 	memcg_update_array_size(num + 1);
 
 	INIT_LIST_HEAD(&memcg->memcg_slab_caches);
+	INIT_WORK(&memcg->kmemcg_shrink_work, kmemcg_shrink_work_fn);
 	mutex_init(&memcg->slab_caches_mutex);
 
 	return 0;
@@ -3319,6 +3374,36 @@ static inline void memcg_resume_kmem_account(void)
 	current->memcg_kmem_skip_account--;
 }
 
+static int memcg_try_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, u64 size)
+{
+	int retries = MEM_CGROUP_RECLAIM_RETRIES;
+	struct res_counter *fail_res;
+	int ret;
+
+	do {
+		ret = res_counter_charge(&memcg->kmem, size, &fail_res);
+		if (!ret)
+			return ret;
+
+		if (!memcg_kmem_should_shrink(memcg) || !(gfp & __GFP_WAIT))
+			return ret;
+
+		if (!(gfp & __GFP_FS)) {
+			/*
+			 * we are already short on memory, every queue
+			 * allocation is likely to fail
+			 */
+			memcg_stop_kmem_account();
+			schedule_work(&memcg->kmemcg_shrink_work);
+			memcg_resume_kmem_account();
+		} else if (!try_to_free_mem_cgroup_kmem(memcg, gfp))
+			congestion_wait(BLK_RW_ASYNC, HZ/10);
+
+	} while (retries--);
+
+	return ret;
+}
+
 static struct mem_cgroup *mem_cgroup_from_kmem_page(struct page *page)
 {
 	struct page_cgroup *pc;
@@ -5399,6 +5484,9 @@ static int mem_cgroup_write(struct cgroup *cont, struct cftype *cft,
 			ret = memcg_update_kmem_limit(cont, val);
 		else
 			return -EINVAL;
+
+		if (!ret)
+			memcg_update_shrink_status(memcg);
 		break;
 	case RES_SOFT_LIMIT:
 		ret = res_counter_memparse_write_strategy(buffer, &val);
@@ -5411,6 +5499,8 @@ static int mem_cgroup_write(struct cgroup *cont, struct cftype *cft,
 		 */
 		if (type == _MEM)
 			ret = res_counter_set_soft_limit(&memcg->res, val);
+		else if (type == _KMEM)
+			ret = res_counter_set_soft_limit(&memcg->kmem, val);
 		else
 			ret = -EINVAL;
 		break;
@@ -6178,6 +6268,12 @@ static struct cftype mem_cgroup_files[] = {
 		.read = mem_cgroup_read,
 	},
 	{
+		.name = "kmem.soft_limit_in_bytes",
+		.private = MEMFILE_PRIVATE(_KMEM, RES_SOFT_LIMIT),
+		.write_string = mem_cgroup_write,
+		.read = mem_cgroup_read,
+	},
+	{
 		.name = "kmem.usage_in_bytes",
 		.private = MEMFILE_PRIVATE(_KMEM, RES_USAGE),
 		.read = mem_cgroup_read,
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 8af0e2b..e4de27a 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2499,7 +2499,42 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 
 	return nr_reclaimed;
 }
-#endif
+
+#ifdef CONFIG_MEMCG_KMEM
+/*
+ * This function is called when we are under kmem-specific pressure.  It will
+ * only trigger in environments with kmem.limit_in_bytes < limit_in_bytes, IOW,
+ * with a lower kmem allowance than the memory allowance.
+ *
+ * In this situation, freeing user pages from the cgroup won't do us any good.
+ * What we really need is to call the memcg-aware shrinkers, in the hope of
+ * freeing pages holding kmem objects. It may also be that we won't be able to
+ * free any pages, but will get rid of old objects opening up space for new
+ * ones.
+ */
+unsigned long try_to_free_mem_cgroup_kmem(struct mem_cgroup *memcg,
+					  gfp_t gfp_mask)
+{
+	struct shrink_control shrink = {
+		.gfp_mask = gfp_mask,
+		.target_mem_cgroup = memcg,
+	};
+
+	if (!(gfp_mask & __GFP_WAIT))
+		return 0;
+
+	nodes_setall(shrink.nodes_to_scan);
+
+	/*
+	 * We haven't scanned any user LRU, so we basically come up with
+	 * crafted values of nr_scanned and LRU page (1 and 0 respectively).
+	 * This should be enough to tell shrink_slab that the freeing
+	 * responsibility is all on himself.
+	 */
+	return shrink_slab(&shrink, 1, 0);
+}
+#endif /* CONFIG_MEMCG_KMEM */
+#endif /* CONFIG_MEMCG */
 
 static void age_active_anon(struct zone *zone, struct scan_control *sc)
 {
-- 
1.8.1


^ permalink raw reply related	[flat|nested] 60+ messages in thread

* Re: [PATCH 1/7] vmscan: also shrink slab in memcg pressure
  2013-02-08 13:07   ` Glauber Costa
  (?)
@ 2013-02-15  1:27     ` Greg Thelen
  -1 siblings, 0 replies; 60+ messages in thread
From: Greg Thelen @ 2013-02-15  1:27 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-mm, cgroups, Andrew Morton, Michal Hocko, Johannes Weiner,
	kamezawa.hiroyu, Dave Shrinnker, linux-fsdevel, Dave Chinner,
	Mel Gorman, Rik van Riel, Hugh Dickins

On Fri, Feb 08 2013, Glauber Costa wrote:

> Without the surrounding infrastructure, this patch is a bit of a hammer:
> it will basically shrink objects from all memcgs under memcg pressure.
> At least, however, we will keep the scan limited to the shrinkers marked
> as per-memcg.
>
> Future patches will implement the in-shrinker logic to filter objects
> based on its memcg association.
>
> Signed-off-by: Glauber Costa <glommer@parallels.com>
> Cc: Dave Chinner <dchinner@redhat.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Michal Hocko <mhocko@suse.cz>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> ---
>  include/linux/memcontrol.h | 16 ++++++++++++++++
>  include/linux/shrinker.h   |  4 ++++
>  mm/memcontrol.c            | 11 ++++++++++-
>  mm/vmscan.c                | 41 ++++++++++++++++++++++++++++++++++++++---
>  4 files changed, 68 insertions(+), 4 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 0108a56..b7de557 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -200,6 +200,9 @@ void mem_cgroup_split_huge_fixup(struct page *head);
>  bool mem_cgroup_bad_page_check(struct page *page);
>  void mem_cgroup_print_bad_page(struct page *page);
>  #endif
> +
> +unsigned long
> +memcg_zone_reclaimable_pages(struct mem_cgroup *memcg, struct zone *zone);
>  #else /* CONFIG_MEMCG */
>  struct mem_cgroup;
>  
> @@ -384,6 +387,11 @@ static inline void mem_cgroup_replace_page_cache(struct page *oldpage,
>  				struct page *newpage)
>  {
>  }
> +
> +static inline unsigned long
> +memcg_zone_reclaimable_pages(struct mem_cgroup *memcg, struct zone *zone)
> +{

	return 0;

> +}
>  #endif /* CONFIG_MEMCG */
>  
>  #if !defined(CONFIG_MEMCG) || !defined(CONFIG_DEBUG_VM)
> @@ -436,6 +444,8 @@ static inline bool memcg_kmem_enabled(void)
>  	return static_key_false(&memcg_kmem_enabled_key);
>  }
>  
> +bool memcg_kmem_is_active(struct mem_cgroup *memcg);
> +
>  /*
>   * In general, we'll do everything in our power to not incur in any overhead
>   * for non-memcg users for the kmem functions. Not even a function call, if we
> @@ -569,6 +579,12 @@ memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp)
>  	return __memcg_kmem_get_cache(cachep, gfp);
>  }
>  #else
> +
> +static inline bool memcg_kmem_is_active(struct mem_cgroup *memcg)
> +{
> +	return false;
> +}
> +
>  #define for_each_memcg_cache_index(_idx)	\
>  	for (; NULL; )
>  
> diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
> index d4636a0..a767f2e 100644
> --- a/include/linux/shrinker.h
> +++ b/include/linux/shrinker.h
> @@ -20,6 +20,9 @@ struct shrink_control {
>  
>  	/* shrink from these nodes */
>  	nodemask_t nodes_to_scan;
> +
> +	/* reclaim from this memcg only (if not NULL) */
> +	struct mem_cgroup *target_mem_cgroup;
>  };
>  
>  /*
> @@ -45,6 +48,7 @@ struct shrinker {
>  
>  	int seeks;	/* seeks to recreate an obj */
>  	long batch;	/* reclaim batch size, 0 = default */
> +	bool memcg_shrinker;
>  
>  	/* These are for internal use */
>  	struct list_head list;
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 3817460..b1d4dfa 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -442,7 +442,7 @@ static inline void memcg_kmem_set_active(struct mem_cgroup *memcg)
>  	set_bit(KMEM_ACCOUNTED_ACTIVE, &memcg->kmem_account_flags);
>  }
>  
> -static bool memcg_kmem_is_active(struct mem_cgroup *memcg)
> +bool memcg_kmem_is_active(struct mem_cgroup *memcg)
>  {
>  	return test_bit(KMEM_ACCOUNTED_ACTIVE, &memcg->kmem_account_flags);
>  }
> @@ -991,6 +991,15 @@ mem_cgroup_zone_nr_lru_pages(struct mem_cgroup *memcg, int nid, int zid,
>  	return ret;
>  }
>  
> +unsigned long
> +memcg_zone_reclaimable_pages(struct mem_cgroup *memcg, struct zone *zone)
> +{
> +	int nid = zone_to_nid(zone);
> +	int zid = zone_idx(zone);
> +
> +	return mem_cgroup_zone_nr_lru_pages(memcg, nid, zid, LRU_ALL);

Without swap enabled it seems like LRU_ALL_FILE is more appropriate.
Maybe something like test_mem_cgroup_node_reclaimable().

> +}
> +
>  static unsigned long
>  mem_cgroup_node_nr_lru_pages(struct mem_cgroup *memcg,
>  			int nid, unsigned int lru_mask)
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 6d96280..8af0e2b 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -138,11 +138,42 @@ static bool global_reclaim(struct scan_control *sc)
>  {
>  	return !sc->target_mem_cgroup;
>  }
> +
> +/*
> + * kmem reclaim should usually not be triggered when we are doing targetted
> + * reclaim. It is only valid when global reclaim is triggered, or when the
> + * underlying memcg has kmem objects.
> + */
> +static bool has_kmem_reclaim(struct scan_control *sc)
> +{
> +	return !sc->target_mem_cgroup ||
> +	(sc->target_mem_cgroup && memcg_kmem_is_active(sc->target_mem_cgroup));

Isn't this the same as:
	return !sc->target_mem_cgroup ||
		memcg_kmem_is_active(sc->target_mem_cgroup);

[...]

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 1/7] vmscan: also shrink slab in memcg pressure
@ 2013-02-15  1:27     ` Greg Thelen
  0 siblings, 0 replies; 60+ messages in thread
From: Greg Thelen @ 2013-02-15  1:27 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-mm, cgroups, Andrew Morton, Michal Hocko, Johannes Weiner,
	kamezawa.hiroyu, Dave Shrinnker, linux-fsdevel, Dave Chinner,
	Mel Gorman, Rik van Riel, Hugh Dickins

On Fri, Feb 08 2013, Glauber Costa wrote:

> Without the surrounding infrastructure, this patch is a bit of a hammer:
> it will basically shrink objects from all memcgs under memcg pressure.
> At least, however, we will keep the scan limited to the shrinkers marked
> as per-memcg.
>
> Future patches will implement the in-shrinker logic to filter objects
> based on its memcg association.
>
> Signed-off-by: Glauber Costa <glommer@parallels.com>
> Cc: Dave Chinner <dchinner@redhat.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Michal Hocko <mhocko@suse.cz>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> ---
>  include/linux/memcontrol.h | 16 ++++++++++++++++
>  include/linux/shrinker.h   |  4 ++++
>  mm/memcontrol.c            | 11 ++++++++++-
>  mm/vmscan.c                | 41 ++++++++++++++++++++++++++++++++++++++---
>  4 files changed, 68 insertions(+), 4 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 0108a56..b7de557 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -200,6 +200,9 @@ void mem_cgroup_split_huge_fixup(struct page *head);
>  bool mem_cgroup_bad_page_check(struct page *page);
>  void mem_cgroup_print_bad_page(struct page *page);
>  #endif
> +
> +unsigned long
> +memcg_zone_reclaimable_pages(struct mem_cgroup *memcg, struct zone *zone);
>  #else /* CONFIG_MEMCG */
>  struct mem_cgroup;
>  
> @@ -384,6 +387,11 @@ static inline void mem_cgroup_replace_page_cache(struct page *oldpage,
>  				struct page *newpage)
>  {
>  }
> +
> +static inline unsigned long
> +memcg_zone_reclaimable_pages(struct mem_cgroup *memcg, struct zone *zone)
> +{

	return 0;

> +}
>  #endif /* CONFIG_MEMCG */
>  
>  #if !defined(CONFIG_MEMCG) || !defined(CONFIG_DEBUG_VM)
> @@ -436,6 +444,8 @@ static inline bool memcg_kmem_enabled(void)
>  	return static_key_false(&memcg_kmem_enabled_key);
>  }
>  
> +bool memcg_kmem_is_active(struct mem_cgroup *memcg);
> +
>  /*
>   * In general, we'll do everything in our power to not incur in any overhead
>   * for non-memcg users for the kmem functions. Not even a function call, if we
> @@ -569,6 +579,12 @@ memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp)
>  	return __memcg_kmem_get_cache(cachep, gfp);
>  }
>  #else
> +
> +static inline bool memcg_kmem_is_active(struct mem_cgroup *memcg)
> +{
> +	return false;
> +}
> +
>  #define for_each_memcg_cache_index(_idx)	\
>  	for (; NULL; )
>  
> diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
> index d4636a0..a767f2e 100644
> --- a/include/linux/shrinker.h
> +++ b/include/linux/shrinker.h
> @@ -20,6 +20,9 @@ struct shrink_control {
>  
>  	/* shrink from these nodes */
>  	nodemask_t nodes_to_scan;
> +
> +	/* reclaim from this memcg only (if not NULL) */
> +	struct mem_cgroup *target_mem_cgroup;
>  };
>  
>  /*
> @@ -45,6 +48,7 @@ struct shrinker {
>  
>  	int seeks;	/* seeks to recreate an obj */
>  	long batch;	/* reclaim batch size, 0 = default */
> +	bool memcg_shrinker;
>  
>  	/* These are for internal use */
>  	struct list_head list;
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 3817460..b1d4dfa 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -442,7 +442,7 @@ static inline void memcg_kmem_set_active(struct mem_cgroup *memcg)
>  	set_bit(KMEM_ACCOUNTED_ACTIVE, &memcg->kmem_account_flags);
>  }
>  
> -static bool memcg_kmem_is_active(struct mem_cgroup *memcg)
> +bool memcg_kmem_is_active(struct mem_cgroup *memcg)
>  {
>  	return test_bit(KMEM_ACCOUNTED_ACTIVE, &memcg->kmem_account_flags);
>  }
> @@ -991,6 +991,15 @@ mem_cgroup_zone_nr_lru_pages(struct mem_cgroup *memcg, int nid, int zid,
>  	return ret;
>  }
>  
> +unsigned long
> +memcg_zone_reclaimable_pages(struct mem_cgroup *memcg, struct zone *zone)
> +{
> +	int nid = zone_to_nid(zone);
> +	int zid = zone_idx(zone);
> +
> +	return mem_cgroup_zone_nr_lru_pages(memcg, nid, zid, LRU_ALL);

Without swap enabled it seems like LRU_ALL_FILE is more appropriate.
Maybe something like test_mem_cgroup_node_reclaimable().

> +}
> +
>  static unsigned long
>  mem_cgroup_node_nr_lru_pages(struct mem_cgroup *memcg,
>  			int nid, unsigned int lru_mask)
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 6d96280..8af0e2b 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -138,11 +138,42 @@ static bool global_reclaim(struct scan_control *sc)
>  {
>  	return !sc->target_mem_cgroup;
>  }
> +
> +/*
> + * kmem reclaim should usually not be triggered when we are doing targetted
> + * reclaim. It is only valid when global reclaim is triggered, or when the
> + * underlying memcg has kmem objects.
> + */
> +static bool has_kmem_reclaim(struct scan_control *sc)
> +{
> +	return !sc->target_mem_cgroup ||
> +	(sc->target_mem_cgroup && memcg_kmem_is_active(sc->target_mem_cgroup));

Isn't this the same as:
	return !sc->target_mem_cgroup ||
		memcg_kmem_is_active(sc->target_mem_cgroup);

[...]

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 1/7] vmscan: also shrink slab in memcg pressure
@ 2013-02-15  1:27     ` Greg Thelen
  0 siblings, 0 replies; 60+ messages in thread
From: Greg Thelen @ 2013-02-15  1:27 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-mm, cgroups, Andrew Morton, Michal Hocko, Johannes Weiner,
	kamezawa.hiroyu, Dave Shrinnker, linux-fsdevel, Dave Chinner,
	Mel Gorman, Rik van Riel, Hugh Dickins

On Fri, Feb 08 2013, Glauber Costa wrote:

> Without the surrounding infrastructure, this patch is a bit of a hammer:
> it will basically shrink objects from all memcgs under memcg pressure.
> At least, however, we will keep the scan limited to the shrinkers marked
> as per-memcg.
>
> Future patches will implement the in-shrinker logic to filter objects
> based on its memcg association.
>
> Signed-off-by: Glauber Costa <glommer@parallels.com>
> Cc: Dave Chinner <dchinner@redhat.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Michal Hocko <mhocko@suse.cz>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> ---
>  include/linux/memcontrol.h | 16 ++++++++++++++++
>  include/linux/shrinker.h   |  4 ++++
>  mm/memcontrol.c            | 11 ++++++++++-
>  mm/vmscan.c                | 41 ++++++++++++++++++++++++++++++++++++++---
>  4 files changed, 68 insertions(+), 4 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 0108a56..b7de557 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -200,6 +200,9 @@ void mem_cgroup_split_huge_fixup(struct page *head);
>  bool mem_cgroup_bad_page_check(struct page *page);
>  void mem_cgroup_print_bad_page(struct page *page);
>  #endif
> +
> +unsigned long
> +memcg_zone_reclaimable_pages(struct mem_cgroup *memcg, struct zone *zone);
>  #else /* CONFIG_MEMCG */
>  struct mem_cgroup;
>  
> @@ -384,6 +387,11 @@ static inline void mem_cgroup_replace_page_cache(struct page *oldpage,
>  				struct page *newpage)
>  {
>  }
> +
> +static inline unsigned long
> +memcg_zone_reclaimable_pages(struct mem_cgroup *memcg, struct zone *zone)
> +{

	return 0;

> +}
>  #endif /* CONFIG_MEMCG */
>  
>  #if !defined(CONFIG_MEMCG) || !defined(CONFIG_DEBUG_VM)
> @@ -436,6 +444,8 @@ static inline bool memcg_kmem_enabled(void)
>  	return static_key_false(&memcg_kmem_enabled_key);
>  }
>  
> +bool memcg_kmem_is_active(struct mem_cgroup *memcg);
> +
>  /*
>   * In general, we'll do everything in our power to not incur in any overhead
>   * for non-memcg users for the kmem functions. Not even a function call, if we
> @@ -569,6 +579,12 @@ memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp)
>  	return __memcg_kmem_get_cache(cachep, gfp);
>  }
>  #else
> +
> +static inline bool memcg_kmem_is_active(struct mem_cgroup *memcg)
> +{
> +	return false;
> +}
> +
>  #define for_each_memcg_cache_index(_idx)	\
>  	for (; NULL; )
>  
> diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
> index d4636a0..a767f2e 100644
> --- a/include/linux/shrinker.h
> +++ b/include/linux/shrinker.h
> @@ -20,6 +20,9 @@ struct shrink_control {
>  
>  	/* shrink from these nodes */
>  	nodemask_t nodes_to_scan;
> +
> +	/* reclaim from this memcg only (if not NULL) */
> +	struct mem_cgroup *target_mem_cgroup;
>  };
>  
>  /*
> @@ -45,6 +48,7 @@ struct shrinker {
>  
>  	int seeks;	/* seeks to recreate an obj */
>  	long batch;	/* reclaim batch size, 0 = default */
> +	bool memcg_shrinker;
>  
>  	/* These are for internal use */
>  	struct list_head list;
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 3817460..b1d4dfa 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -442,7 +442,7 @@ static inline void memcg_kmem_set_active(struct mem_cgroup *memcg)
>  	set_bit(KMEM_ACCOUNTED_ACTIVE, &memcg->kmem_account_flags);
>  }
>  
> -static bool memcg_kmem_is_active(struct mem_cgroup *memcg)
> +bool memcg_kmem_is_active(struct mem_cgroup *memcg)
>  {
>  	return test_bit(KMEM_ACCOUNTED_ACTIVE, &memcg->kmem_account_flags);
>  }
> @@ -991,6 +991,15 @@ mem_cgroup_zone_nr_lru_pages(struct mem_cgroup *memcg, int nid, int zid,
>  	return ret;
>  }
>  
> +unsigned long
> +memcg_zone_reclaimable_pages(struct mem_cgroup *memcg, struct zone *zone)
> +{
> +	int nid = zone_to_nid(zone);
> +	int zid = zone_idx(zone);
> +
> +	return mem_cgroup_zone_nr_lru_pages(memcg, nid, zid, LRU_ALL);

Without swap enabled it seems like LRU_ALL_FILE is more appropriate.
Maybe something like test_mem_cgroup_node_reclaimable().

> +}
> +
>  static unsigned long
>  mem_cgroup_node_nr_lru_pages(struct mem_cgroup *memcg,
>  			int nid, unsigned int lru_mask)
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 6d96280..8af0e2b 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -138,11 +138,42 @@ static bool global_reclaim(struct scan_control *sc)
>  {
>  	return !sc->target_mem_cgroup;
>  }
> +
> +/*
> + * kmem reclaim should usually not be triggered when we are doing targetted
> + * reclaim. It is only valid when global reclaim is triggered, or when the
> + * underlying memcg has kmem objects.
> + */
> +static bool has_kmem_reclaim(struct scan_control *sc)
> +{
> +	return !sc->target_mem_cgroup ||
> +	(sc->target_mem_cgroup && memcg_kmem_is_active(sc->target_mem_cgroup));

Isn't this the same as:
	return !sc->target_mem_cgroup ||
		memcg_kmem_is_active(sc->target_mem_cgroup);

[...]

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 0/7] memcg targeted shrinking
  2013-02-08 13:07 ` Glauber Costa
  (?)
@ 2013-02-15  1:28     ` Greg Thelen
  -1 siblings, 0 replies; 60+ messages in thread
From: Greg Thelen @ 2013-02-15  1:28 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	Andrew Morton, Michal Hocko, Johannes Weiner,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, Dave Shrinnker,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA

On Fri, Feb 08 2013, Glauber Costa wrote:

> This patchset implements targeted shrinking for memcg when kmem limits are
> present. So far, we've been accounting kernel objects but failing allocations
> when short of memory. This is because our only option would be to call the
> global shrinker, depleting objects from all caches and breaking isolation.
>
> This patchset builds upon the recent work from David Chinner
> (http://oss.sgi.com/archives/xfs/2012-11/msg00643.html) to implement NUMA
> aware per-node LRUs. I build heavily on its API, and its presence is implied.
>
> The main idea is to associate per-memcg lists with each of the LRUs. The main
> LRU still provides a single entry point and when adding or removing an element
> from the LRU, we use the page information to figure out which memcg it belongs
> to and relay it to the right list.
>
> This patchset is still not perfect, and some uses cases still need to be
> dealt with. But I wanted to get this out in the open sooner rather than
> later. In particular, I have the following (noncomprehensive) todo list:
>
> TODO:
> * shrink dead memcgs when global pressure kicks in.
> * balance global reclaim among memcgs.
> * improve testing and reliability (I am still seeing some stalls in some cases)

Do you have a git tree with these changes so I can see Dave's numa LRUs
plus these changes?

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 0/7] memcg targeted shrinking
@ 2013-02-15  1:28     ` Greg Thelen
  0 siblings, 0 replies; 60+ messages in thread
From: Greg Thelen @ 2013-02-15  1:28 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-mm, cgroups, Andrew Morton, Michal Hocko, Johannes Weiner,
	kamezawa.hiroyu, Dave Shrinnker, linux-fsdevel

On Fri, Feb 08 2013, Glauber Costa wrote:

> This patchset implements targeted shrinking for memcg when kmem limits are
> present. So far, we've been accounting kernel objects but failing allocations
> when short of memory. This is because our only option would be to call the
> global shrinker, depleting objects from all caches and breaking isolation.
>
> This patchset builds upon the recent work from David Chinner
> (http://oss.sgi.com/archives/xfs/2012-11/msg00643.html) to implement NUMA
> aware per-node LRUs. I build heavily on its API, and its presence is implied.
>
> The main idea is to associate per-memcg lists with each of the LRUs. The main
> LRU still provides a single entry point and when adding or removing an element
> from the LRU, we use the page information to figure out which memcg it belongs
> to and relay it to the right list.
>
> This patchset is still not perfect, and some uses cases still need to be
> dealt with. But I wanted to get this out in the open sooner rather than
> later. In particular, I have the following (noncomprehensive) todo list:
>
> TODO:
> * shrink dead memcgs when global pressure kicks in.
> * balance global reclaim among memcgs.
> * improve testing and reliability (I am still seeing some stalls in some cases)

Do you have a git tree with these changes so I can see Dave's numa LRUs
plus these changes?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 0/7] memcg targeted shrinking
@ 2013-02-15  1:28     ` Greg Thelen
  0 siblings, 0 replies; 60+ messages in thread
From: Greg Thelen @ 2013-02-15  1:28 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	Andrew Morton, Michal Hocko, Johannes Weiner,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, Dave Shrinnker,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA

On Fri, Feb 08 2013, Glauber Costa wrote:

> This patchset implements targeted shrinking for memcg when kmem limits are
> present. So far, we've been accounting kernel objects but failing allocations
> when short of memory. This is because our only option would be to call the
> global shrinker, depleting objects from all caches and breaking isolation.
>
> This patchset builds upon the recent work from David Chinner
> (http://oss.sgi.com/archives/xfs/2012-11/msg00643.html) to implement NUMA
> aware per-node LRUs. I build heavily on its API, and its presence is implied.
>
> The main idea is to associate per-memcg lists with each of the LRUs. The main
> LRU still provides a single entry point and when adding or removing an element
> from the LRU, we use the page information to figure out which memcg it belongs
> to and relay it to the right list.
>
> This patchset is still not perfect, and some uses cases still need to be
> dealt with. But I wanted to get this out in the open sooner rather than
> later. In particular, I have the following (noncomprehensive) todo list:
>
> TODO:
> * shrink dead memcgs when global pressure kicks in.
> * balance global reclaim among memcgs.
> * improve testing and reliability (I am still seeing some stalls in some cases)

Do you have a git tree with these changes so I can see Dave's numa LRUs
plus these changes?

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 2/7] memcg,list_lru: duplicate LRUs upon kmemcg creation
  2013-02-08 13:07   ` Glauber Costa
  (?)
@ 2013-02-15  1:31     ` Greg Thelen
  -1 siblings, 0 replies; 60+ messages in thread
From: Greg Thelen @ 2013-02-15  1:31 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-mm, cgroups, Andrew Morton, Michal Hocko, Johannes Weiner,
	kamezawa.hiroyu, Dave Shrinnker, linux-fsdevel, Dave Chinner,
	Mel Gorman, Rik van Riel, Hugh Dickins

On Fri, Feb 08 2013, Glauber Costa wrote:

> When a new memcg is created, we need to open up room for its descriptors
> in all of the list_lrus that are marked per-memcg. The process is quite
> similar to the one we are using for the kmem caches: we initialize the
> new structures in an array indexed by kmemcg_id, and grow the array if
> needed. Key data like the size of the array will be shared between the
> kmem cache code and the list_lru code (they basically describe the same
> thing)
>
> Signed-off-by: Glauber Costa <glommer@parallels.com>
> Cc: Dave Chinner <dchinner@redhat.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Michal Hocko <mhocko@suse.cz>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> ---
>  include/linux/list_lru.h   |  47 +++++++++++++++++
>  include/linux/memcontrol.h |   6 +++
>  lib/list_lru.c             | 115 +++++++++++++++++++++++++++++++++++++---
>  mm/memcontrol.c            | 128 ++++++++++++++++++++++++++++++++++++++++++---
>  mm/slab_common.c           |   1 -
>  5 files changed, 283 insertions(+), 14 deletions(-)
>
> diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
> index 02796da..370b989 100644
> --- a/include/linux/list_lru.h
> +++ b/include/linux/list_lru.h
> @@ -16,11 +16,58 @@ struct list_lru_node {
>  	long			nr_items;
>  } ____cacheline_aligned_in_smp;
>  
> +struct list_lru_array {
> +	struct list_lru_node node[1];
> +};
> +
>  struct list_lru {
> +	struct list_head	lrus;
>  	struct list_lru_node	node[MAX_NUMNODES];
>  	nodemask_t		active_nodes;
> +#ifdef CONFIG_MEMCG_KMEM
> +	struct list_lru_array	**memcg_lrus;

Probably need a comment regarding that 0x1 is a magic value and
describing what indexes this lazily constructed array.  Is the primary
index memcg_kmem_id and the secondary index a nid?

> +#endif
>  };
>  
> +struct mem_cgroup;
> +#ifdef CONFIG_MEMCG_KMEM
> +/*
> + * We will reuse the last bit of the pointer to tell the lru subsystem that
> + * this particular lru should be replicated when a memcg comes in.
> + */

>From this patch it seems like 0x1 is a magic value rather than bit 0
being special.  memcg_lrus is either 0x1 or a pointer to an array of
struct list_lru_array.  The array is indexed by memcg_kmem_id.

> +static inline void lru_memcg_enable(struct list_lru *lru)

This function is not called yet.  Hmm.

> +{
> +	lru->memcg_lrus = (void *)0x1ULL;
> +}
> +
> +/*
> + * This will return true if we have already allocated and assignment a memcg
> + * pointer set to the LRU. Therefore, we need to mask the first bit out
> + */
> +static inline bool lru_memcg_is_assigned(struct list_lru *lru)
> +{
> +	return (unsigned long)lru->memcg_lrus & ~0x1ULL;

Is this equivalent to?
	return lru->memcg_lrus != NULL && lru->memcg_lrus != 0x1

> +}
> +

[...]

/* comment the meaning of "num" */
> +int memcg_update_all_lrus(unsigned long num)
> +{
> +	int ret = 0;
> +	struct list_lru *lru;
> +

[...]

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 2/7] memcg,list_lru: duplicate LRUs upon kmemcg creation
@ 2013-02-15  1:31     ` Greg Thelen
  0 siblings, 0 replies; 60+ messages in thread
From: Greg Thelen @ 2013-02-15  1:31 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-mm, cgroups, Andrew Morton, Michal Hocko, Johannes Weiner,
	kamezawa.hiroyu, Dave Shrinnker, linux-fsdevel, Dave Chinner,
	Mel Gorman, Rik van Riel, Hugh Dickins

On Fri, Feb 08 2013, Glauber Costa wrote:

> When a new memcg is created, we need to open up room for its descriptors
> in all of the list_lrus that are marked per-memcg. The process is quite
> similar to the one we are using for the kmem caches: we initialize the
> new structures in an array indexed by kmemcg_id, and grow the array if
> needed. Key data like the size of the array will be shared between the
> kmem cache code and the list_lru code (they basically describe the same
> thing)
>
> Signed-off-by: Glauber Costa <glommer@parallels.com>
> Cc: Dave Chinner <dchinner@redhat.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Michal Hocko <mhocko@suse.cz>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> ---
>  include/linux/list_lru.h   |  47 +++++++++++++++++
>  include/linux/memcontrol.h |   6 +++
>  lib/list_lru.c             | 115 +++++++++++++++++++++++++++++++++++++---
>  mm/memcontrol.c            | 128 ++++++++++++++++++++++++++++++++++++++++++---
>  mm/slab_common.c           |   1 -
>  5 files changed, 283 insertions(+), 14 deletions(-)
>
> diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
> index 02796da..370b989 100644
> --- a/include/linux/list_lru.h
> +++ b/include/linux/list_lru.h
> @@ -16,11 +16,58 @@ struct list_lru_node {
>  	long			nr_items;
>  } ____cacheline_aligned_in_smp;
>  
> +struct list_lru_array {
> +	struct list_lru_node node[1];
> +};
> +
>  struct list_lru {
> +	struct list_head	lrus;
>  	struct list_lru_node	node[MAX_NUMNODES];
>  	nodemask_t		active_nodes;
> +#ifdef CONFIG_MEMCG_KMEM
> +	struct list_lru_array	**memcg_lrus;

Probably need a comment regarding that 0x1 is a magic value and
describing what indexes this lazily constructed array.  Is the primary
index memcg_kmem_id and the secondary index a nid?

> +#endif
>  };
>  
> +struct mem_cgroup;
> +#ifdef CONFIG_MEMCG_KMEM
> +/*
> + * We will reuse the last bit of the pointer to tell the lru subsystem that
> + * this particular lru should be replicated when a memcg comes in.
> + */

>From this patch it seems like 0x1 is a magic value rather than bit 0
being special.  memcg_lrus is either 0x1 or a pointer to an array of
struct list_lru_array.  The array is indexed by memcg_kmem_id.

> +static inline void lru_memcg_enable(struct list_lru *lru)

This function is not called yet.  Hmm.

> +{
> +	lru->memcg_lrus = (void *)0x1ULL;
> +}
> +
> +/*
> + * This will return true if we have already allocated and assignment a memcg
> + * pointer set to the LRU. Therefore, we need to mask the first bit out
> + */
> +static inline bool lru_memcg_is_assigned(struct list_lru *lru)
> +{
> +	return (unsigned long)lru->memcg_lrus & ~0x1ULL;

Is this equivalent to?
	return lru->memcg_lrus != NULL && lru->memcg_lrus != 0x1

> +}
> +

[...]

/* comment the meaning of "num" */
> +int memcg_update_all_lrus(unsigned long num)
> +{
> +	int ret = 0;
> +	struct list_lru *lru;
> +

[...]

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 2/7] memcg,list_lru: duplicate LRUs upon kmemcg creation
@ 2013-02-15  1:31     ` Greg Thelen
  0 siblings, 0 replies; 60+ messages in thread
From: Greg Thelen @ 2013-02-15  1:31 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-mm, cgroups, Andrew Morton, Michal Hocko, Johannes Weiner,
	kamezawa.hiroyu, Dave Shrinnker, linux-fsdevel, Dave Chinner,
	Mel Gorman, Rik van Riel, Hugh Dickins

On Fri, Feb 08 2013, Glauber Costa wrote:

> When a new memcg is created, we need to open up room for its descriptors
> in all of the list_lrus that are marked per-memcg. The process is quite
> similar to the one we are using for the kmem caches: we initialize the
> new structures in an array indexed by kmemcg_id, and grow the array if
> needed. Key data like the size of the array will be shared between the
> kmem cache code and the list_lru code (they basically describe the same
> thing)
>
> Signed-off-by: Glauber Costa <glommer@parallels.com>
> Cc: Dave Chinner <dchinner@redhat.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Michal Hocko <mhocko@suse.cz>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> ---
>  include/linux/list_lru.h   |  47 +++++++++++++++++
>  include/linux/memcontrol.h |   6 +++
>  lib/list_lru.c             | 115 +++++++++++++++++++++++++++++++++++++---
>  mm/memcontrol.c            | 128 ++++++++++++++++++++++++++++++++++++++++++---
>  mm/slab_common.c           |   1 -
>  5 files changed, 283 insertions(+), 14 deletions(-)
>
> diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
> index 02796da..370b989 100644
> --- a/include/linux/list_lru.h
> +++ b/include/linux/list_lru.h
> @@ -16,11 +16,58 @@ struct list_lru_node {
>  	long			nr_items;
>  } ____cacheline_aligned_in_smp;
>  
> +struct list_lru_array {
> +	struct list_lru_node node[1];
> +};
> +
>  struct list_lru {
> +	struct list_head	lrus;
>  	struct list_lru_node	node[MAX_NUMNODES];
>  	nodemask_t		active_nodes;
> +#ifdef CONFIG_MEMCG_KMEM
> +	struct list_lru_array	**memcg_lrus;

Probably need a comment regarding that 0x1 is a magic value and
describing what indexes this lazily constructed array.  Is the primary
index memcg_kmem_id and the secondary index a nid?

> +#endif
>  };
>  
> +struct mem_cgroup;
> +#ifdef CONFIG_MEMCG_KMEM
> +/*
> + * We will reuse the last bit of the pointer to tell the lru subsystem that
> + * this particular lru should be replicated when a memcg comes in.
> + */

From this patch it seems like 0x1 is a magic value rather than bit 0
being special.  memcg_lrus is either 0x1 or a pointer to an array of
struct list_lru_array.  The array is indexed by memcg_kmem_id.

> +static inline void lru_memcg_enable(struct list_lru *lru)

This function is not called yet.  Hmm.

> +{
> +	lru->memcg_lrus = (void *)0x1ULL;
> +}
> +
> +/*
> + * This will return true if we have already allocated and assignment a memcg
> + * pointer set to the LRU. Therefore, we need to mask the first bit out
> + */
> +static inline bool lru_memcg_is_assigned(struct list_lru *lru)
> +{
> +	return (unsigned long)lru->memcg_lrus & ~0x1ULL;

Is this equivalent to?
	return lru->memcg_lrus != NULL && lru->memcg_lrus != 0x1

> +}
> +

[...]

/* comment the meaning of "num" */
> +int memcg_update_all_lrus(unsigned long num)
> +{
> +	int ret = 0;
> +	struct list_lru *lru;
> +

[...]

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 3/7] lru: add an element to a memcg list
  2013-02-08 13:07   ` Glauber Costa
@ 2013-02-15  1:32     ` Greg Thelen
  -1 siblings, 0 replies; 60+ messages in thread
From: Greg Thelen @ 2013-02-15  1:32 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-mm, cgroups, Andrew Morton, Michal Hocko, Johannes Weiner,
	kamezawa.hiroyu, Dave Shrinnker, linux-fsdevel, Dave Chinner,
	Mel Gorman, Rik van Riel, Hugh Dickins

On Fri, Feb 08 2013, Glauber Costa wrote:

> With the infrastructure we now have, we can add an element to a memcg
> LRU list instead of the global list. The memcg lists are still
> per-node.

[...]

> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index b9e1941..bfb4b5b 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -3319,6 +3319,36 @@ static inline void memcg_resume_kmem_account(void)
>  	current->memcg_kmem_skip_account--;
>  }
>  
> +static struct mem_cgroup *mem_cgroup_from_kmem_page(struct page *page)
> +{
> +	struct page_cgroup *pc;
> +	struct mem_cgroup *memcg = NULL;
> +
> +	pc = lookup_page_cgroup(page);
> +	if (!PageCgroupUsed(pc))
> +		return NULL;
> +
> +	lock_page_cgroup(pc);
> +	if (PageCgroupUsed(pc))
> +		memcg = pc->mem_cgroup;
> +	unlock_page_cgroup(pc);

Once we drop the lock, is there anything that needs protection
(e.g. PageCgroupUsed)?  If there's no problem, then what's the point of
taking the lock?

> +	return memcg;
> +}
> +

[...]

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 3/7] lru: add an element to a memcg list
@ 2013-02-15  1:32     ` Greg Thelen
  0 siblings, 0 replies; 60+ messages in thread
From: Greg Thelen @ 2013-02-15  1:32 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-mm, cgroups, Andrew Morton, Michal Hocko, Johannes Weiner,
	kamezawa.hiroyu, Dave Shrinnker, linux-fsdevel, Dave Chinner,
	Mel Gorman, Rik van Riel, Hugh Dickins

On Fri, Feb 08 2013, Glauber Costa wrote:

> With the infrastructure we now have, we can add an element to a memcg
> LRU list instead of the global list. The memcg lists are still
> per-node.

[...]

> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index b9e1941..bfb4b5b 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -3319,6 +3319,36 @@ static inline void memcg_resume_kmem_account(void)
>  	current->memcg_kmem_skip_account--;
>  }
>  
> +static struct mem_cgroup *mem_cgroup_from_kmem_page(struct page *page)
> +{
> +	struct page_cgroup *pc;
> +	struct mem_cgroup *memcg = NULL;
> +
> +	pc = lookup_page_cgroup(page);
> +	if (!PageCgroupUsed(pc))
> +		return NULL;
> +
> +	lock_page_cgroup(pc);
> +	if (PageCgroupUsed(pc))
> +		memcg = pc->mem_cgroup;
> +	unlock_page_cgroup(pc);

Once we drop the lock, is there anything that needs protection
(e.g. PageCgroupUsed)?  If there's no problem, then what's the point of
taking the lock?

> +	return memcg;
> +}
> +

[...]

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 1/7] vmscan: also shrink slab in memcg pressure
  2013-02-08 13:07   ` Glauber Costa
@ 2013-02-15  8:37     ` Kamezawa Hiroyuki
  -1 siblings, 0 replies; 60+ messages in thread
From: Kamezawa Hiroyuki @ 2013-02-15  8:37 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-mm, cgroups, Andrew Morton, Michal Hocko, Johannes Weiner,
	Dave Shrinnker, linux-fsdevel, Dave Chinner, Mel Gorman,
	Rik van Riel, Hugh Dickins

(2013/02/08 22:07), Glauber Costa wrote:
> Without the surrounding infrastructure, this patch is a bit of a hammer:
> it will basically shrink objects from all memcgs under memcg pressure.
> At least, however, we will keep the scan limited to the shrinkers marked
> as per-memcg.
> 
> Future patches will implement the in-shrinker logic to filter objects
> based on its memcg association.
> 
> Signed-off-by: Glauber Costa <glommer@parallels.com>
> Cc: Dave Chinner <dchinner@redhat.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Michal Hocko <mhocko@suse.cz>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> ---
>   include/linux/memcontrol.h | 16 ++++++++++++++++
>   include/linux/shrinker.h   |  4 ++++
>   mm/memcontrol.c            | 11 ++++++++++-
>   mm/vmscan.c                | 41 ++++++++++++++++++++++++++++++++++++++---
>   4 files changed, 68 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 0108a56..b7de557 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -200,6 +200,9 @@ void mem_cgroup_split_huge_fixup(struct page *head);
>   bool mem_cgroup_bad_page_check(struct page *page);
>   void mem_cgroup_print_bad_page(struct page *page);
>   #endif
> +
> +unsigned long
> +memcg_zone_reclaimable_pages(struct mem_cgroup *memcg, struct zone *zone);
>   #else /* CONFIG_MEMCG */
>   struct mem_cgroup;
>   
> @@ -384,6 +387,11 @@ static inline void mem_cgroup_replace_page_cache(struct page *oldpage,
>   				struct page *newpage)
>   {
>   }
> +
> +static inline unsigned long
> +memcg_zone_reclaimable_pages(struct mem_cgroup *memcg, struct zone *zone)
> +{
> +}
>   #endif /* CONFIG_MEMCG */
>   
>   #if !defined(CONFIG_MEMCG) || !defined(CONFIG_DEBUG_VM)
> @@ -436,6 +444,8 @@ static inline bool memcg_kmem_enabled(void)
>   	return static_key_false(&memcg_kmem_enabled_key);
>   }
>   
> +bool memcg_kmem_is_active(struct mem_cgroup *memcg);
> +
>   /*
>    * In general, we'll do everything in our power to not incur in any overhead
>    * for non-memcg users for the kmem functions. Not even a function call, if we
> @@ -569,6 +579,12 @@ memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp)
>   	return __memcg_kmem_get_cache(cachep, gfp);
>   }
>   #else
> +
> +static inline bool memcg_kmem_is_active(struct mem_cgroup *memcg)
> +{
> +	return false;
> +}
> +
>   #define for_each_memcg_cache_index(_idx)	\
>   	for (; NULL; )
>   
> diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
> index d4636a0..a767f2e 100644
> --- a/include/linux/shrinker.h
> +++ b/include/linux/shrinker.h
> @@ -20,6 +20,9 @@ struct shrink_control {
>   
>   	/* shrink from these nodes */
>   	nodemask_t nodes_to_scan;
> +
> +	/* reclaim from this memcg only (if not NULL) */
> +	struct mem_cgroup *target_mem_cgroup;
>   };
>   
>   /*
> @@ -45,6 +48,7 @@ struct shrinker {
>   
>   	int seeks;	/* seeks to recreate an obj */
>   	long batch;	/* reclaim batch size, 0 = default */
> +	bool memcg_shrinker;
>   

What is this boolean for ? When is this set ?

>   	/* These are for internal use */
>   	struct list_head list;
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 3817460..b1d4dfa 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -442,7 +442,7 @@ static inline void memcg_kmem_set_active(struct mem_cgroup *memcg)
>   	set_bit(KMEM_ACCOUNTED_ACTIVE, &memcg->kmem_account_flags);
>   }
>   
> -static bool memcg_kmem_is_active(struct mem_cgroup *memcg)
> +bool memcg_kmem_is_active(struct mem_cgroup *memcg)
>   {
>   	return test_bit(KMEM_ACCOUNTED_ACTIVE, &memcg->kmem_account_flags);
>   }
> @@ -991,6 +991,15 @@ mem_cgroup_zone_nr_lru_pages(struct mem_cgroup *memcg, int nid, int zid,
>   	return ret;
>   }
>   
> +unsigned long
> +memcg_zone_reclaimable_pages(struct mem_cgroup *memcg, struct zone *zone)
> +{
> +	int nid = zone_to_nid(zone);
> +	int zid = zone_idx(zone);
> +
> +	return mem_cgroup_zone_nr_lru_pages(memcg, nid, zid, LRU_ALL);
> +}
> +
>   static unsigned long
>   mem_cgroup_node_nr_lru_pages(struct mem_cgroup *memcg,
>   			int nid, unsigned int lru_mask)
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 6d96280..8af0e2b 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -138,11 +138,42 @@ static bool global_reclaim(struct scan_control *sc)
>   {
>   	return !sc->target_mem_cgroup;
>   }
> +
> +/*
> + * kmem reclaim should usually not be triggered when we are doing targetted
> + * reclaim. It is only valid when global reclaim is triggered, or when the
> + * underlying memcg has kmem objects.
> + */
> +static bool has_kmem_reclaim(struct scan_control *sc)
> +{
> +	return !sc->target_mem_cgroup ||
> +	(sc->target_mem_cgroup && memcg_kmem_is_active(sc->target_mem_cgroup));
> +}
> +
> +static unsigned long
> +zone_nr_reclaimable_pages(struct scan_control *sc, struct zone *zone)
> +{
> +	if (global_reclaim(sc))
> +		return zone_reclaimable_pages(zone);
> +	return memcg_zone_reclaimable_pages(sc->target_mem_cgroup, zone);
> +}
> +
>   #else
>   static bool global_reclaim(struct scan_control *sc)
>   {
>   	return true;
>   }
> +
> +static bool has_kmem_reclaim(struct scan_control *sc)
> +{
> +	return true;
> +}
> +
> +static unsigned long
> +zone_nr_reclaimable_pages(struct scan_control *sc, struct zone *zone)
> +{
> +	return zone_reclaimable_pages(zone);
> +}
>   #endif

Can't be in a devided patch ?

>   
>   static unsigned long get_lru_size(struct lruvec *lruvec, enum lru_list lru)
> @@ -221,6 +252,9 @@ unsigned long shrink_slab(struct shrink_control *sc,
>   		long batch_size = shrinker->batch ? shrinker->batch
>   						  : SHRINK_BATCH;
>   
> +		if (!shrinker->memcg_shrinker && sc->target_mem_cgroup)
> +			continue;
> +

What does this mean ?

>   		max_pass = shrinker->count_objects(shrinker, sc);
>   		WARN_ON(max_pass < 0);
>   		if (max_pass <= 0)
> @@ -2170,9 +2204,9 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
>   
>   		/*
>   		 * Don't shrink slabs when reclaiming memory from
> -		 * over limit cgroups
> +		 * over limit cgroups, unless we know they have kmem objects
>   		 */
> -		if (global_reclaim(sc)) {
> +		if (has_kmem_reclaim(sc)) {
>   			unsigned long lru_pages = 0;
>   
>   			nodes_clear(shrink->nodes_to_scan);
> @@ -2181,7 +2215,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
>   				if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
>   					continue;
>   
> -				lru_pages += zone_reclaimable_pages(zone);
> +				lru_pages += zone_nr_reclaimable_pages(sc, zone);
>   				node_set(zone_to_nid(zone),
>   					 shrink->nodes_to_scan);
>   			}
> @@ -2443,6 +2477,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
>   	};
>   	struct shrink_control shrink = {
>   		.gfp_mask = sc.gfp_mask,
> +		.target_mem_cgroup = memcg,
>   	};
>   
>   	/*
> 

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 1/7] vmscan: also shrink slab in memcg pressure
@ 2013-02-15  8:37     ` Kamezawa Hiroyuki
  0 siblings, 0 replies; 60+ messages in thread
From: Kamezawa Hiroyuki @ 2013-02-15  8:37 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-mm, cgroups, Andrew Morton, Michal Hocko, Johannes Weiner,
	Dave Shrinnker, linux-fsdevel, Dave Chinner, Mel Gorman,
	Rik van Riel, Hugh Dickins

(2013/02/08 22:07), Glauber Costa wrote:
> Without the surrounding infrastructure, this patch is a bit of a hammer:
> it will basically shrink objects from all memcgs under memcg pressure.
> At least, however, we will keep the scan limited to the shrinkers marked
> as per-memcg.
> 
> Future patches will implement the in-shrinker logic to filter objects
> based on its memcg association.
> 
> Signed-off-by: Glauber Costa <glommer@parallels.com>
> Cc: Dave Chinner <dchinner@redhat.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Michal Hocko <mhocko@suse.cz>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> ---
>   include/linux/memcontrol.h | 16 ++++++++++++++++
>   include/linux/shrinker.h   |  4 ++++
>   mm/memcontrol.c            | 11 ++++++++++-
>   mm/vmscan.c                | 41 ++++++++++++++++++++++++++++++++++++++---
>   4 files changed, 68 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 0108a56..b7de557 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -200,6 +200,9 @@ void mem_cgroup_split_huge_fixup(struct page *head);
>   bool mem_cgroup_bad_page_check(struct page *page);
>   void mem_cgroup_print_bad_page(struct page *page);
>   #endif
> +
> +unsigned long
> +memcg_zone_reclaimable_pages(struct mem_cgroup *memcg, struct zone *zone);
>   #else /* CONFIG_MEMCG */
>   struct mem_cgroup;
>   
> @@ -384,6 +387,11 @@ static inline void mem_cgroup_replace_page_cache(struct page *oldpage,
>   				struct page *newpage)
>   {
>   }
> +
> +static inline unsigned long
> +memcg_zone_reclaimable_pages(struct mem_cgroup *memcg, struct zone *zone)
> +{
> +}
>   #endif /* CONFIG_MEMCG */
>   
>   #if !defined(CONFIG_MEMCG) || !defined(CONFIG_DEBUG_VM)
> @@ -436,6 +444,8 @@ static inline bool memcg_kmem_enabled(void)
>   	return static_key_false(&memcg_kmem_enabled_key);
>   }
>   
> +bool memcg_kmem_is_active(struct mem_cgroup *memcg);
> +
>   /*
>    * In general, we'll do everything in our power to not incur in any overhead
>    * for non-memcg users for the kmem functions. Not even a function call, if we
> @@ -569,6 +579,12 @@ memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp)
>   	return __memcg_kmem_get_cache(cachep, gfp);
>   }
>   #else
> +
> +static inline bool memcg_kmem_is_active(struct mem_cgroup *memcg)
> +{
> +	return false;
> +}
> +
>   #define for_each_memcg_cache_index(_idx)	\
>   	for (; NULL; )
>   
> diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
> index d4636a0..a767f2e 100644
> --- a/include/linux/shrinker.h
> +++ b/include/linux/shrinker.h
> @@ -20,6 +20,9 @@ struct shrink_control {
>   
>   	/* shrink from these nodes */
>   	nodemask_t nodes_to_scan;
> +
> +	/* reclaim from this memcg only (if not NULL) */
> +	struct mem_cgroup *target_mem_cgroup;
>   };
>   
>   /*
> @@ -45,6 +48,7 @@ struct shrinker {
>   
>   	int seeks;	/* seeks to recreate an obj */
>   	long batch;	/* reclaim batch size, 0 = default */
> +	bool memcg_shrinker;
>   

What is this boolean for ? When is this set ?

>   	/* These are for internal use */
>   	struct list_head list;
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 3817460..b1d4dfa 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -442,7 +442,7 @@ static inline void memcg_kmem_set_active(struct mem_cgroup *memcg)
>   	set_bit(KMEM_ACCOUNTED_ACTIVE, &memcg->kmem_account_flags);
>   }
>   
> -static bool memcg_kmem_is_active(struct mem_cgroup *memcg)
> +bool memcg_kmem_is_active(struct mem_cgroup *memcg)
>   {
>   	return test_bit(KMEM_ACCOUNTED_ACTIVE, &memcg->kmem_account_flags);
>   }
> @@ -991,6 +991,15 @@ mem_cgroup_zone_nr_lru_pages(struct mem_cgroup *memcg, int nid, int zid,
>   	return ret;
>   }
>   
> +unsigned long
> +memcg_zone_reclaimable_pages(struct mem_cgroup *memcg, struct zone *zone)
> +{
> +	int nid = zone_to_nid(zone);
> +	int zid = zone_idx(zone);
> +
> +	return mem_cgroup_zone_nr_lru_pages(memcg, nid, zid, LRU_ALL);
> +}
> +
>   static unsigned long
>   mem_cgroup_node_nr_lru_pages(struct mem_cgroup *memcg,
>   			int nid, unsigned int lru_mask)
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 6d96280..8af0e2b 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -138,11 +138,42 @@ static bool global_reclaim(struct scan_control *sc)
>   {
>   	return !sc->target_mem_cgroup;
>   }
> +
> +/*
> + * kmem reclaim should usually not be triggered when we are doing targetted
> + * reclaim. It is only valid when global reclaim is triggered, or when the
> + * underlying memcg has kmem objects.
> + */
> +static bool has_kmem_reclaim(struct scan_control *sc)
> +{
> +	return !sc->target_mem_cgroup ||
> +	(sc->target_mem_cgroup && memcg_kmem_is_active(sc->target_mem_cgroup));
> +}
> +
> +static unsigned long
> +zone_nr_reclaimable_pages(struct scan_control *sc, struct zone *zone)
> +{
> +	if (global_reclaim(sc))
> +		return zone_reclaimable_pages(zone);
> +	return memcg_zone_reclaimable_pages(sc->target_mem_cgroup, zone);
> +}
> +
>   #else
>   static bool global_reclaim(struct scan_control *sc)
>   {
>   	return true;
>   }
> +
> +static bool has_kmem_reclaim(struct scan_control *sc)
> +{
> +	return true;
> +}
> +
> +static unsigned long
> +zone_nr_reclaimable_pages(struct scan_control *sc, struct zone *zone)
> +{
> +	return zone_reclaimable_pages(zone);
> +}
>   #endif

Can't be in a devided patch ?

>   
>   static unsigned long get_lru_size(struct lruvec *lruvec, enum lru_list lru)
> @@ -221,6 +252,9 @@ unsigned long shrink_slab(struct shrink_control *sc,
>   		long batch_size = shrinker->batch ? shrinker->batch
>   						  : SHRINK_BATCH;
>   
> +		if (!shrinker->memcg_shrinker && sc->target_mem_cgroup)
> +			continue;
> +

What does this mean ?

>   		max_pass = shrinker->count_objects(shrinker, sc);
>   		WARN_ON(max_pass < 0);
>   		if (max_pass <= 0)
> @@ -2170,9 +2204,9 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
>   
>   		/*
>   		 * Don't shrink slabs when reclaiming memory from
> -		 * over limit cgroups
> +		 * over limit cgroups, unless we know they have kmem objects
>   		 */
> -		if (global_reclaim(sc)) {
> +		if (has_kmem_reclaim(sc)) {
>   			unsigned long lru_pages = 0;
>   
>   			nodes_clear(shrink->nodes_to_scan);
> @@ -2181,7 +2215,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
>   				if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
>   					continue;
>   
> -				lru_pages += zone_reclaimable_pages(zone);
> +				lru_pages += zone_nr_reclaimable_pages(sc, zone);
>   				node_set(zone_to_nid(zone),
>   					 shrink->nodes_to_scan);
>   			}
> @@ -2443,6 +2477,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
>   	};
>   	struct shrink_control shrink = {
>   		.gfp_mask = sc.gfp_mask,
> +		.target_mem_cgroup = memcg,
>   	};
>   
>   	/*
> 

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 2/7] memcg,list_lru: duplicate LRUs upon kmemcg creation
  2013-02-08 13:07   ` Glauber Costa
@ 2013-02-15  9:21       ` Kamezawa Hiroyuki
  -1 siblings, 0 replies; 60+ messages in thread
From: Kamezawa Hiroyuki @ 2013-02-15  9:21 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	Andrew Morton, Michal Hocko, Johannes Weiner, Dave Shrinnker,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Dave Chinner, Mel Gorman,
	Rik van Riel, Hugh Dickins

(2013/02/08 22:07), Glauber Costa wrote:
> When a new memcg is created, we need to open up room for its descriptors
> in all of the list_lrus that are marked per-memcg. The process is quite
> similar to the one we are using for the kmem caches: we initialize the
> new structures in an array indexed by kmemcg_id, and grow the array if
> needed. Key data like the size of the array will be shared between the
> kmem cache code and the list_lru code (they basically describe the same
> thing)
> 
> Signed-off-by: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
> Cc: Dave Chinner <dchinner-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> Cc: Mel Gorman <mgorman-l3A5Bk7waGM@public.gmane.org>
> Cc: Rik van Riel <riel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> Cc: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
> Cc: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
> Cc: Hugh Dickins <hughd-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
> Cc: Kamezawa Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
> Cc: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
> ---
>   include/linux/list_lru.h   |  47 +++++++++++++++++
>   include/linux/memcontrol.h |   6 +++
>   lib/list_lru.c             | 115 +++++++++++++++++++++++++++++++++++++---
>   mm/memcontrol.c            | 128 ++++++++++++++++++++++++++++++++++++++++++---
>   mm/slab_common.c           |   1 -
>   5 files changed, 283 insertions(+), 14 deletions(-)
> 
> diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
> index 02796da..370b989 100644
> --- a/include/linux/list_lru.h
> +++ b/include/linux/list_lru.h
> @@ -16,11 +16,58 @@ struct list_lru_node {
>   	long			nr_items;
>   } ____cacheline_aligned_in_smp;
>   
> +struct list_lru_array {
> +	struct list_lru_node node[1];
> +};

size is up to nr_node_ids ?

> +
>   struct list_lru {
> +	struct list_head	lrus;
>   	struct list_lru_node	node[MAX_NUMNODES];
>   	nodemask_t		active_nodes;
> +#ifdef CONFIG_MEMCG_KMEM
> +	struct list_lru_array	**memcg_lrus;
> +#endif
>   };
size is up to memcg_limited_groups_array_size ?


>   
> +struct mem_cgroup;
> +#ifdef CONFIG_MEMCG_KMEM
> +/*
> + * We will reuse the last bit of the pointer to tell the lru subsystem that
> + * this particular lru should be replicated when a memcg comes in.
> + */
> +static inline void lru_memcg_enable(struct list_lru *lru)
> +{
> +	lru->memcg_lrus = (void *)0x1ULL;
> +}
> +

This "enable" is not used in this patch itself, right ?

> +/*
> + * This will return true if we have already allocated and assignment a memcg
> + * pointer set to the LRU. Therefore, we need to mask the first bit out
> + */
> +static inline bool lru_memcg_is_assigned(struct list_lru *lru)
> +{
> +	return (unsigned long)lru->memcg_lrus & ~0x1ULL;
> +}
> +
> +struct list_lru_array *lru_alloc_array(void);
> +int memcg_update_all_lrus(unsigned long num);
> +void list_lru_destroy(struct list_lru *lru);
> +void list_lru_destroy_memcg(struct mem_cgroup *memcg);
> +#else
> +static inline void lru_memcg_enable(struct list_lru *lru)
> +{
> +}
> +
> +static inline bool lru_memcg_is_assigned(struct list_lru *lru)
> +{
> +	return false;
> +}
> +
> +static inline void list_lru_destroy(struct list_lru *lru)
> +{
> +}
> +#endif
> +
>   int list_lru_init(struct list_lru *lru);
>   int list_lru_add(struct list_lru *lru, struct list_head *item);
>   int list_lru_del(struct list_lru *lru, struct list_head *item);
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index b7de557..f9558d0 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -23,6 +23,7 @@
>   #include <linux/vm_event_item.h>
>   #include <linux/hardirq.h>
>   #include <linux/jump_label.h>
> +#include <linux/list_lru.h>
>   
>   struct mem_cgroup;
>   struct page_cgroup;
> @@ -475,6 +476,11 @@ void memcg_update_array_size(int num_groups);
>   struct kmem_cache *
>   __memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp);
>   
> +int memcg_new_lru(struct list_lru *lru);
> +
> +int memcg_kmem_update_lru_size(struct list_lru *lru, int num_groups,
> +			       bool new_lru);
> +
>   void mem_cgroup_destroy_cache(struct kmem_cache *cachep);
>   void kmem_cache_destroy_memcg_children(struct kmem_cache *s);
>   
> diff --git a/lib/list_lru.c b/lib/list_lru.c
> index 0f08ed6..3b0e89d 100644
> --- a/lib/list_lru.c
> +++ b/lib/list_lru.c
> @@ -8,6 +8,7 @@
>   #include <linux/module.h>
>   #include <linux/mm.h>
>   #include <linux/list_lru.h>
> +#include <linux/memcontrol.h>
>   
>   int
>   list_lru_add(
> @@ -184,18 +185,118 @@ list_lru_dispose_all(
>   	return total;
>   }
>   
> -int
> -list_lru_init(
> -	struct list_lru	*lru)
> +/*
> + * This protects the list of all LRU in the system. One only needs
> + * to take when registering an LRU, or when duplicating the list of lrus.
> + * Transversing an LRU can and should be done outside the lock
> + */
> +static DEFINE_MUTEX(all_lrus_mutex);
> +static LIST_HEAD(all_lrus);
> +
> +static void list_lru_init_one(struct list_lru_node *lru)
> +{
> +	spin_lock_init(&lru->lock);
> +	INIT_LIST_HEAD(&lru->list);
> +	lru->nr_items = 0;
> +}
> +
> +struct list_lru_array *lru_alloc_array(void)
> +{
> +	struct list_lru_array *lru_array;
> +	int i;
> +
> +	lru_array = kzalloc(nr_node_ids * sizeof(struct list_lru_node),
> +				GFP_KERNEL);
> +	if (!lru_array)
> +		return NULL;
> +
> +	for (i = 0; i < nr_node_ids ; i++)
> +		list_lru_init_one(&lru_array->node[i]);
> +
> +	return lru_array;
> +}
> +
> +int __list_lru_init(struct list_lru *lru)
>   {
>   	int i;
>   
>   	nodes_clear(lru->active_nodes);
> -	for (i = 0; i < MAX_NUMNODES; i++) {
> -		spin_lock_init(&lru->node[i].lock);
> -		INIT_LIST_HEAD(&lru->node[i].list);
> -		lru->node[i].nr_items = 0;
> +	for (i = 0; i < MAX_NUMNODES; i++)
> +		list_lru_init_one(&lru->node[i]);

Hmm. lru_list is up to MAX_NUMNODES, your new one is up to nr_node_ids...

> +
> +	return 0;
> +}
> +
> +#ifdef CONFIG_MEMCG_KMEM
> +static int memcg_init_lru(struct list_lru *lru)
> +{
> +	int ret;
> +
> +	if (!lru->memcg_lrus)
> +		return 0;
> +
> +	INIT_LIST_HEAD(&lru->lrus);
> +	mutex_lock(&all_lrus_mutex);
> +	list_add(&lru->lrus, &all_lrus);
> +	ret = memcg_new_lru(lru);
> +	mutex_unlock(&all_lrus_mutex);
> +	return ret;
> +}

 only writer takes this mutex ?

> +
> +int memcg_update_all_lrus(unsigned long num)
> +{
> +	int ret = 0;
> +	struct list_lru *lru;
> +
> +	mutex_lock(&all_lrus_mutex);
> +	list_for_each_entry(lru, &all_lrus, lrus) {
> +		if (!lru->memcg_lrus)
> +			continue;
> +
> +		ret = memcg_kmem_update_lru_size(lru, num, false);
> +		if (ret)
> +			goto out;
> +	}
> +out:
> +	mutex_unlock(&all_lrus_mutex);
> +	return ret;
> +}
> +
> +void list_lru_destroy(struct list_lru *lru)
> +{
> +	if (!lru->memcg_lrus)
> +		return;
> +
> +	mutex_lock(&all_lrus_mutex);
> +	list_del(&lru->lrus);
> +	mutex_unlock(&all_lrus_mutex);
> +}
> +
> +void list_lru_destroy_memcg(struct mem_cgroup *memcg)
> +{
> +	struct list_lru *lru;
> +	mutex_lock(&all_lrus_mutex);
> +	list_for_each_entry(lru, &all_lrus, lrus) {
> +		lru->memcg_lrus[memcg_cache_id(memcg)] = NULL;
> +		/* everybody must beaware that this memcg is no longer valid */

Hm, the object pointed by this array entry will be freed by some other func ?

> +		wmb();
>   	}
> +	mutex_unlock(&all_lrus_mutex);
> +}
> +#else
> +static int memcg_init_lru(struct list_lru *lru)
> +{
>   	return 0;
>   }
> +#endif
> +
> +int list_lru_init(struct list_lru *lru)
> +{
> +	int ret;
> +	ret = __list_lru_init(lru);
> +	if (ret)
> +		return ret;
> +
> +	return memcg_init_lru(lru);
> +}
>   EXPORT_SYMBOL_GPL(list_lru_init);
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index b1d4dfa..b9e1941 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -3032,16 +3032,30 @@ int memcg_update_cache_sizes(struct mem_cgroup *memcg)
>   	memcg_kmem_set_activated(memcg);
>   
>   	ret = memcg_update_all_caches(num+1);
> -	if (ret) {
> -		ida_simple_remove(&kmem_limited_groups, num);
> -		memcg_kmem_clear_activated(memcg);
> -		return ret;
> -	}
> +	if (ret)
> +		goto out;
> +
> +	/*
> +	 * We should make sure that the array size is not updated until we are
> +	 * done; otherwise we have no easy way to know whether or not we should
> +	 * grow the array.
> +	 */
> +	ret = memcg_update_all_lrus(num + 1);
> +	if (ret)
> +		goto out;
>   
>   	memcg->kmemcg_id = num;
> +
> +	memcg_update_array_size(num + 1);
> +
>   	INIT_LIST_HEAD(&memcg->memcg_slab_caches);
>   	mutex_init(&memcg->slab_caches_mutex);
> +
>   	return 0;
> +out:
> +	ida_simple_remove(&kmem_limited_groups, num);
> +	memcg_kmem_clear_activated(memcg);
> +	return ret;
>   }
>   
>   static size_t memcg_caches_array_size(int num_groups)
> @@ -3121,6 +3135,106 @@ int memcg_update_cache_size(struct kmem_cache *s, int num_groups)
>   	return 0;
>   }
>   
> +/*
> + * memcg_kmem_update_lru_size - fill in kmemcg info into a list_lru
> + *
> + * @lru: the lru we are operating with
> + * @num_groups: how many kmem-limited cgroups we have
> + * @new_lru: true if this is a new_lru being created, false if this
> + * was triggered from the memcg side
> + *
> + * Returns 0 on success, and an error code otherwise.
> + *
> + * This function can be called either when a new kmem-limited memcg appears,
> + * or when a new list_lru is created. The work is roughly the same in two cases,
> + * but in the later we never have to expand the array size.
> + *
> + * This is always protected by the all_lrus_mutex from the list_lru side.
> + */
> +int memcg_kmem_update_lru_size(struct list_lru *lru, int num_groups,
> +			       bool new_lru)
> +{
> +	struct list_lru_array **new_lru_array;
> +	struct list_lru_array *lru_array;
> +

Both are named as array ...confusing ;)

> +	lru_array = lru_alloc_array();
> +	if (!lru_array)
> +		return -ENOMEM;
> +
> +	/* need some fucked up locking around the list acquisition */
> +	if ((num_groups > memcg_limited_groups_array_size) || new_lru) {
> +		int i;
> +		struct list_lru_array **old_array;
> +		size_t size = memcg_caches_array_size(num_groups);
> +
> +		new_lru_array = kzalloc(size * sizeof(void *), GFP_KERNEL);
> +		if (!new_lru_array) {
> +			kfree(lru_array);
> +			return -ENOMEM;
> +		}
> +
> +		for (i = 0; i < memcg_limited_groups_array_size; i++) {
> +			if (!lru_memcg_is_assigned(lru) || lru->memcg_lrus[i])
> +				continue;
> +			new_lru_array[i] =  lru->memcg_lrus[i];
> +		}
> +
> +		old_array = lru->memcg_lrus;
> +		lru->memcg_lrus = new_lru_array;
> +		/*
> +		 * We don't need a barrier here because we are just copying
> +		 * information over. Anybody operating in memcg_lrus will
> +		 * either follow the new array or the old one and they contain
> +		 * exactly the same information. The new space in the end is
> +		 * always empty anyway.
> +		 *
> +		 * We do have to make sure that no more users of the old
> +		 * memcg_lrus array exist before we free, and this is achieved
> +		 * by the synchronize_lru below.
> +		 */
> +		if (lru_memcg_is_assigned(lru)) {
> +			synchronize_rcu();
> +			kfree(old_array);
> +		}
> +
> +	}
> +
> +	if (lru_memcg_is_assigned(lru)) {
> +		lru->memcg_lrus[num_groups - 1] = lru_array;

Can't this pointer already set ?

> +		/*
> +		 * Here we do need the barrier, because of the state transition
> +		 * implied by the assignment of the array. All users should be
> +		 * able to see it
> +		 */
> +		wmb();
> +	}
> +
> +	return 0;
> +
> +}
> +
> +int memcg_new_lru(struct list_lru *lru)
> +{
> +	struct mem_cgroup *iter;
> +
> +	if (!memcg_kmem_enabled())
> +		return 0;
> +
> +	for_each_mem_cgroup(iter) {
> +		int ret;
> +		int memcg_id = memcg_cache_id(iter);
> +		if (memcg_id < 0)
> +			continue;
> +
> +		ret = memcg_kmem_update_lru_size(lru, memcg_id + 1, true);
> +		if (ret) {
> +			mem_cgroup_iter_break(root_mem_cgroup, iter);
> +			return ret;
> +		}
> +	}
> +	return 0;
> +}
> +
>   int memcg_register_cache(struct mem_cgroup *memcg, struct kmem_cache *s,
>   			 struct kmem_cache *root_cache)
>   {
> @@ -5914,8 +6028,10 @@ static void kmem_cgroup_destroy(struct mem_cgroup *memcg)
>   	 * possible that the charges went down to 0 between mark_dead and the
>   	 * res_counter read, so in that case, we don't need the put
>   	 */
> -	if (memcg_kmem_test_and_clear_dead(memcg))
> +	if (memcg_kmem_test_and_clear_dead(memcg)) {
> +		list_lru_destroy_memcg(memcg);
>   		mem_cgroup_put(memcg);
> +	}
>   }
>   #else
>   static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
> diff --git a/mm/slab_common.c b/mm/slab_common.c
> indek
x 3f3cd97..2470d11 100644
> --- a/mm/slab_common.c
> +++ b/mm/slab_common.c
> @@ -102,7 +102,6 @@ int memcg_update_all_caches(int num_memcgs)
>   			goto out;
>   	}
>   
> -	memcg_update_array_size(num_memcgs);
>   out:
>   	mutex_unlock(&slab_mutex);
>   	return ret;
> 


Thanks,
-Kame

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 2/7] memcg,list_lru: duplicate LRUs upon kmemcg creation
@ 2013-02-15  9:21       ` Kamezawa Hiroyuki
  0 siblings, 0 replies; 60+ messages in thread
From: Kamezawa Hiroyuki @ 2013-02-15  9:21 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-mm, cgroups, Andrew Morton, Michal Hocko, Johannes Weiner,
	Dave Shrinnker, linux-fsdevel, Dave Chinner, Mel Gorman,
	Rik van Riel, Hugh Dickins

(2013/02/08 22:07), Glauber Costa wrote:
> When a new memcg is created, we need to open up room for its descriptors
> in all of the list_lrus that are marked per-memcg. The process is quite
> similar to the one we are using for the kmem caches: we initialize the
> new structures in an array indexed by kmemcg_id, and grow the array if
> needed. Key data like the size of the array will be shared between the
> kmem cache code and the list_lru code (they basically describe the same
> thing)
> 
> Signed-off-by: Glauber Costa <glommer@parallels.com>
> Cc: Dave Chinner <dchinner@redhat.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Michal Hocko <mhocko@suse.cz>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> ---
>   include/linux/list_lru.h   |  47 +++++++++++++++++
>   include/linux/memcontrol.h |   6 +++
>   lib/list_lru.c             | 115 +++++++++++++++++++++++++++++++++++++---
>   mm/memcontrol.c            | 128 ++++++++++++++++++++++++++++++++++++++++++---
>   mm/slab_common.c           |   1 -
>   5 files changed, 283 insertions(+), 14 deletions(-)
> 
> diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
> index 02796da..370b989 100644
> --- a/include/linux/list_lru.h
> +++ b/include/linux/list_lru.h
> @@ -16,11 +16,58 @@ struct list_lru_node {
>   	long			nr_items;
>   } ____cacheline_aligned_in_smp;
>   
> +struct list_lru_array {
> +	struct list_lru_node node[1];
> +};

size is up to nr_node_ids ?

> +
>   struct list_lru {
> +	struct list_head	lrus;
>   	struct list_lru_node	node[MAX_NUMNODES];
>   	nodemask_t		active_nodes;
> +#ifdef CONFIG_MEMCG_KMEM
> +	struct list_lru_array	**memcg_lrus;
> +#endif
>   };
size is up to memcg_limited_groups_array_size ?


>   
> +struct mem_cgroup;
> +#ifdef CONFIG_MEMCG_KMEM
> +/*
> + * We will reuse the last bit of the pointer to tell the lru subsystem that
> + * this particular lru should be replicated when a memcg comes in.
> + */
> +static inline void lru_memcg_enable(struct list_lru *lru)
> +{
> +	lru->memcg_lrus = (void *)0x1ULL;
> +}
> +

This "enable" is not used in this patch itself, right ?

> +/*
> + * This will return true if we have already allocated and assignment a memcg
> + * pointer set to the LRU. Therefore, we need to mask the first bit out
> + */
> +static inline bool lru_memcg_is_assigned(struct list_lru *lru)
> +{
> +	return (unsigned long)lru->memcg_lrus & ~0x1ULL;
> +}
> +
> +struct list_lru_array *lru_alloc_array(void);
> +int memcg_update_all_lrus(unsigned long num);
> +void list_lru_destroy(struct list_lru *lru);
> +void list_lru_destroy_memcg(struct mem_cgroup *memcg);
> +#else
> +static inline void lru_memcg_enable(struct list_lru *lru)
> +{
> +}
> +
> +static inline bool lru_memcg_is_assigned(struct list_lru *lru)
> +{
> +	return false;
> +}
> +
> +static inline void list_lru_destroy(struct list_lru *lru)
> +{
> +}
> +#endif
> +
>   int list_lru_init(struct list_lru *lru);
>   int list_lru_add(struct list_lru *lru, struct list_head *item);
>   int list_lru_del(struct list_lru *lru, struct list_head *item);
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index b7de557..f9558d0 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -23,6 +23,7 @@
>   #include <linux/vm_event_item.h>
>   #include <linux/hardirq.h>
>   #include <linux/jump_label.h>
> +#include <linux/list_lru.h>
>   
>   struct mem_cgroup;
>   struct page_cgroup;
> @@ -475,6 +476,11 @@ void memcg_update_array_size(int num_groups);
>   struct kmem_cache *
>   __memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp);
>   
> +int memcg_new_lru(struct list_lru *lru);
> +
> +int memcg_kmem_update_lru_size(struct list_lru *lru, int num_groups,
> +			       bool new_lru);
> +
>   void mem_cgroup_destroy_cache(struct kmem_cache *cachep);
>   void kmem_cache_destroy_memcg_children(struct kmem_cache *s);
>   
> diff --git a/lib/list_lru.c b/lib/list_lru.c
> index 0f08ed6..3b0e89d 100644
> --- a/lib/list_lru.c
> +++ b/lib/list_lru.c
> @@ -8,6 +8,7 @@
>   #include <linux/module.h>
>   #include <linux/mm.h>
>   #include <linux/list_lru.h>
> +#include <linux/memcontrol.h>
>   
>   int
>   list_lru_add(
> @@ -184,18 +185,118 @@ list_lru_dispose_all(
>   	return total;
>   }
>   
> -int
> -list_lru_init(
> -	struct list_lru	*lru)
> +/*
> + * This protects the list of all LRU in the system. One only needs
> + * to take when registering an LRU, or when duplicating the list of lrus.
> + * Transversing an LRU can and should be done outside the lock
> + */
> +static DEFINE_MUTEX(all_lrus_mutex);
> +static LIST_HEAD(all_lrus);
> +
> +static void list_lru_init_one(struct list_lru_node *lru)
> +{
> +	spin_lock_init(&lru->lock);
> +	INIT_LIST_HEAD(&lru->list);
> +	lru->nr_items = 0;
> +}
> +
> +struct list_lru_array *lru_alloc_array(void)
> +{
> +	struct list_lru_array *lru_array;
> +	int i;
> +
> +	lru_array = kzalloc(nr_node_ids * sizeof(struct list_lru_node),
> +				GFP_KERNEL);
> +	if (!lru_array)
> +		return NULL;
> +
> +	for (i = 0; i < nr_node_ids ; i++)
> +		list_lru_init_one(&lru_array->node[i]);
> +
> +	return lru_array;
> +}
> +
> +int __list_lru_init(struct list_lru *lru)
>   {
>   	int i;
>   
>   	nodes_clear(lru->active_nodes);
> -	for (i = 0; i < MAX_NUMNODES; i++) {
> -		spin_lock_init(&lru->node[i].lock);
> -		INIT_LIST_HEAD(&lru->node[i].list);
> -		lru->node[i].nr_items = 0;
> +	for (i = 0; i < MAX_NUMNODES; i++)
> +		list_lru_init_one(&lru->node[i]);

Hmm. lru_list is up to MAX_NUMNODES, your new one is up to nr_node_ids...

> +
> +	return 0;
> +}
> +
> +#ifdef CONFIG_MEMCG_KMEM
> +static int memcg_init_lru(struct list_lru *lru)
> +{
> +	int ret;
> +
> +	if (!lru->memcg_lrus)
> +		return 0;
> +
> +	INIT_LIST_HEAD(&lru->lrus);
> +	mutex_lock(&all_lrus_mutex);
> +	list_add(&lru->lrus, &all_lrus);
> +	ret = memcg_new_lru(lru);
> +	mutex_unlock(&all_lrus_mutex);
> +	return ret;
> +}

 only writer takes this mutex ?

> +
> +int memcg_update_all_lrus(unsigned long num)
> +{
> +	int ret = 0;
> +	struct list_lru *lru;
> +
> +	mutex_lock(&all_lrus_mutex);
> +	list_for_each_entry(lru, &all_lrus, lrus) {
> +		if (!lru->memcg_lrus)
> +			continue;
> +
> +		ret = memcg_kmem_update_lru_size(lru, num, false);
> +		if (ret)
> +			goto out;
> +	}
> +out:
> +	mutex_unlock(&all_lrus_mutex);
> +	return ret;
> +}
> +
> +void list_lru_destroy(struct list_lru *lru)
> +{
> +	if (!lru->memcg_lrus)
> +		return;
> +
> +	mutex_lock(&all_lrus_mutex);
> +	list_del(&lru->lrus);
> +	mutex_unlock(&all_lrus_mutex);
> +}
> +
> +void list_lru_destroy_memcg(struct mem_cgroup *memcg)
> +{
> +	struct list_lru *lru;
> +	mutex_lock(&all_lrus_mutex);
> +	list_for_each_entry(lru, &all_lrus, lrus) {
> +		lru->memcg_lrus[memcg_cache_id(memcg)] = NULL;
> +		/* everybody must beaware that this memcg is no longer valid */

Hm, the object pointed by this array entry will be freed by some other func ?

> +		wmb();
>   	}
> +	mutex_unlock(&all_lrus_mutex);
> +}
> +#else
> +static int memcg_init_lru(struct list_lru *lru)
> +{
>   	return 0;
>   }
> +#endif
> +
> +int list_lru_init(struct list_lru *lru)
> +{
> +	int ret;
> +	ret = __list_lru_init(lru);
> +	if (ret)
> +		return ret;
> +
> +	return memcg_init_lru(lru);
> +}
>   EXPORT_SYMBOL_GPL(list_lru_init);
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index b1d4dfa..b9e1941 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -3032,16 +3032,30 @@ int memcg_update_cache_sizes(struct mem_cgroup *memcg)
>   	memcg_kmem_set_activated(memcg);
>   
>   	ret = memcg_update_all_caches(num+1);
> -	if (ret) {
> -		ida_simple_remove(&kmem_limited_groups, num);
> -		memcg_kmem_clear_activated(memcg);
> -		return ret;
> -	}
> +	if (ret)
> +		goto out;
> +
> +	/*
> +	 * We should make sure that the array size is not updated until we are
> +	 * done; otherwise we have no easy way to know whether or not we should
> +	 * grow the array.
> +	 */
> +	ret = memcg_update_all_lrus(num + 1);
> +	if (ret)
> +		goto out;
>   
>   	memcg->kmemcg_id = num;
> +
> +	memcg_update_array_size(num + 1);
> +
>   	INIT_LIST_HEAD(&memcg->memcg_slab_caches);
>   	mutex_init(&memcg->slab_caches_mutex);
> +
>   	return 0;
> +out:
> +	ida_simple_remove(&kmem_limited_groups, num);
> +	memcg_kmem_clear_activated(memcg);
> +	return ret;
>   }
>   
>   static size_t memcg_caches_array_size(int num_groups)
> @@ -3121,6 +3135,106 @@ int memcg_update_cache_size(struct kmem_cache *s, int num_groups)
>   	return 0;
>   }
>   
> +/*
> + * memcg_kmem_update_lru_size - fill in kmemcg info into a list_lru
> + *
> + * @lru: the lru we are operating with
> + * @num_groups: how many kmem-limited cgroups we have
> + * @new_lru: true if this is a new_lru being created, false if this
> + * was triggered from the memcg side
> + *
> + * Returns 0 on success, and an error code otherwise.
> + *
> + * This function can be called either when a new kmem-limited memcg appears,
> + * or when a new list_lru is created. The work is roughly the same in two cases,
> + * but in the later we never have to expand the array size.
> + *
> + * This is always protected by the all_lrus_mutex from the list_lru side.
> + */
> +int memcg_kmem_update_lru_size(struct list_lru *lru, int num_groups,
> +			       bool new_lru)
> +{
> +	struct list_lru_array **new_lru_array;
> +	struct list_lru_array *lru_array;
> +

Both are named as array ...confusing ;)

> +	lru_array = lru_alloc_array();
> +	if (!lru_array)
> +		return -ENOMEM;
> +
> +	/* need some fucked up locking around the list acquisition */
> +	if ((num_groups > memcg_limited_groups_array_size) || new_lru) {
> +		int i;
> +		struct list_lru_array **old_array;
> +		size_t size = memcg_caches_array_size(num_groups);
> +
> +		new_lru_array = kzalloc(size * sizeof(void *), GFP_KERNEL);
> +		if (!new_lru_array) {
> +			kfree(lru_array);
> +			return -ENOMEM;
> +		}
> +
> +		for (i = 0; i < memcg_limited_groups_array_size; i++) {
> +			if (!lru_memcg_is_assigned(lru) || lru->memcg_lrus[i])
> +				continue;
> +			new_lru_array[i] =  lru->memcg_lrus[i];
> +		}
> +
> +		old_array = lru->memcg_lrus;
> +		lru->memcg_lrus = new_lru_array;
> +		/*
> +		 * We don't need a barrier here because we are just copying
> +		 * information over. Anybody operating in memcg_lrus will
> +		 * either follow the new array or the old one and they contain
> +		 * exactly the same information. The new space in the end is
> +		 * always empty anyway.
> +		 *
> +		 * We do have to make sure that no more users of the old
> +		 * memcg_lrus array exist before we free, and this is achieved
> +		 * by the synchronize_lru below.
> +		 */
> +		if (lru_memcg_is_assigned(lru)) {
> +			synchronize_rcu();
> +			kfree(old_array);
> +		}
> +
> +	}
> +
> +	if (lru_memcg_is_assigned(lru)) {
> +		lru->memcg_lrus[num_groups - 1] = lru_array;

Can't this pointer already set ?

> +		/*
> +		 * Here we do need the barrier, because of the state transition
> +		 * implied by the assignment of the array. All users should be
> +		 * able to see it
> +		 */
> +		wmb();
> +	}
> +
> +	return 0;
> +
> +}
> +
> +int memcg_new_lru(struct list_lru *lru)
> +{
> +	struct mem_cgroup *iter;
> +
> +	if (!memcg_kmem_enabled())
> +		return 0;
> +
> +	for_each_mem_cgroup(iter) {
> +		int ret;
> +		int memcg_id = memcg_cache_id(iter);
> +		if (memcg_id < 0)
> +			continue;
> +
> +		ret = memcg_kmem_update_lru_size(lru, memcg_id + 1, true);
> +		if (ret) {
> +			mem_cgroup_iter_break(root_mem_cgroup, iter);
> +			return ret;
> +		}
> +	}
> +	return 0;
> +}
> +
>   int memcg_register_cache(struct mem_cgroup *memcg, struct kmem_cache *s,
>   			 struct kmem_cache *root_cache)
>   {
> @@ -5914,8 +6028,10 @@ static void kmem_cgroup_destroy(struct mem_cgroup *memcg)
>   	 * possible that the charges went down to 0 between mark_dead and the
>   	 * res_counter read, so in that case, we don't need the put
>   	 */
> -	if (memcg_kmem_test_and_clear_dead(memcg))
> +	if (memcg_kmem_test_and_clear_dead(memcg)) {
> +		list_lru_destroy_memcg(memcg);
>   		mem_cgroup_put(memcg);
> +	}
>   }
>   #else
>   static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
> diff --git a/mm/slab_common.c b/mm/slab_common.c
> indek
x 3f3cd97..2470d11 100644
> --- a/mm/slab_common.c
> +++ b/mm/slab_common.c
> @@ -102,7 +102,6 @@ int memcg_update_all_caches(int num_memcgs)
>   			goto out;
>   	}
>   
> -	memcg_update_array_size(num_memcgs);
>   out:
>   	mutex_unlock(&slab_mutex);
>   	return ret;
> 


Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 1/7] vmscan: also shrink slab in memcg pressure
  2013-02-15  8:37     ` Kamezawa Hiroyuki
  (?)
@ 2013-02-15 10:30         ` Glauber Costa
  -1 siblings, 0 replies; 60+ messages in thread
From: Glauber Costa @ 2013-02-15 10:30 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	Andrew Morton, Michal Hocko, Johannes Weiner, Dave Shrinnker,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Dave Chinner, Mel Gorman,
	Rik van Riel, Hugh Dickins

>> @@ -45,6 +48,7 @@ struct shrinker {
>>   
>>   	int seeks;	/* seeks to recreate an obj */
>>   	long batch;	/* reclaim batch size, 0 = default */
>> +	bool memcg_shrinker;
>>   
> 
> What is this boolean for ? When is this set ?
It is set when a subsystem declares that its shrinker is memcg capable.
Therefore, it won't be done until all infrastructure is in place. Take a
look at the super.c patches at the end of the series.


>>   static bool global_reclaim(struct scan_control *sc)
>>   {
>>   	return true;
>>   }
>> +
>> +static bool has_kmem_reclaim(struct scan_control *sc)
>> +{
>> +	return true;
>> +}
>> +
>> +static unsigned long
>> +zone_nr_reclaimable_pages(struct scan_control *sc, struct zone *zone)
>> +{
>> +	return zone_reclaimable_pages(zone);
>> +}
>>   #endif
> 
> Can't be in a devided patch ?
> 
if you prefer this way, sure, I can separate it.

>>   static unsigned long get_lru_size(struct lruvec *lruvec, enum lru_list lru)
>> @@ -221,6 +252,9 @@ unsigned long shrink_slab(struct shrink_control *sc,
>>   		long batch_size = shrinker->batch ? shrinker->batch
>>   						  : SHRINK_BATCH;
>>   
>> +		if (!shrinker->memcg_shrinker && sc->target_mem_cgroup)
>> +			continue;
>> +
> 
> What does this mean ?

It means that if target_mem_cgroup is set, we should skip all the
shrinkers that are not memcg capable. Maybe if I invert the order it
will be clearer?

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 1/7] vmscan: also shrink slab in memcg pressure
@ 2013-02-15 10:30         ` Glauber Costa
  0 siblings, 0 replies; 60+ messages in thread
From: Glauber Costa @ 2013-02-15 10:30 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: linux-mm, cgroups, Andrew Morton, Michal Hocko, Johannes Weiner,
	Dave Shrinnker, linux-fsdevel, Dave Chinner, Mel Gorman,
	Rik van Riel, Hugh Dickins

>> @@ -45,6 +48,7 @@ struct shrinker {
>>   
>>   	int seeks;	/* seeks to recreate an obj */
>>   	long batch;	/* reclaim batch size, 0 = default */
>> +	bool memcg_shrinker;
>>   
> 
> What is this boolean for ? When is this set ?
It is set when a subsystem declares that its shrinker is memcg capable.
Therefore, it won't be done until all infrastructure is in place. Take a
look at the super.c patches at the end of the series.


>>   static bool global_reclaim(struct scan_control *sc)
>>   {
>>   	return true;
>>   }
>> +
>> +static bool has_kmem_reclaim(struct scan_control *sc)
>> +{
>> +	return true;
>> +}
>> +
>> +static unsigned long
>> +zone_nr_reclaimable_pages(struct scan_control *sc, struct zone *zone)
>> +{
>> +	return zone_reclaimable_pages(zone);
>> +}
>>   #endif
> 
> Can't be in a devided patch ?
> 
if you prefer this way, sure, I can separate it.

>>   static unsigned long get_lru_size(struct lruvec *lruvec, enum lru_list lru)
>> @@ -221,6 +252,9 @@ unsigned long shrink_slab(struct shrink_control *sc,
>>   		long batch_size = shrinker->batch ? shrinker->batch
>>   						  : SHRINK_BATCH;
>>   
>> +		if (!shrinker->memcg_shrinker && sc->target_mem_cgroup)
>> +			continue;
>> +
> 
> What does this mean ?

It means that if target_mem_cgroup is set, we should skip all the
shrinkers that are not memcg capable. Maybe if I invert the order it
will be clearer?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 1/7] vmscan: also shrink slab in memcg pressure
@ 2013-02-15 10:30         ` Glauber Costa
  0 siblings, 0 replies; 60+ messages in thread
From: Glauber Costa @ 2013-02-15 10:30 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	Andrew Morton, Michal Hocko, Johannes Weiner, Dave Shrinnker,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Dave Chinner, Mel Gorman,
	Rik van Riel, Hugh Dickins

>> @@ -45,6 +48,7 @@ struct shrinker {
>>   
>>   	int seeks;	/* seeks to recreate an obj */
>>   	long batch;	/* reclaim batch size, 0 = default */
>> +	bool memcg_shrinker;
>>   
> 
> What is this boolean for ? When is this set ?
It is set when a subsystem declares that its shrinker is memcg capable.
Therefore, it won't be done until all infrastructure is in place. Take a
look at the super.c patches at the end of the series.


>>   static bool global_reclaim(struct scan_control *sc)
>>   {
>>   	return true;
>>   }
>> +
>> +static bool has_kmem_reclaim(struct scan_control *sc)
>> +{
>> +	return true;
>> +}
>> +
>> +static unsigned long
>> +zone_nr_reclaimable_pages(struct scan_control *sc, struct zone *zone)
>> +{
>> +	return zone_reclaimable_pages(zone);
>> +}
>>   #endif
> 
> Can't be in a devided patch ?
> 
if you prefer this way, sure, I can separate it.

>>   static unsigned long get_lru_size(struct lruvec *lruvec, enum lru_list lru)
>> @@ -221,6 +252,9 @@ unsigned long shrink_slab(struct shrink_control *sc,
>>   		long batch_size = shrinker->batch ? shrinker->batch
>>   						  : SHRINK_BATCH;
>>   
>> +		if (!shrinker->memcg_shrinker && sc->target_mem_cgroup)
>> +			continue;
>> +
> 
> What does this mean ?

It means that if target_mem_cgroup is set, we should skip all the
shrinkers that are not memcg capable. Maybe if I invert the order it
will be clearer?

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 2/7] memcg,list_lru: duplicate LRUs upon kmemcg creation
  2013-02-15  9:21       ` Kamezawa Hiroyuki
  (?)
@ 2013-02-15 10:36         ` Glauber Costa
  -1 siblings, 0 replies; 60+ messages in thread
From: Glauber Costa @ 2013-02-15 10:36 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: linux-mm, cgroups, Andrew Morton, Michal Hocko, Johannes Weiner,
	Dave Shrinnker, linux-fsdevel, Dave Chinner, Mel Gorman,
	Rik van Riel, Hugh Dickins

On 02/15/2013 01:21 PM, Kamezawa Hiroyuki wrote:
> (2013/02/08 22:07), Glauber Costa wrote:
>> When a new memcg is created, we need to open up room for its descriptors
>> in all of the list_lrus that are marked per-memcg. The process is quite
>> similar to the one we are using for the kmem caches: we initialize the
>> new structures in an array indexed by kmemcg_id, and grow the array if
>> needed. Key data like the size of the array will be shared between the
>> kmem cache code and the list_lru code (they basically describe the same
>> thing)
>>
>> Signed-off-by: Glauber Costa <glommer@parallels.com>
>> Cc: Dave Chinner <dchinner@redhat.com>
>> Cc: Mel Gorman <mgorman@suse.de>
>> Cc: Rik van Riel <riel@redhat.com>
>> Cc: Johannes Weiner <hannes@cmpxchg.org>
>> Cc: Michal Hocko <mhocko@suse.cz>
>> Cc: Hugh Dickins <hughd@google.com>
>> Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> ---
>>   include/linux/list_lru.h   |  47 +++++++++++++++++
>>   include/linux/memcontrol.h |   6 +++
>>   lib/list_lru.c             | 115 +++++++++++++++++++++++++++++++++++++---
>>   mm/memcontrol.c            | 128 ++++++++++++++++++++++++++++++++++++++++++---
>>   mm/slab_common.c           |   1 -
>>   5 files changed, 283 insertions(+), 14 deletions(-)
>>
>> diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
>> index 02796da..370b989 100644
>> --- a/include/linux/list_lru.h
>> +++ b/include/linux/list_lru.h
>> @@ -16,11 +16,58 @@ struct list_lru_node {
>>   	long			nr_items;
>>   } ____cacheline_aligned_in_smp;
>>   
>> +struct list_lru_array {
>> +	struct list_lru_node node[1];
>> +};
> 
> size is up to nr_node_ids ?
> 

This is a dynamic quantity, so the correct way to do it is to size it to
1 (or 0 for that matter), have it be the last element of the struct, and
then allocate the right size at allocation time.

>> +
>>   struct list_lru {
>> +	struct list_head	lrus;
>>   	struct list_lru_node	node[MAX_NUMNODES];
>>   	nodemask_t		active_nodes;
>> +#ifdef CONFIG_MEMCG_KMEM
>> +	struct list_lru_array	**memcg_lrus;
>> +#endif
>>   };
> size is up to memcg_limited_groups_array_size ?
> 
ditto. This one not only is a dynamic quantity, but also changes as
new memcgs are created.

>> +/*
>> + * We will reuse the last bit of the pointer to tell the lru subsystem that
>> + * this particular lru should be replicated when a memcg comes in.
>> + */
>> +static inline void lru_memcg_enable(struct list_lru *lru)
>> +{
>> +	lru->memcg_lrus = (void *)0x1ULL;
>> +}
>> +
> 
> This "enable" is not used in this patch itself, right ?
> 
I am not sure. It is definitely used later on, I can check and move it
if necessary.

>> +int __list_lru_init(struct list_lru *lru)
>>   {
>>   	int i;
>>   
>>   	nodes_clear(lru->active_nodes);
>> -	for (i = 0; i < MAX_NUMNODES; i++) {
>> -		spin_lock_init(&lru->node[i].lock);
>> -		INIT_LIST_HEAD(&lru->node[i].list);
>> -		lru->node[i].nr_items = 0;
>> +	for (i = 0; i < MAX_NUMNODES; i++)
>> +		list_lru_init_one(&lru->node[i]);
> 
> Hmm. lru_list is up to MAX_NUMNODES, your new one is up to nr_node_ids...
>

well spotted.
Thanks.

>> +	INIT_LIST_HEAD(&lru->lrus);
>> +	mutex_lock(&all_lrus_mutex);
>> +	list_add(&lru->lrus, &all_lrus);
>> +	ret = memcg_new_lru(lru);
>> +	mutex_unlock(&all_lrus_mutex);
>> +	return ret;
>> +}
> 
>  only writer takes this mutex ?
> 
yes. IIRC, I documented that. But I might be wrong (will check)

>> +void list_lru_destroy_memcg(struct mem_cgroup *memcg)
>> +{
>> +	struct list_lru *lru;
>> +	mutex_lock(&all_lrus_mutex);
>> +	list_for_each_entry(lru, &all_lrus, lrus) {
>> +		lru->memcg_lrus[memcg_cache_id(memcg)] = NULL;
>> +		/* everybody must beaware that this memcg is no longer valid */
> 
> Hm, the object pointed by this array entry will be freed by some other func ?

They should be destroyed before we get here, but I am skimming through
the code now, and I see they are not. On a second thought, I think it
would be simpler and less error prone if I would just free them here...

>> +		new_lru_array = kzalloc(size * sizeof(void *), GFP_KERNEL);
>> +		if (!new_lru_array) {
>> +			kfree(lru_array);
>> +			return -ENOMEM;
>> +		}
>> +
>> +		for (i = 0; i < memcg_limited_groups_array_size; i++) {
>> +			if (!lru_memcg_is_assigned(lru) || lru->memcg_lrus[i])
>> +				continue;
>> +			new_lru_array[i] =  lru->memcg_lrus[i];
>> +		}
>> +
>> +		old_array = lru->memcg_lrus;
>> +		lru->memcg_lrus = new_lru_array;
>> +		/*
>> +		 * We don't need a barrier here because we are just copying
>> +		 * information over. Anybody operating in memcg_lrus will
>> +		 * either follow the new array or the old one and they contain
>> +		 * exactly the same information. The new space in the end is
>> +		 * always empty anyway.
>> +		 *
>> +		 * We do have to make sure that no more users of the old
>> +		 * memcg_lrus array exist before we free, and this is achieved
>> +		 * by the synchronize_lru below.
>> +		 */
>> +		if (lru_memcg_is_assigned(lru)) {
>> +			synchronize_rcu();
>> +			kfree(old_array);
>> +		}
>> +
>> +	}
>> +
>> +	if (lru_memcg_is_assigned(lru)) {
>> +		lru->memcg_lrus[num_groups - 1] = lru_array;
> 
> Can't this pointer already set ?
> 
If it is, it is a bug. I can set VM_BUG_ON here to catch those cases.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 2/7] memcg,list_lru: duplicate LRUs upon kmemcg creation
@ 2013-02-15 10:36         ` Glauber Costa
  0 siblings, 0 replies; 60+ messages in thread
From: Glauber Costa @ 2013-02-15 10:36 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: linux-mm, cgroups, Andrew Morton, Michal Hocko, Johannes Weiner,
	Dave Shrinnker, linux-fsdevel, Dave Chinner, Mel Gorman,
	Rik van Riel, Hugh Dickins

On 02/15/2013 01:21 PM, Kamezawa Hiroyuki wrote:
> (2013/02/08 22:07), Glauber Costa wrote:
>> When a new memcg is created, we need to open up room for its descriptors
>> in all of the list_lrus that are marked per-memcg. The process is quite
>> similar to the one we are using for the kmem caches: we initialize the
>> new structures in an array indexed by kmemcg_id, and grow the array if
>> needed. Key data like the size of the array will be shared between the
>> kmem cache code and the list_lru code (they basically describe the same
>> thing)
>>
>> Signed-off-by: Glauber Costa <glommer@parallels.com>
>> Cc: Dave Chinner <dchinner@redhat.com>
>> Cc: Mel Gorman <mgorman@suse.de>
>> Cc: Rik van Riel <riel@redhat.com>
>> Cc: Johannes Weiner <hannes@cmpxchg.org>
>> Cc: Michal Hocko <mhocko@suse.cz>
>> Cc: Hugh Dickins <hughd@google.com>
>> Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> ---
>>   include/linux/list_lru.h   |  47 +++++++++++++++++
>>   include/linux/memcontrol.h |   6 +++
>>   lib/list_lru.c             | 115 +++++++++++++++++++++++++++++++++++++---
>>   mm/memcontrol.c            | 128 ++++++++++++++++++++++++++++++++++++++++++---
>>   mm/slab_common.c           |   1 -
>>   5 files changed, 283 insertions(+), 14 deletions(-)
>>
>> diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
>> index 02796da..370b989 100644
>> --- a/include/linux/list_lru.h
>> +++ b/include/linux/list_lru.h
>> @@ -16,11 +16,58 @@ struct list_lru_node {
>>   	long			nr_items;
>>   } ____cacheline_aligned_in_smp;
>>   
>> +struct list_lru_array {
>> +	struct list_lru_node node[1];
>> +};
> 
> size is up to nr_node_ids ?
> 

This is a dynamic quantity, so the correct way to do it is to size it to
1 (or 0 for that matter), have it be the last element of the struct, and
then allocate the right size at allocation time.

>> +
>>   struct list_lru {
>> +	struct list_head	lrus;
>>   	struct list_lru_node	node[MAX_NUMNODES];
>>   	nodemask_t		active_nodes;
>> +#ifdef CONFIG_MEMCG_KMEM
>> +	struct list_lru_array	**memcg_lrus;
>> +#endif
>>   };
> size is up to memcg_limited_groups_array_size ?
> 
ditto. This one not only is a dynamic quantity, but also changes as
new memcgs are created.

>> +/*
>> + * We will reuse the last bit of the pointer to tell the lru subsystem that
>> + * this particular lru should be replicated when a memcg comes in.
>> + */
>> +static inline void lru_memcg_enable(struct list_lru *lru)
>> +{
>> +	lru->memcg_lrus = (void *)0x1ULL;
>> +}
>> +
> 
> This "enable" is not used in this patch itself, right ?
> 
I am not sure. It is definitely used later on, I can check and move it
if necessary.

>> +int __list_lru_init(struct list_lru *lru)
>>   {
>>   	int i;
>>   
>>   	nodes_clear(lru->active_nodes);
>> -	for (i = 0; i < MAX_NUMNODES; i++) {
>> -		spin_lock_init(&lru->node[i].lock);
>> -		INIT_LIST_HEAD(&lru->node[i].list);
>> -		lru->node[i].nr_items = 0;
>> +	for (i = 0; i < MAX_NUMNODES; i++)
>> +		list_lru_init_one(&lru->node[i]);
> 
> Hmm. lru_list is up to MAX_NUMNODES, your new one is up to nr_node_ids...
>

well spotted.
Thanks.

>> +	INIT_LIST_HEAD(&lru->lrus);
>> +	mutex_lock(&all_lrus_mutex);
>> +	list_add(&lru->lrus, &all_lrus);
>> +	ret = memcg_new_lru(lru);
>> +	mutex_unlock(&all_lrus_mutex);
>> +	return ret;
>> +}
> 
>  only writer takes this mutex ?
> 
yes. IIRC, I documented that. But I might be wrong (will check)

>> +void list_lru_destroy_memcg(struct mem_cgroup *memcg)
>> +{
>> +	struct list_lru *lru;
>> +	mutex_lock(&all_lrus_mutex);
>> +	list_for_each_entry(lru, &all_lrus, lrus) {
>> +		lru->memcg_lrus[memcg_cache_id(memcg)] = NULL;
>> +		/* everybody must beaware that this memcg is no longer valid */
> 
> Hm, the object pointed by this array entry will be freed by some other func ?

They should be destroyed before we get here, but I am skimming through
the code now, and I see they are not. On a second thought, I think it
would be simpler and less error prone if I would just free them here...

>> +		new_lru_array = kzalloc(size * sizeof(void *), GFP_KERNEL);
>> +		if (!new_lru_array) {
>> +			kfree(lru_array);
>> +			return -ENOMEM;
>> +		}
>> +
>> +		for (i = 0; i < memcg_limited_groups_array_size; i++) {
>> +			if (!lru_memcg_is_assigned(lru) || lru->memcg_lrus[i])
>> +				continue;
>> +			new_lru_array[i] =  lru->memcg_lrus[i];
>> +		}
>> +
>> +		old_array = lru->memcg_lrus;
>> +		lru->memcg_lrus = new_lru_array;
>> +		/*
>> +		 * We don't need a barrier here because we are just copying
>> +		 * information over. Anybody operating in memcg_lrus will
>> +		 * either follow the new array or the old one and they contain
>> +		 * exactly the same information. The new space in the end is
>> +		 * always empty anyway.
>> +		 *
>> +		 * We do have to make sure that no more users of the old
>> +		 * memcg_lrus array exist before we free, and this is achieved
>> +		 * by the synchronize_lru below.
>> +		 */
>> +		if (lru_memcg_is_assigned(lru)) {
>> +			synchronize_rcu();
>> +			kfree(old_array);
>> +		}
>> +
>> +	}
>> +
>> +	if (lru_memcg_is_assigned(lru)) {
>> +		lru->memcg_lrus[num_groups - 1] = lru_array;
> 
> Can't this pointer already set ?
> 
If it is, it is a bug. I can set VM_BUG_ON here to catch those cases.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 2/7] memcg,list_lru: duplicate LRUs upon kmemcg creation
@ 2013-02-15 10:36         ` Glauber Costa
  0 siblings, 0 replies; 60+ messages in thread
From: Glauber Costa @ 2013-02-15 10:36 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: linux-mm, cgroups, Andrew Morton, Michal Hocko, Johannes Weiner,
	Dave Shrinnker, linux-fsdevel, Dave Chinner, Mel Gorman,
	Rik van Riel, Hugh Dickins

On 02/15/2013 01:21 PM, Kamezawa Hiroyuki wrote:
> (2013/02/08 22:07), Glauber Costa wrote:
>> When a new memcg is created, we need to open up room for its descriptors
>> in all of the list_lrus that are marked per-memcg. The process is quite
>> similar to the one we are using for the kmem caches: we initialize the
>> new structures in an array indexed by kmemcg_id, and grow the array if
>> needed. Key data like the size of the array will be shared between the
>> kmem cache code and the list_lru code (they basically describe the same
>> thing)
>>
>> Signed-off-by: Glauber Costa <glommer@parallels.com>
>> Cc: Dave Chinner <dchinner@redhat.com>
>> Cc: Mel Gorman <mgorman@suse.de>
>> Cc: Rik van Riel <riel@redhat.com>
>> Cc: Johannes Weiner <hannes@cmpxchg.org>
>> Cc: Michal Hocko <mhocko@suse.cz>
>> Cc: Hugh Dickins <hughd@google.com>
>> Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> ---
>>   include/linux/list_lru.h   |  47 +++++++++++++++++
>>   include/linux/memcontrol.h |   6 +++
>>   lib/list_lru.c             | 115 +++++++++++++++++++++++++++++++++++++---
>>   mm/memcontrol.c            | 128 ++++++++++++++++++++++++++++++++++++++++++---
>>   mm/slab_common.c           |   1 -
>>   5 files changed, 283 insertions(+), 14 deletions(-)
>>
>> diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
>> index 02796da..370b989 100644
>> --- a/include/linux/list_lru.h
>> +++ b/include/linux/list_lru.h
>> @@ -16,11 +16,58 @@ struct list_lru_node {
>>   	long			nr_items;
>>   } ____cacheline_aligned_in_smp;
>>   
>> +struct list_lru_array {
>> +	struct list_lru_node node[1];
>> +};
> 
> size is up to nr_node_ids ?
> 

This is a dynamic quantity, so the correct way to do it is to size it to
1 (or 0 for that matter), have it be the last element of the struct, and
then allocate the right size at allocation time.

>> +
>>   struct list_lru {
>> +	struct list_head	lrus;
>>   	struct list_lru_node	node[MAX_NUMNODES];
>>   	nodemask_t		active_nodes;
>> +#ifdef CONFIG_MEMCG_KMEM
>> +	struct list_lru_array	**memcg_lrus;
>> +#endif
>>   };
> size is up to memcg_limited_groups_array_size ?
> 
ditto. This one not only is a dynamic quantity, but also changes as
new memcgs are created.

>> +/*
>> + * We will reuse the last bit of the pointer to tell the lru subsystem that
>> + * this particular lru should be replicated when a memcg comes in.
>> + */
>> +static inline void lru_memcg_enable(struct list_lru *lru)
>> +{
>> +	lru->memcg_lrus = (void *)0x1ULL;
>> +}
>> +
> 
> This "enable" is not used in this patch itself, right ?
> 
I am not sure. It is definitely used later on, I can check and move it
if necessary.

>> +int __list_lru_init(struct list_lru *lru)
>>   {
>>   	int i;
>>   
>>   	nodes_clear(lru->active_nodes);
>> -	for (i = 0; i < MAX_NUMNODES; i++) {
>> -		spin_lock_init(&lru->node[i].lock);
>> -		INIT_LIST_HEAD(&lru->node[i].list);
>> -		lru->node[i].nr_items = 0;
>> +	for (i = 0; i < MAX_NUMNODES; i++)
>> +		list_lru_init_one(&lru->node[i]);
> 
> Hmm. lru_list is up to MAX_NUMNODES, your new one is up to nr_node_ids...
>

well spotted.
Thanks.

>> +	INIT_LIST_HEAD(&lru->lrus);
>> +	mutex_lock(&all_lrus_mutex);
>> +	list_add(&lru->lrus, &all_lrus);
>> +	ret = memcg_new_lru(lru);
>> +	mutex_unlock(&all_lrus_mutex);
>> +	return ret;
>> +}
> 
>  only writer takes this mutex ?
> 
yes. IIRC, I documented that. But I might be wrong (will check)

>> +void list_lru_destroy_memcg(struct mem_cgroup *memcg)
>> +{
>> +	struct list_lru *lru;
>> +	mutex_lock(&all_lrus_mutex);
>> +	list_for_each_entry(lru, &all_lrus, lrus) {
>> +		lru->memcg_lrus[memcg_cache_id(memcg)] = NULL;
>> +		/* everybody must beaware that this memcg is no longer valid */
> 
> Hm, the object pointed by this array entry will be freed by some other func ?

They should be destroyed before we get here, but I am skimming through
the code now, and I see they are not. On a second thought, I think it
would be simpler and less error prone if I would just free them here...

>> +		new_lru_array = kzalloc(size * sizeof(void *), GFP_KERNEL);
>> +		if (!new_lru_array) {
>> +			kfree(lru_array);
>> +			return -ENOMEM;
>> +		}
>> +
>> +		for (i = 0; i < memcg_limited_groups_array_size; i++) {
>> +			if (!lru_memcg_is_assigned(lru) || lru->memcg_lrus[i])
>> +				continue;
>> +			new_lru_array[i] =  lru->memcg_lrus[i];
>> +		}
>> +
>> +		old_array = lru->memcg_lrus;
>> +		lru->memcg_lrus = new_lru_array;
>> +		/*
>> +		 * We don't need a barrier here because we are just copying
>> +		 * information over. Anybody operating in memcg_lrus will
>> +		 * either follow the new array or the old one and they contain
>> +		 * exactly the same information. The new space in the end is
>> +		 * always empty anyway.
>> +		 *
>> +		 * We do have to make sure that no more users of the old
>> +		 * memcg_lrus array exist before we free, and this is achieved
>> +		 * by the synchronize_lru below.
>> +		 */
>> +		if (lru_memcg_is_assigned(lru)) {
>> +			synchronize_rcu();
>> +			kfree(old_array);
>> +		}
>> +
>> +	}
>> +
>> +	if (lru_memcg_is_assigned(lru)) {
>> +		lru->memcg_lrus[num_groups - 1] = lru_array;
> 
> Can't this pointer already set ?
> 
If it is, it is a bug. I can set VM_BUG_ON here to catch those cases.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 0/7] memcg targeted shrinking
  2013-02-15  1:28     ` Greg Thelen
  (?)
@ 2013-02-15 10:42         ` Glauber Costa
  -1 siblings, 0 replies; 60+ messages in thread
From: Glauber Costa @ 2013-02-15 10:42 UTC (permalink / raw)
  To: Greg Thelen
  Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	Andrew Morton, Michal Hocko, Johannes Weiner,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, Dave Shrinnker,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA

On 02/15/2013 05:28 AM, Greg Thelen wrote:
> On Fri, Feb 08 2013, Glauber Costa wrote:
> 
>> This patchset implements targeted shrinking for memcg when kmem limits are
>> present. So far, we've been accounting kernel objects but failing allocations
>> when short of memory. This is because our only option would be to call the
>> global shrinker, depleting objects from all caches and breaking isolation.
>>
>> This patchset builds upon the recent work from David Chinner
>> (http://oss.sgi.com/archives/xfs/2012-11/msg00643.html) to implement NUMA
>> aware per-node LRUs. I build heavily on its API, and its presence is implied.
>>
>> The main idea is to associate per-memcg lists with each of the LRUs. The main
>> LRU still provides a single entry point and when adding or removing an element
>> from the LRU, we use the page information to figure out which memcg it belongs
>> to and relay it to the right list.
>>
>> This patchset is still not perfect, and some uses cases still need to be
>> dealt with. But I wanted to get this out in the open sooner rather than
>> later. In particular, I have the following (noncomprehensive) todo list:
>>
>> TODO:
>> * shrink dead memcgs when global pressure kicks in.
>> * balance global reclaim among memcgs.
>> * improve testing and reliability (I am still seeing some stalls in some cases)
> 
> Do you have a git tree with these changes so I can see Dave's numa LRUs
> plus these changes?
> 
I've just uploaded the exact same thing I have sent here to:

  git://git.kernel.org/pub/scm/linux/kernel/git/glommer/memcg.git

The branch is kmemcg-lru-shrinker. Note that there is also another
branch kmemcg-shrinker that contains some other simple patches that were
not yet taken and are more stable. I eventually have to merge the two.

I also still need to incorporate the feedback from you and Kame into
that. I will be traveling until next Wednesday, so expect changes in
there around Thursday.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 0/7] memcg targeted shrinking
@ 2013-02-15 10:42         ` Glauber Costa
  0 siblings, 0 replies; 60+ messages in thread
From: Glauber Costa @ 2013-02-15 10:42 UTC (permalink / raw)
  To: Greg Thelen
  Cc: linux-mm, cgroups, Andrew Morton, Michal Hocko, Johannes Weiner,
	kamezawa.hiroyu, Dave Shrinnker, linux-fsdevel

On 02/15/2013 05:28 AM, Greg Thelen wrote:
> On Fri, Feb 08 2013, Glauber Costa wrote:
> 
>> This patchset implements targeted shrinking for memcg when kmem limits are
>> present. So far, we've been accounting kernel objects but failing allocations
>> when short of memory. This is because our only option would be to call the
>> global shrinker, depleting objects from all caches and breaking isolation.
>>
>> This patchset builds upon the recent work from David Chinner
>> (http://oss.sgi.com/archives/xfs/2012-11/msg00643.html) to implement NUMA
>> aware per-node LRUs. I build heavily on its API, and its presence is implied.
>>
>> The main idea is to associate per-memcg lists with each of the LRUs. The main
>> LRU still provides a single entry point and when adding or removing an element
>> from the LRU, we use the page information to figure out which memcg it belongs
>> to and relay it to the right list.
>>
>> This patchset is still not perfect, and some uses cases still need to be
>> dealt with. But I wanted to get this out in the open sooner rather than
>> later. In particular, I have the following (noncomprehensive) todo list:
>>
>> TODO:
>> * shrink dead memcgs when global pressure kicks in.
>> * balance global reclaim among memcgs.
>> * improve testing and reliability (I am still seeing some stalls in some cases)
> 
> Do you have a git tree with these changes so I can see Dave's numa LRUs
> plus these changes?
> 
I've just uploaded the exact same thing I have sent here to:

  git://git.kernel.org/pub/scm/linux/kernel/git/glommer/memcg.git

The branch is kmemcg-lru-shrinker. Note that there is also another
branch kmemcg-shrinker that contains some other simple patches that were
not yet taken and are more stable. I eventually have to merge the two.

I also still need to incorporate the feedback from you and Kame into
that. I will be traveling until next Wednesday, so expect changes in
there around Thursday.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 0/7] memcg targeted shrinking
@ 2013-02-15 10:42         ` Glauber Costa
  0 siblings, 0 replies; 60+ messages in thread
From: Glauber Costa @ 2013-02-15 10:42 UTC (permalink / raw)
  To: Greg Thelen
  Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	Andrew Morton, Michal Hocko, Johannes Weiner,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, Dave Shrinnker,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA

On 02/15/2013 05:28 AM, Greg Thelen wrote:
> On Fri, Feb 08 2013, Glauber Costa wrote:
> 
>> This patchset implements targeted shrinking for memcg when kmem limits are
>> present. So far, we've been accounting kernel objects but failing allocations
>> when short of memory. This is because our only option would be to call the
>> global shrinker, depleting objects from all caches and breaking isolation.
>>
>> This patchset builds upon the recent work from David Chinner
>> (http://oss.sgi.com/archives/xfs/2012-11/msg00643.html) to implement NUMA
>> aware per-node LRUs. I build heavily on its API, and its presence is implied.
>>
>> The main idea is to associate per-memcg lists with each of the LRUs. The main
>> LRU still provides a single entry point and when adding or removing an element
>> from the LRU, we use the page information to figure out which memcg it belongs
>> to and relay it to the right list.
>>
>> This patchset is still not perfect, and some uses cases still need to be
>> dealt with. But I wanted to get this out in the open sooner rather than
>> later. In particular, I have the following (noncomprehensive) todo list:
>>
>> TODO:
>> * shrink dead memcgs when global pressure kicks in.
>> * balance global reclaim among memcgs.
>> * improve testing and reliability (I am still seeing some stalls in some cases)
> 
> Do you have a git tree with these changes so I can see Dave's numa LRUs
> plus these changes?
> 
I've just uploaded the exact same thing I have sent here to:

  git://git.kernel.org/pub/scm/linux/kernel/git/glommer/memcg.git

The branch is kmemcg-lru-shrinker. Note that there is also another
branch kmemcg-shrinker that contains some other simple patches that were
not yet taken and are more stable. I eventually have to merge the two.

I also still need to incorporate the feedback from you and Kame into
that. I will be traveling until next Wednesday, so expect changes in
there around Thursday.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 1/7] vmscan: also shrink slab in memcg pressure
  2013-02-15  1:27     ` Greg Thelen
  (?)
@ 2013-02-15 10:46       ` Glauber Costa
  -1 siblings, 0 replies; 60+ messages in thread
From: Glauber Costa @ 2013-02-15 10:46 UTC (permalink / raw)
  To: Greg Thelen
  Cc: linux-mm, cgroups, Andrew Morton, Michal Hocko, Johannes Weiner,
	kamezawa.hiroyu, Dave Shrinnker, linux-fsdevel, Dave Chinner,
	Mel Gorman, Rik van Riel, Hugh Dickins

>>  
>> @@ -384,6 +387,11 @@ static inline void mem_cgroup_replace_page_cache(struct page *oldpage,
>>  				struct page *newpage)
>>  {
>>  }
>> +
>> +static inline unsigned long
>> +memcg_zone_reclaimable_pages(struct mem_cgroup *memcg, struct zone *zone)
>> +{
> 
> 	return 0;
> 
ok

>> +bool memcg_kmem_is_active(struct mem_cgroup *memcg)
>>  {
>>  	return test_bit(KMEM_ACCOUNTED_ACTIVE, &memcg->kmem_account_flags);
>>  }
>> @@ -991,6 +991,15 @@ mem_cgroup_zone_nr_lru_pages(struct mem_cgroup *memcg, int nid, int zid,
>>  	return ret;
>>  }
>>  
>> +unsigned long
>> +memcg_zone_reclaimable_pages(struct mem_cgroup *memcg, struct zone *zone)
>> +{
>> +	int nid = zone_to_nid(zone);
>> +	int zid = zone_idx(zone);
>> +
>> +	return mem_cgroup_zone_nr_lru_pages(memcg, nid, zid, LRU_ALL);
> 
> Without swap enabled it seems like LRU_ALL_FILE is more appropriate.
> Maybe something like test_mem_cgroup_node_reclaimable().
> 

You are right, I will look into it.

>> +}
>> +
>>  static unsigned long
>>  mem_cgroup_node_nr_lru_pages(struct mem_cgroup *memcg,
>>  			int nid, unsigned int lru_mask)
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index 6d96280..8af0e2b 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -138,11 +138,42 @@ static bool global_reclaim(struct scan_control *sc)
>>  {
>>  	return !sc->target_mem_cgroup;
>>  }
>> +
>> +/*
>> + * kmem reclaim should usually not be triggered when we are doing targetted
>> + * reclaim. It is only valid when global reclaim is triggered, or when the
>> + * underlying memcg has kmem objects.
>> + */
>> +static bool has_kmem_reclaim(struct scan_control *sc)
>> +{
>> +	return !sc->target_mem_cgroup ||
>> +	(sc->target_mem_cgroup && memcg_kmem_is_active(sc->target_mem_cgroup));
> 
> Isn't this the same as:
> 	return !sc->target_mem_cgroup ||
> 		memcg_kmem_is_active(sc->target_mem_cgroup);
> 

Yes, it is.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 1/7] vmscan: also shrink slab in memcg pressure
@ 2013-02-15 10:46       ` Glauber Costa
  0 siblings, 0 replies; 60+ messages in thread
From: Glauber Costa @ 2013-02-15 10:46 UTC (permalink / raw)
  To: Greg Thelen
  Cc: linux-mm, cgroups, Andrew Morton, Michal Hocko, Johannes Weiner,
	kamezawa.hiroyu, Dave Shrinnker, linux-fsdevel, Dave Chinner,
	Mel Gorman, Rik van Riel, Hugh Dickins

>>  
>> @@ -384,6 +387,11 @@ static inline void mem_cgroup_replace_page_cache(struct page *oldpage,
>>  				struct page *newpage)
>>  {
>>  }
>> +
>> +static inline unsigned long
>> +memcg_zone_reclaimable_pages(struct mem_cgroup *memcg, struct zone *zone)
>> +{
> 
> 	return 0;
> 
ok

>> +bool memcg_kmem_is_active(struct mem_cgroup *memcg)
>>  {
>>  	return test_bit(KMEM_ACCOUNTED_ACTIVE, &memcg->kmem_account_flags);
>>  }
>> @@ -991,6 +991,15 @@ mem_cgroup_zone_nr_lru_pages(struct mem_cgroup *memcg, int nid, int zid,
>>  	return ret;
>>  }
>>  
>> +unsigned long
>> +memcg_zone_reclaimable_pages(struct mem_cgroup *memcg, struct zone *zone)
>> +{
>> +	int nid = zone_to_nid(zone);
>> +	int zid = zone_idx(zone);
>> +
>> +	return mem_cgroup_zone_nr_lru_pages(memcg, nid, zid, LRU_ALL);
> 
> Without swap enabled it seems like LRU_ALL_FILE is more appropriate.
> Maybe something like test_mem_cgroup_node_reclaimable().
> 

You are right, I will look into it.

>> +}
>> +
>>  static unsigned long
>>  mem_cgroup_node_nr_lru_pages(struct mem_cgroup *memcg,
>>  			int nid, unsigned int lru_mask)
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index 6d96280..8af0e2b 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -138,11 +138,42 @@ static bool global_reclaim(struct scan_control *sc)
>>  {
>>  	return !sc->target_mem_cgroup;
>>  }
>> +
>> +/*
>> + * kmem reclaim should usually not be triggered when we are doing targetted
>> + * reclaim. It is only valid when global reclaim is triggered, or when the
>> + * underlying memcg has kmem objects.
>> + */
>> +static bool has_kmem_reclaim(struct scan_control *sc)
>> +{
>> +	return !sc->target_mem_cgroup ||
>> +	(sc->target_mem_cgroup && memcg_kmem_is_active(sc->target_mem_cgroup));
> 
> Isn't this the same as:
> 	return !sc->target_mem_cgroup ||
> 		memcg_kmem_is_active(sc->target_mem_cgroup);
> 

Yes, it is.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 1/7] vmscan: also shrink slab in memcg pressure
@ 2013-02-15 10:46       ` Glauber Costa
  0 siblings, 0 replies; 60+ messages in thread
From: Glauber Costa @ 2013-02-15 10:46 UTC (permalink / raw)
  To: Greg Thelen
  Cc: linux-mm, cgroups, Andrew Morton, Michal Hocko, Johannes Weiner,
	kamezawa.hiroyu, Dave Shrinnker, linux-fsdevel, Dave Chinner,
	Mel Gorman, Rik van Riel, Hugh Dickins

>>  
>> @@ -384,6 +387,11 @@ static inline void mem_cgroup_replace_page_cache(struct page *oldpage,
>>  				struct page *newpage)
>>  {
>>  }
>> +
>> +static inline unsigned long
>> +memcg_zone_reclaimable_pages(struct mem_cgroup *memcg, struct zone *zone)
>> +{
> 
> 	return 0;
> 
ok

>> +bool memcg_kmem_is_active(struct mem_cgroup *memcg)
>>  {
>>  	return test_bit(KMEM_ACCOUNTED_ACTIVE, &memcg->kmem_account_flags);
>>  }
>> @@ -991,6 +991,15 @@ mem_cgroup_zone_nr_lru_pages(struct mem_cgroup *memcg, int nid, int zid,
>>  	return ret;
>>  }
>>  
>> +unsigned long
>> +memcg_zone_reclaimable_pages(struct mem_cgroup *memcg, struct zone *zone)
>> +{
>> +	int nid = zone_to_nid(zone);
>> +	int zid = zone_idx(zone);
>> +
>> +	return mem_cgroup_zone_nr_lru_pages(memcg, nid, zid, LRU_ALL);
> 
> Without swap enabled it seems like LRU_ALL_FILE is more appropriate.
> Maybe something like test_mem_cgroup_node_reclaimable().
> 

You are right, I will look into it.

>> +}
>> +
>>  static unsigned long
>>  mem_cgroup_node_nr_lru_pages(struct mem_cgroup *memcg,
>>  			int nid, unsigned int lru_mask)
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index 6d96280..8af0e2b 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -138,11 +138,42 @@ static bool global_reclaim(struct scan_control *sc)
>>  {
>>  	return !sc->target_mem_cgroup;
>>  }
>> +
>> +/*
>> + * kmem reclaim should usually not be triggered when we are doing targetted
>> + * reclaim. It is only valid when global reclaim is triggered, or when the
>> + * underlying memcg has kmem objects.
>> + */
>> +static bool has_kmem_reclaim(struct scan_control *sc)
>> +{
>> +	return !sc->target_mem_cgroup ||
>> +	(sc->target_mem_cgroup && memcg_kmem_is_active(sc->target_mem_cgroup));
> 
> Isn't this the same as:
> 	return !sc->target_mem_cgroup ||
> 		memcg_kmem_is_active(sc->target_mem_cgroup);
> 

Yes, it is.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 2/7] memcg,list_lru: duplicate LRUs upon kmemcg creation
  2013-02-15  1:31     ` Greg Thelen
  (?)
@ 2013-02-15 10:54         ` Glauber Costa
  -1 siblings, 0 replies; 60+ messages in thread
From: Glauber Costa @ 2013-02-15 10:54 UTC (permalink / raw)
  To: Greg Thelen
  Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	Andrew Morton, Michal Hocko, Johannes Weiner,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, Dave Shrinnker,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Dave Chinner, Mel Gorman,
	Rik van Riel, Hugh Dickins


>>
>> diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
>> index 02796da..370b989 100644
>> --- a/include/linux/list_lru.h
>> +++ b/include/linux/list_lru.h
>> @@ -16,11 +16,58 @@ struct list_lru_node {
>>  	long			nr_items;
>>  } ____cacheline_aligned_in_smp;
>>  
>> +struct list_lru_array {
>> +	struct list_lru_node node[1];
>> +};
>> +
>>  struct list_lru {
>> +	struct list_head	lrus;
>>  	struct list_lru_node	node[MAX_NUMNODES];
>>  	nodemask_t		active_nodes;
>> +#ifdef CONFIG_MEMCG_KMEM
>> +	struct list_lru_array	**memcg_lrus;
> 
> Probably need a comment regarding that 0x1 is a magic value and
> describing what indexes this lazily constructed array. 

Ok.

> Is the primary
> index memcg_kmem_id and the secondary index a nid?
> 

Precisely. The first level is an array of pointers to list_lru_array.
And each list_lru_array is an array of nids.

>> +struct mem_cgroup;
>> +#ifdef CONFIG_MEMCG_KMEM
>> +/*
>> + * We will reuse the last bit of the pointer to tell the lru subsystem that
>> + * this particular lru should be replicated when a memcg comes in.
>> + */
> 
> From this patch it seems like 0x1 is a magic value rather than bit 0
> being special.  memcg_lrus is either 0x1 or a pointer to an array of
> struct list_lru_array.  The array is indexed by memcg_kmem_id.
> 

Well, I thought in terms of "set the last bit". To be honest, when I
first designed this, I figured it could possibly be useful to keep the
bit set at all times, and that is why I used the LSB. Since I turned out
not using it, maybe we could actually resort to a fully fledged magical
to avoid the confusion?

>> +static inline void lru_memcg_enable(struct list_lru *lru)
>> +/*
>> + * This will return true if we have already allocated and assignment a memcg
>> + * pointer set to the LRU. Therefore, we need to mask the first bit out
>> + */
>> +static inline bool lru_memcg_is_assigned(struct list_lru *lru)
>> +{
>> +	return (unsigned long)lru->memcg_lrus & ~0x1ULL;
> 
> Is this equivalent to?
> 	return lru->memcg_lrus != NULL && lru->memcg_lrus != 0x1
> 
yes. What I've explained above should help clarifying why I wrote it
this way. But if we use an actual magical (0x1 is a bad magical, IMHO),
the intentions become a lot clearer.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 2/7] memcg,list_lru: duplicate LRUs upon kmemcg creation
@ 2013-02-15 10:54         ` Glauber Costa
  0 siblings, 0 replies; 60+ messages in thread
From: Glauber Costa @ 2013-02-15 10:54 UTC (permalink / raw)
  To: Greg Thelen
  Cc: linux-mm, cgroups, Andrew Morton, Michal Hocko, Johannes Weiner,
	kamezawa.hiroyu, Dave Shrinnker, linux-fsdevel, Dave Chinner,
	Mel Gorman, Rik van Riel, Hugh Dickins


>>
>> diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
>> index 02796da..370b989 100644
>> --- a/include/linux/list_lru.h
>> +++ b/include/linux/list_lru.h
>> @@ -16,11 +16,58 @@ struct list_lru_node {
>>  	long			nr_items;
>>  } ____cacheline_aligned_in_smp;
>>  
>> +struct list_lru_array {
>> +	struct list_lru_node node[1];
>> +};
>> +
>>  struct list_lru {
>> +	struct list_head	lrus;
>>  	struct list_lru_node	node[MAX_NUMNODES];
>>  	nodemask_t		active_nodes;
>> +#ifdef CONFIG_MEMCG_KMEM
>> +	struct list_lru_array	**memcg_lrus;
> 
> Probably need a comment regarding that 0x1 is a magic value and
> describing what indexes this lazily constructed array. 

Ok.

> Is the primary
> index memcg_kmem_id and the secondary index a nid?
> 

Precisely. The first level is an array of pointers to list_lru_array.
And each list_lru_array is an array of nids.

>> +struct mem_cgroup;
>> +#ifdef CONFIG_MEMCG_KMEM
>> +/*
>> + * We will reuse the last bit of the pointer to tell the lru subsystem that
>> + * this particular lru should be replicated when a memcg comes in.
>> + */
> 
> From this patch it seems like 0x1 is a magic value rather than bit 0
> being special.  memcg_lrus is either 0x1 or a pointer to an array of
> struct list_lru_array.  The array is indexed by memcg_kmem_id.
> 

Well, I thought in terms of "set the last bit". To be honest, when I
first designed this, I figured it could possibly be useful to keep the
bit set at all times, and that is why I used the LSB. Since I turned out
not using it, maybe we could actually resort to a fully fledged magical
to avoid the confusion?

>> +static inline void lru_memcg_enable(struct list_lru *lru)
>> +/*
>> + * This will return true if we have already allocated and assignment a memcg
>> + * pointer set to the LRU. Therefore, we need to mask the first bit out
>> + */
>> +static inline bool lru_memcg_is_assigned(struct list_lru *lru)
>> +{
>> +	return (unsigned long)lru->memcg_lrus & ~0x1ULL;
> 
> Is this equivalent to?
> 	return lru->memcg_lrus != NULL && lru->memcg_lrus != 0x1
> 
yes. What I've explained above should help clarifying why I wrote it
this way. But if we use an actual magical (0x1 is a bad magical, IMHO),
the intentions become a lot clearer.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 2/7] memcg,list_lru: duplicate LRUs upon kmemcg creation
@ 2013-02-15 10:54         ` Glauber Costa
  0 siblings, 0 replies; 60+ messages in thread
From: Glauber Costa @ 2013-02-15 10:54 UTC (permalink / raw)
  To: Greg Thelen
  Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	Andrew Morton, Michal Hocko, Johannes Weiner,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, Dave Shrinnker,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Dave Chinner, Mel Gorman,
	Rik van Riel, Hugh Dickins


>>
>> diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
>> index 02796da..370b989 100644
>> --- a/include/linux/list_lru.h
>> +++ b/include/linux/list_lru.h
>> @@ -16,11 +16,58 @@ struct list_lru_node {
>>  	long			nr_items;
>>  } ____cacheline_aligned_in_smp;
>>  
>> +struct list_lru_array {
>> +	struct list_lru_node node[1];
>> +};
>> +
>>  struct list_lru {
>> +	struct list_head	lrus;
>>  	struct list_lru_node	node[MAX_NUMNODES];
>>  	nodemask_t		active_nodes;
>> +#ifdef CONFIG_MEMCG_KMEM
>> +	struct list_lru_array	**memcg_lrus;
> 
> Probably need a comment regarding that 0x1 is a magic value and
> describing what indexes this lazily constructed array. 

Ok.

> Is the primary
> index memcg_kmem_id and the secondary index a nid?
> 

Precisely. The first level is an array of pointers to list_lru_array.
And each list_lru_array is an array of nids.

>> +struct mem_cgroup;
>> +#ifdef CONFIG_MEMCG_KMEM
>> +/*
>> + * We will reuse the last bit of the pointer to tell the lru subsystem that
>> + * this particular lru should be replicated when a memcg comes in.
>> + */
> 
> From this patch it seems like 0x1 is a magic value rather than bit 0
> being special.  memcg_lrus is either 0x1 or a pointer to an array of
> struct list_lru_array.  The array is indexed by memcg_kmem_id.
> 

Well, I thought in terms of "set the last bit". To be honest, when I
first designed this, I figured it could possibly be useful to keep the
bit set at all times, and that is why I used the LSB. Since I turned out
not using it, maybe we could actually resort to a fully fledged magical
to avoid the confusion?

>> +static inline void lru_memcg_enable(struct list_lru *lru)
>> +/*
>> + * This will return true if we have already allocated and assignment a memcg
>> + * pointer set to the LRU. Therefore, we need to mask the first bit out
>> + */
>> +static inline bool lru_memcg_is_assigned(struct list_lru *lru)
>> +{
>> +	return (unsigned long)lru->memcg_lrus & ~0x1ULL;
> 
> Is this equivalent to?
> 	return lru->memcg_lrus != NULL && lru->memcg_lrus != 0x1
> 
yes. What I've explained above should help clarifying why I wrote it
this way. But if we use an actual magical (0x1 is a bad magical, IMHO),
the intentions become a lot clearer.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 3/7] lru: add an element to a memcg list
  2013-02-15  1:32     ` Greg Thelen
  (?)
@ 2013-02-15 10:57         ` Glauber Costa
  -1 siblings, 0 replies; 60+ messages in thread
From: Glauber Costa @ 2013-02-15 10:57 UTC (permalink / raw)
  To: Greg Thelen
  Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	Andrew Morton, Michal Hocko, Johannes Weiner,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, Dave Shrinnker,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Dave Chinner, Mel Gorman,
	Rik van Riel, Hugh Dickins

On 02/15/2013 05:32 AM, Greg Thelen wrote:
> On Fri, Feb 08 2013, Glauber Costa wrote:
> 
>> With the infrastructure we now have, we can add an element to a memcg
>> LRU list instead of the global list. The memcg lists are still
>> per-node.
> 
> [...]
> 
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index b9e1941..bfb4b5b 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -3319,6 +3319,36 @@ static inline void memcg_resume_kmem_account(void)
>>  	current->memcg_kmem_skip_account--;
>>  }
>>  
>> +static struct mem_cgroup *mem_cgroup_from_kmem_page(struct page *page)
>> +{
>> +	struct page_cgroup *pc;
>> +	struct mem_cgroup *memcg = NULL;
>> +
>> +	pc = lookup_page_cgroup(page);
>> +	if (!PageCgroupUsed(pc))
>> +		return NULL;
>> +
>> +	lock_page_cgroup(pc);
>> +	if (PageCgroupUsed(pc))
>> +		memcg = pc->mem_cgroup;
>> +	unlock_page_cgroup(pc);
> 
> Once we drop the lock, is there anything that needs protection
> (e.g. PageCgroupUsed)?  If there's no problem, then what's the point of
> taking the lock?
> 

This is the same pattern already used in the rest of memcg, and I just
transposing it here. From my understanding, we need to make sure that if
the Used bit is not set, we don't rely on the memcg information. So we
take the lock to guarantee that the big is not cleared in the meantime.
But after that, we should be fine.

Kame, you have any input?

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 3/7] lru: add an element to a memcg list
@ 2013-02-15 10:57         ` Glauber Costa
  0 siblings, 0 replies; 60+ messages in thread
From: Glauber Costa @ 2013-02-15 10:57 UTC (permalink / raw)
  To: Greg Thelen
  Cc: linux-mm, cgroups, Andrew Morton, Michal Hocko, Johannes Weiner,
	kamezawa.hiroyu, Dave Shrinnker, linux-fsdevel, Dave Chinner,
	Mel Gorman, Rik van Riel, Hugh Dickins

On 02/15/2013 05:32 AM, Greg Thelen wrote:
> On Fri, Feb 08 2013, Glauber Costa wrote:
> 
>> With the infrastructure we now have, we can add an element to a memcg
>> LRU list instead of the global list. The memcg lists are still
>> per-node.
> 
> [...]
> 
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index b9e1941..bfb4b5b 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -3319,6 +3319,36 @@ static inline void memcg_resume_kmem_account(void)
>>  	current->memcg_kmem_skip_account--;
>>  }
>>  
>> +static struct mem_cgroup *mem_cgroup_from_kmem_page(struct page *page)
>> +{
>> +	struct page_cgroup *pc;
>> +	struct mem_cgroup *memcg = NULL;
>> +
>> +	pc = lookup_page_cgroup(page);
>> +	if (!PageCgroupUsed(pc))
>> +		return NULL;
>> +
>> +	lock_page_cgroup(pc);
>> +	if (PageCgroupUsed(pc))
>> +		memcg = pc->mem_cgroup;
>> +	unlock_page_cgroup(pc);
> 
> Once we drop the lock, is there anything that needs protection
> (e.g. PageCgroupUsed)?  If there's no problem, then what's the point of
> taking the lock?
> 

This is the same pattern already used in the rest of memcg, and I just
transposing it here. From my understanding, we need to make sure that if
the Used bit is not set, we don't rely on the memcg information. So we
take the lock to guarantee that the big is not cleared in the meantime.
But after that, we should be fine.

Kame, you have any input?



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 3/7] lru: add an element to a memcg list
@ 2013-02-15 10:57         ` Glauber Costa
  0 siblings, 0 replies; 60+ messages in thread
From: Glauber Costa @ 2013-02-15 10:57 UTC (permalink / raw)
  To: Greg Thelen
  Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	Andrew Morton, Michal Hocko, Johannes Weiner,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, Dave Shrinnker,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Dave Chinner, Mel Gorman,
	Rik van Riel, Hugh Dickins

On 02/15/2013 05:32 AM, Greg Thelen wrote:
> On Fri, Feb 08 2013, Glauber Costa wrote:
> 
>> With the infrastructure we now have, we can add an element to a memcg
>> LRU list instead of the global list. The memcg lists are still
>> per-node.
> 
> [...]
> 
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index b9e1941..bfb4b5b 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -3319,6 +3319,36 @@ static inline void memcg_resume_kmem_account(void)
>>  	current->memcg_kmem_skip_account--;
>>  }
>>  
>> +static struct mem_cgroup *mem_cgroup_from_kmem_page(struct page *page)
>> +{
>> +	struct page_cgroup *pc;
>> +	struct mem_cgroup *memcg = NULL;
>> +
>> +	pc = lookup_page_cgroup(page);
>> +	if (!PageCgroupUsed(pc))
>> +		return NULL;
>> +
>> +	lock_page_cgroup(pc);
>> +	if (PageCgroupUsed(pc))
>> +		memcg = pc->mem_cgroup;
>> +	unlock_page_cgroup(pc);
> 
> Once we drop the lock, is there anything that needs protection
> (e.g. PageCgroupUsed)?  If there's no problem, then what's the point of
> taking the lock?
> 

This is the same pattern already used in the rest of memcg, and I just
transposing it here. From my understanding, we need to make sure that if
the Used bit is not set, we don't rely on the memcg information. So we
take the lock to guarantee that the big is not cleared in the meantime.
But after that, we should be fine.

Kame, you have any input?



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 2/7] memcg,list_lru: duplicate LRUs upon kmemcg creation
  2013-02-15 10:54         ` Glauber Costa
  (?)
@ 2013-02-20  7:46           ` Greg Thelen
  -1 siblings, 0 replies; 60+ messages in thread
From: Greg Thelen @ 2013-02-20  7:46 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-mm, cgroups, Andrew Morton, Michal Hocko, Johannes Weiner,
	kamezawa.hiroyu, Dave Shrinnker, linux-fsdevel, Dave Chinner,
	Mel Gorman, Rik van Riel, Hugh Dickins

On Fri, Feb 15 2013, Glauber Costa wrote:

>>> +struct mem_cgroup;
>>> +#ifdef CONFIG_MEMCG_KMEM
>>> +/*
>>> + * We will reuse the last bit of the pointer to tell the lru subsystem that
>>> + * this particular lru should be replicated when a memcg comes in.
>>> + */
>> 
>> From this patch it seems like 0x1 is a magic value rather than bit 0
>> being special.  memcg_lrus is either 0x1 or a pointer to an array of
>> struct list_lru_array.  The array is indexed by memcg_kmem_id.
>> 
>
> Well, I thought in terms of "set the last bit". To be honest, when I
> first designed this, I figured it could possibly be useful to keep the
> bit set at all times, and that is why I used the LSB. Since I turned out
> not using it, maybe we could actually resort to a fully fledged magical
> to avoid the confusion?

To avoid confusion, I'd prefer a magic value.  This allows callers to
not worrying about having to strip off the low order bit, if it's later
always set for some reason.  But I'm not even sure we need a magic value
or magic bit (see below).

>>> +static inline void lru_memcg_enable(struct list_lru *lru)
>>> +/*
>>> + * This will return true if we have already allocated and assignment a memcg
>>> + * pointer set to the LRU. Therefore, we need to mask the first bit out
>>> + */
>>> +static inline bool lru_memcg_is_assigned(struct list_lru *lru)
>>> +{
>>> +	return (unsigned long)lru->memcg_lrus & ~0x1ULL;
>> 
>> Is this equivalent to?
>> 	return lru->memcg_lrus != NULL && lru->memcg_lrus != 0x1
>> 
> yes. What I've explained above should help clarifying why I wrote it
> this way. But if we use an actual magical (0x1 is a bad magical, IMHO),
> the intentions become a lot clearer.

Does the following work and yield simpler code?
1. add a 'bool memcg_enabled' parameter to list_lru_init()
2. rename all_lrus to all_memcg_lrus
3. only add lru to all_memcg_lrus if memcg_enabled is set
4. delete lru_memcg_enable()
5. redefine lru_memcg_is_assigned() to just test (lru->memcg_lrus == NULL)

Then we don't need a magic valid (or LSB) to identify memcg enabled
lrus.  Any lru in the all_memcg_lrus list is memcg enabled.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 2/7] memcg,list_lru: duplicate LRUs upon kmemcg creation
@ 2013-02-20  7:46           ` Greg Thelen
  0 siblings, 0 replies; 60+ messages in thread
From: Greg Thelen @ 2013-02-20  7:46 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-mm, cgroups, Andrew Morton, Michal Hocko, Johannes Weiner,
	kamezawa.hiroyu, Dave Shrinnker, linux-fsdevel, Dave Chinner,
	Mel Gorman, Rik van Riel, Hugh Dickins

On Fri, Feb 15 2013, Glauber Costa wrote:

>>> +struct mem_cgroup;
>>> +#ifdef CONFIG_MEMCG_KMEM
>>> +/*
>>> + * We will reuse the last bit of the pointer to tell the lru subsystem that
>>> + * this particular lru should be replicated when a memcg comes in.
>>> + */
>> 
>> From this patch it seems like 0x1 is a magic value rather than bit 0
>> being special.  memcg_lrus is either 0x1 or a pointer to an array of
>> struct list_lru_array.  The array is indexed by memcg_kmem_id.
>> 
>
> Well, I thought in terms of "set the last bit". To be honest, when I
> first designed this, I figured it could possibly be useful to keep the
> bit set at all times, and that is why I used the LSB. Since I turned out
> not using it, maybe we could actually resort to a fully fledged magical
> to avoid the confusion?

To avoid confusion, I'd prefer a magic value.  This allows callers to
not worrying about having to strip off the low order bit, if it's later
always set for some reason.  But I'm not even sure we need a magic value
or magic bit (see below).

>>> +static inline void lru_memcg_enable(struct list_lru *lru)
>>> +/*
>>> + * This will return true if we have already allocated and assignment a memcg
>>> + * pointer set to the LRU. Therefore, we need to mask the first bit out
>>> + */
>>> +static inline bool lru_memcg_is_assigned(struct list_lru *lru)
>>> +{
>>> +	return (unsigned long)lru->memcg_lrus & ~0x1ULL;
>> 
>> Is this equivalent to?
>> 	return lru->memcg_lrus != NULL && lru->memcg_lrus != 0x1
>> 
> yes. What I've explained above should help clarifying why I wrote it
> this way. But if we use an actual magical (0x1 is a bad magical, IMHO),
> the intentions become a lot clearer.

Does the following work and yield simpler code?
1. add a 'bool memcg_enabled' parameter to list_lru_init()
2. rename all_lrus to all_memcg_lrus
3. only add lru to all_memcg_lrus if memcg_enabled is set
4. delete lru_memcg_enable()
5. redefine lru_memcg_is_assigned() to just test (lru->memcg_lrus == NULL)

Then we don't need a magic valid (or LSB) to identify memcg enabled
lrus.  Any lru in the all_memcg_lrus list is memcg enabled.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH 2/7] memcg,list_lru: duplicate LRUs upon kmemcg creation
@ 2013-02-20  7:46           ` Greg Thelen
  0 siblings, 0 replies; 60+ messages in thread
From: Greg Thelen @ 2013-02-20  7:46 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-mm, cgroups, Andrew Morton, Michal Hocko, Johannes Weiner,
	kamezawa.hiroyu, Dave Shrinnker, linux-fsdevel, Dave Chinner,
	Mel Gorman, Rik van Riel, Hugh Dickins

On Fri, Feb 15 2013, Glauber Costa wrote:

>>> +struct mem_cgroup;
>>> +#ifdef CONFIG_MEMCG_KMEM
>>> +/*
>>> + * We will reuse the last bit of the pointer to tell the lru subsystem that
>>> + * this particular lru should be replicated when a memcg comes in.
>>> + */
>> 
>> From this patch it seems like 0x1 is a magic value rather than bit 0
>> being special.  memcg_lrus is either 0x1 or a pointer to an array of
>> struct list_lru_array.  The array is indexed by memcg_kmem_id.
>> 
>
> Well, I thought in terms of "set the last bit". To be honest, when I
> first designed this, I figured it could possibly be useful to keep the
> bit set at all times, and that is why I used the LSB. Since I turned out
> not using it, maybe we could actually resort to a fully fledged magical
> to avoid the confusion?

To avoid confusion, I'd prefer a magic value.  This allows callers to
not worrying about having to strip off the low order bit, if it's later
always set for some reason.  But I'm not even sure we need a magic value
or magic bit (see below).

>>> +static inline void lru_memcg_enable(struct list_lru *lru)
>>> +/*
>>> + * This will return true if we have already allocated and assignment a memcg
>>> + * pointer set to the LRU. Therefore, we need to mask the first bit out
>>> + */
>>> +static inline bool lru_memcg_is_assigned(struct list_lru *lru)
>>> +{
>>> +	return (unsigned long)lru->memcg_lrus & ~0x1ULL;
>> 
>> Is this equivalent to?
>> 	return lru->memcg_lrus != NULL && lru->memcg_lrus != 0x1
>> 
> yes. What I've explained above should help clarifying why I wrote it
> this way. But if we use an actual magical (0x1 is a bad magical, IMHO),
> the intentions become a lot clearer.

Does the following work and yield simpler code?
1. add a 'bool memcg_enabled' parameter to list_lru_init()
2. rename all_lrus to all_memcg_lrus
3. only add lru to all_memcg_lrus if memcg_enabled is set
4. delete lru_memcg_enable()
5. redefine lru_memcg_is_assigned() to just test (lru->memcg_lrus == NULL)

Then we don't need a magic valid (or LSB) to identify memcg enabled
lrus.  Any lru in the all_memcg_lrus list is memcg enabled.

^ permalink raw reply	[flat|nested] 60+ messages in thread

end of thread, other threads:[~2013-02-20  7:46 UTC | newest]

Thread overview: 60+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-02-08 13:07 [PATCH 0/7] memcg targeted shrinking Glauber Costa
2013-02-08 13:07 ` Glauber Costa
2013-02-08 13:07 ` Glauber Costa
2013-02-08 13:07 ` [PATCH 1/7] vmscan: also shrink slab in memcg pressure Glauber Costa
2013-02-08 13:07   ` Glauber Costa
2013-02-08 13:07   ` Glauber Costa
2013-02-15  1:27   ` Greg Thelen
2013-02-15  1:27     ` Greg Thelen
2013-02-15  1:27     ` Greg Thelen
2013-02-15 10:46     ` Glauber Costa
2013-02-15 10:46       ` Glauber Costa
2013-02-15 10:46       ` Glauber Costa
2013-02-15  8:37   ` Kamezawa Hiroyuki
2013-02-15  8:37     ` Kamezawa Hiroyuki
     [not found]     ` <511DF3CB.7020206-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
2013-02-15 10:30       ` Glauber Costa
2013-02-15 10:30         ` Glauber Costa
2013-02-15 10:30         ` Glauber Costa
2013-02-08 13:07 ` [PATCH 2/7] memcg,list_lru: duplicate LRUs upon kmemcg creation Glauber Costa
2013-02-08 13:07   ` Glauber Costa
2013-02-08 13:07   ` Glauber Costa
2013-02-15  1:31   ` Greg Thelen
2013-02-15  1:31     ` Greg Thelen
2013-02-15  1:31     ` Greg Thelen
     [not found]     ` <xr934nhenz18.fsf-aSPv4SP+Du0KgorLzL7FmE7CuiCeIGUxQQ4Iyu8u01E@public.gmane.org>
2013-02-15 10:54       ` Glauber Costa
2013-02-15 10:54         ` Glauber Costa
2013-02-15 10:54         ` Glauber Costa
2013-02-20  7:46         ` Greg Thelen
2013-02-20  7:46           ` Greg Thelen
2013-02-20  7:46           ` Greg Thelen
     [not found]   ` <1360328857-28070-3-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2013-02-15  9:21     ` Kamezawa Hiroyuki
2013-02-15  9:21       ` Kamezawa Hiroyuki
2013-02-15 10:36       ` Glauber Costa
2013-02-15 10:36         ` Glauber Costa
2013-02-15 10:36         ` Glauber Costa
2013-02-08 13:07 ` [PATCH 3/7] lru: add an element to a memcg list Glauber Costa
2013-02-08 13:07   ` Glauber Costa
2013-02-08 13:07   ` Glauber Costa
2013-02-15  1:32   ` Greg Thelen
2013-02-15  1:32     ` Greg Thelen
     [not found]     ` <xr93txpemkeo.fsf-aSPv4SP+Du0KgorLzL7FmE7CuiCeIGUxQQ4Iyu8u01E@public.gmane.org>
2013-02-15 10:57       ` Glauber Costa
2013-02-15 10:57         ` Glauber Costa
2013-02-15 10:57         ` Glauber Costa
2013-02-08 13:07 ` [PATCH 4/7] list_lru: also include memcg lists in counts and scans Glauber Costa
2013-02-08 13:07   ` Glauber Costa
2013-02-08 13:07   ` Glauber Costa
2013-02-08 13:07 ` [PATCH 5/7] list_lru: per-memcg walks Glauber Costa
2013-02-08 13:07   ` Glauber Costa
2013-02-08 13:07   ` Glauber Costa
2013-02-08 13:07 ` [PATCH 6/7] super: targeted memcg reclaim Glauber Costa
2013-02-08 13:07   ` Glauber Costa
2013-02-08 13:07   ` Glauber Costa
2013-02-08 13:07 ` [PATCH 7/7] memcg: per-memcg kmem shrinking Glauber Costa
2013-02-08 13:07   ` Glauber Costa
2013-02-08 13:07   ` Glauber Costa
     [not found] ` <1360328857-28070-1-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2013-02-15  1:28   ` [PATCH 0/7] memcg targeted shrinking Greg Thelen
2013-02-15  1:28     ` Greg Thelen
2013-02-15  1:28     ` Greg Thelen
     [not found]     ` <xr93ip5unz52.fsf-aSPv4SP+Du0KgorLzL7FmE7CuiCeIGUxQQ4Iyu8u01E@public.gmane.org>
2013-02-15 10:42       ` Glauber Costa
2013-02-15 10:42         ` Glauber Costa
2013-02-15 10:42         ` Glauber Costa

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.