linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [RFC 0/7 -v2] memcg cleanups + get rid of mm_struct::owner
@ 2015-05-29 11:57 Michal Hocko
  2015-05-29 11:57 ` [RFC -v2 1/7] memcg: export struct mem_cgroup Michal Hocko
                   ` (6 more replies)
  0 siblings, 7 replies; 8+ messages in thread
From: Michal Hocko @ 2015-05-29 11:57 UTC (permalink / raw)
  To: linux-mm
  Cc: Johannes Weiner, Andrew Morton, Tejun Heo, Oleg Nesterov,
	Vladimir Davydov, Greg Thelen, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro

Hi,
this is the second version of the RFC. The previous version was posted
here: http://marc.info/?l=linux-mm&m=143264102317318&w=2. It has grown
the number of patches based on the previous feedback.

Johannes has suggested to export struct mem_cgroup because some other
things will become easier. So even though this is not directly related
to the patchset I have posted it together because there are some
dependencies. First two patches are doing this move.

The third patch simply cleans up some extern declarations in memcontrol.h
because we are not consistent in that regard.

The fourth patch is the same one from Tejun which cleans up
mem_cgroup_can_attach a bit and the follow up patches depend on it.

The fifth patch is preparatory and it's touching sock_update_memcg which
doesn't check for cg_proto == NULL. It doesn't have to do it currently
because mem_cgroup_from_task always return non-NULL but the code is awkward
and this will no longer be true with the follow up patch.

All these can go without the rest IMO.

The patch number 6 is the core one and it gets rid of mm_struct::owner
in favor of mm_struct::memcg. The rationale is described in the patch
so I will not repeat it here again. The changes since the last post
are
	- added Suggested-by Oleg has it was him who kicked me into doing
	  it
	- exec_mmap association was missing [Oleg]
	- mem_cgroup_from_task cannot return NULL anymore [Oleg]
        - mm_match_cgroup doesn't have to play games and it can use
          rcu_dereference after mm_struct is exported [Johannes]
	- drop mm_move_memcg as it has only one user in memcontrol.c
	  so it can be opencoded [Johannes]
	- functions to associate and drop mm->memcg association have
	  to be static inline for !CONFIG_MEMCG [Johannes]
	- drop synchronize_rcu nonsense during memcg move [Johannes]
	- drop "memcg: Use mc.moving_task as the indication for charge
	  moving" patch because we can get the target task from css
	  easily [Johannes]
	- rename css_set_memcg renamed to css_inherit_memcg because the
	  name better suits its usage
	- dropped css_get(&from->css) during move because it is pinned
	  already by the mm. We just have to be careful and do not drop
	  it before we are really done during migration.

I have tried to compile test this with my usual configs battery +
randconfigs and nothing blown up.

I have also runtime tested that charges end up in a proper memcg
and migration between two memcgs as fast as possible. Except for
pre-existing races no regressions seemed to be introduced.

I was worried about the user visible change this patch introduces but
Johannes seems to be OK with that. The changelog states that a potential
usecase is not really worth all the troubles the implementation exposes
to the memcg behavior. I am still posting this as an RFC so that we give
more time to others.

The last patch gets rid of mem_cgroup_from_task because this is a really
weird interface (see more in the changelog). After mm->owner is gone
we have only 2 more callers remaining and both of them can be changed
to not use it. So better get rid of this before we get new callers.

Shortlog says:
Michal Hocko (6):
      memcg: export struct mem_cgroup
      memcg: get rid of extern for functions in memcontrol.h
      memcg, mm: move mem_cgroup_select_victim_node into vmscan
      memcg, tcp_kmem: check for cg_proto in sock_update_memcg
      memcg: get rid of mm_struct::owner
      memcg: get rid of mem_cgroup_from_task

Tejun Heo (1):
      memcg: restructure mem_cgroup_can_attach()

And diffstat looks promissing as well.
 fs/exec.c                  |   2 +-
 include/linux/memcontrol.h | 437 +++++++++++++++++++++++++++++++++----
 include/linux/mm_types.h   |  12 +-
 include/linux/swap.h       |  10 +-
 include/net/sock.h         |  28 ---
 kernel/exit.c              |  89 --------
 kernel/fork.c              |  10 +-
 mm/debug.c                 |   4 +-
 mm/memcontrol.c            | 527 ++++++---------------------------------------
 mm/memory-failure.c        |   2 +-
 mm/slab_common.c           |   2 +-
 mm/vmscan.c                |  96 +++++++++
 12 files changed, 577 insertions(+), 642 deletions(-)

Thanks

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [RFC -v2 1/7] memcg: export struct mem_cgroup
  2015-05-29 11:57 [RFC 0/7 -v2] memcg cleanups + get rid of mm_struct::owner Michal Hocko
@ 2015-05-29 11:57 ` Michal Hocko
  2015-05-29 11:57 ` [RFC -v2 2/7] memcg: get rid of extern for functions in memcontrol.h Michal Hocko
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: Michal Hocko @ 2015-05-29 11:57 UTC (permalink / raw)
  To: linux-mm
  Cc: Johannes Weiner, Andrew Morton, Tejun Heo, Oleg Nesterov,
	Vladimir Davydov, Greg Thelen, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro

mem_cgroup structure is defined in mm/memcontrol.c currently which
means that the code outside of this file has to use external API even
for trivial access stuff.

This patch exports mm_struct with its dependencies and makes some of the
exported functions inlines. This even helps to reduce the code size a bit
(make defconfig + CONFIG_MEMCG=y)

text		data	bss	dec		 hex	filename
12177806        1791000 1085440 15054246         e5b5a6 vmlinux.before
12177534        1791000 1085440 15053974         e5b496 vmlinux.after

This is not much (270B) but better than nothing. We also save a function
call in some hot paths like callers of mem_cgroup_count_vm_event which is
used for accounting.

The patch doesn't introduce any functional changes.

Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 include/linux/memcontrol.h | 366 ++++++++++++++++++++++++++++++++++++++++-----
 include/linux/swap.h       |  10 +-
 include/net/sock.h         |  28 ----
 mm/memcontrol.c            | 305 -------------------------------------
 mm/memory-failure.c        |   2 +-
 mm/slab_common.c           |   2 +-
 6 files changed, 343 insertions(+), 370 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 6c8918114804..3d9d7ff0d93e 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -23,8 +23,10 @@
 #include <linux/vm_event_item.h>
 #include <linux/hardirq.h>
 #include <linux/jump_label.h>
+#include <linux/page_counter.h>
+#include <linux/vmpressure.h>
+#include <linux/mmzone.h>
 
-struct mem_cgroup;
 struct page;
 struct mm_struct;
 struct kmem_cache;
@@ -66,10 +68,220 @@ enum mem_cgroup_events_index {
 	MEMCG_NR_EVENTS,
 };
 
+/*
+ * Per memcg event counter is incremented at every pagein/pageout. With THP,
+ * it will be incremated by the number of pages. This counter is used for
+ * for trigger some periodic events. This is straightforward and better
+ * than using jiffies etc. to handle periodic memcg event.
+ */
+enum mem_cgroup_events_target {
+	MEM_CGROUP_TARGET_THRESH,
+	MEM_CGROUP_TARGET_SOFTLIMIT,
+	MEM_CGROUP_TARGET_NUMAINFO,
+	MEM_CGROUP_NTARGETS,
+};
+
+struct mem_cgroup_stat_cpu {
+	long count[MEM_CGROUP_STAT_NSTATS];
+	unsigned long events[MEMCG_NR_EVENTS];
+	unsigned long nr_page_events;
+	unsigned long targets[MEM_CGROUP_NTARGETS];
+};
+
+struct reclaim_iter {
+	struct mem_cgroup *position;
+	/* scan generation, increased every round-trip */
+	unsigned int generation;
+};
+
+/*
+ * per-zone information in memory controller.
+ */
+struct mem_cgroup_per_zone {
+	struct lruvec		lruvec;
+	unsigned long		lru_size[NR_LRU_LISTS];
+
+	struct reclaim_iter	iter[DEF_PRIORITY + 1];
+
+	struct rb_node		tree_node;	/* RB tree node */
+	unsigned long		usage_in_excess;/* Set to the value by which */
+						/* the soft limit is exceeded*/
+	bool			on_tree;
+	struct mem_cgroup	*memcg;		/* Back pointer, we cannot */
+						/* use container_of	   */
+};
+
+struct mem_cgroup_per_node {
+	struct mem_cgroup_per_zone zoneinfo[MAX_NR_ZONES];
+};
+
+struct mem_cgroup_threshold {
+	struct eventfd_ctx *eventfd;
+	unsigned long threshold;
+};
+
+/* For threshold */
+struct mem_cgroup_threshold_ary {
+	/* An array index points to threshold just below or equal to usage. */
+	int current_threshold;
+	/* Size of entries[] */
+	unsigned int size;
+	/* Array of thresholds */
+	struct mem_cgroup_threshold entries[0];
+};
+
+struct mem_cgroup_thresholds {
+	/* Primary thresholds array */
+	struct mem_cgroup_threshold_ary *primary;
+	/*
+	 * Spare threshold array.
+	 * This is needed to make mem_cgroup_unregister_event() "never fail".
+	 * It must be able to store at least primary->size - 1 entries.
+	 */
+	struct mem_cgroup_threshold_ary *spare;
+};
+
+/*
+ * Bits in struct cg_proto.flags
+ */
+enum cg_proto_flags {
+	/* Currently active and new sockets should be assigned to cgroups */
+	MEMCG_SOCK_ACTIVE,
+	/* It was ever activated; we must disarm static keys on destruction */
+	MEMCG_SOCK_ACTIVATED,
+};
+
+struct cg_proto {
+	struct page_counter	memory_allocated;	/* Current allocated memory. */
+	struct percpu_counter	sockets_allocated;	/* Current number of sockets. */
+	int			memory_pressure;
+	long			sysctl_mem[3];
+	unsigned long		flags;
+	/*
+	 * memcg field is used to find which memcg we belong directly
+	 * Each memcg struct can hold more than one cg_proto, so container_of
+	 * won't really cut.
+	 *
+	 * The elegant solution would be having an inverse function to
+	 * proto_cgroup in struct proto, but that means polluting the structure
+	 * for everybody, instead of just for memcg users.
+	 */
+	struct mem_cgroup	*memcg;
+};
+
 #ifdef CONFIG_MEMCG
-void mem_cgroup_events(struct mem_cgroup *memcg,
+/*
+ * The memory controller data structure. The memory controller controls both
+ * page cache and RSS per cgroup. We would eventually like to provide
+ * statistics based on the statistics developed by Rik Van Riel for clock-pro,
+ * to help the administrator determine what knobs to tune.
+ */
+struct mem_cgroup {
+	struct cgroup_subsys_state css;
+
+	/* Accounted resources */
+	struct page_counter memory;
+	struct page_counter memsw;
+	struct page_counter kmem;
+
+	/* Normal memory consumption range */
+	unsigned long low;
+	unsigned long high;
+
+	unsigned long soft_limit;
+
+	/* vmpressure notifications */
+	struct vmpressure vmpressure;
+
+	/* css_online() has been completed */
+	int initialized;
+
+	/*
+	 * Should the accounting and control be hierarchical, per subtree?
+	 */
+	bool use_hierarchy;
+
+	bool		oom_lock;
+	atomic_t	under_oom;
+	atomic_t	oom_wakeups;
+
+	int	swappiness;
+	/* OOM-Killer disable */
+	int		oom_kill_disable;
+
+	/* protect arrays of thresholds */
+	struct mutex thresholds_lock;
+
+	/* thresholds for memory usage. RCU-protected */
+	struct mem_cgroup_thresholds thresholds;
+
+	/* thresholds for mem+swap usage. RCU-protected */
+	struct mem_cgroup_thresholds memsw_thresholds;
+
+	/* For oom notifier event fd */
+	struct list_head oom_notify;
+
+	/*
+	 * Should we move charges of a task when a task is moved into this
+	 * mem_cgroup ? And what type of charges should we move ?
+	 */
+	unsigned long move_charge_at_immigrate;
+	/*
+	 * set > 0 if pages under this cgroup are moving to other cgroup.
+	 */
+	atomic_t		moving_account;
+	/* taken only while moving_account > 0 */
+	spinlock_t		move_lock;
+	struct task_struct	*move_lock_task;
+	unsigned long		move_lock_flags;
+	/*
+	 * percpu counter.
+	 */
+	struct mem_cgroup_stat_cpu __percpu *stat;
+	/*
+	 * used when a cpu is offlined or other synchronizations
+	 * See mem_cgroup_read_stat().
+	 */
+	struct mem_cgroup_stat_cpu nocpu_base;
+	spinlock_t pcp_counter_lock;
+
+#if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_INET)
+	struct cg_proto tcp_mem;
+#endif
+#if defined(CONFIG_MEMCG_KMEM)
+        /* Index in the kmem_cache->memcg_params.memcg_caches array */
+	int kmemcg_id;
+	bool kmem_acct_activated;
+	bool kmem_acct_active;
+#endif
+
+	int last_scanned_node;
+#if MAX_NUMNODES > 1
+	nodemask_t	scan_nodes;
+	atomic_t	numainfo_events;
+	atomic_t	numainfo_updating;
+#endif
+
+	/* List of events which userspace want to receive */
+	struct list_head event_list;
+	spinlock_t event_list_lock;
+
+	struct mem_cgroup_per_node *nodeinfo[0];
+	/* WARNING: nodeinfo must be the last member here */
+};
+
+/**
+ * mem_cgroup_events - count memory events against a cgroup
+ * @memcg: the memory cgroup
+ * @idx: the event index
+ * @nr: the number of events to account for
+ */
+static inline void mem_cgroup_events(struct mem_cgroup *memcg,
 		       enum mem_cgroup_events_index idx,
-		       unsigned int nr);
+		       unsigned int nr)
+{
+	this_cpu_add(memcg->stat->events[idx], nr);
+}
 
 bool mem_cgroup_low(struct mem_cgroup *root, struct mem_cgroup *memcg);
 
@@ -87,15 +299,31 @@ void mem_cgroup_migrate(struct page *oldpage, struct page *newpage,
 struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *);
 struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *);
 
-bool mem_cgroup_is_descendant(struct mem_cgroup *memcg,
-			      struct mem_cgroup *root);
 bool task_in_mem_cgroup(struct task_struct *task, struct mem_cgroup *memcg);
 
 extern struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page);
 extern struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
 
 extern struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg);
-extern struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css);
+static inline
+struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css){
+	return css ? container_of(css, struct mem_cgroup, css) : NULL;
+}
+
+struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *,
+				   struct mem_cgroup *,
+				   struct mem_cgroup_reclaim_cookie *);
+void mem_cgroup_iter_break(struct mem_cgroup *, struct mem_cgroup *);
+
+static inline bool mem_cgroup_is_descendant(struct mem_cgroup *memcg,
+			      struct mem_cgroup *root)
+{
+	if (root == memcg)
+		return true;
+	if (!root->use_hierarchy)
+		return false;
+	return cgroup_is_descendant(memcg->css.cgroup, root->css.cgroup);
+}
 
 static inline bool mm_match_cgroup(struct mm_struct *mm,
 				   struct mem_cgroup *memcg)
@@ -111,21 +339,63 @@ static inline bool mm_match_cgroup(struct mm_struct *mm,
 	return match;
 }
 
-extern struct cgroup_subsys_state *mem_cgroup_css(struct mem_cgroup *memcg);
-
-struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *,
-				   struct mem_cgroup *,
-				   struct mem_cgroup_reclaim_cookie *);
-void mem_cgroup_iter_break(struct mem_cgroup *, struct mem_cgroup *);
+static inline bool mem_cgroup_disabled(void)
+{
+	if (memory_cgrp_subsys.disabled)
+		return true;
+	return false;
+}
 
 /*
  * For memory reclaim.
  */
-int mem_cgroup_inactive_anon_is_low(struct lruvec *lruvec);
-bool mem_cgroup_lruvec_online(struct lruvec *lruvec);
 int mem_cgroup_select_victim_node(struct mem_cgroup *memcg);
-unsigned long mem_cgroup_get_lru_size(struct lruvec *lruvec, enum lru_list);
-void mem_cgroup_update_lru_size(struct lruvec *, enum lru_list, int);
+
+void mem_cgroup_update_lru_size(struct lruvec *lruvec, enum lru_list lru,
+		int nr_pages);
+
+static inline bool mem_cgroup_lruvec_online(struct lruvec *lruvec)
+{
+	struct mem_cgroup_per_zone *mz;
+	struct mem_cgroup *memcg;
+
+	if (mem_cgroup_disabled())
+		return true;
+
+	mz = container_of(lruvec, struct mem_cgroup_per_zone, lruvec);
+	memcg = mz->memcg;
+
+	return !!(memcg->css.flags & CSS_ONLINE);
+}
+
+static inline
+unsigned long mem_cgroup_get_lru_size(struct lruvec *lruvec, enum lru_list lru)
+{
+	struct mem_cgroup_per_zone *mz;
+
+	mz = container_of(lruvec, struct mem_cgroup_per_zone, lruvec);
+	return mz->lru_size[lru];
+}
+
+static inline int mem_cgroup_inactive_anon_is_low(struct lruvec *lruvec)
+{
+	unsigned long inactive_ratio;
+	unsigned long inactive;
+	unsigned long active;
+	unsigned long gb;
+
+	inactive = mem_cgroup_get_lru_size(lruvec, LRU_INACTIVE_ANON);
+	active = mem_cgroup_get_lru_size(lruvec, LRU_ACTIVE_ANON);
+
+	gb = (inactive + active) >> (30 - PAGE_SHIFT);
+	if (gb)
+		inactive_ratio = int_sqrt(10 * gb);
+	else
+		inactive_ratio = 1;
+
+	return inactive * inactive_ratio < active;
+}
+
 extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
 					struct task_struct *p);
 
@@ -152,18 +422,26 @@ bool mem_cgroup_oom_synchronize(bool wait);
 extern int do_swap_account;
 #endif
 
-static inline bool mem_cgroup_disabled(void)
-{
-	if (memory_cgrp_subsys.disabled)
-		return true;
-	return false;
-}
-
 struct mem_cgroup *mem_cgroup_begin_page_stat(struct page *page);
-void mem_cgroup_update_page_stat(struct mem_cgroup *memcg,
-				 enum mem_cgroup_stat_index idx, int val);
 void mem_cgroup_end_page_stat(struct mem_cgroup *memcg);
 
+/**
+ * mem_cgroup_update_page_stat - update page state statistics
+ * @memcg: memcg to account against
+ * @idx: page state item to account
+ * @val: number of pages (positive or negative)
+ *
+ * See mem_cgroup_begin_page_stat() for locking requirements.
+ */
+static inline void mem_cgroup_update_page_stat(struct mem_cgroup *memcg,
+				 enum mem_cgroup_stat_index idx, int val)
+{
+	VM_BUG_ON(!rcu_read_lock_held());
+
+	if (memcg)
+		this_cpu_add(memcg->stat->count[idx], val);
+}
+
 static inline void mem_cgroup_inc_page_stat(struct mem_cgroup *memcg,
 					    enum mem_cgroup_stat_index idx)
 {
@@ -180,13 +458,31 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
 						gfp_t gfp_mask,
 						unsigned long *total_scanned);
 
-void __mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item idx);
 static inline void mem_cgroup_count_vm_event(struct mm_struct *mm,
 					     enum vm_event_item idx)
 {
+	struct mem_cgroup *memcg;
+
 	if (mem_cgroup_disabled())
 		return;
-	__mem_cgroup_count_vm_event(mm, idx);
+
+	rcu_read_lock();
+	memcg = mem_cgroup_from_task(rcu_dereference(mm->owner));
+	if (unlikely(!memcg))
+		goto out;
+
+	switch (idx) {
+	case PGFAULT:
+		this_cpu_inc(memcg->stat->events[MEM_CGROUP_EVENTS_PGFAULT]);
+		break;
+	case PGMAJFAULT:
+		this_cpu_inc(memcg->stat->events[MEM_CGROUP_EVENTS_PGMAJFAULT]);
+		break;
+	default:
+		BUG();
+	}
+out:
+	rcu_read_unlock();
 }
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 void mem_cgroup_split_huge_fixup(struct page *head);
@@ -269,12 +565,6 @@ static inline bool task_in_mem_cgroup(struct task_struct *task,
 	return true;
 }
 
-static inline struct cgroup_subsys_state
-		*mem_cgroup_css(struct mem_cgroup *memcg)
-{
-	return NULL;
-}
-
 static inline struct mem_cgroup *
 mem_cgroup_iter(struct mem_cgroup *root,
 		struct mem_cgroup *prev,
@@ -434,7 +724,15 @@ void __memcg_kmem_commit_charge(struct page *page,
 				       struct mem_cgroup *memcg, int order);
 void __memcg_kmem_uncharge_pages(struct page *page, int order);
 
-int memcg_cache_id(struct mem_cgroup *memcg);
+/*
+ * helper for acessing a memcg's index. It will be used as an index in the
+ * child cache array in kmem_cache, and also to derive its name. This function
+ * will return -1 when this is not a kmem-limited memcg.
+ */
+static inline int memcg_cache_id(struct mem_cgroup *memcg)
+{
+	return memcg ? memcg->kmemcg_id : -1;
+}
 
 struct kmem_cache *__memcg_kmem_get_cache(struct kmem_cache *cachep);
 void __memcg_kmem_put_cache(struct kmem_cache *cachep);
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 0428e4c84e1d..4c3f56a831ac 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -352,7 +352,15 @@ extern void check_move_unevictable_pages(struct page **, int nr_pages);
 extern int kswapd_run(int nid);
 extern void kswapd_stop(int nid);
 #ifdef CONFIG_MEMCG
-extern int mem_cgroup_swappiness(struct mem_cgroup *mem);
+static inline int mem_cgroup_swappiness(struct mem_cgroup *memcg)
+{
+	/* root ? */
+	if (mem_cgroup_disabled() || !memcg->css.parent)
+		return vm_swappiness;
+
+	return memcg->swappiness;
+}
+
 #else
 static inline int mem_cgroup_swappiness(struct mem_cgroup *mem)
 {
diff --git a/include/net/sock.h b/include/net/sock.h
index e4079c28e6b8..75fc35b1b774 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1041,34 +1041,6 @@ struct proto {
 #endif
 };
 
-/*
- * Bits in struct cg_proto.flags
- */
-enum cg_proto_flags {
-	/* Currently active and new sockets should be assigned to cgroups */
-	MEMCG_SOCK_ACTIVE,
-	/* It was ever activated; we must disarm static keys on destruction */
-	MEMCG_SOCK_ACTIVATED,
-};
-
-struct cg_proto {
-	struct page_counter	memory_allocated;	/* Current allocated memory. */
-	struct percpu_counter	sockets_allocated;	/* Current number of sockets. */
-	int			memory_pressure;
-	long			sysctl_mem[3];
-	unsigned long		flags;
-	/*
-	 * memcg field is used to find which memcg we belong directly
-	 * Each memcg struct can hold more than one cg_proto, so container_of
-	 * won't really cut.
-	 *
-	 * The elegant solution would be having an inverse function to
-	 * proto_cgroup in struct proto, but that means polluting the structure
-	 * for everybody, instead of just for memcg users.
-	 */
-	struct mem_cgroup	*memcg;
-};
-
 int proto_register(struct proto *prot, int alloc_slab);
 void proto_unregister(struct proto *prot);
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 5fd273d22714..198ce919911d 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -109,56 +109,10 @@ static const char * const mem_cgroup_lru_names[] = {
 	"unevictable",
 };
 
-/*
- * Per memcg event counter is incremented at every pagein/pageout. With THP,
- * it will be incremated by the number of pages. This counter is used for
- * for trigger some periodic events. This is straightforward and better
- * than using jiffies etc. to handle periodic memcg event.
- */
-enum mem_cgroup_events_target {
-	MEM_CGROUP_TARGET_THRESH,
-	MEM_CGROUP_TARGET_SOFTLIMIT,
-	MEM_CGROUP_TARGET_NUMAINFO,
-	MEM_CGROUP_NTARGETS,
-};
 #define THRESHOLDS_EVENTS_TARGET 128
 #define SOFTLIMIT_EVENTS_TARGET 1024
 #define NUMAINFO_EVENTS_TARGET	1024
 
-struct mem_cgroup_stat_cpu {
-	long count[MEM_CGROUP_STAT_NSTATS];
-	unsigned long events[MEMCG_NR_EVENTS];
-	unsigned long nr_page_events;
-	unsigned long targets[MEM_CGROUP_NTARGETS];
-};
-
-struct reclaim_iter {
-	struct mem_cgroup *position;
-	/* scan generation, increased every round-trip */
-	unsigned int generation;
-};
-
-/*
- * per-zone information in memory controller.
- */
-struct mem_cgroup_per_zone {
-	struct lruvec		lruvec;
-	unsigned long		lru_size[NR_LRU_LISTS];
-
-	struct reclaim_iter	iter[DEF_PRIORITY + 1];
-
-	struct rb_node		tree_node;	/* RB tree node */
-	unsigned long		usage_in_excess;/* Set to the value by which */
-						/* the soft limit is exceeded*/
-	bool			on_tree;
-	struct mem_cgroup	*memcg;		/* Back pointer, we cannot */
-						/* use container_of	   */
-};
-
-struct mem_cgroup_per_node {
-	struct mem_cgroup_per_zone zoneinfo[MAX_NR_ZONES];
-};
-
 /*
  * Cgroups above their limits are maintained in a RB-Tree, independent of
  * their hierarchy representation
@@ -179,32 +133,6 @@ struct mem_cgroup_tree {
 
 static struct mem_cgroup_tree soft_limit_tree __read_mostly;
 
-struct mem_cgroup_threshold {
-	struct eventfd_ctx *eventfd;
-	unsigned long threshold;
-};
-
-/* For threshold */
-struct mem_cgroup_threshold_ary {
-	/* An array index points to threshold just below or equal to usage. */
-	int current_threshold;
-	/* Size of entries[] */
-	unsigned int size;
-	/* Array of thresholds */
-	struct mem_cgroup_threshold entries[0];
-};
-
-struct mem_cgroup_thresholds {
-	/* Primary thresholds array */
-	struct mem_cgroup_threshold_ary *primary;
-	/*
-	 * Spare threshold array.
-	 * This is needed to make mem_cgroup_unregister_event() "never fail".
-	 * It must be able to store at least primary->size - 1 entries.
-	 */
-	struct mem_cgroup_threshold_ary *spare;
-};
-
 /* for OOM */
 struct mem_cgroup_eventfd_list {
 	struct list_head list;
@@ -254,106 +182,6 @@ struct mem_cgroup_event {
 static void mem_cgroup_threshold(struct mem_cgroup *memcg);
 static void mem_cgroup_oom_notify(struct mem_cgroup *memcg);
 
-/*
- * The memory controller data structure. The memory controller controls both
- * page cache and RSS per cgroup. We would eventually like to provide
- * statistics based on the statistics developed by Rik Van Riel for clock-pro,
- * to help the administrator determine what knobs to tune.
- */
-struct mem_cgroup {
-	struct cgroup_subsys_state css;
-
-	/* Accounted resources */
-	struct page_counter memory;
-	struct page_counter memsw;
-	struct page_counter kmem;
-
-	/* Normal memory consumption range */
-	unsigned long low;
-	unsigned long high;
-
-	unsigned long soft_limit;
-
-	/* vmpressure notifications */
-	struct vmpressure vmpressure;
-
-	/* css_online() has been completed */
-	int initialized;
-
-	/*
-	 * Should the accounting and control be hierarchical, per subtree?
-	 */
-	bool use_hierarchy;
-
-	bool		oom_lock;
-	atomic_t	under_oom;
-	atomic_t	oom_wakeups;
-
-	int	swappiness;
-	/* OOM-Killer disable */
-	int		oom_kill_disable;
-
-	/* protect arrays of thresholds */
-	struct mutex thresholds_lock;
-
-	/* thresholds for memory usage. RCU-protected */
-	struct mem_cgroup_thresholds thresholds;
-
-	/* thresholds for mem+swap usage. RCU-protected */
-	struct mem_cgroup_thresholds memsw_thresholds;
-
-	/* For oom notifier event fd */
-	struct list_head oom_notify;
-
-	/*
-	 * Should we move charges of a task when a task is moved into this
-	 * mem_cgroup ? And what type of charges should we move ?
-	 */
-	unsigned long move_charge_at_immigrate;
-	/*
-	 * set > 0 if pages under this cgroup are moving to other cgroup.
-	 */
-	atomic_t		moving_account;
-	/* taken only while moving_account > 0 */
-	spinlock_t		move_lock;
-	struct task_struct	*move_lock_task;
-	unsigned long		move_lock_flags;
-	/*
-	 * percpu counter.
-	 */
-	struct mem_cgroup_stat_cpu __percpu *stat;
-	/*
-	 * used when a cpu is offlined or other synchronizations
-	 * See mem_cgroup_read_stat().
-	 */
-	struct mem_cgroup_stat_cpu nocpu_base;
-	spinlock_t pcp_counter_lock;
-
-#if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_INET)
-	struct cg_proto tcp_mem;
-#endif
-#if defined(CONFIG_MEMCG_KMEM)
-        /* Index in the kmem_cache->memcg_params.memcg_caches array */
-	int kmemcg_id;
-	bool kmem_acct_activated;
-	bool kmem_acct_active;
-#endif
-
-	int last_scanned_node;
-#if MAX_NUMNODES > 1
-	nodemask_t	scan_nodes;
-	atomic_t	numainfo_events;
-	atomic_t	numainfo_updating;
-#endif
-
-	/* List of events which userspace want to receive */
-	struct list_head event_list;
-	spinlock_t event_list_lock;
-
-	struct mem_cgroup_per_node *nodeinfo[0];
-	/* WARNING: nodeinfo must be the last member here */
-};
-
 #ifdef CONFIG_MEMCG_KMEM
 bool memcg_kmem_is_active(struct mem_cgroup *memcg)
 {
@@ -421,11 +249,6 @@ enum res_type {
  */
 static DEFINE_MUTEX(memcg_create_mutex);
 
-struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *s)
-{
-	return s ? container_of(s, struct mem_cgroup, css) : NULL;
-}
-
 /* Some nice accessors for the vmpressure. */
 struct vmpressure *memcg_to_vmpressure(struct mem_cgroup *memcg)
 {
@@ -591,11 +414,6 @@ mem_cgroup_zone_zoneinfo(struct mem_cgroup *memcg, struct zone *zone)
 	return &memcg->nodeinfo[nid]->zoneinfo[zid];
 }
 
-struct cgroup_subsys_state *mem_cgroup_css(struct mem_cgroup *memcg)
-{
-	return &memcg->css;
-}
-
 static struct mem_cgroup_per_zone *
 mem_cgroup_page_zoneinfo(struct mem_cgroup *memcg, struct page *page)
 {
@@ -855,14 +673,6 @@ static void mem_cgroup_charge_statistics(struct mem_cgroup *memcg,
 	__this_cpu_add(memcg->stat->nr_page_events, nr_pages);
 }
 
-unsigned long mem_cgroup_get_lru_size(struct lruvec *lruvec, enum lru_list lru)
-{
-	struct mem_cgroup_per_zone *mz;
-
-	mz = container_of(lruvec, struct mem_cgroup_per_zone, lruvec);
-	return mz->lru_size[lru];
-}
-
 static unsigned long mem_cgroup_node_nr_lru_pages(struct mem_cgroup *memcg,
 						  int nid,
 						  unsigned int lru_mask)
@@ -1152,30 +962,6 @@ void mem_cgroup_iter_break(struct mem_cgroup *root,
 	     iter != NULL;				\
 	     iter = mem_cgroup_iter(NULL, iter, NULL))
 
-void __mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item idx)
-{
-	struct mem_cgroup *memcg;
-
-	rcu_read_lock();
-	memcg = mem_cgroup_from_task(rcu_dereference(mm->owner));
-	if (unlikely(!memcg))
-		goto out;
-
-	switch (idx) {
-	case PGFAULT:
-		this_cpu_inc(memcg->stat->events[MEM_CGROUP_EVENTS_PGFAULT]);
-		break;
-	case PGMAJFAULT:
-		this_cpu_inc(memcg->stat->events[MEM_CGROUP_EVENTS_PGMAJFAULT]);
-		break;
-	default:
-		BUG();
-	}
-out:
-	rcu_read_unlock();
-}
-EXPORT_SYMBOL(__mem_cgroup_count_vm_event);
-
 /**
  * mem_cgroup_zone_lruvec - get the lru list vector for a zone and memcg
  * @zone: zone of the wanted lruvec
@@ -1274,15 +1060,6 @@ void mem_cgroup_update_lru_size(struct lruvec *lruvec, enum lru_list lru,
 	VM_BUG_ON((long)(*lru_size) < 0);
 }
 
-bool mem_cgroup_is_descendant(struct mem_cgroup *memcg, struct mem_cgroup *root)
-{
-	if (root == memcg)
-		return true;
-	if (!root->use_hierarchy)
-		return false;
-	return cgroup_is_descendant(memcg->css.cgroup, root->css.cgroup);
-}
-
 bool task_in_mem_cgroup(struct task_struct *task, struct mem_cgroup *memcg)
 {
 	struct mem_cgroup *task_memcg;
@@ -1309,39 +1086,6 @@ bool task_in_mem_cgroup(struct task_struct *task, struct mem_cgroup *memcg)
 	return ret;
 }
 
-int mem_cgroup_inactive_anon_is_low(struct lruvec *lruvec)
-{
-	unsigned long inactive_ratio;
-	unsigned long inactive;
-	unsigned long active;
-	unsigned long gb;
-
-	inactive = mem_cgroup_get_lru_size(lruvec, LRU_INACTIVE_ANON);
-	active = mem_cgroup_get_lru_size(lruvec, LRU_ACTIVE_ANON);
-
-	gb = (inactive + active) >> (30 - PAGE_SHIFT);
-	if (gb)
-		inactive_ratio = int_sqrt(10 * gb);
-	else
-		inactive_ratio = 1;
-
-	return inactive * inactive_ratio < active;
-}
-
-bool mem_cgroup_lruvec_online(struct lruvec *lruvec)
-{
-	struct mem_cgroup_per_zone *mz;
-	struct mem_cgroup *memcg;
-
-	if (mem_cgroup_disabled())
-		return true;
-
-	mz = container_of(lruvec, struct mem_cgroup_per_zone, lruvec);
-	memcg = mz->memcg;
-
-	return !!(memcg->css.flags & CSS_ONLINE);
-}
-
 #define mem_cgroup_from_counter(counter, member)	\
 	container_of(counter, struct mem_cgroup, member)
 
@@ -1373,15 +1117,6 @@ static unsigned long mem_cgroup_margin(struct mem_cgroup *memcg)
 	return margin;
 }
 
-int mem_cgroup_swappiness(struct mem_cgroup *memcg)
-{
-	/* root ? */
-	if (mem_cgroup_disabled() || !memcg->css.parent)
-		return vm_swappiness;
-
-	return memcg->swappiness;
-}
-
 /*
  * A routine for checking "mem" is under move_account() or not.
  *
@@ -2034,23 +1769,6 @@ void mem_cgroup_end_page_stat(struct mem_cgroup *memcg)
 	rcu_read_unlock();
 }
 
-/**
- * mem_cgroup_update_page_stat - update page state statistics
- * @memcg: memcg to account against
- * @idx: page state item to account
- * @val: number of pages (positive or negative)
- *
- * See mem_cgroup_begin_page_stat() for locking requirements.
- */
-void mem_cgroup_update_page_stat(struct mem_cgroup *memcg,
-				 enum mem_cgroup_stat_index idx, int val)
-{
-	VM_BUG_ON(!rcu_read_lock_held());
-
-	if (memcg)
-		this_cpu_add(memcg->stat->count[idx], val);
-}
-
 /*
  * size of first charge trial. "32" comes from vmscan.c's magic value.
  * TODO: maybe necessary to use big numbers in big irons.
@@ -2502,16 +2220,6 @@ void memcg_uncharge_kmem(struct mem_cgroup *memcg, unsigned long nr_pages)
 	css_put_many(&memcg->css, nr_pages);
 }
 
-/*
- * helper for acessing a memcg's index. It will be used as an index in the
- * child cache array in kmem_cache, and also to derive its name. This function
- * will return -1 when this is not a kmem-limited memcg.
- */
-int memcg_cache_id(struct mem_cgroup *memcg)
-{
-	return memcg ? memcg->kmemcg_id : -1;
-}
-
 static int memcg_alloc_cache_id(void)
 {
 	int id, size;
@@ -5393,19 +5101,6 @@ struct cgroup_subsys memory_cgrp_subsys = {
 };
 
 /**
- * mem_cgroup_events - count memory events against a cgroup
- * @memcg: the memory cgroup
- * @idx: the event index
- * @nr: the number of events to account for
- */
-void mem_cgroup_events(struct mem_cgroup *memcg,
-		       enum mem_cgroup_events_index idx,
-		       unsigned int nr)
-{
-	this_cpu_add(memcg->stat->events[idx], nr);
-}
-
-/**
  * mem_cgroup_low - check if memory consumption is below the normal range
  * @root: the highest ancestor to consider
  * @memcg: the memory cgroup to check
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 1cf7f2988422..ef33ccf37224 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -146,7 +146,7 @@ static int hwpoison_filter_task(struct page *p)
 	if (!mem)
 		return -EINVAL;
 
-	css = mem_cgroup_css(mem);
+	css = &mem->css;
 	ino = cgroup_ino(css->cgroup);
 	css_put(css);
 
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 8873985f905e..ec3f68b14b87 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -501,7 +501,7 @@ void memcg_create_kmem_cache(struct mem_cgroup *memcg,
 			     struct kmem_cache *root_cache)
 {
 	static char memcg_name_buf[NAME_MAX + 1]; /* protected by slab_mutex */
-	struct cgroup_subsys_state *css = mem_cgroup_css(memcg);
+	struct cgroup_subsys_state *css = &memcg->css;
 	struct memcg_cache_array *arr;
 	struct kmem_cache *s = NULL;
 	char *cache_name;
-- 
2.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [RFC -v2 2/7] memcg: get rid of extern for functions in memcontrol.h
  2015-05-29 11:57 [RFC 0/7 -v2] memcg cleanups + get rid of mm_struct::owner Michal Hocko
  2015-05-29 11:57 ` [RFC -v2 1/7] memcg: export struct mem_cgroup Michal Hocko
@ 2015-05-29 11:57 ` Michal Hocko
  2015-05-29 11:57 ` [RFC -v2 3/7] memcg, mm: move mem_cgroup_select_victim_node into vmscan Michal Hocko
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: Michal Hocko @ 2015-05-29 11:57 UTC (permalink / raw)
  To: linux-mm
  Cc: Johannes Weiner, Andrew Morton, Tejun Heo, Oleg Nesterov,
	Vladimir Davydov, Greg Thelen, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro

Most of the exported functions in this header are not marked extern so
change the rest to follow the same style.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 include/linux/memcontrol.h | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 3d9d7ff0d93e..c3993b1728f7 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -301,10 +301,10 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *);
 
 bool task_in_mem_cgroup(struct task_struct *task, struct mem_cgroup *memcg);
 
-extern struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page);
-extern struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
+struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page);
+struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
 
-extern struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg);
+struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg);
 static inline
 struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css){
 	return css ? container_of(css, struct mem_cgroup, css) : NULL;
@@ -396,8 +396,8 @@ static inline int mem_cgroup_inactive_anon_is_low(struct lruvec *lruvec)
 	return inactive * inactive_ratio < active;
 }
 
-extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
-					struct task_struct *p);
+void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
+				struct task_struct *p);
 
 static inline void mem_cgroup_oom_enable(void)
 {
@@ -689,8 +689,8 @@ static inline void sock_release_memcg(struct sock *sk)
 extern struct static_key memcg_kmem_enabled_key;
 
 extern int memcg_nr_cache_ids;
-extern void memcg_get_cache_ids(void);
-extern void memcg_put_cache_ids(void);
+void memcg_get_cache_ids(void);
+void memcg_put_cache_ids(void);
 
 /*
  * Helper macro to loop through all memcg-specific caches. Callers must still
-- 
2.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [RFC -v2 3/7] memcg, mm: move mem_cgroup_select_victim_node into vmscan
  2015-05-29 11:57 [RFC 0/7 -v2] memcg cleanups + get rid of mm_struct::owner Michal Hocko
  2015-05-29 11:57 ` [RFC -v2 1/7] memcg: export struct mem_cgroup Michal Hocko
  2015-05-29 11:57 ` [RFC -v2 2/7] memcg: get rid of extern for functions in memcontrol.h Michal Hocko
@ 2015-05-29 11:57 ` Michal Hocko
  2015-05-29 11:57 ` [RFC -v2 4/7] memcg: restructure mem_cgroup_can_attach() Michal Hocko
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: Michal Hocko @ 2015-05-29 11:57 UTC (permalink / raw)
  To: linux-mm
  Cc: Johannes Weiner, Andrew Morton, Tejun Heo, Oleg Nesterov,
	Vladimir Davydov, Greg Thelen, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro

We currently have only one caller of mem_cgroup_select_victim_node which
is sitting in mm/vmscan.c and which is already wrapped by CONFIG_MEMCG
ifdef. Now that we have struct mem_cgroup visible outside of
mm/memcontrol.c we can move the function and its dependencies there.
This even shrinks the code size by few bytes:

   text    data     bss     dec     hex filename
 478509   65806   26384  570699   8b54b mm/built-in.o.before
 478445   65806   26384  570635   8b50b mm/built-in.o.after

Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 include/linux/memcontrol.h |  6 ++-
 mm/memcontrol.c            | 99 +---------------------------------------------
 mm/vmscan.c                | 96 ++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 101 insertions(+), 100 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index c3993b1728f7..820067ae462e 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -349,11 +349,13 @@ static inline bool mem_cgroup_disabled(void)
 /*
  * For memory reclaim.
  */
-int mem_cgroup_select_victim_node(struct mem_cgroup *memcg);
-
 void mem_cgroup_update_lru_size(struct lruvec *lruvec, enum lru_list lru,
 		int nr_pages);
 
+unsigned long mem_cgroup_node_nr_lru_pages(struct mem_cgroup *memcg,
+						  int nid,
+						  unsigned int lru_mask);
+
 static inline bool mem_cgroup_lruvec_online(struct lruvec *lruvec)
 {
 	struct mem_cgroup_per_zone *mz;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 198ce919911d..54600a6d23bc 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -673,7 +673,7 @@ static void mem_cgroup_charge_statistics(struct mem_cgroup *memcg,
 	__this_cpu_add(memcg->stat->nr_page_events, nr_pages);
 }
 
-static unsigned long mem_cgroup_node_nr_lru_pages(struct mem_cgroup *memcg,
+unsigned long mem_cgroup_node_nr_lru_pages(struct mem_cgroup *memcg,
 						  int nid,
 						  unsigned int lru_mask)
 {
@@ -1331,103 +1331,6 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	mutex_unlock(&oom_lock);
 }
 
-#if MAX_NUMNODES > 1
-
-/**
- * test_mem_cgroup_node_reclaimable
- * @memcg: the target memcg
- * @nid: the node ID to be checked.
- * @noswap : specify true here if the user wants flle only information.
- *
- * This function returns whether the specified memcg contains any
- * reclaimable pages on a node. Returns true if there are any reclaimable
- * pages in the node.
- */
-static bool test_mem_cgroup_node_reclaimable(struct mem_cgroup *memcg,
-		int nid, bool noswap)
-{
-	if (mem_cgroup_node_nr_lru_pages(memcg, nid, LRU_ALL_FILE))
-		return true;
-	if (noswap || !total_swap_pages)
-		return false;
-	if (mem_cgroup_node_nr_lru_pages(memcg, nid, LRU_ALL_ANON))
-		return true;
-	return false;
-
-}
-
-/*
- * Always updating the nodemask is not very good - even if we have an empty
- * list or the wrong list here, we can start from some node and traverse all
- * nodes based on the zonelist. So update the list loosely once per 10 secs.
- *
- */
-static void mem_cgroup_may_update_nodemask(struct mem_cgroup *memcg)
-{
-	int nid;
-	/*
-	 * numainfo_events > 0 means there was at least NUMAINFO_EVENTS_TARGET
-	 * pagein/pageout changes since the last update.
-	 */
-	if (!atomic_read(&memcg->numainfo_events))
-		return;
-	if (atomic_inc_return(&memcg->numainfo_updating) > 1)
-		return;
-
-	/* make a nodemask where this memcg uses memory from */
-	memcg->scan_nodes = node_states[N_MEMORY];
-
-	for_each_node_mask(nid, node_states[N_MEMORY]) {
-
-		if (!test_mem_cgroup_node_reclaimable(memcg, nid, false))
-			node_clear(nid, memcg->scan_nodes);
-	}
-
-	atomic_set(&memcg->numainfo_events, 0);
-	atomic_set(&memcg->numainfo_updating, 0);
-}
-
-/*
- * Selecting a node where we start reclaim from. Because what we need is just
- * reducing usage counter, start from anywhere is O,K. Considering
- * memory reclaim from current node, there are pros. and cons.
- *
- * Freeing memory from current node means freeing memory from a node which
- * we'll use or we've used. So, it may make LRU bad. And if several threads
- * hit limits, it will see a contention on a node. But freeing from remote
- * node means more costs for memory reclaim because of memory latency.
- *
- * Now, we use round-robin. Better algorithm is welcomed.
- */
-int mem_cgroup_select_victim_node(struct mem_cgroup *memcg)
-{
-	int node;
-
-	mem_cgroup_may_update_nodemask(memcg);
-	node = memcg->last_scanned_node;
-
-	node = next_node(node, memcg->scan_nodes);
-	if (node == MAX_NUMNODES)
-		node = first_node(memcg->scan_nodes);
-	/*
-	 * We call this when we hit limit, not when pages are added to LRU.
-	 * No LRU may hold pages because all pages are UNEVICTABLE or
-	 * memcg is too small and all pages are not on LRU. In that case,
-	 * we use curret node.
-	 */
-	if (unlikely(node == MAX_NUMNODES))
-		node = numa_node_id();
-
-	memcg->last_scanned_node = node;
-	return node;
-}
-#else
-int mem_cgroup_select_victim_node(struct mem_cgroup *memcg)
-{
-	return 0;
-}
-#endif
-
 static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_memcg,
 				   struct zone *zone,
 				   gfp_t gfp_mask,
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 37e90db1520b..3b6e61ef5183 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2886,6 +2886,102 @@ unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *memcg,
 	return sc.nr_reclaimed;
 }
 
+#if MAX_NUMNODES > 1
+
+/**
+ * test_mem_cgroup_node_reclaimable
+ * @memcg: the target memcg
+ * @nid: the node ID to be checked.
+ * @noswap : specify true here if the user wants flle only information.
+ *
+ * This function returns whether the specified memcg contains any
+ * reclaimable pages on a node. Returns true if there are any reclaimable
+ * pages in the node.
+ */
+static bool test_mem_cgroup_node_reclaimable(struct mem_cgroup *memcg,
+		int nid, bool noswap)
+{
+	if (mem_cgroup_node_nr_lru_pages(memcg, nid, LRU_ALL_FILE))
+		return true;
+	if (noswap || !total_swap_pages)
+		return false;
+	if (mem_cgroup_node_nr_lru_pages(memcg, nid, LRU_ALL_ANON))
+		return true;
+	return false;
+
+}
+/*
+ * Always updating the nodemask is not very good - even if we have an empty
+ * list or the wrong list here, we can start from some node and traverse all
+ * nodes based on the zonelist. So update the list loosely once per 10 secs.
+ *
+ */
+static void mem_cgroup_may_update_nodemask(struct mem_cgroup *memcg)
+{
+	int nid;
+	/*
+	 * numainfo_events > 0 means there was at least NUMAINFO_EVENTS_TARGET
+	 * pagein/pageout changes since the last update.
+	 */
+	if (!atomic_read(&memcg->numainfo_events))
+		return;
+	if (atomic_inc_return(&memcg->numainfo_updating) > 1)
+		return;
+
+	/* make a nodemask where this memcg uses memory from */
+	memcg->scan_nodes = node_states[N_MEMORY];
+
+	for_each_node_mask(nid, node_states[N_MEMORY]) {
+
+		if (!test_mem_cgroup_node_reclaimable(memcg, nid, false))
+			node_clear(nid, memcg->scan_nodes);
+	}
+
+	atomic_set(&memcg->numainfo_events, 0);
+	atomic_set(&memcg->numainfo_updating, 0);
+}
+
+/*
+ * Selecting a node where we start reclaim from. Because what we need is just
+ * reducing usage counter, start from anywhere is O,K. Considering
+ * memory reclaim from current node, there are pros. and cons.
+ *
+ * Freeing memory from current node means freeing memory from a node which
+ * we'll use or we've used. So, it may make LRU bad. And if several threads
+ * hit limits, it will see a contention on a node. But freeing from remote
+ * node means more costs for memory reclaim because of memory latency.
+ *
+ * Now, we use round-robin. Better algorithm is welcomed.
+ */
+static int mem_cgroup_select_victim_node(struct mem_cgroup *memcg)
+{
+	int node;
+
+	mem_cgroup_may_update_nodemask(memcg);
+	node = memcg->last_scanned_node;
+
+	node = next_node(node, memcg->scan_nodes);
+	if (node == MAX_NUMNODES)
+		node = first_node(memcg->scan_nodes);
+	/*
+	 * We call this when we hit limit, not when pages are added to LRU.
+	 * No LRU may hold pages because all pages are UNEVICTABLE or
+	 * memcg is too small and all pages are not on LRU. In that case,
+	 * we use curret node.
+	 */
+	if (unlikely(node == MAX_NUMNODES))
+		node = numa_node_id();
+
+	memcg->last_scanned_node = node;
+	return node;
+}
+#else
+static int mem_cgroup_select_victim_node(struct mem_cgroup *memcg)
+{
+	return 0;
+}
+#endif
+
 unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 					   unsigned long nr_pages,
 					   gfp_t gfp_mask,
-- 
2.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [RFC -v2 4/7] memcg: restructure mem_cgroup_can_attach()
  2015-05-29 11:57 [RFC 0/7 -v2] memcg cleanups + get rid of mm_struct::owner Michal Hocko
                   ` (2 preceding siblings ...)
  2015-05-29 11:57 ` [RFC -v2 3/7] memcg, mm: move mem_cgroup_select_victim_node into vmscan Michal Hocko
@ 2015-05-29 11:57 ` Michal Hocko
  2015-05-29 11:57 ` [RFC -v2 5/7] memcg, tcp_kmem: check for cg_proto in sock_update_memcg Michal Hocko
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: Michal Hocko @ 2015-05-29 11:57 UTC (permalink / raw)
  To: linux-mm
  Cc: Johannes Weiner, Andrew Morton, Tejun Heo, Oleg Nesterov,
	Vladimir Davydov, Greg Thelen, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro

From: Tejun Heo <tj@kernel.org>

Restructure it to lower nesting level and help the planned threadgroup
leader iteration changes.

This is pure reorganization.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
---
 mm/memcontrol.c | 61 ++++++++++++++++++++++++++++++---------------------------
 1 file changed, 32 insertions(+), 29 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 54600a6d23bc..5f0adf6e2332 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4612,10 +4612,12 @@ static void mem_cgroup_clear_mc(void)
 static int mem_cgroup_can_attach(struct cgroup_subsys_state *css,
 				 struct cgroup_taskset *tset)
 {
-	struct task_struct *p = cgroup_taskset_first(tset);
-	int ret = 0;
 	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
+	struct mem_cgroup *from;
+	struct task_struct *p;
+	struct mm_struct *mm;
 	unsigned long move_flags;
+	int ret = 0;
 
 	/*
 	 * We are now commited to this value whatever it is. Changes in this
@@ -4623,36 +4625,37 @@ static int mem_cgroup_can_attach(struct cgroup_subsys_state *css,
 	 * So we need to save it, and keep it going.
 	 */
 	move_flags = READ_ONCE(memcg->move_charge_at_immigrate);
-	if (move_flags) {
-		struct mm_struct *mm;
-		struct mem_cgroup *from = mem_cgroup_from_task(p);
+	if (!move_flags)
+		return 0;
 
-		VM_BUG_ON(from == memcg);
+	p = cgroup_taskset_first(tset);
+	from = mem_cgroup_from_task(p);
 
-		mm = get_task_mm(p);
-		if (!mm)
-			return 0;
-		/* We move charges only when we move a owner of the mm */
-		if (mm->owner == p) {
-			VM_BUG_ON(mc.from);
-			VM_BUG_ON(mc.to);
-			VM_BUG_ON(mc.precharge);
-			VM_BUG_ON(mc.moved_charge);
-			VM_BUG_ON(mc.moved_swap);
-
-			spin_lock(&mc.lock);
-			mc.from = from;
-			mc.to = memcg;
-			mc.flags = move_flags;
-			spin_unlock(&mc.lock);
-			/* We set mc.moving_task later */
-
-			ret = mem_cgroup_precharge_mc(mm);
-			if (ret)
-				mem_cgroup_clear_mc();
-		}
-		mmput(mm);
+	VM_BUG_ON(from == memcg);
+
+	mm = get_task_mm(p);
+	if (!mm)
+		return 0;
+	/* We move charges only when we move a owner of the mm */
+	if (mm->owner == p) {
+		VM_BUG_ON(mc.from);
+		VM_BUG_ON(mc.to);
+		VM_BUG_ON(mc.precharge);
+		VM_BUG_ON(mc.moved_charge);
+		VM_BUG_ON(mc.moved_swap);
+
+		spin_lock(&mc.lock);
+		mc.from = from;
+		mc.to = memcg;
+		mc.flags = move_flags;
+		spin_unlock(&mc.lock);
+		/* We set mc.moving_task later */
+
+		ret = mem_cgroup_precharge_mc(mm);
+		if (ret)
+			mem_cgroup_clear_mc();
 	}
+	mmput(mm);
 	return ret;
 }
 
-- 
2.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [RFC -v2 5/7] memcg, tcp_kmem: check for cg_proto in sock_update_memcg
  2015-05-29 11:57 [RFC 0/7 -v2] memcg cleanups + get rid of mm_struct::owner Michal Hocko
                   ` (3 preceding siblings ...)
  2015-05-29 11:57 ` [RFC -v2 4/7] memcg: restructure mem_cgroup_can_attach() Michal Hocko
@ 2015-05-29 11:57 ` Michal Hocko
  2015-05-29 11:57 ` [RFC -v2 6/7] memcg: get rid of mm_struct::owner Michal Hocko
  2015-05-29 11:57 ` [RFC -v2 7/7] memcg: get rid of mem_cgroup_from_task Michal Hocko
  6 siblings, 0 replies; 8+ messages in thread
From: Michal Hocko @ 2015-05-29 11:57 UTC (permalink / raw)
  To: linux-mm
  Cc: Johannes Weiner, Andrew Morton, Tejun Heo, Oleg Nesterov,
	Vladimir Davydov, Greg Thelen, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro

sk_prot->proto_cgroup is allowed to return NULL but sock_update_memcg
doesn't check for NULL. The function relies on the mem_cgroup_is_root
check because we shouldn't get NULL otherwise because
mem_cgroup_from_task will always return !NULL.

All other callers are checking for NULL and we can safely replace
mem_cgroup_is_root() check by cg_proto != NULL which will be more
straightforward (proto_cgroup returns NULL for the root memcg already).

Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 mm/memcontrol.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 5f0adf6e2332..1f10f90da4ef 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -320,8 +320,7 @@ void sock_update_memcg(struct sock *sk)
 		rcu_read_lock();
 		memcg = mem_cgroup_from_task(current);
 		cg_proto = sk->sk_prot->proto_cgroup(memcg);
-		if (!mem_cgroup_is_root(memcg) &&
-		    memcg_proto_active(cg_proto) &&
+		if (cg_proto && memcg_proto_active(cg_proto) &&
 		    css_tryget_online(&memcg->css)) {
 			sk->sk_cgrp = cg_proto;
 		}
-- 
2.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [RFC -v2 6/7] memcg: get rid of mm_struct::owner
  2015-05-29 11:57 [RFC 0/7 -v2] memcg cleanups + get rid of mm_struct::owner Michal Hocko
                   ` (4 preceding siblings ...)
  2015-05-29 11:57 ` [RFC -v2 5/7] memcg, tcp_kmem: check for cg_proto in sock_update_memcg Michal Hocko
@ 2015-05-29 11:57 ` Michal Hocko
  2015-05-29 11:57 ` [RFC -v2 7/7] memcg: get rid of mem_cgroup_from_task Michal Hocko
  6 siblings, 0 replies; 8+ messages in thread
From: Michal Hocko @ 2015-05-29 11:57 UTC (permalink / raw)
  To: linux-mm
  Cc: Johannes Weiner, Andrew Morton, Tejun Heo, Oleg Nesterov,
	Vladimir Davydov, Greg Thelen, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro

mm_struct::owner keeps track of the task which is in charge for the
specific mm. This is usually the thread group leader of the process but
there are exotic cases where this doesn't hold.

The most prominent one is when separate tasks (not in the same thread
group) share the address space (by using clone with CLONE_VM without
CLONE_THREAD). The first task will be the owner until it exits.
mm_update_next_owner will then try to find a new owner - a task which
points to the same mm_struct. There is no guarantee a new owner will
be a thread group leader though because the leader for that thread
group might have exited. Even though such a thread will be still around
waiting for the remaining threads from its group, it's mm will be NULL
so it cannot be chosen.

cgroup migration code, however assumes only group leaders when migrating
via cgroup.procs (which will be the only mode in the unified hierarchy
API) while mem_cgroup_can_attach considers only those tasks which are
owner of the mm. So we might end up with tasks which cannot be migrated.
mm_update_next_owner could be tweaked to try harder and use a group
leader whenever possible but this will never be 100% because all the
leaders might be dead. It seems that getting rid of the mm->owner sounds
like a better and less hacky option.

The whole concept of the mm owner is a bit artificial and too tricky to
get right. All the memcg code needs is to find struct mem_cgroup from
a given mm_struct and there are only two events when the association
is either built or changed
	- a new mm is created - dup_mmm resp exec_mmap - when the memcg
	  is inherited from the oldmm
	- task associated with the mm is moved to another memcg
So it is much more easier to bind mm_struct with the mem_cgroup directly
rather than indirectly via a task. This is exactly what this patch does.

mm_inherit_memcg and mm_drop_memcg are exported for the core kernel
to bind an old memcg during dup_mm (fork) resp. exec_mmap (exec) and
releasing that memcg in mmput after the last reference is dropped and no
task sees the mm anymore. We have to be careful and take a reference to
the memcg->css so that it doesn't vanish from under our feet.

The only remaining part is to catch task migration and change the
association. This is done in mem_cgroup_move_task before charges get
moved because mem_cgroup_can_attach is too early and other controllers
might fail and we would have to handle the rollback.

mm->memcg conforms to standard mem_cgroup locking rules. It has to be
used inside rcu_read_{un}lock() and a reference has to be taken before the
unlock if the memcg is supposed to be used outside.

Finally mem_cgroup_can_attach will allow task migration only for the
thread group leaders to conform with cgroup core requirements.

Please note that this patch introduces a USER VISIBLE CHANGE OF BEHAVIOR.
Without mm->owner _all_ tasks (group leaders to be precise) associated
with the mm_struct would initiate memcg migration while previously
only owner of the mm_struct could do that. The original behavior was
awkward though because the user task didn't have any means to find out
the current owner (esp. after mm_update_next_owner) so the migration
behavior was not well defined in general.
New cgroup API (unified hierarchy) will discontinue tasks cgroup file
which means that migrating threads will no longer be possible. In such
a case having CLONE_VM without CLONE_THREAD could emulate the thread
behavior but this patch prevents from isolating memcg controllers from
others. Nevertheless I am not convinced such a use case would really
deserve complications on the memcg code side.

Suggested-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 fs/exec.c                  |   2 +-
 include/linux/memcontrol.h |  57 +++++++++++++++++++++++--
 include/linux/mm_types.h   |  12 +-----
 kernel/exit.c              |  89 ---------------------------------------
 kernel/fork.c              |  10 +----
 mm/debug.c                 |   4 +-
 mm/memcontrol.c            | 101 ++++++++++++++++++++++++++++-----------------
 7 files changed, 122 insertions(+), 153 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 02bfd980a40c..e30c55231879 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -867,7 +867,7 @@ static int exec_mmap(struct mm_struct *mm)
 		up_read(&old_mm->mmap_sem);
 		BUG_ON(active_mm != old_mm);
 		setmax_mm_hiwater_rss(&tsk->signal->maxrss, old_mm);
-		mm_update_next_owner(old_mm);
+		mm_inherit_memcg(mm, old_mm);
 		mmput(old_mm);
 		return 0;
 	}
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 820067ae462e..1020bfd01966 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -271,6 +271,52 @@ struct mem_cgroup {
 };
 
 /**
+ * __mm_set_memcg - Set mm_struct:memcg to a given memcg.
+ * @mm: mm struct
+ * @memcg: mem_cgroup to be used
+ *
+ * Note that this function doesn't clean up the previous mm->memcg.
+ * This should be done by caller when necessary (e.g. when moving
+ * mm from one memcg to another).
+ */
+static inline
+void __mm_set_memcg(struct mm_struct *mm, struct mem_cgroup *memcg)
+{
+	if (memcg)
+		css_get(&memcg->css);
+	rcu_assign_pointer(mm->memcg, memcg);
+}
+
+/**
+ * mm_inherit_memcg - Initialize mm_struct::memcg from an existing mm_struct
+ * @newmm: new mm struct
+ * @oldmm: old mm struct to inherit from
+ *
+ * Should be called for each new mm_struct.
+ */
+static inline
+void mm_inherit_memcg(struct mm_struct *newmm, struct mm_struct *oldmm)
+{
+	struct mem_cgroup *memcg = oldmm->memcg;
+
+	__mm_set_memcg(newmm, memcg);
+}
+
+/**
+ * mm_drop_iter - drop mm_struct::memcg association
+ * @mm: mm struct
+ *
+ * Should be called after the mm has been removed from all tasks
+ * and before it is freed (e.g. from mmput)
+ */
+static inline void mm_drop_memcg(struct mm_struct *mm)
+{
+	if (mm->memcg)
+		css_put(&mm->memcg->css);
+	mm->memcg = NULL;
+}
+
+/**
  * mem_cgroup_events - count memory events against a cgroup
  * @memcg: the memory cgroup
  * @idx: the event index
@@ -302,7 +348,6 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *);
 bool task_in_mem_cgroup(struct task_struct *task, struct mem_cgroup *memcg);
 
 struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page);
-struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
 
 struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg);
 static inline
@@ -332,7 +377,7 @@ static inline bool mm_match_cgroup(struct mm_struct *mm,
 	bool match = false;
 
 	rcu_read_lock();
-	task_memcg = mem_cgroup_from_task(rcu_dereference(mm->owner));
+	task_memcg = rcu_dereference(mm->memcg);
 	if (task_memcg)
 		match = mem_cgroup_is_descendant(task_memcg, memcg);
 	rcu_read_unlock();
@@ -469,7 +514,7 @@ static inline void mem_cgroup_count_vm_event(struct mm_struct *mm,
 		return;
 
 	rcu_read_lock();
-	memcg = mem_cgroup_from_task(rcu_dereference(mm->owner));
+	memcg = rcu_dereference(mm->memcg);
 	if (unlikely(!memcg))
 		goto out;
 
@@ -493,6 +538,12 @@ void mem_cgroup_split_huge_fixup(struct page *head);
 #else /* CONFIG_MEMCG */
 struct mem_cgroup;
 
+static inline void mm_inherit_memcg(struct mm_struct *newmm, struct mm_struct *oldmm)
+{
+}
+static inline void mm_drop_memcg(struct mm_struct *mm)
+{
+}
 static inline void mem_cgroup_events(struct mem_cgroup *memcg,
 				     enum mem_cgroup_events_index idx,
 				     unsigned int nr)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index f6266742ce1f..93dc8cb9c636 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -426,17 +426,7 @@ struct mm_struct {
 	struct kioctx_table __rcu	*ioctx_table;
 #endif
 #ifdef CONFIG_MEMCG
-	/*
-	 * "owner" points to a task that is regarded as the canonical
-	 * user/owner of this mm. All of the following must be true in
-	 * order for it to be changed:
-	 *
-	 * current == mm->owner
-	 * current->mm != mm
-	 * new_owner->mm == mm
-	 * new_owner->alloc_lock is held
-	 */
-	struct task_struct __rcu *owner;
+	struct mem_cgroup __rcu *memcg;
 #endif
 
 	/* store ref to file /proc/<pid>/exe symlink points to */
diff --git a/kernel/exit.c b/kernel/exit.c
index 4089c2fd373e..8f3e5b4c58ce 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -292,94 +292,6 @@ kill_orphaned_pgrp(struct task_struct *tsk, struct task_struct *parent)
 	}
 }
 
-#ifdef CONFIG_MEMCG
-/*
- * A task is exiting.   If it owned this mm, find a new owner for the mm.
- */
-void mm_update_next_owner(struct mm_struct *mm)
-{
-	struct task_struct *c, *g, *p = current;
-
-retry:
-	/*
-	 * If the exiting or execing task is not the owner, it's
-	 * someone else's problem.
-	 */
-	if (mm->owner != p)
-		return;
-	/*
-	 * The current owner is exiting/execing and there are no other
-	 * candidates.  Do not leave the mm pointing to a possibly
-	 * freed task structure.
-	 */
-	if (atomic_read(&mm->mm_users) <= 1) {
-		mm->owner = NULL;
-		return;
-	}
-
-	read_lock(&tasklist_lock);
-	/*
-	 * Search in the children
-	 */
-	list_for_each_entry(c, &p->children, sibling) {
-		if (c->mm == mm)
-			goto assign_new_owner;
-	}
-
-	/*
-	 * Search in the siblings
-	 */
-	list_for_each_entry(c, &p->real_parent->children, sibling) {
-		if (c->mm == mm)
-			goto assign_new_owner;
-	}
-
-	/*
-	 * Search through everything else, we should not get here often.
-	 */
-	for_each_process(g) {
-		if (g->flags & PF_KTHREAD)
-			continue;
-		for_each_thread(g, c) {
-			if (c->mm == mm)
-				goto assign_new_owner;
-			if (c->mm)
-				break;
-		}
-	}
-	read_unlock(&tasklist_lock);
-	/*
-	 * We found no owner yet mm_users > 1: this implies that we are
-	 * most likely racing with swapoff (try_to_unuse()) or /proc or
-	 * ptrace or page migration (get_task_mm()).  Mark owner as NULL.
-	 */
-	mm->owner = NULL;
-	return;
-
-assign_new_owner:
-	BUG_ON(c == p);
-	get_task_struct(c);
-	/*
-	 * The task_lock protects c->mm from changing.
-	 * We always want mm->owner->mm == mm
-	 */
-	task_lock(c);
-	/*
-	 * Delay read_unlock() till we have the task_lock()
-	 * to ensure that c does not slip away underneath us
-	 */
-	read_unlock(&tasklist_lock);
-	if (c->mm != mm) {
-		task_unlock(c);
-		put_task_struct(c);
-		goto retry;
-	}
-	mm->owner = c;
-	task_unlock(c);
-	put_task_struct(c);
-}
-#endif /* CONFIG_MEMCG */
-
 /*
  * Turn us into a lazy TLB process if we
  * aren't already..
@@ -433,7 +345,6 @@ static void exit_mm(struct task_struct *tsk)
 	up_read(&mm->mmap_sem);
 	enter_lazy_tlb(mm, current);
 	task_unlock(tsk);
-	mm_update_next_owner(mm);
 	mmput(mm);
 	if (test_thread_flag(TIF_MEMDIE))
 		exit_oom_victim();
diff --git a/kernel/fork.c b/kernel/fork.c
index 556cc64ae0c4..1541f90de227 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -570,13 +570,6 @@ static void mm_init_aio(struct mm_struct *mm)
 #endif
 }
 
-static void mm_init_owner(struct mm_struct *mm, struct task_struct *p)
-{
-#ifdef CONFIG_MEMCG
-	mm->owner = p;
-#endif
-}
-
 static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p)
 {
 	mm->mmap = NULL;
@@ -596,7 +589,6 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p)
 	spin_lock_init(&mm->page_table_lock);
 	mm_init_cpumask(mm);
 	mm_init_aio(mm);
-	mm_init_owner(mm, p);
 	mmu_notifier_mm_init(mm);
 	clear_tlb_flush_pending(mm);
 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
@@ -702,6 +694,7 @@ void mmput(struct mm_struct *mm)
 		}
 		if (mm->binfmt)
 			module_put(mm->binfmt->module);
+		mm_drop_memcg(mm);
 		mmdrop(mm);
 	}
 }
@@ -925,6 +918,7 @@ static struct mm_struct *dup_mm(struct task_struct *tsk)
 	if (mm->binfmt && !try_module_get(mm->binfmt->module))
 		goto free_pt;
 
+	mm_inherit_memcg(mm, oldmm);
 	return mm;
 
 free_pt:
diff --git a/mm/debug.c b/mm/debug.c
index 3eb3ac2fcee7..d0347a168651 100644
--- a/mm/debug.c
+++ b/mm/debug.c
@@ -184,7 +184,7 @@ void dump_mm(const struct mm_struct *mm)
 		"ioctx_table %p\n"
 #endif
 #ifdef CONFIG_MEMCG
-		"owner %p "
+		"memcg %p "
 #endif
 		"exe_file %p\n"
 #ifdef CONFIG_MMU_NOTIFIER
@@ -218,7 +218,7 @@ void dump_mm(const struct mm_struct *mm)
 		mm->ioctx_table,
 #endif
 #ifdef CONFIG_MEMCG
-		mm->owner,
+		mm->memcg,
 #endif
 		mm->exe_file,
 #ifdef CONFIG_MMU_NOTIFIER
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 1f10f90da4ef..33d2ed086673 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -292,6 +292,18 @@ static inline struct mem_cgroup *mem_cgroup_from_id(unsigned short id)
 	return mem_cgroup_from_css(css);
 }
 
+static struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p)
+{
+	if (p->mm)
+		return rcu_dereference(p->mm->memcg);
+
+	/*
+	 * If the process doesn't have mm struct anymore we have to fallback
+	 * to the task_css.
+	 */
+	return mem_cgroup_from_css(task_css(p, memory_cgrp_id));
+}
+
 /* Writing them here to avoid exposing memcg's inner layout */
 #if defined(CONFIG_INET) && defined(CONFIG_MEMCG_KMEM)
 
@@ -762,19 +774,6 @@ static void memcg_check_events(struct mem_cgroup *memcg, struct page *page)
 	}
 }
 
-struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p)
-{
-	/*
-	 * mm_update_next_owner() may clear mm->owner to NULL
-	 * if it races with swapoff, page migration, etc.
-	 * So this can be called with p == NULL.
-	 */
-	if (unlikely(!p))
-		return NULL;
-
-	return mem_cgroup_from_css(task_css(p, memory_cgrp_id));
-}
-
 static struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm)
 {
 	struct mem_cgroup *memcg = NULL;
@@ -789,7 +788,7 @@ static struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm)
 		if (unlikely(!mm))
 			memcg = root_mem_cgroup;
 		else {
-			memcg = mem_cgroup_from_task(rcu_dereference(mm->owner));
+			memcg = rcu_dereference(mm->memcg);
 			if (unlikely(!memcg))
 				memcg = root_mem_cgroup;
 		}
@@ -2284,7 +2283,7 @@ void __memcg_kmem_put_cache(struct kmem_cache *cachep)
 }
 
 /*
- * We need to verify if the allocation against current->mm->owner's memcg is
+ * We need to verify if the allocation against current->mm->memcg is
  * possible for the given order. But the page is not allocated yet, so we'll
  * need a further commit step to do the final arrangements.
  *
@@ -4611,7 +4610,7 @@ static void mem_cgroup_clear_mc(void)
 static int mem_cgroup_can_attach(struct cgroup_subsys_state *css,
 				 struct cgroup_taskset *tset)
 {
-	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
+	struct mem_cgroup *to = mem_cgroup_from_css(css);
 	struct mem_cgroup *from;
 	struct task_struct *p;
 	struct mm_struct *mm;
@@ -4623,37 +4622,49 @@ static int mem_cgroup_can_attach(struct cgroup_subsys_state *css,
 	 * tunable will only affect upcoming migrations, not the current one.
 	 * So we need to save it, and keep it going.
 	 */
-	move_flags = READ_ONCE(memcg->move_charge_at_immigrate);
+	move_flags = READ_ONCE(to->move_charge_at_immigrate);
 	if (!move_flags)
 		return 0;
 
 	p = cgroup_taskset_first(tset);
-	from = mem_cgroup_from_task(p);
-
-	VM_BUG_ON(from == memcg);
+	if (!thread_group_leader(p))
+		return 0;
 
 	mm = get_task_mm(p);
 	if (!mm)
 		return 0;
-	/* We move charges only when we move a owner of the mm */
-	if (mm->owner == p) {
-		VM_BUG_ON(mc.from);
-		VM_BUG_ON(mc.to);
-		VM_BUG_ON(mc.precharge);
-		VM_BUG_ON(mc.moved_charge);
-		VM_BUG_ON(mc.moved_swap);
-
-		spin_lock(&mc.lock);
-		mc.from = from;
-		mc.to = memcg;
-		mc.flags = move_flags;
-		spin_unlock(&mc.lock);
-		/* We set mc.moving_task later */
-
-		ret = mem_cgroup_precharge_mc(mm);
-		if (ret)
-			mem_cgroup_clear_mc();
-	}
+
+	/*
+	 * tasks' cgroup might be different from the one p->mm is associated
+	 * with because CLONE_VM is allowed without CLONE_THREAD. The task is
+	 * moving so we have to migrate from the memcg associated with its
+	 * address space.
+	 * No need to take a reference here because the memcg is pinned by the
+	 * mm_struct.
+	 */
+	from = READ_ONCE(mm->memcg);
+	if (!from)
+		from = root_mem_cgroup;
+	if (from == to)
+		goto out;
+
+	VM_BUG_ON(mc.from);
+	VM_BUG_ON(mc.to);
+	VM_BUG_ON(mc.precharge);
+	VM_BUG_ON(mc.moved_charge);
+	VM_BUG_ON(mc.moved_swap);
+
+	spin_lock(&mc.lock);
+	mc.from = from;
+	mc.to = to;
+	mc.flags = move_flags;
+	spin_unlock(&mc.lock);
+	/* We set mc.moving_task later */
+
+	ret = mem_cgroup_precharge_mc(mm);
+	if (ret)
+		mem_cgroup_clear_mc();
+out:
 	mmput(mm);
 	return ret;
 }
@@ -4806,14 +4817,26 @@ static void mem_cgroup_move_task(struct cgroup_subsys_state *css,
 {
 	struct task_struct *p = cgroup_taskset_first(tset);
 	struct mm_struct *mm = get_task_mm(p);
+	struct mem_cgroup *old_memcg = NULL;
 
 	if (mm) {
+		old_memcg = READ_ONCE(mm->memcg);
+		__mm_set_memcg(mm, mem_cgroup_from_css(css));
+
 		if (mc.to)
 			mem_cgroup_move_charge(mm);
 		mmput(mm);
 	}
 	if (mc.to)
 		mem_cgroup_clear_mc();
+
+	/*
+	 * Be careful and drop the reference only after we are done because
+	 * p's task_css memcg might be different from p->memcg and nothing else
+	 * might be pinning the old memcg.
+	 */
+	if (old_memcg)
+		css_put(&old_memcg->css);
 }
 #else	/* !CONFIG_MMU */
 static int mem_cgroup_can_attach(struct cgroup_subsys_state *css,
-- 
2.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [RFC -v2 7/7] memcg: get rid of mem_cgroup_from_task
  2015-05-29 11:57 [RFC 0/7 -v2] memcg cleanups + get rid of mm_struct::owner Michal Hocko
                   ` (5 preceding siblings ...)
  2015-05-29 11:57 ` [RFC -v2 6/7] memcg: get rid of mm_struct::owner Michal Hocko
@ 2015-05-29 11:57 ` Michal Hocko
  6 siblings, 0 replies; 8+ messages in thread
From: Michal Hocko @ 2015-05-29 11:57 UTC (permalink / raw)
  To: linux-mm
  Cc: Johannes Weiner, Andrew Morton, Tejun Heo, Oleg Nesterov,
	Vladimir Davydov, Greg Thelen, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro

mem_cgroup_from_task has always been a tricky API. It was added
by 78fb74669e80 ("Memory controller: accounting setup") for
mm_struct::mem_cgroup initialization. Later on it gained new callers
mostly due to mm_struct::mem_cgroup -> mem_cgroup::owner transition and
most users had to do mem_cgroup_from_task(mm->owner) to get the
resulting memcg. Now that mm_struct::owner is gone this is not
necessary, yet the API is still confusing.

One tricky part has always been that the API sounds generic but it is
not really. mem_cgroup_from_task(current) doesn't necessarily mean the
same thing as current->mm->memcg (resp.
mem_cgroup_from_task(current->mm->owner) previously) because mm might be
associated with a different cgroup than the process.

Another tricky part is that p->mm->memcg is unsafe if p!=current
as pointed by Oleg because nobody is holding a reference on that
mm. This is not a problem right now because we have only 2 callers in
the tree. sock_update_memcg operates on current and task_in_mem_cgroup
is providing non-NULL task so it is always using task_css.

Let's ditch this function and use current->mm->memcg for
sock_update_memcg and use task_css for task_in_mem_cgroup. This doesn't
have any functional effect.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 mm/memcontrol.c | 24 +++++++-----------------
 1 file changed, 7 insertions(+), 17 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 33d2ed086673..7461a00cb3ee 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -292,18 +292,6 @@ static inline struct mem_cgroup *mem_cgroup_from_id(unsigned short id)
 	return mem_cgroup_from_css(css);
 }
 
-static struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p)
-{
-	if (p->mm)
-		return rcu_dereference(p->mm->memcg);
-
-	/*
-	 * If the process doesn't have mm struct anymore we have to fallback
-	 * to the task_css.
-	 */
-	return mem_cgroup_from_css(task_css(p, memory_cgrp_id));
-}
-
 /* Writing them here to avoid exposing memcg's inner layout */
 #if defined(CONFIG_INET) && defined(CONFIG_MEMCG_KMEM)
 
@@ -330,7 +318,7 @@ void sock_update_memcg(struct sock *sk)
 		}
 
 		rcu_read_lock();
-		memcg = mem_cgroup_from_task(current);
+		memcg = rcu_dereference(current->mm->memcg);
 		cg_proto = sk->sk_prot->proto_cgroup(memcg);
 		if (cg_proto && memcg_proto_active(cg_proto) &&
 		    css_tryget_online(&memcg->css)) {
@@ -1070,12 +1058,14 @@ bool task_in_mem_cgroup(struct task_struct *task, struct mem_cgroup *memcg)
 		task_unlock(p);
 	} else {
 		/*
-		 * All threads may have already detached their mm's, but the oom
-		 * killer still needs to detect if they have already been oom
-		 * killed to prevent needlessly killing additional tasks.
+		 * All threads have already detached their mm's but we should
+		 * still be able to at least guess the original memcg from the
+		 * task_css. These two will match most of the time but there are
+		 * corner cases where task->mm and task_css refer to a different
+		 * cgroups.
 		 */
 		rcu_read_lock();
-		task_memcg = mem_cgroup_from_task(task);
+		task_memcg = mem_cgroup_from_css(task_css(task, memory_cgrp_id));
 		css_get(&task_memcg->css);
 		rcu_read_unlock();
 	}
-- 
2.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2015-05-29 11:57 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-05-29 11:57 [RFC 0/7 -v2] memcg cleanups + get rid of mm_struct::owner Michal Hocko
2015-05-29 11:57 ` [RFC -v2 1/7] memcg: export struct mem_cgroup Michal Hocko
2015-05-29 11:57 ` [RFC -v2 2/7] memcg: get rid of extern for functions in memcontrol.h Michal Hocko
2015-05-29 11:57 ` [RFC -v2 3/7] memcg, mm: move mem_cgroup_select_victim_node into vmscan Michal Hocko
2015-05-29 11:57 ` [RFC -v2 4/7] memcg: restructure mem_cgroup_can_attach() Michal Hocko
2015-05-29 11:57 ` [RFC -v2 5/7] memcg, tcp_kmem: check for cg_proto in sock_update_memcg Michal Hocko
2015-05-29 11:57 ` [RFC -v2 6/7] memcg: get rid of mm_struct::owner Michal Hocko
2015-05-29 11:57 ` [RFC -v2 7/7] memcg: get rid of mem_cgroup_from_task Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).