linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [patch 0/3] mm: memcontrol: eliminate charge reparenting
@ 2014-09-20 20:00 Johannes Weiner
  2014-09-20 20:00 ` [patch 1/3] mm: memcontrol: take a css reference for each charged page Johannes Weiner
                   ` (4 more replies)
  0 siblings, 5 replies; 14+ messages in thread
From: Johannes Weiner @ 2014-09-20 20:00 UTC (permalink / raw)
  To: linux-mm
  Cc: Michal Hocko, Vladimir Davydov, Greg Thelen, Tejun Heo, cgroups,
	linux-kernel

Hi,

we've come a looong way when it comes to the basic cgroups model, and
the recent changes there open up a lot of opportunity to make drastic
simplifications to memory cgroups as well.

The decoupling of css from the user-visible cgroup, word-sized per-cpu
css reference counters, and css iterators that include offlined groups
means we can take per-charge css references, continue to reclaim from
offlined groups, and so get rid of the error-prone charge reparenting.

Combined with the higher-order reclaim fixes, lockless page counters,
and memcg iterator simplification I sent on Friday, the memory cgroup
core code is finally no longer the biggest file in mm/.  Yay!

These patches are based on mmotm + the above-mentioned changes + Tj's
percpu-refcount conversion to atomic_long_t.

Thanks!

 include/linux/cgroup.h          |  26 +++
 include/linux/percpu-refcount.h |  43 ++++-
 mm/memcontrol.c                 | 337 ++------------------------------------
 3 files changed, 75 insertions(+), 331 deletions(-)


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [patch 1/3] mm: memcontrol: take a css reference for each charged page
  2014-09-20 20:00 [patch 0/3] mm: memcontrol: eliminate charge reparenting Johannes Weiner
@ 2014-09-20 20:00 ` Johannes Weiner
  2014-09-22  8:24   ` Vladimir Davydov
  2014-10-08 13:27   ` Michal Hocko
  2014-09-20 20:00 ` [patch 2/3] mm: memcontrol: remove obsolete kmemcg pinning tricks Johannes Weiner
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 14+ messages in thread
From: Johannes Weiner @ 2014-09-20 20:00 UTC (permalink / raw)
  To: linux-mm
  Cc: Michal Hocko, Vladimir Davydov, Greg Thelen, Tejun Heo, cgroups,
	linux-kernel

Charges currently pin the css indirectly by playing tricks during
css_offline(): user pages stall the offlining process until all of
them have been reparented, whereas kmemcg acquires a keep-alive
reference if outstanding kernel pages are detected at that point.

In preparation for removing all this complexity, make the pinning
explicit and acquire a css references for every charged page.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 include/linux/cgroup.h          | 26 +++++++++++++++++++++++++
 include/linux/percpu-refcount.h | 43 ++++++++++++++++++++++++++++++++---------
 mm/memcontrol.c                 | 17 +++++++++++++++-
 3 files changed, 76 insertions(+), 10 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index b5223c570eba..a9fe70d9c7c5 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -113,6 +113,19 @@ static inline void css_get(struct cgroup_subsys_state *css)
 }
 
 /**
+ * css_get_many - obtain references on the specified css
+ * @css: target css
+ * @n: number of references to get
+ *
+ * The caller must already have a reference.
+ */
+static inline void css_get_many(struct cgroup_subsys_state *css, unsigned int n)
+{
+	if (!(css->flags & CSS_NO_REF))
+		percpu_ref_get_many(&css->refcnt, n);
+}
+
+/**
  * css_tryget - try to obtain a reference on the specified css
  * @css: target css
  *
@@ -159,6 +172,19 @@ static inline void css_put(struct cgroup_subsys_state *css)
 		percpu_ref_put(&css->refcnt);
 }
 
+/**
+ * css_put_many - put css references
+ * @css: target css
+ * @n: number of references to put
+ *
+ * Put references obtained via css_get() and css_tryget_online().
+ */
+static inline void css_put_many(struct cgroup_subsys_state *css, unsigned int n)
+{
+	if (!(css->flags & CSS_NO_REF))
+		percpu_ref_put_many(&css->refcnt, n);
+}
+
 /* bits in struct cgroup flags field */
 enum {
 	/*
diff --git a/include/linux/percpu-refcount.h b/include/linux/percpu-refcount.h
index b1973ba1d5f3..a4551456e06f 100644
--- a/include/linux/percpu-refcount.h
+++ b/include/linux/percpu-refcount.h
@@ -112,26 +112,38 @@ static inline bool __pcpu_ref_alive(struct percpu_ref *ref,
 }
 
 /**
- * percpu_ref_get - increment a percpu refcount
+ * percpu_ref_get_many - increment a percpu refcount
  * @ref: percpu_ref to get
+ * @nr: number of references to get
  *
- * Analagous to atomic_inc().
+ * Analagous to atomic_add().
   */
-static inline void percpu_ref_get(struct percpu_ref *ref)
+static inline void percpu_ref_get_many(struct percpu_ref *ref, unsigned long nr)
 {
 	unsigned long __percpu *pcpu_count;
 
 	rcu_read_lock_sched();
 
 	if (__pcpu_ref_alive(ref, &pcpu_count))
-		this_cpu_inc(*pcpu_count);
+		this_cpu_add(*pcpu_count, nr);
 	else
-		atomic_long_inc(&ref->count);
+		atomic_long_add(nr, &ref->count);
 
 	rcu_read_unlock_sched();
 }
 
 /**
+ * percpu_ref_get - increment a percpu refcount
+ * @ref: percpu_ref to get
+ *
+ * Analagous to atomic_inc().
+  */
+static inline void percpu_ref_get(struct percpu_ref *ref)
+{
+	percpu_ref_get_many(ref, 1);
+}
+
+/**
  * percpu_ref_tryget - try to increment a percpu refcount
  * @ref: percpu_ref to try-get
  *
@@ -191,27 +203,40 @@ static inline bool percpu_ref_tryget_live(struct percpu_ref *ref)
 }
 
 /**
- * percpu_ref_put - decrement a percpu refcount
+ * percpu_ref_put_many - decrement a percpu refcount
  * @ref: percpu_ref to put
+ * @nr: number of references to put
  *
  * Decrement the refcount, and if 0, call the release function (which was passed
  * to percpu_ref_init())
  */
-static inline void percpu_ref_put(struct percpu_ref *ref)
+static inline void percpu_ref_put_many(struct percpu_ref *ref, unsigned long nr)
 {
 	unsigned long __percpu *pcpu_count;
 
 	rcu_read_lock_sched();
 
 	if (__pcpu_ref_alive(ref, &pcpu_count))
-		this_cpu_dec(*pcpu_count);
-	else if (unlikely(atomic_long_dec_and_test(&ref->count)))
+		this_cpu_sub(*pcpu_count, nr);
+	else if (unlikely(atomic_long_sub_and_test(nr, &ref->count)))
 		ref->release(ref);
 
 	rcu_read_unlock_sched();
 }
 
 /**
+ * percpu_ref_put - decrement a percpu refcount
+ * @ref: percpu_ref to put
+ *
+ * Decrement the refcount, and if 0, call the release function (which was passed
+ * to percpu_ref_init())
+ */
+static inline void percpu_ref_put(struct percpu_ref *ref)
+{
+	percpu_ref_put_many(ref, 1);
+}
+
+/**
  * percpu_ref_is_zero - test whether a percpu refcount reached zero
  * @ref: percpu_ref to test
  *
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 154161bb7d4c..b832c87ec43b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2317,6 +2317,7 @@ static void drain_stock(struct memcg_stock_pcp *stock)
 		page_counter_uncharge(&old->memory, stock->nr_pages);
 		if (do_swap_account)
 			page_counter_uncharge(&old->memsw, stock->nr_pages);
+		css_put_many(&old->css, stock->nr_pages);
 		stock->nr_pages = 0;
 	}
 	stock->cached = NULL;
@@ -2573,6 +2574,7 @@ bypass:
 	return -EINTR;
 
 done_restock:
+	css_get_many(&memcg->css, batch);
 	if (batch > nr_pages)
 		refill_stock(memcg, batch - nr_pages);
 done:
@@ -2587,6 +2589,8 @@ static void cancel_charge(struct mem_cgroup *memcg, unsigned int nr_pages)
 	page_counter_uncharge(&memcg->memory, nr_pages);
 	if (do_swap_account)
 		page_counter_uncharge(&memcg->memsw, nr_pages);
+
+	css_put_many(&memcg->css, nr_pages);
 }
 
 /*
@@ -2788,6 +2792,7 @@ static int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp,
 		page_counter_charge(&memcg->memory, nr_pages, NULL);
 		if (do_swap_account)
 			page_counter_charge(&memcg->memsw, nr_pages, NULL);
+		css_get_many(&memcg->css, nr_pages);
 		ret = 0;
 	} else if (ret)
 		page_counter_uncharge(&memcg->kmem, nr_pages);
@@ -2803,8 +2808,10 @@ static void memcg_uncharge_kmem(struct mem_cgroup *memcg,
 		page_counter_uncharge(&memcg->memsw, nr_pages);
 
 	/* Not down to 0 */
-	if (page_counter_uncharge(&memcg->kmem, nr_pages))
+	if (page_counter_uncharge(&memcg->kmem, nr_pages)) {
+		css_put_many(&memcg->css, nr_pages);
 		return;
+	}
 
 	/*
 	 * Releases a reference taken in kmem_cgroup_css_offline in case
@@ -2816,6 +2823,8 @@ static void memcg_uncharge_kmem(struct mem_cgroup *memcg,
 	 */
 	if (memcg_kmem_test_and_clear_dead(memcg))
 		css_put(&memcg->css);
+
+	css_put_many(&memcg->css, nr_pages);
 }
 
 /*
@@ -3444,10 +3453,13 @@ static int mem_cgroup_move_parent(struct page *page,
 	ret = mem_cgroup_move_account(page, nr_pages,
 				pc, child, parent);
 	if (!ret) {
+		if (!mem_cgroup_is_root(parent))
+			css_get_many(&parent->css, nr_pages);
 		/* Take charge off the local counters */
 		page_counter_cancel(&child->memory, nr_pages);
 		if (do_swap_account)
 			page_counter_cancel(&child->memsw, nr_pages);
+		css_put_many(&child->css, nr_pages);
 	}
 
 	if (nr_pages > 1)
@@ -6379,6 +6391,9 @@ static void uncharge_batch(struct mem_cgroup *memcg, unsigned long pgpgout,
 	__this_cpu_add(memcg->stat->nr_page_events, nr_anon + nr_file);
 	memcg_check_events(memcg, dummy_page);
 	local_irq_restore(flags);
+
+	if (!mem_cgroup_is_root(memcg))
+		css_put_many(&memcg->css, max(nr_mem, nr_memsw));
 }
 
 static void uncharge_list(struct list_head *page_list)
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [patch 2/3] mm: memcontrol: remove obsolete kmemcg pinning tricks
  2014-09-20 20:00 [patch 0/3] mm: memcontrol: eliminate charge reparenting Johannes Weiner
  2014-09-20 20:00 ` [patch 1/3] mm: memcontrol: take a css reference for each charged page Johannes Weiner
@ 2014-09-20 20:00 ` Johannes Weiner
  2014-09-22  8:25   ` Vladimir Davydov
  2014-10-08 13:32   ` Michal Hocko
  2014-09-20 20:00 ` [patch 3/3] mm: memcontrol: continue cache reclaim from offlined groups Johannes Weiner
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 14+ messages in thread
From: Johannes Weiner @ 2014-09-20 20:00 UTC (permalink / raw)
  To: linux-mm
  Cc: Michal Hocko, Vladimir Davydov, Greg Thelen, Tejun Heo, cgroups,
	linux-kernel

As charges now pin the css explicitely, there is no more need for
kmemcg to acquire a proxy reference for outstanding pages during
offlining, or maintain state to identify such "dead" groups.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/memcontrol.c | 74 +--------------------------------------------------------
 1 file changed, 1 insertion(+), 73 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b832c87ec43b..019a44ac25d6 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -471,7 +471,6 @@ struct mem_cgroup {
 /* internal only representation about the status of kmem accounting. */
 enum {
 	KMEM_ACCOUNTED_ACTIVE, /* accounted by this cgroup itself */
-	KMEM_ACCOUNTED_DEAD, /* dead memcg with pending kmem charges */
 };
 
 #ifdef CONFIG_MEMCG_KMEM
@@ -485,22 +484,6 @@ static bool memcg_kmem_is_active(struct mem_cgroup *memcg)
 	return test_bit(KMEM_ACCOUNTED_ACTIVE, &memcg->kmem_account_flags);
 }
 
-static void memcg_kmem_mark_dead(struct mem_cgroup *memcg)
-{
-	/*
-	 * Our caller must use css_get() first, because memcg_uncharge_kmem()
-	 * will call css_put() if it sees the memcg is dead.
-	 */
-	smp_wmb();
-	if (test_bit(KMEM_ACCOUNTED_ACTIVE, &memcg->kmem_account_flags))
-		set_bit(KMEM_ACCOUNTED_DEAD, &memcg->kmem_account_flags);
-}
-
-static bool memcg_kmem_test_and_clear_dead(struct mem_cgroup *memcg)
-{
-	return test_and_clear_bit(KMEM_ACCOUNTED_DEAD,
-				  &memcg->kmem_account_flags);
-}
 #endif
 
 /* Stuffs for move charges at task migration. */
@@ -2807,22 +2790,7 @@ static void memcg_uncharge_kmem(struct mem_cgroup *memcg,
 	if (do_swap_account)
 		page_counter_uncharge(&memcg->memsw, nr_pages);
 
-	/* Not down to 0 */
-	if (page_counter_uncharge(&memcg->kmem, nr_pages)) {
-		css_put_many(&memcg->css, nr_pages);
-		return;
-	}
-
-	/*
-	 * Releases a reference taken in kmem_cgroup_css_offline in case
-	 * this last uncharge is racing with the offlining code or it is
-	 * outliving the memcg existence.
-	 *
-	 * The memory barrier imposed by test&clear is paired with the
-	 * explicit one in memcg_kmem_mark_dead().
-	 */
-	if (memcg_kmem_test_and_clear_dead(memcg))
-		css_put(&memcg->css);
+	page_counter_uncharge(&memcg->kmem, nr_pages);
 
 	css_put_many(&memcg->css, nr_pages);
 }
@@ -4805,40 +4773,6 @@ static void memcg_destroy_kmem(struct mem_cgroup *memcg)
 {
 	mem_cgroup_sockets_destroy(memcg);
 }
-
-static void kmem_cgroup_css_offline(struct mem_cgroup *memcg)
-{
-	if (!memcg_kmem_is_active(memcg))
-		return;
-
-	/*
-	 * kmem charges can outlive the cgroup. In the case of slab
-	 * pages, for instance, a page contain objects from various
-	 * processes. As we prevent from taking a reference for every
-	 * such allocation we have to be careful when doing uncharge
-	 * (see memcg_uncharge_kmem) and here during offlining.
-	 *
-	 * The idea is that that only the _last_ uncharge which sees
-	 * the dead memcg will drop the last reference. An additional
-	 * reference is taken here before the group is marked dead
-	 * which is then paired with css_put during uncharge resp. here.
-	 *
-	 * Although this might sound strange as this path is called from
-	 * css_offline() when the referencemight have dropped down to 0 and
-	 * shouldn't be incremented anymore (css_tryget_online() would
-	 * fail) we do not have other options because of the kmem
-	 * allocations lifetime.
-	 */
-	css_get(&memcg->css);
-
-	memcg_kmem_mark_dead(memcg);
-
-	if (atomic_long_read(&memcg->kmem.count))
-		return;
-
-	if (memcg_kmem_test_and_clear_dead(memcg))
-		css_put(&memcg->css);
-}
 #else
 static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
 {
@@ -4848,10 +4782,6 @@ static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
 static void memcg_destroy_kmem(struct mem_cgroup *memcg)
 {
 }
-
-static void kmem_cgroup_css_offline(struct mem_cgroup *memcg)
-{
-}
 #endif
 
 /*
@@ -5443,8 +5373,6 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
 	}
 	spin_unlock(&memcg->event_list_lock);
 
-	kmem_cgroup_css_offline(memcg);
-
 	/*
 	 * This requires that offlining is serialized.  Right now that is
 	 * guaranteed because css_killed_work_fn() holds the cgroup_mutex.
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [patch 3/3] mm: memcontrol: continue cache reclaim from offlined groups
  2014-09-20 20:00 [patch 0/3] mm: memcontrol: eliminate charge reparenting Johannes Weiner
  2014-09-20 20:00 ` [patch 1/3] mm: memcontrol: take a css reference for each charged page Johannes Weiner
  2014-09-20 20:00 ` [patch 2/3] mm: memcontrol: remove obsolete kmemcg pinning tricks Johannes Weiner
@ 2014-09-20 20:00 ` Johannes Weiner
  2014-09-22  8:32   ` Vladimir Davydov
  2014-10-08 14:03   ` Michal Hocko
  2014-09-21 15:50 ` [patch 0/3] mm: memcontrol: eliminate charge reparenting Vladimir Davydov
  2014-10-08 12:48 ` Michal Hocko
  4 siblings, 2 replies; 14+ messages in thread
From: Johannes Weiner @ 2014-09-20 20:00 UTC (permalink / raw)
  To: linux-mm
  Cc: Michal Hocko, Vladimir Davydov, Greg Thelen, Tejun Heo, cgroups,
	linux-kernel

On cgroup deletion, outstanding page cache charges are moved to the
parent group so that they're not lost and can be reclaimed during
pressure on/inside said parent.  But this reparenting is fairly tricky
and its synchroneous nature has led to several lock-ups in the past.

Since css iterators now also include offlined css, memcg iterators can
be changed to include offlined children during reclaim of a group, and
leftover cache can just stay put.

There is a slight change of behavior in that charges of deleted groups
no longer show up as local charges in the parent.  But they are still
included in the parent's hierarchical statistics.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/memcontrol.c | 260 ++------------------------------------------------------
 1 file changed, 5 insertions(+), 255 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 019a44ac25d6..48531433a2fc 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -736,8 +736,6 @@ static void disarm_static_keys(struct mem_cgroup *memcg)
 	disarm_kmem_keys(memcg);
 }
 
-static void drain_all_stock_async(struct mem_cgroup *memcg);
-
 static struct mem_cgroup_per_zone *
 mem_cgroup_zone_zoneinfo(struct mem_cgroup *memcg, struct zone *zone)
 {
@@ -1208,7 +1206,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 				goto out_unlock;
 			continue;
 		}
-		if (css == &root->css || css_tryget_online(css)) {
+		if (css == &root->css || css_tryget(css)) {
 			memcg = mem_cgroup_from_css(css);
 			break;
 		}
@@ -2349,10 +2347,12 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
  * of the hierarchy under it. sync flag says whether we should block
  * until the work is done.
  */
-static void drain_all_stock(struct mem_cgroup *root_memcg, bool sync)
+static void drain_all_stock(struct mem_cgroup *root_memcg)
 {
 	int cpu, curcpu;
 
+	if (!mutex_trylock(&percpu_charge_mutex))
+		return;
 	/* Notify other cpus that system-wide "drain" is running */
 	get_online_cpus();
 	curcpu = get_cpu();
@@ -2373,41 +2373,7 @@ static void drain_all_stock(struct mem_cgroup *root_memcg, bool sync)
 		}
 	}
 	put_cpu();
-
-	if (!sync)
-		goto out;
-
-	for_each_online_cpu(cpu) {
-		struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu);
-		if (test_bit(FLUSHING_CACHED_CHARGE, &stock->flags))
-			flush_work(&stock->work);
-	}
-out:
 	put_online_cpus();
-}
-
-/*
- * Tries to drain stocked charges in other cpus. This function is asynchronous
- * and just put a work per cpu for draining localy on each cpu. Caller can
- * expects some charges will be back later but cannot wait for it.
- */
-static void drain_all_stock_async(struct mem_cgroup *root_memcg)
-{
-	/*
-	 * If someone calls draining, avoid adding more kworker runs.
-	 */
-	if (!mutex_trylock(&percpu_charge_mutex))
-		return;
-	drain_all_stock(root_memcg, false);
-	mutex_unlock(&percpu_charge_mutex);
-}
-
-/* This is a synchronous drain interface. */
-static void drain_all_stock_sync(struct mem_cgroup *root_memcg)
-{
-	/* called when force_empty is called */
-	mutex_lock(&percpu_charge_mutex);
-	drain_all_stock(root_memcg, true);
 	mutex_unlock(&percpu_charge_mutex);
 }
 
@@ -2515,7 +2481,7 @@ retry:
 		goto retry;
 
 	if (!drained) {
-		drain_all_stock_async(mem_over_limit);
+		drain_all_stock(mem_over_limit);
 		drained = true;
 		goto retry;
 	}
@@ -3366,79 +3332,6 @@ out:
 	return ret;
 }
 
-/**
- * mem_cgroup_move_parent - moves page to the parent group
- * @page: the page to move
- * @pc: page_cgroup of the page
- * @child: page's cgroup
- *
- * move charges to its parent or the root cgroup if the group has no
- * parent (aka use_hierarchy==0).
- * Although this might fail (get_page_unless_zero, isolate_lru_page or
- * mem_cgroup_move_account fails) the failure is always temporary and
- * it signals a race with a page removal/uncharge or migration. In the
- * first case the page is on the way out and it will vanish from the LRU
- * on the next attempt and the call should be retried later.
- * Isolation from the LRU fails only if page has been isolated from
- * the LRU since we looked at it and that usually means either global
- * reclaim or migration going on. The page will either get back to the
- * LRU or vanish.
- * Finaly mem_cgroup_move_account fails only if the page got uncharged
- * (!PageCgroupUsed) or moved to a different group. The page will
- * disappear in the next attempt.
- */
-static int mem_cgroup_move_parent(struct page *page,
-				  struct page_cgroup *pc,
-				  struct mem_cgroup *child)
-{
-	struct mem_cgroup *parent;
-	unsigned int nr_pages;
-	unsigned long uninitialized_var(flags);
-	int ret;
-
-	VM_BUG_ON(mem_cgroup_is_root(child));
-
-	ret = -EBUSY;
-	if (!get_page_unless_zero(page))
-		goto out;
-	if (isolate_lru_page(page))
-		goto put;
-
-	nr_pages = hpage_nr_pages(page);
-
-	parent = parent_mem_cgroup(child);
-	/*
-	 * If no parent, move charges to root cgroup.
-	 */
-	if (!parent)
-		parent = root_mem_cgroup;
-
-	if (nr_pages > 1) {
-		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
-		flags = compound_lock_irqsave(page);
-	}
-
-	ret = mem_cgroup_move_account(page, nr_pages,
-				pc, child, parent);
-	if (!ret) {
-		if (!mem_cgroup_is_root(parent))
-			css_get_many(&parent->css, nr_pages);
-		/* Take charge off the local counters */
-		page_counter_cancel(&child->memory, nr_pages);
-		if (do_swap_account)
-			page_counter_cancel(&child->memsw, nr_pages);
-		css_put_many(&child->css, nr_pages);
-	}
-
-	if (nr_pages > 1)
-		compound_unlock_irqrestore(page, flags);
-	putback_lru_page(page);
-put:
-	put_page(page);
-out:
-	return ret;
-}
-
 #ifdef CONFIG_MEMCG_SWAP
 static void mem_cgroup_swap_statistics(struct mem_cgroup *memcg,
 					 bool charge)
@@ -3732,105 +3625,6 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
 	return nr_reclaimed;
 }
 
-/**
- * mem_cgroup_force_empty_list - clears LRU of a group
- * @memcg: group to clear
- * @node: NUMA node
- * @zid: zone id
- * @lru: lru to to clear
- *
- * Traverse a specified page_cgroup list and try to drop them all.  This doesn't
- * reclaim the pages page themselves - pages are moved to the parent (or root)
- * group.
- */
-static void mem_cgroup_force_empty_list(struct mem_cgroup *memcg,
-				int node, int zid, enum lru_list lru)
-{
-	struct lruvec *lruvec;
-	unsigned long flags;
-	struct list_head *list;
-	struct page *busy;
-	struct zone *zone;
-
-	zone = &NODE_DATA(node)->node_zones[zid];
-	lruvec = mem_cgroup_zone_lruvec(zone, memcg);
-	list = &lruvec->lists[lru];
-
-	busy = NULL;
-	do {
-		struct page_cgroup *pc;
-		struct page *page;
-
-		spin_lock_irqsave(&zone->lru_lock, flags);
-		if (list_empty(list)) {
-			spin_unlock_irqrestore(&zone->lru_lock, flags);
-			break;
-		}
-		page = list_entry(list->prev, struct page, lru);
-		if (busy == page) {
-			list_move(&page->lru, list);
-			busy = NULL;
-			spin_unlock_irqrestore(&zone->lru_lock, flags);
-			continue;
-		}
-		spin_unlock_irqrestore(&zone->lru_lock, flags);
-
-		pc = lookup_page_cgroup(page);
-
-		if (mem_cgroup_move_parent(page, pc, memcg)) {
-			/* found lock contention or "pc" is obsolete. */
-			busy = page;
-		} else
-			busy = NULL;
-		cond_resched();
-	} while (!list_empty(list));
-}
-
-/*
- * make mem_cgroup's charge to be 0 if there is no task by moving
- * all the charges and pages to the parent.
- * This enables deleting this mem_cgroup.
- *
- * Caller is responsible for holding css reference on the memcg.
- */
-static void mem_cgroup_reparent_charges(struct mem_cgroup *memcg)
-{
-	int node, zid;
-
-	do {
-		/* This is for making all *used* pages to be on LRU. */
-		lru_add_drain_all();
-		drain_all_stock_sync(memcg);
-		mem_cgroup_start_move(memcg);
-		for_each_node_state(node, N_MEMORY) {
-			for (zid = 0; zid < MAX_NR_ZONES; zid++) {
-				enum lru_list lru;
-				for_each_lru(lru) {
-					mem_cgroup_force_empty_list(memcg,
-							node, zid, lru);
-				}
-			}
-		}
-		mem_cgroup_end_move(memcg);
-		memcg_oom_recover(memcg);
-		cond_resched();
-
-		/*
-		 * Kernel memory may not necessarily be trackable to a specific
-		 * process. So they are not migrated, and therefore we can't
-		 * expect their value to drop to 0 here.
-		 * Having res filled up with kmem only is enough.
-		 *
-		 * This is a safety check because mem_cgroup_force_empty_list
-		 * could have raced with mem_cgroup_replace_page_cache callers
-		 * so the lru seemed empty but the page could have been added
-		 * right after the check. RES_USAGE should be safe as we always
-		 * charge before adding to the LRU.
-		 */
-	} while (atomic_long_read(&memcg->memory.count) -
-		 atomic_long_read(&memcg->kmem.count) > 0);
-}
-
 /*
  * Test whether @memcg has children, dead or alive.  Note that this
  * function doesn't care whether @memcg has use_hierarchy enabled and
@@ -5359,7 +5153,6 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
 	struct mem_cgroup_event *event, *tmp;
-	struct cgroup_subsys_state *iter;
 
 	/*
 	 * Unregister events and notify userspace.
@@ -5373,13 +5166,6 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
 	}
 	spin_unlock(&memcg->event_list_lock);
 
-	/*
-	 * This requires that offlining is serialized.  Right now that is
-	 * guaranteed because css_killed_work_fn() holds the cgroup_mutex.
-	 */
-	css_for_each_descendant_post(iter, css)
-		mem_cgroup_reparent_charges(mem_cgroup_from_css(iter));
-
 	memcg_unregister_all_caches(memcg);
 	vmpressure_cleanup(&memcg->vmpressure);
 }
@@ -5387,42 +5173,6 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
 static void mem_cgroup_css_free(struct cgroup_subsys_state *css)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
-	/*
-	 * XXX: css_offline() would be where we should reparent all
-	 * memory to prepare the cgroup for destruction.  However,
-	 * memcg does not do css_tryget_online() and page_counter charging
-	 * under the same RCU lock region, which means that charging
-	 * could race with offlining.  Offlining only happens to
-	 * cgroups with no tasks in them but charges can show up
-	 * without any tasks from the swapin path when the target
-	 * memcg is looked up from the swapout record and not from the
-	 * current task as it usually is.  A race like this can leak
-	 * charges and put pages with stale cgroup pointers into
-	 * circulation:
-	 *
-	 * #0                        #1
-	 *                           lookup_swap_cgroup_id()
-	 *                           rcu_read_lock()
-	 *                           mem_cgroup_lookup()
-	 *                           css_tryget_online()
-	 *                           rcu_read_unlock()
-	 * disable css_tryget_online()
-	 * call_rcu()
-	 *   offline_css()
-	 *     reparent_charges()
-	 *                           page_counter_charge()
-	 *                           css_put()
-	 *                             css_free()
-	 *                           pc->mem_cgroup = dead memcg
-	 *                           add page to lru
-	 *
-	 * The bulk of the charges are still moved in offline_css() to
-	 * avoid pinning a lot of pages in case a long-term reference
-	 * like a swapout record is deferring the css_free() to long
-	 * after offlining.  But this makes sure we catch any charges
-	 * made after offlining:
-	 */
-	mem_cgroup_reparent_charges(memcg);
 
 	memcg_destroy_kmem(memcg);
 	__mem_cgroup_free(memcg);
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [patch 0/3] mm: memcontrol: eliminate charge reparenting
  2014-09-20 20:00 [patch 0/3] mm: memcontrol: eliminate charge reparenting Johannes Weiner
                   ` (2 preceding siblings ...)
  2014-09-20 20:00 ` [patch 3/3] mm: memcontrol: continue cache reclaim from offlined groups Johannes Weiner
@ 2014-09-21 15:50 ` Vladimir Davydov
  2014-10-08 12:48 ` Michal Hocko
  4 siblings, 0 replies; 14+ messages in thread
From: Vladimir Davydov @ 2014-09-21 15:50 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Michal Hocko, Greg Thelen, Tejun Heo, cgroups, linux-kernel

Hi Johannes,

On Sat, Sep 20, 2014 at 04:00:32PM -0400, Johannes Weiner wrote:
> The decoupling of css from the user-visible cgroup, word-sized per-cpu
> css reference counters, and css iterators that include offlined groups
> means we can take per-charge css references, continue to reclaim from
> offlined groups, and so get rid of the error-prone charge reparenting.

I haven't reviewed this set yet, but I agree that zapping user memory
reparenting sounds like a sane idea, because reparenting won't let the
css go in most cases anyway due to swap and kmem charges.

However, I think we must reparent list_lru items, otherwise per-memcg
arrays (kmem_caches, list_lrus) will grow uncontrollably due to dead
css's, which is unacceptable. Note it isn't the same as the user memory
reparenting, because we don't need to reparent kmem_cache objects or
charges - they can stay where they are pinning the css till they are
freed, because the memcg_cache_id, which I want to free on offline, is
not used for kmem allocations/frees after css offline. Actually we only
need to empty the list_lru corresponding to the dead memory cgroup,
which is relatively easy to implement. This is what patch 13 of the "Per
memcg slab shrinkers" patch set, which I sent recently, does (see
https://lkml.org/lkml/2014/9/21/64).

What do you think about it?

Thanks,
Vladimir

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch 1/3] mm: memcontrol: take a css reference for each charged page
  2014-09-20 20:00 ` [patch 1/3] mm: memcontrol: take a css reference for each charged page Johannes Weiner
@ 2014-09-22  8:24   ` Vladimir Davydov
  2014-10-08 13:27   ` Michal Hocko
  1 sibling, 0 replies; 14+ messages in thread
From: Vladimir Davydov @ 2014-09-22  8:24 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Michal Hocko, Greg Thelen, Tejun Heo, cgroups, linux-kernel

On Sat, Sep 20, 2014 at 04:00:33PM -0400, Johannes Weiner wrote:
> Charges currently pin the css indirectly by playing tricks during
> css_offline(): user pages stall the offlining process until all of
> them have been reparented, whereas kmemcg acquires a keep-alive
> reference if outstanding kernel pages are detected at that point.
> 
> In preparation for removing all this complexity, make the pinning
> explicit and acquire a css references for every charged page.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Reviewed-by: Vladimir Davydov <vdavydov@parallels.com>

Nit: In __mem_cgroup_clear_mc, we have the following hunk:

>		for (i = 0; i < mc.moved_swap; i++)
>			css_put(&mc.from->css);

Now we can simplify it using css_put_many.

Thanks,
Vladimir

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch 2/3] mm: memcontrol: remove obsolete kmemcg pinning tricks
  2014-09-20 20:00 ` [patch 2/3] mm: memcontrol: remove obsolete kmemcg pinning tricks Johannes Weiner
@ 2014-09-22  8:25   ` Vladimir Davydov
  2014-10-08 13:32   ` Michal Hocko
  1 sibling, 0 replies; 14+ messages in thread
From: Vladimir Davydov @ 2014-09-22  8:25 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Michal Hocko, Greg Thelen, Tejun Heo, cgroups, linux-kernel

On Sat, Sep 20, 2014 at 04:00:34PM -0400, Johannes Weiner wrote:
> As charges now pin the css explicitely, there is no more need for
> kmemcg to acquire a proxy reference for outstanding pages during
> offlining, or maintain state to identify such "dead" groups.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Reviewed-by: Vladimir Davydov <vdavydov@parallels.com>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch 3/3] mm: memcontrol: continue cache reclaim from offlined groups
  2014-09-20 20:00 ` [patch 3/3] mm: memcontrol: continue cache reclaim from offlined groups Johannes Weiner
@ 2014-09-22  8:32   ` Vladimir Davydov
  2014-10-08 14:03   ` Michal Hocko
  1 sibling, 0 replies; 14+ messages in thread
From: Vladimir Davydov @ 2014-09-22  8:32 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Michal Hocko, Greg Thelen, Tejun Heo, cgroups, linux-kernel

On Sat, Sep 20, 2014 at 04:00:35PM -0400, Johannes Weiner wrote:
> On cgroup deletion, outstanding page cache charges are moved to the
> parent group so that they're not lost and can be reclaimed during
> pressure on/inside said parent.  But this reparenting is fairly tricky
> and its synchroneous nature has led to several lock-ups in the past.
> 
> Since css iterators now also include offlined css, memcg iterators can
> be changed to include offlined children during reclaim of a group, and
> leftover cache can just stay put.
> 
> There is a slight change of behavior in that charges of deleted groups
> no longer show up as local charges in the parent.  But they are still
> included in the parent's hierarchical statistics.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  mm/memcontrol.c | 260 ++------------------------------------------------------
>  1 file changed, 5 insertions(+), 255 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 019a44ac25d6..48531433a2fc 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -736,8 +736,6 @@ static void disarm_static_keys(struct mem_cgroup *memcg)
>  	disarm_kmem_keys(memcg);
>  }
>  
> -static void drain_all_stock_async(struct mem_cgroup *memcg);
> -
>  static struct mem_cgroup_per_zone *
>  mem_cgroup_zone_zoneinfo(struct mem_cgroup *memcg, struct zone *zone)
>  {
> @@ -1208,7 +1206,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
>  				goto out_unlock;
>  			continue;
>  		}
> -		if (css == &root->css || css_tryget_online(css)) {
> +		if (css == &root->css || css_tryget(css)) {
>  			memcg = mem_cgroup_from_css(css);
>  			break;
>  		}
> @@ -2349,10 +2347,12 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
>   * of the hierarchy under it. sync flag says whether we should block

Please update the comment.

>   * until the work is done.
>   */
> -static void drain_all_stock(struct mem_cgroup *root_memcg, bool sync)
> +static void drain_all_stock(struct mem_cgroup *root_memcg)
>  {
>  	int cpu, curcpu;
>  
> +	if (!mutex_trylock(&percpu_charge_mutex))
> +		return;

It's not obvious why we need it here. The old code has an explanatory
comment. Could you please add one?

Thanks,
Vladimir

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch 0/3] mm: memcontrol: eliminate charge reparenting
  2014-09-20 20:00 [patch 0/3] mm: memcontrol: eliminate charge reparenting Johannes Weiner
                   ` (3 preceding siblings ...)
  2014-09-21 15:50 ` [patch 0/3] mm: memcontrol: eliminate charge reparenting Vladimir Davydov
@ 2014-10-08 12:48 ` Michal Hocko
  2014-10-08 14:17   ` Johannes Weiner
  4 siblings, 1 reply; 14+ messages in thread
From: Michal Hocko @ 2014-10-08 12:48 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Vladimir Davydov, Greg Thelen, Tejun Heo, cgroups,
	linux-kernel

On Sat 20-09-14 16:00:32, Johannes Weiner wrote:
> Hi,
> 
> we've come a looong way when it comes to the basic cgroups model, and
> the recent changes there open up a lot of opportunity to make drastic
> simplifications to memory cgroups as well.
> 
> The decoupling of css from the user-visible cgroup, word-sized per-cpu
> css reference counters, and css iterators that include offlined groups
> means we can take per-charge css references, continue to reclaim from
> offlined groups, and so get rid of the error-prone charge reparenting.
> 
> Combined with the higher-order reclaim fixes, lockless page counters,
> and memcg iterator simplification I sent on Friday, the memory cgroup
> core code is finally no longer the biggest file in mm/.  Yay!

Yeah, the code reduction (as per the diffstat - I didn't get to the code
yet) seems really promising.

> These patches are based on mmotm + the above-mentioned changes

> + Tj's percpu-refcount conversion to atomic_long_t.

This is https://lkml.org/lkml/2014/9/20/11 right?

> Thanks!
> 
>  include/linux/cgroup.h          |  26 +++
>  include/linux/percpu-refcount.h |  43 ++++-
>  mm/memcontrol.c                 | 337 ++------------------------------------
>  3 files changed, 75 insertions(+), 331 deletions(-)
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch 1/3] mm: memcontrol: take a css reference for each charged page
  2014-09-20 20:00 ` [patch 1/3] mm: memcontrol: take a css reference for each charged page Johannes Weiner
  2014-09-22  8:24   ` Vladimir Davydov
@ 2014-10-08 13:27   ` Michal Hocko
  2014-10-08 13:29     ` Michal Hocko
  1 sibling, 1 reply; 14+ messages in thread
From: Michal Hocko @ 2014-10-08 13:27 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Vladimir Davydov, Greg Thelen, Tejun Heo, cgroups,
	linux-kernel

On Sat 20-09-14 16:00:33, Johannes Weiner wrote:
> Charges currently pin the css indirectly by playing tricks during
> css_offline(): user pages stall the offlining process until all of
> them have been reparented, whereas kmemcg acquires a keep-alive
> reference if outstanding kernel pages are detected at that point.
> 
> In preparation for removing all this complexity, make the pinning
> explicit and acquire a css references for every charged page.

OK, all the added {un}charges/atomics happen in a page_counter paths so
there shouldn't be any noticeable overhead.

I cannot judge the percpu counter part but the rest seems OK to me.
Two minor suggestions below.

> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

For the memcg part
Acked-by: Michal Hocko <mhocko@suse.cz>

> ---
>  include/linux/cgroup.h          | 26 +++++++++++++++++++++++++
>  include/linux/percpu-refcount.h | 43 ++++++++++++++++++++++++++++++++---------
>  mm/memcontrol.c                 | 17 +++++++++++++++-
>  3 files changed, 76 insertions(+), 10 deletions(-)
> 
[...]
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 154161bb7d4c..b832c87ec43b 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2317,6 +2317,7 @@ static void drain_stock(struct memcg_stock_pcp *stock)
>  		page_counter_uncharge(&old->memory, stock->nr_pages);
>  		if (do_swap_account)
>  			page_counter_uncharge(&old->memsw, stock->nr_pages);

		/* pairs with css_get_many in try_charge */
> +		css_put_many(&old->css, stock->nr_pages);
>  		stock->nr_pages = 0;
>  	}
>  	stock->cached = NULL;
[...]
> @@ -2803,8 +2808,10 @@ static void memcg_uncharge_kmem(struct mem_cgroup *memcg,
>  		page_counter_uncharge(&memcg->memsw, nr_pages);
>  

Wouldn't a single out_css_put be more readable? I was quite confused
when I start reading the patch before I saw the next hunk.

>  	/* Not down to 0 */
> -	if (page_counter_uncharge(&memcg->kmem, nr_pages))
		goto out_css_put;

> +	if (page_counter_uncharge(&memcg->kmem, nr_pages)) {
> +		css_put_many(&memcg->css, nr_pages);
>  		return;
> +	}
>  
>  	/*
>  	 * Releases a reference taken in kmem_cgroup_css_offline in case
> @@ -2816,6 +2823,8 @@ static void memcg_uncharge_kmem(struct mem_cgroup *memcg,
>  	 */
>  	if (memcg_kmem_test_and_clear_dead(memcg))
>  		css_put(&memcg->css);
> +

out_css_put:
> +	css_put_many(&memcg->css, nr_pages);
>  }
>  
>  /*
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch 1/3] mm: memcontrol: take a css reference for each charged page
  2014-10-08 13:27   ` Michal Hocko
@ 2014-10-08 13:29     ` Michal Hocko
  0 siblings, 0 replies; 14+ messages in thread
From: Michal Hocko @ 2014-10-08 13:29 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Vladimir Davydov, Greg Thelen, Tejun Heo, cgroups,
	linux-kernel

On Wed 08-10-14 15:27:54, Michal Hocko wrote:
> On Sat 20-09-14 16:00:33, Johannes Weiner wrote:
[...]
> > @@ -2803,8 +2808,10 @@ static void memcg_uncharge_kmem(struct mem_cgroup *memcg,
> >  		page_counter_uncharge(&memcg->memsw, nr_pages);
> >  
> 
> Wouldn't a single out_css_put be more readable? I was quite confused
> when I start reading the patch before I saw the next hunk.

Ohh, this will go away in the next patch. Ignore this.

> 
> >  	/* Not down to 0 */
> > -	if (page_counter_uncharge(&memcg->kmem, nr_pages))
> 		goto out_css_put;
> 
> > +	if (page_counter_uncharge(&memcg->kmem, nr_pages)) {
> > +		css_put_many(&memcg->css, nr_pages);
> >  		return;
> > +	}
> >  
> >  	/*
> >  	 * Releases a reference taken in kmem_cgroup_css_offline in case
> > @@ -2816,6 +2823,8 @@ static void memcg_uncharge_kmem(struct mem_cgroup *memcg,
> >  	 */
> >  	if (memcg_kmem_test_and_clear_dead(memcg))
> >  		css_put(&memcg->css);
> > +
> 
> out_css_put:
> > +	css_put_many(&memcg->css, nr_pages);
> >  }
> >  
> >  /*
> -- 
> Michal Hocko
> SUSE Labs

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch 2/3] mm: memcontrol: remove obsolete kmemcg pinning tricks
  2014-09-20 20:00 ` [patch 2/3] mm: memcontrol: remove obsolete kmemcg pinning tricks Johannes Weiner
  2014-09-22  8:25   ` Vladimir Davydov
@ 2014-10-08 13:32   ` Michal Hocko
  1 sibling, 0 replies; 14+ messages in thread
From: Michal Hocko @ 2014-10-08 13:32 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Vladimir Davydov, Greg Thelen, Tejun Heo, cgroups,
	linux-kernel

On Sat 20-09-14 16:00:34, Johannes Weiner wrote:
> As charges now pin the css explicitely, there is no more need for
> kmemcg to acquire a proxy reference for outstanding pages during
> offlining, or maintain state to identify such "dead" groups.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

More than happy to get rid of this trickery.
Acked-by: Michal Hocko <mhocko@suse.cz>

> ---
>  mm/memcontrol.c | 74 +--------------------------------------------------------
>  1 file changed, 1 insertion(+), 73 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index b832c87ec43b..019a44ac25d6 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -471,7 +471,6 @@ struct mem_cgroup {
>  /* internal only representation about the status of kmem accounting. */
>  enum {
>  	KMEM_ACCOUNTED_ACTIVE, /* accounted by this cgroup itself */
> -	KMEM_ACCOUNTED_DEAD, /* dead memcg with pending kmem charges */
>  };
>  
>  #ifdef CONFIG_MEMCG_KMEM
> @@ -485,22 +484,6 @@ static bool memcg_kmem_is_active(struct mem_cgroup *memcg)
>  	return test_bit(KMEM_ACCOUNTED_ACTIVE, &memcg->kmem_account_flags);
>  }
>  
> -static void memcg_kmem_mark_dead(struct mem_cgroup *memcg)
> -{
> -	/*
> -	 * Our caller must use css_get() first, because memcg_uncharge_kmem()
> -	 * will call css_put() if it sees the memcg is dead.
> -	 */
> -	smp_wmb();
> -	if (test_bit(KMEM_ACCOUNTED_ACTIVE, &memcg->kmem_account_flags))
> -		set_bit(KMEM_ACCOUNTED_DEAD, &memcg->kmem_account_flags);
> -}
> -
> -static bool memcg_kmem_test_and_clear_dead(struct mem_cgroup *memcg)
> -{
> -	return test_and_clear_bit(KMEM_ACCOUNTED_DEAD,
> -				  &memcg->kmem_account_flags);
> -}
>  #endif
>  
>  /* Stuffs for move charges at task migration. */
> @@ -2807,22 +2790,7 @@ static void memcg_uncharge_kmem(struct mem_cgroup *memcg,
>  	if (do_swap_account)
>  		page_counter_uncharge(&memcg->memsw, nr_pages);
>  
> -	/* Not down to 0 */
> -	if (page_counter_uncharge(&memcg->kmem, nr_pages)) {
> -		css_put_many(&memcg->css, nr_pages);
> -		return;
> -	}
> -
> -	/*
> -	 * Releases a reference taken in kmem_cgroup_css_offline in case
> -	 * this last uncharge is racing with the offlining code or it is
> -	 * outliving the memcg existence.
> -	 *
> -	 * The memory barrier imposed by test&clear is paired with the
> -	 * explicit one in memcg_kmem_mark_dead().
> -	 */
> -	if (memcg_kmem_test_and_clear_dead(memcg))
> -		css_put(&memcg->css);
> +	page_counter_uncharge(&memcg->kmem, nr_pages);
>  
>  	css_put_many(&memcg->css, nr_pages);
>  }
> @@ -4805,40 +4773,6 @@ static void memcg_destroy_kmem(struct mem_cgroup *memcg)
>  {
>  	mem_cgroup_sockets_destroy(memcg);
>  }
> -
> -static void kmem_cgroup_css_offline(struct mem_cgroup *memcg)
> -{
> -	if (!memcg_kmem_is_active(memcg))
> -		return;
> -
> -	/*
> -	 * kmem charges can outlive the cgroup. In the case of slab
> -	 * pages, for instance, a page contain objects from various
> -	 * processes. As we prevent from taking a reference for every
> -	 * such allocation we have to be careful when doing uncharge
> -	 * (see memcg_uncharge_kmem) and here during offlining.
> -	 *
> -	 * The idea is that that only the _last_ uncharge which sees
> -	 * the dead memcg will drop the last reference. An additional
> -	 * reference is taken here before the group is marked dead
> -	 * which is then paired with css_put during uncharge resp. here.
> -	 *
> -	 * Although this might sound strange as this path is called from
> -	 * css_offline() when the referencemight have dropped down to 0 and
> -	 * shouldn't be incremented anymore (css_tryget_online() would
> -	 * fail) we do not have other options because of the kmem
> -	 * allocations lifetime.
> -	 */
> -	css_get(&memcg->css);
> -
> -	memcg_kmem_mark_dead(memcg);
> -
> -	if (atomic_long_read(&memcg->kmem.count))
> -		return;
> -
> -	if (memcg_kmem_test_and_clear_dead(memcg))
> -		css_put(&memcg->css);
> -}
>  #else
>  static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
>  {
> @@ -4848,10 +4782,6 @@ static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
>  static void memcg_destroy_kmem(struct mem_cgroup *memcg)
>  {
>  }
> -
> -static void kmem_cgroup_css_offline(struct mem_cgroup *memcg)
> -{
> -}
>  #endif
>  
>  /*
> @@ -5443,8 +5373,6 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
>  	}
>  	spin_unlock(&memcg->event_list_lock);
>  
> -	kmem_cgroup_css_offline(memcg);
> -
>  	/*
>  	 * This requires that offlining is serialized.  Right now that is
>  	 * guaranteed because css_killed_work_fn() holds the cgroup_mutex.
> -- 
> 2.1.0
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch 3/3] mm: memcontrol: continue cache reclaim from offlined groups
  2014-09-20 20:00 ` [patch 3/3] mm: memcontrol: continue cache reclaim from offlined groups Johannes Weiner
  2014-09-22  8:32   ` Vladimir Davydov
@ 2014-10-08 14:03   ` Michal Hocko
  1 sibling, 0 replies; 14+ messages in thread
From: Michal Hocko @ 2014-10-08 14:03 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Vladimir Davydov, Greg Thelen, Tejun Heo, cgroups,
	linux-kernel

On Sat 20-09-14 16:00:35, Johannes Weiner wrote:
> On cgroup deletion, outstanding page cache charges are moved to the
> parent group so that they're not lost and can be reclaimed during
> pressure on/inside said parent.  But this reparenting is fairly tricky
> and its synchroneous nature has led to several lock-ups in the past.
> 
> Since css iterators now also include offlined css, memcg iterators can

This is since 3.16 and the closest match seems to be c2931b70a32c
(cgroup: iterate cgroup_subsys_states directly) right?

> be changed to include offlined children during reclaim of a group, and
> leftover cache can just stay put.
> 
> There is a slight change of behavior in that charges of deleted groups
> no longer show up as local charges in the parent.  But they are still
> included in the parent's hierarchical statistics.

This also addresses http://marc.info/?l=linux-mm&m=140027762114865&w=2
after force_empty is gone.

> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

I would prefer if the drain_stock code cleanup was in a separate patch
to make this one smaller and easier to review.

Anyway
Acked-by: Michal Hocko <mhocko@suse.cz>

> ---
>  mm/memcontrol.c | 260 ++------------------------------------------------------
>  1 file changed, 5 insertions(+), 255 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 019a44ac25d6..48531433a2fc 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -736,8 +736,6 @@ static void disarm_static_keys(struct mem_cgroup *memcg)
>  	disarm_kmem_keys(memcg);
>  }
>  
> -static void drain_all_stock_async(struct mem_cgroup *memcg);
> -
>  static struct mem_cgroup_per_zone *
>  mem_cgroup_zone_zoneinfo(struct mem_cgroup *memcg, struct zone *zone)
>  {
> @@ -1208,7 +1206,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
>  				goto out_unlock;
>  			continue;
>  		}
> -		if (css == &root->css || css_tryget_online(css)) {
> +		if (css == &root->css || css_tryget(css)) {
>  			memcg = mem_cgroup_from_css(css);
>  			break;
>  		}
> @@ -2349,10 +2347,12 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
>   * of the hierarchy under it. sync flag says whether we should block
>   * until the work is done.
>   */
> -static void drain_all_stock(struct mem_cgroup *root_memcg, bool sync)
> +static void drain_all_stock(struct mem_cgroup *root_memcg)
>  {
>  	int cpu, curcpu;
>  
> +	if (!mutex_trylock(&percpu_charge_mutex))
> +		return;
>  	/* Notify other cpus that system-wide "drain" is running */
>  	get_online_cpus();
>  	curcpu = get_cpu();
> @@ -2373,41 +2373,7 @@ static void drain_all_stock(struct mem_cgroup *root_memcg, bool sync)
>  		}
>  	}
>  	put_cpu();
> -
> -	if (!sync)
> -		goto out;
> -
> -	for_each_online_cpu(cpu) {
> -		struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu);
> -		if (test_bit(FLUSHING_CACHED_CHARGE, &stock->flags))
> -			flush_work(&stock->work);
> -	}
> -out:
>  	put_online_cpus();
> -}
> -
> -/*
> - * Tries to drain stocked charges in other cpus. This function is asynchronous
> - * and just put a work per cpu for draining localy on each cpu. Caller can
> - * expects some charges will be back later but cannot wait for it.
> - */
> -static void drain_all_stock_async(struct mem_cgroup *root_memcg)
> -{
> -	/*
> -	 * If someone calls draining, avoid adding more kworker runs.
> -	 */
> -	if (!mutex_trylock(&percpu_charge_mutex))
> -		return;
> -	drain_all_stock(root_memcg, false);
> -	mutex_unlock(&percpu_charge_mutex);
> -}
> -
> -/* This is a synchronous drain interface. */
> -static void drain_all_stock_sync(struct mem_cgroup *root_memcg)
> -{
> -	/* called when force_empty is called */
> -	mutex_lock(&percpu_charge_mutex);
> -	drain_all_stock(root_memcg, true);
>  	mutex_unlock(&percpu_charge_mutex);
>  }
>  
> @@ -2515,7 +2481,7 @@ retry:
>  		goto retry;
>  
>  	if (!drained) {
> -		drain_all_stock_async(mem_over_limit);
> +		drain_all_stock(mem_over_limit);
>  		drained = true;
>  		goto retry;
>  	}
> @@ -3366,79 +3332,6 @@ out:
>  	return ret;
>  }
>  
> -/**
> - * mem_cgroup_move_parent - moves page to the parent group
> - * @page: the page to move
> - * @pc: page_cgroup of the page
> - * @child: page's cgroup
> - *
> - * move charges to its parent or the root cgroup if the group has no
> - * parent (aka use_hierarchy==0).
> - * Although this might fail (get_page_unless_zero, isolate_lru_page or
> - * mem_cgroup_move_account fails) the failure is always temporary and
> - * it signals a race with a page removal/uncharge or migration. In the
> - * first case the page is on the way out and it will vanish from the LRU
> - * on the next attempt and the call should be retried later.
> - * Isolation from the LRU fails only if page has been isolated from
> - * the LRU since we looked at it and that usually means either global
> - * reclaim or migration going on. The page will either get back to the
> - * LRU or vanish.
> - * Finaly mem_cgroup_move_account fails only if the page got uncharged
> - * (!PageCgroupUsed) or moved to a different group. The page will
> - * disappear in the next attempt.
> - */
> -static int mem_cgroup_move_parent(struct page *page,
> -				  struct page_cgroup *pc,
> -				  struct mem_cgroup *child)
> -{
> -	struct mem_cgroup *parent;
> -	unsigned int nr_pages;
> -	unsigned long uninitialized_var(flags);
> -	int ret;
> -
> -	VM_BUG_ON(mem_cgroup_is_root(child));
> -
> -	ret = -EBUSY;
> -	if (!get_page_unless_zero(page))
> -		goto out;
> -	if (isolate_lru_page(page))
> -		goto put;
> -
> -	nr_pages = hpage_nr_pages(page);
> -
> -	parent = parent_mem_cgroup(child);
> -	/*
> -	 * If no parent, move charges to root cgroup.
> -	 */
> -	if (!parent)
> -		parent = root_mem_cgroup;
> -
> -	if (nr_pages > 1) {
> -		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
> -		flags = compound_lock_irqsave(page);
> -	}
> -
> -	ret = mem_cgroup_move_account(page, nr_pages,
> -				pc, child, parent);
> -	if (!ret) {
> -		if (!mem_cgroup_is_root(parent))
> -			css_get_many(&parent->css, nr_pages);
> -		/* Take charge off the local counters */
> -		page_counter_cancel(&child->memory, nr_pages);
> -		if (do_swap_account)
> -			page_counter_cancel(&child->memsw, nr_pages);
> -		css_put_many(&child->css, nr_pages);
> -	}
> -
> -	if (nr_pages > 1)
> -		compound_unlock_irqrestore(page, flags);
> -	putback_lru_page(page);
> -put:
> -	put_page(page);
> -out:
> -	return ret;
> -}
> -
>  #ifdef CONFIG_MEMCG_SWAP
>  static void mem_cgroup_swap_statistics(struct mem_cgroup *memcg,
>  					 bool charge)
> @@ -3732,105 +3625,6 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>  	return nr_reclaimed;
>  }
>  
> -/**
> - * mem_cgroup_force_empty_list - clears LRU of a group
> - * @memcg: group to clear
> - * @node: NUMA node
> - * @zid: zone id
> - * @lru: lru to to clear
> - *
> - * Traverse a specified page_cgroup list and try to drop them all.  This doesn't
> - * reclaim the pages page themselves - pages are moved to the parent (or root)
> - * group.
> - */
> -static void mem_cgroup_force_empty_list(struct mem_cgroup *memcg,
> -				int node, int zid, enum lru_list lru)
> -{
> -	struct lruvec *lruvec;
> -	unsigned long flags;
> -	struct list_head *list;
> -	struct page *busy;
> -	struct zone *zone;
> -
> -	zone = &NODE_DATA(node)->node_zones[zid];
> -	lruvec = mem_cgroup_zone_lruvec(zone, memcg);
> -	list = &lruvec->lists[lru];
> -
> -	busy = NULL;
> -	do {
> -		struct page_cgroup *pc;
> -		struct page *page;
> -
> -		spin_lock_irqsave(&zone->lru_lock, flags);
> -		if (list_empty(list)) {
> -			spin_unlock_irqrestore(&zone->lru_lock, flags);
> -			break;
> -		}
> -		page = list_entry(list->prev, struct page, lru);
> -		if (busy == page) {
> -			list_move(&page->lru, list);
> -			busy = NULL;
> -			spin_unlock_irqrestore(&zone->lru_lock, flags);
> -			continue;
> -		}
> -		spin_unlock_irqrestore(&zone->lru_lock, flags);
> -
> -		pc = lookup_page_cgroup(page);
> -
> -		if (mem_cgroup_move_parent(page, pc, memcg)) {
> -			/* found lock contention or "pc" is obsolete. */
> -			busy = page;
> -		} else
> -			busy = NULL;
> -		cond_resched();
> -	} while (!list_empty(list));
> -}
> -
> -/*
> - * make mem_cgroup's charge to be 0 if there is no task by moving
> - * all the charges and pages to the parent.
> - * This enables deleting this mem_cgroup.
> - *
> - * Caller is responsible for holding css reference on the memcg.
> - */
> -static void mem_cgroup_reparent_charges(struct mem_cgroup *memcg)
> -{
> -	int node, zid;
> -
> -	do {
> -		/* This is for making all *used* pages to be on LRU. */
> -		lru_add_drain_all();
> -		drain_all_stock_sync(memcg);
> -		mem_cgroup_start_move(memcg);
> -		for_each_node_state(node, N_MEMORY) {
> -			for (zid = 0; zid < MAX_NR_ZONES; zid++) {
> -				enum lru_list lru;
> -				for_each_lru(lru) {
> -					mem_cgroup_force_empty_list(memcg,
> -							node, zid, lru);
> -				}
> -			}
> -		}
> -		mem_cgroup_end_move(memcg);
> -		memcg_oom_recover(memcg);
> -		cond_resched();
> -
> -		/*
> -		 * Kernel memory may not necessarily be trackable to a specific
> -		 * process. So they are not migrated, and therefore we can't
> -		 * expect their value to drop to 0 here.
> -		 * Having res filled up with kmem only is enough.
> -		 *
> -		 * This is a safety check because mem_cgroup_force_empty_list
> -		 * could have raced with mem_cgroup_replace_page_cache callers
> -		 * so the lru seemed empty but the page could have been added
> -		 * right after the check. RES_USAGE should be safe as we always
> -		 * charge before adding to the LRU.
> -		 */
> -	} while (atomic_long_read(&memcg->memory.count) -
> -		 atomic_long_read(&memcg->kmem.count) > 0);
> -}
> -
>  /*
>   * Test whether @memcg has children, dead or alive.  Note that this
>   * function doesn't care whether @memcg has use_hierarchy enabled and
> @@ -5359,7 +5153,6 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
>  {
>  	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
>  	struct mem_cgroup_event *event, *tmp;
> -	struct cgroup_subsys_state *iter;
>  
>  	/*
>  	 * Unregister events and notify userspace.
> @@ -5373,13 +5166,6 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
>  	}
>  	spin_unlock(&memcg->event_list_lock);
>  
> -	/*
> -	 * This requires that offlining is serialized.  Right now that is
> -	 * guaranteed because css_killed_work_fn() holds the cgroup_mutex.
> -	 */
> -	css_for_each_descendant_post(iter, css)
> -		mem_cgroup_reparent_charges(mem_cgroup_from_css(iter));
> -
>  	memcg_unregister_all_caches(memcg);
>  	vmpressure_cleanup(&memcg->vmpressure);
>  }
> @@ -5387,42 +5173,6 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
>  static void mem_cgroup_css_free(struct cgroup_subsys_state *css)
>  {
>  	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
> -	/*
> -	 * XXX: css_offline() would be where we should reparent all
> -	 * memory to prepare the cgroup for destruction.  However,
> -	 * memcg does not do css_tryget_online() and page_counter charging
> -	 * under the same RCU lock region, which means that charging
> -	 * could race with offlining.  Offlining only happens to
> -	 * cgroups with no tasks in them but charges can show up
> -	 * without any tasks from the swapin path when the target
> -	 * memcg is looked up from the swapout record and not from the
> -	 * current task as it usually is.  A race like this can leak
> -	 * charges and put pages with stale cgroup pointers into
> -	 * circulation:
> -	 *
> -	 * #0                        #1
> -	 *                           lookup_swap_cgroup_id()
> -	 *                           rcu_read_lock()
> -	 *                           mem_cgroup_lookup()
> -	 *                           css_tryget_online()
> -	 *                           rcu_read_unlock()
> -	 * disable css_tryget_online()
> -	 * call_rcu()
> -	 *   offline_css()
> -	 *     reparent_charges()
> -	 *                           page_counter_charge()
> -	 *                           css_put()
> -	 *                             css_free()
> -	 *                           pc->mem_cgroup = dead memcg
> -	 *                           add page to lru
> -	 *
> -	 * The bulk of the charges are still moved in offline_css() to
> -	 * avoid pinning a lot of pages in case a long-term reference
> -	 * like a swapout record is deferring the css_free() to long
> -	 * after offlining.  But this makes sure we catch any charges
> -	 * made after offlining:
> -	 */
> -	mem_cgroup_reparent_charges(memcg);
>  
>  	memcg_destroy_kmem(memcg);
>  	__mem_cgroup_free(memcg);
> -- 
> 2.1.0
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch 0/3] mm: memcontrol: eliminate charge reparenting
  2014-10-08 12:48 ` Michal Hocko
@ 2014-10-08 14:17   ` Johannes Weiner
  0 siblings, 0 replies; 14+ messages in thread
From: Johannes Weiner @ 2014-10-08 14:17 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Vladimir Davydov, Greg Thelen, Tejun Heo, cgroups,
	linux-kernel

On Wed, Oct 08, 2014 at 02:48:23PM +0200, Michal Hocko wrote:
> On Sat 20-09-14 16:00:32, Johannes Weiner wrote:
> > Hi,
> > 
> > we've come a looong way when it comes to the basic cgroups model, and
> > the recent changes there open up a lot of opportunity to make drastic
> > simplifications to memory cgroups as well.
> > 
> > The decoupling of css from the user-visible cgroup, word-sized per-cpu
> > css reference counters, and css iterators that include offlined groups
> > means we can take per-charge css references, continue to reclaim from
> > offlined groups, and so get rid of the error-prone charge reparenting.
> > 
> > Combined with the higher-order reclaim fixes, lockless page counters,
> > and memcg iterator simplification I sent on Friday, the memory cgroup
> > core code is finally no longer the biggest file in mm/.  Yay!
> 
> Yeah, the code reduction (as per the diffstat - I didn't get to the code
> yet) seems really promising.

:)

> > These patches are based on mmotm + the above-mentioned changes
> 
> > + Tj's percpu-refcount conversion to atomic_long_t.
> 
> This is https://lkml.org/lkml/2014/9/20/11 right?

Yep, exactly.  All these moving parts are now in -next, though, so as
soon as Andrew flushes his tree for 3.18, I'll rebase and resubmit.

Thanks!

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2014-10-08 14:18 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-09-20 20:00 [patch 0/3] mm: memcontrol: eliminate charge reparenting Johannes Weiner
2014-09-20 20:00 ` [patch 1/3] mm: memcontrol: take a css reference for each charged page Johannes Weiner
2014-09-22  8:24   ` Vladimir Davydov
2014-10-08 13:27   ` Michal Hocko
2014-10-08 13:29     ` Michal Hocko
2014-09-20 20:00 ` [patch 2/3] mm: memcontrol: remove obsolete kmemcg pinning tricks Johannes Weiner
2014-09-22  8:25   ` Vladimir Davydov
2014-10-08 13:32   ` Michal Hocko
2014-09-20 20:00 ` [patch 3/3] mm: memcontrol: continue cache reclaim from offlined groups Johannes Weiner
2014-09-22  8:32   ` Vladimir Davydov
2014-10-08 14:03   ` Michal Hocko
2014-09-21 15:50 ` [patch 0/3] mm: memcontrol: eliminate charge reparenting Vladimir Davydov
2014-10-08 12:48 ` Michal Hocko
2014-10-08 14:17   ` Johannes Weiner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).