linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC] memcg: some charge path cleanups + css offline vs. charge race fix
@ 2013-12-17 15:45 Michal Hocko
  2013-12-17 15:45 ` [RFC 1/5] memcg: cleanup charge routines Michal Hocko
                   ` (4 more replies)
  0 siblings, 5 replies; 14+ messages in thread
From: Michal Hocko @ 2013-12-17 15:45 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro, LKML, linux-mm

Hi,
the first three patches are an attempt to clean up memcg charging path a
bit. I am already fed up about all the different combinations of mm vs.
memcgp parameters so I have split up the function into two parts:
	* charge mm
	* charge a known memcg
More details are in the patch 1. I think that this makes more sense.
It was also quite surprising that just the code reordering without any
functional changes made the code smaller by 600B.

Second patch is just a trivial follow up, shouldn't be controversial.

The third one tries to remove an exception (bypass) path which was there
from the early days but it never made any sense to me. It always made me
confused so I would more than happy to ditch it.

Finally patch#4 addresses memcg charge vs. memcg_offline race + #5
reverts the workaround which has been merged as a first aid.

What do you think?

Based on the current mmotm (mmotm-2013-12-16-14-29-6)
Michal Hocko (5):
      memcg: cleanup charge routines
      memcg: move stock charge into __mem_cgroup_try_charge_memcg
      memcg: mm == NULL is not allowed for mem_cgroup_try_charge_mm
      memcg: make sure that memcg is not offline when charging
      Revert "mm: memcg: fix race condition between memcg teardown and swapin"

Diffstat:
 mm/memcontrol.c | 361 +++++++++++++++++++++++++++++---------------------------
 1 file changed, 185 insertions(+), 176 deletions(-)

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [RFC 1/5] memcg: cleanup charge routines
  2013-12-17 15:45 [RFC] memcg: some charge path cleanups + css offline vs. charge race fix Michal Hocko
@ 2013-12-17 15:45 ` Michal Hocko
  2014-01-30 17:18   ` Johannes Weiner
  2013-12-17 15:45 ` [RFC 2/5] memcg: move stock charge into __mem_cgroup_try_charge_memcg Michal Hocko
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 14+ messages in thread
From: Michal Hocko @ 2013-12-17 15:45 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro, LKML, linux-mm

The current core of memcg charging is wild to say the least.
__mem_cgroup_try_charge which is in the center tries to be too clever
and it handles two independent cases
	* when the memcg to be charged is known in advance
	* when the given mm_struct is charged
The resulting callchains are quite complex:

memcg_charge_kmem(mm=NULL, memcg)  mem_cgroup_newpage_charge(mm)
 |                                | _________________________________________ mem_cgroup_cache_charge(current->mm)
 |                                |/                                            |
 | ______________________________ mem_cgroup_charge_common(mm, memcg=NULL)      |
 |/                                                                             /
 |                                                                             /
 | ____________________________ mem_cgroup_try_charge_swapin(mm, memcg=NULL)  /
 |/                               | _________________________________________/
 |                                |/
 |                                |                         /* swap accounting */   /* no swap accounting */
 | _____________________________  __mem_cgroup_try_charge_swapin(mm=NULL, memcg) || (mm, memcg=NULL)
 |/
 | ____________________________ mem_cgroup_do_precharge(mm=NULL, memcg)
 |/
__mem_cgroup_try_charge
  mem_cgroup_do_charge
    res_counter_charge
    mem_cgroup_reclaim
    mem_cgroup_wait_acct_move
    mem_cgroup_oom

This patch splits __mem_cgroup_try_charge into two logical parts.
mem_cgroup_try_charge_mm which is responsible for charges for the given
mm_struct and it returns the charged memcg or NULL under OOM while
mem_cgroup_try_charge_memcg charges a known memcg and returns an error
code.

The only tricky part which remains is __mem_cgroup_try_charge_swapin
because it can return 0 if PageCgroupUsed is already set and then we do
not want to commit the charge. This is done with a magic combination of
memcg = NULL and ret = 0. So the function preserves its memcgp parameter
and sets the given memcg to NULL when it sees PageCgroupUsed
(__mem_cgroup_commit_charge_swapin then ignores such a commit).

Not only the code is easier to follow the change reduces the code size
too:
$ size mm/built-in.o.before
   text	   data	    bss	    dec	    hex	filename
 457463	  83162	  49824	 590449	  90271	mm/built-in.o.before

$ size mm/built-in.o.after
   text	   data	    bss	    dec	    hex	filename
 456794	  83162	  49824	 589780	  8ffd4	mm/built-in.o.after

Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 mm/memcontrol.c | 256 +++++++++++++++++++++++++++++---------------------------
 1 file changed, 134 insertions(+), 122 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 0ded63f1cc1e..509bb59f4744 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2581,7 +2581,7 @@ static int memcg_cpu_hotplug_callback(struct notifier_block *nb,
 }
 
 
-/* See __mem_cgroup_try_charge() for details */
+/* See mem_cgroup_do_charge() for details */
 enum {
 	CHARGE_OK,		/* success */
 	CHARGE_RETRY,		/* need to retry but retry is not bad */
@@ -2655,37 +2655,68 @@ static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
 }
 
 /*
- * __mem_cgroup_try_charge() does
- * 1. detect memcg to be charged against from passed *mm and *ptr,
- * 2. update res_counter
- * 3. call memory reclaim if necessary.
+ * __mem_cgroup_try_charge_memcg - core of the memcg charging code. The caller
+ * keeps a css reference to the given memcg. We do not charge root_mem_cgroup.
+ * OOM is triggered only if allowed by the given oom parameter (except for
+ * __GFP_NOFAIL when it is ignored).
  *
- * In some special case, if the task is fatal, fatal_signal_pending() or
- * has TIF_MEMDIE, this function returns -EINTR while writing root_mem_cgroup
- * to *ptr. There are two reasons for this. 1: fatal threads should quit as soon
- * as possible without any hazards. 2: all pages should have a valid
- * pc->mem_cgroup. If mm is NULL and the caller doesn't pass a valid memcg
- * pointer, that is treated as a charge to root_mem_cgroup.
- *
- * So __mem_cgroup_try_charge() will return
- *  0       ...  on success, filling *ptr with a valid memcg pointer.
- *  -ENOMEM ...  charge failure because of resource limits.
- *  -EINTR  ...  if thread is fatal. *ptr is filled with root_mem_cgroup.
- *
- * Unlike the exported interface, an "oom" parameter is added. if oom==true,
- * the oom-killer can be invoked.
+ * Returns 0 on success, -ENOMEM when the given memcg is under OOM and -EINTR
+ * when the charge is bypassed (either when fatal signals are pending or
+ * __GFP_NOFAIL allocation cannot be charged).
  */
-static int __mem_cgroup_try_charge(struct mm_struct *mm,
-				   gfp_t gfp_mask,
+static int __mem_cgroup_try_charge_memcg(gfp_t gfp_mask,
 				   unsigned int nr_pages,
-				   struct mem_cgroup **ptr,
+				   struct mem_cgroup *memcg,
 				   bool oom)
 {
 	unsigned int batch = max(CHARGE_BATCH, nr_pages);
 	int nr_oom_retries = MEM_CGROUP_RECLAIM_RETRIES;
-	struct mem_cgroup *memcg = NULL;
 	int ret;
 
+	VM_BUG_ON(!memcg || memcg == root_mem_cgroup);
+
+	if (unlikely(task_in_memcg_oom(current)))
+		goto nomem;
+
+	if (gfp_mask & __GFP_NOFAIL)
+		oom = false;
+
+	do {
+		bool invoke_oom = oom && !nr_oom_retries;
+
+		/* If killed, bypass charge */
+		if (fatal_signal_pending(current))
+			goto bypass;
+
+		ret = mem_cgroup_do_charge(memcg, gfp_mask, batch,
+					   nr_pages, invoke_oom);
+		switch (ret) {
+		case CHARGE_RETRY: /* not in OOM situation but retry */
+			batch = nr_pages;
+			break;
+		case CHARGE_WOULDBLOCK: /* !__GFP_WAIT */
+			goto nomem;
+		case CHARGE_NOMEM: /* OOM routine works */
+			if (!oom || invoke_oom)
+				goto nomem;
+			nr_oom_retries--;
+			break;
+		}
+	} while (ret != CHARGE_OK);
+
+	if (batch > nr_pages)
+		refill_stock(memcg, batch - nr_pages);
+
+	return 0;
+nomem:
+	if (!(gfp_mask & __GFP_NOFAIL))
+		return -ENOMEM;
+bypass:
+	return -EINTR;
+}
+
+static bool mem_cgroup_bypass_charge(void)
+{
 	/*
 	 * Unlike gloval-vm's OOM-kill, we're not in memory shortage
 	 * in system level. So, allow to go ahead dying process in addition to
@@ -2693,13 +2724,23 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
 	 */
 	if (unlikely(test_thread_flag(TIF_MEMDIE)
 		     || fatal_signal_pending(current)))
-		goto bypass;
+		return true;
 
-	if (unlikely(task_in_memcg_oom(current)))
-		goto nomem;
+	return false;
+}
 
-	if (gfp_mask & __GFP_NOFAIL)
-		oom = false;
+/*
+ * Charges and returns memcg associated with the given mm (or root_mem_cgroup
+ * if mm is NULL). Returns NULL if memcg is under OOM.
+ */
+static struct mem_cgroup *mem_cgroup_try_charge_mm(struct mm_struct *mm,
+				   gfp_t gfp_mask,
+				   unsigned int nr_pages,
+				   bool oom)
+{
+	struct mem_cgroup *memcg;
+	struct task_struct *p;
+	int ret;
 
 	/*
 	 * We always charge the cgroup the mm_struct belongs to.
@@ -2707,18 +2748,12 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
 	 * thread group leader migrates. It's possible that mm is not
 	 * set, if so charge the root memcg (happens for pagecache usage).
 	 */
-	if (!*ptr && !mm)
-		*ptr = root_mem_cgroup;
-again:
-	if (*ptr) { /* css should be a valid one */
-		memcg = *ptr;
-		if (mem_cgroup_is_root(memcg))
-			goto done;
-		if (consume_stock(memcg, nr_pages))
-			goto done;
-		css_get(&memcg->css);
-	} else {
-		struct task_struct *p;
+	if (!mm)
+		goto bypass;
+
+	do {
+		if (mem_cgroup_bypass_charge())
+			goto bypass;
 
 		rcu_read_lock();
 		p = rcu_dereference(mm->owner);
@@ -2733,11 +2768,9 @@ again:
 		 * task-struct. So, mm->owner can be NULL.
 		 */
 		memcg = mem_cgroup_from_task(p);
-		if (!memcg)
-			memcg = root_mem_cgroup;
-		if (mem_cgroup_is_root(memcg)) {
+		if (!memcg || mem_cgroup_is_root(memcg)) {
 			rcu_read_unlock();
-			goto done;
+			goto bypass;
 		}
 		if (consume_stock(memcg, nr_pages)) {
 			/*
@@ -2752,59 +2785,37 @@ again:
 			goto done;
 		}
 		/* after here, we may be blocked. we need to get refcnt */
-		if (!css_tryget(&memcg->css)) {
-			rcu_read_unlock();
-			goto again;
-		}
-		rcu_read_unlock();
-	}
-
-	do {
-		bool invoke_oom = oom && !nr_oom_retries;
-
-		/* If killed, bypass charge */
-		if (fatal_signal_pending(current)) {
-			css_put(&memcg->css);
-			goto bypass;
-		}
-
-		ret = mem_cgroup_do_charge(memcg, gfp_mask, batch,
-					   nr_pages, invoke_oom);
-		switch (ret) {
-		case CHARGE_OK:
-			break;
-		case CHARGE_RETRY: /* not in OOM situation but retry */
-			batch = nr_pages;
-			css_put(&memcg->css);
-			memcg = NULL;
-			goto again;
-		case CHARGE_WOULDBLOCK: /* !__GFP_WAIT */
-			css_put(&memcg->css);
-			goto nomem;
-		case CHARGE_NOMEM: /* OOM routine works */
-			if (!oom || invoke_oom) {
-				css_put(&memcg->css);
-				goto nomem;
-			}
-			nr_oom_retries--;
-			break;
-		}
-	} while (ret != CHARGE_OK);
+	} while(!css_tryget(&memcg->css));
+	rcu_read_unlock();
 
-	if (batch > nr_pages)
-		refill_stock(memcg, batch - nr_pages);
+	ret = __mem_cgroup_try_charge_memcg(gfp_mask, nr_pages, memcg, oom);
 	css_put(&memcg->css);
+	if (ret == -EINTR)
+		goto bypass;
+	else if (ret == -ENOMEM)
+		memcg = NULL;
 done:
-	*ptr = memcg;
-	return 0;
-nomem:
-	if (!(gfp_mask & __GFP_NOFAIL)) {
-		*ptr = NULL;
-		return -ENOMEM;
-	}
+	return memcg;
 bypass:
-	*ptr = root_mem_cgroup;
-	return -EINTR;
+	return root_mem_cgroup;
+}
+
+/*
+ * charge the given memcg. The caller is has to hold a css reference for
+ * the given memcg.
+ */
+static int mem_cgroup_try_charge_memcg(gfp_t gfp_mask,
+				   unsigned int nr_pages,
+				   struct mem_cgroup *memcg,
+				   bool oom)
+{
+	if (mem_cgroup_is_root(memcg) || mem_cgroup_bypass_charge())
+		return -EINTR;
+
+	if (consume_stock(memcg, nr_pages))
+		return 0;
+
+	return __mem_cgroup_try_charge_memcg(gfp_mask, nr_pages, memcg, oom);
 }
 
 /*
@@ -3002,21 +3013,19 @@ static int mem_cgroup_slabinfo_read(struct cgroup_subsys_state *css,
 static int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, u64 size)
 {
 	struct res_counter *fail_res;
-	struct mem_cgroup *_memcg;
 	int ret = 0;
 
 	ret = res_counter_charge(&memcg->kmem, size, &fail_res);
 	if (ret)
 		return ret;
 
-	_memcg = memcg;
-	ret = __mem_cgroup_try_charge(NULL, gfp, size >> PAGE_SHIFT,
-				      &_memcg, oom_gfp_allowed(gfp));
+	ret = mem_cgroup_try_charge_memcg(gfp, size >> PAGE_SHIFT,
+				      memcg, oom_gfp_allowed(gfp));
 
 	if (ret == -EINTR)  {
 		/*
-		 * __mem_cgroup_try_charge() chosed to bypass to root due to
-		 * OOM kill or fatal signal.  Since our only options are to
+		 * __mem_cgroup_try_charge_memcg() chosed to bypass to root due
+		 * to OOM kill or fatal signal.  Since our only options are to
 		 * either fail the allocation or charge it to this cgroup, do
 		 * it as a temporary condition. But we can't fail. From a
 		 * kmem/slab perspective, the cache has already been selected,
@@ -3025,7 +3034,7 @@ static int memcg_charge_kmem(struct mem_cgroup *memcg, gfp_t gfp, u64 size)
 		 *
 		 * This condition will only trigger if the task entered
 		 * memcg_charge_kmem in a sane state, but was OOM-killed during
-		 * __mem_cgroup_try_charge() above. Tasks that were already
+		 * __mem_cgroup_try_charge_memcg() above. Tasks that were already
 		 * dying when the allocation triggers should have been already
 		 * directed to the root cgroup in memcontrol.h
 		 */
@@ -3946,10 +3955,9 @@ out:
 static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm,
 				gfp_t gfp_mask, enum charge_type ctype)
 {
-	struct mem_cgroup *memcg = NULL;
+	struct mem_cgroup *memcg;
 	unsigned int nr_pages = 1;
 	bool oom = true;
-	int ret;
 
 	if (PageTransHuge(page)) {
 		nr_pages <<= compound_order(page);
@@ -3961,9 +3969,9 @@ static int mem_cgroup_charge_common(struct page *page, struct mm_struct *mm,
 		oom = false;
 	}
 
-	ret = __mem_cgroup_try_charge(mm, gfp_mask, nr_pages, &memcg, oom);
-	if (ret == -ENOMEM)
-		return ret;
+	memcg = mem_cgroup_try_charge_mm(mm, gfp_mask, nr_pages, oom);
+	if (!memcg)
+		return -ENOMEM;
 	__mem_cgroup_commit_charge(memcg, page, nr_pages, ctype, false);
 	return 0;
 }
@@ -3987,8 +3995,7 @@ int mem_cgroup_newpage_charge(struct page *page,
  * "commit()" or removed by "cancel()"
  */
 static int __mem_cgroup_try_charge_swapin(struct mm_struct *mm,
-					  struct page *page,
-					  gfp_t mask,
+					  struct page *page, gfp_t mask,
 					  struct mem_cgroup **memcgp)
 {
 	struct mem_cgroup *memcg;
@@ -4002,31 +4009,36 @@ static int __mem_cgroup_try_charge_swapin(struct mm_struct *mm,
 	 * already charged pages, too.  The USED bit is protected by
 	 * the page lock, which serializes swap cache removal, which
 	 * in turn serializes uncharging.
+	 * Have to set memcg to NULL so that __mem_cgroup_commit_charge_swapin
+	 * will ignore such a page.
 	 */
-	if (PageCgroupUsed(pc))
+	if (PageCgroupUsed(pc)) {
+		*memcgp = NULL;
 		return 0;
+	}
 	if (!do_swap_account)
 		goto charge_cur_mm;
 	memcg = try_get_mem_cgroup_from_page(page);
 	if (!memcg)
 		goto charge_cur_mm;
 	*memcgp = memcg;
-	ret = __mem_cgroup_try_charge(NULL, mask, 1, memcgp, true);
+	ret = mem_cgroup_try_charge_memcg(mask, 1, memcg, true);
 	css_put(&memcg->css);
-	if (ret == -EINTR)
+	if (ret == -EINTR) {
+		*memcgp = root_mem_cgroup;
 		ret = 0;
+	}
 	return ret;
 charge_cur_mm:
-	ret = __mem_cgroup_try_charge(mm, mask, 1, memcgp, true);
-	if (ret == -EINTR)
-		ret = 0;
-	return ret;
+	*memcgp = mem_cgroup_try_charge_mm(mm, mask, 1, true);
+	if (!*memcgp)
+		return -ENOMEM;
+	return 0;
 }
 
 int mem_cgroup_try_charge_swapin(struct mm_struct *mm, struct page *page,
 				 gfp_t gfp_mask, struct mem_cgroup **memcgp)
 {
-	*memcgp = NULL;
 	if (mem_cgroup_disabled())
 		return 0;
 	/*
@@ -4036,13 +4048,14 @@ int mem_cgroup_try_charge_swapin(struct mm_struct *mm, struct page *page,
 	 * there's also a KSM case which does need to charge the page.
 	 */
 	if (!PageSwapCache(page)) {
-		int ret;
+		int ret = 0;
 
-		ret = __mem_cgroup_try_charge(mm, gfp_mask, 1, memcgp, true);
-		if (ret == -EINTR)
-			ret = 0;
+		*memcgp = mem_cgroup_try_charge_mm(mm, gfp_mask, 1, true);
+		if (!*memcgp)
+			ret = -ENOMEM;
 		return ret;
 	}
+
 	return __mem_cgroup_try_charge_swapin(mm, page, gfp_mask, memcgp);
 }
 
@@ -4088,7 +4101,6 @@ void mem_cgroup_commit_charge_swapin(struct page *page,
 int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
 				gfp_t gfp_mask)
 {
-	struct mem_cgroup *memcg = NULL;
 	enum charge_type type = MEM_CGROUP_CHARGE_TYPE_CACHE;
 	int ret;
 
@@ -4100,6 +4112,7 @@ int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
 	if (!PageSwapCache(page))
 		ret = mem_cgroup_charge_common(page, mm, gfp_mask, type);
 	else { /* page is swapcache/shmem */
+		struct mem_cgroup *memcg;
 		ret = __mem_cgroup_try_charge_swapin(mm, page,
 						     gfp_mask, &memcg);
 		if (!ret)
@@ -6442,8 +6455,7 @@ one_by_one:
 			batch_count = PRECHARGE_COUNT_AT_ONCE;
 			cond_resched();
 		}
-		ret = __mem_cgroup_try_charge(NULL,
-					GFP_KERNEL, 1, &memcg, false);
+		ret = mem_cgroup_try_charge_memcg(GFP_KERNEL, 1, memcg, false);
 		if (ret)
 			/* mem_cgroup_clear_mc() will do uncharge later */
 			return ret;
-- 
1.8.4.4


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [RFC 2/5] memcg: move stock charge into __mem_cgroup_try_charge_memcg
  2013-12-17 15:45 [RFC] memcg: some charge path cleanups + css offline vs. charge race fix Michal Hocko
  2013-12-17 15:45 ` [RFC 1/5] memcg: cleanup charge routines Michal Hocko
@ 2013-12-17 15:45 ` Michal Hocko
  2013-12-17 15:45 ` [RFC 3/5] memcg: mm == NULL is not allowed for mem_cgroup_try_charge_mm Michal Hocko
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 14+ messages in thread
From: Michal Hocko @ 2013-12-17 15:45 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro, LKML, linux-mm

bacause both mem_cgroup_try_charge and mem_cgroup_try_charg_memcg
do the same thing. mem_cgroup_try_charge tries to safe one css_tryget
because it relies on the fact that the stock consumption disables
preemption while checking the memcg so it either sees an alive memcg or
NULL.
The css_tryget doesn't seem to be a bottleneck anymore (after
per-cpu reference counting has been merged) so let's make the
code simpler and easier to understand and move consume_stock into
__mem_cgroup_try_charge_memcg where it logically belongs.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 mm/memcontrol.c | 20 ++++----------------
 1 file changed, 4 insertions(+), 16 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 509bb59f4744..3f01dc9aa101 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2678,6 +2678,9 @@ static int __mem_cgroup_try_charge_memcg(gfp_t gfp_mask,
 	if (unlikely(task_in_memcg_oom(current)))
 		goto nomem;
 
+	if (consume_stock(memcg, nr_pages))
+		return 0;
+
 	if (gfp_mask & __GFP_NOFAIL)
 		oom = false;
 
@@ -2772,18 +2775,6 @@ static struct mem_cgroup *mem_cgroup_try_charge_mm(struct mm_struct *mm,
 			rcu_read_unlock();
 			goto bypass;
 		}
-		if (consume_stock(memcg, nr_pages)) {
-			/*
-			 * It seems dagerous to access memcg without css_get().
-			 * But considering how consume_stok works, it's not
-			 * necessary. If consume_stock success, some charges
-			 * from this memcg are cached on this cpu. So, we
-			 * don't need to call css_get()/css_tryget() before
-			 * calling consume_stock().
-			 */
-			rcu_read_unlock();
-			goto done;
-		}
 		/* after here, we may be blocked. we need to get refcnt */
 	} while(!css_tryget(&memcg->css));
 	rcu_read_unlock();
@@ -2794,7 +2785,7 @@ static struct mem_cgroup *mem_cgroup_try_charge_mm(struct mm_struct *mm,
 		goto bypass;
 	else if (ret == -ENOMEM)
 		memcg = NULL;
-done:
+
 	return memcg;
 bypass:
 	return root_mem_cgroup;
@@ -2812,9 +2803,6 @@ static int mem_cgroup_try_charge_memcg(gfp_t gfp_mask,
 	if (mem_cgroup_is_root(memcg) || mem_cgroup_bypass_charge())
 		return -EINTR;
 
-	if (consume_stock(memcg, nr_pages))
-		return 0;
-
 	return __mem_cgroup_try_charge_memcg(gfp_mask, nr_pages, memcg, oom);
 }
 
-- 
1.8.4.4


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [RFC 3/5] memcg: mm == NULL is not allowed for mem_cgroup_try_charge_mm
  2013-12-17 15:45 [RFC] memcg: some charge path cleanups + css offline vs. charge race fix Michal Hocko
  2013-12-17 15:45 ` [RFC 1/5] memcg: cleanup charge routines Michal Hocko
  2013-12-17 15:45 ` [RFC 2/5] memcg: move stock charge into __mem_cgroup_try_charge_memcg Michal Hocko
@ 2013-12-17 15:45 ` Michal Hocko
  2013-12-17 15:45 ` [RFC 4/5] memcg: make sure that memcg is not offline when charging Michal Hocko
  2013-12-17 15:45 ` [RFC 5/5] Revert "mm: memcg: fix race condition between memcg teardown and swapin" Michal Hocko
  4 siblings, 0 replies; 14+ messages in thread
From: Michal Hocko @ 2013-12-17 15:45 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro, LKML, linux-mm

An ancient comment tries to explain that a given mm might be NULL when a
task is migrated. It has been introduced by 8a9f3ccd (Memory controller:
memory accounting) along with other bigger changes so it is not much
more specific about the conditions.

Anyway, Even if the task is migrated to another memcg there is no way we
can see NULL mm struct. So either this was not correct from the very
beginning or it is not true anymore.
The only remaining case would be seeing charges after exit_mm but that
would be a bug on its own as the task doesn't have an address space
anymore.

This patch replaces the check by VM_BUG_ON to make it obvious that we
really expect non-NULL mm_struct.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 mm/memcontrol.c | 9 +--------
 1 file changed, 1 insertion(+), 8 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 3f01dc9aa101..a122fde6cd54 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2745,14 +2745,7 @@ static struct mem_cgroup *mem_cgroup_try_charge_mm(struct mm_struct *mm,
 	struct task_struct *p;
 	int ret;
 
-	/*
-	 * We always charge the cgroup the mm_struct belongs to.
-	 * The mm_struct's mem_cgroup changes on task migration if the
-	 * thread group leader migrates. It's possible that mm is not
-	 * set, if so charge the root memcg (happens for pagecache usage).
-	 */
-	if (!mm)
-		goto bypass;
+	VM_BUG_ON(!mm);
 
 	do {
 		if (mem_cgroup_bypass_charge())
-- 
1.8.4.4


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [RFC 4/5] memcg: make sure that memcg is not offline when charging
  2013-12-17 15:45 [RFC] memcg: some charge path cleanups + css offline vs. charge race fix Michal Hocko
                   ` (2 preceding siblings ...)
  2013-12-17 15:45 ` [RFC 3/5] memcg: mm == NULL is not allowed for mem_cgroup_try_charge_mm Michal Hocko
@ 2013-12-17 15:45 ` Michal Hocko
  2014-01-30 17:29   ` Johannes Weiner
  2013-12-17 15:45 ` [RFC 5/5] Revert "mm: memcg: fix race condition between memcg teardown and swapin" Michal Hocko
  4 siblings, 1 reply; 14+ messages in thread
From: Michal Hocko @ 2013-12-17 15:45 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro, LKML, linux-mm

The current charge path might race with memcg offlining because holding
css reference doesn't stop css offline. As a result res counter might be
charged after mem_cgroup_reparent_charges (called from memcg css_offline
callback) and so the charge would never be freed. This has been worked
around by 96f1c58d8534 (mm: memcg: fix race condition between memcg
teardown and swapin) which tries to catch such a leaked charges later
during css_free. It is more optimal to heal this race in the long term
though.

In order to make this raceless we would need to hold rcu_read_lock since
css_tryget until res_counter_charge. This is not so easy unfortunately
because mem_cgroup_do_charge might sleep so we would need to do drop rcu
lock and do css_tryget tricks after each reclaim.

This patch addresses the issue by introducing memcg->offline flag
which is set from mem_cgroup_css_offline callback before the pages are
reparented. mem_cgroup_do_charge checks the flag before res_counter
is charged inside rcu read section. mem_cgroup_css_offline uses
synchronize_rcu to let all preceding chargers finish while all the new
ones will see the group offline already and back out.

Callers are then updated to retry with a new memcg which is fallback to
mem_cgroup_from_task(current).

The only exception is mem_cgroup_do_precharge which should never see
this race because it is called from cgroup {can_}attach callbacks and so
the whole cgroup cannot go away.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 mm/memcontrol.c | 58 ++++++++++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 55 insertions(+), 3 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index a122fde6cd54..2904b2a6805a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -324,6 +324,9 @@ struct mem_cgroup {
 	int kmemcg_id;
 #endif
 
+	/* Is memcg marked for offlining? */
+	bool 		offline;
+
 	int last_scanned_node;
 #if MAX_NUMNODES > 1
 	nodemask_t	scan_nodes;
@@ -2587,6 +2590,9 @@ enum {
 	CHARGE_RETRY,		/* need to retry but retry is not bad */
 	CHARGE_NOMEM,		/* we can't do more. return -ENOMEM */
 	CHARGE_WOULDBLOCK,	/* GFP_WAIT wasn't set and no enough res. */
+	CHARGE_OFFLINE,		/* memcg is offline already so no further
+				 * charges are allowed
+				 */
 };
 
 static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
@@ -2599,20 +2605,36 @@ static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	unsigned long flags = 0;
 	int ret;
 
+	/*
+	 * Although the holder keeps a css reference for memcg this doesn't
+	 * prevent it from being offlined in the meantime. We have to make sure
+	 * that res counter is charged before css_offline reparents its pages
+	 * otherwise the charge might leak.
+	 */
+	rcu_read_lock();
+	if (memcg->offline) {
+		rcu_read_unlock();
+		return CHARGE_OFFLINE;
+	}
 	ret = res_counter_charge(&memcg->res, csize, &fail_res);
-
 	if (likely(!ret)) {
-		if (!do_swap_account)
+		if (!do_swap_account) {
+			rcu_read_unlock();
 			return CHARGE_OK;
+		}
 		ret = res_counter_charge(&memcg->memsw, csize, &fail_res);
-		if (likely(!ret))
+		if (likely(!ret)) {
+			rcu_read_unlock();
 			return CHARGE_OK;
+		}
 
 		res_counter_uncharge(&memcg->res, csize);
 		mem_over_limit = mem_cgroup_from_res_counter(fail_res, memsw);
 		flags |= MEM_CGROUP_RECLAIM_NOSWAP;
 	} else
 		mem_over_limit = mem_cgroup_from_res_counter(fail_res, res);
+	rcu_read_unlock();
+
 	/*
 	 * Never reclaim on behalf of optional batching, retry with a
 	 * single page instead.
@@ -2704,6 +2726,12 @@ static int __mem_cgroup_try_charge_memcg(gfp_t gfp_mask,
 				goto nomem;
 			nr_oom_retries--;
 			break;
+		/*
+		 * memcg went offline, the caller should fallback to
+		 * a different group.
+		 */
+		case CHARGE_OFFLINE:
+			return -EAGAIN;
 		}
 	} while (ret != CHARGE_OK);
 
@@ -2747,6 +2775,7 @@ static struct mem_cgroup *mem_cgroup_try_charge_mm(struct mm_struct *mm,
 
 	VM_BUG_ON(!mm);
 
+again:
 	do {
 		if (mem_cgroup_bypass_charge())
 			goto bypass;
@@ -2778,6 +2807,8 @@ static struct mem_cgroup *mem_cgroup_try_charge_mm(struct mm_struct *mm,
 		goto bypass;
 	else if (ret == -ENOMEM)
 		memcg = NULL;
+	else if (ret == -EAGAIN)
+		goto again;
 
 	return memcg;
 bypass:
@@ -3666,6 +3697,7 @@ __memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup **_memcg, int order)
 	if (!current->mm || current->memcg_kmem_skip_account)
 		return true;
 
+again:
 	memcg = try_get_mem_cgroup_from_mm(current->mm);
 
 	/*
@@ -3684,6 +3716,8 @@ __memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup **_memcg, int order)
 	ret = memcg_charge_kmem(memcg, gfp, PAGE_SIZE << order);
 	if (!ret)
 		*_memcg = memcg;
+	else if (ret == -EAGAIN)
+		goto again;
 
 	css_put(&memcg->css);
 	return (ret == 0);
@@ -4008,6 +4042,8 @@ static int __mem_cgroup_try_charge_swapin(struct mm_struct *mm,
 	if (ret == -EINTR) {
 		*memcgp = root_mem_cgroup;
 		ret = 0;
+	} else if (ret == -EAGAIN) {
+		goto charge_cur_mm;
 	}
 	return ret;
 charge_cur_mm:
@@ -6340,6 +6376,14 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
 
+	/*
+	 * Mark the memcg offline and wait for the charger which uses
+	 * RCU read lock to make sure no charges will leak.
+	 * See mem_cgroup_do_charge for more details.
+	 */
+	memcg->offline = true;
+	synchronize_rcu();
+
 	kmem_cgroup_css_offline(memcg);
 
 	mem_cgroup_invalidate_reclaim_iterators(memcg);
@@ -6437,6 +6481,14 @@ one_by_one:
 			cond_resched();
 		}
 		ret = mem_cgroup_try_charge_memcg(GFP_KERNEL, 1, memcg, false);
+
+		/*
+		 * The target memcg cannot go offline because we are in
+		 * move path and cgroup core doesn't allow to offline
+		 * such groups.
+		 */
+		BUG_ON(ret == -EAGAIN);
+
 		if (ret)
 			/* mem_cgroup_clear_mc() will do uncharge later */
 			return ret;
-- 
1.8.4.4


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [RFC 5/5] Revert "mm: memcg: fix race condition between memcg teardown and swapin"
  2013-12-17 15:45 [RFC] memcg: some charge path cleanups + css offline vs. charge race fix Michal Hocko
                   ` (3 preceding siblings ...)
  2013-12-17 15:45 ` [RFC 4/5] memcg: make sure that memcg is not offline when charging Michal Hocko
@ 2013-12-17 15:45 ` Michal Hocko
  4 siblings, 0 replies; 14+ messages in thread
From: Michal Hocko @ 2013-12-17 15:45 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro, LKML, linux-mm

This reverts commit 96f1c58d853497a757463e0b57fed140d6858f3a
because it is no longer needed after "memcg: make sure that memcg is not
offline when charging" which makes sure that no charges will be accepted
after mem_cgroup_reparent_charges has started.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 mm/memcontrol.c | 36 ------------------------------------
 1 file changed, 36 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2904b2a6805a..591ced342036 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6395,42 +6395,6 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
 static void mem_cgroup_css_free(struct cgroup_subsys_state *css)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
-	/*
-	 * XXX: css_offline() would be where we should reparent all
-	 * memory to prepare the cgroup for destruction.  However,
-	 * memcg does not do css_tryget() and res_counter charging
-	 * under the same RCU lock region, which means that charging
-	 * could race with offlining.  Offlining only happens to
-	 * cgroups with no tasks in them but charges can show up
-	 * without any tasks from the swapin path when the target
-	 * memcg is looked up from the swapout record and not from the
-	 * current task as it usually is.  A race like this can leak
-	 * charges and put pages with stale cgroup pointers into
-	 * circulation:
-	 *
-	 * #0                        #1
-	 *                           lookup_swap_cgroup_id()
-	 *                           rcu_read_lock()
-	 *                           mem_cgroup_lookup()
-	 *                           css_tryget()
-	 *                           rcu_read_unlock()
-	 * disable css_tryget()
-	 * call_rcu()
-	 *   offline_css()
-	 *     reparent_charges()
-	 *                           res_counter_charge()
-	 *                           css_put()
-	 *                             css_free()
-	 *                           pc->mem_cgroup = dead memcg
-	 *                           add page to lru
-	 *
-	 * The bulk of the charges are still moved in offline_css() to
-	 * avoid pinning a lot of pages in case a long-term reference
-	 * like a swapout record is deferring the css_free() to long
-	 * after offlining.  But this makes sure we catch any charges
-	 * made after offlining:
-	 */
-	mem_cgroup_reparent_charges(memcg);
 
 	memcg_destroy_kmem(memcg);
 	__mem_cgroup_free(memcg);
-- 
1.8.4.4


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [RFC 1/5] memcg: cleanup charge routines
  2013-12-17 15:45 ` [RFC 1/5] memcg: cleanup charge routines Michal Hocko
@ 2014-01-30 17:18   ` Johannes Weiner
  2014-02-03 13:20     ` Michal Hocko
  0 siblings, 1 reply; 14+ messages in thread
From: Johannes Weiner @ 2014-01-30 17:18 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro, LKML, linux-mm

On Tue, Dec 17, 2013 at 04:45:26PM +0100, Michal Hocko wrote:
> The current core of memcg charging is wild to say the least.
> __mem_cgroup_try_charge which is in the center tries to be too clever
> and it handles two independent cases
> 	* when the memcg to be charged is known in advance
> 	* when the given mm_struct is charged
> The resulting callchains are quite complex:
> 
> memcg_charge_kmem(mm=NULL, memcg)  mem_cgroup_newpage_charge(mm)
>  |                                | _________________________________________ mem_cgroup_cache_charge(current->mm)
>  |                                |/                                            |
>  | ______________________________ mem_cgroup_charge_common(mm, memcg=NULL)      |
>  |/                                                                             /
>  |                                                                             /
>  | ____________________________ mem_cgroup_try_charge_swapin(mm, memcg=NULL)  /
>  |/                               | _________________________________________/
>  |                                |/
>  |                                |                         /* swap accounting */   /* no swap accounting */
>  | _____________________________  __mem_cgroup_try_charge_swapin(mm=NULL, memcg) || (mm, memcg=NULL)
>  |/
>  | ____________________________ mem_cgroup_do_precharge(mm=NULL, memcg)
>  |/
> __mem_cgroup_try_charge
>   mem_cgroup_do_charge
>     res_counter_charge
>     mem_cgroup_reclaim
>     mem_cgroup_wait_acct_move
>     mem_cgroup_oom
> 
> This patch splits __mem_cgroup_try_charge into two logical parts.
> mem_cgroup_try_charge_mm which is responsible for charges for the given
> mm_struct and it returns the charged memcg or NULL under OOM while
> mem_cgroup_try_charge_memcg charges a known memcg and returns an error
> code.
> 
> The only tricky part which remains is __mem_cgroup_try_charge_swapin
> because it can return 0 if PageCgroupUsed is already set and then we do
> not want to commit the charge. This is done with a magic combination of
> memcg = NULL and ret = 0. So the function preserves its memcgp parameter
> and sets the given memcg to NULL when it sees PageCgroupUsed
> (__mem_cgroup_commit_charge_swapin then ignores such a commit).
> 
> Not only the code is easier to follow the change reduces the code size
> too:
> $ size mm/built-in.o.before
>    text	   data	    bss	    dec	    hex	filename
>  457463	  83162	  49824	 590449	  90271	mm/built-in.o.before
> 
> $ size mm/built-in.o.after
>    text	   data	    bss	    dec	    hex	filename
>  456794	  83162	  49824	 589780	  8ffd4	mm/built-in.o.after

Nice!

> @@ -2655,37 +2655,68 @@ static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  }
>  
>  /*
> - * __mem_cgroup_try_charge() does
> - * 1. detect memcg to be charged against from passed *mm and *ptr,
> - * 2. update res_counter
> - * 3. call memory reclaim if necessary.
> + * __mem_cgroup_try_charge_memcg - core of the memcg charging code. The caller
> + * keeps a css reference to the given memcg. We do not charge root_mem_cgroup.
> + * OOM is triggered only if allowed by the given oom parameter (except for
> + * __GFP_NOFAIL when it is ignored).
>   *
> - * In some special case, if the task is fatal, fatal_signal_pending() or
> - * has TIF_MEMDIE, this function returns -EINTR while writing root_mem_cgroup
> - * to *ptr. There are two reasons for this. 1: fatal threads should quit as soon
> - * as possible without any hazards. 2: all pages should have a valid
> - * pc->mem_cgroup. If mm is NULL and the caller doesn't pass a valid memcg
> - * pointer, that is treated as a charge to root_mem_cgroup.
> - *
> - * So __mem_cgroup_try_charge() will return
> - *  0       ...  on success, filling *ptr with a valid memcg pointer.
> - *  -ENOMEM ...  charge failure because of resource limits.
> - *  -EINTR  ...  if thread is fatal. *ptr is filled with root_mem_cgroup.
> - *
> - * Unlike the exported interface, an "oom" parameter is added. if oom==true,
> - * the oom-killer can be invoked.
> + * Returns 0 on success, -ENOMEM when the given memcg is under OOM and -EINTR
> + * when the charge is bypassed (either when fatal signals are pending or
> + * __GFP_NOFAIL allocation cannot be charged).
>   */
> -static int __mem_cgroup_try_charge(struct mm_struct *mm,
> -				   gfp_t gfp_mask,
> +static int __mem_cgroup_try_charge_memcg(gfp_t gfp_mask,
>  				   unsigned int nr_pages,
> -				   struct mem_cgroup **ptr,
> +				   struct mem_cgroup *memcg,
>  				   bool oom)

Why not keep the __mem_cgroup_try_charge() name?  It's shorter and
just as descriptive.

>  {
>  	unsigned int batch = max(CHARGE_BATCH, nr_pages);
>  	int nr_oom_retries = MEM_CGROUP_RECLAIM_RETRIES;
> -	struct mem_cgroup *memcg = NULL;
>  	int ret;
>  
> +	VM_BUG_ON(!memcg || memcg == root_mem_cgroup);
> +
> +	if (unlikely(task_in_memcg_oom(current)))
> +		goto nomem;
> +
> +	if (gfp_mask & __GFP_NOFAIL)
> +		oom = false;
> +
> +	do {
> +		bool invoke_oom = oom && !nr_oom_retries;
> +
> +		/* If killed, bypass charge */
> +		if (fatal_signal_pending(current))
> +			goto bypass;
> +
> +		ret = mem_cgroup_do_charge(memcg, gfp_mask, batch,
> +					   nr_pages, invoke_oom);
> +		switch (ret) {
> +		case CHARGE_RETRY: /* not in OOM situation but retry */
> +			batch = nr_pages;
> +			break;
> +		case CHARGE_WOULDBLOCK: /* !__GFP_WAIT */
> +			goto nomem;
> +		case CHARGE_NOMEM: /* OOM routine works */
> +			if (!oom || invoke_oom)
> +				goto nomem;
> +			nr_oom_retries--;
> +			break;
> +		}
> +	} while (ret != CHARGE_OK);
> +
> +	if (batch > nr_pages)
> +		refill_stock(memcg, batch - nr_pages);
> +
> +	return 0;
> +nomem:
> +	if (!(gfp_mask & __GFP_NOFAIL))
> +		return -ENOMEM;
> +bypass:
> +	return -EINTR;
> +}
> +
> +static bool mem_cgroup_bypass_charge(void)

The name and parameter list suggests this consults some global memory
cgroup state.  current_bypass_charge()?  I think ultimately we want to
move away from all these mem_cgroup prefixes of static functions in
there, they add nothing of value.

> +{
>  	/*
>  	 * Unlike gloval-vm's OOM-kill, we're not in memory shortage
>  	 * in system level. So, allow to go ahead dying process in addition to
> @@ -2693,13 +2724,23 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
>  	 */
>  	if (unlikely(test_thread_flag(TIF_MEMDIE)
>  		     || fatal_signal_pending(current)))
> -		goto bypass;
> +		return true;
>  
> -	if (unlikely(task_in_memcg_oom(current)))
> -		goto nomem;
> +	return false;
> +}
>  
> -	if (gfp_mask & __GFP_NOFAIL)
> -		oom = false;
> +/*
> + * Charges and returns memcg associated with the given mm (or root_mem_cgroup
> + * if mm is NULL). Returns NULL if memcg is under OOM.
> + */
> +static struct mem_cgroup *mem_cgroup_try_charge_mm(struct mm_struct *mm,
> +				   gfp_t gfp_mask,
> +				   unsigned int nr_pages,
> +				   bool oom)

We already have a try_get_mem_cgroup_from_mm().  After this series,
this function basically duplicates that and it would be much cleaner
if we only had one try_charge() function and let all the callers use
the appropriate try_get_mem_cgroup_from_wherever() themselves.

If you pull the patch that moves consume_stock() back into
try_charge() up front, I think this cleanup would be more obvious and
the result even better.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC 4/5] memcg: make sure that memcg is not offline when charging
  2013-12-17 15:45 ` [RFC 4/5] memcg: make sure that memcg is not offline when charging Michal Hocko
@ 2014-01-30 17:29   ` Johannes Weiner
  2014-02-03 13:33     ` Michal Hocko
  0 siblings, 1 reply; 14+ messages in thread
From: Johannes Weiner @ 2014-01-30 17:29 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro, LKML, linux-mm

On Tue, Dec 17, 2013 at 04:45:29PM +0100, Michal Hocko wrote:
> The current charge path might race with memcg offlining because holding
> css reference doesn't stop css offline. As a result res counter might be
> charged after mem_cgroup_reparent_charges (called from memcg css_offline
> callback) and so the charge would never be freed. This has been worked
> around by 96f1c58d8534 (mm: memcg: fix race condition between memcg
> teardown and swapin) which tries to catch such a leaked charges later
> during css_free. It is more optimal to heal this race in the long term
> though.

We already deal with the race, so IMO the only outstanding improvement
is to take advantage of the teardown synchronization provided by the
cgroup core and get rid of our one-liner workaround in .css_free.

> In order to make this raceless we would need to hold rcu_read_lock since
> css_tryget until res_counter_charge. This is not so easy unfortunately
> because mem_cgroup_do_charge might sleep so we would need to do drop rcu
> lock and do css_tryget tricks after each reclaim.

Yes, why not?

> This patch addresses the issue by introducing memcg->offline flag
> which is set from mem_cgroup_css_offline callback before the pages are
> reparented. mem_cgroup_do_charge checks the flag before res_counter
> is charged inside rcu read section. mem_cgroup_css_offline uses
> synchronize_rcu to let all preceding chargers finish while all the new
> ones will see the group offline already and back out.
>
> Callers are then updated to retry with a new memcg which is fallback to
> mem_cgroup_from_task(current).
> 
> The only exception is mem_cgroup_do_precharge which should never see
> this race because it is called from cgroup {can_}attach callbacks and so
> the whole cgroup cannot go away.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.cz>
> ---
>  mm/memcontrol.c | 58 ++++++++++++++++++++++++++++++++++++++++++++++++++++++---
>  1 file changed, 55 insertions(+), 3 deletions(-)

That makes no sense to me.  It's a lateral move in functionality and
cgroup integration, but more complicated.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC 1/5] memcg: cleanup charge routines
  2014-01-30 17:18   ` Johannes Weiner
@ 2014-02-03 13:20     ` Michal Hocko
  2014-02-03 15:51       ` Johannes Weiner
  0 siblings, 1 reply; 14+ messages in thread
From: Michal Hocko @ 2014-02-03 13:20 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro, LKML, linux-mm

On Thu 30-01-14 12:18:37, Johannes Weiner wrote:
> On Tue, Dec 17, 2013 at 04:45:26PM +0100, Michal Hocko wrote:
[...]
> > -static int __mem_cgroup_try_charge(struct mm_struct *mm,
> > -				   gfp_t gfp_mask,
> > +static int __mem_cgroup_try_charge_memcg(gfp_t gfp_mask,
> >  				   unsigned int nr_pages,
> > -				   struct mem_cgroup **ptr,
> > +				   struct mem_cgroup *memcg,
> >  				   bool oom)
> 
> Why not keep the __mem_cgroup_try_charge() name?  It's shorter and
> just as descriptive.

I wanted to have 2 different names with clear reference to _what_ is
going to be charged. But I am always open to naming suggestions.

[...]
> > +static bool mem_cgroup_bypass_charge(void)
> 
> The name and parameter list suggests this consults some global memory
> cgroup state.  current_bypass_charge()?

OK, that sounds better.

> I think ultimately we want to move away from all these mem_cgroup
> prefixes of static functions in there, they add nothing of value.

Yes, I agree that mem_cgroup prefix is clumsy and we should drop it.

[...]
> > +/*
> > + * Charges and returns memcg associated with the given mm (or root_mem_cgroup
> > + * if mm is NULL). Returns NULL if memcg is under OOM.
> > + */
> > +static struct mem_cgroup *mem_cgroup_try_charge_mm(struct mm_struct *mm,
> > +				   gfp_t gfp_mask,
> > +				   unsigned int nr_pages,
> > +				   bool oom)
> 
> We already have a try_get_mem_cgroup_from_mm().
>
> After this series, this function basically duplicates that and it
> would be much cleaner if we only had one try_charge() function and let
> all the callers use the appropriate try_get_mem_cgroup_from_wherever()
> themselves.

try_get_mem_cgroup_from_mm doesn't charge memory itself. It just tries
to get memcg from the given mm. It is called also from a context which
doesn't charge any memory (task_in_mem_cgroup). Or have I misunderstood
you?

> If you pull the patch that moves consume_stock() back into
> try_charge() up front, I think this cleanup would be more obvious and
> the result even better.

OK, I can move it.

Thanks!

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC 4/5] memcg: make sure that memcg is not offline when charging
  2014-01-30 17:29   ` Johannes Weiner
@ 2014-02-03 13:33     ` Michal Hocko
  2014-02-03 16:18       ` Johannes Weiner
  0 siblings, 1 reply; 14+ messages in thread
From: Michal Hocko @ 2014-02-03 13:33 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro, LKML, linux-mm

On Thu 30-01-14 12:29:06, Johannes Weiner wrote:
> On Tue, Dec 17, 2013 at 04:45:29PM +0100, Michal Hocko wrote:
> > The current charge path might race with memcg offlining because holding
> > css reference doesn't stop css offline. As a result res counter might be
> > charged after mem_cgroup_reparent_charges (called from memcg css_offline
> > callback) and so the charge would never be freed. This has been worked
> > around by 96f1c58d8534 (mm: memcg: fix race condition between memcg
> > teardown and swapin) which tries to catch such a leaked charges later
> > during css_free. It is more optimal to heal this race in the long term
> > though.
> 
> We already deal with the race, so IMO the only outstanding improvement
> is to take advantage of the teardown synchronization provided by the
> cgroup core and get rid of our one-liner workaround in .css_free.

I am not sure I am following you here. Which teardown synchronization do
you have in mind? rcu_read_lock & css_tryget?

> > In order to make this raceless we would need to hold rcu_read_lock since
> > css_tryget until res_counter_charge. This is not so easy unfortunately
> > because mem_cgroup_do_charge might sleep so we would need to do drop rcu
> > lock and do css_tryget tricks after each reclaim.
> 
> Yes, why not?

Although css_tryget is cheap these days I thought that a simple flag
check would be even heaper in this hot path. Changing the patch to use
css_tryget rather than offline check is trivial if you really think it
is better?

Btw. I plan to repost the series as soon as Andrew releases his mmotm
tree.

Thanks
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC 1/5] memcg: cleanup charge routines
  2014-02-03 13:20     ` Michal Hocko
@ 2014-02-03 15:51       ` Johannes Weiner
  2014-02-03 16:41         ` Michal Hocko
  0 siblings, 1 reply; 14+ messages in thread
From: Johannes Weiner @ 2014-02-03 15:51 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro, LKML, linux-mm

On Mon, Feb 03, 2014 at 02:20:01PM +0100, Michal Hocko wrote:
> On Thu 30-01-14 12:18:37, Johannes Weiner wrote:
> > On Tue, Dec 17, 2013 at 04:45:26PM +0100, Michal Hocko wrote:
> [...]
> > > -static int __mem_cgroup_try_charge(struct mm_struct *mm,
> > > -				   gfp_t gfp_mask,
> > > +static int __mem_cgroup_try_charge_memcg(gfp_t gfp_mask,
> > >  				   unsigned int nr_pages,
> > > -				   struct mem_cgroup **ptr,
> > > +				   struct mem_cgroup *memcg,
> > >  				   bool oom)
> > 
> > Why not keep the __mem_cgroup_try_charge() name?  It's shorter and
> > just as descriptive.
> 
> I wanted to have 2 different names with clear reference to _what_ is
> going to be charged. But I am always open to naming suggestions.
>
> > > +/*
> > > + * Charges and returns memcg associated with the given mm (or root_mem_cgroup
> > > + * if mm is NULL). Returns NULL if memcg is under OOM.
> > > + */
> > > +static struct mem_cgroup *mem_cgroup_try_charge_mm(struct mm_struct *mm,
> > > +				   gfp_t gfp_mask,
> > > +				   unsigned int nr_pages,
> > > +				   bool oom)
> > 
> > We already have a try_get_mem_cgroup_from_mm().
> >
> > After this series, this function basically duplicates that and it
> > would be much cleaner if we only had one try_charge() function and let
> > all the callers use the appropriate try_get_mem_cgroup_from_wherever()
> > themselves.
> 
> try_get_mem_cgroup_from_mm doesn't charge memory itself. It just tries
> to get memcg from the given mm. It is called also from a context which
> doesn't charge any memory (task_in_mem_cgroup). Or have I misunderstood
> you?

Your mem_cgroup_try_charge_mm() looks up a memcg from mm and calls
try_charge().  But we have try_get_mem_cgroup_from_mm() to do the
first half, so why not have the current callers of
mem_cgroup_try_charge_mm() just use try_get_mem_cgroup_from_mm() and
try_charge()?  Why is charging through an mm - as opposed to through a
page or through the task - special?

Most callsites already do the lookups themselves:

try_charge_swapin:	uses try_get_mem_cgroup_from_page()
kmem_newpage_charge:	uses try_get_mem_cgroup_from_mm()
precharge:		uses mem_cgroup_from_task()

So just let these two do the same:

newpage_charge:		could use try_get_mem_cgroup_from_mm()
cache_charge:		could use try_get_mem_cgroup_from_mm()

And then provide one try_charge() that always takes a non-null memcg.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC 4/5] memcg: make sure that memcg is not offline when charging
  2014-02-03 13:33     ` Michal Hocko
@ 2014-02-03 16:18       ` Johannes Weiner
  2014-02-03 16:44         ` Michal Hocko
  0 siblings, 1 reply; 14+ messages in thread
From: Johannes Weiner @ 2014-02-03 16:18 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro, LKML, linux-mm

On Mon, Feb 03, 2014 at 02:33:13PM +0100, Michal Hocko wrote:
> On Thu 30-01-14 12:29:06, Johannes Weiner wrote:
> > On Tue, Dec 17, 2013 at 04:45:29PM +0100, Michal Hocko wrote:
> > > The current charge path might race with memcg offlining because holding
> > > css reference doesn't stop css offline. As a result res counter might be
> > > charged after mem_cgroup_reparent_charges (called from memcg css_offline
> > > callback) and so the charge would never be freed. This has been worked
> > > around by 96f1c58d8534 (mm: memcg: fix race condition between memcg
> > > teardown and swapin) which tries to catch such a leaked charges later
> > > during css_free. It is more optimal to heal this race in the long term
> > > though.
> > 
> > We already deal with the race, so IMO the only outstanding improvement
> > is to take advantage of the teardown synchronization provided by the
> > cgroup core and get rid of our one-liner workaround in .css_free.
> 
> I am not sure I am following you here. Which teardown synchronization do
> you have in mind? rcu_read_lock & css_tryget?

Yes.  It provides rcu synchronization between establishing new
references and offlining, as long as you establish references
atomically in one RCU read-side section:

repeat:
  rcu_read_lock()
  css_tryget()
  res_counter_charge()
  rcu_read_unlock()
  if retries++ < RECLAIM_RETRIES:
    reclaim
    goto repeat

> > > In order to make this raceless we would need to hold rcu_read_lock since
> > > css_tryget until res_counter_charge. This is not so easy unfortunately
> > > because mem_cgroup_do_charge might sleep so we would need to do drop rcu
> > > lock and do css_tryget tricks after each reclaim.
> > 
> > Yes, why not?
> 
> Although css_tryget is cheap these days I thought that a simple flag
> check would be even heaper in this hot path. Changing the patch to use
> css_tryget rather than offline check is trivial if you really think it
> is better?

You already changed it to do css_tryget() on every single charge.

Direct reclaim is invoked only from a fraction of all charges, it's
already a slowpath, I don't think another percpu counter op will be
the final straw that makes this path too fat.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC 1/5] memcg: cleanup charge routines
  2014-02-03 15:51       ` Johannes Weiner
@ 2014-02-03 16:41         ` Michal Hocko
  0 siblings, 0 replies; 14+ messages in thread
From: Michal Hocko @ 2014-02-03 16:41 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro, LKML, linux-mm

On Mon 03-02-14 10:51:27, Johannes Weiner wrote:
> On Mon, Feb 03, 2014 at 02:20:01PM +0100, Michal Hocko wrote:
> > On Thu 30-01-14 12:18:37, Johannes Weiner wrote:
> > > On Tue, Dec 17, 2013 at 04:45:26PM +0100, Michal Hocko wrote:
> > [...]
> > > > -static int __mem_cgroup_try_charge(struct mm_struct *mm,
> > > > -				   gfp_t gfp_mask,
> > > > +static int __mem_cgroup_try_charge_memcg(gfp_t gfp_mask,
> > > >  				   unsigned int nr_pages,
> > > > -				   struct mem_cgroup **ptr,
> > > > +				   struct mem_cgroup *memcg,
> > > >  				   bool oom)
> > > 
> > > Why not keep the __mem_cgroup_try_charge() name?  It's shorter and
> > > just as descriptive.
> > 
> > I wanted to have 2 different names with clear reference to _what_ is
> > going to be charged. But I am always open to naming suggestions.
> >
> > > > +/*
> > > > + * Charges and returns memcg associated with the given mm (or root_mem_cgroup
> > > > + * if mm is NULL). Returns NULL if memcg is under OOM.
> > > > + */
> > > > +static struct mem_cgroup *mem_cgroup_try_charge_mm(struct mm_struct *mm,
> > > > +				   gfp_t gfp_mask,
> > > > +				   unsigned int nr_pages,
> > > > +				   bool oom)
> > > 
> > > We already have a try_get_mem_cgroup_from_mm().
> > >
> > > After this series, this function basically duplicates that and it
> > > would be much cleaner if we only had one try_charge() function and let
> > > all the callers use the appropriate try_get_mem_cgroup_from_wherever()
> > > themselves.
> > 
> > try_get_mem_cgroup_from_mm doesn't charge memory itself. It just tries
> > to get memcg from the given mm. It is called also from a context which
> > doesn't charge any memory (task_in_mem_cgroup). Or have I misunderstood
> > you?
> 
> Your mem_cgroup_try_charge_mm() looks up a memcg from mm and calls
> try_charge().  But we have try_get_mem_cgroup_from_mm() to do the
> first half, so why not have the current callers of
> mem_cgroup_try_charge_mm() just use try_get_mem_cgroup_from_mm() and
> try_charge()?  Why is charging through an mm - as opposed to through a
> page or through the task - special?

OK, I thought it would be easier to follow those two different ways of
charging (known memcg vs. current mm - I needed that when tracking some
issue and ended up quite despair from all the different combinations)
but that is probably not that interesting and what you are proposing
reduces even more code. OK I will play with this some more.

> Most callsites already do the lookups themselves:
> 
> try_charge_swapin:	uses try_get_mem_cgroup_from_page()
> kmem_newpage_charge:	uses try_get_mem_cgroup_from_mm()
> precharge:		uses mem_cgroup_from_task()
> 
> So just let these two do the same:
> 
> newpage_charge:		could use try_get_mem_cgroup_from_mm()
> cache_charge:		could use try_get_mem_cgroup_from_mm()
> 
> And then provide one try_charge() that always takes a non-null memcg.

Thanks!

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC 4/5] memcg: make sure that memcg is not offline when charging
  2014-02-03 16:18       ` Johannes Weiner
@ 2014-02-03 16:44         ` Michal Hocko
  0 siblings, 0 replies; 14+ messages in thread
From: Michal Hocko @ 2014-02-03 16:44 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, KOSAKI Motohiro, LKML, linux-mm

On Mon 03-02-14 11:18:23, Johannes Weiner wrote:
> On Mon, Feb 03, 2014 at 02:33:13PM +0100, Michal Hocko wrote:
> > On Thu 30-01-14 12:29:06, Johannes Weiner wrote:
> > > On Tue, Dec 17, 2013 at 04:45:29PM +0100, Michal Hocko wrote:
[...]
> > > > In order to make this raceless we would need to hold rcu_read_lock since
> > > > css_tryget until res_counter_charge. This is not so easy unfortunately
> > > > because mem_cgroup_do_charge might sleep so we would need to do drop rcu
> > > > lock and do css_tryget tricks after each reclaim.
> > > 
> > > Yes, why not?
> > 
> > Although css_tryget is cheap these days I thought that a simple flag
> > check would be even heaper in this hot path. Changing the patch to use
> > css_tryget rather than offline check is trivial if you really think it
> > is better?
> 
> You already changed it to do css_tryget() on every single charge.

Fair point

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2014-02-03 16:44 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-12-17 15:45 [RFC] memcg: some charge path cleanups + css offline vs. charge race fix Michal Hocko
2013-12-17 15:45 ` [RFC 1/5] memcg: cleanup charge routines Michal Hocko
2014-01-30 17:18   ` Johannes Weiner
2014-02-03 13:20     ` Michal Hocko
2014-02-03 15:51       ` Johannes Weiner
2014-02-03 16:41         ` Michal Hocko
2013-12-17 15:45 ` [RFC 2/5] memcg: move stock charge into __mem_cgroup_try_charge_memcg Michal Hocko
2013-12-17 15:45 ` [RFC 3/5] memcg: mm == NULL is not allowed for mem_cgroup_try_charge_mm Michal Hocko
2013-12-17 15:45 ` [RFC 4/5] memcg: make sure that memcg is not offline when charging Michal Hocko
2014-01-30 17:29   ` Johannes Weiner
2014-02-03 13:33     ` Michal Hocko
2014-02-03 16:18       ` Johannes Weiner
2014-02-03 16:44         ` Michal Hocko
2013-12-17 15:45 ` [RFC 5/5] Revert "mm: memcg: fix race condition between memcg teardown and swapin" Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).