[PATCH 1/4] memcg: do not try to drain per-cpu caches without pages

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 1/4] memcg: do not try to drain per-cpu caches without pages
  2011-07-22 12:43 [PATCH 0/4 v2] memcg: cleanup per-cpu charge caches Michal Hocko
@ 2011-07-21  7:38 ` Michal Hocko
  2011-07-25  1:16     ` KAMEZAWA Hiroyuki
  2011-07-22 10:24 ` [PATCH 2/4] memcg: unify sync and async per-cpu charge cache draining Michal Hocko
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 41+ messages in thread
From: Michal Hocko @ 2011-07-21  7:38 UTC (permalink / raw)
  To: linux-mm; +Cc: Balbir Singh, KAMEZAWA Hiroyuki, Andrew Morton, linux-kernel

drain_all_stock_async tries to optimize a work to be done on the work
queue by excluding any work for the current CPU because it assumes that
the context we are called from already tried to charge from that cache
and it's failed so it must be empty already.
While the assumption is correct we can optimize it even more by checking
the current number of pages in the cache. This will also reduce a work
on other CPUs with an empty stock.
For the current CPU we can simply call drain_local_stock rather than
deferring it to the work queue.

[KAMEZAWA Hiroyuki - use drain_local_stock for current CPU optimization]
Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 mm/memcontrol.c |   13 +++++++------
 1 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f11f198..c012ffe 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2159,11 +2159,8 @@ static void drain_all_stock_async(struct mem_cgroup *root_mem)
 		struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu);
 		struct mem_cgroup *mem;
 
-		if (cpu == curcpu)
-			continue;
-
 		mem = stock->cached;
-		if (!mem)
+		if (!mem || !stock->nr_pages)
 			continue;
 		if (mem != root_mem) {
 			if (!root_mem->use_hierarchy)
@@ -2172,8 +2169,12 @@ static void drain_all_stock_async(struct mem_cgroup *root_mem)
 			if (!css_is_ancestor(&mem->css, &root_mem->css))
 				continue;
 		}
-		if (!test_and_set_bit(FLUSHING_CACHED_CHARGE, &stock->flags))
-			schedule_work_on(cpu, &stock->work);
+		if (!test_and_set_bit(FLUSHING_CACHED_CHARGE, &stock->flags)) {
+			if (cpu == curcpu)
+				drain_local_stock(&stock->work);
+			else
+				schedule_work_on(cpu, &stock->work);
+		}
 	}
  	put_online_cpus();
 	mutex_unlock(&percpu_charge_mutex);
-- 
1.7.5.4


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 2/4] memcg: unify sync and async per-cpu charge cache draining
  2011-07-22 12:43 [PATCH 0/4 v2] memcg: cleanup per-cpu charge caches Michal Hocko
  2011-07-21  7:38 ` [PATCH 1/4] memcg: do not try to drain per-cpu caches without pages Michal Hocko
@ 2011-07-22 10:24 ` Michal Hocko
  2011-07-22 10:27 ` [PATCH 3/4] memcg: add mem_cgroup_same_or_subtree helper Michal Hocko
  2011-07-22 11:20 ` [PATCH 4/4] memcg: get rid of percpu_charge_mutex lock Michal Hocko
  3 siblings, 0 replies; 41+ messages in thread
From: Michal Hocko @ 2011-07-22 10:24 UTC (permalink / raw)
  To: linux-mm; +Cc: Balbir Singh, KAMEZAWA Hiroyuki, Andrew Morton, linux-kernel

Currently we have two ways how to drain per-CPU caches for charges.
drain_all_stock_sync will synchronously drain all caches while
drain_all_stock_async will asynchronously drain only those that refer to
a given memory cgroup or its subtree in hierarchy.
Targeted async draining has been introduced by 26fe6168 (memcg: fix
percpu cached charge draining frequency) to reduce the cpu workers
number.

sync draining is currently triggered only from mem_cgroup_force_empty
which is triggered only by userspace (mem_cgroup_force_empty_write) or
when a cgroup is removed (mem_cgroup_pre_destroy). Although these are
not usually frequent operations it still makes some sense to do targeted
draining as well, especially if the box has many CPUs.

This patch unifies both methods to use the single code (drain_all_stock)
which relies on the original async implementation and just adds
flush_work to wait on all caches that are still under work for the sync
mode.
We are using FLUSHING_CACHED_CHARGE bit check to prevent from waiting on
a work that we haven't triggered.
Please note that both sync and async functions are currently protected
by percpu_charge_mutex so we cannot race with other drainers.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |   48 ++++++++++++++++++++++++++++++++++--------------
 1 files changed, 34 insertions(+), 14 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c012ffe..8852bed 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2133,19 +2133,14 @@ static void refill_stock(struct mem_cgroup *mem, unsigned int nr_pages)
 }
 
 /*
- * Tries to drain stocked charges in other cpus. This function is asynchronous
- * and just put a work per cpu for draining localy on each cpu. Caller can
- * expects some charges will be back to res_counter later but cannot wait for
- * it.
+ * Drains all per-CPU charge caches for given root_mem resp. subtree
+ * of the hierarchy under it. sync flag says whether we should block
+ * until the work is done.
  */
-static void drain_all_stock_async(struct mem_cgroup *root_mem)
+static void drain_all_stock(struct mem_cgroup *root_mem, bool sync)
 {
 	int cpu, curcpu;
-	/*
-	 * If someone calls draining, avoid adding more kworker runs.
-	 */
-	if (!mutex_trylock(&percpu_charge_mutex))
-		return;
+
 	/* Notify other cpus that system-wide "drain" is running */
 	get_online_cpus();
 	/*
@@ -2176,17 +2171,42 @@ static void drain_all_stock_async(struct mem_cgroup *root_mem)
 				schedule_work_on(cpu, &stock->work);
 		}
 	}
+
+	if (!sync)
+		goto out;
+
+	for_each_online_cpu(cpu) {
+		struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu);
+		if (test_bit(FLUSHING_CACHED_CHARGE, &stock->flags))
+			flush_work(&stock->work);
+	}
+out:
  	put_online_cpus();
+}
+
+/*
+ * Tries to drain stocked charges in other cpus. This function is asynchronous
+ * and just put a work per cpu for draining localy on each cpu. Caller can
+ * expects some charges will be back to res_counter later but cannot wait for
+ * it.
+ */
+static void drain_all_stock_async(struct mem_cgroup *root_mem)
+{
+	/*
+	 * If someone calls draining, avoid adding more kworker runs.
+	 */
+	if (!mutex_trylock(&percpu_charge_mutex))
+		return;
+	drain_all_stock(root_mem, false);
 	mutex_unlock(&percpu_charge_mutex);
-	/* We don't wait for flush_work */
 }
 
 /* This is a synchronous drain interface. */
-static void drain_all_stock_sync(void)
+static void drain_all_stock_sync(struct mem_cgroup *root_mem)
 {
 	/* called when force_empty is called */
 	mutex_lock(&percpu_charge_mutex);
-	schedule_on_each_cpu(drain_local_stock);
+	drain_all_stock(root_mem, true);
 	mutex_unlock(&percpu_charge_mutex);
 }
 
@@ -3835,7 +3855,7 @@ move_account:
 			goto out;
 		/* This is for making all *used* pages to be on LRU. */
 		lru_add_drain_all();
-		drain_all_stock_sync();
+		drain_all_stock_sync(mem);
 		ret = 0;
 		mem_cgroup_start_move(mem);
 		for_each_node_state(node, N_HIGH_MEMORY) {
-- 
1.7.5.4


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 3/4] memcg: add mem_cgroup_same_or_subtree helper
  2011-07-22 12:43 [PATCH 0/4 v2] memcg: cleanup per-cpu charge caches Michal Hocko
  2011-07-21  7:38 ` [PATCH 1/4] memcg: do not try to drain per-cpu caches without pages Michal Hocko
  2011-07-22 10:24 ` [PATCH 2/4] memcg: unify sync and async per-cpu charge cache draining Michal Hocko
@ 2011-07-22 10:27 ` Michal Hocko
  2011-07-22 11:20 ` [PATCH 4/4] memcg: get rid of percpu_charge_mutex lock Michal Hocko
  3 siblings, 0 replies; 41+ messages in thread
From: Michal Hocko @ 2011-07-22 10:27 UTC (permalink / raw)
  To: linux-mm; +Cc: Balbir Singh, KAMEZAWA Hiroyuki, Andrew Morton, linux-kernel

We are checking whether a given two groups are same or at least in the
same subtree of a hierarchy at several places. Let's make a helper for
it to make code easier to read.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |   51 ++++++++++++++++++++++++++-------------------------
 1 files changed, 26 insertions(+), 25 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 8852bed..3685107 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1046,6 +1046,21 @@ void mem_cgroup_move_lists(struct page *page,
 	mem_cgroup_add_lru_list(page, to);
 }
 
+/*
+ * Checks whether given mem is same or in the root_mem's
+ * hierarchy subtree
+ */
+static bool mem_cgroup_same_or_subtree(const struct mem_cgroup *root_mem,
+		struct mem_cgroup *mem)
+{
+	if (root_mem != mem) {
+		return (root_mem->use_hierarchy &&
+			css_is_ancestor(&mem->css, &root_mem->css));
+	}
+
+	return true;
+}
+
 int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *mem)
 {
 	int ret;
@@ -1065,10 +1080,7 @@ int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *mem)
 	 * enabled in "curr" and "curr" is a child of "mem" in *cgroup*
 	 * hierarchy(even if use_hierarchy is disabled in "mem").
 	 */
-	if (mem->use_hierarchy)
-		ret = css_is_ancestor(&curr->css, &mem->css);
-	else
-		ret = (curr == mem);
+	ret = mem_cgroup_same_or_subtree(mem, curr);
 	css_put(&curr->css);
 	return ret;
 }
@@ -1404,10 +1416,9 @@ static bool mem_cgroup_under_move(struct mem_cgroup *mem)
 	to = mc.to;
 	if (!from)
 		goto unlock;
-	if (from == mem || to == mem
-	    || (mem->use_hierarchy && css_is_ancestor(&from->css, &mem->css))
-	    || (mem->use_hierarchy && css_is_ancestor(&to->css,	&mem->css)))
-		ret = true;
+
+	ret = mem_cgroup_same_or_subtree(mem, from)
+		|| mem_cgroup_same_or_subtree(mem, to);
 unlock:
 	spin_unlock(&mc.lock);
 	return ret;
@@ -1894,25 +1905,20 @@ struct oom_wait_info {
 static int memcg_oom_wake_function(wait_queue_t *wait,
 	unsigned mode, int sync, void *arg)
 {
-	struct mem_cgroup *wake_mem = (struct mem_cgroup *)arg;
+	struct mem_cgroup *wake_mem = (struct mem_cgroup *)arg,
+			  *oom_wait_mem;
 	struct oom_wait_info *oom_wait_info;
 
 	oom_wait_info = container_of(wait, struct oom_wait_info, wait);
+	oom_wait_mem = oom_wait_info->mem;
 
-	if (oom_wait_info->mem == wake_mem)
-		goto wakeup;
-	/* if no hierarchy, no match */
-	if (!oom_wait_info->mem->use_hierarchy || !wake_mem->use_hierarchy)
-		return 0;
 	/*
 	 * Both of oom_wait_info->mem and wake_mem are stable under us.
 	 * Then we can use css_is_ancestor without taking care of RCU.
 	 */
-	if (!css_is_ancestor(&oom_wait_info->mem->css, &wake_mem->css) &&
-	    !css_is_ancestor(&wake_mem->css, &oom_wait_info->mem->css))
+	if (!mem_cgroup_same_or_subtree(oom_wait_mem, wake_mem)
+			&& !mem_cgroup_same_or_subtree(wake_mem, oom_wait_mem))
 		return 0;
-
-wakeup:
 	return autoremove_wake_function(wait, mode, sync, arg);
 }
 
@@ -2157,13 +2163,8 @@ static void drain_all_stock(struct mem_cgroup *root_mem, bool sync)
 		mem = stock->cached;
 		if (!mem || !stock->nr_pages)
 			continue;
-		if (mem != root_mem) {
-			if (!root_mem->use_hierarchy)
-				continue;
-			/* check whether "mem" is under tree of "root_mem" */
-			if (!css_is_ancestor(&mem->css, &root_mem->css))
-				continue;
-		}
+		if (!mem_cgroup_same_or_subtree(root_mem, mem))
+			continue;
 		if (!test_and_set_bit(FLUSHING_CACHED_CHARGE, &stock->flags)) {
 			if (cpu == curcpu)
 				drain_local_stock(&stock->work);
-- 
1.7.5.4


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 4/4] memcg: get rid of percpu_charge_mutex lock
  2011-07-22 12:43 [PATCH 0/4 v2] memcg: cleanup per-cpu charge caches Michal Hocko
                   ` (2 preceding siblings ...)
  2011-07-22 10:27 ` [PATCH 3/4] memcg: add mem_cgroup_same_or_subtree helper Michal Hocko
@ 2011-07-22 11:20 ` Michal Hocko
  2011-07-25  1:18     ` KAMEZAWA Hiroyuki
  2011-08-08 18:47     ` Johannes Weiner
  3 siblings, 2 replies; 41+ messages in thread
From: Michal Hocko @ 2011-07-22 11:20 UTC (permalink / raw)
  To: linux-mm; +Cc: Balbir Singh, KAMEZAWA Hiroyuki, Andrew Morton, linux-kernel

percpu_charge_mutex protects from multiple simultaneous per-cpu charge
caches draining because we might end up having too many work items.
At least this was the case until 26fe6168 (memcg: fix percpu cached
charge draining frequency) when we introduced a more targeted draining
for async mode.
Now that also sync draining is targeted we can safely remove mutex
because we will not send more work than the current number of CPUs.
FLUSHING_CACHED_CHARGE protects from sending the same work multiple
times and stock->nr_pages == 0 protects from pointless sending a work
if there is obviously nothing to be done. This is of course racy but we
can live with it as the race window is really small (we would have to
see FLUSHING_CACHED_CHARGE cleared while nr_pages would be still
non-zero).
The only remaining place where we can race is synchronous mode when we
rely on FLUSHING_CACHED_CHARGE test which might have been set by other
drainer on the same group but we should wait in that case as well.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 mm/memcontrol.c |   12 ++----------
 1 files changed, 2 insertions(+), 10 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 3685107..f8463a0 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2071,7 +2071,6 @@ struct memcg_stock_pcp {
 #define FLUSHING_CACHED_CHARGE	(0)
 };
 static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock);
-static DEFINE_MUTEX(percpu_charge_mutex);
 
 /*
  * Try to consume stocked charge on this cpu. If success, one page is consumed
@@ -2178,7 +2177,8 @@ static void drain_all_stock(struct mem_cgroup *root_mem, bool sync)
 
 	for_each_online_cpu(cpu) {
 		struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu);
-		if (test_bit(FLUSHING_CACHED_CHARGE, &stock->flags))
+		if (mem_cgroup_same_or_subtree(root_mem, stock->cached) &&
+				test_bit(FLUSHING_CACHED_CHARGE, &stock->flags))
 			flush_work(&stock->work);
 	}
 out:
@@ -2193,22 +2193,14 @@ out:
  */
 static void drain_all_stock_async(struct mem_cgroup *root_mem)
 {
-	/*
-	 * If someone calls draining, avoid adding more kworker runs.
-	 */
-	if (!mutex_trylock(&percpu_charge_mutex))
-		return;
 	drain_all_stock(root_mem, false);
-	mutex_unlock(&percpu_charge_mutex);
 }
 
 /* This is a synchronous drain interface. */
 static void drain_all_stock_sync(struct mem_cgroup *root_mem)
 {
 	/* called when force_empty is called */
-	mutex_lock(&percpu_charge_mutex);
 	drain_all_stock(root_mem, true);
-	mutex_unlock(&percpu_charge_mutex);
 }
 
 /*
-- 
1.7.5.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 0/4 v2] memcg: cleanup per-cpu charge caches
@ 2011-07-22 12:43 Michal Hocko
  2011-07-21  7:38 ` [PATCH 1/4] memcg: do not try to drain per-cpu caches without pages Michal Hocko
                   ` (3 more replies)
  0 siblings, 4 replies; 41+ messages in thread
From: Michal Hocko @ 2011-07-22 12:43 UTC (permalink / raw)
  To: linux-mm; +Cc: Balbir Singh, KAMEZAWA Hiroyuki, Andrew Morton, linux-kernel

Hi,
this is a second version of the per-cpu carge draining code cleanup.
I have dropped the "fix unnecessary reclaim if there are still cached
charges" part because it seems to have some issues and it is not
critical at the moment.

I think that the cleanup has some sense on its own.

Changes since v1:
- memcg: do not try to drain per-cpu caches without pages uses
  drain_cache_local for the current CPU
- added memcg: add mem_cgroup_same_or_subtree helper
- dropped "memcg: prevent from reclaiming if there are per-cpu cached
  charges" patch

Michal Hocko (4):
  memcg: do not try to drain per-cpu caches without pages
  memcg: unify sync and async per-cpu charge cache draining
  memcg: add mem_cgroup_same_or_subtree helper
  memcg: get rid of percpu_charge_mutex lock

 mm/memcontrol.c |  110 +++++++++++++++++++++++++++++++------------------------
 1 files changed, 62 insertions(+), 48 deletions(-)

-- 
1.7.5.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 1/4] memcg: do not try to drain per-cpu caches without pages
  2011-07-21  7:38 ` [PATCH 1/4] memcg: do not try to drain per-cpu caches without pages Michal Hocko
@ 2011-07-25  1:16     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 41+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-07-25  1:16 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, Balbir Singh, Andrew Morton, linux-kernel

On Thu, 21 Jul 2011 09:38:00 +0200
Michal Hocko <mhocko@suse.cz> wrote:

> drain_all_stock_async tries to optimize a work to be done on the work
> queue by excluding any work for the current CPU because it assumes that
> the context we are called from already tried to charge from that cache
> and it's failed so it must be empty already.
> While the assumption is correct we can optimize it even more by checking
> the current number of pages in the cache. This will also reduce a work
> on other CPUs with an empty stock.
> For the current CPU we can simply call drain_local_stock rather than
> deferring it to the work queue.
> 
> [KAMEZAWA Hiroyuki - use drain_local_stock for current CPU optimization]
> Signed-off-by: Michal Hocko <mhocko@suse.cz>

Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 1/4] memcg: do not try to drain per-cpu caches without pages
@ 2011-07-25  1:16     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 41+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-07-25  1:16 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, Balbir Singh, Andrew Morton, linux-kernel

On Thu, 21 Jul 2011 09:38:00 +0200
Michal Hocko <mhocko@suse.cz> wrote:

> drain_all_stock_async tries to optimize a work to be done on the work
> queue by excluding any work for the current CPU because it assumes that
> the context we are called from already tried to charge from that cache
> and it's failed so it must be empty already.
> While the assumption is correct we can optimize it even more by checking
> the current number of pages in the cache. This will also reduce a work
> on other CPUs with an empty stock.
> For the current CPU we can simply call drain_local_stock rather than
> deferring it to the work queue.
> 
> [KAMEZAWA Hiroyuki - use drain_local_stock for current CPU optimization]
> Signed-off-by: Michal Hocko <mhocko@suse.cz>

Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 4/4] memcg: get rid of percpu_charge_mutex lock
  2011-07-22 11:20 ` [PATCH 4/4] memcg: get rid of percpu_charge_mutex lock Michal Hocko
@ 2011-07-25  1:18     ` KAMEZAWA Hiroyuki
  2011-08-08 18:47     ` Johannes Weiner
  1 sibling, 0 replies; 41+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-07-25  1:18 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, Balbir Singh, Andrew Morton, linux-kernel

On Fri, 22 Jul 2011 13:20:25 +0200
Michal Hocko <mhocko@suse.cz> wrote:

> percpu_charge_mutex protects from multiple simultaneous per-cpu charge
> caches draining because we might end up having too many work items.
> At least this was the case until 26fe6168 (memcg: fix percpu cached
> charge draining frequency) when we introduced a more targeted draining
> for async mode.
> Now that also sync draining is targeted we can safely remove mutex
> because we will not send more work than the current number of CPUs.
> FLUSHING_CACHED_CHARGE protects from sending the same work multiple
> times and stock->nr_pages == 0 protects from pointless sending a work
> if there is obviously nothing to be done. This is of course racy but we
> can live with it as the race window is really small (we would have to
> see FLUSHING_CACHED_CHARGE cleared while nr_pages would be still
> non-zero).
> The only remaining place where we can race is synchronous mode when we
> rely on FLUSHING_CACHED_CHARGE test which might have been set by other
> drainer on the same group but we should wait in that case as well.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.cz>

Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 4/4] memcg: get rid of percpu_charge_mutex lock
@ 2011-07-25  1:18     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 41+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-07-25  1:18 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, Balbir Singh, Andrew Morton, linux-kernel

On Fri, 22 Jul 2011 13:20:25 +0200
Michal Hocko <mhocko@suse.cz> wrote:

> percpu_charge_mutex protects from multiple simultaneous per-cpu charge
> caches draining because we might end up having too many work items.
> At least this was the case until 26fe6168 (memcg: fix percpu cached
> charge draining frequency) when we introduced a more targeted draining
> for async mode.
> Now that also sync draining is targeted we can safely remove mutex
> because we will not send more work than the current number of CPUs.
> FLUSHING_CACHED_CHARGE protects from sending the same work multiple
> times and stock->nr_pages == 0 protects from pointless sending a work
> if there is obviously nothing to be done. This is of course racy but we
> can live with it as the race window is really small (we would have to
> see FLUSHING_CACHED_CHARGE cleared while nr_pages would be still
> non-zero).
> The only remaining place where we can race is synchronous mode when we
> rely on FLUSHING_CACHED_CHARGE test which might have been set by other
> drainer on the same group but we should wait in that case as well.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.cz>

Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 4/4] memcg: get rid of percpu_charge_mutex lock
  2011-07-22 11:20 ` [PATCH 4/4] memcg: get rid of percpu_charge_mutex lock Michal Hocko
@ 2011-08-08 18:47     ` Johannes Weiner
  2011-08-08 18:47     ` Johannes Weiner
  1 sibling, 0 replies; 41+ messages in thread
From: Johannes Weiner @ 2011-08-08 18:47 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Balbir Singh, KAMEZAWA Hiroyuki, Andrew Morton, linux-kernel

On Fri, Jul 22, 2011 at 01:20:25PM +0200, Michal Hocko wrote:
> percpu_charge_mutex protects from multiple simultaneous per-cpu charge
> caches draining because we might end up having too many work items.
> At least this was the case until 26fe6168 (memcg: fix percpu cached
> charge draining frequency) when we introduced a more targeted draining
> for async mode.
> Now that also sync draining is targeted we can safely remove mutex
> because we will not send more work than the current number of CPUs.
> FLUSHING_CACHED_CHARGE protects from sending the same work multiple
> times and stock->nr_pages == 0 protects from pointless sending a work
> if there is obviously nothing to be done. This is of course racy but we
> can live with it as the race window is really small (we would have to
> see FLUSHING_CACHED_CHARGE cleared while nr_pages would be still
> non-zero).
> The only remaining place where we can race is synchronous mode when we
> rely on FLUSHING_CACHED_CHARGE test which might have been set by other
> drainer on the same group but we should wait in that case as well.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.cz>
> ---
>  mm/memcontrol.c |   12 ++----------
>  1 files changed, 2 insertions(+), 10 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 3685107..f8463a0 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2071,7 +2071,6 @@ struct memcg_stock_pcp {
>  #define FLUSHING_CACHED_CHARGE	(0)
>  };
>  static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock);
> -static DEFINE_MUTEX(percpu_charge_mutex);
>  
>  /*
>   * Try to consume stocked charge on this cpu. If success, one page is consumed
> @@ -2178,7 +2177,8 @@ static void drain_all_stock(struct mem_cgroup *root_mem, bool sync)
>  
>  	for_each_online_cpu(cpu) {
>  		struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu);
> -		if (test_bit(FLUSHING_CACHED_CHARGE, &stock->flags))
> +		if (mem_cgroup_same_or_subtree(root_mem, stock->cached) &&
> +				test_bit(FLUSHING_CACHED_CHARGE, &stock->flags))
>  			flush_work(&stock->work);
>  	}
>  out:

This hunk triggers a crash for me, as the draining is already done and
stock->cached reset to NULL when dereferenced here.  Oops is attached.

We have this loop in drain_all_stock():

	for_each_online_cpu(cpu) {
		struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu);
		struct mem_cgroup *mem;

		mem = stock->cached;
		if (!mem || !stock->nr_pages)
			continue;
		if (!mem_cgroup_same_or_subtree(root_mem, mem))
			continue;
		if (!test_and_set_bit(FLUSHING_CACHED_CHARGE, &stock->flags)) {
			if (cpu == curcpu)
				drain_local_stock(&stock->work);
			else
				schedule_work_on(cpu, &stock->work);
		}
	}

The only thing that stabilizes stock->cached is the knowledge that
there are still pages accounted to the memcg.

Without the mutex serializing this code, can't there be a concurrent
execution that leads to stock->cached being drained, becoming empty
and freed by someone else between the stock->nr_pages check and the
ancestor check, resulting in use after free?

What makes stock->cached safe to dereference?

[ 2313.442944] BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
[ 2313.443935] IP: [<ffffffff81083b70>] css_is_ancestor+0x20/0x70
[ 2313.443935] PGD 4ae7a067 PUD 4adc4067 PMD 0
[ 2313.443935] Oops: 0000 [#1] PREEMPT SMP
[ 2313.443935] CPU 0
[ 2313.443935] Pid: 19677, comm: rmdir Tainted: G        W   3.0.0-mm1-00188-gf38d32b #35 ECS MCP61M-M3/MCP61M-M3
[ 2313.443935] RIP: 0010:[<ffffffff81083b70>]  [<ffffffff81083b70>] css_is_ancestor+0x20/0x70
[ 2313.443935] RSP: 0018:ffff880077b09c88  EFLAGS: 00010202
[ 2313.443935] RAX: ffff8800781bb310 RBX: 0000000000000000 RCX: 000000000000003e
[ 2313.443935] RDX: 0000000000000000 RSI: ffff8800779f7c00 RDI: 0000000000000000
[ 2313.443935] RBP: ffff880077b09c98 R08: ffffffff818a4e88 R09: 0000000000000000
[ 2313.443935] R10: 0000000000000000 R11: dead000000100100 R12: ffff8800779f7c00
[ 2313.443935] R13: ffff8800779f7c00 R14: 0000000000000000 R15: ffff88007bc0eb80
[ 2313.443935] FS:  00007f5d689ec720(0000) GS:ffff88007bc00000(0000) knlGS:0000000000000000
[ 2313.443935] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 2313.443935] CR2: 0000000000000018 CR3: 000000004ad57000 CR4: 00000000000006f0
[ 2313.443935] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 2313.443935] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 2313.443935] Process rmdir (pid: 19677, threadinfo ffff880077b08000, task ffff8800781bb310)
[ 2313.443935] Stack:
[ 2313.443935]  ffffffff818a4e88 000000000000eb80 ffff880077b09ca8 ffffffff810feba3
[ 2313.443935]  ffff880077b09d08 ffffffff810feccf ffff880077b09cf8 0000000000000001
[ 2313.443935]  ffff88007bd0eb80 0000000000000001 ffff880077af2000 0000000000000000
[ 2313.443935] Call Trace:
[ 2313.443935]  [<ffffffff810feba3>] mem_cgroup_same_or_subtree+0x33/0x40
[ 2313.443935]  [<ffffffff810feccf>] drain_all_stock+0x11f/0x170
[ 2313.443935]  [<ffffffff81103211>] mem_cgroup_force_empty+0x231/0x6d0
[ 2313.443935]  [<ffffffff81111872>] ? path_put+0x22/0x30
[ 2313.443935]  [<ffffffff8111c925>] ? __d_lookup+0xb5/0x170
[ 2313.443935]  [<ffffffff811036c4>] mem_cgroup_pre_destroy+0x14/0x20
[ 2313.443935]  [<ffffffff81080559>] cgroup_rmdir+0xb9/0x500
[ 2313.443935]  [<ffffffff81063990>] ? abort_exclusive_wait+0xb0/0xb0
[ 2313.443935]  [<ffffffff81114d26>] vfs_rmdir+0x86/0xe0
[ 2313.443935]  [<ffffffff811233d3>] ? mnt_want_write+0x43/0x80
[ 2313.443935]  [<ffffffff81114e7b>] do_rmdir+0xfb/0x110
[ 2313.443935]  [<ffffffff81114ea6>] sys_rmdir+0x16/0x20
[ 2313.443935]  [<ffffffff8154d76b>] system_call_fastpath+0x16/0x1b
[ 2313.443935] Code: b7 42 0a 5d c3 66 0f 1f 44 00 00 55 48 89 e5 48 83 ec 10 48 89 5d f0 4c 89 65 f8 66 66 66 66 90 48 89 fb 49 89 f4 e8 10 85 00 00
[ 2313.443935]  8b 43 18 49 8b 54 24 18 48 85 d2 74 05 48 85 c0 75 15 31 db

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 4/4] memcg: get rid of percpu_charge_mutex lock
@ 2011-08-08 18:47     ` Johannes Weiner
  0 siblings, 0 replies; 41+ messages in thread
From: Johannes Weiner @ 2011-08-08 18:47 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Balbir Singh, KAMEZAWA Hiroyuki, Andrew Morton, linux-kernel

On Fri, Jul 22, 2011 at 01:20:25PM +0200, Michal Hocko wrote:
> percpu_charge_mutex protects from multiple simultaneous per-cpu charge
> caches draining because we might end up having too many work items.
> At least this was the case until 26fe6168 (memcg: fix percpu cached
> charge draining frequency) when we introduced a more targeted draining
> for async mode.
> Now that also sync draining is targeted we can safely remove mutex
> because we will not send more work than the current number of CPUs.
> FLUSHING_CACHED_CHARGE protects from sending the same work multiple
> times and stock->nr_pages == 0 protects from pointless sending a work
> if there is obviously nothing to be done. This is of course racy but we
> can live with it as the race window is really small (we would have to
> see FLUSHING_CACHED_CHARGE cleared while nr_pages would be still
> non-zero).
> The only remaining place where we can race is synchronous mode when we
> rely on FLUSHING_CACHED_CHARGE test which might have been set by other
> drainer on the same group but we should wait in that case as well.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.cz>
> ---
>  mm/memcontrol.c |   12 ++----------
>  1 files changed, 2 insertions(+), 10 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 3685107..f8463a0 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2071,7 +2071,6 @@ struct memcg_stock_pcp {
>  #define FLUSHING_CACHED_CHARGE	(0)
>  };
>  static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock);
> -static DEFINE_MUTEX(percpu_charge_mutex);
>  
>  /*
>   * Try to consume stocked charge on this cpu. If success, one page is consumed
> @@ -2178,7 +2177,8 @@ static void drain_all_stock(struct mem_cgroup *root_mem, bool sync)
>  
>  	for_each_online_cpu(cpu) {
>  		struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu);
> -		if (test_bit(FLUSHING_CACHED_CHARGE, &stock->flags))
> +		if (mem_cgroup_same_or_subtree(root_mem, stock->cached) &&
> +				test_bit(FLUSHING_CACHED_CHARGE, &stock->flags))
>  			flush_work(&stock->work);
>  	}
>  out:

This hunk triggers a crash for me, as the draining is already done and
stock->cached reset to NULL when dereferenced here.  Oops is attached.

We have this loop in drain_all_stock():

	for_each_online_cpu(cpu) {
		struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu);
		struct mem_cgroup *mem;

		mem = stock->cached;
		if (!mem || !stock->nr_pages)
			continue;
		if (!mem_cgroup_same_or_subtree(root_mem, mem))
			continue;
		if (!test_and_set_bit(FLUSHING_CACHED_CHARGE, &stock->flags)) {
			if (cpu == curcpu)
				drain_local_stock(&stock->work);
			else
				schedule_work_on(cpu, &stock->work);
		}
	}

The only thing that stabilizes stock->cached is the knowledge that
there are still pages accounted to the memcg.

Without the mutex serializing this code, can't there be a concurrent
execution that leads to stock->cached being drained, becoming empty
and freed by someone else between the stock->nr_pages check and the
ancestor check, resulting in use after free?

What makes stock->cached safe to dereference?

[ 2313.442944] BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
[ 2313.443935] IP: [<ffffffff81083b70>] css_is_ancestor+0x20/0x70
[ 2313.443935] PGD 4ae7a067 PUD 4adc4067 PMD 0
[ 2313.443935] Oops: 0000 [#1] PREEMPT SMP
[ 2313.443935] CPU 0
[ 2313.443935] Pid: 19677, comm: rmdir Tainted: G        W   3.0.0-mm1-00188-gf38d32b #35 ECS MCP61M-M3/MCP61M-M3
[ 2313.443935] RIP: 0010:[<ffffffff81083b70>]  [<ffffffff81083b70>] css_is_ancestor+0x20/0x70
[ 2313.443935] RSP: 0018:ffff880077b09c88  EFLAGS: 00010202
[ 2313.443935] RAX: ffff8800781bb310 RBX: 0000000000000000 RCX: 000000000000003e
[ 2313.443935] RDX: 0000000000000000 RSI: ffff8800779f7c00 RDI: 0000000000000000
[ 2313.443935] RBP: ffff880077b09c98 R08: ffffffff818a4e88 R09: 0000000000000000
[ 2313.443935] R10: 0000000000000000 R11: dead000000100100 R12: ffff8800779f7c00
[ 2313.443935] R13: ffff8800779f7c00 R14: 0000000000000000 R15: ffff88007bc0eb80
[ 2313.443935] FS:  00007f5d689ec720(0000) GS:ffff88007bc00000(0000) knlGS:0000000000000000
[ 2313.443935] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 2313.443935] CR2: 0000000000000018 CR3: 000000004ad57000 CR4: 00000000000006f0
[ 2313.443935] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 2313.443935] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 2313.443935] Process rmdir (pid: 19677, threadinfo ffff880077b08000, task ffff8800781bb310)
[ 2313.443935] Stack:
[ 2313.443935]  ffffffff818a4e88 000000000000eb80 ffff880077b09ca8 ffffffff810feba3
[ 2313.443935]  ffff880077b09d08 ffffffff810feccf ffff880077b09cf8 0000000000000001
[ 2313.443935]  ffff88007bd0eb80 0000000000000001 ffff880077af2000 0000000000000000
[ 2313.443935] Call Trace:
[ 2313.443935]  [<ffffffff810feba3>] mem_cgroup_same_or_subtree+0x33/0x40
[ 2313.443935]  [<ffffffff810feccf>] drain_all_stock+0x11f/0x170
[ 2313.443935]  [<ffffffff81103211>] mem_cgroup_force_empty+0x231/0x6d0
[ 2313.443935]  [<ffffffff81111872>] ? path_put+0x22/0x30
[ 2313.443935]  [<ffffffff8111c925>] ? __d_lookup+0xb5/0x170
[ 2313.443935]  [<ffffffff811036c4>] mem_cgroup_pre_destroy+0x14/0x20
[ 2313.443935]  [<ffffffff81080559>] cgroup_rmdir+0xb9/0x500
[ 2313.443935]  [<ffffffff81063990>] ? abort_exclusive_wait+0xb0/0xb0
[ 2313.443935]  [<ffffffff81114d26>] vfs_rmdir+0x86/0xe0
[ 2313.443935]  [<ffffffff811233d3>] ? mnt_want_write+0x43/0x80
[ 2313.443935]  [<ffffffff81114e7b>] do_rmdir+0xfb/0x110
[ 2313.443935]  [<ffffffff81114ea6>] sys_rmdir+0x16/0x20
[ 2313.443935]  [<ffffffff8154d76b>] system_call_fastpath+0x16/0x1b
[ 2313.443935] Code: b7 42 0a 5d c3 66 0f 1f 44 00 00 55 48 89 e5 48 83 ec 10 48 89 5d f0 4c 89 65 f8 66 66 66 66 90 48 89 fb 49 89 f4 e8 10 85 00 00
[ 2313.443935]  8b 43 18 49 8b 54 24 18 48 85 d2 74 05 48 85 c0 75 15 31 db

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 4/4] memcg: get rid of percpu_charge_mutex lock
  2011-08-08 18:47     ` Johannes Weiner
@ 2011-08-08 21:47       ` Michal Hocko
  -1 siblings, 0 replies; 41+ messages in thread
From: Michal Hocko @ 2011-08-08 21:47 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Balbir Singh, KAMEZAWA Hiroyuki, Andrew Morton, linux-kernel

On Mon 08-08-11 20:47:38, Johannes Weiner wrote:
[...]
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -2071,7 +2071,6 @@ struct memcg_stock_pcp {
> >  #define FLUSHING_CACHED_CHARGE	(0)
> >  };
> >  static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock);
> > -static DEFINE_MUTEX(percpu_charge_mutex);
> >  
> >  /*
> >   * Try to consume stocked charge on this cpu. If success, one page is consumed
> > @@ -2178,7 +2177,8 @@ static void drain_all_stock(struct mem_cgroup *root_mem, bool sync)
> >  
> >  	for_each_online_cpu(cpu) {
> >  		struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu);
> > -		if (test_bit(FLUSHING_CACHED_CHARGE, &stock->flags))
> > +		if (mem_cgroup_same_or_subtree(root_mem, stock->cached) &&
> > +				test_bit(FLUSHING_CACHED_CHARGE, &stock->flags))
> >  			flush_work(&stock->work);
> >  	}
> >  out:
> 
> This hunk triggers a crash for me, as the draining is already done and
> stock->cached reset to NULL when dereferenced here.  Oops is attached.

Thanks for catching this. We are racing synchronous drain from
force_empty and async drain from reclaim, I guess. Sync. checked
whether it should wait for the work and the cache got drained and
set to NULL. 
First of all we must not dereference the cached mem without
FLUSHING_CACHED_CHARGE bit test. We have to be sure that there is some
draining on that cache. stock->cached is set to NULL before we clear the
bit (I guess we need to add a barrier into drain_local_stock). So we should
see mem either as NULL or still valid (I have to think some more about
"still valid" part - maybe we will need rcu_read_lock).

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f4ec4e7..626c916 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2197,8 +2197,10 @@ static void drain_all_stock(struct mem_cgroup *root_mem, bool sync)
 
 	for_each_online_cpu(cpu) {
 		struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu);
-		if (mem_cgroup_same_or_subtree(root_mem, stock->cached) &&
-				test_bit(FLUSHING_CACHED_CHARGE, &stock->flags))
+		struct mem_cgroup *mem = stock->cached;
+		if (test_bit(FLUSHING_CACHED_CHARGE, &stock->flags) &&
+				 mem && mem_cgroup_same_or_subtree(root_mem, mem)
+				)
 			flush_work(&stock->work);
 	}
 out:

> 
> We have this loop in drain_all_stock():
> 
> 	for_each_online_cpu(cpu) {
> 		struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu);
> 		struct mem_cgroup *mem;
> 
> 		mem = stock->cached;
> 		if (!mem || !stock->nr_pages)
> 			continue;
> 		if (!mem_cgroup_same_or_subtree(root_mem, mem))
> 			continue;
> 		if (!test_and_set_bit(FLUSHING_CACHED_CHARGE, &stock->flags)) {
> 			if (cpu == curcpu)
> 				drain_local_stock(&stock->work);
> 			else
> 				schedule_work_on(cpu, &stock->work);
> 		}
> 	}
> 
> The only thing that stabilizes stock->cached is the knowledge that
> there are still pages accounted to the memcg.

Yes you are right we have to set FLUSHING_CACHED_CHARGE before nr_pages
check (and do the appropriate cleanup on the continue paths). This looks
quite ugly, though.

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f4ec4e7..eca46141 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2179,17 +2179,23 @@ static void drain_all_stock(struct mem_cgroup *root_mem, bool sync)
 		struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu);
 		struct mem_cgroup *mem;
 
+		/* Try to lock the cache */
+		if(test_and_set_bit(FLUSHING_CACHED_CHARGE, &stock->flags))
+			continue;
+
 		mem = stock->cached;
 		if (!mem || !stock->nr_pages)
-			continue;
+			goto unlock_cache;
 		if (!mem_cgroup_same_or_subtree(root_mem, mem))
-			continue;
-		if (!test_and_set_bit(FLUSHING_CACHED_CHARGE, &stock->flags)) {
-			if (cpu == curcpu)
-				drain_local_stock(&stock->work);
-			else
-				schedule_work_on(cpu, &stock->work);
-		}
+			goto unlock_cache;
+
+		if (cpu == curcpu)
+			drain_local_stock(&stock->work);
+		else
+			schedule_work_on(cpu, &stock->work);
+		continue;
+unlock_cache:
+		clear_bit(FLUSHING_CACHED_CHARGE, &stock->flags);

                ^^^^^
		need a barrier?
 	}
 
 	if (!sync)
 
> Without the mutex serializing this code, can't there be a concurrent
> execution that leads to stock->cached being drained, becoming empty
> and freed by someone else between the stock->nr_pages check and the
> ancestor check, resulting in use after free?
> 
> What makes stock->cached safe to dereference?

We are using FLUSHING_CACHED_CHARGE as a lock for local draining. I
guess it should be sufficient.

mutex which was used previously caused that async draining was exclusive
so a root_mem that has potentially many relevant caches has to back off
because other mem wants to clear the cache on the same CPU.

I will think about this tomorrow (with fresh eyes). I think we should be
able to be without mutex.

Anyway thanks for the really good report!

> 
> [ 2313.442944] BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
> [ 2313.443935] IP: [<ffffffff81083b70>] css_is_ancestor+0x20/0x70
> [ 2313.443935] PGD 4ae7a067 PUD 4adc4067 PMD 0
> [ 2313.443935] Oops: 0000 [#1] PREEMPT SMP
> [ 2313.443935] CPU 0
> [ 2313.443935] Pid: 19677, comm: rmdir Tainted: G        W   3.0.0-mm1-00188-gf38d32b #35 ECS MCP61M-M3/MCP61M-M3
> [ 2313.443935] RIP: 0010:[<ffffffff81083b70>]  [<ffffffff81083b70>] css_is_ancestor+0x20/0x70
> [ 2313.443935] RSP: 0018:ffff880077b09c88  EFLAGS: 00010202
> [ 2313.443935] RAX: ffff8800781bb310 RBX: 0000000000000000 RCX: 000000000000003e
> [ 2313.443935] RDX: 0000000000000000 RSI: ffff8800779f7c00 RDI: 0000000000000000
> [ 2313.443935] RBP: ffff880077b09c98 R08: ffffffff818a4e88 R09: 0000000000000000
> [ 2313.443935] R10: 0000000000000000 R11: dead000000100100 R12: ffff8800779f7c00
> [ 2313.443935] R13: ffff8800779f7c00 R14: 0000000000000000 R15: ffff88007bc0eb80
> [ 2313.443935] FS:  00007f5d689ec720(0000) GS:ffff88007bc00000(0000) knlGS:0000000000000000
> [ 2313.443935] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [ 2313.443935] CR2: 0000000000000018 CR3: 000000004ad57000 CR4: 00000000000006f0
> [ 2313.443935] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 2313.443935] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [ 2313.443935] Process rmdir (pid: 19677, threadinfo ffff880077b08000, task ffff8800781bb310)
> [ 2313.443935] Stack:
> [ 2313.443935]  ffffffff818a4e88 000000000000eb80 ffff880077b09ca8 ffffffff810feba3
> [ 2313.443935]  ffff880077b09d08 ffffffff810feccf ffff880077b09cf8 0000000000000001
> [ 2313.443935]  ffff88007bd0eb80 0000000000000001 ffff880077af2000 0000000000000000
> [ 2313.443935] Call Trace:
> [ 2313.443935]  [<ffffffff810feba3>] mem_cgroup_same_or_subtree+0x33/0x40
> [ 2313.443935]  [<ffffffff810feccf>] drain_all_stock+0x11f/0x170
> [ 2313.443935]  [<ffffffff81103211>] mem_cgroup_force_empty+0x231/0x6d0
> [ 2313.443935]  [<ffffffff81111872>] ? path_put+0x22/0x30
> [ 2313.443935]  [<ffffffff8111c925>] ? __d_lookup+0xb5/0x170
> [ 2313.443935]  [<ffffffff811036c4>] mem_cgroup_pre_destroy+0x14/0x20
> [ 2313.443935]  [<ffffffff81080559>] cgroup_rmdir+0xb9/0x500
> [ 2313.443935]  [<ffffffff81063990>] ? abort_exclusive_wait+0xb0/0xb0
> [ 2313.443935]  [<ffffffff81114d26>] vfs_rmdir+0x86/0xe0
> [ 2313.443935]  [<ffffffff811233d3>] ? mnt_want_write+0x43/0x80
> [ 2313.443935]  [<ffffffff81114e7b>] do_rmdir+0xfb/0x110
> [ 2313.443935]  [<ffffffff81114ea6>] sys_rmdir+0x16/0x20
> [ 2313.443935]  [<ffffffff8154d76b>] system_call_fastpath+0x16/0x1b
> [ 2313.443935] Code: b7 42 0a 5d c3 66 0f 1f 44 00 00 55 48 89 e5 48 83 ec 10 48 89 5d f0 4c 89 65 f8 66 66 66 66 90 48 89 fb 49 89 f4 e8 10 85 00 00
> [ 2313.443935]  8b 43 18 49 8b 54 24 18 48 85 d2 74 05 48 85 c0 75 15 31 db
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH 4/4] memcg: get rid of percpu_charge_mutex lock
@ 2011-08-08 21:47       ` Michal Hocko
  0 siblings, 0 replies; 41+ messages in thread
From: Michal Hocko @ 2011-08-08 21:47 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Balbir Singh, KAMEZAWA Hiroyuki, Andrew Morton, linux-kernel

On Mon 08-08-11 20:47:38, Johannes Weiner wrote:
[...]
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -2071,7 +2071,6 @@ struct memcg_stock_pcp {
> >  #define FLUSHING_CACHED_CHARGE	(0)
> >  };
> >  static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock);
> > -static DEFINE_MUTEX(percpu_charge_mutex);
> >  
> >  /*
> >   * Try to consume stocked charge on this cpu. If success, one page is consumed
> > @@ -2178,7 +2177,8 @@ static void drain_all_stock(struct mem_cgroup *root_mem, bool sync)
> >  
> >  	for_each_online_cpu(cpu) {
> >  		struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu);
> > -		if (test_bit(FLUSHING_CACHED_CHARGE, &stock->flags))
> > +		if (mem_cgroup_same_or_subtree(root_mem, stock->cached) &&
> > +				test_bit(FLUSHING_CACHED_CHARGE, &stock->flags))
> >  			flush_work(&stock->work);
> >  	}
> >  out:
> 
> This hunk triggers a crash for me, as the draining is already done and
> stock->cached reset to NULL when dereferenced here.  Oops is attached.

Thanks for catching this. We are racing synchronous drain from
force_empty and async drain from reclaim, I guess. Sync. checked
whether it should wait for the work and the cache got drained and
set to NULL. 
First of all we must not dereference the cached mem without
FLUSHING_CACHED_CHARGE bit test. We have to be sure that there is some
draining on that cache. stock->cached is set to NULL before we clear the
bit (I guess we need to add a barrier into drain_local_stock). So we should
see mem either as NULL or still valid (I have to think some more about
"still valid" part - maybe we will need rcu_read_lock).

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f4ec4e7..626c916 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2197,8 +2197,10 @@ static void drain_all_stock(struct mem_cgroup *root_mem, bool sync)
 
 	for_each_online_cpu(cpu) {
 		struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu);
-		if (mem_cgroup_same_or_subtree(root_mem, stock->cached) &&
-				test_bit(FLUSHING_CACHED_CHARGE, &stock->flags))
+		struct mem_cgroup *mem = stock->cached;
+		if (test_bit(FLUSHING_CACHED_CHARGE, &stock->flags) &&
+				 mem && mem_cgroup_same_or_subtree(root_mem, mem)
+				)
 			flush_work(&stock->work);
 	}
 out:

> 
> We have this loop in drain_all_stock():
> 
> 	for_each_online_cpu(cpu) {
> 		struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu);
> 		struct mem_cgroup *mem;
> 
> 		mem = stock->cached;
> 		if (!mem || !stock->nr_pages)
> 			continue;
> 		if (!mem_cgroup_same_or_subtree(root_mem, mem))
> 			continue;
> 		if (!test_and_set_bit(FLUSHING_CACHED_CHARGE, &stock->flags)) {
> 			if (cpu == curcpu)
> 				drain_local_stock(&stock->work);
> 			else
> 				schedule_work_on(cpu, &stock->work);
> 		}
> 	}
> 
> The only thing that stabilizes stock->cached is the knowledge that
> there are still pages accounted to the memcg.

Yes you are right we have to set FLUSHING_CACHED_CHARGE before nr_pages
check (and do the appropriate cleanup on the continue paths). This looks
quite ugly, though.

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f4ec4e7..eca46141 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2179,17 +2179,23 @@ static void drain_all_stock(struct mem_cgroup *root_mem, bool sync)
 		struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu);
 		struct mem_cgroup *mem;
 
+		/* Try to lock the cache */
+		if(test_and_set_bit(FLUSHING_CACHED_CHARGE, &stock->flags))
+			continue;
+
 		mem = stock->cached;
 		if (!mem || !stock->nr_pages)
-			continue;
+			goto unlock_cache;
 		if (!mem_cgroup_same_or_subtree(root_mem, mem))
-			continue;
-		if (!test_and_set_bit(FLUSHING_CACHED_CHARGE, &stock->flags)) {
-			if (cpu == curcpu)
-				drain_local_stock(&stock->work);
-			else
-				schedule_work_on(cpu, &stock->work);
-		}
+			goto unlock_cache;
+
+		if (cpu == curcpu)
+			drain_local_stock(&stock->work);
+		else
+			schedule_work_on(cpu, &stock->work);
+		continue;
+unlock_cache:
+		clear_bit(FLUSHING_CACHED_CHARGE, &stock->flags);

                ^^^^^
		need a barrier?
 	}
 
 	if (!sync)
 
> Without the mutex serializing this code, can't there be a concurrent
> execution that leads to stock->cached being drained, becoming empty
> and freed by someone else between the stock->nr_pages check and the
> ancestor check, resulting in use after free?
> 
> What makes stock->cached safe to dereference?

We are using FLUSHING_CACHED_CHARGE as a lock for local draining. I
guess it should be sufficient.

mutex which was used previously caused that async draining was exclusive
so a root_mem that has potentially many relevant caches has to back off
because other mem wants to clear the cache on the same CPU.

I will think about this tomorrow (with fresh eyes). I think we should be
able to be without mutex.

Anyway thanks for the really good report!

> 
> [ 2313.442944] BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
> [ 2313.443935] IP: [<ffffffff81083b70>] css_is_ancestor+0x20/0x70
> [ 2313.443935] PGD 4ae7a067 PUD 4adc4067 PMD 0
> [ 2313.443935] Oops: 0000 [#1] PREEMPT SMP
> [ 2313.443935] CPU 0
> [ 2313.443935] Pid: 19677, comm: rmdir Tainted: G        W   3.0.0-mm1-00188-gf38d32b #35 ECS MCP61M-M3/MCP61M-M3
> [ 2313.443935] RIP: 0010:[<ffffffff81083b70>]  [<ffffffff81083b70>] css_is_ancestor+0x20/0x70
> [ 2313.443935] RSP: 0018:ffff880077b09c88  EFLAGS: 00010202
> [ 2313.443935] RAX: ffff8800781bb310 RBX: 0000000000000000 RCX: 000000000000003e
> [ 2313.443935] RDX: 0000000000000000 RSI: ffff8800779f7c00 RDI: 0000000000000000
> [ 2313.443935] RBP: ffff880077b09c98 R08: ffffffff818a4e88 R09: 0000000000000000
> [ 2313.443935] R10: 0000000000000000 R11: dead000000100100 R12: ffff8800779f7c00
> [ 2313.443935] R13: ffff8800779f7c00 R14: 0000000000000000 R15: ffff88007bc0eb80
> [ 2313.443935] FS:  00007f5d689ec720(0000) GS:ffff88007bc00000(0000) knlGS:0000000000000000
> [ 2313.443935] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [ 2313.443935] CR2: 0000000000000018 CR3: 000000004ad57000 CR4: 00000000000006f0
> [ 2313.443935] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 2313.443935] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [ 2313.443935] Process rmdir (pid: 19677, threadinfo ffff880077b08000, task ffff8800781bb310)
> [ 2313.443935] Stack:
> [ 2313.443935]  ffffffff818a4e88 000000000000eb80 ffff880077b09ca8 ffffffff810feba3
> [ 2313.443935]  ffff880077b09d08 ffffffff810feccf ffff880077b09cf8 0000000000000001
> [ 2313.443935]  ffff88007bd0eb80 0000000000000001 ffff880077af2000 0000000000000000
> [ 2313.443935] Call Trace:
> [ 2313.443935]  [<ffffffff810feba3>] mem_cgroup_same_or_subtree+0x33/0x40
> [ 2313.443935]  [<ffffffff810feccf>] drain_all_stock+0x11f/0x170
> [ 2313.443935]  [<ffffffff81103211>] mem_cgroup_force_empty+0x231/0x6d0
> [ 2313.443935]  [<ffffffff81111872>] ? path_put+0x22/0x30
> [ 2313.443935]  [<ffffffff8111c925>] ? __d_lookup+0xb5/0x170
> [ 2313.443935]  [<ffffffff811036c4>] mem_cgroup_pre_destroy+0x14/0x20
> [ 2313.443935]  [<ffffffff81080559>] cgroup_rmdir+0xb9/0x500
> [ 2313.443935]  [<ffffffff81063990>] ? abort_exclusive_wait+0xb0/0xb0
> [ 2313.443935]  [<ffffffff81114d26>] vfs_rmdir+0x86/0xe0
> [ 2313.443935]  [<ffffffff811233d3>] ? mnt_want_write+0x43/0x80
> [ 2313.443935]  [<ffffffff81114e7b>] do_rmdir+0xfb/0x110
> [ 2313.443935]  [<ffffffff81114ea6>] sys_rmdir+0x16/0x20
> [ 2313.443935]  [<ffffffff8154d76b>] system_call_fastpath+0x16/0x1b
> [ 2313.443935] Code: b7 42 0a 5d c3 66 0f 1f 44 00 00 55 48 89 e5 48 83 ec 10 48 89 5d f0 4c 89 65 f8 66 66 66 66 90 48 89 fb 49 89 f4 e8 10 85 00 00
> [ 2313.443935]  8b 43 18 49 8b 54 24 18 48 85 d2 74 05 48 85 c0 75 15 31 db
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH 4/4] memcg: get rid of percpu_charge_mutex lock
  2011-08-08 21:47       ` Michal Hocko
@ 2011-08-08 23:19         ` Johannes Weiner
  -1 siblings, 0 replies; 41+ messages in thread
From: Johannes Weiner @ 2011-08-08 23:19 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Balbir Singh, KAMEZAWA Hiroyuki, Andrew Morton, linux-kernel

On Mon, Aug 08, 2011 at 11:47:04PM +0200, Michal Hocko wrote:
> On Mon 08-08-11 20:47:38, Johannes Weiner wrote:
> [...]
> > > --- a/mm/memcontrol.c
> > > +++ b/mm/memcontrol.c
> > > @@ -2071,7 +2071,6 @@ struct memcg_stock_pcp {
> > >  #define FLUSHING_CACHED_CHARGE	(0)
> > >  };
> > >  static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock);
> > > -static DEFINE_MUTEX(percpu_charge_mutex);
> > >  
> > >  /*
> > >   * Try to consume stocked charge on this cpu. If success, one page is consumed
> > > @@ -2178,7 +2177,8 @@ static void drain_all_stock(struct mem_cgroup *root_mem, bool sync)
> > >  
> > >  	for_each_online_cpu(cpu) {
> > >  		struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu);
> > > -		if (test_bit(FLUSHING_CACHED_CHARGE, &stock->flags))
> > > +		if (mem_cgroup_same_or_subtree(root_mem, stock->cached) &&
> > > +				test_bit(FLUSHING_CACHED_CHARGE, &stock->flags))
> > >  			flush_work(&stock->work);
> > >  	}
> > >  out:
> > 
> > This hunk triggers a crash for me, as the draining is already done and
> > stock->cached reset to NULL when dereferenced here.  Oops is attached.
> 
> Thanks for catching this. We are racing synchronous drain from
> force_empty and async drain from reclaim, I guess.

It's at the end of a benchmark where several tasks delete the cgroups.
There is no reclaim going on anymore, it must be several sync drains
from force_empty of different memcgs.

> Sync. checked whether it should wait for the work and the cache got
> drained and set to NULL.  First of all we must not dereference the
> cached mem without FLUSHING_CACHED_CHARGE bit test. We have to be
> sure that there is some draining on that cache. stock->cached is set
> to NULL before we clear the bit (I guess we need to add a barrier
> into drain_local_stock). So we should see mem either as NULL or
> still valid (I have to think some more about "still valid" part -
> maybe we will need rcu_read_lock).
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index f4ec4e7..626c916 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2197,8 +2197,10 @@ static void drain_all_stock(struct mem_cgroup *root_mem, bool sync)
>  
>  	for_each_online_cpu(cpu) {
>  		struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu);
> -		if (mem_cgroup_same_or_subtree(root_mem, stock->cached) &&
> -				test_bit(FLUSHING_CACHED_CHARGE, &stock->flags))
> +		struct mem_cgroup *mem = stock->cached;
> +		if (test_bit(FLUSHING_CACHED_CHARGE, &stock->flags) &&
> +				 mem && mem_cgroup_same_or_subtree(root_mem, mem)
> +				)
>  			flush_work(&stock->work);
>  	}
>  out:

This ordering makes sure that mem is a sensible pointer, but it still
does not pin the object, *mem, which could go away the femtosecond
after the test_bit succeeds.

> > We have this loop in drain_all_stock():
> > 
> > 	for_each_online_cpu(cpu) {
> > 		struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu);
> > 		struct mem_cgroup *mem;
> > 
> > 		mem = stock->cached;
> > 		if (!mem || !stock->nr_pages)
> > 			continue;
> > 		if (!mem_cgroup_same_or_subtree(root_mem, mem))
> > 			continue;
> > 		if (!test_and_set_bit(FLUSHING_CACHED_CHARGE, &stock->flags)) {
> > 			if (cpu == curcpu)
> > 				drain_local_stock(&stock->work);
> > 			else
> > 				schedule_work_on(cpu, &stock->work);
> > 		}
> > 	}
> > 
> > The only thing that stabilizes stock->cached is the knowledge that
> > there are still pages accounted to the memcg.
> 
> Yes you are right we have to set FLUSHING_CACHED_CHARGE before nr_pages
> check (and do the appropriate cleanup on the continue paths). This looks
> quite ugly, though.
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index f4ec4e7..eca46141 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2179,17 +2179,23 @@ static void drain_all_stock(struct mem_cgroup *root_mem, bool sync)
>  		struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu);
>  		struct mem_cgroup *mem;
>  
> +		/* Try to lock the cache */
> +		if(test_and_set_bit(FLUSHING_CACHED_CHARGE, &stock->flags))
> +			continue;
> +
>  		mem = stock->cached;
>  		if (!mem || !stock->nr_pages)
> -			continue;
> +			goto unlock_cache;
>  		if (!mem_cgroup_same_or_subtree(root_mem, mem))
> -			continue;
> -		if (!test_and_set_bit(FLUSHING_CACHED_CHARGE, &stock->flags)) {
> -			if (cpu == curcpu)
> -				drain_local_stock(&stock->work);
> -			else
> -				schedule_work_on(cpu, &stock->work);
> -		}
> +			goto unlock_cache;

So one thread locks the cache, recognizes stock->cached is not in the
right hierarchy and unlocks again.  While the cache was locked, a
concurrent drainer with the right hierarchy skipped the stock because
it was locked.  That doesn't sound right.

But yes, we probably need exclusive access to the stock state.

> +
> +		if (cpu == curcpu)
> +			drain_local_stock(&stock->work);
> +		else
> +			schedule_work_on(cpu, &stock->work);
> +		continue;
> +unlock_cache:
> +		clear_bit(FLUSHING_CACHED_CHARGE, &stock->flags);
> 
>                 ^^^^^
> 		need a barrier?
>  	}
>  
>  	if (!sync)
>  
> > Without the mutex serializing this code, can't there be a concurrent
> > execution that leads to stock->cached being drained, becoming empty
> > and freed by someone else between the stock->nr_pages check and the
> > ancestor check, resulting in use after free?
> > 
> > What makes stock->cached safe to dereference?
> 
> We are using FLUSHING_CACHED_CHARGE as a lock for local draining. I
> guess it should be sufficient.
> 
> mutex which was used previously caused that async draining was exclusive
> so a root_mem that has potentially many relevant caches has to back off
> because other mem wants to clear the cache on the same CPU.

It's now replaced by what is essentially a per-stock bit-spinlock that
is always trylocked.

Would it make sense to promote it to a real spinlock?  Draining a
stock is pretty fast, there should be minimal lock hold times, but we
would still avoid that tiny race window where we would skip otherwise
correct stocks just because they are locked.

> I will think about this tomorrow (with fresh eyes). I think we should be
> able to be without mutex.

The problem is that we have a multi-op atomic section, so we can not
go lockless.  We can read the stock state just fine, and order
accesses to different members so that we get a coherent image.  But
there is still nothing that pins the charge to the memcg, and thus
nothing that stabilizes *stock->cached.

I agree that we can probably do better than a global lock, though.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 4/4] memcg: get rid of percpu_charge_mutex lock
@ 2011-08-08 23:19         ` Johannes Weiner
  0 siblings, 0 replies; 41+ messages in thread
From: Johannes Weiner @ 2011-08-08 23:19 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Balbir Singh, KAMEZAWA Hiroyuki, Andrew Morton, linux-kernel

On Mon, Aug 08, 2011 at 11:47:04PM +0200, Michal Hocko wrote:
> On Mon 08-08-11 20:47:38, Johannes Weiner wrote:
> [...]
> > > --- a/mm/memcontrol.c
> > > +++ b/mm/memcontrol.c
> > > @@ -2071,7 +2071,6 @@ struct memcg_stock_pcp {
> > >  #define FLUSHING_CACHED_CHARGE	(0)
> > >  };
> > >  static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock);
> > > -static DEFINE_MUTEX(percpu_charge_mutex);
> > >  
> > >  /*
> > >   * Try to consume stocked charge on this cpu. If success, one page is consumed
> > > @@ -2178,7 +2177,8 @@ static void drain_all_stock(struct mem_cgroup *root_mem, bool sync)
> > >  
> > >  	for_each_online_cpu(cpu) {
> > >  		struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu);
> > > -		if (test_bit(FLUSHING_CACHED_CHARGE, &stock->flags))
> > > +		if (mem_cgroup_same_or_subtree(root_mem, stock->cached) &&
> > > +				test_bit(FLUSHING_CACHED_CHARGE, &stock->flags))
> > >  			flush_work(&stock->work);
> > >  	}
> > >  out:
> > 
> > This hunk triggers a crash for me, as the draining is already done and
> > stock->cached reset to NULL when dereferenced here.  Oops is attached.
> 
> Thanks for catching this. We are racing synchronous drain from
> force_empty and async drain from reclaim, I guess.

It's at the end of a benchmark where several tasks delete the cgroups.
There is no reclaim going on anymore, it must be several sync drains
from force_empty of different memcgs.

> Sync. checked whether it should wait for the work and the cache got
> drained and set to NULL.  First of all we must not dereference the
> cached mem without FLUSHING_CACHED_CHARGE bit test. We have to be
> sure that there is some draining on that cache. stock->cached is set
> to NULL before we clear the bit (I guess we need to add a barrier
> into drain_local_stock). So we should see mem either as NULL or
> still valid (I have to think some more about "still valid" part -
> maybe we will need rcu_read_lock).
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index f4ec4e7..626c916 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2197,8 +2197,10 @@ static void drain_all_stock(struct mem_cgroup *root_mem, bool sync)
>  
>  	for_each_online_cpu(cpu) {
>  		struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu);
> -		if (mem_cgroup_same_or_subtree(root_mem, stock->cached) &&
> -				test_bit(FLUSHING_CACHED_CHARGE, &stock->flags))
> +		struct mem_cgroup *mem = stock->cached;
> +		if (test_bit(FLUSHING_CACHED_CHARGE, &stock->flags) &&
> +				 mem && mem_cgroup_same_or_subtree(root_mem, mem)
> +				)
>  			flush_work(&stock->work);
>  	}
>  out:

This ordering makes sure that mem is a sensible pointer, but it still
does not pin the object, *mem, which could go away the femtosecond
after the test_bit succeeds.

> > We have this loop in drain_all_stock():
> > 
> > 	for_each_online_cpu(cpu) {
> > 		struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu);
> > 		struct mem_cgroup *mem;
> > 
> > 		mem = stock->cached;
> > 		if (!mem || !stock->nr_pages)
> > 			continue;
> > 		if (!mem_cgroup_same_or_subtree(root_mem, mem))
> > 			continue;
> > 		if (!test_and_set_bit(FLUSHING_CACHED_CHARGE, &stock->flags)) {
> > 			if (cpu == curcpu)
> > 				drain_local_stock(&stock->work);
> > 			else
> > 				schedule_work_on(cpu, &stock->work);
> > 		}
> > 	}
> > 
> > The only thing that stabilizes stock->cached is the knowledge that
> > there are still pages accounted to the memcg.
> 
> Yes you are right we have to set FLUSHING_CACHED_CHARGE before nr_pages
> check (and do the appropriate cleanup on the continue paths). This looks
> quite ugly, though.
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index f4ec4e7..eca46141 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2179,17 +2179,23 @@ static void drain_all_stock(struct mem_cgroup *root_mem, bool sync)
>  		struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu);
>  		struct mem_cgroup *mem;
>  
> +		/* Try to lock the cache */
> +		if(test_and_set_bit(FLUSHING_CACHED_CHARGE, &stock->flags))
> +			continue;
> +
>  		mem = stock->cached;
>  		if (!mem || !stock->nr_pages)
> -			continue;
> +			goto unlock_cache;
>  		if (!mem_cgroup_same_or_subtree(root_mem, mem))
> -			continue;
> -		if (!test_and_set_bit(FLUSHING_CACHED_CHARGE, &stock->flags)) {
> -			if (cpu == curcpu)
> -				drain_local_stock(&stock->work);
> -			else
> -				schedule_work_on(cpu, &stock->work);
> -		}
> +			goto unlock_cache;

So one thread locks the cache, recognizes stock->cached is not in the
right hierarchy and unlocks again.  While the cache was locked, a
concurrent drainer with the right hierarchy skipped the stock because
it was locked.  That doesn't sound right.

But yes, we probably need exclusive access to the stock state.

> +
> +		if (cpu == curcpu)
> +			drain_local_stock(&stock->work);
> +		else
> +			schedule_work_on(cpu, &stock->work);
> +		continue;
> +unlock_cache:
> +		clear_bit(FLUSHING_CACHED_CHARGE, &stock->flags);
> 
>                 ^^^^^
> 		need a barrier?
>  	}
>  
>  	if (!sync)
>  
> > Without the mutex serializing this code, can't there be a concurrent
> > execution that leads to stock->cached being drained, becoming empty
> > and freed by someone else between the stock->nr_pages check and the
> > ancestor check, resulting in use after free?
> > 
> > What makes stock->cached safe to dereference?
> 
> We are using FLUSHING_CACHED_CHARGE as a lock for local draining. I
> guess it should be sufficient.
> 
> mutex which was used previously caused that async draining was exclusive
> so a root_mem that has potentially many relevant caches has to back off
> because other mem wants to clear the cache on the same CPU.

It's now replaced by what is essentially a per-stock bit-spinlock that
is always trylocked.

Would it make sense to promote it to a real spinlock?  Draining a
stock is pretty fast, there should be minimal lock hold times, but we
would still avoid that tiny race window where we would skip otherwise
correct stocks just because they are locked.

> I will think about this tomorrow (with fresh eyes). I think we should be
> able to be without mutex.

The problem is that we have a multi-op atomic section, so we can not
go lockless.  We can read the stock state just fine, and order
accesses to different members so that we get a coherent image.  But
there is still nothing that pins the charge to the memcg, and thus
nothing that stabilizes *stock->cached.

I agree that we can probably do better than a global lock, though.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 4/4] memcg: get rid of percpu_charge_mutex lock
  2011-08-08 23:19         ` Johannes Weiner
@ 2011-08-09  7:26           ` Michal Hocko
  -1 siblings, 0 replies; 41+ messages in thread
From: Michal Hocko @ 2011-08-09  7:26 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Balbir Singh, KAMEZAWA Hiroyuki, Andrew Morton, linux-kernel

On Tue 09-08-11 01:19:12, Johannes Weiner wrote:
> On Mon, Aug 08, 2011 at 11:47:04PM +0200, Michal Hocko wrote:
> > On Mon 08-08-11 20:47:38, Johannes Weiner wrote:
> > [...]
> > > > --- a/mm/memcontrol.c
> > > > +++ b/mm/memcontrol.c
> > > > @@ -2071,7 +2071,6 @@ struct memcg_stock_pcp {
> > > >  #define FLUSHING_CACHED_CHARGE	(0)
> > > >  };
> > > >  static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock);
> > > > -static DEFINE_MUTEX(percpu_charge_mutex);
> > > >  
> > > >  /*
> > > >   * Try to consume stocked charge on this cpu. If success, one page is consumed
> > > > @@ -2178,7 +2177,8 @@ static void drain_all_stock(struct mem_cgroup *root_mem, bool sync)
> > > >  
> > > >  	for_each_online_cpu(cpu) {
> > > >  		struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu);
> > > > -		if (test_bit(FLUSHING_CACHED_CHARGE, &stock->flags))
> > > > +		if (mem_cgroup_same_or_subtree(root_mem, stock->cached) &&
> > > > +				test_bit(FLUSHING_CACHED_CHARGE, &stock->flags))
> > > >  			flush_work(&stock->work);
> > > >  	}
> > > >  out:
> > > 
> > > This hunk triggers a crash for me, as the draining is already done and
> > > stock->cached reset to NULL when dereferenced here.  Oops is attached.
> > 
> > Thanks for catching this. We are racing synchronous drain from
> > force_empty and async drain from reclaim, I guess.
> 
> It's at the end of a benchmark where several tasks delete the cgroups.
> There is no reclaim going on anymore, it must be several sync drains
> from force_empty of different memcgs.

I am afraid it is worse than that. Sync. drainer can kick itself really
trivial just by doing local draining. Then stock->cached is guaranteed
to be NULL when we reach this...

[...]
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index f4ec4e7..626c916 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -2197,8 +2197,10 @@ static void drain_all_stock(struct mem_cgroup *root_mem, bool sync)
> >  
> >  	for_each_online_cpu(cpu) {
> >  		struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu);
> > -		if (mem_cgroup_same_or_subtree(root_mem, stock->cached) &&
> > -				test_bit(FLUSHING_CACHED_CHARGE, &stock->flags))
> > +		struct mem_cgroup *mem = stock->cached;
> > +		if (test_bit(FLUSHING_CACHED_CHARGE, &stock->flags) &&
> > +				 mem && mem_cgroup_same_or_subtree(root_mem, mem)
> > +				)
> >  			flush_work(&stock->work);
> >  	}
> >  out:
> 
> This ordering makes sure that mem is a sensible pointer, but it still
> does not pin the object, *mem, which could go away the femtosecond
> after the test_bit succeeds.

Yes there are possible races
	CPU0			CPU1				CPU2
mem = stock->cached
			stock->cached = NULL
			clear_bit
							    test_and_set_bit
test_bit()
<preempted>		...
			mem_cgroup_destroy
use after free

The race is not very probable because we are doing quite a bunch of work
before we can deallocate mem. mem_cgroup_destroy is called after
synchronize_rcu so we can close the race by rcu_read_lock, right?


[...]
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index f4ec4e7..eca46141 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -2179,17 +2179,23 @@ static void drain_all_stock(struct mem_cgroup *root_mem, bool sync)
> >  		struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu);
> >  		struct mem_cgroup *mem;
> >  
> > +		/* Try to lock the cache */
> > +		if(test_and_set_bit(FLUSHING_CACHED_CHARGE, &stock->flags))
> > +			continue;
> > +
> >  		mem = stock->cached;
> >  		if (!mem || !stock->nr_pages)
> > -			continue;
> > +			goto unlock_cache;
> >  		if (!mem_cgroup_same_or_subtree(root_mem, mem))
> > -			continue;
> > -		if (!test_and_set_bit(FLUSHING_CACHED_CHARGE, &stock->flags)) {
> > -			if (cpu == curcpu)
> > -				drain_local_stock(&stock->work);
> > -			else
> > -				schedule_work_on(cpu, &stock->work);
> > -		}
> > +			goto unlock_cache;
> 
> So one thread locks the cache, recognizes stock->cached is not in the
> right hierarchy and unlocks again.  While the cache was locked, a
> concurrent drainer with the right hierarchy skipped the stock because
> it was locked.  That doesn't sound right.

You are right. We have to make it less parallel.

> 
> But yes, we probably need exclusive access to the stock state.
> 
> > +
> > +		if (cpu == curcpu)
> > +			drain_local_stock(&stock->work);
> > +		else
> > +			schedule_work_on(cpu, &stock->work);
> > +		continue;
> > +unlock_cache:
> > +		clear_bit(FLUSHING_CACHED_CHARGE, &stock->flags);
> > 
> >                 ^^^^^
> > 		need a barrier?
> >  	}
> >  
> >  	if (!sync)
> >  
> > > Without the mutex serializing this code, can't there be a concurrent
> > > execution that leads to stock->cached being drained, becoming empty
> > > and freed by someone else between the stock->nr_pages check and the
> > > ancestor check, resulting in use after free?
> > > 
> > > What makes stock->cached safe to dereference?
> > 
> > We are using FLUSHING_CACHED_CHARGE as a lock for local draining. I
> > guess it should be sufficient.
> > 
> > mutex which was used previously caused that async draining was exclusive
> > so a root_mem that has potentially many relevant caches has to back off
> > because other mem wants to clear the cache on the same CPU.
> 
> It's now replaced by what is essentially a per-stock bit-spinlock that
> is always trylocked.

Yes.

> 
> Would it make sense to promote it to a real spinlock?  Draining a
> stock is pretty fast, there should be minimal lock hold times, but we
> would still avoid that tiny race window where we would skip otherwise
> correct stocks just because they are locked.

Yes, per-stock spinlock will be easiest way because if tried other games
with atomics we would endup with a more complicated refill_stock which
can race with draining as well.

> > I will think about this tomorrow (with fresh eyes). I think we should be
> > able to be without mutex.
> 
> The problem is that we have a multi-op atomic section, so we can not
> go lockless.  We can read the stock state just fine, and order
> accesses to different members so that we get a coherent image.  But
> there is still nothing that pins the charge to the memcg, and thus
> nothing that stabilizes *stock->cached.
> 
> I agree that we can probably do better than a global lock, though.

Fully agreed. I will send a patch after I give it some testing.

Thanks!
-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 4/4] memcg: get rid of percpu_charge_mutex lock
@ 2011-08-09  7:26           ` Michal Hocko
  0 siblings, 0 replies; 41+ messages in thread
From: Michal Hocko @ 2011-08-09  7:26 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Balbir Singh, KAMEZAWA Hiroyuki, Andrew Morton, linux-kernel

On Tue 09-08-11 01:19:12, Johannes Weiner wrote:
> On Mon, Aug 08, 2011 at 11:47:04PM +0200, Michal Hocko wrote:
> > On Mon 08-08-11 20:47:38, Johannes Weiner wrote:
> > [...]
> > > > --- a/mm/memcontrol.c
> > > > +++ b/mm/memcontrol.c
> > > > @@ -2071,7 +2071,6 @@ struct memcg_stock_pcp {
> > > >  #define FLUSHING_CACHED_CHARGE	(0)
> > > >  };
> > > >  static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock);
> > > > -static DEFINE_MUTEX(percpu_charge_mutex);
> > > >  
> > > >  /*
> > > >   * Try to consume stocked charge on this cpu. If success, one page is consumed
> > > > @@ -2178,7 +2177,8 @@ static void drain_all_stock(struct mem_cgroup *root_mem, bool sync)
> > > >  
> > > >  	for_each_online_cpu(cpu) {
> > > >  		struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu);
> > > > -		if (test_bit(FLUSHING_CACHED_CHARGE, &stock->flags))
> > > > +		if (mem_cgroup_same_or_subtree(root_mem, stock->cached) &&
> > > > +				test_bit(FLUSHING_CACHED_CHARGE, &stock->flags))
> > > >  			flush_work(&stock->work);
> > > >  	}
> > > >  out:
> > > 
> > > This hunk triggers a crash for me, as the draining is already done and
> > > stock->cached reset to NULL when dereferenced here.  Oops is attached.
> > 
> > Thanks for catching this. We are racing synchronous drain from
> > force_empty and async drain from reclaim, I guess.
> 
> It's at the end of a benchmark where several tasks delete the cgroups.
> There is no reclaim going on anymore, it must be several sync drains
> from force_empty of different memcgs.

I am afraid it is worse than that. Sync. drainer can kick itself really
trivial just by doing local draining. Then stock->cached is guaranteed
to be NULL when we reach this...

[...]
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index f4ec4e7..626c916 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -2197,8 +2197,10 @@ static void drain_all_stock(struct mem_cgroup *root_mem, bool sync)
> >  
> >  	for_each_online_cpu(cpu) {
> >  		struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu);
> > -		if (mem_cgroup_same_or_subtree(root_mem, stock->cached) &&
> > -				test_bit(FLUSHING_CACHED_CHARGE, &stock->flags))
> > +		struct mem_cgroup *mem = stock->cached;
> > +		if (test_bit(FLUSHING_CACHED_CHARGE, &stock->flags) &&
> > +				 mem && mem_cgroup_same_or_subtree(root_mem, mem)
> > +				)
> >  			flush_work(&stock->work);
> >  	}
> >  out:
> 
> This ordering makes sure that mem is a sensible pointer, but it still
> does not pin the object, *mem, which could go away the femtosecond
> after the test_bit succeeds.

Yes there are possible races
	CPU0			CPU1				CPU2
mem = stock->cached
			stock->cached = NULL
			clear_bit
							    test_and_set_bit
test_bit()
<preempted>		...
			mem_cgroup_destroy
use after free

The race is not very probable because we are doing quite a bunch of work
before we can deallocate mem. mem_cgroup_destroy is called after
synchronize_rcu so we can close the race by rcu_read_lock, right?


[...]
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index f4ec4e7..eca46141 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -2179,17 +2179,23 @@ static void drain_all_stock(struct mem_cgroup *root_mem, bool sync)
> >  		struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu);
> >  		struct mem_cgroup *mem;
> >  
> > +		/* Try to lock the cache */
> > +		if(test_and_set_bit(FLUSHING_CACHED_CHARGE, &stock->flags))
> > +			continue;
> > +
> >  		mem = stock->cached;
> >  		if (!mem || !stock->nr_pages)
> > -			continue;
> > +			goto unlock_cache;
> >  		if (!mem_cgroup_same_or_subtree(root_mem, mem))
> > -			continue;
> > -		if (!test_and_set_bit(FLUSHING_CACHED_CHARGE, &stock->flags)) {
> > -			if (cpu == curcpu)
> > -				drain_local_stock(&stock->work);
> > -			else
> > -				schedule_work_on(cpu, &stock->work);
> > -		}
> > +			goto unlock_cache;
> 
> So one thread locks the cache, recognizes stock->cached is not in the
> right hierarchy and unlocks again.  While the cache was locked, a
> concurrent drainer with the right hierarchy skipped the stock because
> it was locked.  That doesn't sound right.

You are right. We have to make it less parallel.

> 
> But yes, we probably need exclusive access to the stock state.
> 
> > +
> > +		if (cpu == curcpu)
> > +			drain_local_stock(&stock->work);
> > +		else
> > +			schedule_work_on(cpu, &stock->work);
> > +		continue;
> > +unlock_cache:
> > +		clear_bit(FLUSHING_CACHED_CHARGE, &stock->flags);
> > 
> >                 ^^^^^
> > 		need a barrier?
> >  	}
> >  
> >  	if (!sync)
> >  
> > > Without the mutex serializing this code, can't there be a concurrent
> > > execution that leads to stock->cached being drained, becoming empty
> > > and freed by someone else between the stock->nr_pages check and the
> > > ancestor check, resulting in use after free?
> > > 
> > > What makes stock->cached safe to dereference?
> > 
> > We are using FLUSHING_CACHED_CHARGE as a lock for local draining. I
> > guess it should be sufficient.
> > 
> > mutex which was used previously caused that async draining was exclusive
> > so a root_mem that has potentially many relevant caches has to back off
> > because other mem wants to clear the cache on the same CPU.
> 
> It's now replaced by what is essentially a per-stock bit-spinlock that
> is always trylocked.

Yes.

> 
> Would it make sense to promote it to a real spinlock?  Draining a
> stock is pretty fast, there should be minimal lock hold times, but we
> would still avoid that tiny race window where we would skip otherwise
> correct stocks just because they are locked.

Yes, per-stock spinlock will be easiest way because if tried other games
with atomics we would endup with a more complicated refill_stock which
can race with draining as well.

> > I will think about this tomorrow (with fresh eyes). I think we should be
> > able to be without mutex.
> 
> The problem is that we have a multi-op atomic section, so we can not
> go lockless.  We can read the stock state just fine, and order
> accesses to different members so that we get a coherent image.  But
> there is still nothing that pins the charge to the memcg, and thus
> nothing that stabilizes *stock->cached.
> 
> I agree that we can probably do better than a global lock, though.

Fully agreed. I will send a patch after I give it some testing.

Thanks!
-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH RFC] memcg: fix drain_all_stock crash
  2011-08-09  7:26           ` Michal Hocko
@ 2011-08-09  9:31             ` Michal Hocko
  -1 siblings, 0 replies; 41+ messages in thread
From: Michal Hocko @ 2011-08-09  9:31 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Balbir Singh, KAMEZAWA Hiroyuki, Andrew Morton, linux-kernel

What do you think about the half backed patch bellow? I didn't manage to
test it yet but I guess it should help. I hate asymmetry of drain_lock
locking (it is acquired somewhere else than it is released which is
not). I will think about a nicer way how to do it.
Maybe I should also split the rcu part in a separate patch.

What do you think?
---
>From 26c2cdc55aa14ec4a54e9c8e2c8b9072c7cb8e28 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.cz>
Date: Tue, 9 Aug 2011 10:53:28 +0200
Subject: [PATCH] memcg: fix drain_all_stock crash

8521fc50 (memcg: get rid of percpu_charge_mutex lock) introduced a crash
in sync mode when we are about to check whether we have to wait for the
work because we are calling mem_cgroup_same_or_subtree without checking
FLUSHING_CACHED_CHARGE before so we can dereference already cleaned
cache (the simplest case would be when we drain the local cache).

BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
IP: [<ffffffff81083b70>] css_is_ancestor+0x20/0x70
PGD 4ae7a067 PUD 4adc4067 PMD 0
Oops: 0000 [#1] PREEMPT SMP
CPU 0
Pid: 19677, comm: rmdir Tainted: G        W   3.0.0-mm1-00188-gf38d32b #35 ECS MCP61M-M3/MCP61M-M3
RIP: 0010:[<ffffffff81083b70>]  [<ffffffff81083b70>] css_is_ancestor+0x20/0x70
RSP: 0018:ffff880077b09c88  EFLAGS: 00010202
RAX: ffff8800781bb310 RBX: 0000000000000000 RCX: 000000000000003e
RDX: 0000000000000000 RSI: ffff8800779f7c00 RDI: 0000000000000000
RBP: ffff880077b09c98 R08: ffffffff818a4e88 R09: 0000000000000000
R10: 0000000000000000 R11: dead000000100100 R12: ffff8800779f7c00
R13: ffff8800779f7c00 R14: 0000000000000000 R15: ffff88007bc0eb80
FS:  00007f5d689ec720(0000) GS:ffff88007bc00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000018 CR3: 000000004ad57000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process rmdir (pid: 19677, threadinfo ffff880077b08000, task ffff8800781bb310)
Stack:
 ffffffff818a4e88 000000000000eb80 ffff880077b09ca8 ffffffff810feba3
 ffff880077b09d08 ffffffff810feccf ffff880077b09cf8 0000000000000001
 ffff88007bd0eb80 0000000000000001 ffff880077af2000 0000000000000000
Call Trace:
 [<ffffffff810feba3>] mem_cgroup_same_or_subtree+0x33/0x40
 [<ffffffff810feccf>] drain_all_stock+0x11f/0x170
 [<ffffffff81103211>] mem_cgroup_force_empty+0x231/0x6d0
 [<ffffffff81111872>] ? path_put+0x22/0x30
 [<ffffffff8111c925>] ? __d_lookup+0xb5/0x170
 [<ffffffff811036c4>] mem_cgroup_pre_destroy+0x14/0x20
 [<ffffffff81080559>] cgroup_rmdir+0xb9/0x500
 [<ffffffff81063990>] ? abort_exclusive_wait+0xb0/0xb0
 [<ffffffff81114d26>] vfs_rmdir+0x86/0xe0
 [<ffffffff811233d3>] ? mnt_want_write+0x43/0x80
 [<ffffffff81114e7b>] do_rmdir+0xfb/0x110
 [<ffffffff81114ea6>] sys_rmdir+0x16/0x20
 [<ffffffff8154d76b>] system_call_fastpath+0x16/0x1b

Testing FLUSHING_CACHED_CHARGE before dereferencing is still not enough
because then we still might see mem == NULL so we have to check it
before dereferencing.
We have to do all stock checking under FLUSHING_CACHED_CHARGE bit lock
so it is much easier to use a spin_lock instead. Let's also add a flag
(under_drain) that draining in progress so that concurrent callers do
not have to wait on the lock pointlessly.

Finally we do not make sure that the mem still exists. It could have
been removed in the meantime:
	CPU0			CPU1			     CPU2
mem=stock->cached
stock->cached=NULL
			      clear_bit
							test_and_set_bit
test_bit()		      ...
<preempted>		mem_cgroup_destroy
use after free

`...' is actually quite a bunch of work to do so the race is not very
probable. The important thing, though, is that cgroup_subsys->destroy
(mem_cgroup_destroy) is called after synchronize_rcu so we can protect
by calling rcu_read_lock when dereferencing cached mem.

TODO:
- check if under_drain needs some memory barriers
- check the hotplug path (can we wait on spinlock?)
- better changelog
- do some testing

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Reported-by: Johannes Weiner <jweiner@redhat.com>
---
 mm/memcontrol.c |   67 +++++++++++++++++++++++++++++++++++++++++-------------
 1 files changed, 51 insertions(+), 16 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f4ec4e7..e34f9fd 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2087,8 +2087,8 @@ struct memcg_stock_pcp {
 	struct mem_cgroup *cached; /* this never be root cgroup */
 	unsigned int nr_pages;
 	struct work_struct work;
-	unsigned long flags;
-#define FLUSHING_CACHED_CHARGE	(0)
+	spinlock_t drain_lock; /* protects from parallel draining */
+	bool under_drain;
 };
 static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock);
 
@@ -2114,6 +2114,7 @@ static bool consume_stock(struct mem_cgroup *mem)
 
 /*
  * Returns stocks cached in percpu to res_counter and reset cached information.
+ * Do not call this directly - use drain_local_stock instead.
  */
 static void drain_stock(struct memcg_stock_pcp *stock)
 {
@@ -2133,12 +2134,16 @@ static void drain_stock(struct memcg_stock_pcp *stock)
 /*
  * This must be called under preempt disabled or must be called by
  * a thread which is pinned to local cpu.
+ * Parameter is not used.
+ * Assumes stock->drain_lock held.
  */
 static void drain_local_stock(struct work_struct *dummy)
 {
 	struct memcg_stock_pcp *stock = &__get_cpu_var(memcg_stock);
 	drain_stock(stock);
-	clear_bit(FLUSHING_CACHED_CHARGE, &stock->flags);
+
+	stock->under_drain = false;
+	spin_unlock(&stock->drain_lock);
 }
 
 /*
@@ -2150,7 +2155,9 @@ static void refill_stock(struct mem_cgroup *mem, unsigned int nr_pages)
 	struct memcg_stock_pcp *stock = &get_cpu_var(memcg_stock);
 
 	if (stock->cached != mem) { /* reset if necessary */
-		drain_stock(stock);
+		spin_lock(&stock->drain_lock);
+		stock->under_drain = true;
+		drain_local_stock(NULL);
 		stock->cached = mem;
 	}
 	stock->nr_pages += nr_pages;
@@ -2179,17 +2186,27 @@ static void drain_all_stock(struct mem_cgroup *root_mem, bool sync)
 		struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu);
 		struct mem_cgroup *mem;
 
+		/*
+		 * make sure we are not waiting when somebody already drains
+		 * the cache.
+		 */
+		if (!spin_trylock(&stock->drain_lock)) {
+			if (stock->under_drain)
+				continue;
+			spin_lock(&stock->drain_lock);
+		}
 		mem = stock->cached;
-		if (!mem || !stock->nr_pages)
+		if (!mem || !stock->nr_pages ||
+				!mem_cgroup_same_or_subtree(root_mem, mem)) {
+			spin_unlock(&stock->drain_lock);
 			continue;
-		if (!mem_cgroup_same_or_subtree(root_mem, mem))
-			continue;
-		if (!test_and_set_bit(FLUSHING_CACHED_CHARGE, &stock->flags)) {
-			if (cpu == curcpu)
-				drain_local_stock(&stock->work);
-			else
-				schedule_work_on(cpu, &stock->work);
 		}
+
+		stock->under_drain = true;
+		if (cpu == curcpu)
+			drain_local_stock(&stock->work);
+		else
+			schedule_work_on(cpu, &stock->work);
 	}
 
 	if (!sync)
@@ -2197,8 +2214,20 @@ static void drain_all_stock(struct mem_cgroup *root_mem, bool sync)
 
 	for_each_online_cpu(cpu) {
 		struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu);
-		if (mem_cgroup_same_or_subtree(root_mem, stock->cached) &&
-				test_bit(FLUSHING_CACHED_CHARGE, &stock->flags))
+		struct mem_cgroup *mem;
+		bool wait_for_drain = false;
+
+		/*
+		 * we have to be careful about parallel group destroying
+		 * (mem_cgroup_destroy) which is derefered after sychronize_rcu
+		 */
+		rcu_read_lock();
+		mem = stock->cached;
+		wait_for_drain = stock->under_drain &&
+			mem && mem_cgroup_same_or_subtree(root_mem, mem);
+		rcu_read_unlock();
+
+		if (wait_for_drain)
 			flush_work(&stock->work);
 	}
 out:
@@ -2278,8 +2307,12 @@ static int __cpuinit memcg_cpu_hotplug_callback(struct notifier_block *nb,
 	for_each_mem_cgroup_all(iter)
 		mem_cgroup_drain_pcp_counter(iter, cpu);
 
-	stock = &per_cpu(memcg_stock, cpu);
-	drain_stock(stock);
+	if (!spin_trylock(&stock->drain_lock)) {
+		if (stock->under_drain)
+			return NOTIFY_OK;
+		spin_lock(&stock->drain_lock);
+	}
+	drain_local_stock(NULL);
 	return NOTIFY_OK;
 }
 
@@ -5068,6 +5101,8 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
 			struct memcg_stock_pcp *stock =
 						&per_cpu(memcg_stock, cpu);
 			INIT_WORK(&stock->work, drain_local_stock);
+			stock->under_drain = false;
+			spin_lock_init(&stock->drain_lock);
 		}
 		hotcpu_notifier(memcg_cpu_hotplug_callback, 0);
 	} else {
-- 
1.7.5.4


-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH RFC] memcg: fix drain_all_stock crash
@ 2011-08-09  9:31             ` Michal Hocko
  0 siblings, 0 replies; 41+ messages in thread
From: Michal Hocko @ 2011-08-09  9:31 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-mm, Balbir Singh, KAMEZAWA Hiroyuki, Andrew Morton, linux-kernel

What do you think about the half backed patch bellow? I didn't manage to
test it yet but I guess it should help. I hate asymmetry of drain_lock
locking (it is acquired somewhere else than it is released which is
not). I will think about a nicer way how to do it.
Maybe I should also split the rcu part in a separate patch.

What do you think?
---

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RFC] memcg: fix drain_all_stock crash
  2011-08-09  9:31             ` Michal Hocko
@ 2011-08-09  9:32               ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 41+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-08-09  9:32 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, linux-mm, Balbir Singh, Andrew Morton, linux-kernel

On Tue, 9 Aug 2011 11:31:50 +0200
Michal Hocko <mhocko@suse.cz> wrote:

> What do you think about the half backed patch bellow? I didn't manage to
> test it yet but I guess it should help. I hate asymmetry of drain_lock
> locking (it is acquired somewhere else than it is released which is
> not). I will think about a nicer way how to do it.
> Maybe I should also split the rcu part in a separate patch.
> 
> What do you think?


I'd like to revert 8521fc50 first and consider total design change
rather than ad-hoc fix.

Personally, I don't like to have spin-lock in per-cpu area.


Thanks,
-Kame

> ---
> From 26c2cdc55aa14ec4a54e9c8e2c8b9072c7cb8e28 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.cz>
> Date: Tue, 9 Aug 2011 10:53:28 +0200
> Subject: [PATCH] memcg: fix drain_all_stock crash
> 
> 8521fc50 (memcg: get rid of percpu_charge_mutex lock) introduced a crash
> in sync mode when we are about to check whether we have to wait for the
> work because we are calling mem_cgroup_same_or_subtree without checking
> FLUSHING_CACHED_CHARGE before so we can dereference already cleaned
> cache (the simplest case would be when we drain the local cache).
> 
> BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
> IP: [<ffffffff81083b70>] css_is_ancestor+0x20/0x70
> PGD 4ae7a067 PUD 4adc4067 PMD 0
> Oops: 0000 [#1] PREEMPT SMP
> CPU 0
> Pid: 19677, comm: rmdir Tainted: G        W   3.0.0-mm1-00188-gf38d32b #35 ECS MCP61M-M3/MCP61M-M3
> RIP: 0010:[<ffffffff81083b70>]  [<ffffffff81083b70>] css_is_ancestor+0x20/0x70
> RSP: 0018:ffff880077b09c88  EFLAGS: 00010202
> RAX: ffff8800781bb310 RBX: 0000000000000000 RCX: 000000000000003e
> RDX: 0000000000000000 RSI: ffff8800779f7c00 RDI: 0000000000000000
> RBP: ffff880077b09c98 R08: ffffffff818a4e88 R09: 0000000000000000
> R10: 0000000000000000 R11: dead000000100100 R12: ffff8800779f7c00
> R13: ffff8800779f7c00 R14: 0000000000000000 R15: ffff88007bc0eb80
> FS:  00007f5d689ec720(0000) GS:ffff88007bc00000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> CR2: 0000000000000018 CR3: 000000004ad57000 CR4: 00000000000006f0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Process rmdir (pid: 19677, threadinfo ffff880077b08000, task ffff8800781bb310)
> Stack:
>  ffffffff818a4e88 000000000000eb80 ffff880077b09ca8 ffffffff810feba3
>  ffff880077b09d08 ffffffff810feccf ffff880077b09cf8 0000000000000001
>  ffff88007bd0eb80 0000000000000001 ffff880077af2000 0000000000000000
> Call Trace:
>  [<ffffffff810feba3>] mem_cgroup_same_or_subtree+0x33/0x40
>  [<ffffffff810feccf>] drain_all_stock+0x11f/0x170
>  [<ffffffff81103211>] mem_cgroup_force_empty+0x231/0x6d0
>  [<ffffffff81111872>] ? path_put+0x22/0x30
>  [<ffffffff8111c925>] ? __d_lookup+0xb5/0x170
>  [<ffffffff811036c4>] mem_cgroup_pre_destroy+0x14/0x20
>  [<ffffffff81080559>] cgroup_rmdir+0xb9/0x500
>  [<ffffffff81063990>] ? abort_exclusive_wait+0xb0/0xb0
>  [<ffffffff81114d26>] vfs_rmdir+0x86/0xe0
>  [<ffffffff811233d3>] ? mnt_want_write+0x43/0x80
>  [<ffffffff81114e7b>] do_rmdir+0xfb/0x110
>  [<ffffffff81114ea6>] sys_rmdir+0x16/0x20
>  [<ffffffff8154d76b>] system_call_fastpath+0x16/0x1b
> 
> Testing FLUSHING_CACHED_CHARGE before dereferencing is still not enough
> because then we still might see mem == NULL so we have to check it
> before dereferencing.
> We have to do all stock checking under FLUSHING_CACHED_CHARGE bit lock
> so it is much easier to use a spin_lock instead. Let's also add a flag
> (under_drain) that draining in progress so that concurrent callers do
> not have to wait on the lock pointlessly.
> 
> Finally we do not make sure that the mem still exists. It could have
> been removed in the meantime:
> 	CPU0			CPU1			     CPU2
> mem=stock->cached
> stock->cached=NULL
> 			      clear_bit
> 							test_and_set_bit
> test_bit()		      ...
> <preempted>		mem_cgroup_destroy
> use after free
> 
> `...' is actually quite a bunch of work to do so the race is not very
> probable. The important thing, though, is that cgroup_subsys->destroy
> (mem_cgroup_destroy) is called after synchronize_rcu so we can protect
> by calling rcu_read_lock when dereferencing cached mem.
> 
> TODO:
> - check if under_drain needs some memory barriers
> - check the hotplug path (can we wait on spinlock?)
> - better changelog
> - do some testing
> 
> Signed-off-by: Michal Hocko <mhocko@suse.cz>
> Reported-by: Johannes Weiner <jweiner@redhat.com>
> ---
>  mm/memcontrol.c |   67 +++++++++++++++++++++++++++++++++++++++++-------------
>  1 files changed, 51 insertions(+), 16 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index f4ec4e7..e34f9fd 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2087,8 +2087,8 @@ struct memcg_stock_pcp {
>  	struct mem_cgroup *cached; /* this never be root cgroup */
>  	unsigned int nr_pages;
>  	struct work_struct work;
> -	unsigned long flags;
> -#define FLUSHING_CACHED_CHARGE	(0)
> +	spinlock_t drain_lock; /* protects from parallel draining */
> +	bool under_drain;
>  };
>  static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock);
>  
> @@ -2114,6 +2114,7 @@ static bool consume_stock(struct mem_cgroup *mem)
>  
>  /*
>   * Returns stocks cached in percpu to res_counter and reset cached information.
> + * Do not call this directly - use drain_local_stock instead.
>   */
>  static void drain_stock(struct memcg_stock_pcp *stock)
>  {
> @@ -2133,12 +2134,16 @@ static void drain_stock(struct memcg_stock_pcp *stock)
>  /*
>   * This must be called under preempt disabled or must be called by
>   * a thread which is pinned to local cpu.
> + * Parameter is not used.
> + * Assumes stock->drain_lock held.
>   */
>  static void drain_local_stock(struct work_struct *dummy)
>  {
>  	struct memcg_stock_pcp *stock = &__get_cpu_var(memcg_stock);
>  	drain_stock(stock);
> -	clear_bit(FLUSHING_CACHED_CHARGE, &stock->flags);
> +
> +	stock->under_drain = false;
> +	spin_unlock(&stock->drain_lock);
>  }
>  
>  /*
> @@ -2150,7 +2155,9 @@ static void refill_stock(struct mem_cgroup *mem, unsigned int nr_pages)
>  	struct memcg_stock_pcp *stock = &get_cpu_var(memcg_stock);
>  
>  	if (stock->cached != mem) { /* reset if necessary */
> -		drain_stock(stock);
> +		spin_lock(&stock->drain_lock);
> +		stock->under_drain = true;
> +		drain_local_stock(NULL);
>  		stock->cached = mem;
>  	}
>  	stock->nr_pages += nr_pages;
> @@ -2179,17 +2186,27 @@ static void drain_all_stock(struct mem_cgroup *root_mem, bool sync)
>  		struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu);
>  		struct mem_cgroup *mem;
>  
> +		/*
> +		 * make sure we are not waiting when somebody already drains
> +		 * the cache.
> +		 */
> +		if (!spin_trylock(&stock->drain_lock)) {
> +			if (stock->under_drain)
> +				continue;
> +			spin_lock(&stock->drain_lock);
> +		}
>  		mem = stock->cached;
> -		if (!mem || !stock->nr_pages)
> +		if (!mem || !stock->nr_pages ||
> +				!mem_cgroup_same_or_subtree(root_mem, mem)) {
> +			spin_unlock(&stock->drain_lock);
>  			continue;
> -		if (!mem_cgroup_same_or_subtree(root_mem, mem))
> -			continue;
> -		if (!test_and_set_bit(FLUSHING_CACHED_CHARGE, &stock->flags)) {
> -			if (cpu == curcpu)
> -				drain_local_stock(&stock->work);
> -			else
> -				schedule_work_on(cpu, &stock->work);
>  		}
> +
> +		stock->under_drain = true;
> +		if (cpu == curcpu)
> +			drain_local_stock(&stock->work);
> +		else
> +			schedule_work_on(cpu, &stock->work);
>  	}
>  
>  	if (!sync)
> @@ -2197,8 +2214,20 @@ static void drain_all_stock(struct mem_cgroup *root_mem, bool sync)
>  
>  	for_each_online_cpu(cpu) {
>  		struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu);
> -		if (mem_cgroup_same_or_subtree(root_mem, stock->cached) &&
> -				test_bit(FLUSHING_CACHED_CHARGE, &stock->flags))
> +		struct mem_cgroup *mem;
> +		bool wait_for_drain = false;
> +
> +		/*
> +		 * we have to be careful about parallel group destroying
> +		 * (mem_cgroup_destroy) which is derefered after sychronize_rcu
> +		 */
> +		rcu_read_lock();
> +		mem = stock->cached;
> +		wait_for_drain = stock->under_drain &&
> +			mem && mem_cgroup_same_or_subtree(root_mem, mem);
> +		rcu_read_unlock();
> +
> +		if (wait_for_drain)
>  			flush_work(&stock->work);
>  	}
>  out:
> @@ -2278,8 +2307,12 @@ static int __cpuinit memcg_cpu_hotplug_callback(struct notifier_block *nb,
>  	for_each_mem_cgroup_all(iter)
>  		mem_cgroup_drain_pcp_counter(iter, cpu);
>  
> -	stock = &per_cpu(memcg_stock, cpu);
> -	drain_stock(stock);
> +	if (!spin_trylock(&stock->drain_lock)) {
> +		if (stock->under_drain)
> +			return NOTIFY_OK;
> +		spin_lock(&stock->drain_lock);
> +	}
> +	drain_local_stock(NULL);
>  	return NOTIFY_OK;
>  }
>  
> @@ -5068,6 +5101,8 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
>  			struct memcg_stock_pcp *stock =
>  						&per_cpu(memcg_stock, cpu);
>  			INIT_WORK(&stock->work, drain_local_stock);
> +			stock->under_drain = false;
> +			spin_lock_init(&stock->drain_lock);
>  		}
>  		hotcpu_notifier(memcg_cpu_hotplug_callback, 0);
>  	} else {
> -- 
> 1.7.5.4
> 
> 
> -- 
> Michal Hocko
> SUSE Labs
> SUSE LINUX s.r.o.
> Lihovarska 1060/12
> 190 00 Praha 9    
> Czech Republic
> 


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RFC] memcg: fix drain_all_stock crash
@ 2011-08-09  9:32               ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 41+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-08-09  9:32 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, linux-mm, Balbir Singh, Andrew Morton, linux-kernel

On Tue, 9 Aug 2011 11:31:50 +0200
Michal Hocko <mhocko@suse.cz> wrote:

> What do you think about the half backed patch bellow? I didn't manage to
> test it yet but I guess it should help. I hate asymmetry of drain_lock
> locking (it is acquired somewhere else than it is released which is
> not). I will think about a nicer way how to do it.
> Maybe I should also split the rcu part in a separate patch.
> 
> What do you think?


I'd like to revert 8521fc50 first and consider total design change
rather than ad-hoc fix.

Personally, I don't like to have spin-lock in per-cpu area.


Thanks,
-Kame

> ---
> From 26c2cdc55aa14ec4a54e9c8e2c8b9072c7cb8e28 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.cz>
> Date: Tue, 9 Aug 2011 10:53:28 +0200
> Subject: [PATCH] memcg: fix drain_all_stock crash
> 
> 8521fc50 (memcg: get rid of percpu_charge_mutex lock) introduced a crash
> in sync mode when we are about to check whether we have to wait for the
> work because we are calling mem_cgroup_same_or_subtree without checking
> FLUSHING_CACHED_CHARGE before so we can dereference already cleaned
> cache (the simplest case would be when we drain the local cache).
> 
> BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
> IP: [<ffffffff81083b70>] css_is_ancestor+0x20/0x70
> PGD 4ae7a067 PUD 4adc4067 PMD 0
> Oops: 0000 [#1] PREEMPT SMP
> CPU 0
> Pid: 19677, comm: rmdir Tainted: G        W   3.0.0-mm1-00188-gf38d32b #35 ECS MCP61M-M3/MCP61M-M3
> RIP: 0010:[<ffffffff81083b70>]  [<ffffffff81083b70>] css_is_ancestor+0x20/0x70
> RSP: 0018:ffff880077b09c88  EFLAGS: 00010202
> RAX: ffff8800781bb310 RBX: 0000000000000000 RCX: 000000000000003e
> RDX: 0000000000000000 RSI: ffff8800779f7c00 RDI: 0000000000000000
> RBP: ffff880077b09c98 R08: ffffffff818a4e88 R09: 0000000000000000
> R10: 0000000000000000 R11: dead000000100100 R12: ffff8800779f7c00
> R13: ffff8800779f7c00 R14: 0000000000000000 R15: ffff88007bc0eb80
> FS:  00007f5d689ec720(0000) GS:ffff88007bc00000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> CR2: 0000000000000018 CR3: 000000004ad57000 CR4: 00000000000006f0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Process rmdir (pid: 19677, threadinfo ffff880077b08000, task ffff8800781bb310)
> Stack:
>  ffffffff818a4e88 000000000000eb80 ffff880077b09ca8 ffffffff810feba3
>  ffff880077b09d08 ffffffff810feccf ffff880077b09cf8 0000000000000001
>  ffff88007bd0eb80 0000000000000001 ffff880077af2000 0000000000000000
> Call Trace:
>  [<ffffffff810feba3>] mem_cgroup_same_or_subtree+0x33/0x40
>  [<ffffffff810feccf>] drain_all_stock+0x11f/0x170
>  [<ffffffff81103211>] mem_cgroup_force_empty+0x231/0x6d0
>  [<ffffffff81111872>] ? path_put+0x22/0x30
>  [<ffffffff8111c925>] ? __d_lookup+0xb5/0x170
>  [<ffffffff811036c4>] mem_cgroup_pre_destroy+0x14/0x20
>  [<ffffffff81080559>] cgroup_rmdir+0xb9/0x500
>  [<ffffffff81063990>] ? abort_exclusive_wait+0xb0/0xb0
>  [<ffffffff81114d26>] vfs_rmdir+0x86/0xe0
>  [<ffffffff811233d3>] ? mnt_want_write+0x43/0x80
>  [<ffffffff81114e7b>] do_rmdir+0xfb/0x110
>  [<ffffffff81114ea6>] sys_rmdir+0x16/0x20
>  [<ffffffff8154d76b>] system_call_fastpath+0x16/0x1b
> 
> Testing FLUSHING_CACHED_CHARGE before dereferencing is still not enough
> because then we still might see mem == NULL so we have to check it
> before dereferencing.
> We have to do all stock checking under FLUSHING_CACHED_CHARGE bit lock
> so it is much easier to use a spin_lock instead. Let's also add a flag
> (under_drain) that draining in progress so that concurrent callers do
> not have to wait on the lock pointlessly.
> 
> Finally we do not make sure that the mem still exists. It could have
> been removed in the meantime:
> 	CPU0			CPU1			     CPU2
> mem=stock->cached
> stock->cached=NULL
> 			      clear_bit
> 							test_and_set_bit
> test_bit()		      ...
> <preempted>		mem_cgroup_destroy
> use after free
> 
> `...' is actually quite a bunch of work to do so the race is not very
> probable. The important thing, though, is that cgroup_subsys->destroy
> (mem_cgroup_destroy) is called after synchronize_rcu so we can protect
> by calling rcu_read_lock when dereferencing cached mem.
> 
> TODO:
> - check if under_drain needs some memory barriers
> - check the hotplug path (can we wait on spinlock?)
> - better changelog
> - do some testing
> 
> Signed-off-by: Michal Hocko <mhocko@suse.cz>
> Reported-by: Johannes Weiner <jweiner@redhat.com>
> ---
>  mm/memcontrol.c |   67 +++++++++++++++++++++++++++++++++++++++++-------------
>  1 files changed, 51 insertions(+), 16 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index f4ec4e7..e34f9fd 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2087,8 +2087,8 @@ struct memcg_stock_pcp {
>  	struct mem_cgroup *cached; /* this never be root cgroup */
>  	unsigned int nr_pages;
>  	struct work_struct work;
> -	unsigned long flags;
> -#define FLUSHING_CACHED_CHARGE	(0)
> +	spinlock_t drain_lock; /* protects from parallel draining */
> +	bool under_drain;
>  };
>  static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock);
>  
> @@ -2114,6 +2114,7 @@ static bool consume_stock(struct mem_cgroup *mem)
>  
>  /*
>   * Returns stocks cached in percpu to res_counter and reset cached information.
> + * Do not call this directly - use drain_local_stock instead.
>   */
>  static void drain_stock(struct memcg_stock_pcp *stock)
>  {
> @@ -2133,12 +2134,16 @@ static void drain_stock(struct memcg_stock_pcp *stock)
>  /*
>   * This must be called under preempt disabled or must be called by
>   * a thread which is pinned to local cpu.
> + * Parameter is not used.
> + * Assumes stock->drain_lock held.
>   */
>  static void drain_local_stock(struct work_struct *dummy)
>  {
>  	struct memcg_stock_pcp *stock = &__get_cpu_var(memcg_stock);
>  	drain_stock(stock);
> -	clear_bit(FLUSHING_CACHED_CHARGE, &stock->flags);
> +
> +	stock->under_drain = false;
> +	spin_unlock(&stock->drain_lock);
>  }
>  
>  /*
> @@ -2150,7 +2155,9 @@ static void refill_stock(struct mem_cgroup *mem, unsigned int nr_pages)
>  	struct memcg_stock_pcp *stock = &get_cpu_var(memcg_stock);
>  
>  	if (stock->cached != mem) { /* reset if necessary */
> -		drain_stock(stock);
> +		spin_lock(&stock->drain_lock);
> +		stock->under_drain = true;
> +		drain_local_stock(NULL);
>  		stock->cached = mem;
>  	}
>  	stock->nr_pages += nr_pages;
> @@ -2179,17 +2186,27 @@ static void drain_all_stock(struct mem_cgroup *root_mem, bool sync)
>  		struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu);
>  		struct mem_cgroup *mem;
>  
> +		/*
> +		 * make sure we are not waiting when somebody already drains
> +		 * the cache.
> +		 */
> +		if (!spin_trylock(&stock->drain_lock)) {
> +			if (stock->under_drain)
> +				continue;
> +			spin_lock(&stock->drain_lock);
> +		}
>  		mem = stock->cached;
> -		if (!mem || !stock->nr_pages)
> +		if (!mem || !stock->nr_pages ||
> +				!mem_cgroup_same_or_subtree(root_mem, mem)) {
> +			spin_unlock(&stock->drain_lock);
>  			continue;
> -		if (!mem_cgroup_same_or_subtree(root_mem, mem))
> -			continue;
> -		if (!test_and_set_bit(FLUSHING_CACHED_CHARGE, &stock->flags)) {
> -			if (cpu == curcpu)
> -				drain_local_stock(&stock->work);
> -			else
> -				schedule_work_on(cpu, &stock->work);
>  		}
> +
> +		stock->under_drain = true;
> +		if (cpu == curcpu)
> +			drain_local_stock(&stock->work);
> +		else
> +			schedule_work_on(cpu, &stock->work);
>  	}
>  
>  	if (!sync)
> @@ -2197,8 +2214,20 @@ static void drain_all_stock(struct mem_cgroup *root_mem, bool sync)
>  
>  	for_each_online_cpu(cpu) {
>  		struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu);
> -		if (mem_cgroup_same_or_subtree(root_mem, stock->cached) &&
> -				test_bit(FLUSHING_CACHED_CHARGE, &stock->flags))
> +		struct mem_cgroup *mem;
> +		bool wait_for_drain = false;
> +
> +		/*
> +		 * we have to be careful about parallel group destroying
> +		 * (mem_cgroup_destroy) which is derefered after sychronize_rcu
> +		 */
> +		rcu_read_lock();
> +		mem = stock->cached;
> +		wait_for_drain = stock->under_drain &&
> +			mem && mem_cgroup_same_or_subtree(root_mem, mem);
> +		rcu_read_unlock();
> +
> +		if (wait_for_drain)
>  			flush_work(&stock->work);
>  	}
>  out:
> @@ -2278,8 +2307,12 @@ static int __cpuinit memcg_cpu_hotplug_callback(struct notifier_block *nb,
>  	for_each_mem_cgroup_all(iter)
>  		mem_cgroup_drain_pcp_counter(iter, cpu);
>  
> -	stock = &per_cpu(memcg_stock, cpu);
> -	drain_stock(stock);
> +	if (!spin_trylock(&stock->drain_lock)) {
> +		if (stock->under_drain)
> +			return NOTIFY_OK;
> +		spin_lock(&stock->drain_lock);
> +	}
> +	drain_local_stock(NULL);
>  	return NOTIFY_OK;
>  }
>  
> @@ -5068,6 +5101,8 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
>  			struct memcg_stock_pcp *stock =
>  						&per_cpu(memcg_stock, cpu);
>  			INIT_WORK(&stock->work, drain_local_stock);
> +			stock->under_drain = false;
> +			spin_lock_init(&stock->drain_lock);
>  		}
>  		hotcpu_notifier(memcg_cpu_hotplug_callback, 0);
>  	} else {
> -- 
> 1.7.5.4
> 
> 
> -- 
> Michal Hocko
> SUSE Labs
> SUSE LINUX s.r.o.
> Lihovarska 1060/12
> 190 00 Praha 9    
> Czech Republic
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RFC] memcg: fix drain_all_stock crash
  2011-08-09  9:32               ` KAMEZAWA Hiroyuki
@ 2011-08-09  9:45                 ` Michal Hocko
  -1 siblings, 0 replies; 41+ messages in thread
From: Michal Hocko @ 2011-08-09  9:45 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Johannes Weiner, linux-mm, Balbir Singh, Andrew Morton, linux-kernel

On Tue 09-08-11 18:32:16, KAMEZAWA Hiroyuki wrote:
> On Tue, 9 Aug 2011 11:31:50 +0200
> Michal Hocko <mhocko@suse.cz> wrote:
> 
> > What do you think about the half backed patch bellow? I didn't manage to
> > test it yet but I guess it should help. I hate asymmetry of drain_lock
> > locking (it is acquired somewhere else than it is released which is
> > not). I will think about a nicer way how to do it.
> > Maybe I should also split the rcu part in a separate patch.
> > 
> > What do you think?
> 
> 
> I'd like to revert 8521fc50 first and consider total design change
> rather than ad-hoc fix.

Agreed. Revert should go into 3.0 stable as well. Although the global
mutex is buggy we have that behavior for a long time without any reports.
We should address it but it can wait for 3.2.

> Personally, I don't like to have spin-lock in per-cpu area.

spinlock is not that different from what we already have with the bit
lock.

> 
> 
> Thanks,
> -Kame
> 
> > ---
> > From 26c2cdc55aa14ec4a54e9c8e2c8b9072c7cb8e28 Mon Sep 17 00:00:00 2001
> > From: Michal Hocko <mhocko@suse.cz>
> > Date: Tue, 9 Aug 2011 10:53:28 +0200
> > Subject: [PATCH] memcg: fix drain_all_stock crash
> > 
> > 8521fc50 (memcg: get rid of percpu_charge_mutex lock) introduced a crash
> > in sync mode when we are about to check whether we have to wait for the
> > work because we are calling mem_cgroup_same_or_subtree without checking
> > FLUSHING_CACHED_CHARGE before so we can dereference already cleaned
> > cache (the simplest case would be when we drain the local cache).
> > 
> > BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
> > IP: [<ffffffff81083b70>] css_is_ancestor+0x20/0x70
> > PGD 4ae7a067 PUD 4adc4067 PMD 0
> > Oops: 0000 [#1] PREEMPT SMP
> > CPU 0
> > Pid: 19677, comm: rmdir Tainted: G        W   3.0.0-mm1-00188-gf38d32b #35 ECS MCP61M-M3/MCP61M-M3
> > RIP: 0010:[<ffffffff81083b70>]  [<ffffffff81083b70>] css_is_ancestor+0x20/0x70
> > RSP: 0018:ffff880077b09c88  EFLAGS: 00010202
> > RAX: ffff8800781bb310 RBX: 0000000000000000 RCX: 000000000000003e
> > RDX: 0000000000000000 RSI: ffff8800779f7c00 RDI: 0000000000000000
> > RBP: ffff880077b09c98 R08: ffffffff818a4e88 R09: 0000000000000000
> > R10: 0000000000000000 R11: dead000000100100 R12: ffff8800779f7c00
> > R13: ffff8800779f7c00 R14: 0000000000000000 R15: ffff88007bc0eb80
> > FS:  00007f5d689ec720(0000) GS:ffff88007bc00000(0000) knlGS:0000000000000000
> > CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> > CR2: 0000000000000018 CR3: 000000004ad57000 CR4: 00000000000006f0
> > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> > Process rmdir (pid: 19677, threadinfo ffff880077b08000, task ffff8800781bb310)
> > Stack:
> >  ffffffff818a4e88 000000000000eb80 ffff880077b09ca8 ffffffff810feba3
> >  ffff880077b09d08 ffffffff810feccf ffff880077b09cf8 0000000000000001
> >  ffff88007bd0eb80 0000000000000001 ffff880077af2000 0000000000000000
> > Call Trace:
> >  [<ffffffff810feba3>] mem_cgroup_same_or_subtree+0x33/0x40
> >  [<ffffffff810feccf>] drain_all_stock+0x11f/0x170
> >  [<ffffffff81103211>] mem_cgroup_force_empty+0x231/0x6d0
> >  [<ffffffff81111872>] ? path_put+0x22/0x30
> >  [<ffffffff8111c925>] ? __d_lookup+0xb5/0x170
> >  [<ffffffff811036c4>] mem_cgroup_pre_destroy+0x14/0x20
> >  [<ffffffff81080559>] cgroup_rmdir+0xb9/0x500
> >  [<ffffffff81063990>] ? abort_exclusive_wait+0xb0/0xb0
> >  [<ffffffff81114d26>] vfs_rmdir+0x86/0xe0
> >  [<ffffffff811233d3>] ? mnt_want_write+0x43/0x80
> >  [<ffffffff81114e7b>] do_rmdir+0xfb/0x110
> >  [<ffffffff81114ea6>] sys_rmdir+0x16/0x20
> >  [<ffffffff8154d76b>] system_call_fastpath+0x16/0x1b
> > 
> > Testing FLUSHING_CACHED_CHARGE before dereferencing is still not enough
> > because then we still might see mem == NULL so we have to check it
> > before dereferencing.
> > We have to do all stock checking under FLUSHING_CACHED_CHARGE bit lock
> > so it is much easier to use a spin_lock instead. Let's also add a flag
> > (under_drain) that draining in progress so that concurrent callers do
> > not have to wait on the lock pointlessly.
> > 
> > Finally we do not make sure that the mem still exists. It could have
> > been removed in the meantime:
> > 	CPU0			CPU1			     CPU2
> > mem=stock->cached
> > stock->cached=NULL
> > 			      clear_bit
> > 							test_and_set_bit
> > test_bit()		      ...
> > <preempted>		mem_cgroup_destroy
> > use after free
> > 
> > `...' is actually quite a bunch of work to do so the race is not very
> > probable. The important thing, though, is that cgroup_subsys->destroy
> > (mem_cgroup_destroy) is called after synchronize_rcu so we can protect
> > by calling rcu_read_lock when dereferencing cached mem.
> > 
> > TODO:
> > - check if under_drain needs some memory barriers
> > - check the hotplug path (can we wait on spinlock?)
> > - better changelog
> > - do some testing
> > 
> > Signed-off-by: Michal Hocko <mhocko@suse.cz>
> > Reported-by: Johannes Weiner <jweiner@redhat.com>
> > ---
> >  mm/memcontrol.c |   67 +++++++++++++++++++++++++++++++++++++++++-------------
> >  1 files changed, 51 insertions(+), 16 deletions(-)
> > 
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index f4ec4e7..e34f9fd 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -2087,8 +2087,8 @@ struct memcg_stock_pcp {
> >  	struct mem_cgroup *cached; /* this never be root cgroup */
> >  	unsigned int nr_pages;
> >  	struct work_struct work;
> > -	unsigned long flags;
> > -#define FLUSHING_CACHED_CHARGE	(0)
> > +	spinlock_t drain_lock; /* protects from parallel draining */
> > +	bool under_drain;
> >  };
> >  static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock);
> >  
> > @@ -2114,6 +2114,7 @@ static bool consume_stock(struct mem_cgroup *mem)
> >  
> >  /*
> >   * Returns stocks cached in percpu to res_counter and reset cached information.
> > + * Do not call this directly - use drain_local_stock instead.
> >   */
> >  static void drain_stock(struct memcg_stock_pcp *stock)
> >  {
> > @@ -2133,12 +2134,16 @@ static void drain_stock(struct memcg_stock_pcp *stock)
> >  /*
> >   * This must be called under preempt disabled or must be called by
> >   * a thread which is pinned to local cpu.
> > + * Parameter is not used.
> > + * Assumes stock->drain_lock held.
> >   */
> >  static void drain_local_stock(struct work_struct *dummy)
> >  {
> >  	struct memcg_stock_pcp *stock = &__get_cpu_var(memcg_stock);
> >  	drain_stock(stock);
> > -	clear_bit(FLUSHING_CACHED_CHARGE, &stock->flags);
> > +
> > +	stock->under_drain = false;
> > +	spin_unlock(&stock->drain_lock);
> >  }
> >  
> >  /*
> > @@ -2150,7 +2155,9 @@ static void refill_stock(struct mem_cgroup *mem, unsigned int nr_pages)
> >  	struct memcg_stock_pcp *stock = &get_cpu_var(memcg_stock);
> >  
> >  	if (stock->cached != mem) { /* reset if necessary */
> > -		drain_stock(stock);
> > +		spin_lock(&stock->drain_lock);
> > +		stock->under_drain = true;
> > +		drain_local_stock(NULL);
> >  		stock->cached = mem;
> >  	}
> >  	stock->nr_pages += nr_pages;
> > @@ -2179,17 +2186,27 @@ static void drain_all_stock(struct mem_cgroup *root_mem, bool sync)
> >  		struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu);
> >  		struct mem_cgroup *mem;
> >  
> > +		/*
> > +		 * make sure we are not waiting when somebody already drains
> > +		 * the cache.
> > +		 */
> > +		if (!spin_trylock(&stock->drain_lock)) {
> > +			if (stock->under_drain)
> > +				continue;
> > +			spin_lock(&stock->drain_lock);
> > +		}
> >  		mem = stock->cached;
> > -		if (!mem || !stock->nr_pages)
> > +		if (!mem || !stock->nr_pages ||
> > +				!mem_cgroup_same_or_subtree(root_mem, mem)) {
> > +			spin_unlock(&stock->drain_lock);
> >  			continue;
> > -		if (!mem_cgroup_same_or_subtree(root_mem, mem))
> > -			continue;
> > -		if (!test_and_set_bit(FLUSHING_CACHED_CHARGE, &stock->flags)) {
> > -			if (cpu == curcpu)
> > -				drain_local_stock(&stock->work);
> > -			else
> > -				schedule_work_on(cpu, &stock->work);
> >  		}
> > +
> > +		stock->under_drain = true;
> > +		if (cpu == curcpu)
> > +			drain_local_stock(&stock->work);
> > +		else
> > +			schedule_work_on(cpu, &stock->work);
> >  	}
> >  
> >  	if (!sync)
> > @@ -2197,8 +2214,20 @@ static void drain_all_stock(struct mem_cgroup *root_mem, bool sync)
> >  
> >  	for_each_online_cpu(cpu) {
> >  		struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu);
> > -		if (mem_cgroup_same_or_subtree(root_mem, stock->cached) &&
> > -				test_bit(FLUSHING_CACHED_CHARGE, &stock->flags))
> > +		struct mem_cgroup *mem;
> > +		bool wait_for_drain = false;
> > +
> > +		/*
> > +		 * we have to be careful about parallel group destroying
> > +		 * (mem_cgroup_destroy) which is derefered after sychronize_rcu
> > +		 */
> > +		rcu_read_lock();
> > +		mem = stock->cached;
> > +		wait_for_drain = stock->under_drain &&
> > +			mem && mem_cgroup_same_or_subtree(root_mem, mem);
> > +		rcu_read_unlock();
> > +
> > +		if (wait_for_drain)
> >  			flush_work(&stock->work);
> >  	}
> >  out:
> > @@ -2278,8 +2307,12 @@ static int __cpuinit memcg_cpu_hotplug_callback(struct notifier_block *nb,
> >  	for_each_mem_cgroup_all(iter)
> >  		mem_cgroup_drain_pcp_counter(iter, cpu);
> >  
> > -	stock = &per_cpu(memcg_stock, cpu);
> > -	drain_stock(stock);
> > +	if (!spin_trylock(&stock->drain_lock)) {
> > +		if (stock->under_drain)
> > +			return NOTIFY_OK;
> > +		spin_lock(&stock->drain_lock);
> > +	}
> > +	drain_local_stock(NULL);
> >  	return NOTIFY_OK;
> >  }
> >  
> > @@ -5068,6 +5101,8 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
> >  			struct memcg_stock_pcp *stock =
> >  						&per_cpu(memcg_stock, cpu);
> >  			INIT_WORK(&stock->work, drain_local_stock);
> > +			stock->under_drain = false;
> > +			spin_lock_init(&stock->drain_lock);
> >  		}
> >  		hotcpu_notifier(memcg_cpu_hotplug_callback, 0);
> >  	} else {
> > -- 
> > 1.7.5.4
> > 
> > 
> > -- 
> > Michal Hocko
> > SUSE Labs
> > SUSE LINUX s.r.o.
> > Lihovarska 1060/12
> > 190 00 Praha 9    
> > Czech Republic
> > 
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RFC] memcg: fix drain_all_stock crash
@ 2011-08-09  9:45                 ` Michal Hocko
  0 siblings, 0 replies; 41+ messages in thread
From: Michal Hocko @ 2011-08-09  9:45 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Johannes Weiner, linux-mm, Balbir Singh, Andrew Morton, linux-kernel

On Tue 09-08-11 18:32:16, KAMEZAWA Hiroyuki wrote:
> On Tue, 9 Aug 2011 11:31:50 +0200
> Michal Hocko <mhocko@suse.cz> wrote:
> 
> > What do you think about the half backed patch bellow? I didn't manage to
> > test it yet but I guess it should help. I hate asymmetry of drain_lock
> > locking (it is acquired somewhere else than it is released which is
> > not). I will think about a nicer way how to do it.
> > Maybe I should also split the rcu part in a separate patch.
> > 
> > What do you think?
> 
> 
> I'd like to revert 8521fc50 first and consider total design change
> rather than ad-hoc fix.

Agreed. Revert should go into 3.0 stable as well. Although the global
mutex is buggy we have that behavior for a long time without any reports.
We should address it but it can wait for 3.2.

> Personally, I don't like to have spin-lock in per-cpu area.

spinlock is not that different from what we already have with the bit
lock.

> 
> 
> Thanks,
> -Kame
> 
> > ---
> > From 26c2cdc55aa14ec4a54e9c8e2c8b9072c7cb8e28 Mon Sep 17 00:00:00 2001
> > From: Michal Hocko <mhocko@suse.cz>
> > Date: Tue, 9 Aug 2011 10:53:28 +0200
> > Subject: [PATCH] memcg: fix drain_all_stock crash
> > 
> > 8521fc50 (memcg: get rid of percpu_charge_mutex lock) introduced a crash
> > in sync mode when we are about to check whether we have to wait for the
> > work because we are calling mem_cgroup_same_or_subtree without checking
> > FLUSHING_CACHED_CHARGE before so we can dereference already cleaned
> > cache (the simplest case would be when we drain the local cache).
> > 
> > BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
> > IP: [<ffffffff81083b70>] css_is_ancestor+0x20/0x70
> > PGD 4ae7a067 PUD 4adc4067 PMD 0
> > Oops: 0000 [#1] PREEMPT SMP
> > CPU 0
> > Pid: 19677, comm: rmdir Tainted: G        W   3.0.0-mm1-00188-gf38d32b #35 ECS MCP61M-M3/MCP61M-M3
> > RIP: 0010:[<ffffffff81083b70>]  [<ffffffff81083b70>] css_is_ancestor+0x20/0x70
> > RSP: 0018:ffff880077b09c88  EFLAGS: 00010202
> > RAX: ffff8800781bb310 RBX: 0000000000000000 RCX: 000000000000003e
> > RDX: 0000000000000000 RSI: ffff8800779f7c00 RDI: 0000000000000000
> > RBP: ffff880077b09c98 R08: ffffffff818a4e88 R09: 0000000000000000
> > R10: 0000000000000000 R11: dead000000100100 R12: ffff8800779f7c00
> > R13: ffff8800779f7c00 R14: 0000000000000000 R15: ffff88007bc0eb80
> > FS:  00007f5d689ec720(0000) GS:ffff88007bc00000(0000) knlGS:0000000000000000
> > CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> > CR2: 0000000000000018 CR3: 000000004ad57000 CR4: 00000000000006f0
> > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> > Process rmdir (pid: 19677, threadinfo ffff880077b08000, task ffff8800781bb310)
> > Stack:
> >  ffffffff818a4e88 000000000000eb80 ffff880077b09ca8 ffffffff810feba3
> >  ffff880077b09d08 ffffffff810feccf ffff880077b09cf8 0000000000000001
> >  ffff88007bd0eb80 0000000000000001 ffff880077af2000 0000000000000000
> > Call Trace:
> >  [<ffffffff810feba3>] mem_cgroup_same_or_subtree+0x33/0x40
> >  [<ffffffff810feccf>] drain_all_stock+0x11f/0x170
> >  [<ffffffff81103211>] mem_cgroup_force_empty+0x231/0x6d0
> >  [<ffffffff81111872>] ? path_put+0x22/0x30
> >  [<ffffffff8111c925>] ? __d_lookup+0xb5/0x170
> >  [<ffffffff811036c4>] mem_cgroup_pre_destroy+0x14/0x20
> >  [<ffffffff81080559>] cgroup_rmdir+0xb9/0x500
> >  [<ffffffff81063990>] ? abort_exclusive_wait+0xb0/0xb0
> >  [<ffffffff81114d26>] vfs_rmdir+0x86/0xe0
> >  [<ffffffff811233d3>] ? mnt_want_write+0x43/0x80
> >  [<ffffffff81114e7b>] do_rmdir+0xfb/0x110
> >  [<ffffffff81114ea6>] sys_rmdir+0x16/0x20
> >  [<ffffffff8154d76b>] system_call_fastpath+0x16/0x1b
> > 
> > Testing FLUSHING_CACHED_CHARGE before dereferencing is still not enough
> > because then we still might see mem == NULL so we have to check it
> > before dereferencing.
> > We have to do all stock checking under FLUSHING_CACHED_CHARGE bit lock
> > so it is much easier to use a spin_lock instead. Let's also add a flag
> > (under_drain) that draining in progress so that concurrent callers do
> > not have to wait on the lock pointlessly.
> > 
> > Finally we do not make sure that the mem still exists. It could have
> > been removed in the meantime:
> > 	CPU0			CPU1			     CPU2
> > mem=stock->cached
> > stock->cached=NULL
> > 			      clear_bit
> > 							test_and_set_bit
> > test_bit()		      ...
> > <preempted>		mem_cgroup_destroy
> > use after free
> > 
> > `...' is actually quite a bunch of work to do so the race is not very
> > probable. The important thing, though, is that cgroup_subsys->destroy
> > (mem_cgroup_destroy) is called after synchronize_rcu so we can protect
> > by calling rcu_read_lock when dereferencing cached mem.
> > 
> > TODO:
> > - check if under_drain needs some memory barriers
> > - check the hotplug path (can we wait on spinlock?)
> > - better changelog
> > - do some testing
> > 
> > Signed-off-by: Michal Hocko <mhocko@suse.cz>
> > Reported-by: Johannes Weiner <jweiner@redhat.com>
> > ---
> >  mm/memcontrol.c |   67 +++++++++++++++++++++++++++++++++++++++++-------------
> >  1 files changed, 51 insertions(+), 16 deletions(-)
> > 
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index f4ec4e7..e34f9fd 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -2087,8 +2087,8 @@ struct memcg_stock_pcp {
> >  	struct mem_cgroup *cached; /* this never be root cgroup */
> >  	unsigned int nr_pages;
> >  	struct work_struct work;
> > -	unsigned long flags;
> > -#define FLUSHING_CACHED_CHARGE	(0)
> > +	spinlock_t drain_lock; /* protects from parallel draining */
> > +	bool under_drain;
> >  };
> >  static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock);
> >  
> > @@ -2114,6 +2114,7 @@ static bool consume_stock(struct mem_cgroup *mem)
> >  
> >  /*
> >   * Returns stocks cached in percpu to res_counter and reset cached information.
> > + * Do not call this directly - use drain_local_stock instead.
> >   */
> >  static void drain_stock(struct memcg_stock_pcp *stock)
> >  {
> > @@ -2133,12 +2134,16 @@ static void drain_stock(struct memcg_stock_pcp *stock)
> >  /*
> >   * This must be called under preempt disabled or must be called by
> >   * a thread which is pinned to local cpu.
> > + * Parameter is not used.
> > + * Assumes stock->drain_lock held.
> >   */
> >  static void drain_local_stock(struct work_struct *dummy)
> >  {
> >  	struct memcg_stock_pcp *stock = &__get_cpu_var(memcg_stock);
> >  	drain_stock(stock);
> > -	clear_bit(FLUSHING_CACHED_CHARGE, &stock->flags);
> > +
> > +	stock->under_drain = false;
> > +	spin_unlock(&stock->drain_lock);
> >  }
> >  
> >  /*
> > @@ -2150,7 +2155,9 @@ static void refill_stock(struct mem_cgroup *mem, unsigned int nr_pages)
> >  	struct memcg_stock_pcp *stock = &get_cpu_var(memcg_stock);
> >  
> >  	if (stock->cached != mem) { /* reset if necessary */
> > -		drain_stock(stock);
> > +		spin_lock(&stock->drain_lock);
> > +		stock->under_drain = true;
> > +		drain_local_stock(NULL);
> >  		stock->cached = mem;
> >  	}
> >  	stock->nr_pages += nr_pages;
> > @@ -2179,17 +2186,27 @@ static void drain_all_stock(struct mem_cgroup *root_mem, bool sync)
> >  		struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu);
> >  		struct mem_cgroup *mem;
> >  
> > +		/*
> > +		 * make sure we are not waiting when somebody already drains
> > +		 * the cache.
> > +		 */
> > +		if (!spin_trylock(&stock->drain_lock)) {
> > +			if (stock->under_drain)
> > +				continue;
> > +			spin_lock(&stock->drain_lock);
> > +		}
> >  		mem = stock->cached;
> > -		if (!mem || !stock->nr_pages)
> > +		if (!mem || !stock->nr_pages ||
> > +				!mem_cgroup_same_or_subtree(root_mem, mem)) {
> > +			spin_unlock(&stock->drain_lock);
> >  			continue;
> > -		if (!mem_cgroup_same_or_subtree(root_mem, mem))
> > -			continue;
> > -		if (!test_and_set_bit(FLUSHING_CACHED_CHARGE, &stock->flags)) {
> > -			if (cpu == curcpu)
> > -				drain_local_stock(&stock->work);
> > -			else
> > -				schedule_work_on(cpu, &stock->work);
> >  		}
> > +
> > +		stock->under_drain = true;
> > +		if (cpu == curcpu)
> > +			drain_local_stock(&stock->work);
> > +		else
> > +			schedule_work_on(cpu, &stock->work);
> >  	}
> >  
> >  	if (!sync)
> > @@ -2197,8 +2214,20 @@ static void drain_all_stock(struct mem_cgroup *root_mem, bool sync)
> >  
> >  	for_each_online_cpu(cpu) {
> >  		struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu);
> > -		if (mem_cgroup_same_or_subtree(root_mem, stock->cached) &&
> > -				test_bit(FLUSHING_CACHED_CHARGE, &stock->flags))
> > +		struct mem_cgroup *mem;
> > +		bool wait_for_drain = false;
> > +
> > +		/*
> > +		 * we have to be careful about parallel group destroying
> > +		 * (mem_cgroup_destroy) which is derefered after sychronize_rcu
> > +		 */
> > +		rcu_read_lock();
> > +		mem = stock->cached;
> > +		wait_for_drain = stock->under_drain &&
> > +			mem && mem_cgroup_same_or_subtree(root_mem, mem);
> > +		rcu_read_unlock();
> > +
> > +		if (wait_for_drain)
> >  			flush_work(&stock->work);
> >  	}
> >  out:
> > @@ -2278,8 +2307,12 @@ static int __cpuinit memcg_cpu_hotplug_callback(struct notifier_block *nb,
> >  	for_each_mem_cgroup_all(iter)
> >  		mem_cgroup_drain_pcp_counter(iter, cpu);
> >  
> > -	stock = &per_cpu(memcg_stock, cpu);
> > -	drain_stock(stock);
> > +	if (!spin_trylock(&stock->drain_lock)) {
> > +		if (stock->under_drain)
> > +			return NOTIFY_OK;
> > +		spin_lock(&stock->drain_lock);
> > +	}
> > +	drain_local_stock(NULL);
> >  	return NOTIFY_OK;
> >  }
> >  
> > @@ -5068,6 +5101,8 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
> >  			struct memcg_stock_pcp *stock =
> >  						&per_cpu(memcg_stock, cpu);
> >  			INIT_WORK(&stock->work, drain_local_stock);
> > +			stock->under_drain = false;
> > +			spin_lock_init(&stock->drain_lock);
> >  		}
> >  		hotcpu_notifier(memcg_cpu_hotplug_callback, 0);
> >  	} else {
> > -- 
> > 1.7.5.4
> > 
> > 
> > -- 
> > Michal Hocko
> > SUSE Labs
> > SUSE LINUX s.r.o.
> > Lihovarska 1060/12
> > 190 00 Praha 9    
> > Czech Republic
> > 
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RFC] memcg: fix drain_all_stock crash
  2011-08-09  9:45                 ` Michal Hocko
@ 2011-08-09  9:53                   ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 41+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-08-09  9:53 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, linux-mm, Balbir Singh, Andrew Morton, linux-kernel

On Tue, 9 Aug 2011 11:45:03 +0200
Michal Hocko <mhocko@suse.cz> wrote:

> On Tue 09-08-11 18:32:16, KAMEZAWA Hiroyuki wrote:
> > On Tue, 9 Aug 2011 11:31:50 +0200
> > Michal Hocko <mhocko@suse.cz> wrote:
> > 
> > > What do you think about the half backed patch bellow? I didn't manage to
> > > test it yet but I guess it should help. I hate asymmetry of drain_lock
> > > locking (it is acquired somewhere else than it is released which is
> > > not). I will think about a nicer way how to do it.
> > > Maybe I should also split the rcu part in a separate patch.
> > > 
> > > What do you think?
> > 
> > 
> > I'd like to revert 8521fc50 first and consider total design change
> > rather than ad-hoc fix.
> 
> Agreed. Revert should go into 3.0 stable as well. Although the global
> mutex is buggy we have that behavior for a long time without any reports.
> We should address it but it can wait for 3.2.
> 

What "buggy" means here ? "problematic" or "cause OOps ?"

> > Personally, I don't like to have spin-lock in per-cpu area.
> 
> spinlock is not that different from what we already have with the bit
> lock.

maybe. The best is lockless style...but pointer in percpu cache is problem..

Thanks,
-Kame



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RFC] memcg: fix drain_all_stock crash
@ 2011-08-09  9:53                   ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 41+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-08-09  9:53 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, linux-mm, Balbir Singh, Andrew Morton, linux-kernel

On Tue, 9 Aug 2011 11:45:03 +0200
Michal Hocko <mhocko@suse.cz> wrote:

> On Tue 09-08-11 18:32:16, KAMEZAWA Hiroyuki wrote:
> > On Tue, 9 Aug 2011 11:31:50 +0200
> > Michal Hocko <mhocko@suse.cz> wrote:
> > 
> > > What do you think about the half backed patch bellow? I didn't manage to
> > > test it yet but I guess it should help. I hate asymmetry of drain_lock
> > > locking (it is acquired somewhere else than it is released which is
> > > not). I will think about a nicer way how to do it.
> > > Maybe I should also split the rcu part in a separate patch.
> > > 
> > > What do you think?
> > 
> > 
> > I'd like to revert 8521fc50 first and consider total design change
> > rather than ad-hoc fix.
> 
> Agreed. Revert should go into 3.0 stable as well. Although the global
> mutex is buggy we have that behavior for a long time without any reports.
> We should address it but it can wait for 3.2.
> 

What "buggy" means here ? "problematic" or "cause OOps ?"

> > Personally, I don't like to have spin-lock in per-cpu area.
> 
> spinlock is not that different from what we already have with the bit
> lock.

maybe. The best is lockless style...but pointer in percpu cache is problem..

Thanks,
-Kame


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RFC] memcg: fix drain_all_stock crash
  2011-08-09 10:09                     ` Michal Hocko
@ 2011-08-09 10:07                       ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 41+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-08-09 10:07 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, linux-mm, Balbir Singh, Andrew Morton, linux-kernel

On Tue, 9 Aug 2011 12:09:44 +0200
Michal Hocko <mhocko@suse.cz> wrote:

> On Tue 09-08-11 18:53:13, KAMEZAWA Hiroyuki wrote:
> > On Tue, 9 Aug 2011 11:45:03 +0200
> > Michal Hocko <mhocko@suse.cz> wrote:
> > 
> > > On Tue 09-08-11 18:32:16, KAMEZAWA Hiroyuki wrote:
> > > > On Tue, 9 Aug 2011 11:31:50 +0200
> > > > Michal Hocko <mhocko@suse.cz> wrote:
> > > > 
> > > > > What do you think about the half backed patch bellow? I didn't manage to
> > > > > test it yet but I guess it should help. I hate asymmetry of drain_lock
> > > > > locking (it is acquired somewhere else than it is released which is
> > > > > not). I will think about a nicer way how to do it.
> > > > > Maybe I should also split the rcu part in a separate patch.
> > > > > 
> > > > > What do you think?
> > > > 
> > > > 
> > > > I'd like to revert 8521fc50 first and consider total design change
> > > > rather than ad-hoc fix.
> > > 
> > > Agreed. Revert should go into 3.0 stable as well. Although the global
> > > mutex is buggy we have that behavior for a long time without any reports.
> > > We should address it but it can wait for 3.2.
> 
> I will send the revert request to Linus.
> 
> > What "buggy" means here ? "problematic" or "cause OOps ?"
> 
> I have described that in an earlier email. Consider pathological case
> when CPU0 wants to async. drain a memcg which has a lot of cached charges while
> CPU1 is already draining so it holds the mutex. CPU0 backs off so it has
> to reclaim although we could prevent from it by getting rid of cached
> charges. This is not critical though.
> 

That problem should be fixed by background reclaim.
I'll do it after fixing numascan. (and dirty-ratio problem...)

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RFC] memcg: fix drain_all_stock crash
@ 2011-08-09 10:07                       ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 41+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-08-09 10:07 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, linux-mm, Balbir Singh, Andrew Morton, linux-kernel

On Tue, 9 Aug 2011 12:09:44 +0200
Michal Hocko <mhocko@suse.cz> wrote:

> On Tue 09-08-11 18:53:13, KAMEZAWA Hiroyuki wrote:
> > On Tue, 9 Aug 2011 11:45:03 +0200
> > Michal Hocko <mhocko@suse.cz> wrote:
> > 
> > > On Tue 09-08-11 18:32:16, KAMEZAWA Hiroyuki wrote:
> > > > On Tue, 9 Aug 2011 11:31:50 +0200
> > > > Michal Hocko <mhocko@suse.cz> wrote:
> > > > 
> > > > > What do you think about the half backed patch bellow? I didn't manage to
> > > > > test it yet but I guess it should help. I hate asymmetry of drain_lock
> > > > > locking (it is acquired somewhere else than it is released which is
> > > > > not). I will think about a nicer way how to do it.
> > > > > Maybe I should also split the rcu part in a separate patch.
> > > > > 
> > > > > What do you think?
> > > > 
> > > > 
> > > > I'd like to revert 8521fc50 first and consider total design change
> > > > rather than ad-hoc fix.
> > > 
> > > Agreed. Revert should go into 3.0 stable as well. Although the global
> > > mutex is buggy we have that behavior for a long time without any reports.
> > > We should address it but it can wait for 3.2.
> 
> I will send the revert request to Linus.
> 
> > What "buggy" means here ? "problematic" or "cause OOps ?"
> 
> I have described that in an earlier email. Consider pathological case
> when CPU0 wants to async. drain a memcg which has a lot of cached charges while
> CPU1 is already draining so it holds the mutex. CPU0 backs off so it has
> to reclaim although we could prevent from it by getting rid of cached
> charges. This is not critical though.
> 

That problem should be fixed by background reclaim.
I'll do it after fixing numascan. (and dirty-ratio problem...)

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RFC] memcg: fix drain_all_stock crash
  2011-08-09  9:53                   ` KAMEZAWA Hiroyuki
@ 2011-08-09 10:09                     ` Michal Hocko
  -1 siblings, 0 replies; 41+ messages in thread
From: Michal Hocko @ 2011-08-09 10:09 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Johannes Weiner, linux-mm, Balbir Singh, Andrew Morton, linux-kernel

On Tue 09-08-11 18:53:13, KAMEZAWA Hiroyuki wrote:
> On Tue, 9 Aug 2011 11:45:03 +0200
> Michal Hocko <mhocko@suse.cz> wrote:
> 
> > On Tue 09-08-11 18:32:16, KAMEZAWA Hiroyuki wrote:
> > > On Tue, 9 Aug 2011 11:31:50 +0200
> > > Michal Hocko <mhocko@suse.cz> wrote:
> > > 
> > > > What do you think about the half backed patch bellow? I didn't manage to
> > > > test it yet but I guess it should help. I hate asymmetry of drain_lock
> > > > locking (it is acquired somewhere else than it is released which is
> > > > not). I will think about a nicer way how to do it.
> > > > Maybe I should also split the rcu part in a separate patch.
> > > > 
> > > > What do you think?
> > > 
> > > 
> > > I'd like to revert 8521fc50 first and consider total design change
> > > rather than ad-hoc fix.
> > 
> > Agreed. Revert should go into 3.0 stable as well. Although the global
> > mutex is buggy we have that behavior for a long time without any reports.
> > We should address it but it can wait for 3.2.

I will send the revert request to Linus.

> What "buggy" means here ? "problematic" or "cause OOps ?"

I have described that in an earlier email. Consider pathological case
when CPU0 wants to async. drain a memcg which has a lot of cached charges while
CPU1 is already draining so it holds the mutex. CPU0 backs off so it has
to reclaim although we could prevent from it by getting rid of cached
charges. This is not critical though.

> 
> > > Personally, I don't like to have spin-lock in per-cpu area.
> > 
> > spinlock is not that different from what we already have with the bit
> > lock.
> 
> maybe. The best is lockless style...but pointer in percpu cache is problem..
> 
> Thanks,
> -Kame

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RFC] memcg: fix drain_all_stock crash
@ 2011-08-09 10:09                     ` Michal Hocko
  0 siblings, 0 replies; 41+ messages in thread
From: Michal Hocko @ 2011-08-09 10:09 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Johannes Weiner, linux-mm, Balbir Singh, Andrew Morton, linux-kernel

On Tue 09-08-11 18:53:13, KAMEZAWA Hiroyuki wrote:
> On Tue, 9 Aug 2011 11:45:03 +0200
> Michal Hocko <mhocko@suse.cz> wrote:
> 
> > On Tue 09-08-11 18:32:16, KAMEZAWA Hiroyuki wrote:
> > > On Tue, 9 Aug 2011 11:31:50 +0200
> > > Michal Hocko <mhocko@suse.cz> wrote:
> > > 
> > > > What do you think about the half backed patch bellow? I didn't manage to
> > > > test it yet but I guess it should help. I hate asymmetry of drain_lock
> > > > locking (it is acquired somewhere else than it is released which is
> > > > not). I will think about a nicer way how to do it.
> > > > Maybe I should also split the rcu part in a separate patch.
> > > > 
> > > > What do you think?
> > > 
> > > 
> > > I'd like to revert 8521fc50 first and consider total design change
> > > rather than ad-hoc fix.
> > 
> > Agreed. Revert should go into 3.0 stable as well. Although the global
> > mutex is buggy we have that behavior for a long time without any reports.
> > We should address it but it can wait for 3.2.

I will send the revert request to Linus.

> What "buggy" means here ? "problematic" or "cause OOps ?"

I have described that in an earlier email. Consider pathological case
when CPU0 wants to async. drain a memcg which has a lot of cached charges while
CPU1 is already draining so it holds the mutex. CPU0 backs off so it has
to reclaim although we could prevent from it by getting rid of cached
charges. This is not critical though.

> 
> > > Personally, I don't like to have spin-lock in per-cpu area.
> > 
> > spinlock is not that different from what we already have with the bit
> > lock.
> 
> maybe. The best is lockless style...but pointer in percpu cache is problem..
> 
> Thanks,
> -Kame

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RFC] memcg: fix drain_all_stock crash
  2011-08-09 10:07                       ` KAMEZAWA Hiroyuki
@ 2011-08-09 11:46                         ` Michal Hocko
  -1 siblings, 0 replies; 41+ messages in thread
From: Michal Hocko @ 2011-08-09 11:46 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Johannes Weiner, linux-mm, Balbir Singh, Andrew Morton, linux-kernel

On Tue 09-08-11 19:07:25, KAMEZAWA Hiroyuki wrote:
> On Tue, 9 Aug 2011 12:09:44 +0200
> Michal Hocko <mhocko@suse.cz> wrote:
> 
> > On Tue 09-08-11 18:53:13, KAMEZAWA Hiroyuki wrote:
> > > On Tue, 9 Aug 2011 11:45:03 +0200
> > > Michal Hocko <mhocko@suse.cz> wrote:
> > > 
> > > > On Tue 09-08-11 18:32:16, KAMEZAWA Hiroyuki wrote:
> > > > > On Tue, 9 Aug 2011 11:31:50 +0200
> > > > > Michal Hocko <mhocko@suse.cz> wrote:
> > > > > 
> > > > > > What do you think about the half backed patch bellow? I didn't manage to
> > > > > > test it yet but I guess it should help. I hate asymmetry of drain_lock
> > > > > > locking (it is acquired somewhere else than it is released which is
> > > > > > not). I will think about a nicer way how to do it.
> > > > > > Maybe I should also split the rcu part in a separate patch.
> > > > > > 
> > > > > > What do you think?
> > > > > 
> > > > > 
> > > > > I'd like to revert 8521fc50 first and consider total design change
> > > > > rather than ad-hoc fix.
> > > > 
> > > > Agreed. Revert should go into 3.0 stable as well. Although the global
> > > > mutex is buggy we have that behavior for a long time without any reports.
> > > > We should address it but it can wait for 3.2.
> > 
> > I will send the revert request to Linus.
> > 
> > > What "buggy" means here ? "problematic" or "cause OOps ?"
> > 
> > I have described that in an earlier email. Consider pathological case
> > when CPU0 wants to async. drain a memcg which has a lot of cached charges while
> > CPU1 is already draining so it holds the mutex. CPU0 backs off so it has
> > to reclaim although we could prevent from it by getting rid of cached
> > charges. This is not critical though.
> > 
> 
> That problem should be fixed by background reclaim.

How? Do you plan to rework locking or the charge caching completely?

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RFC] memcg: fix drain_all_stock crash
@ 2011-08-09 11:46                         ` Michal Hocko
  0 siblings, 0 replies; 41+ messages in thread
From: Michal Hocko @ 2011-08-09 11:46 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Johannes Weiner, linux-mm, Balbir Singh, Andrew Morton, linux-kernel

On Tue 09-08-11 19:07:25, KAMEZAWA Hiroyuki wrote:
> On Tue, 9 Aug 2011 12:09:44 +0200
> Michal Hocko <mhocko@suse.cz> wrote:
> 
> > On Tue 09-08-11 18:53:13, KAMEZAWA Hiroyuki wrote:
> > > On Tue, 9 Aug 2011 11:45:03 +0200
> > > Michal Hocko <mhocko@suse.cz> wrote:
> > > 
> > > > On Tue 09-08-11 18:32:16, KAMEZAWA Hiroyuki wrote:
> > > > > On Tue, 9 Aug 2011 11:31:50 +0200
> > > > > Michal Hocko <mhocko@suse.cz> wrote:
> > > > > 
> > > > > > What do you think about the half backed patch bellow? I didn't manage to
> > > > > > test it yet but I guess it should help. I hate asymmetry of drain_lock
> > > > > > locking (it is acquired somewhere else than it is released which is
> > > > > > not). I will think about a nicer way how to do it.
> > > > > > Maybe I should also split the rcu part in a separate patch.
> > > > > > 
> > > > > > What do you think?
> > > > > 
> > > > > 
> > > > > I'd like to revert 8521fc50 first and consider total design change
> > > > > rather than ad-hoc fix.
> > > > 
> > > > Agreed. Revert should go into 3.0 stable as well. Although the global
> > > > mutex is buggy we have that behavior for a long time without any reports.
> > > > We should address it but it can wait for 3.2.
> > 
> > I will send the revert request to Linus.
> > 
> > > What "buggy" means here ? "problematic" or "cause OOps ?"
> > 
> > I have described that in an earlier email. Consider pathological case
> > when CPU0 wants to async. drain a memcg which has a lot of cached charges while
> > CPU1 is already draining so it holds the mutex. CPU0 backs off so it has
> > to reclaim although we could prevent from it by getting rid of cached
> > charges. This is not critical though.
> > 
> 
> That problem should be fixed by background reclaim.

How? Do you plan to rework locking or the charge caching completely?

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RFC] memcg: fix drain_all_stock crash
  2011-08-09 11:46                         ` Michal Hocko
@ 2011-08-09 23:54                           ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 41+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-08-09 23:54 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, linux-mm, Balbir Singh, Andrew Morton, linux-kernel

On Tue, 9 Aug 2011 13:46:42 +0200
Michal Hocko <mhocko@suse.cz> wrote:

> On Tue 09-08-11 19:07:25, KAMEZAWA Hiroyuki wrote:
> > On Tue, 9 Aug 2011 12:09:44 +0200
> > Michal Hocko <mhocko@suse.cz> wrote:
> > 
> > > On Tue 09-08-11 18:53:13, KAMEZAWA Hiroyuki wrote:
> > > > On Tue, 9 Aug 2011 11:45:03 +0200
> > > > Michal Hocko <mhocko@suse.cz> wrote:
> > > > 
> > > > > On Tue 09-08-11 18:32:16, KAMEZAWA Hiroyuki wrote:
> > > > > > On Tue, 9 Aug 2011 11:31:50 +0200
> > > > > > Michal Hocko <mhocko@suse.cz> wrote:
> > > > > > 
> > > > > > > What do you think about the half backed patch bellow? I didn't manage to
> > > > > > > test it yet but I guess it should help. I hate asymmetry of drain_lock
> > > > > > > locking (it is acquired somewhere else than it is released which is
> > > > > > > not). I will think about a nicer way how to do it.
> > > > > > > Maybe I should also split the rcu part in a separate patch.
> > > > > > > 
> > > > > > > What do you think?
> > > > > > 
> > > > > > 
> > > > > > I'd like to revert 8521fc50 first and consider total design change
> > > > > > rather than ad-hoc fix.
> > > > > 
> > > > > Agreed. Revert should go into 3.0 stable as well. Although the global
> > > > > mutex is buggy we have that behavior for a long time without any reports.
> > > > > We should address it but it can wait for 3.2.
> > > 
> > > I will send the revert request to Linus.
> > > 
> > > > What "buggy" means here ? "problematic" or "cause OOps ?"
> > > 
> > > I have described that in an earlier email. Consider pathological case
> > > when CPU0 wants to async. drain a memcg which has a lot of cached charges while
> > > CPU1 is already draining so it holds the mutex. CPU0 backs off so it has
> > > to reclaim although we could prevent from it by getting rid of cached
> > > charges. This is not critical though.
> > > 
> > 
> > That problem should be fixed by background reclaim.
> 
> How? Do you plan to rework locking or the charge caching completely?
> 

>From your description, the problem is not the lock itself but a task
may go into _unnecessary_ direct-reclaim even if there are remaining
chages on per-cpu stocks, which cause latency.

In (all) my automatic background reclaim tests, no direct reclaim happens
if background reclaim is enabled. And as I said before, we may be able to
add a flag not to cache more. It's set by some condition ....as usage is
near to the limit.

Thanks,
-Kame








^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RFC] memcg: fix drain_all_stock crash
@ 2011-08-09 23:54                           ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 41+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-08-09 23:54 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, linux-mm, Balbir Singh, Andrew Morton, linux-kernel

On Tue, 9 Aug 2011 13:46:42 +0200
Michal Hocko <mhocko@suse.cz> wrote:

> On Tue 09-08-11 19:07:25, KAMEZAWA Hiroyuki wrote:
> > On Tue, 9 Aug 2011 12:09:44 +0200
> > Michal Hocko <mhocko@suse.cz> wrote:
> > 
> > > On Tue 09-08-11 18:53:13, KAMEZAWA Hiroyuki wrote:
> > > > On Tue, 9 Aug 2011 11:45:03 +0200
> > > > Michal Hocko <mhocko@suse.cz> wrote:
> > > > 
> > > > > On Tue 09-08-11 18:32:16, KAMEZAWA Hiroyuki wrote:
> > > > > > On Tue, 9 Aug 2011 11:31:50 +0200
> > > > > > Michal Hocko <mhocko@suse.cz> wrote:
> > > > > > 
> > > > > > > What do you think about the half backed patch bellow? I didn't manage to
> > > > > > > test it yet but I guess it should help. I hate asymmetry of drain_lock
> > > > > > > locking (it is acquired somewhere else than it is released which is
> > > > > > > not). I will think about a nicer way how to do it.
> > > > > > > Maybe I should also split the rcu part in a separate patch.
> > > > > > > 
> > > > > > > What do you think?
> > > > > > 
> > > > > > 
> > > > > > I'd like to revert 8521fc50 first and consider total design change
> > > > > > rather than ad-hoc fix.
> > > > > 
> > > > > Agreed. Revert should go into 3.0 stable as well. Although the global
> > > > > mutex is buggy we have that behavior for a long time without any reports.
> > > > > We should address it but it can wait for 3.2.
> > > 
> > > I will send the revert request to Linus.
> > > 
> > > > What "buggy" means here ? "problematic" or "cause OOps ?"
> > > 
> > > I have described that in an earlier email. Consider pathological case
> > > when CPU0 wants to async. drain a memcg which has a lot of cached charges while
> > > CPU1 is already draining so it holds the mutex. CPU0 backs off so it has
> > > to reclaim although we could prevent from it by getting rid of cached
> > > charges. This is not critical though.
> > > 
> > 
> > That problem should be fixed by background reclaim.
> 
> How? Do you plan to rework locking or the charge caching completely?
> 

>From your description, the problem is not the lock itself but a task
may go into _unnecessary_ direct-reclaim even if there are remaining
chages on per-cpu stocks, which cause latency.

In (all) my automatic background reclaim tests, no direct reclaim happens
if background reclaim is enabled. And as I said before, we may be able to
add a flag not to cache more. It's set by some condition ....as usage is
near to the limit.

Thanks,
-Kame







--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RFC] memcg: fix drain_all_stock crash
  2011-08-09 23:54                           ` KAMEZAWA Hiroyuki
@ 2011-08-10  7:29                             ` Michal Hocko
  -1 siblings, 0 replies; 41+ messages in thread
From: Michal Hocko @ 2011-08-10  7:29 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Johannes Weiner, linux-mm, Balbir Singh, Andrew Morton, linux-kernel

On Wed 10-08-11 08:54:37, KAMEZAWA Hiroyuki wrote:
> On Tue, 9 Aug 2011 13:46:42 +0200
> Michal Hocko <mhocko@suse.cz> wrote:
> 
> > On Tue 09-08-11 19:07:25, KAMEZAWA Hiroyuki wrote:
> > > On Tue, 9 Aug 2011 12:09:44 +0200
> > > Michal Hocko <mhocko@suse.cz> wrote:
> > > 
> > > > On Tue 09-08-11 18:53:13, KAMEZAWA Hiroyuki wrote:
> > > > > On Tue, 9 Aug 2011 11:45:03 +0200
> > > > > Michal Hocko <mhocko@suse.cz> wrote:
> > > > > 
> > > > > > On Tue 09-08-11 18:32:16, KAMEZAWA Hiroyuki wrote:
> > > > > > > On Tue, 9 Aug 2011 11:31:50 +0200
> > > > > > > Michal Hocko <mhocko@suse.cz> wrote:
> > > > > > > 
> > > > > > > > What do you think about the half backed patch bellow? I didn't manage to
> > > > > > > > test it yet but I guess it should help. I hate asymmetry of drain_lock
> > > > > > > > locking (it is acquired somewhere else than it is released which is
> > > > > > > > not). I will think about a nicer way how to do it.
> > > > > > > > Maybe I should also split the rcu part in a separate patch.
> > > > > > > > 
> > > > > > > > What do you think?
> > > > > > > 
> > > > > > > 
> > > > > > > I'd like to revert 8521fc50 first and consider total design change
> > > > > > > rather than ad-hoc fix.
> > > > > > 
> > > > > > Agreed. Revert should go into 3.0 stable as well. Although the global
> > > > > > mutex is buggy we have that behavior for a long time without any reports.
> > > > > > We should address it but it can wait for 3.2.
> > > > 
> > > > I will send the revert request to Linus.
> > > > 
> > > > > What "buggy" means here ? "problematic" or "cause OOps ?"
> > > > 
> > > > I have described that in an earlier email. Consider pathological case
> > > > when CPU0 wants to async. drain a memcg which has a lot of cached charges while
> > > > CPU1 is already draining so it holds the mutex. CPU0 backs off so it has
> > > > to reclaim although we could prevent from it by getting rid of cached
> > > > charges. This is not critical though.
> > > > 
> > > 
> > > That problem should be fixed by background reclaim.
> > 
> > How? Do you plan to rework locking or the charge caching completely?
> > 
> 
> From your description, the problem is not the lock itself but a task
> may go into _unnecessary_ direct-reclaim even if there are remaining
> chages on per-cpu stocks, which cause latency.

The problem is partly the lock because it prevents from parallel async
reclaimers. This is too restrictive. If we are going to rely on async
draining we will have to above problem.

> 
> In (all) my automatic background reclaim tests, no direct reclaim happens
> if background reclaim is enabled. 

Yes, I haven't seen this problem yet but I guess that there is non 0
chance that there is a workload which triggers this.

> And as I said before, we may be able to add a flag not to cache
> more. It's set by some condition ....as usage is near to the limit.
> 
> Thanks,
> -Kame

Thanks
-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RFC] memcg: fix drain_all_stock crash
@ 2011-08-10  7:29                             ` Michal Hocko
  0 siblings, 0 replies; 41+ messages in thread
From: Michal Hocko @ 2011-08-10  7:29 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Johannes Weiner, linux-mm, Balbir Singh, Andrew Morton, linux-kernel

On Wed 10-08-11 08:54:37, KAMEZAWA Hiroyuki wrote:
> On Tue, 9 Aug 2011 13:46:42 +0200
> Michal Hocko <mhocko@suse.cz> wrote:
> 
> > On Tue 09-08-11 19:07:25, KAMEZAWA Hiroyuki wrote:
> > > On Tue, 9 Aug 2011 12:09:44 +0200
> > > Michal Hocko <mhocko@suse.cz> wrote:
> > > 
> > > > On Tue 09-08-11 18:53:13, KAMEZAWA Hiroyuki wrote:
> > > > > On Tue, 9 Aug 2011 11:45:03 +0200
> > > > > Michal Hocko <mhocko@suse.cz> wrote:
> > > > > 
> > > > > > On Tue 09-08-11 18:32:16, KAMEZAWA Hiroyuki wrote:
> > > > > > > On Tue, 9 Aug 2011 11:31:50 +0200
> > > > > > > Michal Hocko <mhocko@suse.cz> wrote:
> > > > > > > 
> > > > > > > > What do you think about the half backed patch bellow? I didn't manage to
> > > > > > > > test it yet but I guess it should help. I hate asymmetry of drain_lock
> > > > > > > > locking (it is acquired somewhere else than it is released which is
> > > > > > > > not). I will think about a nicer way how to do it.
> > > > > > > > Maybe I should also split the rcu part in a separate patch.
> > > > > > > > 
> > > > > > > > What do you think?
> > > > > > > 
> > > > > > > 
> > > > > > > I'd like to revert 8521fc50 first and consider total design change
> > > > > > > rather than ad-hoc fix.
> > > > > > 
> > > > > > Agreed. Revert should go into 3.0 stable as well. Although the global
> > > > > > mutex is buggy we have that behavior for a long time without any reports.
> > > > > > We should address it but it can wait for 3.2.
> > > > 
> > > > I will send the revert request to Linus.
> > > > 
> > > > > What "buggy" means here ? "problematic" or "cause OOps ?"
> > > > 
> > > > I have described that in an earlier email. Consider pathological case
> > > > when CPU0 wants to async. drain a memcg which has a lot of cached charges while
> > > > CPU1 is already draining so it holds the mutex. CPU0 backs off so it has
> > > > to reclaim although we could prevent from it by getting rid of cached
> > > > charges. This is not critical though.
> > > > 
> > > 
> > > That problem should be fixed by background reclaim.
> > 
> > How? Do you plan to rework locking or the charge caching completely?
> > 
> 
> From your description, the problem is not the lock itself but a task
> may go into _unnecessary_ direct-reclaim even if there are remaining
> chages on per-cpu stocks, which cause latency.

The problem is partly the lock because it prevents from parallel async
reclaimers. This is too restrictive. If we are going to rely on async
draining we will have to above problem.

> 
> In (all) my automatic background reclaim tests, no direct reclaim happens
> if background reclaim is enabled. 

Yes, I haven't seen this problem yet but I guess that there is non 0
chance that there is a workload which triggers this.

> And as I said before, we may be able to add a flag not to cache
> more. It's set by some condition ....as usage is near to the limit.
> 
> Thanks,
> -Kame

Thanks
-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [patch] memcg: pin execution to current cpu while draining stock
  2011-07-25  1:16     ` KAMEZAWA Hiroyuki
@ 2011-08-17 19:49       ` Johannes Weiner
  -1 siblings, 0 replies; 41+ messages in thread
From: Johannes Weiner @ 2011-08-17 19:49 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Michal Hocko, linux-mm, Balbir Singh, Andrew Morton, linux-kernel

Commit d1a05b6 'memcg: do not try to drain per-cpu caches without
pages' added a drain_local_stock() call to a preemptible section.

The draining task looks up the cpu-local stock twice to set the
draining-flag, then to drain the stock and clear the flag again.  If
the task is migrated to a different CPU in between, noone will clear
the flag on the first stock and it will be forever undrainable.  Its
charge can not be recovered and the cgroup can not be deleted anymore.

Properly pin the task to the executing CPU while draining stocks.

Signed-off-by: Johannes Weiner <jweiner@redhat.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com
Cc: Michal Hocko <mhocko@suse.cz>
---
 mm/memcontrol.c |    9 ++-------
 1 files changed, 2 insertions(+), 7 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 697a1d5..e9b1206 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2085,13 +2085,7 @@ static void drain_all_stock(struct mem_cgroup *root_mem, bool sync)
 
 	/* Notify other cpus that system-wide "drain" is running */
 	get_online_cpus();
-	/*
-	 * Get a hint for avoiding draining charges on the current cpu,
-	 * which must be exhausted by our charging.  It is not required that
-	 * this be a precise check, so we use raw_smp_processor_id() instead of
-	 * getcpu()/putcpu().
-	 */
-	curcpu = raw_smp_processor_id();
+	curcpu = get_cpu();
 	for_each_online_cpu(cpu) {
 		struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu);
 		struct mem_cgroup *mem;
@@ -2108,6 +2102,7 @@ static void drain_all_stock(struct mem_cgroup *root_mem, bool sync)
 				schedule_work_on(cpu, &stock->work);
 		}
 	}
+	put_cpu();
 
 	if (!sync)
 		goto out;
-- 
1.7.6

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [patch] memcg: pin execution to current cpu while draining stock
@ 2011-08-17 19:49       ` Johannes Weiner
  0 siblings, 0 replies; 41+ messages in thread
From: Johannes Weiner @ 2011-08-17 19:49 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Michal Hocko, linux-mm, Balbir Singh, Andrew Morton, linux-kernel

Commit d1a05b6 'memcg: do not try to drain per-cpu caches without
pages' added a drain_local_stock() call to a preemptible section.

The draining task looks up the cpu-local stock twice to set the
draining-flag, then to drain the stock and clear the flag again.  If
the task is migrated to a different CPU in between, noone will clear
the flag on the first stock and it will be forever undrainable.  Its
charge can not be recovered and the cgroup can not be deleted anymore.

Properly pin the task to the executing CPU while draining stocks.

Signed-off-by: Johannes Weiner <jweiner@redhat.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com
Cc: Michal Hocko <mhocko@suse.cz>
---
 mm/memcontrol.c |    9 ++-------
 1 files changed, 2 insertions(+), 7 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 697a1d5..e9b1206 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2085,13 +2085,7 @@ static void drain_all_stock(struct mem_cgroup *root_mem, bool sync)
 
 	/* Notify other cpus that system-wide "drain" is running */
 	get_online_cpus();
-	/*
-	 * Get a hint for avoiding draining charges on the current cpu,
-	 * which must be exhausted by our charging.  It is not required that
-	 * this be a precise check, so we use raw_smp_processor_id() instead of
-	 * getcpu()/putcpu().
-	 */
-	curcpu = raw_smp_processor_id();
+	curcpu = get_cpu();
 	for_each_online_cpu(cpu) {
 		struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu);
 		struct mem_cgroup *mem;
@@ -2108,6 +2102,7 @@ static void drain_all_stock(struct mem_cgroup *root_mem, bool sync)
 				schedule_work_on(cpu, &stock->work);
 		}
 	}
+	put_cpu();
 
 	if (!sync)
 		goto out;
-- 
1.7.6

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [patch] memcg: pin execution to current cpu while draining stock
  2011-08-17 19:49       ` Johannes Weiner
@ 2011-08-18  0:24         ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 41+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-08-18  0:24 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, linux-mm, Balbir Singh, Andrew Morton, linux-kernel

On Wed, 17 Aug 2011 21:49:27 +0200
Johannes Weiner <jweiner@redhat.com> wrote:

> Commit d1a05b6 'memcg: do not try to drain per-cpu caches without
> pages' added a drain_local_stock() call to a preemptible section.
> 
> The draining task looks up the cpu-local stock twice to set the
> draining-flag, then to drain the stock and clear the flag again.  If
> the task is migrated to a different CPU in between, noone will clear
> the flag on the first stock and it will be forever undrainable.  Its
> charge can not be recovered and the cgroup can not be deleted anymore.
> 
> Properly pin the task to the executing CPU while draining stocks.
> 
> Signed-off-by: Johannes Weiner <jweiner@redhat.com>
> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com
> Cc: Michal Hocko <mhocko@suse.cz>

Thanks. I think Shaoha Li reported this.

Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com
==

http://www.spinics.net/lists/linux-mm/msg22635.html

I get below warning:
BUG: using smp_processor_id() in preemptible [00000000] code: bash/739
caller is drain_local_stock+0x1a/0x55
Pid: 739, comm: bash Tainted: G        W   3.0.0+ #255
Call Trace:
 [<ffffffff813435c6>] debug_smp_processor_id+0xc2/0xdc
 [<ffffffff8114ae9b>] drain_local_stock+0x1a/0x55
 [<ffffffff8114b076>] drain_all_stock+0x98/0x13a
 [<ffffffff8114f04c>] mem_cgroup_force_empty+0xa3/0x27a
 [<ffffffff8114ff1d>] ? sys_close+0x38/0x138
 [<ffffffff811a7631>] ? environ_read+0x1d/0x159
 [<ffffffff8114f253>] mem_cgroup_force_empty_write+0x17/0x19
 [<ffffffff810c72fb>] cgroup_file_write+0xa8/0xba
 [<ffffffff811522ce>] vfs_write+0xb3/0x138
 [<ffffffff81152416>] sys_write+0x4a/0x71
 [<ffffffff8114ffd5>] ? sys_close+0xf0/0x138
 [<ffffffff8176deab>] system_call_fastpath+0x16/0x1b

drain_local_stock() should be run with preempt disabled.
==

Andrew, could you pull this one , too ?
http://www.spinics.net/lists/linux-mm/msg22636.html

Thanks,
-Kame

> ---
>  mm/memcontrol.c |    9 ++-------
>  1 files changed, 2 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 697a1d5..e9b1206 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2085,13 +2085,7 @@ static void drain_all_stock(struct mem_cgroup *root_mem, bool sync)
>  
>  	/* Notify other cpus that system-wide "drain" is running */
>  	get_online_cpus();
> -	/*
> -	 * Get a hint for avoiding draining charges on the current cpu,
> -	 * which must be exhausted by our charging.  It is not required that
> -	 * this be a precise check, so we use raw_smp_processor_id() instead of
> -	 * getcpu()/putcpu().
> -	 */
> -	curcpu = raw_smp_processor_id();
> +	curcpu = get_cpu();
>  	for_each_online_cpu(cpu) {
>  		struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu);
>  		struct mem_cgroup *mem;
> @@ -2108,6 +2102,7 @@ static void drain_all_stock(struct mem_cgroup *root_mem, bool sync)
>  				schedule_work_on(cpu, &stock->work);
>  		}
>  	}
> +	put_cpu();
>  
>  	if (!sync)
>  		goto out;
> -- 
> 1.7.6
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [patch] memcg: pin execution to current cpu while draining stock
@ 2011-08-18  0:24         ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 41+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-08-18  0:24 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, linux-mm, Balbir Singh, Andrew Morton, linux-kernel

On Wed, 17 Aug 2011 21:49:27 +0200
Johannes Weiner <jweiner@redhat.com> wrote:

> Commit d1a05b6 'memcg: do not try to drain per-cpu caches without
> pages' added a drain_local_stock() call to a preemptible section.
> 
> The draining task looks up the cpu-local stock twice to set the
> draining-flag, then to drain the stock and clear the flag again.  If
> the task is migrated to a different CPU in between, noone will clear
> the flag on the first stock and it will be forever undrainable.  Its
> charge can not be recovered and the cgroup can not be deleted anymore.
> 
> Properly pin the task to the executing CPU while draining stocks.
> 
> Signed-off-by: Johannes Weiner <jweiner@redhat.com>
> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com
> Cc: Michal Hocko <mhocko@suse.cz>

Thanks. I think Shaoha Li reported this.

Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com
==

http://www.spinics.net/lists/linux-mm/msg22635.html

I get below warning:
BUG: using smp_processor_id() in preemptible [00000000] code: bash/739
caller is drain_local_stock+0x1a/0x55
Pid: 739, comm: bash Tainted: G        W   3.0.0+ #255
Call Trace:
 [<ffffffff813435c6>] debug_smp_processor_id+0xc2/0xdc
 [<ffffffff8114ae9b>] drain_local_stock+0x1a/0x55
 [<ffffffff8114b076>] drain_all_stock+0x98/0x13a
 [<ffffffff8114f04c>] mem_cgroup_force_empty+0xa3/0x27a
 [<ffffffff8114ff1d>] ? sys_close+0x38/0x138
 [<ffffffff811a7631>] ? environ_read+0x1d/0x159
 [<ffffffff8114f253>] mem_cgroup_force_empty_write+0x17/0x19
 [<ffffffff810c72fb>] cgroup_file_write+0xa8/0xba
 [<ffffffff811522ce>] vfs_write+0xb3/0x138
 [<ffffffff81152416>] sys_write+0x4a/0x71
 [<ffffffff8114ffd5>] ? sys_close+0xf0/0x138
 [<ffffffff8176deab>] system_call_fastpath+0x16/0x1b

drain_local_stock() should be run with preempt disabled.
==

Andrew, could you pull this one , too ?
http://www.spinics.net/lists/linux-mm/msg22636.html

Thanks,
-Kame

> ---
>  mm/memcontrol.c |    9 ++-------
>  1 files changed, 2 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 697a1d5..e9b1206 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2085,13 +2085,7 @@ static void drain_all_stock(struct mem_cgroup *root_mem, bool sync)
>  
>  	/* Notify other cpus that system-wide "drain" is running */
>  	get_online_cpus();
> -	/*
> -	 * Get a hint for avoiding draining charges on the current cpu,
> -	 * which must be exhausted by our charging.  It is not required that
> -	 * this be a precise check, so we use raw_smp_processor_id() instead of
> -	 * getcpu()/putcpu().
> -	 */
> -	curcpu = raw_smp_processor_id();
> +	curcpu = get_cpu();
>  	for_each_online_cpu(cpu) {
>  		struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu);
>  		struct mem_cgroup *mem;
> @@ -2108,6 +2102,7 @@ static void drain_all_stock(struct mem_cgroup *root_mem, bool sync)
>  				schedule_work_on(cpu, &stock->work);
>  		}
>  	}
> +	put_cpu();
>  
>  	if (!sync)
>  		goto out;
> -- 
> 1.7.6
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [patch] memcg: pin execution to current cpu while draining stock
  2011-08-17 19:49       ` Johannes Weiner
@ 2011-08-18  5:56         ` Michal Hocko
  -1 siblings, 0 replies; 41+ messages in thread
From: Michal Hocko @ 2011-08-18  5:56 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: KAMEZAWA Hiroyuki, linux-mm, Balbir Singh, Andrew Morton, linux-kernel

On Wed 17-08-11 21:49:27, Johannes Weiner wrote:
> Commit d1a05b6 'memcg: do not try to drain per-cpu caches without
> pages' added a drain_local_stock() call to a preemptible section.
> 
> The draining task looks up the cpu-local stock twice to set the
> draining-flag, then to drain the stock and clear the flag again.  If
> the task is migrated to a different CPU in between, noone will clear
> the flag on the first stock and it will be forever undrainable.  Its
> charge can not be recovered and the cgroup can not be deleted anymore.
> 
> Properly pin the task to the executing CPU while draining stocks.
> 
> Signed-off-by: Johannes Weiner <jweiner@redhat.com>
> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com
> Cc: Michal Hocko <mhocko@suse.cz>

My fault, I didn't realize that drain_local_stock needs preemption
disabled. Sorry about that and good work, Johannes. 

Acked-by: Michal Hocko <mhocko@suse.cz>
-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [patch] memcg: pin execution to current cpu while draining stock
@ 2011-08-18  5:56         ` Michal Hocko
  0 siblings, 0 replies; 41+ messages in thread
From: Michal Hocko @ 2011-08-18  5:56 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: KAMEZAWA Hiroyuki, linux-mm, Balbir Singh, Andrew Morton, linux-kernel

On Wed 17-08-11 21:49:27, Johannes Weiner wrote:
> Commit d1a05b6 'memcg: do not try to drain per-cpu caches without
> pages' added a drain_local_stock() call to a preemptible section.
> 
> The draining task looks up the cpu-local stock twice to set the
> draining-flag, then to drain the stock and clear the flag again.  If
> the task is migrated to a different CPU in between, noone will clear
> the flag on the first stock and it will be forever undrainable.  Its
> charge can not be recovered and the cgroup can not be deleted anymore.
> 
> Properly pin the task to the executing CPU while draining stocks.
> 
> Signed-off-by: Johannes Weiner <jweiner@redhat.com>
> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com
> Cc: Michal Hocko <mhocko@suse.cz>

My fault, I didn't realize that drain_local_stock needs preemption
disabled. Sorry about that and good work, Johannes. 

Acked-by: Michal Hocko <mhocko@suse.cz>
-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2011-08-18  5:56 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-07-22 12:43 [PATCH 0/4 v2] memcg: cleanup per-cpu charge caches Michal Hocko
2011-07-21  7:38 ` [PATCH 1/4] memcg: do not try to drain per-cpu caches without pages Michal Hocko
2011-07-25  1:16   ` KAMEZAWA Hiroyuki
2011-07-25  1:16     ` KAMEZAWA Hiroyuki
2011-08-17 19:49     ` [patch] memcg: pin execution to current cpu while draining stock Johannes Weiner
2011-08-17 19:49       ` Johannes Weiner
2011-08-18  0:24       ` KAMEZAWA Hiroyuki
2011-08-18  0:24         ` KAMEZAWA Hiroyuki
2011-08-18  5:56       ` Michal Hocko
2011-08-18  5:56         ` Michal Hocko
2011-07-22 10:24 ` [PATCH 2/4] memcg: unify sync and async per-cpu charge cache draining Michal Hocko
2011-07-22 10:27 ` [PATCH 3/4] memcg: add mem_cgroup_same_or_subtree helper Michal Hocko
2011-07-22 11:20 ` [PATCH 4/4] memcg: get rid of percpu_charge_mutex lock Michal Hocko
2011-07-25  1:18   ` KAMEZAWA Hiroyuki
2011-07-25  1:18     ` KAMEZAWA Hiroyuki
2011-08-08 18:47   ` Johannes Weiner
2011-08-08 18:47     ` Johannes Weiner
2011-08-08 21:47     ` Michal Hocko
2011-08-08 21:47       ` Michal Hocko
2011-08-08 23:19       ` Johannes Weiner
2011-08-08 23:19         ` Johannes Weiner
2011-08-09  7:26         ` Michal Hocko
2011-08-09  7:26           ` Michal Hocko
2011-08-09  9:31           ` [PATCH RFC] memcg: fix drain_all_stock crash Michal Hocko
2011-08-09  9:31             ` Michal Hocko
2011-08-09  9:32             ` KAMEZAWA Hiroyuki
2011-08-09  9:32               ` KAMEZAWA Hiroyuki
2011-08-09  9:45               ` Michal Hocko
2011-08-09  9:45                 ` Michal Hocko
2011-08-09  9:53                 ` KAMEZAWA Hiroyuki
2011-08-09  9:53                   ` KAMEZAWA Hiroyuki
2011-08-09 10:09                   ` Michal Hocko
2011-08-09 10:09                     ` Michal Hocko
2011-08-09 10:07                     ` KAMEZAWA Hiroyuki
2011-08-09 10:07                       ` KAMEZAWA Hiroyuki
2011-08-09 11:46                       ` Michal Hocko
2011-08-09 11:46                         ` Michal Hocko
2011-08-09 23:54                         ` KAMEZAWA Hiroyuki
2011-08-09 23:54                           ` KAMEZAWA Hiroyuki
2011-08-10  7:29                           ` Michal Hocko
2011-08-10  7:29                             ` Michal Hocko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.