All of lore.kernel.org
 help / color / mirror / Atom feed
* 3.13-rc breaks MEMCG_SWAP
@ 2013-12-16  8:36 ` Hugh Dickins
  0 siblings, 0 replies; 54+ messages in thread
From: Hugh Dickins @ 2013-12-16  8:36 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Li Zefan, Johannes Weiner, Tejun Heo, Andrew Morton,
	KAMEZAWA Hiroyuki, linux-mm, cgroups, linux-kernel

CONFIG_MEMCG_SWAP is broken in 3.13-rc.  Try something like this:

mkdir -p /tmp/tmpfs /tmp/memcg
mount -t tmpfs -o size=1G tmpfs /tmp/tmpfs
mount -t cgroup -o memory memcg /tmp/memcg
mkdir /tmp/memcg/old
echo 512M >/tmp/memcg/old/memory.limit_in_bytes
echo $$ >/tmp/memcg/old/tasks
cp /dev/zero /tmp/tmpfs/zero 2>/dev/null
echo $$ >/tmp/memcg/tasks
rmdir /tmp/memcg/old
sleep 1	# let rmdir work complete
mkdir /tmp/memcg/new
umount /tmp/tmpfs
dmesg | grep WARNING
rmdir /tmp/memcg/new
umount /tmp/memcg

Shows lots of WARNING: CPU: 1 PID: 1006 at kernel/res_counter.c:91
                           res_counter_uncharge_locked+0x1f/0x2f()

Breakage comes from 34c00c319ce7 ("memcg: convert to use cgroup id").

The lifetime of a cgroup id is different from the lifetime of the
css id it replaced: memsw's css_get()s do nothing to hold on to the
old cgroup id, it soon gets recycled to a new cgroup, which then
mysteriously inherits the old's swap, without any charge for it.
(I thought memsw's particular need had been discussed and was
well understood when 34c00c319ce7 went in, but apparently not.)

The right thing to do at this stage would be to revert that and its
associated commits; but I imagine to do so would be unwelcome to
the cgroup guys, going against their general direction; and I've
no idea how embedded that css_id removal has become by now.

Perhaps some creative refcounting can rescue memsw while still
using cgroup id?

But if not, I've looked up and updated a patch I prepared eighteen
months ago (when I too misunderstood how that memsw refcounting
worked, and mistakenly thought something like this necessary):
to scan the swap_cgroup arrays reassigning id when reparented.
This works fine in the testing I've done on it.

I've not given enough thought to the races around mem_cgroup_lookup():
maybe it's okay as I have it, maybe it needs more (perhaps restoring
the extra css_gets and css_puts that I removed, though that would be
a little sad).  And I've made almost no attempt to optimize the scan
of all swap areas, beyond not doing it if the memcg is using no swap.

I've kept in the various things that patch was doing, and not thought
through what their interdependencies are: it should probably be split
up e.g. some swap_cgroup locking changes in page_cgroup.c, to make the
locking easier or more efficient when reassigning.  But I'll certainly
not spend time on that if you decide to rescue memsw by some other way.

Hugh
---

 include/linux/page_cgroup.h |    1 
 mm/memcontrol.c             |   74 ++++++++++++++-------------
 mm/page_cgroup.c            |   90 +++++++++++++++++++++++-----------
 3 files changed, 101 insertions(+), 64 deletions(-)

--- 3.13-rc4/include/linux/page_cgroup.h	2012-09-30 16:47:46.000000000 -0700
+++ linux/include/linux/page_cgroup.h	2013-12-15 14:34:36.304485959 -0800
@@ -111,6 +111,7 @@ extern unsigned short swap_cgroup_cmpxch
 					unsigned short old, unsigned short new);
 extern unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id);
 extern unsigned short lookup_swap_cgroup_id(swp_entry_t ent);
+extern long swap_cgroup_reassign(unsigned short old, unsigned short new);
 extern int swap_cgroup_swapon(int type, unsigned long max_pages);
 extern void swap_cgroup_swapoff(int type);
 #else
--- 3.13-rc4/mm/memcontrol.c	2013-12-15 13:15:41.634280121 -0800
+++ linux/mm/memcontrol.c	2013-12-15 14:34:36.308485960 -0800
@@ -873,10 +873,8 @@ static long mem_cgroup_read_stat(struct
 	return val;
 }
 
-static void mem_cgroup_swap_statistics(struct mem_cgroup *memcg,
-					 bool charge)
+static void mem_cgroup_swap_statistics(struct mem_cgroup *memcg, long val)
 {
-	int val = (charge) ? 1 : -1;
 	this_cpu_add(memcg->stat->count[MEM_CGROUP_STAT_SWAP], val);
 }
 
@@ -2871,8 +2869,8 @@ struct mem_cgroup *try_get_mem_cgroup_fr
 			memcg = NULL;
 	} else if (PageSwapCache(page)) {
 		ent.val = page_private(page);
-		id = lookup_swap_cgroup_id(ent);
 		rcu_read_lock();
+		id = lookup_swap_cgroup_id(ent);
 		memcg = mem_cgroup_lookup(id);
 		if (memcg && !css_tryget(&memcg->css))
 			memcg = NULL;
@@ -4238,15 +4236,11 @@ __mem_cgroup_uncharge_common(struct page
 	 */
 
 	unlock_page_cgroup(pc);
-	/*
-	 * even after unlock, we have memcg->res.usage here and this memcg
-	 * will never be freed, so it's safe to call css_get().
-	 */
+
 	memcg_check_events(memcg, page);
-	if (do_swap_account && ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT) {
-		mem_cgroup_swap_statistics(memcg, true);
-		css_get(&memcg->css);
-	}
+	if (do_swap_account && ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT)
+		mem_cgroup_swap_statistics(memcg, 1);
+
 	/*
 	 * Migration does not charge the res_counter for the
 	 * replacement page, so leave it alone when phasing out the
@@ -4356,8 +4350,7 @@ mem_cgroup_uncharge_swapcache(struct pag
 	memcg = __mem_cgroup_uncharge_common(page, ctype, false);
 
 	/*
-	 * record memcg information,  if swapout && memcg != NULL,
-	 * css_get() was called in uncharge().
+	 * record memcg information
 	 */
 	if (do_swap_account && swapout && memcg)
 		swap_cgroup_record(ent, mem_cgroup_id(memcg));
@@ -4377,8 +4370,8 @@ void mem_cgroup_uncharge_swap(swp_entry_
 	if (!do_swap_account)
 		return;
 
-	id = swap_cgroup_record(ent, 0);
 	rcu_read_lock();
+	id = swap_cgroup_record(ent, 0);
 	memcg = mem_cgroup_lookup(id);
 	if (memcg) {
 		/*
@@ -4387,8 +4380,7 @@ void mem_cgroup_uncharge_swap(swp_entry_
 		 */
 		if (!mem_cgroup_is_root(memcg))
 			res_counter_uncharge(&memcg->memsw, PAGE_SIZE);
-		mem_cgroup_swap_statistics(memcg, false);
-		css_put(&memcg->css);
+		mem_cgroup_swap_statistics(memcg, -1);
 	}
 	rcu_read_unlock();
 }
@@ -4415,31 +4407,40 @@ static int mem_cgroup_move_swap_account(
 	old_id = mem_cgroup_id(from);
 	new_id = mem_cgroup_id(to);
 
-	if (swap_cgroup_cmpxchg(entry, old_id, new_id) == old_id) {
-		mem_cgroup_swap_statistics(from, false);
-		mem_cgroup_swap_statistics(to, true);
-		/*
-		 * This function is only called from task migration context now.
-		 * It postpones res_counter and refcount handling till the end
-		 * of task migration(mem_cgroup_clear_mc()) for performance
-		 * improvement. But we cannot postpone css_get(to)  because if
-		 * the process that has been moved to @to does swap-in, the
-		 * refcount of @to might be decreased to 0.
-		 *
-		 * We are in attach() phase, so the cgroup is guaranteed to be
-		 * alive, so we can just call css_get().
-		 */
-		css_get(&to->css);
+	if (swap_cgroup_cmpxchg(entry, old_id, new_id) == old_id)
 		return 0;
-	}
+
 	return -EINVAL;
 }
+
+static void mem_cgroup_reparent_swap(struct mem_cgroup *memcg)
+{
+	if (do_swap_account &&
+	    res_counter_read_u64(&memcg->memsw, RES_USAGE) >
+	    res_counter_read_u64(&memcg->kmem, RES_USAGE)) {
+		struct mem_cgroup *parent;
+		long reassigned;
+
+		parent = parent_mem_cgroup(memcg);
+		if (!parent)
+			parent = root_mem_cgroup;
+		reassigned = swap_cgroup_reassign(mem_cgroup_id(memcg),
+						  mem_cgroup_id(parent));
+
+		mem_cgroup_swap_statistics(memcg, -reassigned);
+		mem_cgroup_swap_statistics(parent, reassigned);
+	}
+}
 #else
 static inline int mem_cgroup_move_swap_account(swp_entry_t entry,
 				struct mem_cgroup *from, struct mem_cgroup *to)
 {
 	return -EINVAL;
 }
+
+static inline void mem_cgroup_reparent_swap(struct mem_cgroup *memcg)
+{
+}
 #endif
 
 /*
@@ -5017,6 +5018,7 @@ static int mem_cgroup_force_empty(struct
 	}
 	lru_add_drain();
 	mem_cgroup_reparent_charges(memcg);
+	/* but mem_cgroup_force_empty does not mem_cgroup_reparent_swap */
 
 	return 0;
 }
@@ -6348,6 +6350,7 @@ static void mem_cgroup_css_offline(struc
 
 	mem_cgroup_invalidate_reclaim_iterators(memcg);
 	mem_cgroup_reparent_charges(memcg);
+	mem_cgroup_reparent_swap(memcg);
 	mem_cgroup_destroy_all_caches(memcg);
 	vmpressure_cleanup(&memcg->vmpressure);
 }
@@ -6702,7 +6705,6 @@ static void __mem_cgroup_clear_mc(void)
 {
 	struct mem_cgroup *from = mc.from;
 	struct mem_cgroup *to = mc.to;
-	int i;
 
 	/* we must uncharge all the leftover precharges from mc.to */
 	if (mc.precharge) {
@@ -6724,8 +6726,8 @@ static void __mem_cgroup_clear_mc(void)
 			res_counter_uncharge(&mc.from->memsw,
 						PAGE_SIZE * mc.moved_swap);
 
-		for (i = 0; i < mc.moved_swap; i++)
-			css_put(&mc.from->css);
+		mem_cgroup_swap_statistics(from, -mc.moved_swap);
+		mem_cgroup_swap_statistics(to, mc.moved_swap);
 
 		if (!mem_cgroup_is_root(mc.to)) {
 			/*
--- 3.13-rc4/mm/page_cgroup.c	2013-02-18 15:58:34.000000000 -0800
+++ linux/mm/page_cgroup.c	2013-12-15 14:34:36.312485960 -0800
@@ -322,7 +322,8 @@ void __meminit pgdat_page_cgroup_init(st
 
 #ifdef CONFIG_MEMCG_SWAP
 
-static DEFINE_MUTEX(swap_cgroup_mutex);
+static DEFINE_SPINLOCK(swap_cgroup_lock);
+
 struct swap_cgroup_ctrl {
 	struct page **map;
 	unsigned long length;
@@ -353,14 +354,11 @@ struct swap_cgroup {
 /*
  * allocate buffer for swap_cgroup.
  */
-static int swap_cgroup_prepare(int type)
+static int swap_cgroup_prepare(struct swap_cgroup_ctrl *ctrl)
 {
 	struct page *page;
-	struct swap_cgroup_ctrl *ctrl;
 	unsigned long idx, max;
 
-	ctrl = &swap_cgroup_ctrl[type];
-
 	for (idx = 0; idx < ctrl->length; idx++) {
 		page = alloc_page(GFP_KERNEL | __GFP_ZERO);
 		if (!page)
@@ -407,18 +405,17 @@ unsigned short swap_cgroup_cmpxchg(swp_e
 {
 	struct swap_cgroup_ctrl *ctrl;
 	struct swap_cgroup *sc;
-	unsigned long flags;
 	unsigned short retval;
 
 	sc = lookup_swap_cgroup(ent, &ctrl);
 
-	spin_lock_irqsave(&ctrl->lock, flags);
+	spin_lock(&ctrl->lock);
 	retval = sc->id;
 	if (retval == old)
 		sc->id = new;
 	else
 		retval = 0;
-	spin_unlock_irqrestore(&ctrl->lock, flags);
+	spin_unlock(&ctrl->lock);
 	return retval;
 }
 
@@ -435,14 +432,13 @@ unsigned short swap_cgroup_record(swp_en
 	struct swap_cgroup_ctrl *ctrl;
 	struct swap_cgroup *sc;
 	unsigned short old;
-	unsigned long flags;
 
 	sc = lookup_swap_cgroup(ent, &ctrl);
 
-	spin_lock_irqsave(&ctrl->lock, flags);
+	spin_lock(&ctrl->lock);
 	old = sc->id;
 	sc->id = id;
-	spin_unlock_irqrestore(&ctrl->lock, flags);
+	spin_unlock(&ctrl->lock);
 
 	return old;
 }
@@ -451,19 +447,60 @@ unsigned short swap_cgroup_record(swp_en
  * lookup_swap_cgroup_id - lookup mem_cgroup id tied to swap entry
  * @ent: swap entry to be looked up.
  *
- * Returns CSS ID of mem_cgroup at success. 0 at failure. (0 is invalid ID)
+ * Returns ID of mem_cgroup at success. 0 at failure. (0 is invalid ID)
  */
 unsigned short lookup_swap_cgroup_id(swp_entry_t ent)
 {
 	return lookup_swap_cgroup(ent, NULL)->id;
 }
 
+/**
+ * swap_cgroup_reassign - assign all old entries to new (before old is freed).
+ * @old: id of emptied memcg whose entries are now to be reassigned
+ * @new: id of parent memcg to which those entries are to be assigned
+ *
+ * Returns number of entries reassigned, for debugging or for statistics.
+ */
+long swap_cgroup_reassign(unsigned short old, unsigned short new)
+{
+	long reassigned = 0;
+	int type;
+
+	for (type = 0; type < MAX_SWAPFILES; type++) {
+		struct swap_cgroup_ctrl *ctrl = &swap_cgroup_ctrl[type];
+		unsigned long idx;
+
+		for (idx = 0; idx < ACCESS_ONCE(ctrl->length); idx++) {
+			struct swap_cgroup *sc, *scend;
+
+			spin_lock(&swap_cgroup_lock);
+			if (idx >= ACCESS_ONCE(ctrl->length))
+				goto unlock;
+			sc = page_address(ctrl->map[idx]);
+			for (scend = sc + SC_PER_PAGE; sc < scend; sc++) {
+				if (sc->id != old)
+					continue;
+				spin_lock(&ctrl->lock);
+				if (sc->id == old) {
+					sc->id = new;
+					reassigned++;
+				}
+				spin_unlock(&ctrl->lock);
+			}
+unlock:
+			spin_unlock(&swap_cgroup_lock);
+			cond_resched();
+		}
+	}
+	return reassigned;
+}
+
 int swap_cgroup_swapon(int type, unsigned long max_pages)
 {
 	void *array;
 	unsigned long array_size;
 	unsigned long length;
-	struct swap_cgroup_ctrl *ctrl;
+	struct swap_cgroup_ctrl ctrl;
 
 	if (!do_swap_account)
 		return 0;
@@ -475,23 +512,20 @@ int swap_cgroup_swapon(int type, unsigne
 	if (!array)
 		goto nomem;
 
-	ctrl = &swap_cgroup_ctrl[type];
-	mutex_lock(&swap_cgroup_mutex);
-	ctrl->length = length;
-	ctrl->map = array;
-	spin_lock_init(&ctrl->lock);
-	if (swap_cgroup_prepare(type)) {
-		/* memory shortage */
-		ctrl->map = NULL;
-		ctrl->length = 0;
-		mutex_unlock(&swap_cgroup_mutex);
-		vfree(array);
+	ctrl.length = length;
+	ctrl.map = array;
+	spin_lock_init(&ctrl.lock);
+
+	if (swap_cgroup_prepare(&ctrl))
 		goto nomem;
-	}
-	mutex_unlock(&swap_cgroup_mutex);
+
+	spin_lock(&swap_cgroup_lock);
+	swap_cgroup_ctrl[type] = ctrl;
+	spin_unlock(&swap_cgroup_lock);
 
 	return 0;
 nomem:
+	vfree(array);
 	printk(KERN_INFO "couldn't allocate enough memory for swap_cgroup.\n");
 	printk(KERN_INFO
 		"swap_cgroup can be disabled by swapaccount=0 boot option\n");
@@ -507,13 +541,13 @@ void swap_cgroup_swapoff(int type)
 	if (!do_swap_account)
 		return;
 
-	mutex_lock(&swap_cgroup_mutex);
+	spin_lock(&swap_cgroup_lock);
 	ctrl = &swap_cgroup_ctrl[type];
 	map = ctrl->map;
 	length = ctrl->length;
 	ctrl->map = NULL;
 	ctrl->length = 0;
-	mutex_unlock(&swap_cgroup_mutex);
+	spin_unlock(&swap_cgroup_lock);
 
 	if (map) {
 		for (i = 0; i < length; i++) {

^ permalink raw reply	[flat|nested] 54+ messages in thread

* 3.13-rc breaks MEMCG_SWAP
@ 2013-12-16  8:36 ` Hugh Dickins
  0 siblings, 0 replies; 54+ messages in thread
From: Hugh Dickins @ 2013-12-16  8:36 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Li Zefan, Johannes Weiner, Tejun Heo, Andrew Morton,
	KAMEZAWA Hiroyuki, linux-mm, cgroups, linux-kernel

CONFIG_MEMCG_SWAP is broken in 3.13-rc.  Try something like this:

mkdir -p /tmp/tmpfs /tmp/memcg
mount -t tmpfs -o size=1G tmpfs /tmp/tmpfs
mount -t cgroup -o memory memcg /tmp/memcg
mkdir /tmp/memcg/old
echo 512M >/tmp/memcg/old/memory.limit_in_bytes
echo $$ >/tmp/memcg/old/tasks
cp /dev/zero /tmp/tmpfs/zero 2>/dev/null
echo $$ >/tmp/memcg/tasks
rmdir /tmp/memcg/old
sleep 1	# let rmdir work complete
mkdir /tmp/memcg/new
umount /tmp/tmpfs
dmesg | grep WARNING
rmdir /tmp/memcg/new
umount /tmp/memcg

Shows lots of WARNING: CPU: 1 PID: 1006 at kernel/res_counter.c:91
                           res_counter_uncharge_locked+0x1f/0x2f()

Breakage comes from 34c00c319ce7 ("memcg: convert to use cgroup id").

The lifetime of a cgroup id is different from the lifetime of the
css id it replaced: memsw's css_get()s do nothing to hold on to the
old cgroup id, it soon gets recycled to a new cgroup, which then
mysteriously inherits the old's swap, without any charge for it.
(I thought memsw's particular need had been discussed and was
well understood when 34c00c319ce7 went in, but apparently not.)

The right thing to do at this stage would be to revert that and its
associated commits; but I imagine to do so would be unwelcome to
the cgroup guys, going against their general direction; and I've
no idea how embedded that css_id removal has become by now.

Perhaps some creative refcounting can rescue memsw while still
using cgroup id?

But if not, I've looked up and updated a patch I prepared eighteen
months ago (when I too misunderstood how that memsw refcounting
worked, and mistakenly thought something like this necessary):
to scan the swap_cgroup arrays reassigning id when reparented.
This works fine in the testing I've done on it.

I've not given enough thought to the races around mem_cgroup_lookup():
maybe it's okay as I have it, maybe it needs more (perhaps restoring
the extra css_gets and css_puts that I removed, though that would be
a little sad).  And I've made almost no attempt to optimize the scan
of all swap areas, beyond not doing it if the memcg is using no swap.

I've kept in the various things that patch was doing, and not thought
through what their interdependencies are: it should probably be split
up e.g. some swap_cgroup locking changes in page_cgroup.c, to make the
locking easier or more efficient when reassigning.  But I'll certainly
not spend time on that if you decide to rescue memsw by some other way.

Hugh
---

 include/linux/page_cgroup.h |    1 
 mm/memcontrol.c             |   74 ++++++++++++++-------------
 mm/page_cgroup.c            |   90 +++++++++++++++++++++++-----------
 3 files changed, 101 insertions(+), 64 deletions(-)

--- 3.13-rc4/include/linux/page_cgroup.h	2012-09-30 16:47:46.000000000 -0700
+++ linux/include/linux/page_cgroup.h	2013-12-15 14:34:36.304485959 -0800
@@ -111,6 +111,7 @@ extern unsigned short swap_cgroup_cmpxch
 					unsigned short old, unsigned short new);
 extern unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id);
 extern unsigned short lookup_swap_cgroup_id(swp_entry_t ent);
+extern long swap_cgroup_reassign(unsigned short old, unsigned short new);
 extern int swap_cgroup_swapon(int type, unsigned long max_pages);
 extern void swap_cgroup_swapoff(int type);
 #else
--- 3.13-rc4/mm/memcontrol.c	2013-12-15 13:15:41.634280121 -0800
+++ linux/mm/memcontrol.c	2013-12-15 14:34:36.308485960 -0800
@@ -873,10 +873,8 @@ static long mem_cgroup_read_stat(struct
 	return val;
 }
 
-static void mem_cgroup_swap_statistics(struct mem_cgroup *memcg,
-					 bool charge)
+static void mem_cgroup_swap_statistics(struct mem_cgroup *memcg, long val)
 {
-	int val = (charge) ? 1 : -1;
 	this_cpu_add(memcg->stat->count[MEM_CGROUP_STAT_SWAP], val);
 }
 
@@ -2871,8 +2869,8 @@ struct mem_cgroup *try_get_mem_cgroup_fr
 			memcg = NULL;
 	} else if (PageSwapCache(page)) {
 		ent.val = page_private(page);
-		id = lookup_swap_cgroup_id(ent);
 		rcu_read_lock();
+		id = lookup_swap_cgroup_id(ent);
 		memcg = mem_cgroup_lookup(id);
 		if (memcg && !css_tryget(&memcg->css))
 			memcg = NULL;
@@ -4238,15 +4236,11 @@ __mem_cgroup_uncharge_common(struct page
 	 */
 
 	unlock_page_cgroup(pc);
-	/*
-	 * even after unlock, we have memcg->res.usage here and this memcg
-	 * will never be freed, so it's safe to call css_get().
-	 */
+
 	memcg_check_events(memcg, page);
-	if (do_swap_account && ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT) {
-		mem_cgroup_swap_statistics(memcg, true);
-		css_get(&memcg->css);
-	}
+	if (do_swap_account && ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT)
+		mem_cgroup_swap_statistics(memcg, 1);
+
 	/*
 	 * Migration does not charge the res_counter for the
 	 * replacement page, so leave it alone when phasing out the
@@ -4356,8 +4350,7 @@ mem_cgroup_uncharge_swapcache(struct pag
 	memcg = __mem_cgroup_uncharge_common(page, ctype, false);
 
 	/*
-	 * record memcg information,  if swapout && memcg != NULL,
-	 * css_get() was called in uncharge().
+	 * record memcg information
 	 */
 	if (do_swap_account && swapout && memcg)
 		swap_cgroup_record(ent, mem_cgroup_id(memcg));
@@ -4377,8 +4370,8 @@ void mem_cgroup_uncharge_swap(swp_entry_
 	if (!do_swap_account)
 		return;
 
-	id = swap_cgroup_record(ent, 0);
 	rcu_read_lock();
+	id = swap_cgroup_record(ent, 0);
 	memcg = mem_cgroup_lookup(id);
 	if (memcg) {
 		/*
@@ -4387,8 +4380,7 @@ void mem_cgroup_uncharge_swap(swp_entry_
 		 */
 		if (!mem_cgroup_is_root(memcg))
 			res_counter_uncharge(&memcg->memsw, PAGE_SIZE);
-		mem_cgroup_swap_statistics(memcg, false);
-		css_put(&memcg->css);
+		mem_cgroup_swap_statistics(memcg, -1);
 	}
 	rcu_read_unlock();
 }
@@ -4415,31 +4407,40 @@ static int mem_cgroup_move_swap_account(
 	old_id = mem_cgroup_id(from);
 	new_id = mem_cgroup_id(to);
 
-	if (swap_cgroup_cmpxchg(entry, old_id, new_id) == old_id) {
-		mem_cgroup_swap_statistics(from, false);
-		mem_cgroup_swap_statistics(to, true);
-		/*
-		 * This function is only called from task migration context now.
-		 * It postpones res_counter and refcount handling till the end
-		 * of task migration(mem_cgroup_clear_mc()) for performance
-		 * improvement. But we cannot postpone css_get(to)  because if
-		 * the process that has been moved to @to does swap-in, the
-		 * refcount of @to might be decreased to 0.
-		 *
-		 * We are in attach() phase, so the cgroup is guaranteed to be
-		 * alive, so we can just call css_get().
-		 */
-		css_get(&to->css);
+	if (swap_cgroup_cmpxchg(entry, old_id, new_id) == old_id)
 		return 0;
-	}
+
 	return -EINVAL;
 }
+
+static void mem_cgroup_reparent_swap(struct mem_cgroup *memcg)
+{
+	if (do_swap_account &&
+	    res_counter_read_u64(&memcg->memsw, RES_USAGE) >
+	    res_counter_read_u64(&memcg->kmem, RES_USAGE)) {
+		struct mem_cgroup *parent;
+		long reassigned;
+
+		parent = parent_mem_cgroup(memcg);
+		if (!parent)
+			parent = root_mem_cgroup;
+		reassigned = swap_cgroup_reassign(mem_cgroup_id(memcg),
+						  mem_cgroup_id(parent));
+
+		mem_cgroup_swap_statistics(memcg, -reassigned);
+		mem_cgroup_swap_statistics(parent, reassigned);
+	}
+}
 #else
 static inline int mem_cgroup_move_swap_account(swp_entry_t entry,
 				struct mem_cgroup *from, struct mem_cgroup *to)
 {
 	return -EINVAL;
 }
+
+static inline void mem_cgroup_reparent_swap(struct mem_cgroup *memcg)
+{
+}
 #endif
 
 /*
@@ -5017,6 +5018,7 @@ static int mem_cgroup_force_empty(struct
 	}
 	lru_add_drain();
 	mem_cgroup_reparent_charges(memcg);
+	/* but mem_cgroup_force_empty does not mem_cgroup_reparent_swap */
 
 	return 0;
 }
@@ -6348,6 +6350,7 @@ static void mem_cgroup_css_offline(struc
 
 	mem_cgroup_invalidate_reclaim_iterators(memcg);
 	mem_cgroup_reparent_charges(memcg);
+	mem_cgroup_reparent_swap(memcg);
 	mem_cgroup_destroy_all_caches(memcg);
 	vmpressure_cleanup(&memcg->vmpressure);
 }
@@ -6702,7 +6705,6 @@ static void __mem_cgroup_clear_mc(void)
 {
 	struct mem_cgroup *from = mc.from;
 	struct mem_cgroup *to = mc.to;
-	int i;
 
 	/* we must uncharge all the leftover precharges from mc.to */
 	if (mc.precharge) {
@@ -6724,8 +6726,8 @@ static void __mem_cgroup_clear_mc(void)
 			res_counter_uncharge(&mc.from->memsw,
 						PAGE_SIZE * mc.moved_swap);
 
-		for (i = 0; i < mc.moved_swap; i++)
-			css_put(&mc.from->css);
+		mem_cgroup_swap_statistics(from, -mc.moved_swap);
+		mem_cgroup_swap_statistics(to, mc.moved_swap);
 
 		if (!mem_cgroup_is_root(mc.to)) {
 			/*
--- 3.13-rc4/mm/page_cgroup.c	2013-02-18 15:58:34.000000000 -0800
+++ linux/mm/page_cgroup.c	2013-12-15 14:34:36.312485960 -0800
@@ -322,7 +322,8 @@ void __meminit pgdat_page_cgroup_init(st
 
 #ifdef CONFIG_MEMCG_SWAP
 
-static DEFINE_MUTEX(swap_cgroup_mutex);
+static DEFINE_SPINLOCK(swap_cgroup_lock);
+
 struct swap_cgroup_ctrl {
 	struct page **map;
 	unsigned long length;
@@ -353,14 +354,11 @@ struct swap_cgroup {
 /*
  * allocate buffer for swap_cgroup.
  */
-static int swap_cgroup_prepare(int type)
+static int swap_cgroup_prepare(struct swap_cgroup_ctrl *ctrl)
 {
 	struct page *page;
-	struct swap_cgroup_ctrl *ctrl;
 	unsigned long idx, max;
 
-	ctrl = &swap_cgroup_ctrl[type];
-
 	for (idx = 0; idx < ctrl->length; idx++) {
 		page = alloc_page(GFP_KERNEL | __GFP_ZERO);
 		if (!page)
@@ -407,18 +405,17 @@ unsigned short swap_cgroup_cmpxchg(swp_e
 {
 	struct swap_cgroup_ctrl *ctrl;
 	struct swap_cgroup *sc;
-	unsigned long flags;
 	unsigned short retval;
 
 	sc = lookup_swap_cgroup(ent, &ctrl);
 
-	spin_lock_irqsave(&ctrl->lock, flags);
+	spin_lock(&ctrl->lock);
 	retval = sc->id;
 	if (retval == old)
 		sc->id = new;
 	else
 		retval = 0;
-	spin_unlock_irqrestore(&ctrl->lock, flags);
+	spin_unlock(&ctrl->lock);
 	return retval;
 }
 
@@ -435,14 +432,13 @@ unsigned short swap_cgroup_record(swp_en
 	struct swap_cgroup_ctrl *ctrl;
 	struct swap_cgroup *sc;
 	unsigned short old;
-	unsigned long flags;
 
 	sc = lookup_swap_cgroup(ent, &ctrl);
 
-	spin_lock_irqsave(&ctrl->lock, flags);
+	spin_lock(&ctrl->lock);
 	old = sc->id;
 	sc->id = id;
-	spin_unlock_irqrestore(&ctrl->lock, flags);
+	spin_unlock(&ctrl->lock);
 
 	return old;
 }
@@ -451,19 +447,60 @@ unsigned short swap_cgroup_record(swp_en
  * lookup_swap_cgroup_id - lookup mem_cgroup id tied to swap entry
  * @ent: swap entry to be looked up.
  *
- * Returns CSS ID of mem_cgroup at success. 0 at failure. (0 is invalid ID)
+ * Returns ID of mem_cgroup at success. 0 at failure. (0 is invalid ID)
  */
 unsigned short lookup_swap_cgroup_id(swp_entry_t ent)
 {
 	return lookup_swap_cgroup(ent, NULL)->id;
 }
 
+/**
+ * swap_cgroup_reassign - assign all old entries to new (before old is freed).
+ * @old: id of emptied memcg whose entries are now to be reassigned
+ * @new: id of parent memcg to which those entries are to be assigned
+ *
+ * Returns number of entries reassigned, for debugging or for statistics.
+ */
+long swap_cgroup_reassign(unsigned short old, unsigned short new)
+{
+	long reassigned = 0;
+	int type;
+
+	for (type = 0; type < MAX_SWAPFILES; type++) {
+		struct swap_cgroup_ctrl *ctrl = &swap_cgroup_ctrl[type];
+		unsigned long idx;
+
+		for (idx = 0; idx < ACCESS_ONCE(ctrl->length); idx++) {
+			struct swap_cgroup *sc, *scend;
+
+			spin_lock(&swap_cgroup_lock);
+			if (idx >= ACCESS_ONCE(ctrl->length))
+				goto unlock;
+			sc = page_address(ctrl->map[idx]);
+			for (scend = sc + SC_PER_PAGE; sc < scend; sc++) {
+				if (sc->id != old)
+					continue;
+				spin_lock(&ctrl->lock);
+				if (sc->id == old) {
+					sc->id = new;
+					reassigned++;
+				}
+				spin_unlock(&ctrl->lock);
+			}
+unlock:
+			spin_unlock(&swap_cgroup_lock);
+			cond_resched();
+		}
+	}
+	return reassigned;
+}
+
 int swap_cgroup_swapon(int type, unsigned long max_pages)
 {
 	void *array;
 	unsigned long array_size;
 	unsigned long length;
-	struct swap_cgroup_ctrl *ctrl;
+	struct swap_cgroup_ctrl ctrl;
 
 	if (!do_swap_account)
 		return 0;
@@ -475,23 +512,20 @@ int swap_cgroup_swapon(int type, unsigne
 	if (!array)
 		goto nomem;
 
-	ctrl = &swap_cgroup_ctrl[type];
-	mutex_lock(&swap_cgroup_mutex);
-	ctrl->length = length;
-	ctrl->map = array;
-	spin_lock_init(&ctrl->lock);
-	if (swap_cgroup_prepare(type)) {
-		/* memory shortage */
-		ctrl->map = NULL;
-		ctrl->length = 0;
-		mutex_unlock(&swap_cgroup_mutex);
-		vfree(array);
+	ctrl.length = length;
+	ctrl.map = array;
+	spin_lock_init(&ctrl.lock);
+
+	if (swap_cgroup_prepare(&ctrl))
 		goto nomem;
-	}
-	mutex_unlock(&swap_cgroup_mutex);
+
+	spin_lock(&swap_cgroup_lock);
+	swap_cgroup_ctrl[type] = ctrl;
+	spin_unlock(&swap_cgroup_lock);
 
 	return 0;
 nomem:
+	vfree(array);
 	printk(KERN_INFO "couldn't allocate enough memory for swap_cgroup.\n");
 	printk(KERN_INFO
 		"swap_cgroup can be disabled by swapaccount=0 boot option\n");
@@ -507,13 +541,13 @@ void swap_cgroup_swapoff(int type)
 	if (!do_swap_account)
 		return;
 
-	mutex_lock(&swap_cgroup_mutex);
+	spin_lock(&swap_cgroup_lock);
 	ctrl = &swap_cgroup_ctrl[type];
 	map = ctrl->map;
 	length = ctrl->length;
 	ctrl->map = NULL;
 	ctrl->length = 0;
-	mutex_unlock(&swap_cgroup_mutex);
+	spin_unlock(&swap_cgroup_lock);
 
 	if (map) {
 		for (i = 0; i < length; i++) {

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: 3.13-rc breaks MEMCG_SWAP
  2013-12-16  8:36 ` Hugh Dickins
@ 2013-12-16  9:36   ` Li Zefan
  -1 siblings, 0 replies; 54+ messages in thread
From: Li Zefan @ 2013-12-16  9:36 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Michal Hocko, Johannes Weiner, Tejun Heo, Andrew Morton,
	KAMEZAWA Hiroyuki, linux-mm, cgroups, linux-kernel

On 2013/12/16 16:36, Hugh Dickins wrote:
> CONFIG_MEMCG_SWAP is broken in 3.13-rc.  Try something like this:
> 
> mkdir -p /tmp/tmpfs /tmp/memcg
> mount -t tmpfs -o size=1G tmpfs /tmp/tmpfs
> mount -t cgroup -o memory memcg /tmp/memcg
> mkdir /tmp/memcg/old
> echo 512M >/tmp/memcg/old/memory.limit_in_bytes
> echo $$ >/tmp/memcg/old/tasks
> cp /dev/zero /tmp/tmpfs/zero 2>/dev/null
> echo $$ >/tmp/memcg/tasks
> rmdir /tmp/memcg/old
> sleep 1	# let rmdir work complete
> mkdir /tmp/memcg/new
> umount /tmp/tmpfs
> dmesg | grep WARNING
> rmdir /tmp/memcg/new
> umount /tmp/memcg
> 
> Shows lots of WARNING: CPU: 1 PID: 1006 at kernel/res_counter.c:91
>                            res_counter_uncharge_locked+0x1f/0x2f()
> 
> Breakage comes from 34c00c319ce7 ("memcg: convert to use cgroup id").
> 
> The lifetime of a cgroup id is different from the lifetime of the
> css id it replaced: memsw's css_get()s do nothing to hold on to the
> old cgroup id, it soon gets recycled to a new cgroup, which then
> mysteriously inherits the old's swap, without any charge for it.
> (I thought memsw's particular need had been discussed and was
> well understood when 34c00c319ce7 went in, but apparently not.)
> 
> The right thing to do at this stage would be to revert that and its
> associated commits; but I imagine to do so would be unwelcome to
> the cgroup guys, going against their general direction; and I've
> no idea how embedded that css_id removal has become by now.
> 
> Perhaps some creative refcounting can rescue memsw while still
> using cgroup id?
> 

Sorry for the broken.

I think we can keep the cgroup->id until the last css reference is
dropped and the css is scheduled to be destroyed.

I'll cook a fix tomorrow.


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: 3.13-rc breaks MEMCG_SWAP
@ 2013-12-16  9:36   ` Li Zefan
  0 siblings, 0 replies; 54+ messages in thread
From: Li Zefan @ 2013-12-16  9:36 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Michal Hocko, Johannes Weiner, Tejun Heo, Andrew Morton,
	KAMEZAWA Hiroyuki, linux-mm, cgroups, linux-kernel

On 2013/12/16 16:36, Hugh Dickins wrote:
> CONFIG_MEMCG_SWAP is broken in 3.13-rc.  Try something like this:
> 
> mkdir -p /tmp/tmpfs /tmp/memcg
> mount -t tmpfs -o size=1G tmpfs /tmp/tmpfs
> mount -t cgroup -o memory memcg /tmp/memcg
> mkdir /tmp/memcg/old
> echo 512M >/tmp/memcg/old/memory.limit_in_bytes
> echo $$ >/tmp/memcg/old/tasks
> cp /dev/zero /tmp/tmpfs/zero 2>/dev/null
> echo $$ >/tmp/memcg/tasks
> rmdir /tmp/memcg/old
> sleep 1	# let rmdir work complete
> mkdir /tmp/memcg/new
> umount /tmp/tmpfs
> dmesg | grep WARNING
> rmdir /tmp/memcg/new
> umount /tmp/memcg
> 
> Shows lots of WARNING: CPU: 1 PID: 1006 at kernel/res_counter.c:91
>                            res_counter_uncharge_locked+0x1f/0x2f()
> 
> Breakage comes from 34c00c319ce7 ("memcg: convert to use cgroup id").
> 
> The lifetime of a cgroup id is different from the lifetime of the
> css id it replaced: memsw's css_get()s do nothing to hold on to the
> old cgroup id, it soon gets recycled to a new cgroup, which then
> mysteriously inherits the old's swap, without any charge for it.
> (I thought memsw's particular need had been discussed and was
> well understood when 34c00c319ce7 went in, but apparently not.)
> 
> The right thing to do at this stage would be to revert that and its
> associated commits; but I imagine to do so would be unwelcome to
> the cgroup guys, going against their general direction; and I've
> no idea how embedded that css_id removal has become by now.
> 
> Perhaps some creative refcounting can rescue memsw while still
> using cgroup id?
> 

Sorry for the broken.

I think we can keep the cgroup->id until the last css reference is
dropped and the css is scheduled to be destroyed.

I'll cook a fix tomorrow.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: 3.13-rc breaks MEMCG_SWAP
  2013-12-16  8:36 ` Hugh Dickins
@ 2013-12-16  9:49   ` Michal Hocko
  -1 siblings, 0 replies; 54+ messages in thread
From: Michal Hocko @ 2013-12-16  9:49 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Li Zefan, Johannes Weiner, Tejun Heo, Andrew Morton,
	KAMEZAWA Hiroyuki, linux-mm, cgroups, linux-kernel

On Mon 16-12-13 00:36:05, Hugh Dickins wrote:
> CONFIG_MEMCG_SWAP is broken in 3.13-rc.  Try something like this:
> 
> mkdir -p /tmp/tmpfs /tmp/memcg
> mount -t tmpfs -o size=1G tmpfs /tmp/tmpfs
> mount -t cgroup -o memory memcg /tmp/memcg
> mkdir /tmp/memcg/old
> echo 512M >/tmp/memcg/old/memory.limit_in_bytes
> echo $$ >/tmp/memcg/old/tasks
> cp /dev/zero /tmp/tmpfs/zero 2>/dev/null
> echo $$ >/tmp/memcg/tasks
> rmdir /tmp/memcg/old
> sleep 1	# let rmdir work complete
> mkdir /tmp/memcg/new
> umount /tmp/tmpfs
> dmesg | grep WARNING
> rmdir /tmp/memcg/new
> umount /tmp/memcg
> 
> Shows lots of WARNING: CPU: 1 PID: 1006 at kernel/res_counter.c:91
>                            res_counter_uncharge_locked+0x1f/0x2f()

I have noticed these warnings as well last week while I was working on
a fix for memsw charge vs. css_offline race. I thought the warning was
caused by something I have introduced by those patches but I didn't get
to look closer.

> Breakage comes from 34c00c319ce7 ("memcg: convert to use cgroup id").

Thanks for finding this out! I thought the change was safe because
idr_remove is called after css_offline so all the charges should be
re-parented already. This was (now) obviously incorrect. I didn't
realize that some portion of charges is laying on the swap.

> The lifetime of a cgroup id is different from the lifetime of the
> css id it replaced: memsw's css_get()s do nothing to hold on to the
> old cgroup id, it soon gets recycled to a new cgroup, which then
> mysteriously inherits the old's swap, without any charge for it.
> (I thought memsw's particular need had been discussed and was
> well understood when 34c00c319ce7 went in, but apparently not.)
> 
> The right thing to do at this stage would be to revert that and its
> associated commits; but I imagine to do so would be unwelcome to
> the cgroup guys, going against their general direction; and I've
> no idea how embedded that css_id removal has become by now.
> 
> Perhaps some creative refcounting can rescue memsw while still
> using cgroup id?

I am afraid this will not work.
 
> But if not, I've looked up and updated a patch I prepared eighteen
> months ago (when I too misunderstood how that memsw refcounting
> worked, and mistakenly thought something like this necessary):
> to scan the swap_cgroup arrays reassigning id when reparented.
> This works fine in the testing I've done on it.

Yes something like that should work. We really make sure that RES_USAGE
on memcg->res is zero but we do not check for memcg->memsw counter.

> I've not given enough thought to the races around mem_cgroup_lookup():

mem_cgroup_lookup with css_try_get within the same RCU read section
should be sufficient. But let me think about it some more.

> maybe it's okay as I have it, maybe it needs more (perhaps restoring
> the extra css_gets and css_puts that I removed, though that would be
> a little sad).  And I've made almost no attempt to optimize the scan
> of all swap areas, beyond not doing it if the memcg is using no swap.
> 
> I've kept in the various things that patch was doing, and not thought
> through what their interdependencies are: it should probably be split
> up e.g. some swap_cgroup locking changes in page_cgroup.c, to make the
> locking easier or more efficient when reassigning.  But I'll certainly
> not spend time on that if you decide to rescue memsw by some other way.

The patch looks good from a quick glance. I will spend more time on the
full review later today.

Thanks a lot Hugh!
 
> Hugh
> ---
> 
>  include/linux/page_cgroup.h |    1 
>  mm/memcontrol.c             |   74 ++++++++++++++-------------
>  mm/page_cgroup.c            |   90 +++++++++++++++++++++++-----------
>  3 files changed, 101 insertions(+), 64 deletions(-)
> 
> --- 3.13-rc4/include/linux/page_cgroup.h	2012-09-30 16:47:46.000000000 -0700
> +++ linux/include/linux/page_cgroup.h	2013-12-15 14:34:36.304485959 -0800
> @@ -111,6 +111,7 @@ extern unsigned short swap_cgroup_cmpxch
>  					unsigned short old, unsigned short new);
>  extern unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id);
>  extern unsigned short lookup_swap_cgroup_id(swp_entry_t ent);
> +extern long swap_cgroup_reassign(unsigned short old, unsigned short new);
>  extern int swap_cgroup_swapon(int type, unsigned long max_pages);
>  extern void swap_cgroup_swapoff(int type);
>  #else
> --- 3.13-rc4/mm/memcontrol.c	2013-12-15 13:15:41.634280121 -0800
> +++ linux/mm/memcontrol.c	2013-12-15 14:34:36.308485960 -0800
> @@ -873,10 +873,8 @@ static long mem_cgroup_read_stat(struct
>  	return val;
>  }
>  
> -static void mem_cgroup_swap_statistics(struct mem_cgroup *memcg,
> -					 bool charge)
> +static void mem_cgroup_swap_statistics(struct mem_cgroup *memcg, long val)
>  {
> -	int val = (charge) ? 1 : -1;
>  	this_cpu_add(memcg->stat->count[MEM_CGROUP_STAT_SWAP], val);
>  }
>  
> @@ -2871,8 +2869,8 @@ struct mem_cgroup *try_get_mem_cgroup_fr
>  			memcg = NULL;
>  	} else if (PageSwapCache(page)) {
>  		ent.val = page_private(page);
> -		id = lookup_swap_cgroup_id(ent);
>  		rcu_read_lock();
> +		id = lookup_swap_cgroup_id(ent);
>  		memcg = mem_cgroup_lookup(id);
>  		if (memcg && !css_tryget(&memcg->css))
>  			memcg = NULL;
> @@ -4238,15 +4236,11 @@ __mem_cgroup_uncharge_common(struct page
>  	 */
>  
>  	unlock_page_cgroup(pc);
> -	/*
> -	 * even after unlock, we have memcg->res.usage here and this memcg
> -	 * will never be freed, so it's safe to call css_get().
> -	 */
> +
>  	memcg_check_events(memcg, page);
> -	if (do_swap_account && ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT) {
> -		mem_cgroup_swap_statistics(memcg, true);
> -		css_get(&memcg->css);
> -	}
> +	if (do_swap_account && ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT)
> +		mem_cgroup_swap_statistics(memcg, 1);
> +
>  	/*
>  	 * Migration does not charge the res_counter for the
>  	 * replacement page, so leave it alone when phasing out the
> @@ -4356,8 +4350,7 @@ mem_cgroup_uncharge_swapcache(struct pag
>  	memcg = __mem_cgroup_uncharge_common(page, ctype, false);
>  
>  	/*
> -	 * record memcg information,  if swapout && memcg != NULL,
> -	 * css_get() was called in uncharge().
> +	 * record memcg information
>  	 */
>  	if (do_swap_account && swapout && memcg)
>  		swap_cgroup_record(ent, mem_cgroup_id(memcg));
> @@ -4377,8 +4370,8 @@ void mem_cgroup_uncharge_swap(swp_entry_
>  	if (!do_swap_account)
>  		return;
>  
> -	id = swap_cgroup_record(ent, 0);
>  	rcu_read_lock();
> +	id = swap_cgroup_record(ent, 0);
>  	memcg = mem_cgroup_lookup(id);
>  	if (memcg) {
>  		/*
> @@ -4387,8 +4380,7 @@ void mem_cgroup_uncharge_swap(swp_entry_
>  		 */
>  		if (!mem_cgroup_is_root(memcg))
>  			res_counter_uncharge(&memcg->memsw, PAGE_SIZE);
> -		mem_cgroup_swap_statistics(memcg, false);
> -		css_put(&memcg->css);
> +		mem_cgroup_swap_statistics(memcg, -1);
>  	}
>  	rcu_read_unlock();
>  }
> @@ -4415,31 +4407,40 @@ static int mem_cgroup_move_swap_account(
>  	old_id = mem_cgroup_id(from);
>  	new_id = mem_cgroup_id(to);
>  
> -	if (swap_cgroup_cmpxchg(entry, old_id, new_id) == old_id) {
> -		mem_cgroup_swap_statistics(from, false);
> -		mem_cgroup_swap_statistics(to, true);
> -		/*
> -		 * This function is only called from task migration context now.
> -		 * It postpones res_counter and refcount handling till the end
> -		 * of task migration(mem_cgroup_clear_mc()) for performance
> -		 * improvement. But we cannot postpone css_get(to)  because if
> -		 * the process that has been moved to @to does swap-in, the
> -		 * refcount of @to might be decreased to 0.
> -		 *
> -		 * We are in attach() phase, so the cgroup is guaranteed to be
> -		 * alive, so we can just call css_get().
> -		 */
> -		css_get(&to->css);
> +	if (swap_cgroup_cmpxchg(entry, old_id, new_id) == old_id)
>  		return 0;
> -	}
> +
>  	return -EINVAL;
>  }
> +
> +static void mem_cgroup_reparent_swap(struct mem_cgroup *memcg)
> +{
> +	if (do_swap_account &&
> +	    res_counter_read_u64(&memcg->memsw, RES_USAGE) >
> +	    res_counter_read_u64(&memcg->kmem, RES_USAGE)) {
> +		struct mem_cgroup *parent;
> +		long reassigned;
> +
> +		parent = parent_mem_cgroup(memcg);
> +		if (!parent)
> +			parent = root_mem_cgroup;
> +		reassigned = swap_cgroup_reassign(mem_cgroup_id(memcg),
> +						  mem_cgroup_id(parent));
> +
> +		mem_cgroup_swap_statistics(memcg, -reassigned);
> +		mem_cgroup_swap_statistics(parent, reassigned);
> +	}
> +}
>  #else
>  static inline int mem_cgroup_move_swap_account(swp_entry_t entry,
>  				struct mem_cgroup *from, struct mem_cgroup *to)
>  {
>  	return -EINVAL;
>  }
> +
> +static inline void mem_cgroup_reparent_swap(struct mem_cgroup *memcg)
> +{
> +}
>  #endif
>  
>  /*
> @@ -5017,6 +5018,7 @@ static int mem_cgroup_force_empty(struct
>  	}
>  	lru_add_drain();
>  	mem_cgroup_reparent_charges(memcg);
> +	/* but mem_cgroup_force_empty does not mem_cgroup_reparent_swap */
>  
>  	return 0;
>  }
> @@ -6348,6 +6350,7 @@ static void mem_cgroup_css_offline(struc
>  
>  	mem_cgroup_invalidate_reclaim_iterators(memcg);
>  	mem_cgroup_reparent_charges(memcg);
> +	mem_cgroup_reparent_swap(memcg);
>  	mem_cgroup_destroy_all_caches(memcg);
>  	vmpressure_cleanup(&memcg->vmpressure);
>  }
> @@ -6702,7 +6705,6 @@ static void __mem_cgroup_clear_mc(void)
>  {
>  	struct mem_cgroup *from = mc.from;
>  	struct mem_cgroup *to = mc.to;
> -	int i;
>  
>  	/* we must uncharge all the leftover precharges from mc.to */
>  	if (mc.precharge) {
> @@ -6724,8 +6726,8 @@ static void __mem_cgroup_clear_mc(void)
>  			res_counter_uncharge(&mc.from->memsw,
>  						PAGE_SIZE * mc.moved_swap);
>  
> -		for (i = 0; i < mc.moved_swap; i++)
> -			css_put(&mc.from->css);
> +		mem_cgroup_swap_statistics(from, -mc.moved_swap);
> +		mem_cgroup_swap_statistics(to, mc.moved_swap);
>  
>  		if (!mem_cgroup_is_root(mc.to)) {
>  			/*
> --- 3.13-rc4/mm/page_cgroup.c	2013-02-18 15:58:34.000000000 -0800
> +++ linux/mm/page_cgroup.c	2013-12-15 14:34:36.312485960 -0800
> @@ -322,7 +322,8 @@ void __meminit pgdat_page_cgroup_init(st
>  
>  #ifdef CONFIG_MEMCG_SWAP
>  
> -static DEFINE_MUTEX(swap_cgroup_mutex);
> +static DEFINE_SPINLOCK(swap_cgroup_lock);
> +
>  struct swap_cgroup_ctrl {
>  	struct page **map;
>  	unsigned long length;
> @@ -353,14 +354,11 @@ struct swap_cgroup {
>  /*
>   * allocate buffer for swap_cgroup.
>   */
> -static int swap_cgroup_prepare(int type)
> +static int swap_cgroup_prepare(struct swap_cgroup_ctrl *ctrl)
>  {
>  	struct page *page;
> -	struct swap_cgroup_ctrl *ctrl;
>  	unsigned long idx, max;
>  
> -	ctrl = &swap_cgroup_ctrl[type];
> -
>  	for (idx = 0; idx < ctrl->length; idx++) {
>  		page = alloc_page(GFP_KERNEL | __GFP_ZERO);
>  		if (!page)
> @@ -407,18 +405,17 @@ unsigned short swap_cgroup_cmpxchg(swp_e
>  {
>  	struct swap_cgroup_ctrl *ctrl;
>  	struct swap_cgroup *sc;
> -	unsigned long flags;
>  	unsigned short retval;
>  
>  	sc = lookup_swap_cgroup(ent, &ctrl);
>  
> -	spin_lock_irqsave(&ctrl->lock, flags);
> +	spin_lock(&ctrl->lock);
>  	retval = sc->id;
>  	if (retval == old)
>  		sc->id = new;
>  	else
>  		retval = 0;
> -	spin_unlock_irqrestore(&ctrl->lock, flags);
> +	spin_unlock(&ctrl->lock);
>  	return retval;
>  }
>  
> @@ -435,14 +432,13 @@ unsigned short swap_cgroup_record(swp_en
>  	struct swap_cgroup_ctrl *ctrl;
>  	struct swap_cgroup *sc;
>  	unsigned short old;
> -	unsigned long flags;
>  
>  	sc = lookup_swap_cgroup(ent, &ctrl);
>  
> -	spin_lock_irqsave(&ctrl->lock, flags);
> +	spin_lock(&ctrl->lock);
>  	old = sc->id;
>  	sc->id = id;
> -	spin_unlock_irqrestore(&ctrl->lock, flags);
> +	spin_unlock(&ctrl->lock);
>  
>  	return old;
>  }
> @@ -451,19 +447,60 @@ unsigned short swap_cgroup_record(swp_en
>   * lookup_swap_cgroup_id - lookup mem_cgroup id tied to swap entry
>   * @ent: swap entry to be looked up.
>   *
> - * Returns CSS ID of mem_cgroup at success. 0 at failure. (0 is invalid ID)
> + * Returns ID of mem_cgroup at success. 0 at failure. (0 is invalid ID)
>   */
>  unsigned short lookup_swap_cgroup_id(swp_entry_t ent)
>  {
>  	return lookup_swap_cgroup(ent, NULL)->id;
>  }
>  
> +/**
> + * swap_cgroup_reassign - assign all old entries to new (before old is freed).
> + * @old: id of emptied memcg whose entries are now to be reassigned
> + * @new: id of parent memcg to which those entries are to be assigned
> + *
> + * Returns number of entries reassigned, for debugging or for statistics.
> + */
> +long swap_cgroup_reassign(unsigned short old, unsigned short new)
> +{
> +	long reassigned = 0;
> +	int type;
> +
> +	for (type = 0; type < MAX_SWAPFILES; type++) {
> +		struct swap_cgroup_ctrl *ctrl = &swap_cgroup_ctrl[type];
> +		unsigned long idx;
> +
> +		for (idx = 0; idx < ACCESS_ONCE(ctrl->length); idx++) {
> +			struct swap_cgroup *sc, *scend;
> +
> +			spin_lock(&swap_cgroup_lock);
> +			if (idx >= ACCESS_ONCE(ctrl->length))
> +				goto unlock;
> +			sc = page_address(ctrl->map[idx]);
> +			for (scend = sc + SC_PER_PAGE; sc < scend; sc++) {
> +				if (sc->id != old)
> +					continue;
> +				spin_lock(&ctrl->lock);
> +				if (sc->id == old) {
> +					sc->id = new;
> +					reassigned++;
> +				}
> +				spin_unlock(&ctrl->lock);
> +			}
> +unlock:
> +			spin_unlock(&swap_cgroup_lock);
> +			cond_resched();
> +		}
> +	}
> +	return reassigned;
> +}
> +
>  int swap_cgroup_swapon(int type, unsigned long max_pages)
>  {
>  	void *array;
>  	unsigned long array_size;
>  	unsigned long length;
> -	struct swap_cgroup_ctrl *ctrl;
> +	struct swap_cgroup_ctrl ctrl;
>  
>  	if (!do_swap_account)
>  		return 0;
> @@ -475,23 +512,20 @@ int swap_cgroup_swapon(int type, unsigne
>  	if (!array)
>  		goto nomem;
>  
> -	ctrl = &swap_cgroup_ctrl[type];
> -	mutex_lock(&swap_cgroup_mutex);
> -	ctrl->length = length;
> -	ctrl->map = array;
> -	spin_lock_init(&ctrl->lock);
> -	if (swap_cgroup_prepare(type)) {
> -		/* memory shortage */
> -		ctrl->map = NULL;
> -		ctrl->length = 0;
> -		mutex_unlock(&swap_cgroup_mutex);
> -		vfree(array);
> +	ctrl.length = length;
> +	ctrl.map = array;
> +	spin_lock_init(&ctrl.lock);
> +
> +	if (swap_cgroup_prepare(&ctrl))
>  		goto nomem;
> -	}
> -	mutex_unlock(&swap_cgroup_mutex);
> +
> +	spin_lock(&swap_cgroup_lock);
> +	swap_cgroup_ctrl[type] = ctrl;
> +	spin_unlock(&swap_cgroup_lock);
>  
>  	return 0;
>  nomem:
> +	vfree(array);
>  	printk(KERN_INFO "couldn't allocate enough memory for swap_cgroup.\n");
>  	printk(KERN_INFO
>  		"swap_cgroup can be disabled by swapaccount=0 boot option\n");
> @@ -507,13 +541,13 @@ void swap_cgroup_swapoff(int type)
>  	if (!do_swap_account)
>  		return;
>  
> -	mutex_lock(&swap_cgroup_mutex);
> +	spin_lock(&swap_cgroup_lock);
>  	ctrl = &swap_cgroup_ctrl[type];
>  	map = ctrl->map;
>  	length = ctrl->length;
>  	ctrl->map = NULL;
>  	ctrl->length = 0;
> -	mutex_unlock(&swap_cgroup_mutex);
> +	spin_unlock(&swap_cgroup_lock);
>  
>  	if (map) {
>  		for (i = 0; i < length; i++) {

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: 3.13-rc breaks MEMCG_SWAP
@ 2013-12-16  9:49   ` Michal Hocko
  0 siblings, 0 replies; 54+ messages in thread
From: Michal Hocko @ 2013-12-16  9:49 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Li Zefan, Johannes Weiner, Tejun Heo, Andrew Morton,
	KAMEZAWA Hiroyuki, linux-mm, cgroups, linux-kernel

On Mon 16-12-13 00:36:05, Hugh Dickins wrote:
> CONFIG_MEMCG_SWAP is broken in 3.13-rc.  Try something like this:
> 
> mkdir -p /tmp/tmpfs /tmp/memcg
> mount -t tmpfs -o size=1G tmpfs /tmp/tmpfs
> mount -t cgroup -o memory memcg /tmp/memcg
> mkdir /tmp/memcg/old
> echo 512M >/tmp/memcg/old/memory.limit_in_bytes
> echo $$ >/tmp/memcg/old/tasks
> cp /dev/zero /tmp/tmpfs/zero 2>/dev/null
> echo $$ >/tmp/memcg/tasks
> rmdir /tmp/memcg/old
> sleep 1	# let rmdir work complete
> mkdir /tmp/memcg/new
> umount /tmp/tmpfs
> dmesg | grep WARNING
> rmdir /tmp/memcg/new
> umount /tmp/memcg
> 
> Shows lots of WARNING: CPU: 1 PID: 1006 at kernel/res_counter.c:91
>                            res_counter_uncharge_locked+0x1f/0x2f()

I have noticed these warnings as well last week while I was working on
a fix for memsw charge vs. css_offline race. I thought the warning was
caused by something I have introduced by those patches but I didn't get
to look closer.

> Breakage comes from 34c00c319ce7 ("memcg: convert to use cgroup id").

Thanks for finding this out! I thought the change was safe because
idr_remove is called after css_offline so all the charges should be
re-parented already. This was (now) obviously incorrect. I didn't
realize that some portion of charges is laying on the swap.

> The lifetime of a cgroup id is different from the lifetime of the
> css id it replaced: memsw's css_get()s do nothing to hold on to the
> old cgroup id, it soon gets recycled to a new cgroup, which then
> mysteriously inherits the old's swap, without any charge for it.
> (I thought memsw's particular need had been discussed and was
> well understood when 34c00c319ce7 went in, but apparently not.)
> 
> The right thing to do at this stage would be to revert that and its
> associated commits; but I imagine to do so would be unwelcome to
> the cgroup guys, going against their general direction; and I've
> no idea how embedded that css_id removal has become by now.
> 
> Perhaps some creative refcounting can rescue memsw while still
> using cgroup id?

I am afraid this will not work.
 
> But if not, I've looked up and updated a patch I prepared eighteen
> months ago (when I too misunderstood how that memsw refcounting
> worked, and mistakenly thought something like this necessary):
> to scan the swap_cgroup arrays reassigning id when reparented.
> This works fine in the testing I've done on it.

Yes something like that should work. We really make sure that RES_USAGE
on memcg->res is zero but we do not check for memcg->memsw counter.

> I've not given enough thought to the races around mem_cgroup_lookup():

mem_cgroup_lookup with css_try_get within the same RCU read section
should be sufficient. But let me think about it some more.

> maybe it's okay as I have it, maybe it needs more (perhaps restoring
> the extra css_gets and css_puts that I removed, though that would be
> a little sad).  And I've made almost no attempt to optimize the scan
> of all swap areas, beyond not doing it if the memcg is using no swap.
> 
> I've kept in the various things that patch was doing, and not thought
> through what their interdependencies are: it should probably be split
> up e.g. some swap_cgroup locking changes in page_cgroup.c, to make the
> locking easier or more efficient when reassigning.  But I'll certainly
> not spend time on that if you decide to rescue memsw by some other way.

The patch looks good from a quick glance. I will spend more time on the
full review later today.

Thanks a lot Hugh!
 
> Hugh
> ---
> 
>  include/linux/page_cgroup.h |    1 
>  mm/memcontrol.c             |   74 ++++++++++++++-------------
>  mm/page_cgroup.c            |   90 +++++++++++++++++++++++-----------
>  3 files changed, 101 insertions(+), 64 deletions(-)
> 
> --- 3.13-rc4/include/linux/page_cgroup.h	2012-09-30 16:47:46.000000000 -0700
> +++ linux/include/linux/page_cgroup.h	2013-12-15 14:34:36.304485959 -0800
> @@ -111,6 +111,7 @@ extern unsigned short swap_cgroup_cmpxch
>  					unsigned short old, unsigned short new);
>  extern unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id);
>  extern unsigned short lookup_swap_cgroup_id(swp_entry_t ent);
> +extern long swap_cgroup_reassign(unsigned short old, unsigned short new);
>  extern int swap_cgroup_swapon(int type, unsigned long max_pages);
>  extern void swap_cgroup_swapoff(int type);
>  #else
> --- 3.13-rc4/mm/memcontrol.c	2013-12-15 13:15:41.634280121 -0800
> +++ linux/mm/memcontrol.c	2013-12-15 14:34:36.308485960 -0800
> @@ -873,10 +873,8 @@ static long mem_cgroup_read_stat(struct
>  	return val;
>  }
>  
> -static void mem_cgroup_swap_statistics(struct mem_cgroup *memcg,
> -					 bool charge)
> +static void mem_cgroup_swap_statistics(struct mem_cgroup *memcg, long val)
>  {
> -	int val = (charge) ? 1 : -1;
>  	this_cpu_add(memcg->stat->count[MEM_CGROUP_STAT_SWAP], val);
>  }
>  
> @@ -2871,8 +2869,8 @@ struct mem_cgroup *try_get_mem_cgroup_fr
>  			memcg = NULL;
>  	} else if (PageSwapCache(page)) {
>  		ent.val = page_private(page);
> -		id = lookup_swap_cgroup_id(ent);
>  		rcu_read_lock();
> +		id = lookup_swap_cgroup_id(ent);
>  		memcg = mem_cgroup_lookup(id);
>  		if (memcg && !css_tryget(&memcg->css))
>  			memcg = NULL;
> @@ -4238,15 +4236,11 @@ __mem_cgroup_uncharge_common(struct page
>  	 */
>  
>  	unlock_page_cgroup(pc);
> -	/*
> -	 * even after unlock, we have memcg->res.usage here and this memcg
> -	 * will never be freed, so it's safe to call css_get().
> -	 */
> +
>  	memcg_check_events(memcg, page);
> -	if (do_swap_account && ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT) {
> -		mem_cgroup_swap_statistics(memcg, true);
> -		css_get(&memcg->css);
> -	}
> +	if (do_swap_account && ctype == MEM_CGROUP_CHARGE_TYPE_SWAPOUT)
> +		mem_cgroup_swap_statistics(memcg, 1);
> +
>  	/*
>  	 * Migration does not charge the res_counter for the
>  	 * replacement page, so leave it alone when phasing out the
> @@ -4356,8 +4350,7 @@ mem_cgroup_uncharge_swapcache(struct pag
>  	memcg = __mem_cgroup_uncharge_common(page, ctype, false);
>  
>  	/*
> -	 * record memcg information,  if swapout && memcg != NULL,
> -	 * css_get() was called in uncharge().
> +	 * record memcg information
>  	 */
>  	if (do_swap_account && swapout && memcg)
>  		swap_cgroup_record(ent, mem_cgroup_id(memcg));
> @@ -4377,8 +4370,8 @@ void mem_cgroup_uncharge_swap(swp_entry_
>  	if (!do_swap_account)
>  		return;
>  
> -	id = swap_cgroup_record(ent, 0);
>  	rcu_read_lock();
> +	id = swap_cgroup_record(ent, 0);
>  	memcg = mem_cgroup_lookup(id);
>  	if (memcg) {
>  		/*
> @@ -4387,8 +4380,7 @@ void mem_cgroup_uncharge_swap(swp_entry_
>  		 */
>  		if (!mem_cgroup_is_root(memcg))
>  			res_counter_uncharge(&memcg->memsw, PAGE_SIZE);
> -		mem_cgroup_swap_statistics(memcg, false);
> -		css_put(&memcg->css);
> +		mem_cgroup_swap_statistics(memcg, -1);
>  	}
>  	rcu_read_unlock();
>  }
> @@ -4415,31 +4407,40 @@ static int mem_cgroup_move_swap_account(
>  	old_id = mem_cgroup_id(from);
>  	new_id = mem_cgroup_id(to);
>  
> -	if (swap_cgroup_cmpxchg(entry, old_id, new_id) == old_id) {
> -		mem_cgroup_swap_statistics(from, false);
> -		mem_cgroup_swap_statistics(to, true);
> -		/*
> -		 * This function is only called from task migration context now.
> -		 * It postpones res_counter and refcount handling till the end
> -		 * of task migration(mem_cgroup_clear_mc()) for performance
> -		 * improvement. But we cannot postpone css_get(to)  because if
> -		 * the process that has been moved to @to does swap-in, the
> -		 * refcount of @to might be decreased to 0.
> -		 *
> -		 * We are in attach() phase, so the cgroup is guaranteed to be
> -		 * alive, so we can just call css_get().
> -		 */
> -		css_get(&to->css);
> +	if (swap_cgroup_cmpxchg(entry, old_id, new_id) == old_id)
>  		return 0;
> -	}
> +
>  	return -EINVAL;
>  }
> +
> +static void mem_cgroup_reparent_swap(struct mem_cgroup *memcg)
> +{
> +	if (do_swap_account &&
> +	    res_counter_read_u64(&memcg->memsw, RES_USAGE) >
> +	    res_counter_read_u64(&memcg->kmem, RES_USAGE)) {
> +		struct mem_cgroup *parent;
> +		long reassigned;
> +
> +		parent = parent_mem_cgroup(memcg);
> +		if (!parent)
> +			parent = root_mem_cgroup;
> +		reassigned = swap_cgroup_reassign(mem_cgroup_id(memcg),
> +						  mem_cgroup_id(parent));
> +
> +		mem_cgroup_swap_statistics(memcg, -reassigned);
> +		mem_cgroup_swap_statistics(parent, reassigned);
> +	}
> +}
>  #else
>  static inline int mem_cgroup_move_swap_account(swp_entry_t entry,
>  				struct mem_cgroup *from, struct mem_cgroup *to)
>  {
>  	return -EINVAL;
>  }
> +
> +static inline void mem_cgroup_reparent_swap(struct mem_cgroup *memcg)
> +{
> +}
>  #endif
>  
>  /*
> @@ -5017,6 +5018,7 @@ static int mem_cgroup_force_empty(struct
>  	}
>  	lru_add_drain();
>  	mem_cgroup_reparent_charges(memcg);
> +	/* but mem_cgroup_force_empty does not mem_cgroup_reparent_swap */
>  
>  	return 0;
>  }
> @@ -6348,6 +6350,7 @@ static void mem_cgroup_css_offline(struc
>  
>  	mem_cgroup_invalidate_reclaim_iterators(memcg);
>  	mem_cgroup_reparent_charges(memcg);
> +	mem_cgroup_reparent_swap(memcg);
>  	mem_cgroup_destroy_all_caches(memcg);
>  	vmpressure_cleanup(&memcg->vmpressure);
>  }
> @@ -6702,7 +6705,6 @@ static void __mem_cgroup_clear_mc(void)
>  {
>  	struct mem_cgroup *from = mc.from;
>  	struct mem_cgroup *to = mc.to;
> -	int i;
>  
>  	/* we must uncharge all the leftover precharges from mc.to */
>  	if (mc.precharge) {
> @@ -6724,8 +6726,8 @@ static void __mem_cgroup_clear_mc(void)
>  			res_counter_uncharge(&mc.from->memsw,
>  						PAGE_SIZE * mc.moved_swap);
>  
> -		for (i = 0; i < mc.moved_swap; i++)
> -			css_put(&mc.from->css);
> +		mem_cgroup_swap_statistics(from, -mc.moved_swap);
> +		mem_cgroup_swap_statistics(to, mc.moved_swap);
>  
>  		if (!mem_cgroup_is_root(mc.to)) {
>  			/*
> --- 3.13-rc4/mm/page_cgroup.c	2013-02-18 15:58:34.000000000 -0800
> +++ linux/mm/page_cgroup.c	2013-12-15 14:34:36.312485960 -0800
> @@ -322,7 +322,8 @@ void __meminit pgdat_page_cgroup_init(st
>  
>  #ifdef CONFIG_MEMCG_SWAP
>  
> -static DEFINE_MUTEX(swap_cgroup_mutex);
> +static DEFINE_SPINLOCK(swap_cgroup_lock);
> +
>  struct swap_cgroup_ctrl {
>  	struct page **map;
>  	unsigned long length;
> @@ -353,14 +354,11 @@ struct swap_cgroup {
>  /*
>   * allocate buffer for swap_cgroup.
>   */
> -static int swap_cgroup_prepare(int type)
> +static int swap_cgroup_prepare(struct swap_cgroup_ctrl *ctrl)
>  {
>  	struct page *page;
> -	struct swap_cgroup_ctrl *ctrl;
>  	unsigned long idx, max;
>  
> -	ctrl = &swap_cgroup_ctrl[type];
> -
>  	for (idx = 0; idx < ctrl->length; idx++) {
>  		page = alloc_page(GFP_KERNEL | __GFP_ZERO);
>  		if (!page)
> @@ -407,18 +405,17 @@ unsigned short swap_cgroup_cmpxchg(swp_e
>  {
>  	struct swap_cgroup_ctrl *ctrl;
>  	struct swap_cgroup *sc;
> -	unsigned long flags;
>  	unsigned short retval;
>  
>  	sc = lookup_swap_cgroup(ent, &ctrl);
>  
> -	spin_lock_irqsave(&ctrl->lock, flags);
> +	spin_lock(&ctrl->lock);
>  	retval = sc->id;
>  	if (retval == old)
>  		sc->id = new;
>  	else
>  		retval = 0;
> -	spin_unlock_irqrestore(&ctrl->lock, flags);
> +	spin_unlock(&ctrl->lock);
>  	return retval;
>  }
>  
> @@ -435,14 +432,13 @@ unsigned short swap_cgroup_record(swp_en
>  	struct swap_cgroup_ctrl *ctrl;
>  	struct swap_cgroup *sc;
>  	unsigned short old;
> -	unsigned long flags;
>  
>  	sc = lookup_swap_cgroup(ent, &ctrl);
>  
> -	spin_lock_irqsave(&ctrl->lock, flags);
> +	spin_lock(&ctrl->lock);
>  	old = sc->id;
>  	sc->id = id;
> -	spin_unlock_irqrestore(&ctrl->lock, flags);
> +	spin_unlock(&ctrl->lock);
>  
>  	return old;
>  }
> @@ -451,19 +447,60 @@ unsigned short swap_cgroup_record(swp_en
>   * lookup_swap_cgroup_id - lookup mem_cgroup id tied to swap entry
>   * @ent: swap entry to be looked up.
>   *
> - * Returns CSS ID of mem_cgroup at success. 0 at failure. (0 is invalid ID)
> + * Returns ID of mem_cgroup at success. 0 at failure. (0 is invalid ID)
>   */
>  unsigned short lookup_swap_cgroup_id(swp_entry_t ent)
>  {
>  	return lookup_swap_cgroup(ent, NULL)->id;
>  }
>  
> +/**
> + * swap_cgroup_reassign - assign all old entries to new (before old is freed).
> + * @old: id of emptied memcg whose entries are now to be reassigned
> + * @new: id of parent memcg to which those entries are to be assigned
> + *
> + * Returns number of entries reassigned, for debugging or for statistics.
> + */
> +long swap_cgroup_reassign(unsigned short old, unsigned short new)
> +{
> +	long reassigned = 0;
> +	int type;
> +
> +	for (type = 0; type < MAX_SWAPFILES; type++) {
> +		struct swap_cgroup_ctrl *ctrl = &swap_cgroup_ctrl[type];
> +		unsigned long idx;
> +
> +		for (idx = 0; idx < ACCESS_ONCE(ctrl->length); idx++) {
> +			struct swap_cgroup *sc, *scend;
> +
> +			spin_lock(&swap_cgroup_lock);
> +			if (idx >= ACCESS_ONCE(ctrl->length))
> +				goto unlock;
> +			sc = page_address(ctrl->map[idx]);
> +			for (scend = sc + SC_PER_PAGE; sc < scend; sc++) {
> +				if (sc->id != old)
> +					continue;
> +				spin_lock(&ctrl->lock);
> +				if (sc->id == old) {
> +					sc->id = new;
> +					reassigned++;
> +				}
> +				spin_unlock(&ctrl->lock);
> +			}
> +unlock:
> +			spin_unlock(&swap_cgroup_lock);
> +			cond_resched();
> +		}
> +	}
> +	return reassigned;
> +}
> +
>  int swap_cgroup_swapon(int type, unsigned long max_pages)
>  {
>  	void *array;
>  	unsigned long array_size;
>  	unsigned long length;
> -	struct swap_cgroup_ctrl *ctrl;
> +	struct swap_cgroup_ctrl ctrl;
>  
>  	if (!do_swap_account)
>  		return 0;
> @@ -475,23 +512,20 @@ int swap_cgroup_swapon(int type, unsigne
>  	if (!array)
>  		goto nomem;
>  
> -	ctrl = &swap_cgroup_ctrl[type];
> -	mutex_lock(&swap_cgroup_mutex);
> -	ctrl->length = length;
> -	ctrl->map = array;
> -	spin_lock_init(&ctrl->lock);
> -	if (swap_cgroup_prepare(type)) {
> -		/* memory shortage */
> -		ctrl->map = NULL;
> -		ctrl->length = 0;
> -		mutex_unlock(&swap_cgroup_mutex);
> -		vfree(array);
> +	ctrl.length = length;
> +	ctrl.map = array;
> +	spin_lock_init(&ctrl.lock);
> +
> +	if (swap_cgroup_prepare(&ctrl))
>  		goto nomem;
> -	}
> -	mutex_unlock(&swap_cgroup_mutex);
> +
> +	spin_lock(&swap_cgroup_lock);
> +	swap_cgroup_ctrl[type] = ctrl;
> +	spin_unlock(&swap_cgroup_lock);
>  
>  	return 0;
>  nomem:
> +	vfree(array);
>  	printk(KERN_INFO "couldn't allocate enough memory for swap_cgroup.\n");
>  	printk(KERN_INFO
>  		"swap_cgroup can be disabled by swapaccount=0 boot option\n");
> @@ -507,13 +541,13 @@ void swap_cgroup_swapoff(int type)
>  	if (!do_swap_account)
>  		return;
>  
> -	mutex_lock(&swap_cgroup_mutex);
> +	spin_lock(&swap_cgroup_lock);
>  	ctrl = &swap_cgroup_ctrl[type];
>  	map = ctrl->map;
>  	length = ctrl->length;
>  	ctrl->map = NULL;
>  	ctrl->length = 0;
> -	mutex_unlock(&swap_cgroup_mutex);
> +	spin_unlock(&swap_cgroup_lock);
>  
>  	if (map) {
>  		for (i = 0; i < length; i++) {

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: 3.13-rc breaks MEMCG_SWAP
  2013-12-16  9:36   ` Li Zefan
@ 2013-12-16  9:53     ` Michal Hocko
  -1 siblings, 0 replies; 54+ messages in thread
From: Michal Hocko @ 2013-12-16  9:53 UTC (permalink / raw)
  To: Li Zefan
  Cc: Hugh Dickins, Johannes Weiner, Tejun Heo, Andrew Morton,
	KAMEZAWA Hiroyuki, linux-mm, cgroups, linux-kernel

On Mon 16-12-13 17:36:09, Li Zefan wrote:
> On 2013/12/16 16:36, Hugh Dickins wrote:
> > CONFIG_MEMCG_SWAP is broken in 3.13-rc.  Try something like this:
> > 
> > mkdir -p /tmp/tmpfs /tmp/memcg
> > mount -t tmpfs -o size=1G tmpfs /tmp/tmpfs
> > mount -t cgroup -o memory memcg /tmp/memcg
> > mkdir /tmp/memcg/old
> > echo 512M >/tmp/memcg/old/memory.limit_in_bytes
> > echo $$ >/tmp/memcg/old/tasks
> > cp /dev/zero /tmp/tmpfs/zero 2>/dev/null
> > echo $$ >/tmp/memcg/tasks
> > rmdir /tmp/memcg/old
> > sleep 1	# let rmdir work complete
> > mkdir /tmp/memcg/new
> > umount /tmp/tmpfs
> > dmesg | grep WARNING
> > rmdir /tmp/memcg/new
> > umount /tmp/memcg
> > 
> > Shows lots of WARNING: CPU: 1 PID: 1006 at kernel/res_counter.c:91
> >                            res_counter_uncharge_locked+0x1f/0x2f()
> > 
> > Breakage comes from 34c00c319ce7 ("memcg: convert to use cgroup id").
> > 
> > The lifetime of a cgroup id is different from the lifetime of the
> > css id it replaced: memsw's css_get()s do nothing to hold on to the
> > old cgroup id, it soon gets recycled to a new cgroup, which then
> > mysteriously inherits the old's swap, without any charge for it.
> > (I thought memsw's particular need had been discussed and was
> > well understood when 34c00c319ce7 went in, but apparently not.)
> > 
> > The right thing to do at this stage would be to revert that and its
> > associated commits; but I imagine to do so would be unwelcome to
> > the cgroup guys, going against their general direction; and I've
> > no idea how embedded that css_id removal has become by now.
> > 
> > Perhaps some creative refcounting can rescue memsw while still
> > using cgroup id?
> > 
> 
> Sorry for the broken.
> 
> I think we can keep the cgroup->id until the last css reference is
> dropped and the css is scheduled to be destroyed.

How would this work? The task which pushed the memory to the swap is
still alive (living in a different group) and the swap will be there
after the last reference to css as well.
 
> I'll cook a fix tomorrow.
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: 3.13-rc breaks MEMCG_SWAP
@ 2013-12-16  9:53     ` Michal Hocko
  0 siblings, 0 replies; 54+ messages in thread
From: Michal Hocko @ 2013-12-16  9:53 UTC (permalink / raw)
  To: Li Zefan
  Cc: Hugh Dickins, Johannes Weiner, Tejun Heo, Andrew Morton,
	KAMEZAWA Hiroyuki, linux-mm, cgroups, linux-kernel

On Mon 16-12-13 17:36:09, Li Zefan wrote:
> On 2013/12/16 16:36, Hugh Dickins wrote:
> > CONFIG_MEMCG_SWAP is broken in 3.13-rc.  Try something like this:
> > 
> > mkdir -p /tmp/tmpfs /tmp/memcg
> > mount -t tmpfs -o size=1G tmpfs /tmp/tmpfs
> > mount -t cgroup -o memory memcg /tmp/memcg
> > mkdir /tmp/memcg/old
> > echo 512M >/tmp/memcg/old/memory.limit_in_bytes
> > echo $$ >/tmp/memcg/old/tasks
> > cp /dev/zero /tmp/tmpfs/zero 2>/dev/null
> > echo $$ >/tmp/memcg/tasks
> > rmdir /tmp/memcg/old
> > sleep 1	# let rmdir work complete
> > mkdir /tmp/memcg/new
> > umount /tmp/tmpfs
> > dmesg | grep WARNING
> > rmdir /tmp/memcg/new
> > umount /tmp/memcg
> > 
> > Shows lots of WARNING: CPU: 1 PID: 1006 at kernel/res_counter.c:91
> >                            res_counter_uncharge_locked+0x1f/0x2f()
> > 
> > Breakage comes from 34c00c319ce7 ("memcg: convert to use cgroup id").
> > 
> > The lifetime of a cgroup id is different from the lifetime of the
> > css id it replaced: memsw's css_get()s do nothing to hold on to the
> > old cgroup id, it soon gets recycled to a new cgroup, which then
> > mysteriously inherits the old's swap, without any charge for it.
> > (I thought memsw's particular need had been discussed and was
> > well understood when 34c00c319ce7 went in, but apparently not.)
> > 
> > The right thing to do at this stage would be to revert that and its
> > associated commits; but I imagine to do so would be unwelcome to
> > the cgroup guys, going against their general direction; and I've
> > no idea how embedded that css_id removal has become by now.
> > 
> > Perhaps some creative refcounting can rescue memsw while still
> > using cgroup id?
> > 
> 
> Sorry for the broken.
> 
> I think we can keep the cgroup->id until the last css reference is
> dropped and the css is scheduled to be destroyed.

How would this work? The task which pushed the memory to the swap is
still alive (living in a different group) and the swap will be there
after the last reference to css as well.
 
> I'll cook a fix tomorrow.
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: 3.13-rc breaks MEMCG_SWAP
  2013-12-16  9:53     ` Michal Hocko
@ 2013-12-16 10:40       ` Michal Hocko
  -1 siblings, 0 replies; 54+ messages in thread
From: Michal Hocko @ 2013-12-16 10:40 UTC (permalink / raw)
  To: Li Zefan
  Cc: Hugh Dickins, Johannes Weiner, Tejun Heo, Andrew Morton,
	KAMEZAWA Hiroyuki, linux-mm, cgroups, linux-kernel

On Mon 16-12-13 10:53:45, Michal Hocko wrote:
> On Mon 16-12-13 17:36:09, Li Zefan wrote:
> > On 2013/12/16 16:36, Hugh Dickins wrote:
> > > CONFIG_MEMCG_SWAP is broken in 3.13-rc.  Try something like this:
> > > 
> > > mkdir -p /tmp/tmpfs /tmp/memcg
> > > mount -t tmpfs -o size=1G tmpfs /tmp/tmpfs
> > > mount -t cgroup -o memory memcg /tmp/memcg
> > > mkdir /tmp/memcg/old
> > > echo 512M >/tmp/memcg/old/memory.limit_in_bytes
> > > echo $$ >/tmp/memcg/old/tasks
> > > cp /dev/zero /tmp/tmpfs/zero 2>/dev/null
> > > echo $$ >/tmp/memcg/tasks
> > > rmdir /tmp/memcg/old
> > > sleep 1	# let rmdir work complete
> > > mkdir /tmp/memcg/new
> > > umount /tmp/tmpfs
> > > dmesg | grep WARNING
> > > rmdir /tmp/memcg/new
> > > umount /tmp/memcg
> > > 
> > > Shows lots of WARNING: CPU: 1 PID: 1006 at kernel/res_counter.c:91
> > >                            res_counter_uncharge_locked+0x1f/0x2f()
> > > 
> > > Breakage comes from 34c00c319ce7 ("memcg: convert to use cgroup id").
> > > 
> > > The lifetime of a cgroup id is different from the lifetime of the
> > > css id it replaced: memsw's css_get()s do nothing to hold on to the
> > > old cgroup id, it soon gets recycled to a new cgroup, which then
> > > mysteriously inherits the old's swap, without any charge for it.
> > > (I thought memsw's particular need had been discussed and was
> > > well understood when 34c00c319ce7 went in, but apparently not.)
> > > 
> > > The right thing to do at this stage would be to revert that and its
> > > associated commits; but I imagine to do so would be unwelcome to
> > > the cgroup guys, going against their general direction; and I've
> > > no idea how embedded that css_id removal has become by now.
> > > 
> > > Perhaps some creative refcounting can rescue memsw while still
> > > using cgroup id?
> > > 
> > 
> > Sorry for the broken.
> > 
> > I think we can keep the cgroup->id until the last css reference is
> > dropped and the css is scheduled to be destroyed.
> 
> How would this work? The task which pushed the memory to the swap is
> still alive (living in a different group) and the swap will be there
> after the last reference to css as well.

Or did you mean to get css reference in swap_cgroup_record and release
it in __mem_cgroup_try_charge_swapin?

That would prevent the warning (assuming idr_remove would move to
css_free[1]) but I am not sure this is the right thing to do. memsw charges
will be accounted to the parent already (assuming there is one) without
anybody to uncharge them because all uncharges would fallback to the
root memcg after css_offline.

Hugh's approach seems much better.

---
[1] Is this even possible? I cannot say I would understand the comment
above idr_remove in cgroup_destroy_css_killed 100% but it suggests we
cannot postpone it to later.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: 3.13-rc breaks MEMCG_SWAP
@ 2013-12-16 10:40       ` Michal Hocko
  0 siblings, 0 replies; 54+ messages in thread
From: Michal Hocko @ 2013-12-16 10:40 UTC (permalink / raw)
  To: Li Zefan
  Cc: Hugh Dickins, Johannes Weiner, Tejun Heo, Andrew Morton,
	KAMEZAWA Hiroyuki, linux-mm, cgroups, linux-kernel

On Mon 16-12-13 10:53:45, Michal Hocko wrote:
> On Mon 16-12-13 17:36:09, Li Zefan wrote:
> > On 2013/12/16 16:36, Hugh Dickins wrote:
> > > CONFIG_MEMCG_SWAP is broken in 3.13-rc.  Try something like this:
> > > 
> > > mkdir -p /tmp/tmpfs /tmp/memcg
> > > mount -t tmpfs -o size=1G tmpfs /tmp/tmpfs
> > > mount -t cgroup -o memory memcg /tmp/memcg
> > > mkdir /tmp/memcg/old
> > > echo 512M >/tmp/memcg/old/memory.limit_in_bytes
> > > echo $$ >/tmp/memcg/old/tasks
> > > cp /dev/zero /tmp/tmpfs/zero 2>/dev/null
> > > echo $$ >/tmp/memcg/tasks
> > > rmdir /tmp/memcg/old
> > > sleep 1	# let rmdir work complete
> > > mkdir /tmp/memcg/new
> > > umount /tmp/tmpfs
> > > dmesg | grep WARNING
> > > rmdir /tmp/memcg/new
> > > umount /tmp/memcg
> > > 
> > > Shows lots of WARNING: CPU: 1 PID: 1006 at kernel/res_counter.c:91
> > >                            res_counter_uncharge_locked+0x1f/0x2f()
> > > 
> > > Breakage comes from 34c00c319ce7 ("memcg: convert to use cgroup id").
> > > 
> > > The lifetime of a cgroup id is different from the lifetime of the
> > > css id it replaced: memsw's css_get()s do nothing to hold on to the
> > > old cgroup id, it soon gets recycled to a new cgroup, which then
> > > mysteriously inherits the old's swap, without any charge for it.
> > > (I thought memsw's particular need had been discussed and was
> > > well understood when 34c00c319ce7 went in, but apparently not.)
> > > 
> > > The right thing to do at this stage would be to revert that and its
> > > associated commits; but I imagine to do so would be unwelcome to
> > > the cgroup guys, going against their general direction; and I've
> > > no idea how embedded that css_id removal has become by now.
> > > 
> > > Perhaps some creative refcounting can rescue memsw while still
> > > using cgroup id?
> > > 
> > 
> > Sorry for the broken.
> > 
> > I think we can keep the cgroup->id until the last css reference is
> > dropped and the css is scheduled to be destroyed.
> 
> How would this work? The task which pushed the memory to the swap is
> still alive (living in a different group) and the swap will be there
> after the last reference to css as well.

Or did you mean to get css reference in swap_cgroup_record and release
it in __mem_cgroup_try_charge_swapin?

That would prevent the warning (assuming idr_remove would move to
css_free[1]) but I am not sure this is the right thing to do. memsw charges
will be accounted to the parent already (assuming there is one) without
anybody to uncharge them because all uncharges would fallback to the
root memcg after css_offline.

Hugh's approach seems much better.

---
[1] Is this even possible? I cannot say I would understand the comment
above idr_remove in cgroup_destroy_css_killed 100% but it suggests we
cannot postpone it to later.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: 3.13-rc breaks MEMCG_SWAP
  2013-12-16  8:36 ` Hugh Dickins
@ 2013-12-16 16:20   ` Michal Hocko
  -1 siblings, 0 replies; 54+ messages in thread
From: Michal Hocko @ 2013-12-16 16:20 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Li Zefan, Johannes Weiner, Tejun Heo, Andrew Morton,
	KAMEZAWA Hiroyuki, linux-mm, cgroups, linux-kernel

On Mon 16-12-13 00:36:05, Hugh Dickins wrote:
[...]

OK, I went through the patch and it looks good except for suspicious
ctrl->lock handling in swap_cgroup_reassign (see below). I am just
suggesting to split it into 4 parts

* swap_cgroup_mutex -> swap_cgroup_lock
* swapon cleanup
* drop irqsave when taking ctrl->lock
* mem_cgroup_reparent_swap

but I will leave the split up to you. Just make sure that the fix is a
separate patch, please.

[...]
> --- 3.13-rc4/mm/page_cgroup.c	2013-02-18 15:58:34.000000000 -0800
> +++ linux/mm/page_cgroup.c	2013-12-15 14:34:36.312485960 -0800
> @@ -322,7 +322,8 @@ void __meminit pgdat_page_cgroup_init(st
>  
>  #ifdef CONFIG_MEMCG_SWAP
>  
> -static DEFINE_MUTEX(swap_cgroup_mutex);
> +static DEFINE_SPINLOCK(swap_cgroup_lock);
> +

This one is worth a separate patch IMO.

>  struct swap_cgroup_ctrl {
>  	struct page **map;
>  	unsigned long length;
> @@ -353,14 +354,11 @@ struct swap_cgroup {
>  /*
>   * allocate buffer for swap_cgroup.
>   */
> -static int swap_cgroup_prepare(int type)
> +static int swap_cgroup_prepare(struct swap_cgroup_ctrl *ctrl)
>  {
>  	struct page *page;
> -	struct swap_cgroup_ctrl *ctrl;
>  	unsigned long idx, max;
>  
> -	ctrl = &swap_cgroup_ctrl[type];
> -
>  	for (idx = 0; idx < ctrl->length; idx++) {
>  		page = alloc_page(GFP_KERNEL | __GFP_ZERO);
>  		if (!page)

This with swap_cgroup_swapon should be in a separate patch as a cleanup.

> @@ -407,18 +405,17 @@ unsigned short swap_cgroup_cmpxchg(swp_e
>  {
>  	struct swap_cgroup_ctrl *ctrl;
>  	struct swap_cgroup *sc;
> -	unsigned long flags;
>  	unsigned short retval;
>  
>  	sc = lookup_swap_cgroup(ent, &ctrl);
>  
> -	spin_lock_irqsave(&ctrl->lock, flags);
> +	spin_lock(&ctrl->lock);
>  	retval = sc->id;
>  	if (retval == old)
>  		sc->id = new;
>  	else
>  		retval = 0;
> -	spin_unlock_irqrestore(&ctrl->lock, flags);
> +	spin_unlock(&ctrl->lock);
>  	return retval;
>  }
>  
> @@ -435,14 +432,13 @@ unsigned short swap_cgroup_record(swp_en
>  	struct swap_cgroup_ctrl *ctrl;
>  	struct swap_cgroup *sc;
>  	unsigned short old;
> -	unsigned long flags;
>  
>  	sc = lookup_swap_cgroup(ent, &ctrl);
>  
> -	spin_lock_irqsave(&ctrl->lock, flags);
> +	spin_lock(&ctrl->lock);
>  	old = sc->id;
>  	sc->id = id;
> -	spin_unlock_irqrestore(&ctrl->lock, flags);
> +	spin_unlock(&ctrl->lock);
>  
>  	return old;
>  }

I would prefer these two in a separate patch as well. I have no idea why
these were IRQ aware as this was never needed AFAICS.
e9e58a4ec3b10 is not very specific...

> @@ -451,19 +447,60 @@ unsigned short swap_cgroup_record(swp_en
>   * lookup_swap_cgroup_id - lookup mem_cgroup id tied to swap entry
>   * @ent: swap entry to be looked up.
>   *
> - * Returns CSS ID of mem_cgroup at success. 0 at failure. (0 is invalid ID)
> + * Returns ID of mem_cgroup at success. 0 at failure. (0 is invalid ID)
>   */
>  unsigned short lookup_swap_cgroup_id(swp_entry_t ent)
>  {
>  	return lookup_swap_cgroup(ent, NULL)->id;
>  }
>  
> +/**
> + * swap_cgroup_reassign - assign all old entries to new (before old is freed).
> + * @old: id of emptied memcg whose entries are now to be reassigned
> + * @new: id of parent memcg to which those entries are to be assigned
> + *
> + * Returns number of entries reassigned, for debugging or for statistics.
> + */
> +long swap_cgroup_reassign(unsigned short old, unsigned short new)
> +{
> +	long reassigned = 0;
> +	int type;
> +
> +	for (type = 0; type < MAX_SWAPFILES; type++) {
> +		struct swap_cgroup_ctrl *ctrl = &swap_cgroup_ctrl[type];
> +		unsigned long idx;
> +
> +		for (idx = 0; idx < ACCESS_ONCE(ctrl->length); idx++) {
> +			struct swap_cgroup *sc, *scend;
> +
> +			spin_lock(&swap_cgroup_lock);
> +			if (idx >= ACCESS_ONCE(ctrl->length))
> +				goto unlock;
> +			sc = page_address(ctrl->map[idx]);
> +			for (scend = sc + SC_PER_PAGE; sc < scend; sc++) {
> +				if (sc->id != old)
> +					continue;

Is this safe? What prevents from race when id is set to old?

> +				spin_lock(&ctrl->lock);
> +				if (sc->id == old) {

Also it seems that compiler is free to optimize this test away, no?
You need ACCESS_ONCE here as well, I guess.

> +					sc->id = new;
> +					reassigned++;
> +				}
> +				spin_unlock(&ctrl->lock);
> +			}
> +unlock:
> +			spin_unlock(&swap_cgroup_lock);
> +			cond_resched();
> +		}
> +	}
> +	return reassigned;
> +}
> +
>  int swap_cgroup_swapon(int type, unsigned long max_pages)
>  {
>  	void *array;
>  	unsigned long array_size;
>  	unsigned long length;
> -	struct swap_cgroup_ctrl *ctrl;
> +	struct swap_cgroup_ctrl ctrl;
>  
>  	if (!do_swap_account)
>  		return 0;
[...]
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: 3.13-rc breaks MEMCG_SWAP
@ 2013-12-16 16:20   ` Michal Hocko
  0 siblings, 0 replies; 54+ messages in thread
From: Michal Hocko @ 2013-12-16 16:20 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Li Zefan, Johannes Weiner, Tejun Heo, Andrew Morton,
	KAMEZAWA Hiroyuki, linux-mm, cgroups, linux-kernel

On Mon 16-12-13 00:36:05, Hugh Dickins wrote:
[...]

OK, I went through the patch and it looks good except for suspicious
ctrl->lock handling in swap_cgroup_reassign (see below). I am just
suggesting to split it into 4 parts

* swap_cgroup_mutex -> swap_cgroup_lock
* swapon cleanup
* drop irqsave when taking ctrl->lock
* mem_cgroup_reparent_swap

but I will leave the split up to you. Just make sure that the fix is a
separate patch, please.

[...]
> --- 3.13-rc4/mm/page_cgroup.c	2013-02-18 15:58:34.000000000 -0800
> +++ linux/mm/page_cgroup.c	2013-12-15 14:34:36.312485960 -0800
> @@ -322,7 +322,8 @@ void __meminit pgdat_page_cgroup_init(st
>  
>  #ifdef CONFIG_MEMCG_SWAP
>  
> -static DEFINE_MUTEX(swap_cgroup_mutex);
> +static DEFINE_SPINLOCK(swap_cgroup_lock);
> +

This one is worth a separate patch IMO.

>  struct swap_cgroup_ctrl {
>  	struct page **map;
>  	unsigned long length;
> @@ -353,14 +354,11 @@ struct swap_cgroup {
>  /*
>   * allocate buffer for swap_cgroup.
>   */
> -static int swap_cgroup_prepare(int type)
> +static int swap_cgroup_prepare(struct swap_cgroup_ctrl *ctrl)
>  {
>  	struct page *page;
> -	struct swap_cgroup_ctrl *ctrl;
>  	unsigned long idx, max;
>  
> -	ctrl = &swap_cgroup_ctrl[type];
> -
>  	for (idx = 0; idx < ctrl->length; idx++) {
>  		page = alloc_page(GFP_KERNEL | __GFP_ZERO);
>  		if (!page)

This with swap_cgroup_swapon should be in a separate patch as a cleanup.

> @@ -407,18 +405,17 @@ unsigned short swap_cgroup_cmpxchg(swp_e
>  {
>  	struct swap_cgroup_ctrl *ctrl;
>  	struct swap_cgroup *sc;
> -	unsigned long flags;
>  	unsigned short retval;
>  
>  	sc = lookup_swap_cgroup(ent, &ctrl);
>  
> -	spin_lock_irqsave(&ctrl->lock, flags);
> +	spin_lock(&ctrl->lock);
>  	retval = sc->id;
>  	if (retval == old)
>  		sc->id = new;
>  	else
>  		retval = 0;
> -	spin_unlock_irqrestore(&ctrl->lock, flags);
> +	spin_unlock(&ctrl->lock);
>  	return retval;
>  }
>  
> @@ -435,14 +432,13 @@ unsigned short swap_cgroup_record(swp_en
>  	struct swap_cgroup_ctrl *ctrl;
>  	struct swap_cgroup *sc;
>  	unsigned short old;
> -	unsigned long flags;
>  
>  	sc = lookup_swap_cgroup(ent, &ctrl);
>  
> -	spin_lock_irqsave(&ctrl->lock, flags);
> +	spin_lock(&ctrl->lock);
>  	old = sc->id;
>  	sc->id = id;
> -	spin_unlock_irqrestore(&ctrl->lock, flags);
> +	spin_unlock(&ctrl->lock);
>  
>  	return old;
>  }

I would prefer these two in a separate patch as well. I have no idea why
these were IRQ aware as this was never needed AFAICS.
e9e58a4ec3b10 is not very specific...

> @@ -451,19 +447,60 @@ unsigned short swap_cgroup_record(swp_en
>   * lookup_swap_cgroup_id - lookup mem_cgroup id tied to swap entry
>   * @ent: swap entry to be looked up.
>   *
> - * Returns CSS ID of mem_cgroup at success. 0 at failure. (0 is invalid ID)
> + * Returns ID of mem_cgroup at success. 0 at failure. (0 is invalid ID)
>   */
>  unsigned short lookup_swap_cgroup_id(swp_entry_t ent)
>  {
>  	return lookup_swap_cgroup(ent, NULL)->id;
>  }
>  
> +/**
> + * swap_cgroup_reassign - assign all old entries to new (before old is freed).
> + * @old: id of emptied memcg whose entries are now to be reassigned
> + * @new: id of parent memcg to which those entries are to be assigned
> + *
> + * Returns number of entries reassigned, for debugging or for statistics.
> + */
> +long swap_cgroup_reassign(unsigned short old, unsigned short new)
> +{
> +	long reassigned = 0;
> +	int type;
> +
> +	for (type = 0; type < MAX_SWAPFILES; type++) {
> +		struct swap_cgroup_ctrl *ctrl = &swap_cgroup_ctrl[type];
> +		unsigned long idx;
> +
> +		for (idx = 0; idx < ACCESS_ONCE(ctrl->length); idx++) {
> +			struct swap_cgroup *sc, *scend;
> +
> +			spin_lock(&swap_cgroup_lock);
> +			if (idx >= ACCESS_ONCE(ctrl->length))
> +				goto unlock;
> +			sc = page_address(ctrl->map[idx]);
> +			for (scend = sc + SC_PER_PAGE; sc < scend; sc++) {
> +				if (sc->id != old)
> +					continue;

Is this safe? What prevents from race when id is set to old?

> +				spin_lock(&ctrl->lock);
> +				if (sc->id == old) {

Also it seems that compiler is free to optimize this test away, no?
You need ACCESS_ONCE here as well, I guess.

> +					sc->id = new;
> +					reassigned++;
> +				}
> +				spin_unlock(&ctrl->lock);
> +			}
> +unlock:
> +			spin_unlock(&swap_cgroup_lock);
> +			cond_resched();
> +		}
> +	}
> +	return reassigned;
> +}
> +
>  int swap_cgroup_swapon(int type, unsigned long max_pages)
>  {
>  	void *array;
>  	unsigned long array_size;
>  	unsigned long length;
> -	struct swap_cgroup_ctrl *ctrl;
> +	struct swap_cgroup_ctrl ctrl;
>  
>  	if (!do_swap_account)
>  		return 0;
[...]
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: 3.13-rc breaks MEMCG_SWAP
  2013-12-16 10:40       ` Michal Hocko
@ 2013-12-16 16:35         ` Tejun Heo
  -1 siblings, 0 replies; 54+ messages in thread
From: Tejun Heo @ 2013-12-16 16:35 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Li Zefan, Hugh Dickins, Johannes Weiner, Andrew Morton,
	KAMEZAWA Hiroyuki, linux-mm, cgroups, linux-kernel

On Mon, Dec 16, 2013 at 11:40:42AM +0100, Michal Hocko wrote:
> > How would this work? The task which pushed the memory to the swap is
> > still alive (living in a different group) and the swap will be there
> > after the last reference to css as well.
> 
> Or did you mean to get css reference in swap_cgroup_record and release
> it in __mem_cgroup_try_charge_swapin?
> 
> That would prevent the warning (assuming idr_remove would move to
> css_free[1]) but I am not sure this is the right thing to do. memsw charges
> will be accounted to the parent already (assuming there is one) without
> anybody to uncharge them because all uncharges would fallback to the
> root memcg after css_offline.
> 
> Hugh's approach seems much better.

Hmmm... I think it's reasonable for css's to expect cgrp->id to not be
recycled before all css refs are gone.  If Hugh's patches are
something desriable independent of cgrp->id issues, great, but, if
not, let's first try to get it right from cgroup core.  Is it enough
for css_from_id() to return NULL after offline until all refs are
gone?  That should be an easy fix.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: 3.13-rc breaks MEMCG_SWAP
@ 2013-12-16 16:35         ` Tejun Heo
  0 siblings, 0 replies; 54+ messages in thread
From: Tejun Heo @ 2013-12-16 16:35 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Li Zefan, Hugh Dickins, Johannes Weiner, Andrew Morton,
	KAMEZAWA Hiroyuki, linux-mm, cgroups, linux-kernel

On Mon, Dec 16, 2013 at 11:40:42AM +0100, Michal Hocko wrote:
> > How would this work? The task which pushed the memory to the swap is
> > still alive (living in a different group) and the swap will be there
> > after the last reference to css as well.
> 
> Or did you mean to get css reference in swap_cgroup_record and release
> it in __mem_cgroup_try_charge_swapin?
> 
> That would prevent the warning (assuming idr_remove would move to
> css_free[1]) but I am not sure this is the right thing to do. memsw charges
> will be accounted to the parent already (assuming there is one) without
> anybody to uncharge them because all uncharges would fallback to the
> root memcg after css_offline.
> 
> Hugh's approach seems much better.

Hmmm... I think it's reasonable for css's to expect cgrp->id to not be
recycled before all css refs are gone.  If Hugh's patches are
something desriable independent of cgrp->id issues, great, but, if
not, let's first try to get it right from cgroup core.  Is it enough
for css_from_id() to return NULL after offline until all refs are
gone?  That should be an easy fix.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: 3.13-rc breaks MEMCG_SWAP
  2013-12-16 10:40       ` Michal Hocko
@ 2013-12-16 16:41         ` Johannes Weiner
  -1 siblings, 0 replies; 54+ messages in thread
From: Johannes Weiner @ 2013-12-16 16:41 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Li Zefan, Hugh Dickins, Tejun Heo, Andrew Morton,
	KAMEZAWA Hiroyuki, linux-mm, cgroups, linux-kernel

On Mon, Dec 16, 2013 at 11:40:42AM +0100, Michal Hocko wrote:
> On Mon 16-12-13 10:53:45, Michal Hocko wrote:
> > On Mon 16-12-13 17:36:09, Li Zefan wrote:
> > > On 2013/12/16 16:36, Hugh Dickins wrote:
> > > > CONFIG_MEMCG_SWAP is broken in 3.13-rc.  Try something like this:
> > > > 
> > > > mkdir -p /tmp/tmpfs /tmp/memcg
> > > > mount -t tmpfs -o size=1G tmpfs /tmp/tmpfs
> > > > mount -t cgroup -o memory memcg /tmp/memcg
> > > > mkdir /tmp/memcg/old
> > > > echo 512M >/tmp/memcg/old/memory.limit_in_bytes
> > > > echo $$ >/tmp/memcg/old/tasks
> > > > cp /dev/zero /tmp/tmpfs/zero 2>/dev/null
> > > > echo $$ >/tmp/memcg/tasks
> > > > rmdir /tmp/memcg/old
> > > > sleep 1	# let rmdir work complete
> > > > mkdir /tmp/memcg/new
> > > > umount /tmp/tmpfs
> > > > dmesg | grep WARNING
> > > > rmdir /tmp/memcg/new
> > > > umount /tmp/memcg
> > > > 
> > > > Shows lots of WARNING: CPU: 1 PID: 1006 at kernel/res_counter.c:91
> > > >                            res_counter_uncharge_locked+0x1f/0x2f()
> > > > 
> > > > Breakage comes from 34c00c319ce7 ("memcg: convert to use cgroup id").
> > > > 
> > > > The lifetime of a cgroup id is different from the lifetime of the
> > > > css id it replaced: memsw's css_get()s do nothing to hold on to the
> > > > old cgroup id, it soon gets recycled to a new cgroup, which then
> > > > mysteriously inherits the old's swap, without any charge for it.
> > > > (I thought memsw's particular need had been discussed and was
> > > > well understood when 34c00c319ce7 went in, but apparently not.)
> > > > 
> > > > The right thing to do at this stage would be to revert that and its
> > > > associated commits; but I imagine to do so would be unwelcome to
> > > > the cgroup guys, going against their general direction; and I've
> > > > no idea how embedded that css_id removal has become by now.
> > > > 
> > > > Perhaps some creative refcounting can rescue memsw while still
> > > > using cgroup id?
> > > > 
> > > 
> > > Sorry for the broken.
> > > 
> > > I think we can keep the cgroup->id until the last css reference is
> > > dropped and the css is scheduled to be destroyed.
> > 
> > How would this work? The task which pushed the memory to the swap is
> > still alive (living in a different group) and the swap will be there
> > after the last reference to css as well.
> 
> Or did you mean to get css reference in swap_cgroup_record and release
> it in __mem_cgroup_try_charge_swapin?

We already do that, swap records hold a css reference.  We do the put
in mem_cgroup_uncharge_swap().

It really strikes me as odd that we recycle the cgroup ID while there
are still references to the cgroup in circulation.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: 3.13-rc breaks MEMCG_SWAP
@ 2013-12-16 16:41         ` Johannes Weiner
  0 siblings, 0 replies; 54+ messages in thread
From: Johannes Weiner @ 2013-12-16 16:41 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Li Zefan, Hugh Dickins, Tejun Heo, Andrew Morton,
	KAMEZAWA Hiroyuki, linux-mm, cgroups, linux-kernel

On Mon, Dec 16, 2013 at 11:40:42AM +0100, Michal Hocko wrote:
> On Mon 16-12-13 10:53:45, Michal Hocko wrote:
> > On Mon 16-12-13 17:36:09, Li Zefan wrote:
> > > On 2013/12/16 16:36, Hugh Dickins wrote:
> > > > CONFIG_MEMCG_SWAP is broken in 3.13-rc.  Try something like this:
> > > > 
> > > > mkdir -p /tmp/tmpfs /tmp/memcg
> > > > mount -t tmpfs -o size=1G tmpfs /tmp/tmpfs
> > > > mount -t cgroup -o memory memcg /tmp/memcg
> > > > mkdir /tmp/memcg/old
> > > > echo 512M >/tmp/memcg/old/memory.limit_in_bytes
> > > > echo $$ >/tmp/memcg/old/tasks
> > > > cp /dev/zero /tmp/tmpfs/zero 2>/dev/null
> > > > echo $$ >/tmp/memcg/tasks
> > > > rmdir /tmp/memcg/old
> > > > sleep 1	# let rmdir work complete
> > > > mkdir /tmp/memcg/new
> > > > umount /tmp/tmpfs
> > > > dmesg | grep WARNING
> > > > rmdir /tmp/memcg/new
> > > > umount /tmp/memcg
> > > > 
> > > > Shows lots of WARNING: CPU: 1 PID: 1006 at kernel/res_counter.c:91
> > > >                            res_counter_uncharge_locked+0x1f/0x2f()
> > > > 
> > > > Breakage comes from 34c00c319ce7 ("memcg: convert to use cgroup id").
> > > > 
> > > > The lifetime of a cgroup id is different from the lifetime of the
> > > > css id it replaced: memsw's css_get()s do nothing to hold on to the
> > > > old cgroup id, it soon gets recycled to a new cgroup, which then
> > > > mysteriously inherits the old's swap, without any charge for it.
> > > > (I thought memsw's particular need had been discussed and was
> > > > well understood when 34c00c319ce7 went in, but apparently not.)
> > > > 
> > > > The right thing to do at this stage would be to revert that and its
> > > > associated commits; but I imagine to do so would be unwelcome to
> > > > the cgroup guys, going against their general direction; and I've
> > > > no idea how embedded that css_id removal has become by now.
> > > > 
> > > > Perhaps some creative refcounting can rescue memsw while still
> > > > using cgroup id?
> > > > 
> > > 
> > > Sorry for the broken.
> > > 
> > > I think we can keep the cgroup->id until the last css reference is
> > > dropped and the css is scheduled to be destroyed.
> > 
> > How would this work? The task which pushed the memory to the swap is
> > still alive (living in a different group) and the swap will be there
> > after the last reference to css as well.
> 
> Or did you mean to get css reference in swap_cgroup_record and release
> it in __mem_cgroup_try_charge_swapin?

We already do that, swap records hold a css reference.  We do the put
in mem_cgroup_uncharge_swap().

It really strikes me as odd that we recycle the cgroup ID while there
are still references to the cgroup in circulation.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: 3.13-rc breaks MEMCG_SWAP
  2013-12-16 16:41         ` Johannes Weiner
  (?)
@ 2013-12-16 17:15           ` Michal Hocko
  -1 siblings, 0 replies; 54+ messages in thread
From: Michal Hocko @ 2013-12-16 17:15 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Li Zefan, Hugh Dickins, Tejun Heo, Andrew Morton,
	KAMEZAWA Hiroyuki, linux-mm, cgroups, linux-kernel

On Mon 16-12-13 11:41:54, Johannes Weiner wrote:
> On Mon, Dec 16, 2013 at 11:40:42AM +0100, Michal Hocko wrote:
> > On Mon 16-12-13 10:53:45, Michal Hocko wrote:
> > > On Mon 16-12-13 17:36:09, Li Zefan wrote:
> > > > On 2013/12/16 16:36, Hugh Dickins wrote:
> > > > > CONFIG_MEMCG_SWAP is broken in 3.13-rc.  Try something like this:
> > > > > 
> > > > > mkdir -p /tmp/tmpfs /tmp/memcg
> > > > > mount -t tmpfs -o size=1G tmpfs /tmp/tmpfs
> > > > > mount -t cgroup -o memory memcg /tmp/memcg
> > > > > mkdir /tmp/memcg/old
> > > > > echo 512M >/tmp/memcg/old/memory.limit_in_bytes
> > > > > echo $$ >/tmp/memcg/old/tasks
> > > > > cp /dev/zero /tmp/tmpfs/zero 2>/dev/null
> > > > > echo $$ >/tmp/memcg/tasks
> > > > > rmdir /tmp/memcg/old
> > > > > sleep 1	# let rmdir work complete
> > > > > mkdir /tmp/memcg/new
> > > > > umount /tmp/tmpfs
> > > > > dmesg | grep WARNING
> > > > > rmdir /tmp/memcg/new
> > > > > umount /tmp/memcg
> > > > > 
> > > > > Shows lots of WARNING: CPU: 1 PID: 1006 at kernel/res_counter.c:91
> > > > >                            res_counter_uncharge_locked+0x1f/0x2f()
> > > > > 
> > > > > Breakage comes from 34c00c319ce7 ("memcg: convert to use cgroup id").
> > > > > 
> > > > > The lifetime of a cgroup id is different from the lifetime of the
> > > > > css id it replaced: memsw's css_get()s do nothing to hold on to the
> > > > > old cgroup id, it soon gets recycled to a new cgroup, which then
> > > > > mysteriously inherits the old's swap, without any charge for it.
> > > > > (I thought memsw's particular need had been discussed and was
> > > > > well understood when 34c00c319ce7 went in, but apparently not.)
> > > > > 
> > > > > The right thing to do at this stage would be to revert that and its
> > > > > associated commits; but I imagine to do so would be unwelcome to
> > > > > the cgroup guys, going against their general direction; and I've
> > > > > no idea how embedded that css_id removal has become by now.
> > > > > 
> > > > > Perhaps some creative refcounting can rescue memsw while still
> > > > > using cgroup id?
> > > > > 
> > > > 
> > > > Sorry for the broken.
> > > > 
> > > > I think we can keep the cgroup->id until the last css reference is
> > > > dropped and the css is scheduled to be destroyed.
> > > 
> > > How would this work? The task which pushed the memory to the swap is
> > > still alive (living in a different group) and the swap will be there
> > > after the last reference to css as well.
> > 
> > Or did you mean to get css reference in swap_cgroup_record and release
> > it in __mem_cgroup_try_charge_swapin?
> 
> We already do that, swap records hold a css reference.  We do the put
> in mem_cgroup_uncharge_swap().

Dohh! You are right I have totally missed that the css_get is burried in
__mem_cgroup_uncharge_common and the counterpart is in mem_cgroup_uncharge_swap
(which is less unexpected).

> It really strikes me as odd that we recycle the cgroup ID while there
> are still references to the cgroup in circulation.

That is true but even with this fixed I still think that the Hugh's
approach makes a lot of sense.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: 3.13-rc breaks MEMCG_SWAP
@ 2013-12-16 17:15           ` Michal Hocko
  0 siblings, 0 replies; 54+ messages in thread
From: Michal Hocko @ 2013-12-16 17:15 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Li Zefan, Hugh Dickins, Tejun Heo, Andrew Morton,
	KAMEZAWA Hiroyuki, linux-mm, cgroups, linux-kernel

On Mon 16-12-13 11:41:54, Johannes Weiner wrote:
> On Mon, Dec 16, 2013 at 11:40:42AM +0100, Michal Hocko wrote:
> > On Mon 16-12-13 10:53:45, Michal Hocko wrote:
> > > On Mon 16-12-13 17:36:09, Li Zefan wrote:
> > > > On 2013/12/16 16:36, Hugh Dickins wrote:
> > > > > CONFIG_MEMCG_SWAP is broken in 3.13-rc.  Try something like this:
> > > > > 
> > > > > mkdir -p /tmp/tmpfs /tmp/memcg
> > > > > mount -t tmpfs -o size=1G tmpfs /tmp/tmpfs
> > > > > mount -t cgroup -o memory memcg /tmp/memcg
> > > > > mkdir /tmp/memcg/old
> > > > > echo 512M >/tmp/memcg/old/memory.limit_in_bytes
> > > > > echo $$ >/tmp/memcg/old/tasks
> > > > > cp /dev/zero /tmp/tmpfs/zero 2>/dev/null
> > > > > echo $$ >/tmp/memcg/tasks
> > > > > rmdir /tmp/memcg/old
> > > > > sleep 1	# let rmdir work complete
> > > > > mkdir /tmp/memcg/new
> > > > > umount /tmp/tmpfs
> > > > > dmesg | grep WARNING
> > > > > rmdir /tmp/memcg/new
> > > > > umount /tmp/memcg
> > > > > 
> > > > > Shows lots of WARNING: CPU: 1 PID: 1006 at kernel/res_counter.c:91
> > > > >                            res_counter_uncharge_locked+0x1f/0x2f()
> > > > > 
> > > > > Breakage comes from 34c00c319ce7 ("memcg: convert to use cgroup id").
> > > > > 
> > > > > The lifetime of a cgroup id is different from the lifetime of the
> > > > > css id it replaced: memsw's css_get()s do nothing to hold on to the
> > > > > old cgroup id, it soon gets recycled to a new cgroup, which then
> > > > > mysteriously inherits the old's swap, without any charge for it.
> > > > > (I thought memsw's particular need had been discussed and was
> > > > > well understood when 34c00c319ce7 went in, but apparently not.)
> > > > > 
> > > > > The right thing to do at this stage would be to revert that and its
> > > > > associated commits; but I imagine to do so would be unwelcome to
> > > > > the cgroup guys, going against their general direction; and I've
> > > > > no idea how embedded that css_id removal has become by now.
> > > > > 
> > > > > Perhaps some creative refcounting can rescue memsw while still
> > > > > using cgroup id?
> > > > > 
> > > > 
> > > > Sorry for the broken.
> > > > 
> > > > I think we can keep the cgroup->id until the last css reference is
> > > > dropped and the css is scheduled to be destroyed.
> > > 
> > > How would this work? The task which pushed the memory to the swap is
> > > still alive (living in a different group) and the swap will be there
> > > after the last reference to css as well.
> > 
> > Or did you mean to get css reference in swap_cgroup_record and release
> > it in __mem_cgroup_try_charge_swapin?
> 
> We already do that, swap records hold a css reference.  We do the put
> in mem_cgroup_uncharge_swap().

Dohh! You are right I have totally missed that the css_get is burried in
__mem_cgroup_uncharge_common and the counterpart is in mem_cgroup_uncharge_swap
(which is less unexpected).

> It really strikes me as odd that we recycle the cgroup ID while there
> are still references to the cgroup in circulation.

That is true but even with this fixed I still think that the Hugh's
approach makes a lot of sense.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: 3.13-rc breaks MEMCG_SWAP
@ 2013-12-16 17:15           ` Michal Hocko
  0 siblings, 0 replies; 54+ messages in thread
From: Michal Hocko @ 2013-12-16 17:15 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Li Zefan, Hugh Dickins, Tejun Heo, Andrew Morton,
	KAMEZAWA Hiroyuki, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Mon 16-12-13 11:41:54, Johannes Weiner wrote:
> On Mon, Dec 16, 2013 at 11:40:42AM +0100, Michal Hocko wrote:
> > On Mon 16-12-13 10:53:45, Michal Hocko wrote:
> > > On Mon 16-12-13 17:36:09, Li Zefan wrote:
> > > > On 2013/12/16 16:36, Hugh Dickins wrote:
> > > > > CONFIG_MEMCG_SWAP is broken in 3.13-rc.  Try something like this:
> > > > > 
> > > > > mkdir -p /tmp/tmpfs /tmp/memcg
> > > > > mount -t tmpfs -o size=1G tmpfs /tmp/tmpfs
> > > > > mount -t cgroup -o memory memcg /tmp/memcg
> > > > > mkdir /tmp/memcg/old
> > > > > echo 512M >/tmp/memcg/old/memory.limit_in_bytes
> > > > > echo $$ >/tmp/memcg/old/tasks
> > > > > cp /dev/zero /tmp/tmpfs/zero 2>/dev/null
> > > > > echo $$ >/tmp/memcg/tasks
> > > > > rmdir /tmp/memcg/old
> > > > > sleep 1	# let rmdir work complete
> > > > > mkdir /tmp/memcg/new
> > > > > umount /tmp/tmpfs
> > > > > dmesg | grep WARNING
> > > > > rmdir /tmp/memcg/new
> > > > > umount /tmp/memcg
> > > > > 
> > > > > Shows lots of WARNING: CPU: 1 PID: 1006 at kernel/res_counter.c:91
> > > > >                            res_counter_uncharge_locked+0x1f/0x2f()
> > > > > 
> > > > > Breakage comes from 34c00c319ce7 ("memcg: convert to use cgroup id").
> > > > > 
> > > > > The lifetime of a cgroup id is different from the lifetime of the
> > > > > css id it replaced: memsw's css_get()s do nothing to hold on to the
> > > > > old cgroup id, it soon gets recycled to a new cgroup, which then
> > > > > mysteriously inherits the old's swap, without any charge for it.
> > > > > (I thought memsw's particular need had been discussed and was
> > > > > well understood when 34c00c319ce7 went in, but apparently not.)
> > > > > 
> > > > > The right thing to do at this stage would be to revert that and its
> > > > > associated commits; but I imagine to do so would be unwelcome to
> > > > > the cgroup guys, going against their general direction; and I've
> > > > > no idea how embedded that css_id removal has become by now.
> > > > > 
> > > > > Perhaps some creative refcounting can rescue memsw while still
> > > > > using cgroup id?
> > > > > 
> > > > 
> > > > Sorry for the broken.
> > > > 
> > > > I think we can keep the cgroup->id until the last css reference is
> > > > dropped and the css is scheduled to be destroyed.
> > > 
> > > How would this work? The task which pushed the memory to the swap is
> > > still alive (living in a different group) and the swap will be there
> > > after the last reference to css as well.
> > 
> > Or did you mean to get css reference in swap_cgroup_record and release
> > it in __mem_cgroup_try_charge_swapin?
> 
> We already do that, swap records hold a css reference.  We do the put
> in mem_cgroup_uncharge_swap().

Dohh! You are right I have totally missed that the css_get is burried in
__mem_cgroup_uncharge_common and the counterpart is in mem_cgroup_uncharge_swap
(which is less unexpected).

> It really strikes me as odd that we recycle the cgroup ID while there
> are still references to the cgroup in circulation.

That is true but even with this fixed I still think that the Hugh's
approach makes a lot of sense.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: 3.13-rc breaks MEMCG_SWAP
  2013-12-16 16:35         ` Tejun Heo
@ 2013-12-16 17:19           ` Michal Hocko
  -1 siblings, 0 replies; 54+ messages in thread
From: Michal Hocko @ 2013-12-16 17:19 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Li Zefan, Hugh Dickins, Johannes Weiner, Andrew Morton,
	KAMEZAWA Hiroyuki, linux-mm, cgroups, linux-kernel

On Mon 16-12-13 11:35:30, Tejun Heo wrote:
> On Mon, Dec 16, 2013 at 11:40:42AM +0100, Michal Hocko wrote:
> > > How would this work? The task which pushed the memory to the swap is
> > > still alive (living in a different group) and the swap will be there
> > > after the last reference to css as well.
> > 
> > Or did you mean to get css reference in swap_cgroup_record and release
> > it in __mem_cgroup_try_charge_swapin?
> > 
> > That would prevent the warning (assuming idr_remove would move to
> > css_free[1]) but I am not sure this is the right thing to do. memsw charges
> > will be accounted to the parent already (assuming there is one) without
> > anybody to uncharge them because all uncharges would fallback to the
> > root memcg after css_offline.
> > 
> > Hugh's approach seems much better.
> 
> Hmmm... I think it's reasonable for css's to expect cgrp->id to not be
> recycled before all css refs are gone.  If Hugh's patches are
> something desriable independent of cgrp->id issues, great, but, if
> not, let's first try to get it right from cgroup core.  Is it enough
> for css_from_id() to return NULL after offline until all refs are
> gone?  That should be an easy fix.

I have to think about it some more (the brain is not working anymore
today). But what we really need is that nobody gets the same id while
the css is alive. So css_from_id returning NULL doesn't seem to be
enough.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: 3.13-rc breaks MEMCG_SWAP
@ 2013-12-16 17:19           ` Michal Hocko
  0 siblings, 0 replies; 54+ messages in thread
From: Michal Hocko @ 2013-12-16 17:19 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Li Zefan, Hugh Dickins, Johannes Weiner, Andrew Morton,
	KAMEZAWA Hiroyuki, linux-mm, cgroups, linux-kernel

On Mon 16-12-13 11:35:30, Tejun Heo wrote:
> On Mon, Dec 16, 2013 at 11:40:42AM +0100, Michal Hocko wrote:
> > > How would this work? The task which pushed the memory to the swap is
> > > still alive (living in a different group) and the swap will be there
> > > after the last reference to css as well.
> > 
> > Or did you mean to get css reference in swap_cgroup_record and release
> > it in __mem_cgroup_try_charge_swapin?
> > 
> > That would prevent the warning (assuming idr_remove would move to
> > css_free[1]) but I am not sure this is the right thing to do. memsw charges
> > will be accounted to the parent already (assuming there is one) without
> > anybody to uncharge them because all uncharges would fallback to the
> > root memcg after css_offline.
> > 
> > Hugh's approach seems much better.
> 
> Hmmm... I think it's reasonable for css's to expect cgrp->id to not be
> recycled before all css refs are gone.  If Hugh's patches are
> something desriable independent of cgrp->id issues, great, but, if
> not, let's first try to get it right from cgroup core.  Is it enough
> for css_from_id() to return NULL after offline until all refs are
> gone?  That should be an easy fix.

I have to think about it some more (the brain is not working anymore
today). But what we really need is that nobody gets the same id while
the css is alive. So css_from_id returning NULL doesn't seem to be
enough.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: 3.13-rc breaks MEMCG_SWAP
  2013-12-16 17:15           ` Michal Hocko
@ 2013-12-16 17:19             ` Tejun Heo
  -1 siblings, 0 replies; 54+ messages in thread
From: Tejun Heo @ 2013-12-16 17:19 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Li Zefan, Hugh Dickins, Andrew Morton,
	KAMEZAWA Hiroyuki, linux-mm, cgroups, linux-kernel

Hello, Michal, Johannes.

On Mon, Dec 16, 2013 at 06:15:27PM +0100, Michal Hocko wrote:
> > We already do that, swap records hold a css reference.  We do the put
> > in mem_cgroup_uncharge_swap().
> 
> Dohh! You are right I have totally missed that the css_get is burried in
> __mem_cgroup_uncharge_common and the counterpart is in mem_cgroup_uncharge_swap
> (which is less unexpected).
> 
> > It really strikes me as odd that we recycle the cgroup ID while there
> > are still references to the cgroup in circulation.
> 
> That is true but even with this fixed I still think that the Hugh's
> approach makes a lot of sense.

I thought about this a bit and I think the id really should be per
subsystem - ie. like css_id but just a dumb id as cgrp->id.  The
reason is that cgroup's lifetime and css's lifetime will soon be
decoupled.  ie. if a css is disabled and re-enabled on the same
cgroup, there can be two css's associated with a single cgroup.
cgroup_css() and css iterators should block accesses to css's which
are being drained but it does make sense for id lookup to work until
the css is actually released.

That said, for now, whatever works is fine and if Hugh's suggested
change is desirable anyway, that should do for now.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: 3.13-rc breaks MEMCG_SWAP
@ 2013-12-16 17:19             ` Tejun Heo
  0 siblings, 0 replies; 54+ messages in thread
From: Tejun Heo @ 2013-12-16 17:19 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Johannes Weiner, Li Zefan, Hugh Dickins, Andrew Morton,
	KAMEZAWA Hiroyuki, linux-mm, cgroups, linux-kernel

Hello, Michal, Johannes.

On Mon, Dec 16, 2013 at 06:15:27PM +0100, Michal Hocko wrote:
> > We already do that, swap records hold a css reference.  We do the put
> > in mem_cgroup_uncharge_swap().
> 
> Dohh! You are right I have totally missed that the css_get is burried in
> __mem_cgroup_uncharge_common and the counterpart is in mem_cgroup_uncharge_swap
> (which is less unexpected).
> 
> > It really strikes me as odd that we recycle the cgroup ID while there
> > are still references to the cgroup in circulation.
> 
> That is true but even with this fixed I still think that the Hugh's
> approach makes a lot of sense.

I thought about this a bit and I think the id really should be per
subsystem - ie. like css_id but just a dumb id as cgrp->id.  The
reason is that cgroup's lifetime and css's lifetime will soon be
decoupled.  ie. if a css is disabled and re-enabled on the same
cgroup, there can be two css's associated with a single cgroup.
cgroup_css() and css iterators should block accesses to css's which
are being drained but it does make sense for id lookup to work until
the css is actually released.

That said, for now, whatever works is fine and if Hugh's suggested
change is desirable anyway, that should do for now.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: 3.13-rc breaks MEMCG_SWAP
  2013-12-16 17:19           ` Michal Hocko
@ 2013-12-16 17:21             ` Tejun Heo
  -1 siblings, 0 replies; 54+ messages in thread
From: Tejun Heo @ 2013-12-16 17:21 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Li Zefan, Hugh Dickins, Johannes Weiner, Andrew Morton,
	KAMEZAWA Hiroyuki, linux-mm, cgroups, linux-kernel

Hey,

On Mon, Dec 16, 2013 at 06:19:37PM +0100, Michal Hocko wrote:
> I have to think about it some more (the brain is not working anymore
> today). But what we really need is that nobody gets the same id while
> the css is alive. So css_from_id returning NULL doesn't seem to be
> enough.

Oh, I meant whether it's necessary to keep css_from_id() working
(ie. doing successful lookups) between offline and release, because
that's where lifetimes are coupled.  IOW, if it's enough for cgroup to
not recycle the ID until all css's are released && fail css_from_id()
lookup after the css is offlined, I can make a five liner quick fix.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: 3.13-rc breaks MEMCG_SWAP
@ 2013-12-16 17:21             ` Tejun Heo
  0 siblings, 0 replies; 54+ messages in thread
From: Tejun Heo @ 2013-12-16 17:21 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Li Zefan, Hugh Dickins, Johannes Weiner, Andrew Morton,
	KAMEZAWA Hiroyuki, linux-mm, cgroups, linux-kernel

Hey,

On Mon, Dec 16, 2013 at 06:19:37PM +0100, Michal Hocko wrote:
> I have to think about it some more (the brain is not working anymore
> today). But what we really need is that nobody gets the same id while
> the css is alive. So css_from_id returning NULL doesn't seem to be
> enough.

Oh, I meant whether it's necessary to keep css_from_id() working
(ie. doing successful lookups) between offline and release, because
that's where lifetimes are coupled.  IOW, if it's enough for cgroup to
not recycle the ID until all css's are released && fail css_from_id()
lookup after the css is offlined, I can make a five liner quick fix.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: 3.13-rc breaks MEMCG_SWAP
  2013-12-16 17:21             ` Tejun Heo
@ 2013-12-17  1:41               ` Hugh Dickins
  -1 siblings, 0 replies; 54+ messages in thread
From: Hugh Dickins @ 2013-12-17  1:41 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Michal Hocko, Li Zefan, Hugh Dickins, Johannes Weiner,
	Andrew Morton, KAMEZAWA Hiroyuki, linux-mm, cgroups,
	linux-kernel

On Mon, 16 Dec 2013, Tejun Heo wrote:
> On Mon, Dec 16, 2013 at 06:19:37PM +0100, Michal Hocko wrote:
> > I have to think about it some more (the brain is not working anymore
> > today). But what we really need is that nobody gets the same id while
> > the css is alive. So css_from_id returning NULL doesn't seem to be
> > enough.
> 
> Oh, I meant whether it's necessary to keep css_from_id() working
> (ie. doing successful lookups) between offline and release, because
> that's where lifetimes are coupled.  IOW, if it's enough for cgroup to
> not recycle the ID until all css's are released && fail css_from_id()
> lookup after the css is offlined, I can make a five liner quick fix.

Don't take my word on it, I'm too fuzzy on this: but although it would
be good to refrain from recycling the ID until all css's are released,
I believe that it would not be good enough to fail css_from_id() once
the css is offlined - mem_cgroup_uncharge_swap() needs to uncharge the
hierarchy of the dead memcg (for example, when tmpfs file is removed).

Uncharging the dead memcg itself is presumably irrelevant, but it does
need to locate the right parent to uncharge, and NULL css_from_id()
would make that impossible.  It would be easy if we said those charges
migrate to root rather than to parent, but that's inconsistent with
what we have happily converged upon doing elsewhere (in the preferred
use_hierarchy case), and it would be a change in behaviour.

I'm not nearly as enthusiastic for my patch as Michal is: I really
would prefer a five-liner from you or from Zefan.  I do think (and
this is probably what Michal likes) that my patch leaves MEMCG_SWAP
less surprising, and less likely to cause similar trouble in future;
but it's not how Kame chose to implement it, and it has those nasty
swap_cgroup array scans adding to the overhead of memcg removal -
we can layer on several different hacks/optimizations to reduce that
overhead, but I think it's debatable whether that will end up as an
improvement over what we have had until now.

Hugh

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: 3.13-rc breaks MEMCG_SWAP
@ 2013-12-17  1:41               ` Hugh Dickins
  0 siblings, 0 replies; 54+ messages in thread
From: Hugh Dickins @ 2013-12-17  1:41 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Michal Hocko, Li Zefan, Hugh Dickins, Johannes Weiner,
	Andrew Morton, KAMEZAWA Hiroyuki, linux-mm, cgroups,
	linux-kernel

On Mon, 16 Dec 2013, Tejun Heo wrote:
> On Mon, Dec 16, 2013 at 06:19:37PM +0100, Michal Hocko wrote:
> > I have to think about it some more (the brain is not working anymore
> > today). But what we really need is that nobody gets the same id while
> > the css is alive. So css_from_id returning NULL doesn't seem to be
> > enough.
> 
> Oh, I meant whether it's necessary to keep css_from_id() working
> (ie. doing successful lookups) between offline and release, because
> that's where lifetimes are coupled.  IOW, if it's enough for cgroup to
> not recycle the ID until all css's are released && fail css_from_id()
> lookup after the css is offlined, I can make a five liner quick fix.

Don't take my word on it, I'm too fuzzy on this: but although it would
be good to refrain from recycling the ID until all css's are released,
I believe that it would not be good enough to fail css_from_id() once
the css is offlined - mem_cgroup_uncharge_swap() needs to uncharge the
hierarchy of the dead memcg (for example, when tmpfs file is removed).

Uncharging the dead memcg itself is presumably irrelevant, but it does
need to locate the right parent to uncharge, and NULL css_from_id()
would make that impossible.  It would be easy if we said those charges
migrate to root rather than to parent, but that's inconsistent with
what we have happily converged upon doing elsewhere (in the preferred
use_hierarchy case), and it would be a change in behaviour.

I'm not nearly as enthusiastic for my patch as Michal is: I really
would prefer a five-liner from you or from Zefan.  I do think (and
this is probably what Michal likes) that my patch leaves MEMCG_SWAP
less surprising, and less likely to cause similar trouble in future;
but it's not how Kame chose to implement it, and it has those nasty
swap_cgroup array scans adding to the overhead of memcg removal -
we can layer on several different hacks/optimizations to reduce that
overhead, but I think it's debatable whether that will end up as an
improvement over what we have had until now.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: 3.13-rc breaks MEMCG_SWAP
  2013-12-16 16:20   ` Michal Hocko
@ 2013-12-17  2:26     ` Hugh Dickins
  -1 siblings, 0 replies; 54+ messages in thread
From: Hugh Dickins @ 2013-12-17  2:26 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Hugh Dickins, Li Zefan, Johannes Weiner, Tejun Heo,
	Andrew Morton, KAMEZAWA Hiroyuki, linux-mm, cgroups,
	linux-kernel

On Mon, 16 Dec 2013, Michal Hocko wrote:
> On Mon 16-12-13 00:36:05, Hugh Dickins wrote:
> [...]
> 
> OK, I went through the patch and it looks good except for suspicious
> ctrl->lock handling in swap_cgroup_reassign (see below). I am just
> suggesting to split it into 4 parts

Thanks a lot for studying it, and responding so quickly.
As I remark in reply to Tejun, I'm not nearly so keen on this
approach as you are, and would prefer something short and sweet
from the cgroup end, at least for now; but let's go through your
comments, to keep both options open until we're surer.

> 
> * swap_cgroup_mutex -> swap_cgroup_lock
> * swapon cleanup
> * drop irqsave when taking ctrl->lock
> * mem_cgroup_reparent_swap
> 
> but I will leave the split up to you. Just make sure that the fix is a
> separate patch, please.

Yes indeed, some split like that, maybe even more pieces.  The difficult
part is the ordering: for going into 3.13, we'd prefer a small fix first
and cleanup to follow after in 3.14, but it will be hard to force myself
not to do the cleanup ones first, and until getting down to it I won't
recall how much of the cleanup was essential e.g. to avoid lock ordering
problems or long lock hold times, perhaps.

> 
> [...]
> > --- 3.13-rc4/mm/page_cgroup.c	2013-02-18 15:58:34.000000000 -0800
> > +++ linux/mm/page_cgroup.c	2013-12-15 14:34:36.312485960 -0800
> > @@ -322,7 +322,8 @@ void __meminit pgdat_page_cgroup_init(st
> >  
> >  #ifdef CONFIG_MEMCG_SWAP
> >  
> > -static DEFINE_MUTEX(swap_cgroup_mutex);
> > +static DEFINE_SPINLOCK(swap_cgroup_lock);
> > +
> 
> This one is worth a separate patch IMO.

Agreed.

> 
> >  struct swap_cgroup_ctrl {
> >  	struct page **map;
> >  	unsigned long length;
> > @@ -353,14 +354,11 @@ struct swap_cgroup {
> >  /*
> >   * allocate buffer for swap_cgroup.
> >   */
> > -static int swap_cgroup_prepare(int type)
> > +static int swap_cgroup_prepare(struct swap_cgroup_ctrl *ctrl)
> >  {
> >  	struct page *page;
> > -	struct swap_cgroup_ctrl *ctrl;
> >  	unsigned long idx, max;
> >  
> > -	ctrl = &swap_cgroup_ctrl[type];
> > -
> >  	for (idx = 0; idx < ctrl->length; idx++) {
> >  		page = alloc_page(GFP_KERNEL | __GFP_ZERO);
> >  		if (!page)
> 
> This with swap_cgroup_swapon should be in a separate patch as a cleanup.

Agreed.

> 
> > @@ -407,18 +405,17 @@ unsigned short swap_cgroup_cmpxchg(swp_e
> >  {
> >  	struct swap_cgroup_ctrl *ctrl;
> >  	struct swap_cgroup *sc;
> > -	unsigned long flags;
> >  	unsigned short retval;
> >  
> >  	sc = lookup_swap_cgroup(ent, &ctrl);
> >  
> > -	spin_lock_irqsave(&ctrl->lock, flags);
> > +	spin_lock(&ctrl->lock);
> >  	retval = sc->id;
> >  	if (retval == old)
> >  		sc->id = new;
> >  	else
> >  		retval = 0;
> > -	spin_unlock_irqrestore(&ctrl->lock, flags);
> > +	spin_unlock(&ctrl->lock);
> >  	return retval;
> >  }
> >  
> > @@ -435,14 +432,13 @@ unsigned short swap_cgroup_record(swp_en
> >  	struct swap_cgroup_ctrl *ctrl;
> >  	struct swap_cgroup *sc;
> >  	unsigned short old;
> > -	unsigned long flags;
> >  
> >  	sc = lookup_swap_cgroup(ent, &ctrl);
> >  
> > -	spin_lock_irqsave(&ctrl->lock, flags);
> > +	spin_lock(&ctrl->lock);
> >  	old = sc->id;
> >  	sc->id = id;
> > -	spin_unlock_irqrestore(&ctrl->lock, flags);
> > +	spin_unlock(&ctrl->lock);
> >  
> >  	return old;
> >  }
> 
> I would prefer these two in a separate patch as well. I have no idea why
> these were IRQ aware as this was never needed AFAICS.
> e9e58a4ec3b10 is not very specific...

Agreed.  Yes, I couldn't work out any justification for the _irqsave
variants, and preferred to avoid that clutter rather than add to it.

> 
> > @@ -451,19 +447,60 @@ unsigned short swap_cgroup_record(swp_en
> >   * lookup_swap_cgroup_id - lookup mem_cgroup id tied to swap entry
> >   * @ent: swap entry to be looked up.
> >   *
> > - * Returns CSS ID of mem_cgroup at success. 0 at failure. (0 is invalid ID)
> > + * Returns ID of mem_cgroup at success. 0 at failure. (0 is invalid ID)
> >   */
> >  unsigned short lookup_swap_cgroup_id(swp_entry_t ent)
> >  {
> >  	return lookup_swap_cgroup(ent, NULL)->id;
> >  }
> >  
> > +/**
> > + * swap_cgroup_reassign - assign all old entries to new (before old is freed).
> > + * @old: id of emptied memcg whose entries are now to be reassigned
> > + * @new: id of parent memcg to which those entries are to be assigned
> > + *
> > + * Returns number of entries reassigned, for debugging or for statistics.
> > + */
> > +long swap_cgroup_reassign(unsigned short old, unsigned short new)
> > +{
> > +	long reassigned = 0;
> > +	int type;
> > +
> > +	for (type = 0; type < MAX_SWAPFILES; type++) {
> > +		struct swap_cgroup_ctrl *ctrl = &swap_cgroup_ctrl[type];
> > +		unsigned long idx;
> > +
> > +		for (idx = 0; idx < ACCESS_ONCE(ctrl->length); idx++) {
> > +			struct swap_cgroup *sc, *scend;
> > +
> > +			spin_lock(&swap_cgroup_lock);
> > +			if (idx >= ACCESS_ONCE(ctrl->length))
> > +				goto unlock;
> > +			sc = page_address(ctrl->map[idx]);
> > +			for (scend = sc + SC_PER_PAGE; sc < scend; sc++) {
> > +				if (sc->id != old)
> > +					continue;
> 
> Is this safe? What prevents from race when id is set to old?

I am assuming that when this is called, we shall not be making any
new charges to old (if you see what I mean :) - this is called after
mem_cgroup_reparent_charges() (the main call - I've not yet wrapped
my head around Hannes's later safety-net call; perhaps these patches
would even make that one redundant - dunno).

Quite a lot is made simpler by the fact that we do not need to call
this from mem_cgroup_force_empty(), with all the races that would
entail: there was never any attempt to move these swap charges before,
so it's a pleasure not to have to deal with that possibility now.

> 
> > +				spin_lock(&ctrl->lock);
> > +				if (sc->id == old) {
> 
> Also it seems that compiler is free to optimize this test away, no?
> You need ACCESS_ONCE here as well, I guess.

Oh dear, you're asking me to read again through memory-barriers.txt
(Xmas 2013 edition), and come to a conclusion.  I think this pattern
of test outside spinlock, spinlock, test again inside spinlock is
used in very many places, without any ACCESS_ONCE.  I'll have to
go and search through the precedents.

I've probably brought this upon myself with the ACCESS_ONCE(ctrl->length)
a few lines above, which I added at a late stage to match the one above
it; but now I'm arguing that's unnecessary.

One of the problems with ACCESS_ONCE is that one easily falls into
a mistaken state in which it seems to be necessary everywhere;
but that illusion must be resisted.

The spinlock should make it unnecessary, but I'll have to muse on
semi-permeable membranes, osmosis, stuff like that.

> 
> > +					sc->id = new;
> > +					reassigned++;
> > +				}
> > +				spin_unlock(&ctrl->lock);
> > +			}
> > +unlock:
> > +			spin_unlock(&swap_cgroup_lock);
> > +			cond_resched();
> > +		}
> > +	}
> > +	return reassigned;
> > +}
> > +
> >  int swap_cgroup_swapon(int type, unsigned long max_pages)
> >  {
> >  	void *array;
> >  	unsigned long array_size;
> >  	unsigned long length;
> > -	struct swap_cgroup_ctrl *ctrl;
> > +	struct swap_cgroup_ctrl ctrl;
> >  
> >  	if (!do_swap_account)
> >  		return 0;
> [...]
> -- 
> Michal Hocko
> SUSE Labs

Thanks: let's see if Tejun and Zefan can come up with something
simpler than this at their end.  If we do all decide that this
swap_cgroup_reassign() kind of change is desirable, it would still
be better to make it later on as a cleanup than rush it in now.

Hugh

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: 3.13-rc breaks MEMCG_SWAP
@ 2013-12-17  2:26     ` Hugh Dickins
  0 siblings, 0 replies; 54+ messages in thread
From: Hugh Dickins @ 2013-12-17  2:26 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Hugh Dickins, Li Zefan, Johannes Weiner, Tejun Heo,
	Andrew Morton, KAMEZAWA Hiroyuki, linux-mm, cgroups,
	linux-kernel

On Mon, 16 Dec 2013, Michal Hocko wrote:
> On Mon 16-12-13 00:36:05, Hugh Dickins wrote:
> [...]
> 
> OK, I went through the patch and it looks good except for suspicious
> ctrl->lock handling in swap_cgroup_reassign (see below). I am just
> suggesting to split it into 4 parts

Thanks a lot for studying it, and responding so quickly.
As I remark in reply to Tejun, I'm not nearly so keen on this
approach as you are, and would prefer something short and sweet
from the cgroup end, at least for now; but let's go through your
comments, to keep both options open until we're surer.

> 
> * swap_cgroup_mutex -> swap_cgroup_lock
> * swapon cleanup
> * drop irqsave when taking ctrl->lock
> * mem_cgroup_reparent_swap
> 
> but I will leave the split up to you. Just make sure that the fix is a
> separate patch, please.

Yes indeed, some split like that, maybe even more pieces.  The difficult
part is the ordering: for going into 3.13, we'd prefer a small fix first
and cleanup to follow after in 3.14, but it will be hard to force myself
not to do the cleanup ones first, and until getting down to it I won't
recall how much of the cleanup was essential e.g. to avoid lock ordering
problems or long lock hold times, perhaps.

> 
> [...]
> > --- 3.13-rc4/mm/page_cgroup.c	2013-02-18 15:58:34.000000000 -0800
> > +++ linux/mm/page_cgroup.c	2013-12-15 14:34:36.312485960 -0800
> > @@ -322,7 +322,8 @@ void __meminit pgdat_page_cgroup_init(st
> >  
> >  #ifdef CONFIG_MEMCG_SWAP
> >  
> > -static DEFINE_MUTEX(swap_cgroup_mutex);
> > +static DEFINE_SPINLOCK(swap_cgroup_lock);
> > +
> 
> This one is worth a separate patch IMO.

Agreed.

> 
> >  struct swap_cgroup_ctrl {
> >  	struct page **map;
> >  	unsigned long length;
> > @@ -353,14 +354,11 @@ struct swap_cgroup {
> >  /*
> >   * allocate buffer for swap_cgroup.
> >   */
> > -static int swap_cgroup_prepare(int type)
> > +static int swap_cgroup_prepare(struct swap_cgroup_ctrl *ctrl)
> >  {
> >  	struct page *page;
> > -	struct swap_cgroup_ctrl *ctrl;
> >  	unsigned long idx, max;
> >  
> > -	ctrl = &swap_cgroup_ctrl[type];
> > -
> >  	for (idx = 0; idx < ctrl->length; idx++) {
> >  		page = alloc_page(GFP_KERNEL | __GFP_ZERO);
> >  		if (!page)
> 
> This with swap_cgroup_swapon should be in a separate patch as a cleanup.

Agreed.

> 
> > @@ -407,18 +405,17 @@ unsigned short swap_cgroup_cmpxchg(swp_e
> >  {
> >  	struct swap_cgroup_ctrl *ctrl;
> >  	struct swap_cgroup *sc;
> > -	unsigned long flags;
> >  	unsigned short retval;
> >  
> >  	sc = lookup_swap_cgroup(ent, &ctrl);
> >  
> > -	spin_lock_irqsave(&ctrl->lock, flags);
> > +	spin_lock(&ctrl->lock);
> >  	retval = sc->id;
> >  	if (retval == old)
> >  		sc->id = new;
> >  	else
> >  		retval = 0;
> > -	spin_unlock_irqrestore(&ctrl->lock, flags);
> > +	spin_unlock(&ctrl->lock);
> >  	return retval;
> >  }
> >  
> > @@ -435,14 +432,13 @@ unsigned short swap_cgroup_record(swp_en
> >  	struct swap_cgroup_ctrl *ctrl;
> >  	struct swap_cgroup *sc;
> >  	unsigned short old;
> > -	unsigned long flags;
> >  
> >  	sc = lookup_swap_cgroup(ent, &ctrl);
> >  
> > -	spin_lock_irqsave(&ctrl->lock, flags);
> > +	spin_lock(&ctrl->lock);
> >  	old = sc->id;
> >  	sc->id = id;
> > -	spin_unlock_irqrestore(&ctrl->lock, flags);
> > +	spin_unlock(&ctrl->lock);
> >  
> >  	return old;
> >  }
> 
> I would prefer these two in a separate patch as well. I have no idea why
> these were IRQ aware as this was never needed AFAICS.
> e9e58a4ec3b10 is not very specific...

Agreed.  Yes, I couldn't work out any justification for the _irqsave
variants, and preferred to avoid that clutter rather than add to it.

> 
> > @@ -451,19 +447,60 @@ unsigned short swap_cgroup_record(swp_en
> >   * lookup_swap_cgroup_id - lookup mem_cgroup id tied to swap entry
> >   * @ent: swap entry to be looked up.
> >   *
> > - * Returns CSS ID of mem_cgroup at success. 0 at failure. (0 is invalid ID)
> > + * Returns ID of mem_cgroup at success. 0 at failure. (0 is invalid ID)
> >   */
> >  unsigned short lookup_swap_cgroup_id(swp_entry_t ent)
> >  {
> >  	return lookup_swap_cgroup(ent, NULL)->id;
> >  }
> >  
> > +/**
> > + * swap_cgroup_reassign - assign all old entries to new (before old is freed).
> > + * @old: id of emptied memcg whose entries are now to be reassigned
> > + * @new: id of parent memcg to which those entries are to be assigned
> > + *
> > + * Returns number of entries reassigned, for debugging or for statistics.
> > + */
> > +long swap_cgroup_reassign(unsigned short old, unsigned short new)
> > +{
> > +	long reassigned = 0;
> > +	int type;
> > +
> > +	for (type = 0; type < MAX_SWAPFILES; type++) {
> > +		struct swap_cgroup_ctrl *ctrl = &swap_cgroup_ctrl[type];
> > +		unsigned long idx;
> > +
> > +		for (idx = 0; idx < ACCESS_ONCE(ctrl->length); idx++) {
> > +			struct swap_cgroup *sc, *scend;
> > +
> > +			spin_lock(&swap_cgroup_lock);
> > +			if (idx >= ACCESS_ONCE(ctrl->length))
> > +				goto unlock;
> > +			sc = page_address(ctrl->map[idx]);
> > +			for (scend = sc + SC_PER_PAGE; sc < scend; sc++) {
> > +				if (sc->id != old)
> > +					continue;
> 
> Is this safe? What prevents from race when id is set to old?

I am assuming that when this is called, we shall not be making any
new charges to old (if you see what I mean :) - this is called after
mem_cgroup_reparent_charges() (the main call - I've not yet wrapped
my head around Hannes's later safety-net call; perhaps these patches
would even make that one redundant - dunno).

Quite a lot is made simpler by the fact that we do not need to call
this from mem_cgroup_force_empty(), with all the races that would
entail: there was never any attempt to move these swap charges before,
so it's a pleasure not to have to deal with that possibility now.

> 
> > +				spin_lock(&ctrl->lock);
> > +				if (sc->id == old) {
> 
> Also it seems that compiler is free to optimize this test away, no?
> You need ACCESS_ONCE here as well, I guess.

Oh dear, you're asking me to read again through memory-barriers.txt
(Xmas 2013 edition), and come to a conclusion.  I think this pattern
of test outside spinlock, spinlock, test again inside spinlock is
used in very many places, without any ACCESS_ONCE.  I'll have to
go and search through the precedents.

I've probably brought this upon myself with the ACCESS_ONCE(ctrl->length)
a few lines above, which I added at a late stage to match the one above
it; but now I'm arguing that's unnecessary.

One of the problems with ACCESS_ONCE is that one easily falls into
a mistaken state in which it seems to be necessary everywhere;
but that illusion must be resisted.

The spinlock should make it unnecessary, but I'll have to muse on
semi-permeable membranes, osmosis, stuff like that.

> 
> > +					sc->id = new;
> > +					reassigned++;
> > +				}
> > +				spin_unlock(&ctrl->lock);
> > +			}
> > +unlock:
> > +			spin_unlock(&swap_cgroup_lock);
> > +			cond_resched();
> > +		}
> > +	}
> > +	return reassigned;
> > +}
> > +
> >  int swap_cgroup_swapon(int type, unsigned long max_pages)
> >  {
> >  	void *array;
> >  	unsigned long array_size;
> >  	unsigned long length;
> > -	struct swap_cgroup_ctrl *ctrl;
> > +	struct swap_cgroup_ctrl ctrl;
> >  
> >  	if (!do_swap_account)
> >  		return 0;
> [...]
> -- 
> Michal Hocko
> SUSE Labs

Thanks: let's see if Tejun and Zefan can come up with something
simpler than this at their end.  If we do all decide that this
swap_cgroup_reassign() kind of change is desirable, it would still
be better to make it later on as a cleanup than rush it in now.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: 3.13-rc breaks MEMCG_SWAP
  2013-12-17  1:41               ` Hugh Dickins
@ 2013-12-17  3:13                 ` Li Zefan
  -1 siblings, 0 replies; 54+ messages in thread
From: Li Zefan @ 2013-12-17  3:13 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Tejun Heo, Michal Hocko, Johannes Weiner, Andrew Morton,
	KAMEZAWA Hiroyuki, linux-mm, cgroups, linux-kernel

On 2013/12/17 9:41, Hugh Dickins wrote:
> On Mon, 16 Dec 2013, Tejun Heo wrote:
>> On Mon, Dec 16, 2013 at 06:19:37PM +0100, Michal Hocko wrote:
>>> I have to think about it some more (the brain is not working anymore
>>> today). But what we really need is that nobody gets the same id while
>>> the css is alive.

That's what I meant to do in my last reply.

But I'm confused by

"How would this work? .. the swap will be there
after the last reference to css as well."

>>> So css_from_id returning NULL doesn't seem to be
>>> enough.
>>
>> Oh, I meant whether it's necessary to keep css_from_id() working
>> (ie. doing successful lookups) between offline and release, because
>> that's where lifetimes are coupled.  IOW, if it's enough for cgroup to
>> not recycle the ID until all css's are released && fail css_from_id()
>> lookup after the css is offlined, I can make a five liner quick fix.
> 
> Don't take my word on it, I'm too fuzzy on this: but although it would
> be good to refrain from recycling the ID until all css's are released,
> I believe that it would not be good enough to fail css_from_id() once
> the css is offlined - mem_cgroup_uncharge_swap() needs to uncharge the
> hierarchy of the dead memcg (for example, when tmpfs file is removed).
> 
> Uncharging the dead memcg itself is presumably irrelevant, but it does
> need to locate the right parent to uncharge, and NULL css_from_id()
> would make that impossible.  It would be easy if we said those charges
> migrate to root rather than to parent, but that's inconsistent with
> what we have happily converged upon doing elsewhere (in the preferred
> use_hierarchy case), and it would be a change in behaviour.
> 
> I'm not nearly as enthusiastic for my patch as Michal is: I really
> would prefer a five-liner from you or from Zefan. 

I've come up with a fix. Though it's more than five-line, it mostly moves
a few lines from one place to another. I've tested it with your script.

============================

From: Li Zefan <lizefan@huawei.com>
Date: Tue, 17 Dec 2013 10:45:09 +0800
Subject: [PATCH] cgroup: don't recycle cgroup id until all csses' have been destroyed

Hugh reported this bug:

> CONFIG_MEMCG_SWAP is broken in 3.13-rc.  Try something like this:
>
> mkdir -p /tmp/tmpfs /tmp/memcg
> mount -t tmpfs -o size=1G tmpfs /tmp/tmpfs
> mount -t cgroup -o memory memcg /tmp/memcg
> mkdir /tmp/memcg/old
> echo 512M >/tmp/memcg/old/memory.limit_in_bytes
> echo $$ >/tmp/memcg/old/tasks
> cp /dev/zero /tmp/tmpfs/zero 2>/dev/null
> echo $$ >/tmp/memcg/tasks
> rmdir /tmp/memcg/old
> sleep 1	# let rmdir work complete
> mkdir /tmp/memcg/new
> umount /tmp/tmpfs
> dmesg | grep WARNING
> rmdir /tmp/memcg/new
> umount /tmp/memcg
>
> Shows lots of WARNING: CPU: 1 PID: 1006 at kernel/res_counter.c:91
>                            res_counter_uncharge_locked+0x1f/0x2f()
>
> Breakage comes from 34c00c319ce7 ("memcg: convert to use cgroup id").
>
> The lifetime of a cgroup id is different from the lifetime of the
> css id it replaced: memsw's css_get()s do nothing to hold on to the
> old cgroup id, it soon gets recycled to a new cgroup, which then
> mysteriously inherits the old's swap, without any charge for it.

Instead of removing cgroup id right after all the csses have been
offlined, we should do that after csses have been destroyed.

To make sure an invalid css pointer won't be returned after the css
is destroyed, make sure css_from_id() returns NULL in this case.

Reported-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Li Zefan <lizefan@huawei.com>
---
 kernel/cgroup.c | 18 ++++++++++--------
 1 file changed, 10 insertions(+), 8 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index c36d906..769b5bb 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -868,6 +868,15 @@ static void cgroup_diput(struct dentry *dentry, struct inode *inode)
 		struct cgroup *cgrp = dentry->d_fsdata;
 
 		BUG_ON(!(cgroup_is_dead(cgrp)));
+
+		/*
+		 * We should remove the cgroup object from idr before its
+		 * grace period starts, so we won't be looking up a cgroup
+		 * while the cgroup is being freed.
+		 */
+		idr_remove(&cgrp->root->cgroup_idr, cgrp->id);
+		cgrp->id = -1;
+
 		call_rcu(&cgrp->rcu_head, cgroup_free_rcu);
 	} else {
 		struct cfent *cfe = __d_cfe(dentry);
@@ -4104,6 +4113,7 @@ static void css_release(struct percpu_ref *ref)
 	struct cgroup_subsys_state *css =
 		container_of(ref, struct cgroup_subsys_state, refcnt);
 
+	rcu_assign_pointer(css->cgroup->subsys[css->ss->subsys_id], NULL);
 	call_rcu(&css->rcu_head, css_free_rcu_fn);
 }
 
@@ -4545,14 +4555,6 @@ static void cgroup_destroy_css_killed(struct cgroup *cgrp)
 	/* delete this cgroup from parent->children */
 	list_del_rcu(&cgrp->sibling);
 
-	/*
-	 * We should remove the cgroup object from idr before its grace
-	 * period starts, so we won't be looking up a cgroup while the
-	 * cgroup is being freed.
-	 */
-	idr_remove(&cgrp->root->cgroup_idr, cgrp->id);
-	cgrp->id = -1;
-
 	dput(d);
 
 	set_bit(CGRP_RELEASABLE, &parent->flags);
-- 
1.8.0.2





^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: 3.13-rc breaks MEMCG_SWAP
@ 2013-12-17  3:13                 ` Li Zefan
  0 siblings, 0 replies; 54+ messages in thread
From: Li Zefan @ 2013-12-17  3:13 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Tejun Heo, Michal Hocko, Johannes Weiner, Andrew Morton,
	KAMEZAWA Hiroyuki, linux-mm, cgroups, linux-kernel

On 2013/12/17 9:41, Hugh Dickins wrote:
> On Mon, 16 Dec 2013, Tejun Heo wrote:
>> On Mon, Dec 16, 2013 at 06:19:37PM +0100, Michal Hocko wrote:
>>> I have to think about it some more (the brain is not working anymore
>>> today). But what we really need is that nobody gets the same id while
>>> the css is alive.

That's what I meant to do in my last reply.

But I'm confused by

"How would this work? .. the swap will be there
after the last reference to css as well."

>>> So css_from_id returning NULL doesn't seem to be
>>> enough.
>>
>> Oh, I meant whether it's necessary to keep css_from_id() working
>> (ie. doing successful lookups) between offline and release, because
>> that's where lifetimes are coupled.  IOW, if it's enough for cgroup to
>> not recycle the ID until all css's are released && fail css_from_id()
>> lookup after the css is offlined, I can make a five liner quick fix.
> 
> Don't take my word on it, I'm too fuzzy on this: but although it would
> be good to refrain from recycling the ID until all css's are released,
> I believe that it would not be good enough to fail css_from_id() once
> the css is offlined - mem_cgroup_uncharge_swap() needs to uncharge the
> hierarchy of the dead memcg (for example, when tmpfs file is removed).
> 
> Uncharging the dead memcg itself is presumably irrelevant, but it does
> need to locate the right parent to uncharge, and NULL css_from_id()
> would make that impossible.  It would be easy if we said those charges
> migrate to root rather than to parent, but that's inconsistent with
> what we have happily converged upon doing elsewhere (in the preferred
> use_hierarchy case), and it would be a change in behaviour.
> 
> I'm not nearly as enthusiastic for my patch as Michal is: I really
> would prefer a five-liner from you or from Zefan. 

I've come up with a fix. Though it's more than five-line, it mostly moves
a few lines from one place to another. I've tested it with your script.

============================

From: Li Zefan <lizefan@huawei.com>
Date: Tue, 17 Dec 2013 10:45:09 +0800
Subject: [PATCH] cgroup: don't recycle cgroup id until all csses' have been destroyed

Hugh reported this bug:

> CONFIG_MEMCG_SWAP is broken in 3.13-rc.  Try something like this:
>
> mkdir -p /tmp/tmpfs /tmp/memcg
> mount -t tmpfs -o size=1G tmpfs /tmp/tmpfs
> mount -t cgroup -o memory memcg /tmp/memcg
> mkdir /tmp/memcg/old
> echo 512M >/tmp/memcg/old/memory.limit_in_bytes
> echo $$ >/tmp/memcg/old/tasks
> cp /dev/zero /tmp/tmpfs/zero 2>/dev/null
> echo $$ >/tmp/memcg/tasks
> rmdir /tmp/memcg/old
> sleep 1	# let rmdir work complete
> mkdir /tmp/memcg/new
> umount /tmp/tmpfs
> dmesg | grep WARNING
> rmdir /tmp/memcg/new
> umount /tmp/memcg
>
> Shows lots of WARNING: CPU: 1 PID: 1006 at kernel/res_counter.c:91
>                            res_counter_uncharge_locked+0x1f/0x2f()
>
> Breakage comes from 34c00c319ce7 ("memcg: convert to use cgroup id").
>
> The lifetime of a cgroup id is different from the lifetime of the
> css id it replaced: memsw's css_get()s do nothing to hold on to the
> old cgroup id, it soon gets recycled to a new cgroup, which then
> mysteriously inherits the old's swap, without any charge for it.

Instead of removing cgroup id right after all the csses have been
offlined, we should do that after csses have been destroyed.

To make sure an invalid css pointer won't be returned after the css
is destroyed, make sure css_from_id() returns NULL in this case.

Reported-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Li Zefan <lizefan@huawei.com>
---
 kernel/cgroup.c | 18 ++++++++++--------
 1 file changed, 10 insertions(+), 8 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index c36d906..769b5bb 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -868,6 +868,15 @@ static void cgroup_diput(struct dentry *dentry, struct inode *inode)
 		struct cgroup *cgrp = dentry->d_fsdata;
 
 		BUG_ON(!(cgroup_is_dead(cgrp)));
+
+		/*
+		 * We should remove the cgroup object from idr before its
+		 * grace period starts, so we won't be looking up a cgroup
+		 * while the cgroup is being freed.
+		 */
+		idr_remove(&cgrp->root->cgroup_idr, cgrp->id);
+		cgrp->id = -1;
+
 		call_rcu(&cgrp->rcu_head, cgroup_free_rcu);
 	} else {
 		struct cfent *cfe = __d_cfe(dentry);
@@ -4104,6 +4113,7 @@ static void css_release(struct percpu_ref *ref)
 	struct cgroup_subsys_state *css =
 		container_of(ref, struct cgroup_subsys_state, refcnt);
 
+	rcu_assign_pointer(css->cgroup->subsys[css->ss->subsys_id], NULL);
 	call_rcu(&css->rcu_head, css_free_rcu_fn);
 }
 
@@ -4545,14 +4555,6 @@ static void cgroup_destroy_css_killed(struct cgroup *cgrp)
 	/* delete this cgroup from parent->children */
 	list_del_rcu(&cgrp->sibling);
 
-	/*
-	 * We should remove the cgroup object from idr before its grace
-	 * period starts, so we won't be looking up a cgroup while the
-	 * cgroup is being freed.
-	 */
-	idr_remove(&cgrp->root->cgroup_idr, cgrp->id);
-	cgrp->id = -1;
-
 	dput(d);
 
 	set_bit(CGRP_RELEASABLE, &parent->flags);
-- 
1.8.0.2




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: 3.13-rc breaks MEMCG_SWAP
  2013-12-17  3:13                 ` Li Zefan
@ 2013-12-17  7:09                   ` Hugh Dickins
  -1 siblings, 0 replies; 54+ messages in thread
From: Hugh Dickins @ 2013-12-17  7:09 UTC (permalink / raw)
  To: Li Zefan
  Cc: Hugh Dickins, Tejun Heo, Michal Hocko, Johannes Weiner,
	Andrew Morton, KAMEZAWA Hiroyuki, linux-mm, cgroups,
	linux-kernel

On Tue, 17 Dec 2013, Li Zefan wrote:
> On 2013/12/17 9:41, Hugh Dickins wrote:
> > On Mon, 16 Dec 2013, Tejun Heo wrote:
> >> On Mon, Dec 16, 2013 at 06:19:37PM +0100, Michal Hocko wrote:
> >>> I have to think about it some more (the brain is not working anymore
> >>> today). But what we really need is that nobody gets the same id while
> >>> the css is alive.
> 
> That's what I meant to do in my last reply.
> 
> But I'm confused by
> 
> "How would this work? .. the swap will be there
> after the last reference to css as well."
> 
> >>> So css_from_id returning NULL doesn't seem to be
> >>> enough.
> >>
> >> Oh, I meant whether it's necessary to keep css_from_id() working
> >> (ie. doing successful lookups) between offline and release, because
> >> that's where lifetimes are coupled.  IOW, if it's enough for cgroup to
> >> not recycle the ID until all css's are released && fail css_from_id()
> >> lookup after the css is offlined, I can make a five liner quick fix.
> > 
> > Don't take my word on it, I'm too fuzzy on this: but although it would
> > be good to refrain from recycling the ID until all css's are released,
> > I believe that it would not be good enough to fail css_from_id() once
> > the css is offlined - mem_cgroup_uncharge_swap() needs to uncharge the
> > hierarchy of the dead memcg (for example, when tmpfs file is removed).
> > 
> > Uncharging the dead memcg itself is presumably irrelevant, but it does
> > need to locate the right parent to uncharge, and NULL css_from_id()
> > would make that impossible.  It would be easy if we said those charges
> > migrate to root rather than to parent, but that's inconsistent with
> > what we have happily converged upon doing elsewhere (in the preferred
> > use_hierarchy case), and it would be a change in behaviour.
> > 
> > I'm not nearly as enthusiastic for my patch as Michal is: I really
> > would prefer a five-liner from you or from Zefan. 
> 
> I've come up with a fix. Though it's more than five-line, it mostly moves
> a few lines from one place to another. I've tested it with your script.

It seems to be working very well for me.  I'm inclined to forgive you for
taking more than five lines, given that there are almost as many -s as +s ;)

In my opinion, your patch is greatly preferable to mine - if there are
good things in mine, memcg can incorporate them at leisure later on,
but right now this seems a much better 3.13 solution.  I'm guessing
Tejun and Hannes will feel the same way: how about you, Michal?

Thank you!
Hugh

> 
> ============================
> 
> From: Li Zefan <lizefan@huawei.com>
> Date: Tue, 17 Dec 2013 10:45:09 +0800
> Subject: [PATCH] cgroup: don't recycle cgroup id until all csses' have been destroyed
> 
> Hugh reported this bug:
> 
> > CONFIG_MEMCG_SWAP is broken in 3.13-rc.  Try something like this:
> >
> > mkdir -p /tmp/tmpfs /tmp/memcg
> > mount -t tmpfs -o size=1G tmpfs /tmp/tmpfs
> > mount -t cgroup -o memory memcg /tmp/memcg
> > mkdir /tmp/memcg/old
> > echo 512M >/tmp/memcg/old/memory.limit_in_bytes
> > echo $$ >/tmp/memcg/old/tasks
> > cp /dev/zero /tmp/tmpfs/zero 2>/dev/null
> > echo $$ >/tmp/memcg/tasks
> > rmdir /tmp/memcg/old
> > sleep 1	# let rmdir work complete
> > mkdir /tmp/memcg/new
> > umount /tmp/tmpfs
> > dmesg | grep WARNING
> > rmdir /tmp/memcg/new
> > umount /tmp/memcg
> >
> > Shows lots of WARNING: CPU: 1 PID: 1006 at kernel/res_counter.c:91
> >                            res_counter_uncharge_locked+0x1f/0x2f()
> >
> > Breakage comes from 34c00c319ce7 ("memcg: convert to use cgroup id").
> >
> > The lifetime of a cgroup id is different from the lifetime of the
> > css id it replaced: memsw's css_get()s do nothing to hold on to the
> > old cgroup id, it soon gets recycled to a new cgroup, which then
> > mysteriously inherits the old's swap, without any charge for it.
> 
> Instead of removing cgroup id right after all the csses have been
> offlined, we should do that after csses have been destroyed.
> 
> To make sure an invalid css pointer won't be returned after the css
> is destroyed, make sure css_from_id() returns NULL in this case.
> 
> Reported-by: Hugh Dickins <hughd@google.com>
> Signed-off-by: Li Zefan <lizefan@huawei.com>
> ---
>  kernel/cgroup.c | 18 ++++++++++--------
>  1 file changed, 10 insertions(+), 8 deletions(-)
> 
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index c36d906..769b5bb 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -868,6 +868,15 @@ static void cgroup_diput(struct dentry *dentry, struct inode *inode)
>  		struct cgroup *cgrp = dentry->d_fsdata;
>  
>  		BUG_ON(!(cgroup_is_dead(cgrp)));
> +
> +		/*
> +		 * We should remove the cgroup object from idr before its
> +		 * grace period starts, so we won't be looking up a cgroup
> +		 * while the cgroup is being freed.
> +		 */
> +		idr_remove(&cgrp->root->cgroup_idr, cgrp->id);
> +		cgrp->id = -1;
> +
>  		call_rcu(&cgrp->rcu_head, cgroup_free_rcu);
>  	} else {
>  		struct cfent *cfe = __d_cfe(dentry);
> @@ -4104,6 +4113,7 @@ static void css_release(struct percpu_ref *ref)
>  	struct cgroup_subsys_state *css =
>  		container_of(ref, struct cgroup_subsys_state, refcnt);
>  
> +	rcu_assign_pointer(css->cgroup->subsys[css->ss->subsys_id], NULL);
>  	call_rcu(&css->rcu_head, css_free_rcu_fn);
>  }
>  
> @@ -4545,14 +4555,6 @@ static void cgroup_destroy_css_killed(struct cgroup *cgrp)
>  	/* delete this cgroup from parent->children */
>  	list_del_rcu(&cgrp->sibling);
>  
> -	/*
> -	 * We should remove the cgroup object from idr before its grace
> -	 * period starts, so we won't be looking up a cgroup while the
> -	 * cgroup is being freed.
> -	 */
> -	idr_remove(&cgrp->root->cgroup_idr, cgrp->id);
> -	cgrp->id = -1;
> -
>  	dput(d);
>  
>  	set_bit(CGRP_RELEASABLE, &parent->flags);
> -- 
> 1.8.0.2
> 
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: 3.13-rc breaks MEMCG_SWAP
@ 2013-12-17  7:09                   ` Hugh Dickins
  0 siblings, 0 replies; 54+ messages in thread
From: Hugh Dickins @ 2013-12-17  7:09 UTC (permalink / raw)
  To: Li Zefan
  Cc: Hugh Dickins, Tejun Heo, Michal Hocko, Johannes Weiner,
	Andrew Morton, KAMEZAWA Hiroyuki, linux-mm, cgroups,
	linux-kernel

On Tue, 17 Dec 2013, Li Zefan wrote:
> On 2013/12/17 9:41, Hugh Dickins wrote:
> > On Mon, 16 Dec 2013, Tejun Heo wrote:
> >> On Mon, Dec 16, 2013 at 06:19:37PM +0100, Michal Hocko wrote:
> >>> I have to think about it some more (the brain is not working anymore
> >>> today). But what we really need is that nobody gets the same id while
> >>> the css is alive.
> 
> That's what I meant to do in my last reply.
> 
> But I'm confused by
> 
> "How would this work? .. the swap will be there
> after the last reference to css as well."
> 
> >>> So css_from_id returning NULL doesn't seem to be
> >>> enough.
> >>
> >> Oh, I meant whether it's necessary to keep css_from_id() working
> >> (ie. doing successful lookups) between offline and release, because
> >> that's where lifetimes are coupled.  IOW, if it's enough for cgroup to
> >> not recycle the ID until all css's are released && fail css_from_id()
> >> lookup after the css is offlined, I can make a five liner quick fix.
> > 
> > Don't take my word on it, I'm too fuzzy on this: but although it would
> > be good to refrain from recycling the ID until all css's are released,
> > I believe that it would not be good enough to fail css_from_id() once
> > the css is offlined - mem_cgroup_uncharge_swap() needs to uncharge the
> > hierarchy of the dead memcg (for example, when tmpfs file is removed).
> > 
> > Uncharging the dead memcg itself is presumably irrelevant, but it does
> > need to locate the right parent to uncharge, and NULL css_from_id()
> > would make that impossible.  It would be easy if we said those charges
> > migrate to root rather than to parent, but that's inconsistent with
> > what we have happily converged upon doing elsewhere (in the preferred
> > use_hierarchy case), and it would be a change in behaviour.
> > 
> > I'm not nearly as enthusiastic for my patch as Michal is: I really
> > would prefer a five-liner from you or from Zefan. 
> 
> I've come up with a fix. Though it's more than five-line, it mostly moves
> a few lines from one place to another. I've tested it with your script.

It seems to be working very well for me.  I'm inclined to forgive you for
taking more than five lines, given that there are almost as many -s as +s ;)

In my opinion, your patch is greatly preferable to mine - if there are
good things in mine, memcg can incorporate them at leisure later on,
but right now this seems a much better 3.13 solution.  I'm guessing
Tejun and Hannes will feel the same way: how about you, Michal?

Thank you!
Hugh

> 
> ============================
> 
> From: Li Zefan <lizefan@huawei.com>
> Date: Tue, 17 Dec 2013 10:45:09 +0800
> Subject: [PATCH] cgroup: don't recycle cgroup id until all csses' have been destroyed
> 
> Hugh reported this bug:
> 
> > CONFIG_MEMCG_SWAP is broken in 3.13-rc.  Try something like this:
> >
> > mkdir -p /tmp/tmpfs /tmp/memcg
> > mount -t tmpfs -o size=1G tmpfs /tmp/tmpfs
> > mount -t cgroup -o memory memcg /tmp/memcg
> > mkdir /tmp/memcg/old
> > echo 512M >/tmp/memcg/old/memory.limit_in_bytes
> > echo $$ >/tmp/memcg/old/tasks
> > cp /dev/zero /tmp/tmpfs/zero 2>/dev/null
> > echo $$ >/tmp/memcg/tasks
> > rmdir /tmp/memcg/old
> > sleep 1	# let rmdir work complete
> > mkdir /tmp/memcg/new
> > umount /tmp/tmpfs
> > dmesg | grep WARNING
> > rmdir /tmp/memcg/new
> > umount /tmp/memcg
> >
> > Shows lots of WARNING: CPU: 1 PID: 1006 at kernel/res_counter.c:91
> >                            res_counter_uncharge_locked+0x1f/0x2f()
> >
> > Breakage comes from 34c00c319ce7 ("memcg: convert to use cgroup id").
> >
> > The lifetime of a cgroup id is different from the lifetime of the
> > css id it replaced: memsw's css_get()s do nothing to hold on to the
> > old cgroup id, it soon gets recycled to a new cgroup, which then
> > mysteriously inherits the old's swap, without any charge for it.
> 
> Instead of removing cgroup id right after all the csses have been
> offlined, we should do that after csses have been destroyed.
> 
> To make sure an invalid css pointer won't be returned after the css
> is destroyed, make sure css_from_id() returns NULL in this case.
> 
> Reported-by: Hugh Dickins <hughd@google.com>
> Signed-off-by: Li Zefan <lizefan@huawei.com>
> ---
>  kernel/cgroup.c | 18 ++++++++++--------
>  1 file changed, 10 insertions(+), 8 deletions(-)
> 
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index c36d906..769b5bb 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -868,6 +868,15 @@ static void cgroup_diput(struct dentry *dentry, struct inode *inode)
>  		struct cgroup *cgrp = dentry->d_fsdata;
>  
>  		BUG_ON(!(cgroup_is_dead(cgrp)));
> +
> +		/*
> +		 * We should remove the cgroup object from idr before its
> +		 * grace period starts, so we won't be looking up a cgroup
> +		 * while the cgroup is being freed.
> +		 */
> +		idr_remove(&cgrp->root->cgroup_idr, cgrp->id);
> +		cgrp->id = -1;
> +
>  		call_rcu(&cgrp->rcu_head, cgroup_free_rcu);
>  	} else {
>  		struct cfent *cfe = __d_cfe(dentry);
> @@ -4104,6 +4113,7 @@ static void css_release(struct percpu_ref *ref)
>  	struct cgroup_subsys_state *css =
>  		container_of(ref, struct cgroup_subsys_state, refcnt);
>  
> +	rcu_assign_pointer(css->cgroup->subsys[css->ss->subsys_id], NULL);
>  	call_rcu(&css->rcu_head, css_free_rcu_fn);
>  }
>  
> @@ -4545,14 +4555,6 @@ static void cgroup_destroy_css_killed(struct cgroup *cgrp)
>  	/* delete this cgroup from parent->children */
>  	list_del_rcu(&cgrp->sibling);
>  
> -	/*
> -	 * We should remove the cgroup object from idr before its grace
> -	 * period starts, so we won't be looking up a cgroup while the
> -	 * cgroup is being freed.
> -	 */
> -	idr_remove(&cgrp->root->cgroup_idr, cgrp->id);
> -	cgrp->id = -1;
> -
>  	dput(d);
>  
>  	set_bit(CGRP_RELEASABLE, &parent->flags);
> -- 
> 1.8.0.2
> 
> 
> 
> 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: 3.13-rc breaks MEMCG_SWAP
  2013-12-17  2:26     ` Hugh Dickins
  (?)
@ 2013-12-17 10:25       ` Michal Hocko
  -1 siblings, 0 replies; 54+ messages in thread
From: Michal Hocko @ 2013-12-17 10:25 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Li Zefan, Johannes Weiner, Tejun Heo, Andrew Morton,
	KAMEZAWA Hiroyuki, linux-mm, cgroups, linux-kernel

On Mon 16-12-13 18:26:18, Hugh Dickins wrote:
> On Mon, 16 Dec 2013, Michal Hocko wrote:
> > On Mon 16-12-13 00:36:05, Hugh Dickins wrote:
[...]
> > > +/**
> > > + * swap_cgroup_reassign - assign all old entries to new (before old is freed).
> > > + * @old: id of emptied memcg whose entries are now to be reassigned
> > > + * @new: id of parent memcg to which those entries are to be assigned
> > > + *
> > > + * Returns number of entries reassigned, for debugging or for statistics.
> > > + */
> > > +long swap_cgroup_reassign(unsigned short old, unsigned short new)
> > > +{
> > > +	long reassigned = 0;
> > > +	int type;
> > > +
> > > +	for (type = 0; type < MAX_SWAPFILES; type++) {
> > > +		struct swap_cgroup_ctrl *ctrl = &swap_cgroup_ctrl[type];
> > > +		unsigned long idx;
> > > +
> > > +		for (idx = 0; idx < ACCESS_ONCE(ctrl->length); idx++) {
> > > +			struct swap_cgroup *sc, *scend;
> > > +
> > > +			spin_lock(&swap_cgroup_lock);
> > > +			if (idx >= ACCESS_ONCE(ctrl->length))
> > > +				goto unlock;
> > > +			sc = page_address(ctrl->map[idx]);
> > > +			for (scend = sc + SC_PER_PAGE; sc < scend; sc++) {
> > > +				if (sc->id != old)
> > > +					continue;
> > 
> > Is this safe? What prevents from race when id is set to old?
> 
> I am assuming that when this is called, we shall not be making any
> new charges to old (if you see what I mean :) - this is called after
> mem_cgroup_reparent_charges() (the main call - I've not yet wrapped
> my head around Hannes's later safety-net call; perhaps these patches
> would even make that one redundant - dunno).

I have to think about this some more. I was playing with an alternative
fix for this race and few trace points shown me that the races are quite
common.

Anyway, wouldn't be it easier to take the lock for a batch of sc's and
then release it followed by cond_resched? Doing 1024 of iterations
doesn't sound too bad to me (we would take the lock 2 times for 4k pages).

> Quite a lot is made simpler by the fact that we do not need to call
> this from mem_cgroup_force_empty(), with all the races that would
> entail: there was never any attempt to move these swap charges before,
> so it's a pleasure not to have to deal with that possibility now.
>
> > > +				spin_lock(&ctrl->lock);
> > > +				if (sc->id == old) {
> > 
> > Also it seems that compiler is free to optimize this test away, no?
> > You need ACCESS_ONCE here as well, I guess.
> 
> Oh dear, you're asking me to read again through memory-barriers.txt
> (Xmas 2013 edition), and come to a conclusion.  I think this pattern
> of test outside spinlock, spinlock, test again inside spinlock is
> used in very many places, without any ACCESS_ONCE.  I'll have to
> go and search through the precedents.
> 
> I've probably brought this upon myself with the ACCESS_ONCE(ctrl->length)
> a few lines above, which I added at a late stage to match the one above
> it; but now I'm arguing that's unnecessary.

Yeah, that triggered the red flag ;)

> One of the problems with ACCESS_ONCE is that one easily falls into
> a mistaken state in which it seems to be necessary everywhere;
> but that illusion must be resisted.
> 
> The spinlock should make it unnecessary, but I'll have to muse on
> semi-permeable membranes, osmosis, stuff like that.

OK, I have checked that and you are right. ACCESS_ONCE is not needed.
Compiler is not allowed to optimize across barrier() which is a part of
spin_lock. So this should be ok.
[...]
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: 3.13-rc breaks MEMCG_SWAP
@ 2013-12-17 10:25       ` Michal Hocko
  0 siblings, 0 replies; 54+ messages in thread
From: Michal Hocko @ 2013-12-17 10:25 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Li Zefan, Johannes Weiner, Tejun Heo, Andrew Morton,
	KAMEZAWA Hiroyuki, linux-mm, cgroups, linux-kernel

On Mon 16-12-13 18:26:18, Hugh Dickins wrote:
> On Mon, 16 Dec 2013, Michal Hocko wrote:
> > On Mon 16-12-13 00:36:05, Hugh Dickins wrote:
[...]
> > > +/**
> > > + * swap_cgroup_reassign - assign all old entries to new (before old is freed).
> > > + * @old: id of emptied memcg whose entries are now to be reassigned
> > > + * @new: id of parent memcg to which those entries are to be assigned
> > > + *
> > > + * Returns number of entries reassigned, for debugging or for statistics.
> > > + */
> > > +long swap_cgroup_reassign(unsigned short old, unsigned short new)
> > > +{
> > > +	long reassigned = 0;
> > > +	int type;
> > > +
> > > +	for (type = 0; type < MAX_SWAPFILES; type++) {
> > > +		struct swap_cgroup_ctrl *ctrl = &swap_cgroup_ctrl[type];
> > > +		unsigned long idx;
> > > +
> > > +		for (idx = 0; idx < ACCESS_ONCE(ctrl->length); idx++) {
> > > +			struct swap_cgroup *sc, *scend;
> > > +
> > > +			spin_lock(&swap_cgroup_lock);
> > > +			if (idx >= ACCESS_ONCE(ctrl->length))
> > > +				goto unlock;
> > > +			sc = page_address(ctrl->map[idx]);
> > > +			for (scend = sc + SC_PER_PAGE; sc < scend; sc++) {
> > > +				if (sc->id != old)
> > > +					continue;
> > 
> > Is this safe? What prevents from race when id is set to old?
> 
> I am assuming that when this is called, we shall not be making any
> new charges to old (if you see what I mean :) - this is called after
> mem_cgroup_reparent_charges() (the main call - I've not yet wrapped
> my head around Hannes's later safety-net call; perhaps these patches
> would even make that one redundant - dunno).

I have to think about this some more. I was playing with an alternative
fix for this race and few trace points shown me that the races are quite
common.

Anyway, wouldn't be it easier to take the lock for a batch of sc's and
then release it followed by cond_resched? Doing 1024 of iterations
doesn't sound too bad to me (we would take the lock 2 times for 4k pages).

> Quite a lot is made simpler by the fact that we do not need to call
> this from mem_cgroup_force_empty(), with all the races that would
> entail: there was never any attempt to move these swap charges before,
> so it's a pleasure not to have to deal with that possibility now.
>
> > > +				spin_lock(&ctrl->lock);
> > > +				if (sc->id == old) {
> > 
> > Also it seems that compiler is free to optimize this test away, no?
> > You need ACCESS_ONCE here as well, I guess.
> 
> Oh dear, you're asking me to read again through memory-barriers.txt
> (Xmas 2013 edition), and come to a conclusion.  I think this pattern
> of test outside spinlock, spinlock, test again inside spinlock is
> used in very many places, without any ACCESS_ONCE.  I'll have to
> go and search through the precedents.
> 
> I've probably brought this upon myself with the ACCESS_ONCE(ctrl->length)
> a few lines above, which I added at a late stage to match the one above
> it; but now I'm arguing that's unnecessary.

Yeah, that triggered the red flag ;)

> One of the problems with ACCESS_ONCE is that one easily falls into
> a mistaken state in which it seems to be necessary everywhere;
> but that illusion must be resisted.
> 
> The spinlock should make it unnecessary, but I'll have to muse on
> semi-permeable membranes, osmosis, stuff like that.

OK, I have checked that and you are right. ACCESS_ONCE is not needed.
Compiler is not allowed to optimize across barrier() which is a part of
spin_lock. So this should be ok.
[...]
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: 3.13-rc breaks MEMCG_SWAP
@ 2013-12-17 10:25       ` Michal Hocko
  0 siblings, 0 replies; 54+ messages in thread
From: Michal Hocko @ 2013-12-17 10:25 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Li Zefan, Johannes Weiner, Tejun Heo, Andrew Morton,
	KAMEZAWA Hiroyuki, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Mon 16-12-13 18:26:18, Hugh Dickins wrote:
> On Mon, 16 Dec 2013, Michal Hocko wrote:
> > On Mon 16-12-13 00:36:05, Hugh Dickins wrote:
[...]
> > > +/**
> > > + * swap_cgroup_reassign - assign all old entries to new (before old is freed).
> > > + * @old: id of emptied memcg whose entries are now to be reassigned
> > > + * @new: id of parent memcg to which those entries are to be assigned
> > > + *
> > > + * Returns number of entries reassigned, for debugging or for statistics.
> > > + */
> > > +long swap_cgroup_reassign(unsigned short old, unsigned short new)
> > > +{
> > > +	long reassigned = 0;
> > > +	int type;
> > > +
> > > +	for (type = 0; type < MAX_SWAPFILES; type++) {
> > > +		struct swap_cgroup_ctrl *ctrl = &swap_cgroup_ctrl[type];
> > > +		unsigned long idx;
> > > +
> > > +		for (idx = 0; idx < ACCESS_ONCE(ctrl->length); idx++) {
> > > +			struct swap_cgroup *sc, *scend;
> > > +
> > > +			spin_lock(&swap_cgroup_lock);
> > > +			if (idx >= ACCESS_ONCE(ctrl->length))
> > > +				goto unlock;
> > > +			sc = page_address(ctrl->map[idx]);
> > > +			for (scend = sc + SC_PER_PAGE; sc < scend; sc++) {
> > > +				if (sc->id != old)
> > > +					continue;
> > 
> > Is this safe? What prevents from race when id is set to old?
> 
> I am assuming that when this is called, we shall not be making any
> new charges to old (if you see what I mean :) - this is called after
> mem_cgroup_reparent_charges() (the main call - I've not yet wrapped
> my head around Hannes's later safety-net call; perhaps these patches
> would even make that one redundant - dunno).

I have to think about this some more. I was playing with an alternative
fix for this race and few trace points shown me that the races are quite
common.

Anyway, wouldn't be it easier to take the lock for a batch of sc's and
then release it followed by cond_resched? Doing 1024 of iterations
doesn't sound too bad to me (we would take the lock 2 times for 4k pages).

> Quite a lot is made simpler by the fact that we do not need to call
> this from mem_cgroup_force_empty(), with all the races that would
> entail: there was never any attempt to move these swap charges before,
> so it's a pleasure not to have to deal with that possibility now.
>
> > > +				spin_lock(&ctrl->lock);
> > > +				if (sc->id == old) {
> > 
> > Also it seems that compiler is free to optimize this test away, no?
> > You need ACCESS_ONCE here as well, I guess.
> 
> Oh dear, you're asking me to read again through memory-barriers.txt
> (Xmas 2013 edition), and come to a conclusion.  I think this pattern
> of test outside spinlock, spinlock, test again inside spinlock is
> used in very many places, without any ACCESS_ONCE.  I'll have to
> go and search through the precedents.
> 
> I've probably brought this upon myself with the ACCESS_ONCE(ctrl->length)
> a few lines above, which I added at a late stage to match the one above
> it; but now I'm arguing that's unnecessary.

Yeah, that triggered the red flag ;)

> One of the problems with ACCESS_ONCE is that one easily falls into
> a mistaken state in which it seems to be necessary everywhere;
> but that illusion must be resisted.
> 
> The spinlock should make it unnecessary, but I'll have to muse on
> semi-permeable membranes, osmosis, stuff like that.

OK, I have checked that and you are right. ACCESS_ONCE is not needed.
Compiler is not allowed to optimize across barrier() which is a part of
spin_lock. So this should be ok.
[...]
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: 3.13-rc breaks MEMCG_SWAP
  2013-12-17  3:13                 ` Li Zefan
@ 2013-12-17 12:29                   ` Tejun Heo
  -1 siblings, 0 replies; 54+ messages in thread
From: Tejun Heo @ 2013-12-17 12:29 UTC (permalink / raw)
  To: Li Zefan
  Cc: Hugh Dickins, Michal Hocko, Johannes Weiner, Andrew Morton,
	KAMEZAWA Hiroyuki, linux-mm, cgroups, linux-kernel

Hello, Li.

On Tue, Dec 17, 2013 at 11:13:39AM +0800, Li Zefan wrote:
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index c36d906..769b5bb 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -868,6 +868,15 @@ static void cgroup_diput(struct dentry *dentry, struct inode *inode)
>  		struct cgroup *cgrp = dentry->d_fsdata;
>  
>  		BUG_ON(!(cgroup_is_dead(cgrp)));
> +
> +		/*
> +		 * We should remove the cgroup object from idr before its
> +		 * grace period starts, so we won't be looking up a cgroup
> +		 * while the cgroup is being freed.
> +		 */

Let's remove this comment and instead comment that this is to be made
per-css.  I mixed up the lifetime rules of the cgroup and css and
thought css_from_id() should fail once css is confirmed to be offline,
so the above comment.  It looks like we'll eventually have to move
cgrp->id to css->id (just simple per-ss idr) as the two objects'
lifetime rules will be completely separate.  Other than that, looks
good to me.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: 3.13-rc breaks MEMCG_SWAP
@ 2013-12-17 12:29                   ` Tejun Heo
  0 siblings, 0 replies; 54+ messages in thread
From: Tejun Heo @ 2013-12-17 12:29 UTC (permalink / raw)
  To: Li Zefan
  Cc: Hugh Dickins, Michal Hocko, Johannes Weiner, Andrew Morton,
	KAMEZAWA Hiroyuki, linux-mm, cgroups, linux-kernel

Hello, Li.

On Tue, Dec 17, 2013 at 11:13:39AM +0800, Li Zefan wrote:
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index c36d906..769b5bb 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -868,6 +868,15 @@ static void cgroup_diput(struct dentry *dentry, struct inode *inode)
>  		struct cgroup *cgrp = dentry->d_fsdata;
>  
>  		BUG_ON(!(cgroup_is_dead(cgrp)));
> +
> +		/*
> +		 * We should remove the cgroup object from idr before its
> +		 * grace period starts, so we won't be looking up a cgroup
> +		 * while the cgroup is being freed.
> +		 */

Let's remove this comment and instead comment that this is to be made
per-css.  I mixed up the lifetime rules of the cgroup and css and
thought css_from_id() should fail once css is confirmed to be offline,
so the above comment.  It looks like we'll eventually have to move
cgrp->id to css->id (just simple per-ss idr) as the two objects'
lifetime rules will be completely separate.  Other than that, looks
good to me.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: 3.13-rc breaks MEMCG_SWAP
  2013-12-17  3:13                 ` Li Zefan
@ 2013-12-17 12:48                   ` Michal Hocko
  -1 siblings, 0 replies; 54+ messages in thread
From: Michal Hocko @ 2013-12-17 12:48 UTC (permalink / raw)
  To: Li Zefan
  Cc: Hugh Dickins, Tejun Heo, Johannes Weiner, Andrew Morton,
	KAMEZAWA Hiroyuki, linux-mm, cgroups, linux-kernel

On Tue 17-12-13 11:13:39, Li Zefan wrote:
> On 2013/12/17 9:41, Hugh Dickins wrote:
> > On Mon, 16 Dec 2013, Tejun Heo wrote:
> >> On Mon, Dec 16, 2013 at 06:19:37PM +0100, Michal Hocko wrote:
> >>> I have to think about it some more (the brain is not working anymore
> >>> today). But what we really need is that nobody gets the same id while
> >>> the css is alive.
> 
> That's what I meant to do in my last reply.
> 
> But I'm confused by
> 
> "How would this work? .. the swap will be there
> after the last reference to css as well."

Sorry about the confusion. Johannes correctly pointed out that a css
reference is taken when we record memcg id and released after the record
is removed.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: 3.13-rc breaks MEMCG_SWAP
@ 2013-12-17 12:48                   ` Michal Hocko
  0 siblings, 0 replies; 54+ messages in thread
From: Michal Hocko @ 2013-12-17 12:48 UTC (permalink / raw)
  To: Li Zefan
  Cc: Hugh Dickins, Tejun Heo, Johannes Weiner, Andrew Morton,
	KAMEZAWA Hiroyuki, linux-mm, cgroups, linux-kernel

On Tue 17-12-13 11:13:39, Li Zefan wrote:
> On 2013/12/17 9:41, Hugh Dickins wrote:
> > On Mon, 16 Dec 2013, Tejun Heo wrote:
> >> On Mon, Dec 16, 2013 at 06:19:37PM +0100, Michal Hocko wrote:
> >>> I have to think about it some more (the brain is not working anymore
> >>> today). But what we really need is that nobody gets the same id while
> >>> the css is alive.
> 
> That's what I meant to do in my last reply.
> 
> But I'm confused by
> 
> "How would this work? .. the swap will be there
> after the last reference to css as well."

Sorry about the confusion. Johannes correctly pointed out that a css
reference is taken when we record memcg id and released after the record
is removed.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: 3.13-rc breaks MEMCG_SWAP
  2013-12-17  3:13                 ` Li Zefan
@ 2013-12-17 13:05                   ` Michal Hocko
  -1 siblings, 0 replies; 54+ messages in thread
From: Michal Hocko @ 2013-12-17 13:05 UTC (permalink / raw)
  To: Li Zefan
  Cc: Hugh Dickins, Tejun Heo, Johannes Weiner, Andrew Morton,
	KAMEZAWA Hiroyuki, linux-mm, cgroups, linux-kernel

On Tue 17-12-13 11:13:39, Li Zefan wrote:
[...]
> From: Li Zefan <lizefan@huawei.com>
> Date: Tue, 17 Dec 2013 10:45:09 +0800
> Subject: [PATCH] cgroup: don't recycle cgroup id until all csses' have been destroyed
> 
> Hugh reported this bug:
> 
> > CONFIG_MEMCG_SWAP is broken in 3.13-rc.  Try something like this:
> >
> > mkdir -p /tmp/tmpfs /tmp/memcg
> > mount -t tmpfs -o size=1G tmpfs /tmp/tmpfs
> > mount -t cgroup -o memory memcg /tmp/memcg
> > mkdir /tmp/memcg/old
> > echo 512M >/tmp/memcg/old/memory.limit_in_bytes
> > echo $$ >/tmp/memcg/old/tasks
> > cp /dev/zero /tmp/tmpfs/zero 2>/dev/null
> > echo $$ >/tmp/memcg/tasks
> > rmdir /tmp/memcg/old
> > sleep 1	# let rmdir work complete
> > mkdir /tmp/memcg/new
> > umount /tmp/tmpfs
> > dmesg | grep WARNING
> > rmdir /tmp/memcg/new
> > umount /tmp/memcg
> >
> > Shows lots of WARNING: CPU: 1 PID: 1006 at kernel/res_counter.c:91
> >                            res_counter_uncharge_locked+0x1f/0x2f()
> >
> > Breakage comes from 34c00c319ce7 ("memcg: convert to use cgroup id").
> >
> > The lifetime of a cgroup id is different from the lifetime of the
> > css id it replaced: memsw's css_get()s do nothing to hold on to the
> > old cgroup id, it soon gets recycled to a new cgroup, which then
> > mysteriously inherits the old's swap, without any charge for it.
> 
> Instead of removing cgroup id right after all the csses have been
> offlined, we should do that after csses have been destroyed.
> 
> To make sure an invalid css pointer won't be returned after the css
> is destroyed, make sure css_from_id() returns NULL in this case.

OK, so this will postpone idr_remove to css_free and until then
mem_cgroup_lookup finds a correct memcg. This will work as well.
It is basically the same thing we had with css_id AFAIR.

Originally I thought this wouldn't be possible because of the comment
above idr_remove for some reason.

> Reported-by: Hugh Dickins <hughd@google.com>
> Signed-off-by: Li Zefan <lizefan@huawei.com>

Reviewed-by: Michal Hocko <mhocko@suse.cz>

> ---
>  kernel/cgroup.c | 18 ++++++++++--------
>  1 file changed, 10 insertions(+), 8 deletions(-)
> 
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index c36d906..769b5bb 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -868,6 +868,15 @@ static void cgroup_diput(struct dentry *dentry, struct inode *inode)
>  		struct cgroup *cgrp = dentry->d_fsdata;
>  
>  		BUG_ON(!(cgroup_is_dead(cgrp)));
> +
> +		/*
> +		 * We should remove the cgroup object from idr before its
> +		 * grace period starts, so we won't be looking up a cgroup
> +		 * while the cgroup is being freed.
> +		 */
> +		idr_remove(&cgrp->root->cgroup_idr, cgrp->id);
> +		cgrp->id = -1;
> +
>  		call_rcu(&cgrp->rcu_head, cgroup_free_rcu);
>  	} else {
>  		struct cfent *cfe = __d_cfe(dentry);
> @@ -4104,6 +4113,7 @@ static void css_release(struct percpu_ref *ref)
>  	struct cgroup_subsys_state *css =
>  		container_of(ref, struct cgroup_subsys_state, refcnt);
>  
> +	rcu_assign_pointer(css->cgroup->subsys[css->ss->subsys_id], NULL);
>  	call_rcu(&css->rcu_head, css_free_rcu_fn);
>  }
>  
> @@ -4545,14 +4555,6 @@ static void cgroup_destroy_css_killed(struct cgroup *cgrp)
>  	/* delete this cgroup from parent->children */
>  	list_del_rcu(&cgrp->sibling);
>  
> -	/*
> -	 * We should remove the cgroup object from idr before its grace
> -	 * period starts, so we won't be looking up a cgroup while the
> -	 * cgroup is being freed.
> -	 */
> -	idr_remove(&cgrp->root->cgroup_idr, cgrp->id);
> -	cgrp->id = -1;
> -
>  	dput(d);
>  
>  	set_bit(CGRP_RELEASABLE, &parent->flags);
> -- 
> 1.8.0.2
> 
> 
> 
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: 3.13-rc breaks MEMCG_SWAP
@ 2013-12-17 13:05                   ` Michal Hocko
  0 siblings, 0 replies; 54+ messages in thread
From: Michal Hocko @ 2013-12-17 13:05 UTC (permalink / raw)
  To: Li Zefan
  Cc: Hugh Dickins, Tejun Heo, Johannes Weiner, Andrew Morton,
	KAMEZAWA Hiroyuki, linux-mm, cgroups, linux-kernel

On Tue 17-12-13 11:13:39, Li Zefan wrote:
[...]
> From: Li Zefan <lizefan@huawei.com>
> Date: Tue, 17 Dec 2013 10:45:09 +0800
> Subject: [PATCH] cgroup: don't recycle cgroup id until all csses' have been destroyed
> 
> Hugh reported this bug:
> 
> > CONFIG_MEMCG_SWAP is broken in 3.13-rc.  Try something like this:
> >
> > mkdir -p /tmp/tmpfs /tmp/memcg
> > mount -t tmpfs -o size=1G tmpfs /tmp/tmpfs
> > mount -t cgroup -o memory memcg /tmp/memcg
> > mkdir /tmp/memcg/old
> > echo 512M >/tmp/memcg/old/memory.limit_in_bytes
> > echo $$ >/tmp/memcg/old/tasks
> > cp /dev/zero /tmp/tmpfs/zero 2>/dev/null
> > echo $$ >/tmp/memcg/tasks
> > rmdir /tmp/memcg/old
> > sleep 1	# let rmdir work complete
> > mkdir /tmp/memcg/new
> > umount /tmp/tmpfs
> > dmesg | grep WARNING
> > rmdir /tmp/memcg/new
> > umount /tmp/memcg
> >
> > Shows lots of WARNING: CPU: 1 PID: 1006 at kernel/res_counter.c:91
> >                            res_counter_uncharge_locked+0x1f/0x2f()
> >
> > Breakage comes from 34c00c319ce7 ("memcg: convert to use cgroup id").
> >
> > The lifetime of a cgroup id is different from the lifetime of the
> > css id it replaced: memsw's css_get()s do nothing to hold on to the
> > old cgroup id, it soon gets recycled to a new cgroup, which then
> > mysteriously inherits the old's swap, without any charge for it.
> 
> Instead of removing cgroup id right after all the csses have been
> offlined, we should do that after csses have been destroyed.
> 
> To make sure an invalid css pointer won't be returned after the css
> is destroyed, make sure css_from_id() returns NULL in this case.

OK, so this will postpone idr_remove to css_free and until then
mem_cgroup_lookup finds a correct memcg. This will work as well.
It is basically the same thing we had with css_id AFAIR.

Originally I thought this wouldn't be possible because of the comment
above idr_remove for some reason.

> Reported-by: Hugh Dickins <hughd@google.com>
> Signed-off-by: Li Zefan <lizefan@huawei.com>

Reviewed-by: Michal Hocko <mhocko@suse.cz>

> ---
>  kernel/cgroup.c | 18 ++++++++++--------
>  1 file changed, 10 insertions(+), 8 deletions(-)
> 
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index c36d906..769b5bb 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -868,6 +868,15 @@ static void cgroup_diput(struct dentry *dentry, struct inode *inode)
>  		struct cgroup *cgrp = dentry->d_fsdata;
>  
>  		BUG_ON(!(cgroup_is_dead(cgrp)));
> +
> +		/*
> +		 * We should remove the cgroup object from idr before its
> +		 * grace period starts, so we won't be looking up a cgroup
> +		 * while the cgroup is being freed.
> +		 */
> +		idr_remove(&cgrp->root->cgroup_idr, cgrp->id);
> +		cgrp->id = -1;
> +
>  		call_rcu(&cgrp->rcu_head, cgroup_free_rcu);
>  	} else {
>  		struct cfent *cfe = __d_cfe(dentry);
> @@ -4104,6 +4113,7 @@ static void css_release(struct percpu_ref *ref)
>  	struct cgroup_subsys_state *css =
>  		container_of(ref, struct cgroup_subsys_state, refcnt);
>  
> +	rcu_assign_pointer(css->cgroup->subsys[css->ss->subsys_id], NULL);
>  	call_rcu(&css->rcu_head, css_free_rcu_fn);
>  }
>  
> @@ -4545,14 +4555,6 @@ static void cgroup_destroy_css_killed(struct cgroup *cgrp)
>  	/* delete this cgroup from parent->children */
>  	list_del_rcu(&cgrp->sibling);
>  
> -	/*
> -	 * We should remove the cgroup object from idr before its grace
> -	 * period starts, so we won't be looking up a cgroup while the
> -	 * cgroup is being freed.
> -	 */
> -	idr_remove(&cgrp->root->cgroup_idr, cgrp->id);
> -	cgrp->id = -1;
> -
>  	dput(d);
>  
>  	set_bit(CGRP_RELEASABLE, &parent->flags);
> -- 
> 1.8.0.2
> 
> 
> 
> 

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: 3.13-rc breaks MEMCG_SWAP
  2013-12-17  7:09                   ` Hugh Dickins
@ 2013-12-17 13:11                     ` Michal Hocko
  -1 siblings, 0 replies; 54+ messages in thread
From: Michal Hocko @ 2013-12-17 13:11 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Li Zefan, Tejun Heo, Johannes Weiner, Andrew Morton,
	KAMEZAWA Hiroyuki, linux-mm, cgroups, linux-kernel

On Mon 16-12-13 23:09:23, Hugh Dickins wrote:
> On Tue, 17 Dec 2013, Li Zefan wrote:
> > On 2013/12/17 9:41, Hugh Dickins wrote:
> > > On Mon, 16 Dec 2013, Tejun Heo wrote:
> > >> On Mon, Dec 16, 2013 at 06:19:37PM +0100, Michal Hocko wrote:
> > >>> I have to think about it some more (the brain is not working anymore
> > >>> today). But what we really need is that nobody gets the same id while
> > >>> the css is alive.
> > 
> > That's what I meant to do in my last reply.
> > 
> > But I'm confused by
> > 
> > "How would this work? .. the swap will be there
> > after the last reference to css as well."
> > 
> > >>> So css_from_id returning NULL doesn't seem to be
> > >>> enough.
> > >>
> > >> Oh, I meant whether it's necessary to keep css_from_id() working
> > >> (ie. doing successful lookups) between offline and release, because
> > >> that's where lifetimes are coupled.  IOW, if it's enough for cgroup to
> > >> not recycle the ID until all css's are released && fail css_from_id()
> > >> lookup after the css is offlined, I can make a five liner quick fix.
> > > 
> > > Don't take my word on it, I'm too fuzzy on this: but although it would
> > > be good to refrain from recycling the ID until all css's are released,
> > > I believe that it would not be good enough to fail css_from_id() once
> > > the css is offlined - mem_cgroup_uncharge_swap() needs to uncharge the
> > > hierarchy of the dead memcg (for example, when tmpfs file is removed).
> > > 
> > > Uncharging the dead memcg itself is presumably irrelevant, but it does
> > > need to locate the right parent to uncharge, and NULL css_from_id()
> > > would make that impossible.  It would be easy if we said those charges
> > > migrate to root rather than to parent, but that's inconsistent with
> > > what we have happily converged upon doing elsewhere (in the preferred
> > > use_hierarchy case), and it would be a change in behaviour.
> > > 
> > > I'm not nearly as enthusiastic for my patch as Michal is: I really
> > > would prefer a five-liner from you or from Zefan. 
> > 
> > I've come up with a fix. Though it's more than five-line, it mostly moves
> > a few lines from one place to another. I've tested it with your script.
> 
> It seems to be working very well for me.  I'm inclined to forgive you for
> taking more than five lines, given that there are almost as many -s as +s ;)
> 
> In my opinion, your patch is greatly preferable to mine - if there are
> good things in mine, memcg can incorporate them at leisure later on,
> but right now this seems a much better 3.13 solution.  I'm guessing
> Tejun and Hannes will feel the same way: how about you, Michal?

OK, let's go with this for now but I would like swap accounting less
tricky and confusing and explicit reparenting should help there I
believe.

And sorry for distracting you from the css based approach. I have
totally misinterpreted the comment above idr_remove.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: 3.13-rc breaks MEMCG_SWAP
@ 2013-12-17 13:11                     ` Michal Hocko
  0 siblings, 0 replies; 54+ messages in thread
From: Michal Hocko @ 2013-12-17 13:11 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Li Zefan, Tejun Heo, Johannes Weiner, Andrew Morton,
	KAMEZAWA Hiroyuki, linux-mm, cgroups, linux-kernel

On Mon 16-12-13 23:09:23, Hugh Dickins wrote:
> On Tue, 17 Dec 2013, Li Zefan wrote:
> > On 2013/12/17 9:41, Hugh Dickins wrote:
> > > On Mon, 16 Dec 2013, Tejun Heo wrote:
> > >> On Mon, Dec 16, 2013 at 06:19:37PM +0100, Michal Hocko wrote:
> > >>> I have to think about it some more (the brain is not working anymore
> > >>> today). But what we really need is that nobody gets the same id while
> > >>> the css is alive.
> > 
> > That's what I meant to do in my last reply.
> > 
> > But I'm confused by
> > 
> > "How would this work? .. the swap will be there
> > after the last reference to css as well."
> > 
> > >>> So css_from_id returning NULL doesn't seem to be
> > >>> enough.
> > >>
> > >> Oh, I meant whether it's necessary to keep css_from_id() working
> > >> (ie. doing successful lookups) between offline and release, because
> > >> that's where lifetimes are coupled.  IOW, if it's enough for cgroup to
> > >> not recycle the ID until all css's are released && fail css_from_id()
> > >> lookup after the css is offlined, I can make a five liner quick fix.
> > > 
> > > Don't take my word on it, I'm too fuzzy on this: but although it would
> > > be good to refrain from recycling the ID until all css's are released,
> > > I believe that it would not be good enough to fail css_from_id() once
> > > the css is offlined - mem_cgroup_uncharge_swap() needs to uncharge the
> > > hierarchy of the dead memcg (for example, when tmpfs file is removed).
> > > 
> > > Uncharging the dead memcg itself is presumably irrelevant, but it does
> > > need to locate the right parent to uncharge, and NULL css_from_id()
> > > would make that impossible.  It would be easy if we said those charges
> > > migrate to root rather than to parent, but that's inconsistent with
> > > what we have happily converged upon doing elsewhere (in the preferred
> > > use_hierarchy case), and it would be a change in behaviour.
> > > 
> > > I'm not nearly as enthusiastic for my patch as Michal is: I really
> > > would prefer a five-liner from you or from Zefan. 
> > 
> > I've come up with a fix. Though it's more than five-line, it mostly moves
> > a few lines from one place to another. I've tested it with your script.
> 
> It seems to be working very well for me.  I'm inclined to forgive you for
> taking more than five lines, given that there are almost as many -s as +s ;)
> 
> In my opinion, your patch is greatly preferable to mine - if there are
> good things in mine, memcg can incorporate them at leisure later on,
> but right now this seems a much better 3.13 solution.  I'm guessing
> Tejun and Hannes will feel the same way: how about you, Michal?

OK, let's go with this for now but I would like swap accounting less
tricky and confusing and explicit reparenting should help there I
believe.

And sorry for distracting you from the css based approach. I have
totally misinterpreted the comment above idr_remove.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: 3.13-rc breaks MEMCG_SWAP
  2013-12-17 12:29                   ` Tejun Heo
  (?)
@ 2013-12-17 13:12                     ` Michal Hocko
  -1 siblings, 0 replies; 54+ messages in thread
From: Michal Hocko @ 2013-12-17 13:12 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Li Zefan, Hugh Dickins, Johannes Weiner, Andrew Morton,
	KAMEZAWA Hiroyuki, linux-mm, cgroups, linux-kernel

On Tue 17-12-13 07:29:26, Tejun Heo wrote:
> Hello, Li.
> 
> On Tue, Dec 17, 2013 at 11:13:39AM +0800, Li Zefan wrote:
> > diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> > index c36d906..769b5bb 100644
> > --- a/kernel/cgroup.c
> > +++ b/kernel/cgroup.c
> > @@ -868,6 +868,15 @@ static void cgroup_diput(struct dentry *dentry, struct inode *inode)
> >  		struct cgroup *cgrp = dentry->d_fsdata;
> >  
> >  		BUG_ON(!(cgroup_is_dead(cgrp)));
> > +
> > +		/*
> > +		 * We should remove the cgroup object from idr before its
> > +		 * grace period starts, so we won't be looking up a cgroup
> > +		 * while the cgroup is being freed.
> > +		 */
> 
> Let's remove this comment and instead comment that this is to be made
> per-css.  I mixed up the lifetime rules of the cgroup and css and
> thought css_from_id() should fail once css is confirmed to be offline,
> so the above comment.  It looks like we'll eventually have to move
> cgrp->id to css->id (just simple per-ss idr) as the two objects'
> lifetime rules will be completely separate.  Other than that, looks
> good to me.

Yeah, please remove it. It made me think that idr_remove cannot be
pulled to later and that's why I ruled out css based solution from the
beginning.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: 3.13-rc breaks MEMCG_SWAP
@ 2013-12-17 13:12                     ` Michal Hocko
  0 siblings, 0 replies; 54+ messages in thread
From: Michal Hocko @ 2013-12-17 13:12 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Li Zefan, Hugh Dickins, Johannes Weiner, Andrew Morton,
	KAMEZAWA Hiroyuki, linux-mm, cgroups, linux-kernel

On Tue 17-12-13 07:29:26, Tejun Heo wrote:
> Hello, Li.
> 
> On Tue, Dec 17, 2013 at 11:13:39AM +0800, Li Zefan wrote:
> > diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> > index c36d906..769b5bb 100644
> > --- a/kernel/cgroup.c
> > +++ b/kernel/cgroup.c
> > @@ -868,6 +868,15 @@ static void cgroup_diput(struct dentry *dentry, struct inode *inode)
> >  		struct cgroup *cgrp = dentry->d_fsdata;
> >  
> >  		BUG_ON(!(cgroup_is_dead(cgrp)));
> > +
> > +		/*
> > +		 * We should remove the cgroup object from idr before its
> > +		 * grace period starts, so we won't be looking up a cgroup
> > +		 * while the cgroup is being freed.
> > +		 */
> 
> Let's remove this comment and instead comment that this is to be made
> per-css.  I mixed up the lifetime rules of the cgroup and css and
> thought css_from_id() should fail once css is confirmed to be offline,
> so the above comment.  It looks like we'll eventually have to move
> cgrp->id to css->id (just simple per-ss idr) as the two objects'
> lifetime rules will be completely separate.  Other than that, looks
> good to me.

Yeah, please remove it. It made me think that idr_remove cannot be
pulled to later and that's why I ruled out css based solution from the
beginning.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: 3.13-rc breaks MEMCG_SWAP
@ 2013-12-17 13:12                     ` Michal Hocko
  0 siblings, 0 replies; 54+ messages in thread
From: Michal Hocko @ 2013-12-17 13:12 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Li Zefan, Hugh Dickins, Johannes Weiner, Andrew Morton,
	KAMEZAWA Hiroyuki, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Tue 17-12-13 07:29:26, Tejun Heo wrote:
> Hello, Li.
> 
> On Tue, Dec 17, 2013 at 11:13:39AM +0800, Li Zefan wrote:
> > diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> > index c36d906..769b5bb 100644
> > --- a/kernel/cgroup.c
> > +++ b/kernel/cgroup.c
> > @@ -868,6 +868,15 @@ static void cgroup_diput(struct dentry *dentry, struct inode *inode)
> >  		struct cgroup *cgrp = dentry->d_fsdata;
> >  
> >  		BUG_ON(!(cgroup_is_dead(cgrp)));
> > +
> > +		/*
> > +		 * We should remove the cgroup object from idr before its
> > +		 * grace period starts, so we won't be looking up a cgroup
> > +		 * while the cgroup is being freed.
> > +		 */
> 
> Let's remove this comment and instead comment that this is to be made
> per-css.  I mixed up the lifetime rules of the cgroup and css and
> thought css_from_id() should fail once css is confirmed to be offline,
> so the above comment.  It looks like we'll eventually have to move
> cgrp->id to css->id (just simple per-ss idr) as the two objects'
> lifetime rules will be completely separate.  Other than that, looks
> good to me.

Yeah, please remove it. It made me think that idr_remove cannot be
pulled to later and that's why I ruled out css based solution from the
beginning.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: 3.13-rc breaks MEMCG_SWAP
  2013-12-17  1:41               ` Hugh Dickins
@ 2013-12-17 13:14                 ` Michal Hocko
  -1 siblings, 0 replies; 54+ messages in thread
From: Michal Hocko @ 2013-12-17 13:14 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Tejun Heo, Li Zefan, Johannes Weiner, Andrew Morton,
	KAMEZAWA Hiroyuki, linux-mm, cgroups, linux-kernel

On Mon 16-12-13 17:41:38, Hugh Dickins wrote:
> On Mon, 16 Dec 2013, Tejun Heo wrote:
> > On Mon, Dec 16, 2013 at 06:19:37PM +0100, Michal Hocko wrote:
> > > I have to think about it some more (the brain is not working anymore
> > > today). But what we really need is that nobody gets the same id while
> > > the css is alive. So css_from_id returning NULL doesn't seem to be
> > > enough.
> > 
> > Oh, I meant whether it's necessary to keep css_from_id() working
> > (ie. doing successful lookups) between offline and release, because
> > that's where lifetimes are coupled.  IOW, if it's enough for cgroup to
> > not recycle the ID until all css's are released && fail css_from_id()
> > lookup after the css is offlined, I can make a five liner quick fix.
> 
> Don't take my word on it, I'm too fuzzy on this: but although it would
> be good to refrain from recycling the ID until all css's are released,
> I believe that it would not be good enough to fail css_from_id() once
> the css is offlined - mem_cgroup_uncharge_swap() needs to uncharge the
> hierarchy of the dead memcg (for example, when tmpfs file is removed).
> 
> Uncharging the dead memcg itself is presumably irrelevant, but it does
> need to locate the right parent to uncharge, and NULL css_from_id()
> would make that impossible. 

Exactly!

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: 3.13-rc breaks MEMCG_SWAP
@ 2013-12-17 13:14                 ` Michal Hocko
  0 siblings, 0 replies; 54+ messages in thread
From: Michal Hocko @ 2013-12-17 13:14 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Tejun Heo, Li Zefan, Johannes Weiner, Andrew Morton,
	KAMEZAWA Hiroyuki, linux-mm, cgroups, linux-kernel

On Mon 16-12-13 17:41:38, Hugh Dickins wrote:
> On Mon, 16 Dec 2013, Tejun Heo wrote:
> > On Mon, Dec 16, 2013 at 06:19:37PM +0100, Michal Hocko wrote:
> > > I have to think about it some more (the brain is not working anymore
> > > today). But what we really need is that nobody gets the same id while
> > > the css is alive. So css_from_id returning NULL doesn't seem to be
> > > enough.
> > 
> > Oh, I meant whether it's necessary to keep css_from_id() working
> > (ie. doing successful lookups) between offline and release, because
> > that's where lifetimes are coupled.  IOW, if it's enough for cgroup to
> > not recycle the ID until all css's are released && fail css_from_id()
> > lookup after the css is offlined, I can make a five liner quick fix.
> 
> Don't take my word on it, I'm too fuzzy on this: but although it would
> be good to refrain from recycling the ID until all css's are released,
> I believe that it would not be good enough to fail css_from_id() once
> the css is offlined - mem_cgroup_uncharge_swap() needs to uncharge the
> hierarchy of the dead memcg (for example, when tmpfs file is removed).
> 
> Uncharging the dead memcg itself is presumably irrelevant, but it does
> need to locate the right parent to uncharge, and NULL css_from_id()
> would make that impossible. 

Exactly!

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: 3.13-rc breaks MEMCG_SWAP
  2013-12-17 13:11                     ` Michal Hocko
@ 2013-12-17 13:14                       ` Tejun Heo
  -1 siblings, 0 replies; 54+ messages in thread
From: Tejun Heo @ 2013-12-17 13:14 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Hugh Dickins, Li Zefan, Johannes Weiner, Andrew Morton,
	KAMEZAWA Hiroyuki, linux-mm, cgroups, linux-kernel

On Tue, Dec 17, 2013 at 02:11:19PM +0100, Michal Hocko wrote:
> And sorry for distracting you from the css based approach. I have
> totally misinterpreted the comment above idr_remove.

Heh, you actually interpreted it correctly.  I was the one confused
when moving the id to cgroup.  I should have just converted css_id to
idr based per-subsys id in the first place.  Sorry about that. :)

-- 
tejun

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: 3.13-rc breaks MEMCG_SWAP
@ 2013-12-17 13:14                       ` Tejun Heo
  0 siblings, 0 replies; 54+ messages in thread
From: Tejun Heo @ 2013-12-17 13:14 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Hugh Dickins, Li Zefan, Johannes Weiner, Andrew Morton,
	KAMEZAWA Hiroyuki, linux-mm, cgroups, linux-kernel

On Tue, Dec 17, 2013 at 02:11:19PM +0100, Michal Hocko wrote:
> And sorry for distracting you from the css based approach. I have
> totally misinterpreted the comment above idr_remove.

Heh, you actually interpreted it correctly.  I was the one confused
when moving the id to cgroup.  I should have just converted css_id to
idr based per-subsys id in the first place.  Sorry about that. :)

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH cgroup/for-3.13-fixes] cgroup: don't recycle cgroup id until all csses' have been destroyed
  2013-12-17  3:13                 ` Li Zefan
  (?)
@ 2013-12-17 13:15                   ` Tejun Heo
  -1 siblings, 0 replies; 54+ messages in thread
From: Tejun Heo @ 2013-12-17 13:15 UTC (permalink / raw)
  To: Li Zefan
  Cc: Hugh Dickins, Michal Hocko, Johannes Weiner, Andrew Morton,
	KAMEZAWA Hiroyuki, linux-mm, cgroups, linux-kernel

Hey,

I updated the comment myself and applied the patch to
cgroup/for-3.13-fixes.

Thanks!
-------- 8< --------
>From c1a71504e9715812a2d15e7c03b5aa147ae70ded Mon Sep 17 00:00:00 2001
From: Li Zefan <lizefan@huawei.com>
Date: Tue, 17 Dec 2013 11:13:39 +0800

Hugh reported this bug:

> CONFIG_MEMCG_SWAP is broken in 3.13-rc.  Try something like this:
>
> mkdir -p /tmp/tmpfs /tmp/memcg
> mount -t tmpfs -o size=1G tmpfs /tmp/tmpfs
> mount -t cgroup -o memory memcg /tmp/memcg
> mkdir /tmp/memcg/old
> echo 512M >/tmp/memcg/old/memory.limit_in_bytes
> echo $$ >/tmp/memcg/old/tasks
> cp /dev/zero /tmp/tmpfs/zero 2>/dev/null
> echo $$ >/tmp/memcg/tasks
> rmdir /tmp/memcg/old
> sleep 1	# let rmdir work complete
> mkdir /tmp/memcg/new
> umount /tmp/tmpfs
> dmesg | grep WARNING
> rmdir /tmp/memcg/new
> umount /tmp/memcg
>
> Shows lots of WARNING: CPU: 1 PID: 1006 at kernel/res_counter.c:91
>                            res_counter_uncharge_locked+0x1f/0x2f()
>
> Breakage comes from 34c00c319ce7 ("memcg: convert to use cgroup id").
>
> The lifetime of a cgroup id is different from the lifetime of the
> css id it replaced: memsw's css_get()s do nothing to hold on to the
> old cgroup id, it soon gets recycled to a new cgroup, which then
> mysteriously inherits the old's swap, without any charge for it.

Instead of removing cgroup id right after all the csses have been
offlined, we should do that after csses have been destroyed.

To make sure an invalid css pointer won't be returned after the css
is destroyed, make sure css_from_id() returns NULL in this case.

tj: Updated comment to note planned changes for cgrp->id.

Reported-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Li Zefan <lizefan@huawei.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/cgroup.c | 19 +++++++++++--------
 1 file changed, 11 insertions(+), 8 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index bcb1755..bc1dcab 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -890,6 +890,16 @@ static void cgroup_diput(struct dentry *dentry, struct inode *inode)
 		struct cgroup *cgrp = dentry->d_fsdata;
 
 		BUG_ON(!(cgroup_is_dead(cgrp)));
+
+		/*
+		 * XXX: cgrp->id is only used to look up css's.  As cgroup
+		 * and css's lifetimes will be decoupled, it should be made
+		 * per-subsystem and moved to css->id so that lookups are
+		 * successful until the target css is released.
+		 */
+		idr_remove(&cgrp->root->cgroup_idr, cgrp->id);
+		cgrp->id = -1;
+
 		call_rcu(&cgrp->rcu_head, cgroup_free_rcu);
 	} else {
 		struct cfent *cfe = __d_cfe(dentry);
@@ -4268,6 +4278,7 @@ static void css_release(struct percpu_ref *ref)
 	struct cgroup_subsys_state *css =
 		container_of(ref, struct cgroup_subsys_state, refcnt);
 
+	rcu_assign_pointer(css->cgroup->subsys[css->ss->subsys_id], NULL);
 	call_rcu(&css->rcu_head, css_free_rcu_fn);
 }
 
@@ -4733,14 +4744,6 @@ static void cgroup_destroy_css_killed(struct cgroup *cgrp)
 	/* delete this cgroup from parent->children */
 	list_del_rcu(&cgrp->sibling);
 
-	/*
-	 * We should remove the cgroup object from idr before its grace
-	 * period starts, so we won't be looking up a cgroup while the
-	 * cgroup is being freed.
-	 */
-	idr_remove(&cgrp->root->cgroup_idr, cgrp->id);
-	cgrp->id = -1;
-
 	dput(d);
 
 	set_bit(CGRP_RELEASABLE, &parent->flags);
-- 
1.8.4.2


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH cgroup/for-3.13-fixes] cgroup: don't recycle cgroup id until all csses' have been destroyed
@ 2013-12-17 13:15                   ` Tejun Heo
  0 siblings, 0 replies; 54+ messages in thread
From: Tejun Heo @ 2013-12-17 13:15 UTC (permalink / raw)
  To: Li Zefan
  Cc: Hugh Dickins, Michal Hocko, Johannes Weiner, Andrew Morton,
	KAMEZAWA Hiroyuki, linux-mm, cgroups, linux-kernel

Hey,

I updated the comment myself and applied the patch to
cgroup/for-3.13-fixes.

Thanks!
-------- 8< --------

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH cgroup/for-3.13-fixes] cgroup: don't recycle cgroup id until all csses' have been destroyed
@ 2013-12-17 13:15                   ` Tejun Heo
  0 siblings, 0 replies; 54+ messages in thread
From: Tejun Heo @ 2013-12-17 13:15 UTC (permalink / raw)
  To: Li Zefan
  Cc: Hugh Dickins, Michal Hocko, Johannes Weiner, Andrew Morton,
	KAMEZAWA Hiroyuki, linux-mm, cgroups, linux-kernel

Hey,

I updated the comment myself and applied the patch to
cgroup/for-3.13-fixes.

Thanks!
-------- 8< --------
From c1a71504e9715812a2d15e7c03b5aa147ae70ded Mon Sep 17 00:00:00 2001
From: Li Zefan <lizefan@huawei.com>
Date: Tue, 17 Dec 2013 11:13:39 +0800

Hugh reported this bug:

> CONFIG_MEMCG_SWAP is broken in 3.13-rc.  Try something like this:
>
> mkdir -p /tmp/tmpfs /tmp/memcg
> mount -t tmpfs -o size=1G tmpfs /tmp/tmpfs
> mount -t cgroup -o memory memcg /tmp/memcg
> mkdir /tmp/memcg/old
> echo 512M >/tmp/memcg/old/memory.limit_in_bytes
> echo $$ >/tmp/memcg/old/tasks
> cp /dev/zero /tmp/tmpfs/zero 2>/dev/null
> echo $$ >/tmp/memcg/tasks
> rmdir /tmp/memcg/old
> sleep 1	# let rmdir work complete
> mkdir /tmp/memcg/new
> umount /tmp/tmpfs
> dmesg | grep WARNING
> rmdir /tmp/memcg/new
> umount /tmp/memcg
>
> Shows lots of WARNING: CPU: 1 PID: 1006 at kernel/res_counter.c:91
>                            res_counter_uncharge_locked+0x1f/0x2f()
>
> Breakage comes from 34c00c319ce7 ("memcg: convert to use cgroup id").
>
> The lifetime of a cgroup id is different from the lifetime of the
> css id it replaced: memsw's css_get()s do nothing to hold on to the
> old cgroup id, it soon gets recycled to a new cgroup, which then
> mysteriously inherits the old's swap, without any charge for it.

Instead of removing cgroup id right after all the csses have been
offlined, we should do that after csses have been destroyed.

To make sure an invalid css pointer won't be returned after the css
is destroyed, make sure css_from_id() returns NULL in this case.

tj: Updated comment to note planned changes for cgrp->id.

Reported-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Li Zefan <lizefan@huawei.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/cgroup.c | 19 +++++++++++--------
 1 file changed, 11 insertions(+), 8 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index bcb1755..bc1dcab 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -890,6 +890,16 @@ static void cgroup_diput(struct dentry *dentry, struct inode *inode)
 		struct cgroup *cgrp = dentry->d_fsdata;
 
 		BUG_ON(!(cgroup_is_dead(cgrp)));
+
+		/*
+		 * XXX: cgrp->id is only used to look up css's.  As cgroup
+		 * and css's lifetimes will be decoupled, it should be made
+		 * per-subsystem and moved to css->id so that lookups are
+		 * successful until the target css is released.
+		 */
+		idr_remove(&cgrp->root->cgroup_idr, cgrp->id);
+		cgrp->id = -1;
+
 		call_rcu(&cgrp->rcu_head, cgroup_free_rcu);
 	} else {
 		struct cfent *cfe = __d_cfe(dentry);
@@ -4268,6 +4278,7 @@ static void css_release(struct percpu_ref *ref)
 	struct cgroup_subsys_state *css =
 		container_of(ref, struct cgroup_subsys_state, refcnt);
 
+	rcu_assign_pointer(css->cgroup->subsys[css->ss->subsys_id], NULL);
 	call_rcu(&css->rcu_head, css_free_rcu_fn);
 }
 
@@ -4733,14 +4744,6 @@ static void cgroup_destroy_css_killed(struct cgroup *cgrp)
 	/* delete this cgroup from parent->children */
 	list_del_rcu(&cgrp->sibling);
 
-	/*
-	 * We should remove the cgroup object from idr before its grace
-	 * period starts, so we won't be looking up a cgroup while the
-	 * cgroup is being freed.
-	 */
-	idr_remove(&cgrp->root->cgroup_idr, cgrp->id);
-	cgrp->id = -1;
-
 	dput(d);
 
 	set_bit(CGRP_RELEASABLE, &parent->flags);
-- 
1.8.4.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 54+ messages in thread

end of thread, other threads:[~2013-12-17 13:15 UTC | newest]

Thread overview: 54+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-12-16  8:36 3.13-rc breaks MEMCG_SWAP Hugh Dickins
2013-12-16  8:36 ` Hugh Dickins
2013-12-16  9:36 ` Li Zefan
2013-12-16  9:36   ` Li Zefan
2013-12-16  9:53   ` Michal Hocko
2013-12-16  9:53     ` Michal Hocko
2013-12-16 10:40     ` Michal Hocko
2013-12-16 10:40       ` Michal Hocko
2013-12-16 16:35       ` Tejun Heo
2013-12-16 16:35         ` Tejun Heo
2013-12-16 17:19         ` Michal Hocko
2013-12-16 17:19           ` Michal Hocko
2013-12-16 17:21           ` Tejun Heo
2013-12-16 17:21             ` Tejun Heo
2013-12-17  1:41             ` Hugh Dickins
2013-12-17  1:41               ` Hugh Dickins
2013-12-17  3:13               ` Li Zefan
2013-12-17  3:13                 ` Li Zefan
2013-12-17  7:09                 ` Hugh Dickins
2013-12-17  7:09                   ` Hugh Dickins
2013-12-17 13:11                   ` Michal Hocko
2013-12-17 13:11                     ` Michal Hocko
2013-12-17 13:14                     ` Tejun Heo
2013-12-17 13:14                       ` Tejun Heo
2013-12-17 12:29                 ` Tejun Heo
2013-12-17 12:29                   ` Tejun Heo
2013-12-17 13:12                   ` Michal Hocko
2013-12-17 13:12                     ` Michal Hocko
2013-12-17 13:12                     ` Michal Hocko
2013-12-17 12:48                 ` Michal Hocko
2013-12-17 12:48                   ` Michal Hocko
2013-12-17 13:05                 ` Michal Hocko
2013-12-17 13:05                   ` Michal Hocko
2013-12-17 13:15                 ` [PATCH cgroup/for-3.13-fixes] cgroup: don't recycle cgroup id until all csses' have been destroyed Tejun Heo
2013-12-17 13:15                   ` Tejun Heo
2013-12-17 13:15                   ` Tejun Heo
2013-12-17 13:14               ` 3.13-rc breaks MEMCG_SWAP Michal Hocko
2013-12-17 13:14                 ` Michal Hocko
2013-12-16 16:41       ` Johannes Weiner
2013-12-16 16:41         ` Johannes Weiner
2013-12-16 17:15         ` Michal Hocko
2013-12-16 17:15           ` Michal Hocko
2013-12-16 17:15           ` Michal Hocko
2013-12-16 17:19           ` Tejun Heo
2013-12-16 17:19             ` Tejun Heo
2013-12-16  9:49 ` Michal Hocko
2013-12-16  9:49   ` Michal Hocko
2013-12-16 16:20 ` Michal Hocko
2013-12-16 16:20   ` Michal Hocko
2013-12-17  2:26   ` Hugh Dickins
2013-12-17  2:26     ` Hugh Dickins
2013-12-17 10:25     ` Michal Hocko
2013-12-17 10:25       ` Michal Hocko
2013-12-17 10:25       ` Michal Hocko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.